Randomness is used in a myriad of industries including; video games; politics; science; cryptography and more. It is important to continuously alter the method in which randomness is generated to make it harder to predict. This project studies the randomness of the social media platform ‘Twitter’, by producing a random number generator from the Tweets of several different users. Tweets are used to produce values that control the AI (Artificial Intelligence) of Robots in the programming game ‘Robocode’ in order to determine the random nature of these values.
The project aims to understand the difference between pseudo and truly random number generation and gain a greater appreciation of the applications of randomness in real world industries. This report describes the concept of a new method of generating random numbers using the aforementioned sources and stimuli.
One method of determining the random nature of the data is a simple visual analysis. See the data visualisation below.
Statistical attributes such as standard deviation and variance will be measured to gain a more empirical conclusion.
A statistical method useful to the analysis of the data is linear regression, specifically the method of least squares.
Thousands of Tweets stored in the MongoDB Atlas Cluster are parsed using a bespoke Java program and Twitter4J.
Thousands of Tweets means millions of characters. Each character represents a single instruction for a TwitterRobot.
Thousands of annoying little emoji's, varied by Fitzpatrick Modifiers. Yep, there really were that many...
Lightweight plugin to render simple, animated and retina optimized pie charts. Great for displaying relative percentages of data.
This section extends the previous one, providing the means for visual analysis and ultimately concluding results for the next section.
The final conclusion. Have I created a PRNG or a TRNG? Or is it somewhere in-between? Was it worth it?
Evaluation will help better understand the influence Robocode had on the random nature of the results.
Does the variance of characters in the English language skew the results and change the outcome?
To recap, the difference between pseudo-random and truly-random numbers is that pseudo-random number generators produce numbers that are deterministic. If we can determine them, they must have a discernible trend, pattern and correlation. On the other hand, truly-random generated numbers are the opposite. Like the decay of a radioactive isotope, they are un-predictable. However, given our increasingly better understanding of computing, technology and algorithms, we are able to write pseudo-random number generators that are very close to true ones. This means that the numbers they produce are for all intents and purposes ‘random’, albeit they are pre-determined from a table of values and are therefore technically not truly¬-random.
With the aforementioned definitions in mind, it didn’t take too long to come to the conclusion that the numbers generated by the Robots in Robocode from the Tweets are not truly-random. As discussed in the evaluation, there are clear patters and correlation to the data plotted on the scatter graph. The relationship between the battle score and number of turns was as expected, as one increases, so does the other. However, it’s not all bad. Although the data was restricted to a range of 0-1019, the spread of the data inside that range was quite sparse. The statistical analysis of the standard deviation and variance does not support this as the standard deviation was quite close to the median value. However, I believe that because the majority of the battles had 3 rounds, it created a bias and skewed the data in favour of that sub-dataset. This is most likely why the interquartile range is so small. Instead, if we take an even number of data from each number of rounds, we can see that they are actually quite sparse within the overall range.
Finally, regarding the real-world applications of this project, given the strength of correlation that the data expresses, there is no real place for the numbers that the program generates. Given then pre-existing PRNG algorithms, the project would need significant improvement and revision to bring it up to a state where it could complete with the sparsity and randomness of sophisticated pseudo-random algorithms. However, it does not change the fact that there is a real place for random-number generation in real life industries.
In conclusion, the numbers generated from the Robocode results are pseudo-random.
Robocode is the variable in the project that as intended to dampen or even reduce the bias that was to be expected from the Tweets. Considering that it was very likely that the characters would cause a bias to the frequency of use in the English language, Robocode was to take these values as an input, change them, and return a different value, seemingly randomly.
It appears, however, that Robocode has actually created a bias itself due to the nature of the game and its scoring system. It may also be due to the constraints that I put upon the battles themselves due to academic reasons regarding the project.
As illustrated in Chapter 2.1.3 of my dissertation, the frequency of characters different from that of the Oxford English Dictionary. There were some letters that were drastically greater than the other source. We would expect the vowels to be the most common. Letters E, O and A did come up at 2nd, 3rd and 4th respectively. However, the most common occurring character in the Tweets I had collected was the letter ‘T’ at 10.7%. This seemed unusual until I realised that social media posts on Twitter tend to contain a lot of hyperlinks to webpages. The HTTP/HTTPS protocol contains two letter ‘T’s. Furthermore, the word ‘Twitter’ itself contains 3 letter ‘T’s. That’s why I decided to analyse those two statistics in the user analysis.
What went well during the project? Did everything go as planned? What was the biggest academic challenge?
What went wrong during the project? Was there anything that caused significant issues?
If I were to do the project again, or continue it. What changes would I make, and why?
I planned a lot of the project during the early stages, as documented in Chapter 4 – Analysis & Design. This made the development stages run smoother as I knew what I had to create. There were of course lots of parts that were unexpected and had to be modified or developed on-the-fly, but the vast majority of it was as planned. Furthermore, the usage of Git VCS and GitHub made pushing the web application to the live server easier and provided peace of mind the project files were backed up and audited via Git’s version history of commits. The project was also synced via my personal Dropbox account which ensured that all the files were synched and up-to-date across all my devices during development. This again ensured there was another backup and meant that no duplicates were created, or work was lost.
This may just be a testament to the correlation and trends in the final visualisation of the results, but the conclusion was an easy decision to make. I knew right from the very start of planning the project that if the data was to produce results and visualisations that had ambiguous correlation, that the decision-making process would be very difficult. It would have required further research into determining whether or not numbers were pseudo or truly random. The fact that the numbers turned out to be pseudo-random does not make the project a failure, in fact, it was something that was expected as a possible outcome due to the nature of the project.
During the Java development stage of the agile process, I had to write a Java program that interfaced with both the Twitter and Robocode API’s. Twitter wasn’t too much of an issue, simple trial and error saw me through the development of the Twitter related classes. Once I had the Tweets downloaded and serialised, I could then work on the Robot classes. I had an issue with loading custom Robot classes into the RobocodeEngine in my GameConfigurer. Weeks elapsed with no progression, I’d tried countless different methods to resolve the issue. After conversing with Flemming N. Larsen, one of the major contributors to the source code of Robocode, I told him my issue and he agreed to take a look. After studying my code and its problem, he debugged it and found the cause. He patched and updated the game, releasing the .JAR of the new version for me. It turned out it was technically impossible for me to fix the problem as it was an issue with the game itself.
Because D3 works by manipulating SVG elements in the DOM (Document Object Model), large datasets can cause performance problems for the browser. The scatter diagram had over 1000 records passed to it, meaning that it created over 1000 SVG circles, mouse over and on click events for the browser to handle. This quickly uses up the available memory and
Given that the Gantt chart solution did not provide any helpful time management for the project, a different approach to the initial planning phase would definitely be helpful if I were to do the project, or something similar, again.
One major element of the program that needed improving was the Java code that parsed the Tweets and produced the values for the Robots. This was the major variable in the whole process that determined the results of the project. As mentioned, there was not enough time to make the algorithm as complex as I would have liked. Therefore, if I were to improve the project in the future, I would spend more time improving and testing the TweetParser class that the RobotController uses.
As discussed in the conclusion, I felt that the data was skewed and created a bias due to the fact that I had not simulated an even number of battles for each number of rounds. If I were to do the project again or re-collect the data, I would ensure that I keep the sub-sample sizes the same size to ensure that the experiment is fair and eliminates this factor from the analysis.
The information in this section is an excerpt from the 'Reflection' chapter in my dissertation final report. You can download a copy of it here.