Fishing for Bots: Social Media Honeypots
Have you ever received likes, retweets, or replies on posts for no apparent reason from someone you don’t know? I have a strong feeling most of us have. This is because of an infamous workforce: bots. If you’re not familiar with these programs, you are in the vast minority. There are millions of bots across all social media, from Twitter to Github. In some cases, bots can be good, providing services and product integration. However, I often find myself plagued with auto-like and auto-retweet bots on Twitter. These bots are designed to like or retweet any post containing certain keywords. This can occasionally be for “feed” bots, which are designed to centralize a certain topic. Sadly, this is usually not the case. Personally, I became sick of these bots, so I decided to have a bit of fun with them.
I’ve created what I believe to be the first bot designed to annoy other bots. Basically, it tweets a large amount of generated tweets with many keywords, then refines its database with each wave of tweets using a Genetic Algorithm. This creates a bot that continuously improves its ability to trick other bots into liking/retweeting its content. Currently I’ve identified over 200 bot accounts using this method(you can find them here). I hope to continue improving the algorithm used to select keywords so I can attempt to trick as many bots as possible.
There’s been some pretty funny cases where bots go on retweet sprees, flooding their feed with my bot’s garbage tweets.
I feel it’s worth noting that I am not trying to harm bots or ruin their functionality. As I stated before, there are many bots that serve useful purposes. The main goal of this project is mostly for fun, but also to generate a list of bots. It would be cool to have a massive, centralized list of Twitter bots at some point.
Brief Technical Details
My bot’s learning algorithm is extremely stripped down and probably overly simplistic. It runs off of a simple genetic algorithm, where each keyword is a gene. The fitness is calculated by
x = (2r + 1.3m + 1.1l)
r is the number of retweets,
m is the number of mentions, and
l is the number of likes. Out of a generation of 40 tweets, the 20 individuals with the least fitness are removed from the breeding pool, while the top 10 are bred with the next 10. The new tweet has a higher likelihood to take a keyword from the ‘alpha’ parent, which is the parent in the top 10. There is also a 30% chance that a random keyword will be selected instead of a keyword from either parent.
I think it would be interesting to develop honeypot accounts in this manner for various social media sites. In fact, it would be an effective way for Twitter to detect bots, allowing them to flag the accounts for removal. In addition, I would be interested to see the results of this project on Facebook or Google+, as they allow for far longer posts, meaning the genome could be more complex and thus possibly more effective. Regardless, I really enjoyed this project and I hope to continue developing it later on. I’ve already spoken briefly with a few companies about the potential of them using the method I’ve developed for more effective bot detection.