The DARPA Twitter Bot Challenge

1 minute read

Subrahmanian, V. S., A. Azaria, S. Durst, V. Kagan, A. Galstyan, K. Lerman, L. Zhu, E. Ferrara, A. Flammini, and F. Menczer. The DARPA Twitter Bot Challenge. Computer 49, no. 6 (June 2016): 38—46. doi:10.1109/MC.2016.183.

Introduction

According to Twitter's SEC filing approximately 8.5% of all Twitter users are bots such as (i) spambots, (ii) paybots (i.e. bots that copy content from respected sources and paste it into micro URLs that pay the bot's creator for redirecting traffic to their sites), and (iii) influence bots (i.e. bots that try to shape discussions in accordance to a certain agenda).

Research has shown that such bots have a surprisingly large influence, triggering a competition by DARPA's Social Media in Strategic Communication program for identifying influence bots that promote pro-vaccination on Twitter discussions based on a synthetic data set comprising over 7000 Twitter profiles, > 4 million tweets and weekly snapshots of the Twitter network that capture changes to profiles and a user's followers.

Approach

The teams considered different features for identifying bots including

  1. Tweet syntax (i.e. syntax that indicates the use of natural language generation programs, etc.)
  2. Tweet semantics (number of posts related to that topic, sentiment, consistency, etc.)
  3. Temporal behavior features (variance in sentiment, durance of sessions, average number of Tweets)
  4. Network features (deviation of user sentiment scores from followers and followees, centrality, etc.)
The overall procedure for detecting bots contained the following steps:

  1. identify an initial set of bots based on the features mentioned above
  2. use cluster, outliers and network analysis to locate further bots
  3. once a large enough number of bots has been found, apply standard machine learning methods to identify the remaining bots.
Machine learning couldn't be applied at an earlier stage since not enough training data has been available. In addition, all teams used semi-supervised approaches - i.e. machines would identify potential bots that are than confirmed (or rejected) by human experts.