Clash of the Contagions - Cooperation and Competition in Information Diffusion

2 minute read

by Seth A. Myers and Jure Leskovec, IEEE International Conference on Data Mining (ICDM 2012), Brussels, Belgium

Introduction

The authors present a statistical information diffusion model that considers competition as well as cooperation between contagions.

  1. during processing information sources (Web pages, TV, Tweets) - or being exposed to contagions - we constantly make choices whether to process (= getting infected with the contagion) or ignore these events.
  2. the authors distinguish tree different factors that determine, whether an infection takes place:
    • interestingness of the content (= content virality)
    • likelihood of the user to share the content (=user bias)
    • the content interaction term (= past exposures)

Potential benefits

  1. optimizing click-through rates by optimizing content placing
  2. combat the spread of negative pieces of information

Related work:

  • models of contagions in isolation: standard information diffusion approaches such as: Linear Threshold Models, Independent Cascade Models, exposure curves
  • models assuming that contagions are mutually exclusive (e.g. adoption of technology in a company such as Skype versus Google Hangout)

Method

The test set consists of URLs in Twitter tweets. Users are infected when they re-tweet a certain URL. Users are modeled as nodes. If a user, re-tweets a URL all his followers become exposed to that particular URL.

The probability of infection is altered by a contagion's (X) predecessors ($$Y_k$$) that are considered within a sliding window of size K. Therefore, the conditional probability of an infection is computed as

\[P(X|Y_1, Y_2, ... Y_K)\]

The authors make the following assumptions to decrease the number of different contagion combinations:

  1. $$Y_k$$ is independent of $$Y_l$$ => they only need to consider $$P(X|Y_k)$$ rather than every possible contagion sequence.
  2. they only consider the interaction between clusters (i.e., latent topics) rather than between all pairs of contagions.

Based on these assumptions they model

\[P(X|Y_k) = P(X) + \Delta^{(k)}_{cont}(u_i, u_j)\]

where $$\Delta^{(k)}_{cont}(u_i, u_j)$$ represents the effect a contagion $$u_i$$ has on contagion $$u_j$$ from k exposures away.

Solution

  1. in the rare cases where the $$\Delta$$ term leads to negative probabilities the authors set the probability to a minimum value of 1E-10
  2. the models obtained have a high number of parameters. The authors tried numerous methods to optimize these parameters and discovered that a variation of stochastic gradient descent worked best for the given use case.

Dataset

The raw dataset contained more than 3 billion tweets that where filtered based on the following criteria:

  1. Tweets containing URLs that where tweeted by at least 50 users (191,650 URLs)
  2. URLs referring to sites which contain enough (>=50 tokens) texts to determine a latent topic (39,771 URLs)
  3. URLs referring to English sites (18,186 URLs and 2,664,207 infectiuous events, i.e. Tweets)

Results

The paper provides evidence that more infectious URLs have

  1. a negative (suppressive) effect on less infectious URLs of unrelated content, and
  2. a positive effect on less infectious URLs of related content.