by Seth A. Myers and Jure Leskovec, IEEE International Conference on Data Mining (ICDM 2012), Brussels, Belgium
IntroductionThe authors present a statistical information diffusion model that considers competition as well as cooperation between contagions.
- during processing information sources (Web pages, TV, Tweets) - or being exposed to contagions - we constantly make choices whether to process (= getting infected with the contagion) or ignore these events.
- the authors distinguish tree different factors that determine, whether an infection takes place:
- interestingness of the content (= content virality)
- likelihood of the user to share the content (=user bias)
- the content interaction term (= past exposures)
- optimizing click-through rates by optimizing content placing
- combat the spread of negative pieces of information
- models of contagions in isolation: standard information diffusion approaches such as: Linear Threshold Models, Independent Cascade Models, exposure curves
- models assuming that contagions are mutually exclusive (e.g. adoption of technology in a company such as Skype versus Google Hangout)
MethodThe test set consists of URLs in Twitter tweets. Users are infected when they re-tweet a certain URL. Users are modeled as nodes. If a user, re-tweets a URL all his followers become exposed to that particular URL.
The probability of infection is altered by a contagion's (X) predecessors ($$Y_k$$) that are considered within a sliding window of size K. Therefore, the conditional probability of an infection is computed as
\[P(X|Y_1, Y_2, ... Y_K)\]
The authors make the following assumptions to decrease the number of different contagion combinations:
- $$Y_k$$ is independent of $$Y_l$$ => they only need to consider $$P(X|Y_k)$$ rather than every possible contagion sequence.
- they only consider the interaction between clusters (i.e., latent topics) rather than between all pairs of contagions.
- in the rare cases where the $$\Delta$$ term leads to negative probabilities the authors set the probability to a minimum value of 1E-10
- the models obtained have a high number of parameters. The authors tried numerous methods to optimize these parameters and discovered that a variation of stochastic gradient descent worked best for the given use case.
- Tweets containing URLs that where tweeted by at least 50 users (191,650 URLs)
- URLs referring to sites which contain enough (>=50 tokens) texts to determine a latent topic (39,771 URLs)
- URLs referring to English sites (18,186 URLs and 2,664,207 infectiuous events, i.e. Tweets)
ResultsThe paper provides evidence that more infectious URLs have
- a negative (suppressive) effect on less infectious URLs of unrelated content, and
- a positive effect on less infectious URLs of related content.