Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news

1 minute read

Kim, E. H.-J., Jeong, Y. K., Kim, Y., Kang, K. Y., & Song, M. (2015). Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news. Journal of Information Science

This paper investigates the temporal differences between topic coverage and sentiment on Twitter and news media.

Method

The analysis is based on a corpus of 16,189 news articles and 7,106,297 tweets which have been assembled based on queries for "Ebola" and "Ebola virus". Preprocessing removed stopwords, applied lemmatization, tokenization and part-of-speech tagging to the corpus.

The following four techniques have been applied to the created corpus:vocabulary control, i.e. replace consumer's expressions with expert's jargon using the Consumer Health Vocabulary (CHV)

  1. topic modeling using n-gram Latent Dirichlet Allocation (LDA). Afterwards the differences in coverage have been investigated using (a) within-topic similarity (WTS) and (b) between-topic similarity (BTS).
  2. entity extraction using PKDE4J and creation of an entity network which has been visualized using Gephi (www.gephi.org).
  3. computation of the topic-based sentiment scores - the sentiment value of words is weighted with their topic probability $$s_{topic_i} = \sum_{j} P(w_j|topic_i) \cdot s(w_j)$$

Performed Analysis

  1. Topic distribution and relations between the topics based on the within-topic similarity (WTS) score.
  2. Entity distribution (entity type per outlet) and visualization of the entity network. Computation of the top entities based on the betweenness centrality.
  3. Temporal analysis: identify continuous topics and compute the topic sentiment over time;