Natural Language Processing for Health and Social Media

1 minute read

Abbasi, A. et al., 2014. Social Media Analytics for Smart Health. IEEE Intelligent Systems, 29(2), pp.60—80.

Summary

In this article Mark Dredze and Michael J. Paul discuss the use of Twitter for collecting population data at a much lower cost than through traditional data sources such as surveys. One key challenge of mining health information from social media is extracting structured data from short and noisy unstructured text.

  1. the massive amounts of available data make it difficult to survey and explore the data.
  2. keyword searches are problematic since users may miss important health topics
  3. disambiguating tweets is challenging (for example, not all tweets containing flue-related keywords present actual infections).

Method

  1. the authors use topic models, which represent documents as distributions of topics which in turn are formed by word distributions.
  2. evaluation: by measuring the ability of the models to capture real-world trends as measured by the US Behavioral Risk Factor Surveillance System (BRFSS).
    • tweets about cancer and serious illnesses per state are positively correlated with the rate of smokers in the corresponding state.
    • tweets on obesity are negatively correlated with the rate of exercise per US state.
    • control: tweets on bacterial infections are unrelated to the rate of asthma in the US states.
  3. a supervised classifier was trained with more than 10,000 labeled tweets to identify tweets which reported influence infections using the following features
    • n-grams (phrases)
    • manually specified keyword groups (e.g. worried, scared, ...)
    • Twitter related features (URLs, hashtags, user mentions, emoticons)
    • linguistic features using part-of-speech tags (e.g. flu as a subject indicates awareness - "the flue is going around"; flue as an object indicates infection - "i finally got the flue")
    evaluation: comparison with the infection rate measured by the US Centers for Disease Control and Prevention (CDC).

Other Corpora

  1. RateMDs - 50,000 doctor reviews;
  2. Drugs-Forum.com - discussions of illicit drugs; use of text summarization to reveal information on drug use.

  1. RateMDs - 50,000 doctor reviews;
  2. Drugs-Forum.com - discussions of illicit drugs; use of text summarization to reveal information on drug use.