A Survey of Types of Text Noise and Techniques to Handle Noisy Text

2 minute read

by Subramaniam, L. V., Roy, S., Faruquie, T. A., & Negi, S. (2009). A survey of types of text noise and techniques to handle noisy text. Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, AND ™09 (pp. 115—122). New York, NY, USA: ACM. doi:10.1145/1568296.1568315

Definition

Noise in text can be defined as any kind of difference between electronic text and the intended, correct or original text. The authors of this paper identify the following two noise sources:

  1. Automatic processing of signals intended for human consumption (e.g. optical character recognition (OCR), automatic speech recognition (ASR), machine translation). This kind of noise may produce text that only contains valid vocabulary words, but still is very noisy for human consumption.
  2. Text generated in informal settings (e.g. chat messages, short messaging service (SMS) messages, emails, message boards, web pages) due to (i) spelling errors, (ii) special characters, (iii) non standard word forms, (iv) grammar mistakes, (v) use of multilingual words, etc.

Methods and Metrics

  1. Levenshtein distance (spelling errors)
  2. Word Error Rate and Sentence Error rate
  3. Perplexity (an indication of the number of "strange" n-grams in the text)
  4. Character accuracy (for OCR texts)

Impact of Noise

  1. Information Retrieval - Noise is definitely a serious problem for information retrieval because (i) it is often based on string matching, (ii) search queries frequently contain spelling errors (10-15%) [Cucerzan and Brill 2004), (iii) contain a high share of out of vocabulary words, and (iii) are typically short (average query length of 2.3).
  2. Text Classification - Agarwal et al. (2007) performed a comprehensive study on the impact of noise on text classification using synthetic and real life data. They concluded that even high noise levels (70%) do not impact classification performance considerably.
  3. Information Extraction - Packer et al. (2010) applied three typical named entity recognition techniques (dictionary based extractor, regular expression-based extractor, Maximum Entropy Markov Model-based extractor) to an OCR corpus. They observed F-Measures between 28-89% depending on the word error rate of the scanned documents.

Literature

Agarwal, S., Godbole, S., Punjani, D., & Roy, S. (2007). How Much Noise Is Too Much: A Study in Automatic Text Classification. Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM ™07 (pp. 3—12). Washington, DC, USA: IEEE Computer Society. doi:10.1109/ICDM.2007.21

Cucerzan, S., & Brill, E. (2004). Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users (pp. 293—300). Presented at the EMNLP. Retrieved from http://dblp.uni-trier.de/rec/bibtex/conf/emnlp/CucerzanB04

Packer, T. L., Lutes, J. F., Stewart, A. P., Embley, D. W., Ringger, E. K., Seppi, K. D., & Jensen, L. S. (2010). Extracting person names from diverse and noisy OCR text. Proceedings of the fourth workshop on Analytics for noisy unstructured text data, AND ™10 (pp. 19—26). New York, NY, USA: ACM. doi:10.1145/1871840.1871845