Using Word Association to Detect Multitopic Structures in Text Documents

1 minute read

Klahold, A. et al., 2014. Using Word Association to Detect Multitopic Structures in Text Documents. IEEE Intelligent Systems, 29(5), pp.40—46.

This paper presents a method for detecting multitopic structures in text documents that combines (i) traditional keyword extraction methods and (ii) the association between keywords (CIMAWA = Context for the Imitation of the Mental Ability of Word Association) into an Associative Gravity metric which is used for clustering topics.


Associative gravity focuses on the content structures within a single document and combines the following three steps:

  1. keyword detection - computes a keyword rating kw which determines the most important terms w based on the number of times they occur in the text n(w), the total number of documents |D| and the number of documents in which they occur I(w): $$!kw(w) = n(w) \cdot\frac{|D|}{I(w)}$$
  2. CIMAWA - to determine the relations between those keywords. CIMAWA(x(y)) measures the association between word x and y based on a certain windows size (ws) and the reverse association multiplied with the damping factor k. For the used experiments a window size of 10 has been chosen. $$!CIMAWA(x(y)) = \frac{Cooc_{ws}(x,y)}{n(y)^\alpha} + k \frac{Cooc_{ws}(x,y)}{n(x)^\alpha}$$
  3. clustering - to determine semantic topic clusters CIMAWA is used to compute the associative gravity force (AGF), i.e. the distance metric for the clustering $$!AGF(x, y) = \frac{CIMAWA(x(y)) \cdot kr(x)}{y}$$
The article also includes suggestions for building the clusters and evaluating them using cluster entropy .