Klahold, A. et al., 2014. Using Word Association to Detect Multitopic Structures in Text Documents. IEEE Intelligent Systems, 29(5), pp.40—46.
This paper presents a method for detecting multitopic structures in text documents that combines (i) traditional keyword extraction methods and (ii) the association between keywords (CIMAWA = Context for the Imitation of the Mental Ability of Word Association) into an Associative Gravity metric which is used for clustering topics.
Method
Associative gravity focuses on the content structures
within a single document and combines the following three steps:
- keyword detection - computes a keyword rating kw which determines the most important terms w based on the number of times they occur in the text n(w), the total number of documents |D| and the number of documents in which they occur I(w):
$$!kw(w) = n(w) \cdot\frac{|D|}{I(w)}$$
- CIMAWA - to determine the relations between those keywords. CIMAWA(x(y)) measures the association between word x and y based on a certain windows size (ws) and the reverse association multiplied with the damping factor k. For the used experiments a window size of 10 has been chosen.
$$!CIMAWA(x(y)) = \frac{Cooc_{ws}(x,y)}{n(y)^\alpha} + k \frac{Cooc_{ws}(x,y)}{n(x)^\alpha}$$
- clustering - to determine semantic topic clusters CIMAWA is used to compute the associative gravity force (AGF), i.e. the distance metric for the clustering
$$!AGF(x, y) = \frac{CIMAWA(x(y)) \cdot kr(x)}{y}$$
The article also includes suggestions for building the clusters and evaluating them using cluster entropy .