Thematic Exploration of Linked Data
by Castano et al.; Very Large Data Search (VLDS) 2011
This article addresses the problem of organizing linked data, which features an inherent flat organization, into a more easily browsable representation that the authors call inCloud.
Creating an inCloud requires:
- a computation of the similarity between nodes based on a string metric (the dice coefficient) that considers terms Term$$_i$$ appearing in a node $$n_i$$ and all adjacent nodes.
- clustering (thematic aggregation) based on the interconnectivity between nodes by using the clique percolation method (CPM), which yields multiple, potentially overlapping cliques.
- the computation of descriptive terms for each clique by identifying frequent node types (rdf:type)
- the clusters variability, i.e. the degree of overlap among the cliques in that cluster (lower overlap = better defined and more consistent topic)
- the cluster density, i.e. a higher density indicates a more focused and homogeneous discussion of the topic.
Graph Theory
The cluster density is computed based on the ratio between the number of links in the cluster $$R_i$$ and the maximum number of possible links:\[ d_i = \frac{2\cdot R_i}{N_i(N_i-1)} \]
A clique is a subset of vertices so that for every two vertices in the clique there exists an edge connecting them. A pre-print of the Nature article describing the clique percolation method (CPM) can be found here.