Thematic Exploration of Linked Data

1 minute read

by Castano et al.; Very Large Data Search (VLDS) 2011

This article addresses the problem of organizing linked data, which features an inherent flat organization, into a more easily browsable representation that the authors call inCloud.

Creating an inCloud requires:

  1. a computation of the similarity between nodes based on a string metric (the dice coefficient) that considers terms Term$$_i$$ appearing in a node $$n_i$$ and all adjacent nodes.
  2. clustering (thematic aggregation) based on the interconnectivity between nodes by using the clique percolation method (CPM), which yields multiple, potentially overlapping cliques.
  3. the computation of descriptive terms for each clique by identifying frequent node types (rdf:type)
The prominence of clusters is determined by considering

  1. the clusters variability, i.e. the degree of overlap among the cliques in that cluster (lower overlap = better defined and more consistent topic)
  2. the cluster density, i.e. a higher density indicates a more focused and homogeneous discussion of the topic.

Graph Theory

The cluster density is computed based on the ratio between the number of links in the cluster $$R_i$$ and the maximum number of possible links:

\[ d_i = \frac{2\cdot R_i}{N_i(N_i-1)} \]

A clique is a subset of vertices so that for every two vertices in the clique there exists an edge connecting them. A pre-print of the Nature article describing the clique percolation method (CPM) can be found here.