Mining competitor relationships from online news: A network based approach

2 minute read

Ma, Z., Pant, G. & Sheng, O.R.L., 2011. Mining competitor relationships from online news: A network-based approach. Electronic Commerce Research and Applications, 10(4), pp.418—427.

This article introduces a method for mining competitor relationships from a network of business relations that has been retrieved from categorized Yahoo finance news. News assigned to a company yield outgoing links to other companies mentioned in the same article. The authors compute four different types of network metrics for every aggregated link between two companies which are then used as features for a classifier that determines whether two companies share an "isCompetitor" relation.

Method

The method draws upon an observation by Bae et al. (2008) which states that a company is more likely to co-occur with its competitors in Web pages. Therefore, Ma et al. (2011) perform the following steps for identifying competitor relations:

  1. They analyse news from Yahoo finance where news items have been pre-assigned to companies (and their corresponding ticker symbols). Every mention of another company $$B_i$$ in a news article assigned to company A yields an outgoing link from A to the corresponding company ($$B_i$$).
  2. Aggregating these links yields a weighted inter-company network. The authors then compute four different type of network metrics for every link. These metrics are used as features for a subsequent classification step.
  3. The classifier is first trained with 40 randomly selected pairs that have been labeled according to a gold standard. Afterwards, the labels of unknown relations (competitor/no competitor) are computed. The authors also address the issues of imbalanced data sets by using techniques such as decision threshold adjustment (DTA) and undersampling-ensemble (UE).

Link and Network Metrics

The approach bases the link classification on the following four types of metrics:

  1. Links between nodes: degree based attributes (incoming (weigth of dyad in-degree), outgoing links (weight of dyad out-degree), net weight, dyad in/out-degree)
  2. Links between nodes and their neighbors: weight of node in-degree (considers incoming links from all neighbors), node out-degree, node in/out-degree
  3. Centrality-based attributes: PageRank (popularity score), HITS (authority score), betweenness centrality. Both, PageRank and HITS, compute principal eigenvectors of matrices derived from graph representations of the Web.
  4. Structural equivalence (SE)-based attributes: identify two nodes as structurally equivalent if they have the same links to and from other nodes; the authors use a similarity metric to measure the degree of structural equivalence that is based on the nodes in-/out- and in/out-degree. A high overlap between neighbors of two nodes might correspond to a high overlap in their business and, therefore, to a competitor relationship.

Data Set Imbalance

  1. Imbalanced data sets, i.e. data set where most instances occur in one class are hard to classify, since classifiers optimize accuracy (which makes returning the majority class as classification result very attractive)
  2. The following approaches may be used to compensate for the imbalance:
    • data-oriented methods: undersampling majority, oversampling minority, oversampling minority by creating a synthetic minority, segmentation of the set in disjoint regions
    • algorithmic methods: decision threshold adjustment, cost-sensitive learning, recognition-based learning (=> transfer the problem to a one-class recognition problem where only the rules for classifying the minority are learned)

Machine Learning

The authors evaluate their approach with the neural network, Naive Bayes, C4.5 decision tree and logistic regression models from the WEKA toolkit using a 10-fold cross evaluation.

Estimate Gold Standard Coverage

Finally, the article describes a method for estimating the gold standard coverage based on two different "gold standard" and techniques originally developed by Le Cren (1965) for estimating the wildlife population and Lawrence and Giles for estimating the coverage of search engines.