by Dey et al.
This article discusses (i) methodologies to obtain Web intelligence and (ii) presents case-studies which demonstrate how this information can be integrated with structured data to explain business facts and thereby be adopted for future decision making.
Traditionally, analysts relied on facts and figures provided by third-party reports and survey results to gather competitive intelligence. Nowadays, large amounts of information required for Web intelligence are available in unstructured form and are therefore not directly machine-interpretable and would require sufficient human effort to read, extract, organize and assimilate relevant knowledge.
Methods1. Content Acquisition:
- the authors deploy (i) site-specific focused crawler and (ii) Google Alerts (www.google.com/alerts) to ensure that no relevant information is missed
- afterwards they use Nutch (nutch.apache.org) to extract content from these Web pages
- Dey and Haque, 2008 employ content cleanup techniques such as content-dependent spelling correction, sentence demarcation, removing unnecessary capitalizations and special characters to content retrieved from social sources.
- Identification and tagging of relevant concepts (=words, phrases, entities and their combinations) facilitating organizational information stored in the domain ontology
- Frequently used ontology-concepts are used to label sets of documents
- Integration with ReVerb; extraction of the following data
- people and events (CEO's, CFO's, ...), market players, ...
- ternary relational patterns (person, organization, + set of verbs ("holds", "serves", ...)
- competitor's strategies (market news, patents, product launches, mergers and acquisitions, ...) -> action verb classes (see above)
- consumer sentiments and opinions (opinion mining)
- promotion events (Groupon, Foresquare); pattern-based detection of such events
- real world events (twitter as early warning system - compare
- Label assignment
- based on the content (entities, relations, concepts, and opinions)
- topics are extracted using Latent Dirichlet Allocation (LDA)
- social-network content are very noisy, therefore, labeling of these articles is rule-based and uses prior knowledge about enterprises, products and concepts.
- example output:
Cluster #11; Dominant Pattern:
; Label: Opening of new store Apr, 25, 2011 - Macy's to open ... Apr, 27, 2011 - The Dillard's property opened ... ...
- all classified content is harmonized and consolidated to generate reports in pre-defined templates. These reports contain quantifications of the extracted information and can be treated as intelligence-digests which are starting points for drill-down analyses.
- social media data can be an indicator of brand popularity - the popularity of a brand was directly correlated to promotional activities for the brand
- correlation between social-media events and sales data - promotion let to a reduction or at least arrested the growth of rival brands (although the impact varies from region to region)
- cumulative impact of promotion events and negative news about competing brands - market share of a brand went up, once one of its rivals has announced prise rises
- effects of different categories of news items on an organization's performance - large data will be required to identify trends and deduce conclusions
- nutch.apache.org - data extraction
- reverb.cs.washington.edu - Open Information Extraction Software
- nlp.stanford.edu/ner/index.shtml - Stanford Named Entity Recognition
- ï»¿ï»¿Dey, Lipika and Haque, S K Mirajul (2008). ''Opinion mining from noisy text data'', Proceedings of the second workshop on Analytics for noisy unstructured text data, ISBN: 978-1-60558-196-5, ACM, pages 83--90
- ï»¿Shroff, Gautam, Agarwal, Puneet and Dey, Lipika (2011). ''Enterprise information fusion for real-time business intelligence'', Proceedings of the 14th International Conference on Information Fusion (FUSION 2011), pages 1-8