Web Page Classification: Features and Algorithms
- subject classification (topic)
- functional classification (role of a Web page; e.g. personal homepage, course page, admission page, ...)
- sentiment classification
- other types of classification (e.g. genre, spam, ...)
Applications
- Constructing, maintaining and extending Web directories
- Improving the quality of search results
- Question answering systems
- Building efficient focused crawlers
- Domain-specific search engines
- Web content filtering
- Contextual advertising
- Ontology annotation
- Knowledge base construction
Features
Qi and Davison (2009) distinguish between on-page features and neighbor features. The later group is especially useful in cases, where the features of a particular Web page are missing, misleading or unrecognizable (e.g. flash intros, image-map navigation, etc). In such cases features extracted from neighboring (especially sibling) pages have considerably improved the classification performance.The following on-page features are of special importance:
- n-gram representations because they are able to capture concepts expressed by phrases (Shen et al. 2006)
- HTML tags
- URLs
- Visual analysis
Feature Selection
- Information gain
- mutual information
- document frequency
- Chi-squared test
Hierarchical Classification
Dumais and Chen (2000) suggest the use of hierarchical structures for Web page classification based on classical "divide and conquer" approaches. They show that splitting the classification problem into a set of sub-problems at each level of the hierarchy is more efficient and accurate than classifying using a flat model. Wibowo and Willisams (2002) suggested methods to minimize errors by shifting the class assignment into higher level of the hierarchy if the lower-level assignment is uncertain.Liu et al. (2005) studied the scalability and effectiveness of SVMs for classifying documents into large-scale taxonomies. They found that although hierarchical SVMs are more efficient than flat SVMs neither approach yields satisfying results for large taxonomies. They also showed that under certain conditions hierarchical settings do more harm than good when used for k-Nearest Neighbor or Naive Bayes classifiers.