Web Page Classification: Features and Algorithms

2 minute read

Qi, X. & Davison, B.D., 2009. Web page classification: Features and algorithms. ACM Comput. Surv., 41(2), pp.12:1—12:31.

Classification is a traditionally posed as a supervised learning problem. The most common classification tasks are:

  • subject classification (topic)
  • functional classification (role of a Web page; e.g. personal homepage, course page, admission page, ...)
  • sentiment classification
  • other types of classification (e.g. genre, spam, ...)

Applications

  • Constructing, maintaining and extending Web directories
  • Improving the quality of search results
  • Question answering systems
  • Building efficient focused crawlers
  • Domain-specific search engines
  • Web content filtering
  • Contextual advertising
  • Ontology annotation
  • Knowledge base construction

Features

Qi and Davison (2009) distinguish between on-page features and neighbor features. The later group is especially useful in cases, where the features of a particular Web page are missing, misleading or unrecognizable (e.g. flash intros, image-map navigation, etc). In such cases features extracted from neighboring (especially sibling) pages have considerably improved the classification performance.

The following on-page features are of special importance:

  • n-gram representations because they are able to capture concepts expressed by phrases (Shen et al. 2006)
  • HTML tags
  • URLs
  • Visual analysis

Feature Selection

  • Information gain
  • mutual information
  • document frequency
  • Chi-squared test

Hierarchical Classification

Dumais and Chen (2000) suggest the use of hierarchical structures for Web page classification based on classical "divide and conquer" approaches. They show that splitting the classification problem into a set of sub-problems at each level of the hierarchy is more efficient and accurate than classifying using a flat model. Wibowo and Willisams (2002) suggested methods to minimize errors by shifting the class assignment into higher level of the hierarchy if the lower-level assignment is uncertain.

Liu et al. (2005) studied the scalability and effectiveness of SVMs for classifying documents into large-scale taxonomies. They found that although hierarchical SVMs are more efficient than flat SVMs neither approach yields satisfying results for large taxonomies. They also showed that under certain conditions hierarchical settings do more harm than good when used for k-Nearest Neighbor or Naive Bayes classifiers.

 

Literature

Dumais, S. & Chen, H., 2000. Hierarchical classification of Web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ™00. New York, NY, USA: ACM, pp. 256—263. Available at: http://doi.acm.org/10.1145/345508.345593 [Accessed October 1, 2012].

Liu, T.-Y. et al., 2005. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 7(1), pp.36—43.

Shen, D. et al., 2006. Text classification improved through multigram models. In Proceedings of the 15th ACM international conference on Information and knowledge management. CIKM ™06. New York, NY, USA: ACM, pp. 672—681. Available at: http://doi.acm.org/10.1145/1183614.1183710 [Accessed September 28, 2012].

Wibowo, W. & Williams, H.E., 2002. Simple and accurate feature selection for hierarchical categorisation. In Proceedings of the 2002 ACM symposium on Document engineering. DocEng ™02. New York, NY, USA: ACM, pp. 111—118. Available at: http://doi.acm.org/10.1145/585058.585079 [Accessed October 1, 2012].