Web Page Classification: Features and Algorithms

2 minute read

Qi, X. & Davison, B.D., 2009. Web page classification: Features and algorithms. ACM Comput. Surv., 41(2), pp.12:1—12:31.

Classification is a traditionally posed as a supervised learning problem. The most common classification tasks are:

subject classification (topic)
functional classification (role of a Web page; e.g. personal homepage, course page, admission page, ...)
sentiment classification
other types of classification (e.g. genre, spam, ...)

Applications

Constructing, maintaining and extending Web directories
Improving the quality of search results
Question answering systems
Building efficient focused crawlers
Domain-specific search engines
Web content filtering
Contextual advertising
Ontology annotation
Knowledge base construction

Features

Qi and Davison (2009) distinguish between on-page features and neighbor features. The later group is especially useful in cases, where the features of a particular Web page are missing, misleading or unrecognizable (e.g. flash intros, image-map navigation, etc). In such cases features extracted from neighboring (especially sibling) pages have considerably improved the classification performance.

The following on-page features are of special importance:

n-gram representations because they are able to capture concepts expressed by phrases (Shen et al. 2006)
HTML tags
URLs
Visual analysis

Feature Selection

Information gain
mutual information
document frequency
Chi-squared test

Hierarchical Classification

Dumais and Chen (2000) suggest the use of hierarchical structures for Web page classification based on classical "divide and conquer" approaches. They show that splitting the classification problem into a set of sub-problems at each level of the hierarchy is more efficient and accurate than classifying using a flat model. Wibowo and Willisams (2002) suggested methods to minimize errors by shifting the class assignment into higher level of the hierarchy if the lower-level assignment is uncertain.

Liu et al. (2005) studied the scalability and effectiveness of SVMs for classifying documents into large-scale taxonomies. They found that although hierarchical SVMs are more efficient than flat SVMs neither approach yields satisfying results for large taxonomies. They also showed that under certain conditions hierarchical settings do more harm than good when used for k-Nearest Neighbor or Naive Bayes classifiers.

Literature

Dumais, S. & Chen, H., 2000. Hierarchical classification of Web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ™00. New York, NY, USA: ACM, pp. 256—263. Available at: http://doi.acm.org/10.1145/345508.345593 [Accessed October 1, 2012].

Liu, T.-Y. et al., 2005. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 7(1), pp.36—43.

Shen, D. et al., 2006. Text classification improved through multigram models. In Proceedings of the 15th ACM international conference on Information and knowledge management. CIKM ™06. New York, NY, USA: ACM, pp. 672—681. Available at: http://doi.acm.org/10.1145/1183614.1183710 [Accessed September 28, 2012].

Wibowo, W. & Williams, H.E., 2002. Simple and accurate feature selection for hierarchical categorisation. In Proceedings of the 2002 ACM symposium on Document engineering. DocEng ™02. New York, NY, USA: ACM, pp. 111—118. Available at: http://doi.acm.org/10.1145/585058.585079 [Accessed October 1, 2012].

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Web Page Classification: Features and Algorithms

Applications

Features

Feature Selection

Hierarchical Classification

Literature

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers