Exploiting link structure for web page genre identification

Jia Zhu, Qing Xie, Shoou I. Yu, Wai Hung Wong

Research output: Contribution to journalArticlepeer-review

18 Scopus citations

Abstract

As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information. © 2015 The Author(s)
Original languageEnglish (US)
Pages (from-to)550-575
Number of pages26
JournalData Mining and Knowledge Discovery
Volume30
Issue number3
DOIs
StatePublished - Jul 7 2015

Fingerprint Dive into the research topics of 'Exploiting link structure for web page genre identification'. Together they form a unique fingerprint.

Cite this