A Study of Approaches to Hypertext Categorization

被引:39
作者
Yiming Yang
Seán Slattery
Rayid Ghani
机构
[1] Carnegie Mellon University,School of Computer Science
[2] Carnegie Mellon University,School of Computer Science
[3] Accenture Technology Labs—Research,undefined
来源
Journal of Intelligent Information Systems | 2002年 / 18卷
关键词
hypertext classification; machine learning; web mining; text mining;
D O I
暂无
中图分类号
学科分类号
摘要
Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related Web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our results show that the identification of hypertext regularities in the data and the selection of appropriate representations for hypertext in particular domains are crucial, but seldom obvious, in real-world problems. We find that adding the words in the linked neighborhood to the page having those links (both inlinks and outlinks) were helpful for all our classifiers on one data set, but more harmful than helpful for two out of the three classifiers on the remaining datasets. We also observed that extracting meta data from related Web sites was extremely useful for improving classification accuracy in some of those domains. Finally, the relative performance of the classifiers being tested provided insights into their strengths and limitations for solving classification problems involving diverse and often noisy Web pages.
引用
收藏
页码:219 / 241
页数:22
相关论文
共 11 条
  • [1] Craven M.(2000)Learning to Construct Knowledge Bases from the World Wide Web Artificial Intelligence 118 69-113
  • [2] DiPasquo D.(1990)Learning Logical Definitions from Relations Machine Learning 5 239-266
  • [3] Freitag D.(1988)Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management 24 513-523
  • [4] McCallum A.K.(1999)An Evaluation of Statistical Approaches to Text Categorization Information Retrieval 1 69-90
  • [5] Mitchell T.M.(undefined)undefined undefined undefined undefined-undefined
  • [6] Nigam K.(undefined)undefined undefined undefined undefined-undefined
  • [7] Slattery S.(undefined)undefined undefined undefined undefined-undefined
  • [8] Quinlan J.R.(undefined)undefined undefined undefined undefined-undefined
  • [9] Salton G.(undefined)undefined undefined undefined undefined-undefined
  • [10] Buckley C.(undefined)undefined undefined undefined undefined-undefined