Automating the construction of internet portals with machine learning

被引:745
作者
McCallum, AK [1 ]
Nigam, K
Rennie, J
Seymore, K
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] MIT, Cambridge, MA 02139 USA
来源
INFORMATION RETRIEVAL | 2000年 / 3卷 / 02期
关键词
spidering; crawling; reinforcement learning; information extraction; hidden Markov models; text classification; naive Bayes; expectation-maximization; unlabeled data;
D O I
10.1023/A:1009953814988
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age location, cost and specialty over summer camps. This functionality is nor possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. Ws describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are widely applicable to portal creation in other domains.
引用
收藏
页码:127 / 163
页数:37
相关论文
共 43 条
[1]  
BAKER D, 1999, HIERARCHICAL PROBABI
[2]  
Baum L.E., 1972, Inequalities III: Proceedings of the Third Symposium on Inequalities, page, V3, P1
[3]  
Bellman R., 1957, DYNAMIC PROGRAMMING
[4]  
Bikel D.M., 1997, Proceedings of the fifth conference on Applied natural language processing. Association for Computational Linguistics, P194
[5]  
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962
[6]  
Boyan J., 1996, AAAI 96 WORKSH INT B
[7]  
CHAKRABARTI S, 1999, P 8 INT WORLD WID WE
[8]  
CHANG H, 1999, CREATING CUSTOMIZED
[9]  
Chen S. F., 1998, Tech. Rep. TR-10-98
[10]  
CHO J, 1998, P 7 WORLD WID WEB C