Performing binary-categorization on multiple-record web documents using information retrieval models and application ontologies

被引:4
作者
Kwong, LW [1 ]
Ng, YK [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2003年 / 6卷 / 03期
关键词
information retrieval; application ontologies; binary-categorization; vector-space model; clustering model; WWW;
D O I
10.1023/A:1024653618816
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To retrieve Web documents, of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those. applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.
引用
收藏
页码:281 / 303
页数:23
相关论文
共 15 条
[1]  
[Anonymous], P ICML 97
[2]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[3]   Document categorization and query generation on the World Wide Web using WebACE [J].
Boley, D ;
Gini, M ;
Gross, R ;
Han, EH ;
Hastings, K ;
Karypis, G ;
Kumar, V ;
Mobasher, B ;
Moore, J .
ARTIFICIAL INTELLIGENCE REVIEW, 1999, 13 (5-6) :365-391
[4]   AUTOMATIC DOCUMENT CLASSIFICATION [J].
BORKO, H ;
BERNICK, M .
JOURNAL OF THE ACM, 1963, 10 (02) :151-&
[5]  
Bunge M., 1979, ONTOLOGY 2 AWORLD SY, V4
[6]  
CHEKURI C, 1997, P 6 INT WWW C
[7]  
Crestani F, 1998, KLUW S INF, P247
[8]   Conceptual-model-based data extraction from multiple-record Web pages [J].
Embley, DW ;
Campbell, DM ;
Jiang, YS ;
Liddle, SW ;
Lonsdale, DW ;
Ng, YK ;
Smith, RD .
DATA & KNOWLEDGE ENGINEERING, 1999, 31 (03) :227-251
[9]  
Embley DW, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P467, DOI 10.1145/304181.304223
[10]  
EMBLEY DW, 2001, P 20 INT C CONC MOD, P555