A semidiscrete matrix decomposition for latent semantic indexing in information retrieval

被引:112
作者
Kolda, TG
O'Leary, DP
机构
[1] Oak Ridge Natl Lab, Math Sci Sect, Oak Ridge, TN 37831 USA
[2] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[3] Univ Maryland, Inst Adv Comp Studies, College Pk, MD 20742 USA
关键词
algorithms; data mining; latent semantic indexing; semidiscrete decomposition; singular-value decomposition; text retrieval;
D O I
10.1145/291128.291131
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The vast amount of textual information available today is useless unless it can be effectively and efficiently searched. The goal in information retrieval is to find documents that are relevant to a given user query. We can represent a document collection by a matrix whose (i, j) entry is nonzero only if the ith term appears in the jth document; thus each document corresponds to a column vector. The query is also represented as a column vector whose ith term is nonzero only if the ith term appears in the query. We score each document for relevancy by taking its inner product with the query. The highest-scoring documents are considered the most relevant. Unfortunately, this method does not necessarily retrieve all relevant documents because it is based on literal term matching. Latent semantic indexing (LSI) replaces the document matrix with an approximation generated by the truncated singular-value decomposition (SVD). This method has been shown to overcome many difficulties associated with literal term matching. In this article we propose replacing the SVD with the semidiscrete decomposition (SDD). We will describe the SDD approximation, show how to compute it, and compare the SDD-based LSI method to the SVD-bascd LSI method. We will show that SDD-based LSI does as well as SVD-based LSI in terms of document retrieval while requiring only one-twentieth the storage and one-half the time to compute each query. We will also show how to update the SDD approximation when documents are added or deleted from the document collection.
引用
收藏
页码:322 / 346
页数:25
相关论文
共 24 条
[1]  
[Anonymous], THESIS U TENNESSEE K
[2]  
[Anonymous], 1994, MANAGING GIGABYTES C
[3]  
[Anonymous], NIST SPECIAL PUBLICA
[4]  
BERRY M, 1993, SVDPACK VERSION 1 0
[5]   Using linear algebra for intelligent information retrieval [J].
Berry, MW ;
Dumais, ST ;
OBrien, GW .
SIAM REVIEW, 1995, 37 (04) :573-595
[6]  
BERRY MW, 1996, NUMERICAL LINEAR ALG, V1, P1
[7]  
BROGLIO J, 1995, NIST SPECIAL PUBLICA
[8]  
Cohen E, 1997, PROCEEDINGS OF THE EIGHTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P682
[9]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[10]  
2-9