Lower dimensional representation of text data based on centroids and least squares

被引:97
作者
Park, H [1 ]
Jeon, M
Ben Rosen, J
机构
[1] Univ Minnesota, Dept Comp Sci & Engn, Minneapolis, MN 55455 USA
[2] Univ Calif Santa Barbara, Dept Comp Sci & Engn, Santa Barbara, CA 93106 USA
[3] Univ Calif San Diego, La Jolla, CA 92093 USA
[4] Linkoping Univ, Dept Math, S-58183 Linkoping, Sweden
来源
BIT | 2003年 / 43卷 / 02期
基金
美国国家科学基金会;
关键词
dimension reduction; centroids; least squares; rank reducing decomposition; classification; feature extraction;
D O I
10.1023/A:1026039313770
中图分类号
TP31 [计算机软件];
学科分类号
081202 [计算机软件与理论]; 0835 [软件工程];
摘要
Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed. Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.
引用
收藏
页码:427 / 448
页数:22
相关论文
共 39 条
[1]
Anderberg M.R., 1973, Probability and Mathematical Statistics
[2]
[Anonymous], SIAM J MATRIX ANAL A
[3]
Using linear algebra for intelligent information retrieval [J].
Berry, MW ;
Dumais, ST ;
OBrien, GW .
SIAM REVIEW, 1995, 37 (04) :573-595
[4]
Matrices, vector spaces, and information retrieval [J].
Berry, MW ;
Drmac, Z ;
Jessup, ER .
SIAM REVIEW, 1999, 41 (02) :335-362
[5]
BJORCK A, 1996, NUMERICAL METHODS LE
[6]
A rank-one reduction formula and its applications to matrix factorizations [J].
Chu, MT ;
Funderlic, RE ;
Golub, GH .
SIAM REVIEW, 1995, 37 (04) :512-530
[7]
RANK OF A DIFFERENCE OF MATRICES AND ASSOCIATED GENERALIZED INVERSES [J].
CLINE, RE ;
FUNDERLIC, RE .
LINEAR ALGEBRA AND ITS APPLICATIONS, 1979, 24 (APR) :185-215
[8]
COLON JR, 1996, J AM SOC INFORM SCI, V47, P449
[9]
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[10]
2-9