Text classification using string kernels

被引:675
作者
Lodhi, H [1 ]
Saunders, C [1 ]
Shawe-Taylor, J [1 ]
Cristianini, N [1 ]
Watkins, C [1 ]
机构
[1] Univ London Royal Holloway & Bedford New Coll, Dept Comp Sci, Egham TW20 0EX, Surrey, England
关键词
kernels and support vector machines; string subsequence kernel; approximating kernels; text classification;
D O I
10.1162/153244302760200687
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique.
引用
收藏
页码:419 / 444
页数:26
相关论文
共 19 条
[1]  
[Anonymous], UCSCCRL9910 COMP SCI
[2]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[3]  
CAVNAR WB, 1994, P TREC 3 3 TEXT RETR, P269
[4]  
CRISTIANINI N, 2001, NCTR01087 U LOND
[5]  
CRISTIANINI N, IN PRESS NEURAL INFO
[6]  
Cristianini N, 2000, Intelligent Data Analysis: An Introduction
[7]  
Friess T., 1998, P 15 TH INT C MACHIN
[8]  
Huffman S., 1995, P TREC 4 4 TEXT RETR
[9]  
Joachims T, 1999, ADVANCES IN KERNEL METHODS, P169
[10]  
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683