两种相似度计算方法对KNN分类效果的影响研究

被引：5

作者：

黄莉 ^{[1
]}

李湘东 ^{[2
]}

机构：

[1] 武汉大学图书馆

[2] 武汉大学信息管理学院

来源：

情报杂志 | 2012年 / 31卷 / 07期

关键词：

文本自动分类; 分类效果; 最邻近算法; 相似度; 余弦值; Jensen-Shannon; 散度;

D O I：

暂无

中图分类号：

TP391.1 [文字信息处理];

学科分类号：

081203 ; 0835 ;

摘要：

KNN最邻近算法是文本自动分类中最基本且常用的算法,该算法中需要计算文本之间的相似度。以Jens-en-Shannon散度为例,在推导和说明其基本原理的基础之上,将其用于计算文本之间的相似度;作为对比,也使用常规的余弦值方法计算文本之间的相似度,并进而使用KNN最邻近算法对文本进行分类,以探讨不同的相似度计算方法对使用KNN最邻近算法进行文本自动分类效果的影响。多种试验材料的实证研究说明,较之于余弦值方法,基于Jensen-Shannon散度计算文本相似度的自动分类会使分类正确率更高,但会花费更长的时间。

引用

页码：177 / 181+176 +176

页数：6

共 19 条

[1]

Reuters-21578 Text Ctaegorization Test Cllection. David D. http://kdd.ics.uci.edu/databas-es/reuters21578/reuters21578.html . 2010

[2]

信息检索技术[M]. 科学出版社 , 孙建军等编著, 2004

[3]

Neighbor-weighted K-nearest Neighbor for Unbalanced Text Corpus. S. Tan. Expert Systems With Applications . 2005

[4]

IKNN:Informative k-nearestNeighbor Pattern Classification. SongY,Huang J,Zhou D,et al. 11th European Confer-ence on Principles and Practice of Knowledge Discovery in Data-bases . 2007

[5] 使用KNN算法的文本分类 [J].

张宁 ;

贾自艳 ;

史忠植 .

计算机工程, 2005, (08) :171-172+185

[6]

An Improved k-Nearest Neighbor Algorithm for Text Categorization. Li B L,Yu S,Lu H. The Proceedings of the 20thInternational Conference on Computer Processing of Oriental Languages . 2003

[7]

An Improved k-Nearest Neighbor Algorithm for Text Categorization. Li B L,Yu S,Lu H. The Proceedings of the 20thInternational Conference on Computer Processing of Oriental Languages . 2003

[8]

Boosting k-nearest neighbor classifier by means of input space projection. Garcia-Pedrajas N,Ortiz-Boyer D. Expert Systems With Applications . 2009

[9] Using kNN model for automatic text categorization [J].

Guo, GD ;

Wang, H ;

Bell, D ;

Bi, YX ;

Greer, K .

SOFT COMPUTING, 2006, 10 (05) :423-430

[10]

What makes a query difficult. D.Carmel,et al. SIGIR’’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval . 2006

← 1 2 →