Nearest neighbors by neighborhood counting

被引:112
作者
Wang, H [1 ]
机构
[1] Univ Ulster, Sch Comp & Math, Fac Engn, Jordanstown BT37 0QB, North Ireland
[2] Univ Metz, LITA, F-57045 Metz, France
关键词
pattern recognition; machine learning; nearest neighbors; distance; similarity; neighborhood counting measure;
D O I
10.1109/TPAMI.2006.126
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Finding nearest neighbors is a general idea that underlies many artificial intelligence tasks, including machine learning, data mining, natural language understanding, and information retrieval. This idea is explicitly used in the k- nearest neighbors algorithm ( kNN), a popular classification method. In this paper, this idea is adopted in the development of a general methodology, neighborhood counting, for devising similarity functions. We turn our focus from neighbors to neighborhoods, a region in the data space covering the data point in question. To measure the similarity between two data points, we consider all neighborhoods that cover both data points. We propose to use the number of such neighborhoods as a measure of similarity. Neighborhood can be defined for different types of data in different ways. Here, we consider one definition of neighborhood for multivariate data and derive a formula for such similarity, called neighborhood counting measure or NCM. NCM was tested experimentally in the framework of kNN. Experiments show that NCM is generally comparable to VDM and its variants, the state- of- the- art distance functions for multivariate data, and, at the same time, is consistently better for relatively large k values. Additionally, NCM consistently outperforms HEOM ( a mixture of Euclidean and Hamming distances), the " standard" and most widely used distance function for multivariate data. NCM has a computational complexity in the same order as the standard Euclidean distance function and NCM is task independent and works for numerical and categorical data in a conceptually uniform way. The neighborhood counting methodology is proven sound for multivariate data experimentally. We hope it will work for other types of data.
引用
收藏
页码:942 / 953
页数:12
相关论文
共 29 条
[1]  
[Anonymous], WIK FREE ENC
[2]  
Ash R. B., 2000, PROBABILITY MEASURE
[3]  
Atkeson CG, 1997, ARTIF INTELL REV, V11, P11, DOI 10.1023/A:1006559212014
[4]  
BAILEY T, 1978, IEEE T SYST MAN CYB, V8, P311
[5]  
Blake C.L., 1998, UCI repository of machine learning databases
[6]  
Blanzieri E, 1999, LECT NOTES ARTIF INT, V1650, P14
[7]   A WEIGHTED NEAREST NEIGHBOR ALGORITHM FOR LEARNING WITH SYMBOLIC FEATURES [J].
COST, S ;
SALZBERG, S .
MACHINE LEARNING, 1993, 10 (01) :57-78
[8]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[9]  
Dasarathy B.V., 1991, IEEE COMPUTER SOC TU
[10]   A K-NEAREST NEIGHBOR CLASSIFICATION RULE-BASED ON DEMPSTER-SHAFER THEORY [J].
DENOEUX, T .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1995, 25 (05) :804-813