The similarity metric

被引:583
作者
Li, M [1 ]
Chen, X
Li, X
Ma, B
Vitányi, PMB
机构
[1] Univ Waterloo, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
[2] Bioinformat Solut Inc, Waterloo, ON, Canada
[3] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA
[4] Univ Western Ontario, Dept Comp Sci, London, ON N6A 5B7, Canada
[5] Ctr Math & Comp Sci, CWI, NL-1098 SJ Amsterdam, Netherlands
[6] Univ Amsterdam, Amsterdam, Netherlands
基金
加拿大自然科学与工程研究理事会;
关键词
dissimilarity distance; Kolmogorov complexity; language tree construction; normalized compression distance; normalized information distance; parameter-free data mining; phylogeny in bioinformatics; universal similarity metric;
D O I
10.1109/TIT.2004.838101
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance," based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness, we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.
引用
收藏
页码:3250 / 3264
页数:15
相关论文
共 51 条
[1]  
ADACHI J, COMPUT SCI MONOGR I, V28, P1
[2]  
[Anonymous], GENOME INFORM 1999
[3]  
[Anonymous], 1948, Universal Declaration of Human Rights, V1st
[4]  
[Anonymous], 1996, P ROY SOC A-MATH PHY, V452, P769, DOI DOI 10.1098/rspa.1996.0039
[5]  
BALL P, 2002, NATURE JAN
[6]   Language trees and zipping [J].
Benedetto, D ;
Caglioti, E ;
Loreto, V .
PHYSICAL REVIEW LETTERS, 2002, 88 (04) :4
[7]   Chain letters & evolutionary histories [J].
Bennett, CH ;
Li, M ;
Ma, B .
SCIENTIFIC AMERICAN, 2003, 288 (06) :76-81
[8]   Information distance [J].
Bennett, CH ;
Gacs, P ;
Li, M ;
Vitanyi, FMB ;
Zurek, WH .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1998, 44 (04) :1407-1423
[9]  
Berry V, 2000, PROCEEDINGS OF THE ELEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P287
[10]   Big trees from little genomes: mitochondrial gene order as a phylogenetic tool [J].
Boore, JL ;
Brown, WM .
CURRENT OPINION IN GENETICS & DEVELOPMENT, 1998, 8 (06) :668-674