Fast algorithms for computing sequence distances by exhaustive substring composition

被引:19
作者
Apostolico, Alberto [1 ,2 ,3 ]
Denas, Olgert [3 ]
机构
[1] Acad Nazl Lincei, Rome, Italy
[2] Univ Padua, Dept Informat Engn, Padua, Italy
[3] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
关键词
D O I
10.1186/1748-7188-3-13
中图分类号
Q5 [生物化学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.
引用
收藏
页数:9
相关论文
共 21 条
[1]
[Anonymous], 1972, INFORM THEORY LIVING
[2]
Apostolico A., 1985, NATO ASI Series, V12, P85, DOI [DOI 10.1007/978-3-642-82456-2_6, 10.1007/978-3-642-82456-26, DOI 10.1007/978-3-642-82456-26]
[4]
Brillouin L, 1971, SCI INFORM THEORY
[5]
Cover TM, 2006, Elements of Information Theory
[6]
Local homology recognition and distance measures in linear time using compressed amino acid alphabets [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (01) :380-385
[7]
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment [J].
Ferragina, Paolo ;
Giancarlo, Raffaele ;
Greco, Valentina ;
Manzini, Giovanni ;
Valiente, Gabriel .
BMC BIOINFORMATICS, 2007, 8 (1)
[8]
Hao Bailin, 2004, J Bioinform Comput Biol, V2, P1
[9]
HELDEN JV, 2004, BIOINFORMATICS, V20, P399
[10]
Höhl M, 2006, EVOL BIOINFORM, V2, P359