Local decoding of sequences and alignment-free comparison

被引:10
作者
Didier, Gilles
Laprevotte, Ivan
Pupin, Maude
Henaut, Alain
机构
[1] Inst Math Luminy, UMR 6206, F-13288 Marseille 6, France
[2] Lab Stat & Genome, Evry, France
[3] Lab Informat Fondamentale Lille Batiment, Villeneuve Dascq, France
[4] Univ Evry Val Essonne, Evry, France
关键词
algorithm; HMM decoding; sequences comparison;
D O I
10.1089/cmb.2006.13.1465
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Subword composition plays an important role in a lot of analyses of sequences. Here we define and study the "local decoding of order N of sequences," an alternative that avoids some drawbacks of "subwords of length N" approaches while keeping informations about environments of length N in the sequences ("decoding" is taken here in the sense of hidden Markov modeling, i.e., associating some state to all positions of the sequence). We present an algorithm for computing the local decoding of order N of a given set of sequences. Its complexity is linear in the total length of the set (whatever the order N) both in time and memory space. In order to show a use of local decoding, we propose a very basic dissimilarity measure between sequences which can be computed both from local decoding of order N and composition in subwords of length N. The accuracies of these two dissimilarities are evaluated, over several datasets, by computing their linear correlations with a reference alignment-based distance. These accuracies are also compared to the one obtained from another recent alignment-free comparison.
引用
收藏
页码:1465 / 1476
页数:12
相关论文
共 17 条
[1]   Bilaterian phylogeny based on analyses of a region of the sodium-potassium ATPase β-subunit gene [J].
Anderson, FE ;
Córdoba, AJ ;
Thollesson, M .
JOURNAL OF MOLECULAR EVOLUTION, 2004, 58 (03) :252-268
[2]   Multiple sources of character information and the phylogeny of Hawaiian Drosophilids [J].
Baker, RH ;
DeSalle, R .
SYSTEMATIC BIOLOGY, 1997, 46 (04) :654-673
[3]   Caenorhabditis elegans is a nematode [J].
Blaxter, M .
SCIENCE, 1998, 282 (5396) :2041-2046
[4]  
Bonnet E., 2002, J STAT SOFTW, V7, P1, DOI [DOI 10.18637/JSS.V007.I10, 10.18637/jss.v007.i10]
[5]   Phylogeny of the Chlorophyceae with special reference to the Sphaeropleales: A study of 18S and 26S rDNA data [J].
Buchheim, MA ;
Michalopulos, EA ;
Buchheim, JA .
JOURNAL OF PHYCOLOGY, 2001, 37 (05) :819-835
[6]   Characterization of N-ecritures and application to the study of sequences of n+cste complexity [J].
Didier, G .
THEORETICAL COMPUTER SCIENCE, 1999, 215 (1-2) :31-49
[7]  
FISCHER WM, 2005, MOL PHYLOGENET EVOL, V30, P325
[8]   The Los Alamos hepatitis C sequence database [J].
Kuiken, C ;
Yusim, K ;
Boykin, L ;
Richardson, R .
BIOINFORMATICS, 2005, 21 (03) :379-384
[9]   HIV-1 and HIV-2 LTR nucleotide sequences:: Assessment of the alignment by N-block presentation, "Retroviral signatures" of overrepeated oligonucleotides, and a probable important role of scrambled stepwise duplications/deletions in molecular evolution [J].
Laprevotte, I ;
Pupin, M ;
Coward, E ;
Didier, G ;
Terzian, C ;
Devauchelle, C ;
Hénaut, A .
MOLECULAR BIOLOGY AND EVOLUTION, 2001, 18 (07) :1231-1245
[10]   A probabilistic measure for alignment-free sequence comparison [J].
Pham, TD ;
Zuegg, J .
BIOINFORMATICS, 2004, 20 (18) :3455-3461