Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison

被引:34
作者
Dai, Qi [1 ]
Liu, Xiaoqing [2 ]
Yao, Yuhua [1 ]
Zhao, Fukun [1 ]
机构
[1] Zhejiang Sci Tech Univ, Coll Life Sci, Hangzhou 310018, Peoples R China
[2] Hangzhou Dianzi Univ, Sch Sci, Hangzhou 310018, Peoples R China
基金
中国国家自然科学基金;
关键词
Word frequency; Word expectation; Word variance; Regulatory sequence; Phylogenetic analysis; 2D GRAPHICAL REPRESENTATION; DNA-SEQUENCES; PROTEIN SEQUENCES; PHYLOGENETIC ANALYSIS; STATISTICAL-METHOD; ALIGNMENT; SIMILARITY; DISTANCE; GENE; INFERENCE;
D O I
10.1016/j.jtbi.2011.02.005
中图分类号
Q [生物科学];
学科分类号
090105 [作物生产系统与生态工程];
摘要
Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. Numerous dissimilarity measures achieve promising results in sequence comparison, but challenges remain. This paper studied numerical characteristics of word frequencies and proposed a novel dissimilarity measure for sequence comparison. Instead of using the word frequencies directly, the proposed measure considers both the word frequencies and overlapping structures of words. To verify the effectiveness of the proposed measure, we tested it with two experiments and further compared it with alignment-based and alignment-free measures. The results demonstrate that the proposed measure extracting more information on the overlapping structures of the words improves the efficiency of sequence comparison. Crown Copyright (C) 2011 Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:174 / 180
页数:7
相关论文
共 49 条
[1]
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]
[Anonymous], 1975, SIGNAL DETECTION THE
[3]
[Anonymous], 1995, Introduction to computational biology: maps, sequences and genomes
[4]
[Anonymous], 2005, PHYLIP (phylogeny inference package) version 3.6
[5]
Fast algorithms for computing sequence distances by exhaustive substring composition [J].
Apostolico, Alberto ;
Denas, Olgert .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2008, 3 (1)
[7]
The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[8]
Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders [J].
Cao, Y ;
Janke, A ;
Waddell, PJ ;
Westerman, M ;
Takenaka, O ;
Murata, S ;
Okada, N ;
Pääbo, S ;
Hasegawa, M .
JOURNAL OF MOLECULAR EVOLUTION, 1998, 47 (03) :307-322
[9]
Linear regression model of DNA sequences and its application [J].
Dai, Qi ;
Liu, Xiao-Qing ;
Wang, Tian-Ming ;
Vukicevic, Damir .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2007, 28 (08) :1434-1445
[10]
A novel 2D graphical representation of DNA sequences and its application [J].
Dai, Qi ;
Liu, Xiaoqing ;
Wang, Tianming .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2006, 25 (03) :340-344