Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison

被引：34

作者：

Dai, Qi ^{[1
]}

Liu, Xiaoqing ^{[2
]}

Yao, Yuhua ^{[1
]}

Zhao, Fukun ^{[1
]}

机构：

[1] Zhejiang Sci Tech Univ, Coll Life Sci, Hangzhou 310018, Peoples R China

[2] Hangzhou Dianzi Univ, Sch Sci, Hangzhou 310018, Peoples R China

来源：

JOURNAL OF THEORETICAL BIOLOGY | 2011年 / 276卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Word frequency; Word expectation; Word variance; Regulatory sequence; Phylogenetic analysis; 2D GRAPHICAL REPRESENTATION; DNA-SEQUENCES; PROTEIN SEQUENCES; PHYLOGENETIC ANALYSIS; STATISTICAL-METHOD; ALIGNMENT; SIMILARITY; DISTANCE; GENE; INFERENCE;

D O I：

10.1016/j.jtbi.2011.02.005

中图分类号：

Q [生物科学];

学科分类号：

090105 [作物生产系统与生态工程];

摘要：

Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. Numerous dissimilarity measures achieve promising results in sequence comparison, but challenges remain. This paper studied numerical characteristics of word frequencies and proposed a novel dissimilarity measure for sequence comparison. Instead of using the word frequencies directly, the proposed measure considers both the word frequencies and overlapping structures of words. To verify the effectiveness of the proposed measure, we tested it with two experiments and further compared it with alignment-based and alignment-free measures. The results demonstrate that the proposed measure extracting more information on the overlapping structures of the words improves the efficiency of sequence comparison. Crown Copyright (C) 2011 Published by Elsevier Ltd. All rights reserved.

引用

页码：174 / 180

页数：7

共 49 条

[1]

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].