A novel filtration method in biological sequence databases

被引:7
作者
Lee, Anthony J. T. [1 ]
Lin, Chao-Wen [1 ]
Lo, Wen-Hsing [1 ]
Chen, Chieh-Chun [1 ]
Chen, Jia-Xin [1 ]
机构
[1] Natl Taiwan Univ, Dept Informat Management, Taipei 10617, Taiwan
关键词
bioinformatics; sequence comparison; filtration method;
D O I
10.1016/j.patrec.2006.08.015
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a new filtration method, called Transformation-based Database Filtration method (TDF), to screen out those data sequences of a DNA sequence database which cannot satisfy a given query sequence. Our proposed method consists of two phases. First, we divide each data sequence into several windows (blocks), each of which is transformed into a data feature vector using the Haar wavelet transform. The transformed data feature vectors are then stored in an index file. Second, we divide a query sequence into sliding windows, each of which is, again, transformed into a query feature vector using the Haar wavelet transform. We then search the index file to find the candidate sequences for each query feature vector and check if they match the query sequence using the sequence alignment algorithm. We transform the bound of edit distance between sequences to the bound of Manhattan distance between feature vectors. Since the Manhattan distance is much easier to compute, our proposed method can efficiently screen out impossible data sequences and guarantee no false negatives. The experimental results show that our proposed method outperforms the QUASAR method in terms of filtration ratio, precision, execution time and index size. The proposed method also outperforms the YM method for long query, low complexity and repetitive data. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:447 / 458
页数:12
相关论文
共 17 条
[1]  
AGILI A, 2003, P 3 IEEE INT S BIOIN, P149
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]  
[Anonymous], [No title captured], DOI DOI 10.1145/299432.299460
[5]  
[Anonymous], P ACM SIG MOD INT C
[6]   SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size [J].
Giladi, E ;
Walker, MG ;
Wang, JZ ;
Volkmuth, W .
BIOINFORMATICS, 2002, 18 (06) :873-879
[7]  
Krishnan Arun, 2004, In Silico Biology, V4, P133
[8]   RAPID AND SENSITIVE PROTEIN SIMILARITY SEARCHES [J].
LIPMAN, DJ ;
PEARSON, WR .
SCIENCE, 1985, 227 (4693) :1435-1441
[9]   PatternHunter: faster and more sensitive homology search [J].
Ma, B ;
Tromp, J ;
Li, M .
BIOINFORMATICS, 2002, 18 (03) :440-445
[10]  
MEEK C, 2003, P 29 VLDB C