Filtering redundancies for sequence similarity search programs

被引:1
作者
Cantalloube, H
Chomilier, J
Chiusa, S
Lonquety, M
Spadoni, JL
Zagury, JF
机构
[1] Univ Paris 06, Lab Mineral Cristallog, CNRS, F-75252 Paris 05, France
[2] Univ Paris 07, Lab Mineral Cristallog, CNRS, F-75252 Paris 05, France
[3] Conservatoire Natl Arts & Metiers, Chaire Bioinformat, F-75003 Paris, France
[4] Off Natl Etud & Rech Aerosp, Dept Electromagnetism & Radar, F-91761 Palaiseau, France
[5] INSERM, Grp Bioinformat Genom & Traitement Pathol Syst Im, EMI0355, F-75006 Paris, France
关键词
D O I
10.1080/07391102.2005.10507020
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Database scanning programs such as BLAST and FASTA are used nowadays by most biologists for the post-genomic processing of DNA or protein sequence information (in particular to retrieve the structure/function of uncharacterized proteins). Unfortunately, their results can be polluted by identical alignments (called redundancies) coming from the same protein or DNA sequences present in different entries of the database. This makes the efficient use of the listed alignments difficult. Pretreatment of databases has been proposed to suppress strictly identical entries. However, there still remain many identical alignments since redundancies may occur locally for entries corresponding to various fragments of the same sequence or for entries corresponding to very homologous sequences but differing at the level of a few residues such as ortholog proteins. In the present work. we show that redundant alignments can be indeed numerous even when working with a pretreated non-redundant data. bank, going as high as 60% of the output results according to the query and the bank. Therefore the accuracy and the efficiency of the post-genomic work will be greatly increased if these redundancies are removed. To solve this up to now unaddressed problem. we have developed an algorithm that allows for the efficient and safe suppression of all the redundancies with no loss of information. This algorithm is based on various filtering steps that we describe here in the context of the Automat similarity search program, and such an algorithm should also be added to the other similarity search programs (BLAST, FASTA. etc...).
引用
收藏
页码:487 / 492
页数:6
相关论文
共 15 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   Closed loops: persistence of the protein chain returns [J].
Berezovsky, IN ;
Kirzhner, VM ;
Kirzhner, A ;
Rosenfeld, VR ;
Trifonov, EN .
PROTEIN ENGINEERING, 2002, 15 (12) :955-957
[4]  
BLEASBY AJ, 1994, NUCLEIC ACIDS RES, V22, P3574
[5]  
CANTALLOUBE H, 1995, COMPUT APPL BIOSCI, V11, P261
[6]  
CANTALLOUBE H, 1994, COMPUT APPL BIOSCI, V10, P153
[7]   ALIGNING AMINO-ACID SEQUENCES - COMPARISON OF COMMONLY USED METHODS [J].
FENG, DF ;
JOHNSON, MS ;
DOOLITTLE, RF .
JOURNAL OF MOLECULAR EVOLUTION, 1985, 21 (02) :112-125
[8]   AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) :10915-10919
[9]   KIND - a non-redundant protein database [J].
Kallberg, Y ;
Persson, B .
BIOINFORMATICS, 1999, 15 (03) :260-261
[10]   Analysis of the genome sequence of the flowering plant Arabidopsis thaliana [J].
Kaul, S ;
Koo, HL ;
Jenkins, J ;
Rizzo, M ;
Rooney, T ;
Tallon, LJ ;
Feldblyum, T ;
Nierman, W ;
Benito, MI ;
Lin, XY ;
Town, CD ;
Venter, JC ;
Fraser, CM ;
Tabata, S ;
Nakamura, Y ;
Kaneko, T ;
Sato, S ;
Asamizu, E ;
Kato, T ;
Kotani, H ;
Sasamoto, S ;
Ecker, JR ;
Theologis, A ;
Federspiel, NA ;
Palm, CJ ;
Osborne, BI ;
Shinn, P ;
Conway, AB ;
Vysotskaia, VS ;
Dewar, K ;
Conn, L ;
Lenz, CA ;
Kim, CJ ;
Hansen, NF ;
Liu, SX ;
Buehler, E ;
Altafi, H ;
Sakano, H ;
Dunn, P ;
Lam, B ;
Pham, PK ;
Chao, Q ;
Nguyen, M ;
Yu, GX ;
Chen, HM ;
Southwick, A ;
Lee, JM ;
Miranda, M ;
Toriumi, MJ ;
Davis, RW .
NATURE, 2000, 408 (6814) :796-815