A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases

被引:24
作者
Miller, C [1 ]
Gurd, J [1 ]
Brass, A [1 ]
机构
[1] Univ Manchester, Sch Biol Sci, Manchester M13 9PT, Lancs, England
关键词
D O I
10.1093/bioinformatics/15.2.111
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Word-matching algorithms such as BLAST are routinely, used for sequence comparison. These algorithms typically, use areas of matching words to seen alignments which are then Leed to assess the degree of sequence similarity. In this paper we show that by formally separating the word-matching and sequence-alignment process, and using information about word frequencies to generate alignments and similarity scores, we can create a new sequence-comparison algorithm which is both fast and sensitive. The formal split between word searching and alignment allows users to select an appropriate alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence. Results: We present three algorithms, RAPID, PHAT and SPLAT which together allow vector contaminations to be found and assessed extremely rapidly RAPID is a word search algorithm which uses probabilities to modify the significance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of magnitude faster than BLAST The formal split between word searching and alignment not only offer's considerable gains in performance, bur also allows alignment generation to be viewed as a riser interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an artificial test set allows the optimal score threshold for identifying vector contamination to be determined ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of the entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (an EST subset of EMBL) finds art error rate of 0.86%, principally due to two large-scale projects.
引用
收藏
页码:111 / 121
页数:11
相关论文
共 17 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
ATTESON K, 1998, ISMB 98, P14
[3]   FAST COMPUTER-SEARCH FOR SIMILAR DNA-SEQUENCES [J].
BISHOP, M ;
THOMPSON, E .
NUCLEIC ACIDS RESEARCH, 1984, 12 (13) :5471-5474
[4]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[5]   DIAGRAM, A METHOD FOR COMPARING SEQUENCES - ITS USE WITH AMINO ACID AND NUCLEOTIDE SEQUENCES [J].
GIBBS, AJ ;
MCINTYRE, GA .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 1970, 16 (01) :1-+
[6]   The Genome Sequence DataBase (GSDB): improving data quality and data access [J].
Harger, C ;
Skupski, M ;
Bingham, J ;
Farmer, A ;
Hoisie, S ;
Hraber, P ;
Kiphart, D ;
Krakowski, L ;
McLeod, M ;
Schwertfeger, J ;
Seluja, G ;
Siepel, A ;
Singh, G ;
Stamper, D ;
Steadman, P ;
Thayer, N ;
Thompson, R ;
Wargo, P ;
Waugh, M ;
Zhuang, JJ ;
Schad, PA .
NUCLEIC ACIDS RESEARCH, 1998, 26 (01) :21-26
[7]  
Hough PV., 1962, US Patent, Patent No. 3069654
[8]   CORRUPTION OF GENOMIC DATABASES WITH ANOMALOUS SEQUENCE [J].
LAMPERTI, ED ;
KITTELBERGER, JM ;
SMITH, TF ;
VILLAKOMAROFF, L .
NUCLEIC ACIDS RESEARCH, 1992, 20 (11) :2741-2747
[9]   Identification of 4370 expressed sequence tags from a 3'-end-specific cDNA library of human skeletal muscle by DNA sequencing and filter hybridization [J].
Lanfranchi, G ;
Muraro, T ;
Caldara, F ;
Pacchioni, B ;
Pallavicini, A ;
Pandolfo, D ;
Toppo, S ;
Trevisan, S ;
Scarso, S ;
Valle, G .
GENOME RESEARCH, 1996, 6 (01) :35-42
[10]  
Parsons JD, 1995, COMPUT APPL BIOSCI, V11, P615