SSAHA: A fast search method for large DNA databases

被引:692
作者
Ning, ZM [1 ]
Cox, AJ [1 ]
Mullikin, JC [1 ]
机构
[1] Sanger Ctr, Informat Div, Cambridge CB10 1SA, England
关键词
D O I
10.1101/gr.194201
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that Ss: HA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.
引用
收藏
页码:1725 / 1729
页数:5
相关论文
共 16 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
  • [3] An SNP map of the human genome generated by reduced representation shotgun sequencing
    Altshuler, D
    Pollara, VJ
    Cowles, CR
    Van Etten, WJ
    Baldwin, J
    Linton, L
    Lander, ES
    [J]. NATURE, 2000, 407 (6803) : 513 - 516
  • [4] [Anonymous], 1998, SORTING SEARCHING
  • [5] Tandem repeats finder: a program to analyze DNA sequences
    Benson, G
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (02) : 573 - 580
  • [6] Alignment of whole genomes
    Delcher, AL
    Kasif, S
    Fleischmann, RD
    Peterson, J
    White, O
    Salzberg, SL
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (11) : 2369 - 2376
  • [7] Gusfield D, 1997, ALGORITHMS STRINGS T
  • [8] International Human Genome Sequencing Consortium, NATURE, V409, P860
  • [9] RAPID AND SENSITIVE PROTEIN SIMILARITY SEARCHES
    LIPMAN, DJ
    PEARSON, WR
    [J]. SCIENCE, 1985, 227 (4693) : 1435 - 1441
  • [10] A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases
    Miller, C
    Gurd, J
    Brass, A
    [J]. BIOINFORMATICS, 1999, 15 (02) : 111 - 121