Clustered sequence representation for fast homology search

被引:14
作者
Cameron, Michael
Bernstein, Yaniv
Williams, Hugh E.
机构
[1] RMIT Univ, Sch Comp Sci & IT, Melbourne, Vic 3001, Australia
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
BLAST; clustering; homology search; near duplicate detection; sequence alignment;
D O I
10.1089/cmb.2007.R005
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web. documents in Information Retrieval. Our clustering approach is ten times. faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
引用
收藏
页码:594 / 614
页数:21
相关论文
共 45 条
  • [1] Altschul SF, 1996, METHOD ENZYMOL, V266, P460
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [4] SCOP database in 2004: refinements integrate structure and sequence family data
    Andreeva, A
    Howorth, D
    Brenner, SE
    Hubbard, TJP
    Chothia, C
    Murzin, AG
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D226 - D229
  • [5] Bernstein Y, 2004, LECT NOTES COMPUT SC, V3246, P55
  • [6] BERNSTEIN Y, 2005, P 14 ACM INT C INF K, P736
  • [7] CONSTRUCTION OF VALIDATED, NONREDUNDANT COMPOSITE PROTEIN-SEQUENCE DATABASES
    BLEASBY, AJ
    WOOTTON, JC
    [J]. PROTEIN ENGINEERING, 1990, 3 (03): : 153 - 159
  • [8] BRIN S, 1995, P 1995 ACM SIGMOD IN, P398
  • [9] Syntactic clustering of the Web
    Broder, AZ
    Glassman, SC
    Manasse, MS
    Zweig, G
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13): : 1157 - 1166
  • [10] d2_cluster: A validated method for clustering EST and full-length cDNA sequences
    Burke, J
    Davison, D
    Hide, W
    [J]. GENOME RESEARCH, 1999, 9 (11) : 1135 - 1142