Tolerating some redundancy significantly speeds up clustering of large protein databases

被引:415
作者
Li, WZ [1 ]
Jaroszewski, L [1 ]
Godzik, A [1 ]
机构
[1] Burnham Inst, La Jolla, CA 92037 USA
关键词
D O I
10.1093/bioinformatics/18.1.77
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in similar to1 h and at 75% identity in similar to1 day on a 1 GHz Linux PC (Li et al., Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program's speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in similar to5 days. Although some redundancy is present after clustering, our new program's results only differ from our previous program's by less than 0.4%.
引用
收藏
页码:77 / 82
页数:6
相关论文
共 7 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   The Pfam protein families database [J].
Bateman, A ;
Birney, E ;
Durbin, R ;
Eddy, SR ;
Howe, KL ;
Sonnhammer, ELL .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :263-266
[3]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[4]  
HOBOHM U, 1992, PROTEIN SCI, V1, P409
[5]   Removing near-neighbour redundancy from large protein sequence collections [J].
Holm, L ;
Sander, C .
BIOINFORMATICS, 1998, 14 (05) :423-429
[6]   Clustering of highly homologous sequences to reduce the size of large protein databases [J].
Li, WZ ;
Jaroszewski, L ;
Godzik, A .
BIOINFORMATICS, 2001, 17 (03) :282-283
[7]   RSDB: representative protein sequence databases have high information content [J].
Park, J ;
Holm, L ;
Heger, A ;
Chothia, C .
BIOINFORMATICS, 2000, 16 (05) :458-464