Sequence clustering strategies improve remote homology recognitions while reducing search times

被引：53

作者：

Li, WZ ^{[1
]}

Jaroszewski, L ^{[1
]}

Godzik, A ^{[1
]}

机构：

[1] Burnham Inst, Program Bioinformat & Biol Complex, La Jolla, CA 92037 USA

来源：

PROTEIN ENGINEERING | 2002年 / 15卷 / 08期

关键词：

fold recognition; intermediate profile search; sequence clustering;

D O I：

10.1093/protein/15.8.643

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark.

引用

页码：643 / 649

页数：7

共 17 条

[1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[2] The ASTRAL compendium for protein structure and sequence analysis [J].

Brenner, SE ;

Koehl, P ;

Levitt, R .

NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :254-256

[3] Removing near-neighbour redundancy from large protein sequence collections [J].

Holm, L ;

Sander, C .

BIOINFORMATICS, 1998, 14 (05) :423-429

[4] Multiple sequence alignment with Clustal x [J].

Jeanmougin, F ;

Thompson, JD ;

Gouy, M ;

Higgins, DG ;

Gibson, TJ .

TRENDS IN BIOCHEMICAL SCIENCES, 1998, 23 (10) :403-405

[5] Hidden Markov models for detecting remote protein homologies [J].

Karplus, K ;

Barrett, C ;

Hughey, R .

BIOINFORMATICS, 1998, 14 (10) :846-856

[6] Tolerating some redundancy significantly speeds up clustering of large protein databases [J].

Li, WZ ;

Jaroszewski, L ;

Godzik, A .

BIOINFORMATICS, 2002, 18 (01) :77-82

[7] Clustering of highly homologous sequences to reduce the size of large protein databases [J].

Li, WZ ;

Jaroszewski, L ;

Godzik, A .

BIOINFORMATICS, 2001, 17 (03) :282-283

[8] Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology [J].

Li, WZ ;

Pio, F ;

Pawlowski, K ;

Godzik, A .

BIOINFORMATICS, 2000, 16 (12) :1105-1110

[9]

MURZIN AG, 1995, J MOL BIOL, V247, P536, DOI 10.1016/S0022-2836(05)80134-2

[10] Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods [J].

Park, J ;

Karplus, K ;

Barrett, C ;

Hughey, R ;

Haussler, D ;

Hubbard, T ;

Chothia, C .

JOURNAL OF MOLECULAR BIOLOGY, 1998, 284 (04) :1201-1210

← 1 2 →