Fast sequence clustering using a suffix array algorithm

被引:28
作者
Malde, K [1 ]
Coward, E [1 ]
Jonassen, I [1 ]
机构
[1] Univ Bergen, Dept Informat, N-5020 Bergen, Norway
关键词
D O I
10.1093/bioinformatics/btg138
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data. Results: A new, fast EST clustering algorithm is presented. Sub-quadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising.
引用
收藏
页码:1221 / 1226
页数:6
相关论文
共 13 条
[1]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[2]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[3]   d2_cluster: A validated method for clustering EST and full-length cDNA sequences [J].
Burke, J ;
Davison, D ;
Hide, W .
GENOME RESEARCH, 1999, 9 (11) :1135-1142
[4]   A comparison of imperative and purely functional suffix tree constructions [J].
Giegerich, R ;
Kurtz, S .
SCIENCE OF COMPUTER PROGRAMMING, 1995, 25 (2-3) :187-218
[5]   GeneNest: automated generation and visualization of gene indices [J].
Haas, SA ;
Beissbarth, T ;
Rivals, E ;
Krause, A ;
Vingron, M .
TRENDS IN GENETICS, 2000, 16 (11) :521-523
[6]   Removing near-neighbour redundancy from large protein sequence collections [J].
Holm, L ;
Sander, C .
BIOINFORMATICS, 1998, 14 (05) :423-429
[7]   An optimized protocol for analysis of EST sequences [J].
Liang, F ;
Holt, I ;
Pertea, G ;
Karamycheva, S ;
Salzberg, SL ;
Quackenbush, J .
NUCLEIC ACIDS RESEARCH, 2000, 28 (18) :3657-3665
[8]   SUFFIX ARRAYS - A NEW METHOD FOR ONLINE STRING SEARCHES [J].
MANBER, U ;
MYERS, G .
SIAM JOURNAL ON COMPUTING, 1993, 22 (05) :935-948
[9]   SPACE-ECONOMICAL SUFFIX TREE CONSTRUCTION ALGORITHM [J].
MCCREIGHT, EM .
JOURNAL OF THE ACM, 1976, 23 (02) :262-272
[10]   IMPROVED TOOLS FOR BIOLOGICAL SEQUENCE COMPARISON [J].
PEARSON, WR ;
LIPMAN, DJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1988, 85 (08) :2444-2448