SEED: efficient clustering of next-generation sequences

被引:43
作者
Bao, Ergude [2 ]
Jiang, Tao [2 ]
Kaloshian, Isgouhi [3 ]
Girke, Thomas [1 ]
机构
[1] Univ Calif Riverside, Dept Bot & Plant Sci, Riverside, CA 92521 USA
[2] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[3] Univ Calif Riverside, Dept Nematol, Riverside, CA 92521 USA
基金
美国国家科学基金会;
关键词
GENOME; PROGRAM; PROTEIN; SEARCH; FORMAT; FASTER; RNAS; TOOL;
D O I
10.1093/bioinformatics/btr447
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Similarity clustering of next generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in < 4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
引用
收藏
页码:2502 / 2509
页数:8
相关论文
共 30 条
[1]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[2]   Assemblies: the good, the bad, the ugly [J].
Birney, Ewan .
NATURE METHODS, 2011, 8 (01) :59-60
[3]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[4]   Search and clustering orders of magnitude faster than BLAST [J].
Edgar, Robert C. .
BIOINFORMATICS, 2010, 26 (19) :2460-2461
[5]   Efficient storage of high throughput DNA sequencing data using reference-based compression [J].
Fritz, Markus Hsi-Yang ;
Leinonen, Rasko ;
Cochrane, Guy ;
Birney, Ewan .
GENOME RESEARCH, 2011, 21 (05) :734-740
[6]   An overview of the wcd EST clustering tool [J].
Hazelhurst, Scott ;
Hide, Winston ;
Liptak, Zsuzsanna ;
Nogueira, Ramon ;
Starfield, Richard .
BIOINFORMATICS, 2008, 24 (13) :1542-1546
[7]   The new paradigm of flow cell sequencing [J].
Holt, Robert A. ;
Jones, Steven J. M. .
GENOME RESEARCH, 2008, 18 (06) :839-846
[8]   Uncovering Small RNA-Mediated Responses to Phosphate Deficiency in Arabidopsis by Deep Sequencing [J].
Hsieh, Li-Ching ;
Lin, Shu-I ;
Shih, Arthur Chun-Chieh ;
Chen, June-Wei ;
Lin, Wei-Yi ;
Tseng, Ching-Ying ;
Li, Wen-Hsiung ;
Chiou, Tzyy-Jen .
PLANT PHYSIOLOGY, 2009, 151 (04) :2120-2132
[9]   CAP3: A DNA sequence assembly program [J].
Huang, XQ ;
Madan, A .
GENOME RESEARCH, 1999, 9 (09) :868-877
[10]   SeqMap: mapping massive amount of oligonucleotides to the genome [J].
Jiang, Hui ;
Wong, Wing Hung .
BIOINFORMATICS, 2008, 24 (20) :2395-2396