AMASS: A structured pattern matching approach to shotgun sequence assembly

被引:19
作者
Kim, S
Segre, AM
机构
[1] Univ Illinois, Ctr Biotechnol, Urbana, IL 61801 USA
[2] Univ Iowa, Dept Management Sci, Iowa City, IA 52242 USA
关键词
fragment assembly; repeating regions; sequence reconstruction; whole genome;
D O I
10.1089/cmb.1999.6.163
中图分类号
Q5 [生物化学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
In this paper, we propose an efficient, reliable shotgun sequence assembly algorithm based on a fingerprinting scheme that is robust to both noise and repetitive sequences in the data, two primary roadblocks to effective whole-genome shotgun sequencing. Our algorithm uses exact matches of short patterns randomly selected from fragment data to identify fragment overlaps, construct an overlap map, and deliver a consensus sequence. We show how statistical clues made explicit in our approach can easily be exploited to correctly assemble results even in the presence of extensive repetitive sequences. Our approach is both accurate and exceptionally fast in practice: e.g., we have correctly assembled the whole Mycoplasma genitalium genome (approximately 580 kbp) is roughly 8 minutes of 64MB 200MHz Pentium Pro CPU time from real shotgun data, where most existing algorithms can be expected to run for several hours to a day on the same data. Moreover, experiments with artificially-shotgunned data prepared from real DNA sequences from a wide range of organisms (including human DNA) and containing complex repeating regions demonstrate our algorithm's robustness to input noise and the presence of repetitive sequences. For example, we have correctly assembled a 238-kbp human DNA sequence in less than 3 min of 64-MB 200-MHz Pentium Pro CPU time.
引用
收藏
页码:163 / 186
页数:24
相关论文
共 23 条
[1]
AGARWAL P, 1994, P 2 INT C INT SYST M, P1
[2]
A new DNA sequence assembly program [J].
Bonfield, JK ;
Smith, KF ;
Staden, R .
NUCLEIC ACIDS RESEARCH, 1995, 23 (24) :4992-4999
[3]
ENGLE ML, 1994, COMPUT APPL BIOSCI, V10, P567
[4]
ARTIFICIALLY GENERATED DATA SETS FOR TESTING DNA-SEQUENCE ASSEMBLY ALGORITHMS [J].
ENGLE, ML ;
BURKS, C .
GENOMICS, 1993, 16 (01) :286-288
[5]
WHOLE-GENOME RANDOM SEQUENCING AND ASSEMBLY OF HAEMOPHILUS-INFLUENZAE RD [J].
FLEISCHMANN, RD ;
ADAMS, MD ;
WHITE, O ;
CLAYTON, RA ;
KIRKNESS, EF ;
KERLAVAGE, AR ;
BULT, CJ ;
TOMB, JF ;
DOUGHERTY, BA ;
MERRICK, JM ;
MCKENNEY, K ;
SUTTON, G ;
FITZHUGH, W ;
FIELDS, C ;
GOCAYNE, JD ;
SCOTT, J ;
SHIRLEY, R ;
LIU, LI ;
GLODEK, A ;
KELLEY, JM ;
WEIDMAN, JF ;
PHILLIPS, CA ;
SPRIGGS, T ;
HEDBLOM, E ;
COTTON, MD ;
UTTERBACK, TR ;
HANNA, MC ;
NGUYEN, DT ;
SAUDEK, DM ;
BRANDON, RC ;
FINE, LD ;
FRITCHMAN, JL ;
FUHRMANN, JL ;
GEOGHAGEN, NSM ;
GNEHM, CL ;
MCDONALD, LA ;
SMALL, KV ;
FRASER, CM ;
SMITH, HO ;
VENTER, JC .
SCIENCE, 1995, 269 (5223) :496-512
[6]
FOULSER D, 1990, 812 YAL U DEP COMP S
[7]
THE MINIMAL GENE COMPLEMENT OF MYCOPLASMA-GENITALIUM [J].
FRASER, CM ;
GOCAYNE, JD ;
WHITE, O ;
ADAMS, MD ;
CLAYTON, RA ;
FLEISCHMANN, RD ;
BULT, CJ ;
KERLAVAGE, AR ;
SUTTON, G ;
KELLEY, JM ;
FRITCHMAN, JL ;
WEIDMAN, JF ;
SMALL, KV ;
SANDUSKY, M ;
FUHRMANN, J ;
NGUYEN, D ;
UTTERBACK, TR ;
SAUDEK, DM ;
PHILLIPS, CA ;
MERRICK, JM ;
TOMB, JF ;
DOUGHERTY, BA ;
BOTT, KF ;
HU, PC ;
LUCIER, TS ;
PETERSON, SN ;
SMITH, HO ;
HUTCHISON, CA ;
VENTER, JC .
SCIENCE, 1995, 270 (5235) :397-403
[8]
GINGERAS T, NUCL ACIDS RES, V7, P529
[9]
Against a whole-genome shotgun [J].
Green, P .
GENOME RESEARCH, 1997, 7 (05) :410-417
[10]
Green P., 1996, PHRAP DOCUMENTATION