Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score

被引:77
作者
Lee, Hayan [1 ,2 ]
Schatz, Michael C. [1 ,2 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] Cold Spring Harbor Lab, Simons Ctr Quantit Biol, Cold Spring Harbor, NY 11724 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
ULTRAFAST; ALIGNMENT; SEQUENCE;
D O I
10.1093/bioinformatics/bts330
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and do not directly measure the problematic repeats across the genome. Here, we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position and thus measures the overall composition of the genome itself. Results: We have developed the Genome Mappability Analyzer to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5-14% of the human, mouse, fly and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the 'dark matter' of the genome, including of known clinically relevant variations in these regions.
引用
收藏
页码:2097 / 2105
页数:9
相关论文
共 23 条
[1]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[2]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[3]  
Burrows M., 1994, 124 DIG SRC RES
[4]   Draft genome sequence of the sexually transmitted pathogen Trichomonas vaginalis [J].
Carlton, Jane M. ;
Hirt, Robert P. ;
Silva, Joana C. ;
Delcher, Arthur L. ;
Schatz, Michael ;
Zhao, Qi ;
Wortman, Jennifer R. ;
Bidwell, Shelby L. ;
Alsmark, U. Cecilia M. ;
Besteiro, Sebastien ;
Sicheritz-Ponten, Thomas ;
Noel, Christophe J. ;
Dacks, Joel B. ;
Foster, Peter G. ;
Simillion, Cedric ;
Van de Peer, Yves ;
Miranda-Saavedra, Diego ;
Barton, Geoffrey J. ;
Westrop, Gareth D. ;
Mueller, Sylke ;
Dessi, Daniele ;
Fiori, Pier Luigi ;
Ren, Qinghu ;
Paulsen, Ian ;
Zhang, Hanbang ;
Bastida-Corcuera, Felix D. ;
Simoes-Barbosa, Augusto ;
Brown, Mark T. ;
Hayes, Richard D. ;
Mukherjee, Mandira ;
Okumura, Cheryl Y. ;
Schneider, Rachel ;
Smith, Alias J. ;
Vanacova, Stepanka ;
Villalvazo, Maria ;
Haas, Brian J. ;
Pertea, Mihaela ;
Feldblyum, Tamara V. ;
Utterback, Terry R. ;
Shu, Chung-Li ;
Osoegawa, Kazutoyo ;
de Jong, Pieter J. ;
Hrdy, Ivan ;
Horvathova, Lenka ;
Zubacova, Zuzana ;
Dolezal, Pavel ;
Malik, Shehre-Banoo ;
Logsdon, John M., Jr. ;
Henze, Katrin ;
Gupta, Arti .
SCIENCE, 2007, 315 (5809) :207-212
[5]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[6]   Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays [J].
Drmanac, Radoje ;
Sparks, Andrew B. ;
Callow, Matthew J. ;
Halpern, Aaron L. ;
Burns, Norman L. ;
Kermani, Bahram G. ;
Carnevali, Paolo ;
Nazarenko, Igor ;
Nilsen, Geoffrey B. ;
Yeung, George ;
Dahl, Fredrik ;
Fernandez, Andres ;
Staker, Bryan ;
Pant, Krishna P. ;
Baccash, Jonathan ;
Borcherding, Adam P. ;
Brownley, Anushka ;
Cedeno, Ryan ;
Chen, Linsu ;
Chernikoff, Dan ;
Cheung, Alex ;
Chirita, Razvan ;
Curson, Benjamin ;
Ebert, Jessica C. ;
Hacker, Coleen R. ;
Hartlage, Robert ;
Hauser, Brian ;
Huang, Steve ;
Jiang, Yuan ;
Karpinchyk, Vitali ;
Koenig, Mark ;
Kong, Calvin ;
Landers, Tom ;
Le, Catherine ;
Liu, Jia ;
McBride, Celeste E. ;
Morenzoni, Matt ;
Morey, Robert E. ;
Mutch, Karl ;
Perazich, Helena ;
Perry, Kimberly ;
Peters, Brock A. ;
Peterson, Joe ;
Pethiyagoda, Charit L. ;
Pothuraju, Kaliprasad ;
Richter, Claudia ;
Rosenbaum, Abraham M. ;
Roy, Shaunak ;
Shafto, Jay ;
Sharanhovich, Uladzislau .
SCIENCE, 2010, 327 (5961) :78-81
[7]   The ENCODE (ENCyclopedia of DNA elements) Project [J].
Feingold, EA ;
Good, PJ ;
Guyer, MS ;
Kamholz, S ;
Liefer, L ;
Wetterstrand, K ;
Collins, FS ;
Gingeras, TR ;
Kampa, D ;
Sekinger, EA ;
Cheng, J ;
Hirsch, H ;
Ghosh, S ;
Zhu, Z ;
Pate, S ;
Piccolboni, A ;
Yang, A ;
Tammana, H ;
Bekiranov, S ;
Kapranov, P ;
Harrison, R ;
Church, G ;
Struhl, K ;
Ren, B ;
Kim, TH ;
Barrera, LO ;
Qu, C ;
Van Calcar, S ;
Luna, R ;
Glass, CK ;
Rosenfeld, MG ;
Guigo, R ;
Antonarakis, SE ;
Birney, E ;
Brent, M ;
Pachter, L ;
Reymond, A ;
Dermitzakis, ET ;
Dewey, C ;
Keefe, D ;
Denoeud, F ;
Lagarde, J ;
Ashurst, J ;
Hubbard, T ;
Wesselink, JJ ;
Castelo, R ;
Eyras, E ;
Myers, RM ;
Sidow, A ;
Batzoglou, S .
SCIENCE, 2004, 306 (5696) :636-640
[8]   Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing [J].
Gilles, Andre ;
Meglecz, Emese ;
Pech, Nicolas ;
Ferreira, Stephanie ;
Malausa, Thibaut ;
Martin, Jean-Francois .
BMC GENOMICS, 2011, 12
[9]   Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011 [J].
Grad, Yonatan H. ;
Lipsitch, Marc ;
Feldgarden, Michael ;
Arachchi, Harindra M. ;
Cerqueira, Gustavo C. ;
FitzGerald, Michael ;
Godfrey, Paul ;
Haas, Brian J. ;
Murphy, Cheryl I. ;
Russ, Carsten ;
Sykes, Sean ;
Walker, Bruce J. ;
Wortman, Jennifer R. ;
Young, Sarah ;
Zeng, Qiandong ;
Abouelleil, Amr ;
Bochicchio, James ;
Chauvin, Sara ;
DeSmet, Timothy ;
Gujja, Sharvari ;
McCowan, Caryn ;
Montmayeur, Anna ;
Steelman, Scott ;
Frimodt-Moller, Jakob ;
Petersen, Andreas M. ;
Struve, Carsten ;
Krogfelt, Karen A. ;
Bingen, Edouard ;
Weill, Francois-Xavier ;
Lander, Eric S. ;
Nusbaum, Chad ;
Birren, Bruce W. ;
Hung, Deborah T. ;
Hanage, William P. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2012, 109 (08) :3065-3070
[10]   A novel and well-defined benchmarking method for second generation read mapping [J].
Holtgrewe, Manuel ;
Emde, Anne-Katrin ;
Weese, David ;
Reinert, Knut .
BMC BIOINFORMATICS, 2011, 12