Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases

被引:11
作者
Li, Wentian [1 ]
Freudenberg, Jan [1 ]
机构
[1] North Shore LIJ Hlth Syst, Robert S Boas Ctr Genom & Human Genet, Feinstein Inst Med Res, Manhasset, NY 11030 USA
基金
美国国家卫生研究院;
关键词
SEGMENTAL DUPLICATIONS; HOMOLOGOUS RECOMBINATION; DNA; MAPPABILITY; COMPLEXITY; GENE; ORGANIZATION; ANNOTATION; DATABASE; ALU;
D O I
10.1016/j.compbiolchem.2014.08.015
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1 kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:108 / 117
页数:10
相关论文
共 90 条
[81]   UCbase & miRfunc: a database of ultraconserved sequences and microRNA function [J].
Taccioli, Cristian ;
Fabbri, Enrica ;
Visone, Rosa ;
Volinia, Stefano ;
Calin, George A. ;
Fong, Louise Y. ;
Gambari, Roberto ;
Bottoni, Arianna ;
Acunzo, Mario ;
Hagan, John ;
Iorio, Marilena V. ;
Piovan, Claudia ;
Romano, Giulia ;
Croce, Carlo Maria .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D41-D48
[82]  
Taub M, 2010, COMMUN INF SYST, V10, P69
[83]   Repetitive DNA and next-generation sequencing: computational challenges and solutions [J].
Treangen, Todd J. ;
Salzberg, Steven L. .
NATURE REVIEWS GENETICS, 2012, 13 (01) :36-46
[84]  
Tsiagkas G., 2014, COMPUT BIOL CHEM
[85]   The sequence of the human genome [J].
Venter, JC ;
Adams, MD ;
Myers, EW ;
Li, PW ;
Mural, RJ ;
Sutton, GG ;
Smith, HO ;
Yandell, M ;
Evans, CA ;
Holt, RA ;
Gocayne, JD ;
Amanatides, P ;
Ballew, RM ;
Huson, DH ;
Wortman, JR ;
Zhang, Q ;
Kodira, CD ;
Zheng, XQH ;
Chen, L ;
Skupski, M ;
Subramanian, G ;
Thomas, PD ;
Zhang, JH ;
Miklos, GLG ;
Nelson, C ;
Broder, S ;
Clark, AG ;
Nadeau, C ;
McKusick, VA ;
Zinder, N ;
Levine, AJ ;
Roberts, RJ ;
Simon, M ;
Slayman, C ;
Hunkapiller, M ;
Bolanos, R ;
Delcher, A ;
Dew, I ;
Fasulo, D ;
Flanigan, M ;
Florea, L ;
Halpern, A ;
Hannenhalli, S ;
Kravitz, S ;
Levy, S ;
Mobarry, C ;
Reinert, K ;
Remington, K ;
Abu-Threideh, J ;
Beasley, E .
SCIENCE, 2001, 291 (5507) :1304-+
[86]   A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags [J].
Wang, Jianrong ;
Huda, Ahsan ;
Lunyak, Victoria V. ;
Jordan, I. King .
BIOINFORMATICS, 2010, 26 (20) :2501-2508
[87]  
Wentian Li, 1991, Complex Systems, V5, P381
[88]   Dfam: a database of repetitive DNA based on profile hidden Markov models [J].
Wheeler, Travis J. ;
Clements, Jody ;
Eddy, Sean R. ;
Hubley, Robert ;
Jones, Thomas A. ;
Jurka, Jerzy ;
Smit, Arian F. A. ;
Finn, Robert D. .
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D70-D82
[89]   STATISTICS OF LOCAL COMPLEXITY IN AMINO-ACID-SEQUENCES AND SEQUENCE DATABASES [J].
WOOTTON, JC ;
FEDERHEN, S .
COMPUTERS & CHEMISTRY, 1993, 17 (02) :149-163
[90]   Pseudogenes in the ENCODE regions:: Consensus annotation, analysis of transcription, and evolution [J].
Zheng, Deyou ;
Frankish, Adam ;
Baertsch, Robert ;
Kapranov, Philipp ;
Reymond, Alexandre ;
Choo, Siew Woh ;
Lu, Yontao ;
Denoeud, France ;
Antonarakis, Stylianos E. ;
Snyder, Michael ;
Ruan, Yijun ;
Wei, Chia-Lin ;
Gingeras, Thomas R. ;
Guigo, Roderic ;
Harrow, Jennifer ;
Gerstein, Mark B. .
GENOME RESEARCH, 2007, 17 (06) :839-851