Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays

被引:38
作者
Zeller, Georg [1 ,2 ]
Clark, Richard M. [2 ]
Schneeberger, Korbinian [2 ]
Bohlen, Anja [1 ]
Weigel, Detlef [2 ]
Raetsch, Gunnar [1 ]
机构
[1] Max Planck Gesell, Friedrich Miescher Lab, D-72070 Tubingen, Germany
[2] Max Planck Inst Dev Biol, Dept Mol Biol, D-72070 Tubingen, Germany
关键词
D O I
10.1101/gr.070169.107
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Whole-genome oligonucleotide resequencing arrays have allowed the comprehensive discovery of single nucleotide polymorphisms (SNPs) in eukaryotic genomes of moderate to large size. With this technology, the detection rate for isolated SNPs is typically high. However, it is greatly reduced when other polymorphisms are located near a SNP as multiple mismatches inhibit hybridization to arrayed oligonucleotides. Contiguous tracts of suppressed hybridization therefore typify polymorphic regions (PRs) such as clusters of SNPs or deletions. We developed a machine learning method, designated margin-based prediction of polymorphic regions (mPPR), to predict PRs from resequencing array data. Conceptually similar to hidden Markov models, the method is trained with discriminative learning techniques related to support vector machines, and accurately identifies even very short polymorphic tracts (<10 bp). We applied this method to resequencing array data previously generated for the euchromatic genomes of 20 strains (accessions) of the best-characterized plant, Arabidopsis thaliana. Nonredundantly, 27% of the genome was included within the boundaries of PRs predicted at high specificity (approximate to 97%). The resulting data set provides a fine-scale view of polymorphic sequences in A. thaliana; patterns of polymorphism not apparent in SNP data were readily detected, especially for noncoding regions. Our predictions provide a valuable resource for evolutionary genetic and functional studies in A. thaliana, and our method is applicable to similar data sets in other species. More broadly, our computational approach can be applied to other segmentation tasks related to the analysis of genomic variation.
引用
收藏
页码:918 / 929
页数:12
相关论文
共 60 条
[31]   Guidelines for incorporating non-perfectly matched oligonucleotides into target-specific hybridization probes for a DNA microarray [J].
Lee, I ;
Dombkowski, AA ;
Athey, BD .
NUCLEIC ACIDS RESEARCH, 2004, 32 (02) :681-690
[32]   Transcriptional and posttranscriptional regulation of transcription factor expression in Arabidopsis roots [J].
Lee, JY ;
Colinas, J ;
Wang, JY ;
Mace, D ;
Ohler, U ;
Benfey, PN .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (15) :6055-6060
[33]   Sequencing multiple and diverse rice varieties. Connecting whole-genome variation with phenotypes [J].
McNally, KL ;
Bruskiewich, R ;
Mackill, D ;
Buell, CR ;
Leach, JE ;
Leung, H .
PLANT PHYSIOLOGY, 2006, 141 (01) :26-31
[34]   An initial map of insertion and deletion (INDEL) variation in the human genome [J].
Mills, Ryan E. ;
Luttig, Christopher T. ;
Larkins, Christine E. ;
Beauchamp, Adam ;
Tsui, Circe ;
Pittard, W. Stephen ;
Devine, Scott E. .
GENOME RESEARCH, 2006, 16 (09) :1182-1190
[35]   An introduction to kernel-based learning algorithms [J].
Müller, KR ;
Mika, S ;
Rätsch, G ;
Tsuda, K ;
Schölkopf, B .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2001, 12 (02) :181-201
[36]   A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome [J].
Mural, RJ ;
Adams, MD ;
Myers, EW ;
Smith, HO ;
Miklos, GLG ;
Wides, R ;
Halpern, A ;
Li, PW ;
Sutton, GG ;
Nadeau, J ;
Salzberg, SL ;
Holt, RA ;
Kodira, CD ;
Lu, F ;
Chen, L ;
Deng, ZM ;
Evangelista, CC ;
Gan, WN ;
Heiman, TJ ;
Li, JY ;
Li, ZY ;
Merkulov, GV ;
Milshina, NV ;
Naik, AK ;
Qi, R ;
Shue, BC ;
Wang, AH ;
Wang, J ;
Wang, X ;
Yan, XH ;
Ye, JN ;
Yooseph, S ;
Zhao, Q ;
Zheng, LS ;
Zhu, SPC ;
Biddick, K ;
Bolanos, R ;
Delcher, AL ;
Dew, IM ;
Fasulo, D ;
Flanigan, MJ ;
Huson, DH ;
Kravitz, SA ;
Miller, JR ;
Mobarry, CM ;
Reinert, K ;
Remington, KA ;
Zhang, Q ;
Zheng, XQH ;
Nusskern, DR .
SCIENCE, 2002, 296 (5573) :1661-1671
[37]  
Nguyen N., 2007, P 24 INT C MACH LEAR, P681
[38]   The pattern of polymorphism in Arabidopsis thaliana [J].
Nordborg, M ;
Hu, TT ;
Ishino, Y ;
Jhaveri, J ;
Toomajian, C ;
Zheng, HG ;
Bakker, E ;
Calabrese, P ;
Gladstone, J ;
Goyal, R ;
Jakobsson, M ;
Kim, S ;
Morozov, Y ;
Padhukasahasram, B ;
Plagnol, V ;
Rosenberg, NA ;
Shah, C ;
Wall, JD ;
Wang, J ;
Zhao, KY ;
Kalbfleisch, T ;
Schulz, V ;
Kreitman, M ;
Bergelson, J .
PLOS BIOLOGY, 2005, 3 (07) :1289-1299
[39]   Athena:: a resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences [J].
O'Connor, TR ;
Dyreson, C ;
Wyrick, JJ .
BIOINFORMATICS, 2005, 21 (24) :4411-4413
[40]   Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 [J].
Patil, N ;
Berno, AJ ;
Hinds, DA ;
Barrett, WA ;
Doshi, JM ;
Hacker, CR ;
Kautzer, CR ;
Lee, DH ;
Marjoribanks, C ;
McDonough, DP ;
Nguyen, BTN ;
Norris, MC ;
Sheehan, JB ;
Shen, NP ;
Stern, D ;
Stokowski, RP ;
Thomas, DJ ;
Trulson, MO ;
Vyas, KR ;
Frazer, KA ;
Fodor, SPA ;
Cox, DR .
SCIENCE, 2001, 294 (5547) :1719-1723