Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs

被引:55
作者
Pavy, Nathalie
Parsons, Lee S.
Paule, Charles
MacKay, John
Bousquet, Jean
机构
[1] Univ Laval, ARBOREA, Ste Foy, PQ G1K 7P4, Canada
[2] Univ Laval, Canada Res Chair Forest Genom, Ste Foy, PQ G1K 7P4, Canada
[3] Univ Minnesota, Ctr Computat Genom & Bioinformat, Minneapolis, MN 55455 USA
关键词
D O I
10.1186/1471-2164-7-174
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: High-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies. As a first step toward this goal, we aimed to develop a resource of candidate Single Nucleotide Polymorphisms (SNP) in white spruce (Picea glauca [Moench] Voss), a softwood tree of major economic importance. Results: A white spruce SNP resource encompassing 12,264 SNPs was constructed from a set of 6,459 contigs derived from Expressed Sequence Tags (EST) and by using the bayesian-based statistical software PolyBayes. Several parameters influencing the SNP prediction were analysed including the a priori expected polymorphism, the probability score (P-SNP), and the contig depth and length. SNP detection in 3' and 5' reads from the same clones revealed a level of inconsistency between overlapping sequences as low as 1%. A subset of 245 predicted SNPs were verified through the independent resequencing of genomic DNA of a genotype also used to prepare cDNA libraries. The validation rate reached a maximum of 85% for SNPs predicted with either P-SNP = 0.95 or >= 0.99. A total of 9,310 SNPs were detected by using P-SNP = 0.95 as a criterion. The SNPs were distributed among 3,590 contigs encompassing an array of broad functional categories, with an overall frequency of 1 SNP per 700 nucleotide sites. Experimental and statistical approaches were used to evaluate the proportion of paralogous SNPs, with estimates in the range of 8 to 12%. The 3,789 coding SNPs identified through coding region annotation and ORF prediction, were distributed into 39% nonsynonymous and 61% synonymous substitutions. Overall, there were 0.9 SNP per 1,000 nonsynonymous sites and 5.2 SNPs per 1,000 synonymous sites, for a genome-wide nonsynonymous to synonymous substitution rate ratio (Ka/Ks) of 0.17. Conclusion: We integrated the SNP data in the ForestTreeDB database along with functional annotations to provide a tool facilitating the choice of candidate genes for mapping purposes or association studies.
引用
收藏
页数:14
相关论文
共 37 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]   Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data [J].
Batley, J ;
Barker, G ;
O'Sullivan, H ;
Edwards, KJ ;
Edwards, D .
PLANT PHYSIOLOGY, 2003, 132 (01) :84-91
[3]   The Bayesian revolution in genetics [J].
Beaumont, MA ;
Rannala, B .
NATURE REVIEWS GENETICS, 2004, 5 (04) :251-261
[4]   Trans-species shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae):: Implications for the long-term maintenance of genetic diversity in trees [J].
Bouillé, M ;
Bousquet, J .
AMERICAN JOURNAL OF BOTANY, 2005, 92 (01) :63-73
[5]   Nucleotide diversity and linkage disequilibrium in loblolly pine [J].
Brown, GR ;
Gill, GP ;
Kuntz, RJ ;
Langley, CH ;
Neale, DB .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (42) :15255-15260
[6]   Reliable identification of large numbers of candidate SNPs from public EST data [J].
Buetow, KH ;
Edmonson, MN ;
Cassidy, AB .
NATURE GENETICS, 1999, 21 (03) :323-325
[7]   Bioinformatics tools for single nucleotide polymorphism discovery and analysis [J].
Clifford, RJ ;
Edmonson, MN ;
Nguyen, C ;
Scherpbier, T ;
Hu, Y ;
Buetow, KH .
APPLICATIONS OF BIOINFORMATICS IN CANCER DETECTION, 2004, 1020 :101-109
[8]  
CONSORTIUM TGO, 2001, GENOME RES, V11, P1425
[9]  
CROW JA, 2005, DIOGENES RELIABLE PR
[10]   Base-calling of automated sequencer traces using phred.: I.: Accuracy assessment [J].
Ewing, B ;
Hillier, L ;
Wendl, MC ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :175-185