Machine learning classification procedure for selecting SNPs in genomic selection:: application to early mortality in broilers

被引:86
作者
Long, N. [1 ]
Gianola, D.
Rosa, G. J. M.
Weigel, K. A.
Avendano, S.
机构
[1] Univ Wisconsin, Dept Anim Sci, Madison, WI 53706 USA
[2] Univ Wisconsin, Dept Dairy Sci, Madison, WI 53706 USA
[3] Aviagen Ltd, Newbridge EH28 8SZ, Midlothian, Scotland
关键词
filter-wrapper feature selection; genomic selection; machine learning; mortality; single nucleotide polymorphism;
D O I
10.1111/j.1439-0388.2007.00694.x
中图分类号
S8 [畜牧、 动物医学、狩猎、蚕、蜂];
学科分类号
0905 ;
摘要
Genome-wide association studies using single nucleotide polymorphisms (SNPs) can identify genetic variants related to complex traits. Typically thousands of SNPs are genotyped, whereas the number of phenotypes for which there is genomic information may be smaller. When predicting phenotypes, options for statistical model building range from incorporating all possible markers into the specification to including only sets of relevant SNPs (features). In the latter case, an efficient method of selecting influential features is required. A two-step feature selection method for binary traits was developed, which consisted of filtering (using information gain), and wrapping (using naive Bayesian classification). The filter reduces the large number of SNPs to a much smaller size, to facilitate the wrapper step. As the procedure is tailored for discrete outcomes, an approach based on discretization of phenotypic values was developed, to enable feature selection in a classification framework. The method was applied to chick mortality rates (0-14 days of age) on progeny from 201 sires in a commercial broiler line, with the goal of identifying SNPs (over 5000) related to progeny mortality. To mimic a case-control study, sires were clustered into two groups, low and high, according to two arbitrarily chosen mortality rate cut points. By varying these thresholds, 11 different 'case-control' samples were formed, and the SNP selection procedure was applied to each sample. To compare the 11 sets of chosen SNPs, predicted residual sum of squares (PRESS) from a linear model was used. The two-step method improved naive Bayesian classification accuracy over the case without feature selection (from around 50 to above 90% without and with feature selection in each case-control sample). The best case-control group (63 sires above or below the thresholds) had the smallest PRESS statistic among groups with model p-values below 0.003. The 17 SNPs selected using this group accounted for 31% of the variation in raw mortality rates between sire families.
引用
收藏
页码:377 / 389
页数:13
相关论文
共 33 条
[1]  
[Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques
[2]  
[Anonymous], 2011, Categorical data analysis
[3]  
[Anonymous], 1997, Machine Learning
[4]   A tutorial on statistical methods for population association studies [J].
Balding, David J. .
NATURE REVIEWS GENETICS, 2006, 7 (10) :781-791
[5]  
CARUANA R, 1994, P 11 INT MACH LEARN
[6]  
Collett D, 1991, MODELLING BINARY DAT
[7]   The use of molecular genetics in the improvement of agricultural populations [J].
Dekkers, JCM ;
Hospital, F .
NATURE REVIEWS GENETICS, 2002, 3 (01) :22-32
[8]  
Elkan C., 1997, Boosting and Naive Bayesian learning
[9]  
Gianola D, 2003, GENETICS, V163, P347
[10]   Genomic-assisted prediction of genetic value with semiparametric procedures [J].
Gianola, Daniel ;
Fernando, Rohan L. ;
Stella, Alessandra .
GENETICS, 2006, 173 (03) :1761-1776