Machine learning classification procedure for selecting SNPs in genomic selection:: application to early mortality in broilers

被引：86

作者：

Long, N. ^{[1
]}

Gianola, D.

Rosa, G. J. M.

Weigel, K. A.

Avendano, S.

机构：

[1] Univ Wisconsin, Dept Anim Sci, Madison, WI 53706 USA

[2] Univ Wisconsin, Dept Dairy Sci, Madison, WI 53706 USA

[3] Aviagen Ltd, Newbridge EH28 8SZ, Midlothian, Scotland

来源：

JOURNAL OF ANIMAL BREEDING AND GENETICS | 2007年 / 124卷 / 06期

关键词：

filter-wrapper feature selection; genomic selection; machine learning; mortality; single nucleotide polymorphism;

D O I：

10.1111/j.1439-0388.2007.00694.x

中图分类号：

S8 [畜牧、动物医学、狩猎、蚕、蜂];

学科分类号：

0905 ;

摘要：

Genome-wide association studies using single nucleotide polymorphisms (SNPs) can identify genetic variants related to complex traits. Typically thousands of SNPs are genotyped, whereas the number of phenotypes for which there is genomic information may be smaller. When predicting phenotypes, options for statistical model building range from incorporating all possible markers into the specification to including only sets of relevant SNPs (features). In the latter case, an efficient method of selecting influential features is required. A two-step feature selection method for binary traits was developed, which consisted of filtering (using information gain), and wrapping (using naive Bayesian classification). The filter reduces the large number of SNPs to a much smaller size, to facilitate the wrapper step. As the procedure is tailored for discrete outcomes, an approach based on discretization of phenotypic values was developed, to enable feature selection in a classification framework. The method was applied to chick mortality rates (0-14 days of age) on progeny from 201 sires in a commercial broiler line, with the goal of identifying SNPs (over 5000) related to progeny mortality. To mimic a case-control study, sires were clustered into two groups, low and high, according to two arbitrarily chosen mortality rate cut points. By varying these thresholds, 11 different 'case-control' samples were formed, and the SNP selection procedure was applied to each sample. To compare the 11 sets of chosen SNPs, predicted residual sum of squares (PRESS) from a linear model was used. The two-step method improved naive Bayesian classification accuracy over the case without feature selection (from around 50 to above 90% without and with feature selection in each case-control sample). The best case-control group (63 sires above or below the thresholds) had the smallest PRESS statistic among groups with model p-values below 0.003. The 17 SNPs selected using this group accounted for 31% of the variation in raw mortality rates between sire families.

引用

页码：377 / 389

页数：13

共 33 条

[1]

[Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques

[2]

[Anonymous], 2011, Categorical data analysis

[3]

[Anonymous], 1997, Machine Learning

[4] A tutorial on statistical methods for population association studies [J].