Relating amino acid sequence to phenotype: Analysis of peptide-binding data

被引:47
作者
Segal, MR [1 ]
Cummings, MP
Hubbard, AE
机构
[1] Univ Calif San Francisco, Div Biostat, San Francisco, CA 94143 USA
[2] Josephine Bay Paul Ctr Comparat Mol Biol & Evolut, Marine Biol Lab, Woods Hole, MA 02543 USA
[3] Univ Calif Berkeley, Div Biostat, Berkeley, CA 94720 USA
关键词
bump hunting; classification trees; prediction rules; unordered categorical covariates;
D O I
10.1111/j.0006-341X.2001.00632.x
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
We illustrate data analytic concerns that arise in the context of relating genotype, as represented by amino acid sequence, to phenotypes (outcomes). The present application examines whether peptides that bind to a particular major histocompatibility complex (MHC) class I molecule have characteristic amino acid sequences. However, the concerns identified and addressed are considerably more general. It is recognized that simple rules for predicting binding based solely on preferences for specific amino acids in certain (anchor) positions of the peptide's amino acid sequence are generally inadequate and that binding is potentially influenced by all sequence positions as well as between-position interactions. The desire to elucidate these more complex prediction rules has spawned various modeling attempts, the shortcomings of which provide motivation for the methods adopted here. Because of (i) this need to model between-position interactions, (ii) amino acids constituting a highly (20) multilevel unordered categorical covariate, and (iii) there frequently being numerous such covariates (i.e., positions) comprising the sequence, standard regression/classification techniques are problematic due to the proliferation of indicator variables required for encoding the sequence position covariates and attendant interactions. These difficulties have led to analyses based on (continuous) properties (e.g., molecular weights) of the amino acids. However, there is potential information loss in such an approach if the properties used are incomplete and/or do not capture the mechanism underlying association with the phenotype. Here we demonstrate that handling unordered categorical covariates with numerous levels and accompanying interactions can be done effectively using classification trees and recently devised bump-hunting methods. We further tackle the question of whether observed associations are attributable to amino acid properties as well as addressing the assessment and implications of between-position covariation.
引用
收藏
页码:632 / 642
页数:11
相关论文
共 23 条
[1]   Covariability of V3 loop amino acids [J].
Bickel, PJ ;
Cosman, PC ;
Olshen, RA ;
Spector, PC ;
Rodrigo, AG ;
Mullins, JI .
AIDS RESEARCH AND HUMAN RETROVIRUSES, 1996, 12 (15) :1401-1411
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   ON GROUPING FOR MAXIMUM HOMOGENEITY [J].
FISHER, WD .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1958, 53 (284) :789-798
[5]   Bump hunting in high-dimensional data [J].
Friedman J.H. ;
Fisher N.I. .
Statistics and Computing, 1999, 9 (2) :123-143
[6]   MULTIVARIATE ADAPTIVE REGRESSION SPLINES [J].
FRIEDMAN, JH .
ANNALS OF STATISTICS, 1991, 19 (01) :1-67
[7]   Two complementary methods for predicting peptides binding major histocompatibility complex molecules [J].
Gulukota, K ;
Sidney, J ;
Sette, A ;
DeLisi, C .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 267 (05) :1258-1267
[8]   Skipping a step with neural nets [J].
Gulukota, K .
NATURE BIOTECHNOLOGY, 1998, 16 (08) :722-723
[9]  
HASTIE T, 1998, ENCY BIOSTATISTICS, P2986
[10]  
Hastie T.J., 1990, Generalized Additive Models, V1st, DOI DOI 10.1214/SS/1177013604