Data-mining discovery of pattern and process in ecological systems

被引:109
作者
Hochachka, Wesley M. [1 ]
Caruana, Rich
Fink, Danniel
Munson, Art
Riedewald, Mirek
Sorokina, Darla
Kelling, Steve
机构
[1] Cornell Univ, Ornithol Lab, Ithaca, NY 14850 USA
[2] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
bagging; data mining; decision trees; exploratory data analysis; hypothesis generation; machine learning; prediction;
D O I
10.2193/2006-503
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Most ecologists use statistical methods as their main analytical tools when analyzing data to identify relationships between a response and a set of predictors; thus, they treat all analyses as hypothesis tests or exercises in parameter estimation. However, little or no prior knowledge about a system can lead to creation of a statistical model or models that do not accurately describe major sources of variation in the response variable. We suggest that under such circumstances data mining is more appropriate for analysis. lit this paper we 1) present the distinctions between data-mining (usually exploratory) analyses and parametric statistical (confirmatory) analyses, 2) illustrate 3 strengths of data-mining tools for generating hypotheses from data, and 3) suggest useful ways in which data mining and statistical analyses can be integrated into a thorough analysis of data to facilitate rapid creation of accurate models and to guide further research.
引用
收藏
页码:2427 / 2437
页数:11
相关论文
共 32 条
[1]   An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J].
Bauer, E ;
Kohavi, R .
MACHINE LEARNING, 1999, 36 (1-2) :105-139
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]  
Burnham K. P., 2002, A practical informationtheoretic approach, DOI [DOI 10.1007/B97636, 10.1007/b97636]
[6]  
Caruana R, 2006, P 23 INT C MACH LEAR, P161, DOI [10.1145/1143844.1143865, DOI 10.1145/1143844.1143865, DOI 10.1145/1143844]
[7]  
Caruana R., 2006, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P909
[8]  
De'ath G, 2000, ECOLOGY, V81, P3178, DOI 10.2307/177409
[9]   Novel methods improve prediction of species' distributions from occurrence data [J].
Elith, J ;
Graham, CH ;
Anderson, RP ;
Dudík, M ;
Ferrier, S ;
Guisan, A ;
Hijmans, RJ ;
Huettmann, F ;
Leathwick, JR ;
Lehmann, A ;
Li, J ;
Lohmann, LG ;
Loiselle, BA ;
Manion, G ;
Moritz, C ;
Nakamura, M ;
Nakazawa, Y ;
Overton, JM ;
Peterson, AT ;
Phillips, SJ ;
Richardson, K ;
Scachetti-Pereira, R ;
Schapire, RE ;
Soberón, J ;
Williams, S ;
Wisz, MS ;
Zimmermann, NE .
ECOGRAPHY, 2006, 29 (02) :129-151
[10]  
Freund Y, 1996, ICML