TREELETS - AN ADAPTIVE MULTI-SCALE BASIS FOR SPARSE UNORDERED DATA

被引:85
作者
Lee, Ann B. [1 ]
Nadler, Boaz [2 ]
Wasserman, Larry [1 ]
机构
[1] Carnegie Mellon Univ, Dept Stat, Pittsburgh, PA 15213 USA
[2] Weizmann Inst Sci, Dept Comp Sci & Appl Math, IL-76100 Rehovot, Israel
关键词
Feature selection; dimensionality reduction; multi-resolution analysis; local best basis. sparsity; principal component analysis; hierarchical clusetering; small smaple sizes;
D O I
10.1214/07-AOAS137
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In many modern applications, including analysis of gene expression and texts documents, the data are noisy, high-dimensional, and unordered-with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets-a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.
引用
收藏
页码:435 / 471
页数:37
相关论文
共 53 条
[1]  
AHN J, 2008, BIOMETRIKA IN PRESS
[2]   Detection of malignancy in cytology specimens using spectral-spatial analysis [J].
Angeletti, C ;
Harvey, NR ;
Khomitch, V ;
Fischer, AH ;
Levenson, RM ;
Rimm, DL .
LABORATORY INVESTIGATION, 2005, 85 (12) :1555-1564
[3]  
[Anonymous], 1998, PHYS A
[4]  
[Anonymous], 1997, ESSENTIAL WAVELETS S, DOI DOI 10.1007/978-1-4612-0709-2
[5]   THE GRAND TOUR - A TOOL FOR VIEWING MULTIDIMENSIONAL DATA [J].
ASIMOV, D .
SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1985, 6 (01) :128-143
[6]   Prediction by supervised principal components [J].
Bair, E ;
Hastie, T ;
Paul, D ;
Tibshirani, R .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2006, 101 (473) :119-137
[7]   Semi-supervised learning on Riemannian manifolds [J].
Belkin, M ;
Niyogi, P .
MACHINE LEARNING, 2004, 56 (1-3) :209-239
[8]   BOOTSTRAP TESTS AND CONFIDENCE-REGIONS FOR FUNCTIONS OF A COVARIANCE-MATRIX [J].
BERAN, R ;
SRIVASTAVA, MS .
ANNALS OF STATISTICS, 1985, 13 (01) :95-115
[9]   Regularized estimation of large covariance matrices [J].
Bickel, Peter J. ;
Levina, Elizaveta .
ANNALS OF STATISTICS, 2008, 36 (01) :199-227
[10]  
BUCKHEIT JB, 1995, P SOC PHOTO-OPT INS, V2569, P540, DOI 10.1117/12.217608