TREELETS - AN ADAPTIVE MULTI-SCALE BASIS FOR SPARSE UNORDERED DATA

被引:85
作者
Lee, Ann B. [1 ]
Nadler, Boaz [2 ]
Wasserman, Larry [1 ]
机构
[1] Carnegie Mellon Univ, Dept Stat, Pittsburgh, PA 15213 USA
[2] Weizmann Inst Sci, Dept Comp Sci & Appl Math, IL-76100 Rehovot, Israel
关键词
Feature selection; dimensionality reduction; multi-resolution analysis; local best basis. sparsity; principal component analysis; hierarchical clusetering; small smaple sizes;
D O I
10.1214/07-AOAS137
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In many modern applications, including analysis of gene expression and texts documents, the data are noisy, high-dimensional, and unordered-with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets-a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.
引用
收藏
页码:435 / 471
页数:37
相关论文
共 53 条
[11]  
Candes E, 2007, ANN STAT, V35, P2313, DOI 10.1214/009053606000001523
[12]   Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps [J].
Coifman, RR ;
Lafon, S ;
Lee, AB ;
Maggioni, M ;
Nadler, B ;
Warner, F ;
Zucker, SW .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (21) :7426-7431
[13]   The local Karhunen-Loeve bases [J].
Coifman, RR ;
Saito, N .
PROCEEDINGS OF THE IEEE-SP INTERNATIONAL SYMPOSIUM ON TIME-FREQUENCY AND TIME-SCALE ANALYSIS, 1996, :129-132
[14]  
COIFMAN RR, 1992, IEEE T INFORM THEORY, V32, P712
[15]   Finding predictive gene groups from microarray data [J].
Dettling, M ;
Bühlmann, P .
JOURNAL OF MULTIVARIATE ANALYSIS, 2004, 90 (01) :106-131
[16]   Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization [J].
Donoho, DL ;
Elad, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (05) :2197-2202
[17]   Adapting to unknown smoothness via wavelet shrinkage [J].
Donoho, DL ;
Johnstone, IM .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1995, 90 (432) :1200-1224
[18]   Cluster analysis and display of genome-wide expression patterns [J].
Eisen, MB ;
Spellman, PT ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) :14863-14868
[19]   Model-based clustering, discriminant analysis, and density estimation [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) :611-631
[20]  
Friedman J, 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5