Decision trees for hierarchical multi-label classification

被引:465
作者
Vens, Celine [1 ]
Struyf, Jan [1 ]
Schietgat, Leander [1 ]
Dzeroski, Saso [2 ]
Blockeel, Hendrik [1 ]
机构
[1] Katholieke Univ Leuven, Dept Comp Sci, B-3001 Louvain, Belgium
[2] Jozef Stefan Inst, Dept Knowledge Technol, Ljubljana 1000, Slovenia
关键词
Hierarchical classification; Multi-label classification; Decision trees; Functional genomics; Precision-recall analysis;
D O I
10.1007/s10994-008-5077-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hierarchical multi-label classification (HMC) is a variant of classification where instances may belong to multiple classes at the same time and these classes are organized in a hierarchy. This article presents several approaches to the induction of decision trees for HMC, as well as an empirical study of their use in functional genomics. We compare learning a single HMC tree (which makes predictions for all classes together) to two approaches that learn a set of regular classification trees (one for each class). The first approach defines an independent single-label classification task for each class (SC). Obviously, the hierarchy introduces dependencies between the classes. While they are ignored by the first approach, they are exploited by the second approach, named hierarchical single-label classification (HSC). Depending on the application at hand, the hierarchy of classes can be such that each class has at most one parent (tree structure) or such that classes may have multiple parents (DAG structure). The latter case has not been considered before and we show how the HMC and HSC approaches can be modified to support this setting. We compare the three approaches on 24 yeast data sets using as classification schemes MIPS's FunCat (tree structure) and the Gene Ontology (DAG structure). We show that HMC trees outperform HSC and SC trees along three dimensions: predictive accuracy, model size, and induction time. We conclude that HMC trees should definitely be considered in HMC tasks where interpretable models are desired.
引用
收藏
页码:185 / 214
页数:30
相关论文
共 39 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] [Anonymous], P 15 INT C MACH LEAR
  • [3] [Anonymous], P ICML 97
  • [4] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [5] Hierarchical multi-label prediction of gene function
    Barutcuoglu, Z
    Schapire, RE
    Troyanskaya, OG
    [J]. BIOINFORMATICS, 2006, 22 (07) : 830 - 836
  • [6] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [7] Blockeel H, 1999, LECT NOTES ARTIF INT, V1704, P32
  • [8] Blockeel H., 2002, Proceedings of the SIGKDD Workshop on Multi-Relational Data Mining (MRDM), P21
  • [9] Blockeel H, 2006, LECT NOTES ARTIF INT, V4213, P18
  • [10] Cesa-Bianchi N, 2006, J MACH LEARN RES, V7, P31