The choice of optimal distance measure in genome-wide datasets

被引:19
作者
Glazko, G
Gordon, A
Mushegian, A
机构
[1] Stowers Inst Med Res, Kansas City, MO 64110 USA
[2] Univ Rochester, Med Ctr, Dept Biostat & Computat Biol, Rochester, NY 14642 USA
[3] Univ Kansas, Med Ctr, Dept Microbiol Mol Genet & Immunol, Kansas City, KS 66160 USA
关键词
D O I
10.1093/bioinformatics/bti1201
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is, frequently, to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure. Results: We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances.
引用
收藏
页码:2 / 10
页数:9
相关论文
共 49 条
  • [1] AN EFFICIENTLY COMPUTABLE METRIC FOR COMPARING POLYGONAL SHAPES
    ARKIN, EM
    CHEW, LP
    HUTTENLOCHER, DP
    KEDEM, K
    MITCHELL, JSB
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1991, 13 (03) : 209 - 216
  • [2] Solving the protein sequence metric problem
    Atchley, WR
    Zhao, JP
    Fernandes, AD
    Drüke, T
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (18) : 6395 - 6400
  • [3] An automated method for finding molecular complexes in large protein interaction networks
    Bader, GD
    Hogue, CW
    [J]. BMC BIOINFORMATICS, 2003, 4 (1)
  • [4] Analyzing yeast protein-protein interaction data obtained from different sources
    Bader, GD
    Hogue, CWV
    [J]. NATURE BIOTECHNOLOGY, 2002, 20 (10) : 991 - 997
  • [5] Gene expression data analysis
    Brazma, A
    Vilo, J
    [J]. FEBS LETTERS, 2000, 480 (01) : 17 - 24
  • [6] Clustering proteins from interaction networks for the prediction of cellular functions -: art. no. 95
    Brun, C
    Herrmann, C
    Guénoche, A
    [J]. BMC BIOINFORMATICS, 2004, 5 (1)
  • [7] Buneman P, 1974, J COMBINATORIAL TH B, V17, P48, DOI [10.1016/0095-8956(74)90047-1, DOI 10.1016/0095-8956(74)90047-1]
  • [8] A genome-wide transcriptional analysis of the mitotic cell cycle
    Cho, RJ
    Campbell, MJ
    Winzeler, EA
    Steinmetz, L
    Conway, A
    Wodicka, L
    Wolfsberg, TG
    Gabrielian, AE
    Landsman, D
    Lockhart, DJ
    Davis, RW
    [J]. MOLECULAR CELL, 1998, 2 (01) : 65 - 73
  • [9] Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages
    Date, SV
    Marcotte, EM
    [J]. NATURE BIOTECHNOLOGY, 2003, 21 (09) : 1055 - 1062
  • [10] Comparison of computational methods for the identification of cell cycle-regulated genes
    de Lichtenberg, U
    Jensen, LJ
    Fausboll, A
    Jensen, TS
    Bork, P
    Brunak, S
    [J]. BIOINFORMATICS, 2005, 21 (07) : 1164 - 1171