Ranked Adjusted Rand:: integrating distance and partition information in a measure of clustering agreement

被引:17
作者
Pinto, Francisco R.
Carrico, Joao A.
Ramirez, Mario
Almeida, Jonas S.
机构
[1] Fac Med Lisbon, Inst Microbiol, Inst Mol Med, P-1649028 Lisbon, Portugal
[2] Inst Tecnol Quim & Biol, Grp Biomatemat, P-2780 Oeiras, Portugal
[3] INESC ID, P-1000029 Lisbon, Portugal
[4] Univ Texas, MD Anderson Canc Ctr, Dept Biostat & Appl Math, Houston, TX 77030 USA
关键词
D O I
10.1186/1471-2105-8-44
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowing the comparison of the performance of two data sources or the determination of how well a given classification agrees with another are frequently needed, especially in the absence of a universally accepted "gold standard" classification. ]Results: Here, we describe a novel measure-the Ranked Adjusted Rand (RAR) index. RAR differs from existing methods by evaluating the extent of agreement between any two groupings, taking into account the intercluster distances. This characteristic is relevant to evaluate cases of pairs of entities grouped in the same cluster by one method and separated by another. The latter method may assign them to close neighbour clusters or, on the contrary, to clusters that are far apart from each other. RAR is applicable even when intercluster distance information is absent for both or one of the groupings. In the first case, RAR is equal to its predecessor, Adjusted Rand ( HA) index. Artificially designed clusterings were used to demonstrate situations in which only RAR was able to detect differences in the grouping patterns. A study with larger simulated clusterings ensured that in realistic conditions, RAR is effectively integrating distance and partition information. The new method was applied to biological examples to compare 1) two microbial typing methods, 2) two gene regulatory network distances and 3) microarray gene expression data with pathway information. In the first application, one of the methods does not provide intercluster distances while the other originated a hierarchical clustering. RAR proved to be more sensitive than HA in the choice of a threshold for defining clusters in the hierarchical method that maximizes agreement between the results of both methods. Conclusion: RAR has its major advantage in combining cluster distance and partition information, while the previously available methods used only the latter. RAR should be used in the research problems were HA was previously used, because in the absence of inter cluster distance effects it is an equally effective measure, and in the presence of distance effects it is a more complete one.
引用
收藏
页数:13
相关论文
共 21 条
[1]  
[Anonymous], INSR0012 CENTR WISK
[2]   Illustration of a common framework for relating multiple typing methods by application to macrolide-resistant Streptococcus pyogenes [J].
Carrico, J. A. ;
Silva-Costa, C. ;
Melo-Cristino, J. ;
Pinto, F. R. ;
de Lencastre, H. ;
Almeida, J. S. ;
Ramirez, M. .
JOURNAL OF CLINICAL MICROBIOLOGY, 2006, 44 (07) :2524-2532
[3]   Assessment of band-based similarity coefficients for automatic type and subtype classification of microbial isolates analyzed by pulsed-field gel electrophoresis [J].
Carriço, JA ;
Pinto, FR ;
Simas, C ;
Nunes, S ;
Sousa, NG ;
Frazao, N ;
de Lencastre, H ;
Almeida, JS .
JOURNAL OF CLINICAL MICROBIOLOGY, 2005, 43 (11) :5483-5490
[4]  
Chipman H, 2003, INTERDISC STAT, P159
[5]   A METHOD FOR COMPARING 2 HIERARCHICAL CLUSTERINGS [J].
FOWLKES, EB ;
MALLOWS, CL .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1983, 78 (383) :553-569
[6]   COMPARING PARTITIONS [J].
HUBERT, L ;
ARABIE, P .
JOURNAL OF CLASSIFICATION, 1985, 2 (2-3) :193-218
[7]   Measurement of observer agreement [J].
Kundel, HL ;
Polansky, M .
RADIOLOGY, 2003, 228 (02) :303-308
[8]  
LARSEN B, 1999, C KNOWL DISC DAT MIN, P16
[9]  
Li HF, 2004, 2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, P142
[10]  
MEILA M, 2003, 16 ANN C COMP LEARN