Information mining over heterogeneous and high-dimensional time-series data in clinical trials databases

被引:15
作者
Altiparmak, F [1 ]
Ferhatosmanoglu, H
Erdal, S
Trost, DC
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Biophys Grad Program, Columbus, OH 43210 USA
[3] Pfizer Inc, Global Res & Dev, Groton, CT 06340 USA
来源
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE | 2006年 / 10卷 / 02期
关键词
clinical trials; information mining; time series;
D O I
10.1109/TITB.2005.859885
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An effective analysis of clinical trials data involves analyzing different types of data such as heterogeneous and high dimensional time series data. The current time series analysis methods generally assume that the series at hand have sufficient length to apply statistical techniques to them. Other ideal case assumptions are that data are collected in equal length intervals, and while comparing time series, the lengths are usually expected to be equal to each other. However, these assumptions are not valid for many real data sets, especially for the clinical trials data sets. An addition, the data sources are different from each other, the data are heterogeneous, and the sensitivity of the experiments varies by the source. Approaches for mining time series data need to be revisited, keeping the wide range of requirements in mind. In this paper, we propose a novel approach for information mining that involves two major steps: applying a data mining algorithm over homogeneous subsets of data, and identifying common or distinct patterns over the information gathered in the first step. Our approach is implemented specifically for heterogeneous and high dimensional time series clinical trials data. Using this framework, we propose a new way of utilizing frequent itemset mining, as well as clustering and declustering techniques with novel distance metrics for measuring similarity between time series data. By clustering the data, we find groups of analytes (substances in blood) that are most strongly correlated. Most of these relationships already known are verified by the clinical panels, and, in addition, we identify novel groups that need further biomedical analysis. A slight modification to our algorithm results an effective declustering of high dimensional time series data, which is then used for "feature selection." Using industry-sponsored clinical trials data sets, we are able to identify a small set of analytes that effectively models the state of normal health.
引用
收藏
页码:254 / 263
页数:10
相关论文
共 20 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]  
BAYARDO RJ, 1999, P 15 INT IEEE COMP S, P23
[3]   Gene-expression profiles predict survival of patients with lung adenocarcinoma [J].
Beer, DG ;
Kardia, SLR ;
Huang, CC ;
Giordano, TJ ;
Levin, AM ;
Misek, DE ;
Lin, L ;
Chen, GA ;
Gharib, TG ;
Thomas, DG ;
Lizyness, ML ;
Kuick, R ;
Hayasaka, S ;
Taylor, JMG ;
Iannettoni, MD ;
Orringer, MB ;
Hanash, S .
NATURE MEDICINE, 2002, 8 (08) :816-824
[4]  
Brin Sergey, 1997, SIGMOD REC, V6, P255, DOI DOI 10.1145/253262.253325
[5]  
CIOS K, 2002, KNOWLEDGE DISCOVERY
[6]  
Das G., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P16
[7]   Discovery of association rules in medical data [J].
Doddi, S ;
Marathe, A ;
Ravi, SS ;
Torney, DC .
MEDICAL INFORMATICS AND THE INTERNET IN MEDICINE, 2001, 26 (01) :25-33
[8]  
FDA, 2004, INN STAGN CHALL OPP
[9]  
Jaeger JJ, 2002, LIVER FUNCTION TESTS
[10]  
Kaufman L., 1990, FINDING GROUPS DATA