A generic parallel processing model for facilitating data mining and integration

被引:20
作者
Han, Liangxiu [1 ]
Liew, Chee Sun [1 ,2 ]
van Hemert, Jano [1 ]
Atkinson, Malcolm [1 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh EH8 9AB, Midlothian, Scotland
[2] Univ Malaya, Fac Comp Sci & Informat Technol, Kuala Lumpur 50603, Malaysia
基金
英国工程与自然科学研究理事会;
关键词
Pipeline streaming; Parallelism; Data mining and data integration (DMI); Workflow; Life sciences; OGSA-DAI;
D O I
10.1016/j.parco.2011.02.006
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
To facilitate data mining and integration (DMI) processes in a generic way, we investigate a parallel pipeline streaming model. We model a DMI task as a streaming data-flow graph: a directed acyclic graph (DAG) of Processing Elements (PEs). The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data-flows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, which provide room for performance enhancement. We have applied this approach to a real DMI case in the life sciences and implemented a prototype. To demonstrate feasibility of the modelled DMI task and assess the efficiency of the prototype, we have also built a performance evaluation model. The experimental evaluation results show that a linear speedup has been achieved with the increase of the number of distributed computing nodes in this case study. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:157 / 171
页数:15
相关论文
共 40 条
[1]  
Altintas I, 2004, 16TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, P423
[2]   Language and compiler design for streaming applications [J].
Amarasinghe, S ;
Gordon, MI ;
Karczmarek, M ;
Lin, J ;
Maze, D ;
Rabbah, RM ;
Thies, W .
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2005, 33 (2-3) :261-278
[3]  
Andrews Tony., 2003, Business process execution language for web services
[4]  
[Anonymous], 2005, Scientific Programming
[5]  
[Anonymous], P 2 ACM EUROPEAN C C, DOI DOI 10.1145/1272996.1273005
[6]  
[Anonymous], 2007, Workflows for e-science. p, DOI DOI 10.1007/978-1-84628-757-222
[7]  
ARPACIDUSSEAU RH, 1999, P 6 WORKSH INP OUTP, P10
[8]  
Atkinson MP, 2009, DADC 2009: SECOND INTERNATIONAL WORKSHOP ON DATA AWARE DISTRIBUTED COMPUTING, P11
[9]   Distributed processing of very large datasets with DataCutter [J].
Beynon, MD ;
Kurc, T ;
Catalyurek, U ;
Chang, CL ;
Sussman, A ;
Saltz, J .
PARALLEL COMPUTING, 2001, 27 (11) :1457-1478
[10]  
BREZANY P, 2006, P 1 WICI INT C WEB I, P353