Trends in big data analytics

被引:442
作者
Kambatla, Karthik [1 ]
Kollias, Giorgos [2 ]
Kumar, Vipin [3 ]
Grama, Ananth [1 ]
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[3] Univ Minnesota, Dept Comp Sci, Minneapolis, MN 55455 USA
基金
美国国家科学基金会;
关键词
Big-data; Analytics; Data centers; Distributed systems; MODEL;
D O I
10.1016/j.jpdc.2014.01.003
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
One of the major applications of future generation parallel and distributed systems is in big-data analytics. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size. Beyond their sheer magnitude, these datasets and associated applications' considerations pose significant challenges for method and software development. Datasets are often distributed and their size and privacy considerations warrant distributed techniques. Data often resides on platforms with widely varying computational and network capabilities. Considerations of fault-tolerance, security, and access control are critical in many applications (Dean and Ghemawat, 2004; Apache hadoop). Analysis tasks often have hard deadlines, and data quality is a major concern in yet other applications. For most emerging applications, data-driven models and methods, capable of operating at scale, are as-yet unknown. Even when known methods can be scaled, validation of results is a major issue. Characteristics of hardware platforms and the software stack fundamentally impact data analytics. In this article, we provide an overview of the state-of-the-art and focus on emerging trends to highlight the hardware, software, and application landscape of big-data analytics. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:2561 / 2573
页数:13
相关论文
共 75 条
[1]   Aurora: a new model and architecture for data stream management [J].
Abadi, DJ ;
Carney, D ;
Cetintemel, U ;
Cherniack, M ;
Convey, C ;
Lee, S ;
Stonebraker, M ;
Tatbul, N ;
Zdonik, S .
VLDB JOURNAL, 2003, 12 (02) :120-139
[2]  
Abouzeid Azza., 2009, VLDB
[3]  
Ahmad Yanif., 2005, SIGMOD Conference, P882, DOI [10.1145/1066157.1066274, DOI 10.1145/1066157.1066274]
[4]   A scalable, commodity data center network architecture [J].
Al-Fares, Mohammad ;
Loukissas, Alexander ;
Vahdat, Amin .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :63-74
[5]   Data Center TCP (DCTCP) [J].
Alizadeh, Mohammad ;
Greenberg, Albert ;
Maltz, David A. ;
Padhye, Jitendra ;
Patel, Parveen ;
Prabhakar, Balaji ;
Sengupta, Sudipta ;
Sridharan, Murari .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2010, 40 (04) :63-74
[6]   FAWN: A Fast Array of Wimpy Nodes [J].
Andersen, David G. ;
Franklin, Jason ;
Kaminsky, Michael ;
Phanishayee, Amar ;
Tan, Lawrence ;
Vasudevan, Vijay .
COMMUNICATIONS OF THE ACM, 2011, 54 (07) :101-109
[7]  
Andersen Rasmus, 2008, Proceedings of the 2008 International Conference on Grid Computing & Applications, P175
[8]   Processing high data rate streams in System S [J].
Andrade, H. ;
Gedik, B. ;
Wu, K-L. ;
Yu, P. S. .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2011, 71 (02) :145-156
[9]  
[Anonymous], PLDI
[10]  
[Anonymous], HPCA