Fault-tolerance in the borealis distributed stream processing system

被引:94
作者
Balazinska, Magdalena [1 ]
Balakrishnan, Hari [2 ]
Madden, Samuel R. [2 ]
Stonebraker, Michael [2 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
[2] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2008年 / 33卷 / 01期
关键词
algorithms; design; experimentation; reliability; distributed stream processing; fault-tolerance; availability; consistency;
D O I
10.1145/1331904.1331907
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e. g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this article, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any further failures that occur during recovery. This article describes DPC and its implementation in the Borealis SPE. We show that DPC enables a distributed SPE to maintain low-latency processing at all times, while also achieving eventual consistency, where applications eventually receive the complete and correct output streams. Furthermore, we show that, independent of system size and failure location, it is possible to handle failures almost up-to the user-specified bound in a manner that meets the required availability without introducing any inconsistency.
引用
收藏
页数:44
相关论文
共 45 条
[11]  
BALAZINSKA M, 2005, P ACM SIGMOD INT C M
[12]  
BERNSTEIN PA, 1990, P ACM SIGMOD INT C M
[13]   Lessons from giant-scale services [J].
Brewer, EA .
IEEE INTERNET COMPUTING, 2001, 5 (04) :46-55
[14]  
Chandrasekaran S., 2003, P ACM SIGMOD INT C M
[15]  
Cherniack M., 2003, P 1 BIENN C INN DAT
[16]  
CRANOR C, 2003, P ACM SIGMOD INT C M
[17]   A survey of rollback-recovery protocols in message-passing systems [J].
Elnozahy, EN ;
Alvisi, L ;
Wang, YM ;
Johnson, DB .
ACM COMPUTING SURVEYS, 2002, 34 (03) :375-408
[18]  
FEKETE A, 1996, P 15 ACM S PRINC DIS
[19]   HOW TO ASSIGN VOTES IN A DISTRIBUTED SYSTEM [J].
GARCIAMOLINA, H ;
BARBARA, D .
JOURNAL OF THE ACM, 1985, 32 (04) :841-860
[20]  
Gifford D. K., 1979, P 7 ACM S OP SYST PR