Practical byzantine fault tolerance and proactive recovery

被引:3527
作者
Castro, M
Liskov, B
机构
[1] Microsoft Res, Cambridge CB3 0FB, England
[2] MIT, Comp Sci Lab, Cambridge, MA 02139 USA
来源
ACM TRANSACTIONS ON COMPUTER SYSTEMS | 2002年 / 20卷 / 04期
关键词
security; reliability; algorithms; performance; measurement; byzantine fault tolerance; state machine replication; proactive recovery; asynchronous systems; state transfer;
D O I
10.1145/571637.571640
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions. Software bugs, operator mistakes, and malicious attacks are a major cause of service interruptions and they can cause arbitrary behavior, that is, Byzantine faults. This article describes a new replication algorithm, BFT, that can be used to build highly available systems that tolerate Byzantine faults. BFT can be used in practice to implement real services: it performs well, it is safe in asynchronous environments such as the Internet, it incorporates mechanisms to defend against Byzantine-faulty clients, and it recovers replicas proactively. The recovery mechanism allows the algorithm to tolerate any number of faults over the lifetime of the system provided fewer than 1/3 of the replicas become faulty within a small window of vulnerability. BFT has been implemented as a generic program library with a simple interface. We used the library to implement the first Byzantine-fault-tolerant NFS file system, BFS. The BFT library and BFS perform well because the library incorporates several important optimizations, the most important of which is the use of symmetric cryptography to authenticate messages. The performance results show that BFS performs 2% faster to 24% slower than production implementations of the NFS protocol that are not replicated. This supports our claim that the BFT library can be used to build practical systems that tolerate Byzantine faults.
引用
收藏
页码:398 / 461
页数:64
相关论文
共 72 条
  • [41] LAMPORT L, 1989, 49 DIG EQ CORP SYST
  • [42] LAMPSON B, 2001, PRINCIPLES DISTRIBUT
  • [43] Li Gong, 1992, Operating Systems Review, V26, P49, DOI 10.1145/130704.130709
  • [44] LISKOV B, 1991, P 13 ACM S OP SYST P, P226
  • [45] Liskov B. H., 1975, IEEE Transactions on Software Engineering, VSE-1, P7, DOI 10.1109/TSE.1975.6312816
  • [46] Lynch N. A., 1996, DISTRIBUTED ALGORITH
  • [47] Maheshwari U., 2000, P 4 USENIX S OP SYST
  • [48] Byzantine quorum systems
    Malkhi, D
    Reiter, M
    [J]. DISTRIBUTED COMPUTING, 1998, 11 (04) : 203 - 213
  • [49] An architecture for survivable coordination in large distributed systems
    Malkhi, D
    Reiter, MK
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (02) : 187 - 202
  • [50] MALKHI D, 1998, UNPUB CORRECTNESS CO