TraceBack: First fault diagnosis by reconstruction of distributed control flow

被引:13
作者
Ayers, A [1 ]
Schooler, R
Metcalf, C
Agarwal, A
Rhee, J
Witchel, E
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
[2] MIT, CSAIL, VERITAS Software, Cambridge, MA 02139 USA
[3] Univ Texas, Austin, TX 78712 USA
关键词
fault diagnosis; instrumentation;
D O I
10.1145/1064978.1065035
中图分类号
TP31 [计算机软件];
学科分类号
081202 [计算机软件与理论]; 0835 [软件工程];
摘要
Faults that occur in production systems are the most important faults to fix, but most production systems lack the debugging facilities present in development environments. TraceBack provides debugging information for production systems by providing execution history data about program problems (such as crashes, hangs, and exceptions). TraceBack supports features commonly found in production environments such as multiple threads, dynamically loaded modules, multiple source languages (e.g., Java applications running with JNI modules written in C++), and distributed execution across multiple computers. TraceBack supports first fault diagnosis-discovering what went wrong the first time a fault is encountered. The user can see how the program reached the fault state without having to re-run the computation; in effect enabling a limited form of a debugger in production code. TraceBack uses static, binary program analysis to inject low-overhead runtime instrumentation at control-flow block granularity. Post-facto reconstruction of the records written by the instrumentation code produces a source-statement trace for user diagnosis. The trace shows the dynamic instruction sequence leading up to the fault state, even when the program took exceptions or terminated abruptly (e.g., kill -9). We have implemented TraceBack on a variety of architectures and operating systems, and present examples from a variety of platforms. Performance overhead is variable, from 5% for Apache running SPECweb99, to 16%-25% for the Java SPECJbb benchmark, to 60% average for SPECint2000. We show examples of TraceBack's cross-language and cross-machine abilities, and report its use in diagnosing problems in production software.
引用
收藏
页码:201 / 212
页数:12
相关论文
共 30 条
[1]
Aguilera M. K., 2003, P SOSP
[2]
AIKEN A, 2003, PLDI
[3]
[Anonymous], PLDI
[4]
[Anonymous], P INT C DEP SYST NET
[5]
Bala Vasanth, 2000, PLDI
[6]
BALL T, 1996, P MICRO 29 PAR
[7]
BOND MD, 2005, CGO
[8]
CHERNOFF A, 1997, P USENIC WIND NT WOR
[9]
CIFUENTES C, 1995, SOFTWARE PRACTICE EX, V25
[10]
CMELIK B, 1994, SIGMETRICS