Michael Stumm: Publications

Paper Details

Reference:

Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan,
"The inflection point hypothesis a principled debugging approach for locating the root cause of a failure",
In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19), Huntsville, ON, Canada, ACM, October, 2019, pp. 131–146.

Download:

PDF ; Talk Slides ; Talk

Abstract:

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.

Keywords:

Distributed systems, root cause, failure diagnosis, debugging

Reference Info:

DOI: 10.1145/3341301.3359650
ISBN: 9781450368735
OCLC: 8877132593

BibTeX:

@inproceedings(Zhang-sosp-19,
    author = {Yongle Zhang and Kirk Rodrigues and Yu Luo and Michael Stumm and Ding Yuan},
    title = {The inflection point hypothesis a principled debugging approach for locating the root cause of a failure},
    booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (\textbf{SOSP'19})},
    location = {Huntsville, ON, Canada},
    publisher = {ACM},
    month = {October},
    year = {2019},
    pages = {131-146},
    doi = {10.1145/3341301.3359650},
    isbn = {9781450368735},
    keywords = {Distributed systems, root cause, failure diagnosis, debugging}
)