An Analysis of Production Failures in Distributed Data-intensive Systems
-
This page contains the detailed analysis of a large collection of failure reports. This is the dataset we used in our paper titled:
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. In the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14), October 2014.
The failures are from five widely used open source software projects: Cassandra, HDFS, Hadoop MapReduce, HBase, and Redis. In each diagnosis report, we document in detail the root causes and the symptoms of the failure, as well as the manifestation process. In addition, we also discuss whether each failure provides sufficient log messages for diagnosis, as well as how it was fixed. If a failure was reproduced by us, we also document the detail procedure of reproducing it.
Failure set
Aspirator: A simple static checker
One of our findings is that the cause of some of the most catastrophic
failures, i.e., failures that affect all or majority of the users,
are caused by some simple bugs in the error handling code. We further
extracted a few rules from these bugs and built a static checker, Aspirator,
to automatically detect these bugs. The source code of Aspirator is
available here:
Impact
Our paper stirred quite a few online discussions in news, blogs, developers' mailing list, and hundreds of tweets (see this and this).
Here goes a few of them (and if you wrote something about it, let us know!):
- the morning paper (also it is considered as a highlight of 2016)
- Hacker News [1], [2],
- Discussions from HBase developers, which prompted a series of reactions to address the problems we mentioned in the paper.
- Twitter discussions: see this, this, and this (if you're looking for a screenshot that summarizes our paper, see this or this).
- neverworkintheory.org.
- Another word for it.
- Metadata.
- Fifty Quick Ideas to Improve Your Tests.
- Postmortem lessons.
- Agilezilla.
- Some discussions on Google+.
- And quite a few emails sent to us from developers...
Contacts
- Ding Yuan, yuan at eecg dot toronto dot edu