Special Topics in Software Engineering: Dependable Systems

ECE 1724, Fall 2009
University of Toronto

validate
Instructor: Ashvin Goel
Course Number: ECE 1724
Course Time: Wed, 3-5 pm
Course Room: GB248
Start Date: Sep 16, 2009

Project Suggestions

Some suggested projects are described below. Please talk to the instructor about more details about each project. Please make sure to get a confirmation about a project from the instructor before starting the project.

Detecting Corruption Bugs in Web Applications

Web applications are becoming increasingly popular today. However, bugs in these applications can affect all users using the application. This project involves detecting bugs that can corrupt data in web applications. The instructor will provide some code that can be used to characterize web application requests. Using this information, your goal is to choose some common web applications, run them using the provided code, and determine whether a request might be performing some operations that could corrupt web application data.
Detecting I/O Bugs in Applications

File systems perform writes asynchronously and often do not report failures if the data cannot be written to disk successfully (e.g., the application may have exited by the time the file system tries to flush data to disk and an error occurs during the flushing operation). In this project, you will study how this behavior can affect applications. Your task will be induce I/O failures on asynchronous writes and observe how applications handle such failures. For example, how do applications that use the Berkeley database behave after such failures? How would you evaluate whether applications fail gracefully?
Resource Exhaustion Failures in File Systems

In this project, you will evaluate how file systems handle resource exhaustion, such as file systems being full. File systems have various other statically allocated resources, such as the number of inodes, that may also get exhausted. You will first evaluate all resources that are statically allocated in a specific file system such as the Linux Ext4 file system. Does Ext4 handle exhaustion of these resources gracefully? Are the applications informed about these failures correctly?
N-Version File Systems

The goal of this project is to improve the reliability of file systems in the face of hardware and file system bugs. One option is to take advantage of the fact that different file systems handle failures differently. As a result, a simple fault tolerance method would be to replicate all file system operations to two different file systems and detect errors based on comparing the outputs of the operations.
Fast Kernel Updates

The goal of this project is to update kernels with minimal downtime when a kernel update is available. The idea is to preserve application state, while only rebooting the kernel, using an application-level checkpoint recovery scheme. The instructor will provide code that performs this checkpointing. Your goal is to improve the reboot time further by using virtual machines to perform some of the checkpointing and recovery operations in parallel.
Recovery via Restarting Applications

The "Microreboot" paper described a method by which parts of an application are rebooted to allow recovery of the application. This approach gets rid of faulty state in the application. In this project, you will choose either a content download application (e.g., bittorrent) or an instant messaging application (e.g., gaim) and implement a recovery via "reboot" method for this application. You need to make sure that the persistent data (e.g, the music repository or the instant messages received) in the application is not lost. How fine is your reboot granularity? Can you tune it? How often is reboot possible? What types of faults or bugs can the reboot handle? How does the reboot affect user perception?
Application-Level Undo and Recovery

The "Undo for Operators" paper implemented an undoable email service. In general, their application-level undo and recovery service requires applications whose operations have well-defined semantics and can be serialized. Another example that satisfies this criteria is a calendar service. Can you think of other such applications? Choose an application and implement an undoable service for that application. Describe the properties of this undoable application. How does application-specific recovery improve on generic recovery as described in the "Exploring Failure Transparency" paper?

Special Topics in Software Engineering: Dependable Systems ECE 1724, Fall 2009 University of Toronto

Project Suggestions

Special Topics in Software Engineering: Dependable Systems

ECE 1724, Fall 2009
University of Toronto