Special Topics in Software Engineering: Dependable Systems
ECE 1724, Fall 2009
University of Toronto
validate
Instructor: Ashvin Goel
Course Number: ECE 1724
Course Time: Wed, 3-5 pm
Course Room: GB248
Start Date: Sep 16, 2009
Project Suggestions
Some suggested projects are described below. Please talk to the instructor
about more details about each project. Please make sure to get a confirmation
about a project from the instructor before starting the project.
- Detecting Corruption Bugs in Web Applications
Web applications are becoming increasingly popular today. However, bugs in these
applications can affect all users using the application. This project involves
detecting bugs that can corrupt data in web applications. The instructor will
provide some code that can be used to characterize web application
requests. Using this information, your goal is to choose some common web
applications, run them using the provided code, and determine whether a request
might be performing some operations that could corrupt web application data.
- Detecting I/O Bugs in Applications
File systems perform writes asynchronously and often do not report failures if
the data cannot be written to disk successfully (e.g., the application may have
exited by the time the file system tries to flush data to disk and an error
occurs during the flushing operation). In this project, you will study how this
behavior can affect applications. Your task will be induce I/O failures on
asynchronous writes and observe how applications handle such failures. For
example, how do applications that use the Berkeley database behave after such
failures? How would you evaluate whether applications fail gracefully?
- Resource Exhaustion Failures in File Systems
In this project, you will evaluate how file systems handle resource exhaustion,
such as file systems being full. File systems have various other statically
allocated resources, such as the number of inodes, that may also get
exhausted. You will first evaluate all resources that are statically allocated
in a specific file system such as the Linux Ext4 file system. Does Ext4 handle
exhaustion of these resources gracefully? Are the applications informed about
these failures correctly?
- N-Version File Systems
The goal of this project is to improve the reliability of file systems in the
face of hardware and file system bugs. One option is to take advantage of the
fact that different file systems handle failures differently. As a result, a
simple fault tolerance method would be to replicate all file system operations
to two different file systems and detect errors based on comparing the outputs
of the operations.
- Fast Kernel Updates
The goal of this project is to update kernels with minimal downtime when a
kernel update is available. The idea is to preserve application state, while
only rebooting the kernel, using an application-level checkpoint recovery
scheme. The instructor will provide code that performs this checkpointing. Your
goal is to improve the reboot time further by using virtual machines to perform
some of the checkpointing and recovery operations in parallel.
- Recovery via Restarting Applications
The "Microreboot" paper described a method by which parts of an application are
rebooted to allow recovery of the application. This approach gets rid of faulty
state in the application. In this project, you will choose either a content
download application (e.g., bittorrent
) or an instant messaging
application (e.g., gaim
) and implement a recovery via "reboot"
method for this application. You need to make sure that the persistent data
(e.g, the music repository or the instant messages received) in the application
is not lost. How fine is your reboot granularity? Can you tune it? How often is
reboot possible? What types of faults or bugs can the reboot handle? How does
the reboot affect user perception?
- Application-Level Undo and Recovery
The "Undo for Operators" paper implemented an undoable email service. In
general, their application-level undo and recovery service requires applications
whose operations have well-defined semantics and can be serialized. Another
example that satisfies this criteria is a calendar service. Can you think of
other such applications? Choose an application and implement an undoable service
for that application. Describe the properties of this undoable application. How
does application-specific recovery improve on generic recovery as described in
the "Exploring Failure Transparency" paper?