Special Topics in Software Engineering: Dependable Systems
ECE 1724, Fall 2009
University of Toronto
Instructor: Ashvin Goel
Course Number: ECE 1724
Course Time: Wed, 3-5 pm
Course Room: GB248
Start Date: Sep 16, 2009
Modern computer systems have become tightly intertwined with our daily
lives. However, they are failure-prone and difficult to manage and thus hardly
dependable. Today, these problems dominate total cost of ownership of computer
systems, and unfortunately they have no simple solutions. There is a
realization that these problems cannot be decisively solved but are ongoing
facts of life that must be dealt with regularly. To do so, systems should be
designed to detect, isolate and recover from these problems.
This advanced graduate-level course focuses on dependability in software
systems and examines current research that aims to address challenges caused
by software and hardware bugs and software
misconfiguration. Students are expected to read and critique recent
research papers in operating systems that cover these areas. They are also
expected to work on a research project and make class presentations. While
there are no specific prerequisites for this course, students who have taken
undergraduate or graduate courses in operating systems, networks and
distributed systems will have an edge.
There are no required textbooks for this course. The optional
- Modern Operating Systems (Third Edition), by Andrew
S. Tanenbaum. Published by Prentice Hall, 2008.
- Distributed Systems: Concepts and Design (Third Edition), by
George Coulouris, Jean Dollimore and Tim Kindberg. Published by Addison
Please subscribe to the class mailing list by joining
this group. You
will need a Yahoo account, although Yahoo will forward the group messages to any
email address of your choice. The instructor will use this group to send
instructions and reminders. All students who subscribe to the group can send
email to the group by sending mail to
this list. The
group is not moderated. If you have a specific question for the instructor,
please send an email to the instructor directly. For the first week of classes,
you can join the group directly. After that the Yahoo groups website will
require the instructor's approval to subscribe you.
Grades will be based on class presentations, a class project, and class
participation. There will be no final exam in this course. The
grading breakup is as follows:
- Class presentation: 30%
- Class project: 50%
- Class participation: 20%
Note: If a student is unable to attend a class, he or
she will lose 2% for non-participation.
Each week this class will cover a group of papers that focuses on a
specific aspect of the course. Students are expected to read all the
papers in the group that will be presented. At the beginning of the term,
each paper will be assigned to a student who will be presenting the
paper. Presentations will be limited to roughly 20 minutes.
More details about the presentation
format. Please read very carefully.
There will be no assignments in this course.
A major component of this course is devoted to a term-long project. The
topic of the project is largely up to you, but to help you choose a
project, a sample list of projects is provided below. This list should
help students determine whether their own projects are of reasonable size
More details about the project
format. Please read very carefully.
Here is a list of project ideas.
This is a tentative list. These papers can be accessed from either
the ACM or
the Usenix web site. If you cannot access
ACM articles directly, please read the
following instructions for accessing the
papers via the UoT online Library.
Week 1: Introduction (Jan 12)
Why Do Computers Stop and What Can Be Done About It? SRDS 1986.
Broad New OS Research: Challenges and Opportunities. HOTOS 2005.
- Introduction to Dependable Software Systems by Instructor.
- Efficient Readings of Papers in Science and Technology.
- How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.
Week 2: Bug Detection and Diagnosis (Sep 23)
- Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. SOSP 2001.Afshar
- Triage: Diagnosing Production Run Failures at the User's Site. SOSP 2007.Raymond
- RacerX: Effective, Static Detection of Race Conditions and Deadlocks. SOSP 2003.
- Using Model Checking to Find Serious File System Errors. OSDI 2004.
- eXplode: A lightweight, general system for finding serious storage system errors. OSDI 2006.
- Hang Analysis: Fighting Responsiveness Bugs. Eurosys 2008.
Week 3: Race Detection (Sep 30)
- Eraser: A Dynamic Data Race Detector for Multi-Threaded Programs. SOSP 1997.Cedomir
- Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI 2008.David
- Deadlock Immunity: Enabling Systems to Defend Against Deadlocks. OSDI 2008.Svitlana
Week 4: Fault Isolation (Oct 7 - first report due)
- Efficient Software-Based Fault Isolation. SOSP 1993.David
- Hive: Fault Containment for Shared-Memory Multiprocessors. SOSP 1995.Farid
- Dealing With Disaster: Surviving Misbehaved Kernel Extensions. OSDI 1996.Zimu
Week 5: No Class (Oct 14)
Week 6: Fault Isolation (Oct 21)
- CuriOS: Improving Reliability through Operating System Structure. OSDI 2008.Yuk Fai
- Improving the Reliability of Commodity Operating Systems. SOSP 2003.Yuri
- Fast Byte-granularity Software Fault Isolation. SOSP 2009.Afshar
Week 7: Generic Failure Recovery (Oct 28)
- Exploring Failure Transparency and the Limits of Generic Recovery. OSDI 2000.Peter
- Rx: Treating Bugs As Allergies---A Safe Method to Survive Software Failures. SOSP 2005.Patrick
- Enhancing Server Availability and Security Through Failure-Oblivious Computing. OSDI 2004.Svitlana
Week 8: Application-Specific Failure Recovery (Nov 4 - second report due)
- Undo for Operators: Building an Undoable E-mail Store. Usenix 2003.Patricia
- Microreboot - A Technique for Cheap Recovery. OSDI 2004.Isaac
Week 8/9: OS Failure Recovery (Nov 4, 11)
- Tolerating Hardware Device Failures in Software. SOSP 2009.Cedomir (Nov 4)
- Recovery Domains: An Organizing Principle for Recoverable Operating Systems. ASPLOS 2009.Pouya (Nov 11)
Week 9: Fault Tolerance (Nov 11)
- Hypervisor-based Fault-tolerance. SOSP 1995.Peter
- Remus: High Availability via Asynchronous Virtual Machine Replication. NSDI 2008.Sam
Week 10: Storage Failure Recovery (Nov 18)
- Iron File Systems. SOSP 2005.Ryan
- Improving File System Reliability with I/O Shepherding. SOSP 2007.Ryan
- Analyzing the effects of disk-pointer corruption. DSN 2008.Yaowei
Week 11: Testing and Development (Nov 25)
- KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. OSDI 2008.Pouya
- ODR: Output-Deterministic Replay for Multicore Debugging. SOSP
- R2: An Application-Level Kernel for Record and Replay. OSDI 2008.Zimu
Week 12: Updating Software (Dec 2)
- Ksplice: Automatic rebootless kernel updates. Eurosys 2009.Isaac
- Automatically Patching Errors in Deployed Software. SOSP 2009.Yuri
- DeVirtualizable Virtual Machines: Enabling General, Single-Node, Onine Maintenance. ASPLOS 2004.Patricia
- Dynamic and Adaptive Updates to Non-Quiescent Subsystems in Commodity Operating System Kernels. Eurosys 2007.Raymond
Week 13: System Misconfiguration (Dec 9 - final report due)
- Understanding and Dealing with Operator Mistakes in Internet Services. OSDI 2004.Patrick
- AutoBash: Improving Configuration Management with Operating System Causality Analysis. SOSP 2007.Yaowei
- Automatic Misconfiguration Troubleshooting with PeerPressure. OSDI 2004.Farid
- Configuration Debugging as Search: Finding the Needle in the Haystack. OSDI 2004.Sam
- Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System. SOSP 2007.
Week 14: Project Presentations (Dec 16)