|Home||Presentation Format||Project Format||Project Suggestions|
Modern computer systems have become tightly intertwined with our daily lives. However, they are failure-prone and difficult to manage and thus hardly dependable. Today, these problems dominate total cost of ownership of computer systems, and unfortunately they have no simple solutions. There is a realization that these problems cannot be decisively solved but are ongoing facts of life that must be dealt with regularly. To do so, systems should be designed to detect, isolate and recover from these problems.
This advanced graduate-level course focuses on dependability in software systems and examines current research that aims to address challenges caused by software and hardware bugs and software misconfiguration. Students are expected to read and critique recent research papers in operating systems that cover these areas. They are also expected to work on a research project and make class presentations. While there are no specific prerequisites for this course, students who have taken undergraduate or graduate courses in operating systems, networks and distributed systems will have an edge.
There are no required textbooks for this course. The optional textbooks are
- Modern Operating Systems (Third Edition), by Andrew S. Tanenbaum. Published by Prentice Hall, 2008.
- Distributed Systems: Concepts and Design (Fourth Edition), by George Coulouris, Jean Dollimore and Tim Kindberg. Published by Addison Wesley, 2005.
Please subscribe to the class mailing list by joining the UofT ECE1724 Google group. You will need a Google account. Subscribing to the group requires the instructor's approval.
The instructor will use this group to send instructions and reminders. You can send email to the class by sending mail to this list. If you have a specific question for the instructor, please send an email to the instructor directly.
Grades will be based on class presentations, a class project, and class participation. There will be no final exam in this course. The grading breakup is as follows:
- Class presentation: 30%
- Class project: 50%
- Class participation: 20%
Note: If a student is unable to attend a class, he or she will lose 2% for non-participation.
Each week this class will cover a group of papers that focuses on a specific aspect of the course. Students are expected to read all the papers in the group that will be presented. At the beginning of the term, each paper will be assigned to a student who will be presenting the paper. Presentations will be limited to roughly 20 minutes.
More details about the presentation format. Please read very carefully.
There will be no assignments in this course.
A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.
More details about the project format. Please read very carefully.
Week 1: Introduction (Jan 9)
- Why Do Computers Stop and What Can Be Done About It? SRDS 1986.
- Broad New OS Research: Challenges and Opportunities. HOTOS 2005.
- Introduction to Dependable Software Systems by Instructor.
- Efficient Readings of Papers in Science and Technology.
- How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.
Week 2: Detecting Races (Jan 16)
- Eraser: A Dynamic Data Race Detector for Multi-Threaded Programs. SOSP 1997. Severin
- Effective Data-Race Detection for the Kernel. OSDI 2010. Wei
- Detecting and Surviving Data Races using Complementary Schedules. SOSP 2011. Akshay
- RacerX: Effective, Static Detection of Race Conditions and Deadlocks. SOSP 2003.
- Bypassing Races in Live Applications with Execution Filters. OSDI 2010.
- Pervasive Detection of Process Races in Deployed Systems. SOSP 2011.
Week 3: More Races (Jan 23)
- Operating Systems Transactions. SOSP 2009. Wael
- Ad Hoc Synchronization Considered Harmful. OSDI 2010.
- CTrigger: Exposing Atomicity Violation Bugs from Their Hiding Places. ASPLOS 2009. Eric
- Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI 2008.
- Deadlock Immunity: Enabling Systems to Defend Against Deadlocks. OSDI 2008.
Week 4: Detecting Bugs (Jan 30)
- Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. SOSP 2001.
- eXplode: A lightweight, general system for finding serious storage system errors. OSDI 2006. George
- A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. ASPLOS 2010. Antoine
- Using Model Checking to Find Serious File System Errors. OSDI 2004.
- Triage: Diagnosing Production Run Failures at the User's Site. SOSP 2007.
- Hang Analysis: Fighting Responsiveness Bugs. Eurosys 2008.
Week 5: Testing and Debugging (Feb 6 - first report due)
- R2: An Application-Level Kernel for Record and Replay. OSDI 2008. Wael
- ODR: Output-Deterministic Replay for Multicore Debugging. SOSP 2009.
- Execution Synthesis: A Technique for Automated Software Debugging. Eurosys 2010. David
- KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. OSDI 2008.
- Anomaly-Based Bug Prediction, Isolation, and Validation: An Automated Approach for Software Debugging. ASPLOS 2009.
Week 6: Fault Isolation (Feb 13)
- Efficient Software-Based Fault Isolation. SOSP 1993.
- Fast Byte-granularity Software Fault Isolation. SOSP 2009. Ioan
- Software fault isolation with API integrity and multi-principal modules. SOSP 2011. Akshay
- Hive: Fault Containment for Shared-Memory Multiprocessors. SOSP 1995.
- Dealing With Disaster: Surviving Misbehaved Kernel Extensions. OSDI 1996.
- Improving the Reliability of Commodity Operating Systems. SOSP 2003.
- CuriOS: Improving Reliability through Operating System Structure. OSDI 2008.
Week 7: No Class (Feb 20)
Week 8: Generic Failure Recovery (Feb 27 - second report due)
- Exploring Failure Transparency and the Limits of Generic Recovery. OSDI 2000. Daniel
- Rx: Treating Bugs As Allergies---A Safe Method to Survive Software Failures. SOSP 2005.
- ASSURE: Automatic Software Self-healing Using REscue points. ASPLOS 2009. Eric
Week 9: No Class (Mar 5)
Instructor is at a conference.
Week 10: Storage Failure Recovery (Mar 12)
- Improving File System Reliability with I/O Shepherding. SOSP 2007. Mike
- Membrane: Operating System Support for Restartable File Systems. FAST 2010.
- Fast Crash Recovery in RAMCloud. SOSP 2011. David
Week 11: Application and OS Failure Recovery (Mar 19)
- Undo for Operators: Building an Undoable E-mail Store. Usenix 2003. Daniel
- Microreboot - A Technique for Cheap Recovery. OSDI 2004. Andy
- Recovery Domains: An Organizing Principle for Recoverable Operating Systems. ASPLOS 2009.
- Tolerating Hardware Device Failures in Software. SOSP 2009. Presented next week. Andy
Week 12: Fault Tolerance (Mar 26)
- Hypervisor-based Fault-tolerance. SOSP 1995.
- Remus: High Availability via Asynchronous Virtual Machine Replication. NSDI 2008. Mike
Week 13: Updating Software (Apr 2)
- Dynamic and Adaptive Updates to Non-Quiescent Subsystems in Commodity Operating System Kernels. Eurosys 2007.
- Ksplice: Automatic rebootless kernel updates. Eurosys 2009. Antoine
- Automatically Patching Errors in Deployed Software. SOSP 2009. Ioan
Week 14: System Misconfiguration (Apr 9 - final report due)
- Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. OSDI 2010.
- Enabling Configuration-Independent Automation by Non-Expert Users. OSDI 2010. Wei
- An Empirical Study on Configuration Errors in Commercial and Open Source Systems. SOSP 2011. George
- Understanding and Dealing with Operator Mistakes in Internet Services. OSDI 2004.
- Configuration Debugging as Search: Finding the Needle in the Haystack. OSDI 2004.
- Automatic Misconfiguration Troubleshooting with PeerPressure. OSDI 2004.
- Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System. SOSP 2007.
- AutoBash: Improving Configuration Management with Operating System Causality Analysis. SOSP 2007.
- Barricade: Defending Systems Against Operator Mistakes. Eurosys 2010.
- Fingerprinting the Datacenter: Automated Classification of Performance Crises. Eurosys 2010.