Quick Links
Course Description
The last decade has seen an enormous shift in computing, with web-scale applications driving the rise of cloud computing and big data processing. This course discusses the principles, key technologies and trends in the design of web-scale applications. The course will examine and compare the architectures and the infrastructure needed to support several types of web-scale applications. Students will learn how these applications are designed to achieve high scalability, reliability and availability.
Students are expected to read, analyze and discuss seminal and cutting-edge research in this area. The aim is to both learn from prior work and extract exciting research ideas. A course project will provide concrete experience and deeper understanding of the material.
The course covers advanced topics, broadly in the areas of distributed systems, operating systems, storage and databases, with a focus on web-scale applications. The goal is to survey research in this area, rather than focus on a specific topic.
Prerequisites
This course builds on the following undergraduate courses taught at University of Toronto: operating systems (ECE344) and distributed systems (ECE419). It assumes that students are knowledgeable about the contents of these courses. If you have not taken these or similar courses, you will not have sufficient background to take this course.
Students are expected to have strong coding skills and be experienced in languages such as Java, Python, and C++. They should have experience building and debugging a significant software system. If unsure, consult with the instructor.
Textbooks
There are no required textbooks for this course. The optional textbooks are
- Distributed Systems: Concepts and Design (Fifth Edition), by George Coulouris, Jean Dollimore Tim Kindberg and Gordon Blair, 2011.
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppmann, 2017.
- Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services, by Kenneth P. Birman, 2012.
Course Announcements and Questions
Course announcements will be made on Quercus Announcements. You should post any general questions about the course on Quercus Discussions. If you need to contact the instructor directly, please send an email.
Grading Policy
Grades will be based on quizzes and a class project. This year, the course has two grading options:
- Course Project: With this option, the student will do a course project and two quizzes. This option is required for MASc and PhD students.
- 2 Quizzes: 50%
- Class project: 50%
- All Quizzes: With this option, the student will do four quizzes and no course project. This option is suggested for MEng students.
- 4 Quizzes: 100% (each is 25%)
Course Readings
Each week this class will cover a set of papers that focuses on a specific aspect of the course. These papers are seminal or recent research papers, broadly in the areas of distributed systems, operating systems, storage, and databases. Students are expected to read typically one or two papers each week. The instructor will present these papers and then we will discuss these papers. Students are expected to take part in the discussion and so they are expected to read the papers before the class. Please expect to spend 4-6 hours per week to read the papers critically.
Class Project and Project Ideas
A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.
Students will generally work in groups on a term project that pushes the state-of-the-art in the design of a web-scale application. Students will typically be required to implement and evaluate a significantly-large software system.
The project deliverables are as follows:
- Project Description (5%): 1 page. Due
Oct 2Oct 7, 2024. - Project Status Report (10%): 3-4 pages. Due
Nov 9Nov 16, 2024. - Project Final Report (35%): 8-10 pages. Due
Dec 4Dec 15, 2024.
We will be holding project presentations on Dec 4, 2024 and Dec 11, 2024 during class. Each presentation should be roughly 15 minutes long, followed by 5 minutes of Q/A.
Please read more details about the project format.
Here is a list of project ideas.
Note: students should not do the course project if they choose the All Quizzes option.
Quizzes
There will be four quizzes in the course, held roughly every 3 weeks. Each quiz will cover topics covered in the last 3 weeks. Students doing the course project can choose to do any two quizzes. Otherwise, students are required to take the four quizzes.
The quizzes will be held on Tuesdays, 7-9 pm Wednesdays in class from 4:10-5:40 pm, on the following dates:
Sep 24 Oct 2, 2024
Oct 15 Oct 16, 2024
Nov 12 Nov 13, 2024
Dec 3 Dec 4, 2024
The format of the quiz will be announced in class.
There will be no assignments in this course.
There will be no final exam in this course.
Lecture Material
The instructor will make lecture material available here as topics are covered in class.
Reading List
This is a list of papers that provides background on the material that will be covered in the course. We will cover a subset of this material in class. This background material will be especially helpful when choosing your course project, for motivating your project and for writing about the research work related to your project.
Most of these papers can be accessed from the ACM web site. If you cannot access ACM articles directly, please read the following instructions for accessing the papers.
Introduction
- A View of Cloud Computing. CACM 2010.
- The Dangers of Replication and a Solution. SIGMOD 1996.
- Efficient Readings of Papers in Science and Technology.
- How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.
Consensus and Coordination
- Paxos Made Simple.
- Replication Management using the State Machine Approach.
- The Chubby lock service for loosely-coupled distributed systems. OSDI 2006.
- Exploiting Nil-Externality for Fast Replicated Storage. SOSP 2021.
Cluster Storage Systems
- The Hadoop Distributed File System. MSST 2010.
- Fast Crash Recovery in RAMCloud. SOSP 2011.
- Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. SOSP 2011.
Transactional Stores
- Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems. TODS 2014.
- Caracal: Contention Management with Deterministic Concurrency Control. SOSP 2021.
Wide Area Storage Systems
- Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. SOSP 2011.
- Transaction chains: achieving serializability with low latency in geo-distributed storage systems. SOSP 2013.
- Sharding the Shards: Managing Datastore Locality at Scale with Akkio. OSDI 2018.
- Performance-Optimal Read-Only Transactions. OSDI 2020.
- FlightTracker: Consistency across Read-Optimized Online Stores at Facebook. OSDI 2020.
- Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. SOSP 2021.
Data Parallel Frameworks
- Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Eurosys 2007.
- Distributed aggregation for data-parallel computing: interfaces and implementations. SOSP 2009.
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. SoCC 2014.
- Latency-Tolerant Software Distributed Shared Memory. USENIX ATC 2015.
Scheduling and Resource Management
- Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. Eurosys 2010.
- Sparrow: Distributed, Low Latency Scheduling. SOSP 2013.
- Improving MapReduce Performance in Heterogeneous Environments. OSDI 2008.
- Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP 2009.
- Apache Hadoop YARN: Yet Another Resource Negotiator. SoCC 2013.
- Omega: flexible, scalable schedulers for large compute clusters. Eurosys 2013.
- Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. OSDI 2014.
- Retro: Targeted Resource Management in Multi-tenant Distributed Systems. NSDI 2015.
- Large-scale cluster management at Google with Borg. Eurosys 2015.
- RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation. SOSP 2021.
- Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. SOSP 2021.
Stream Processing
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale. SOSP 2013.
- Kafka: a Distributed Messaging System for Log Processing. NetDB 2011.
- Naiad: A Timely Dataflow System. SOSP 2013.
- Storm @Twitter. SIGMOD 2014.
- Building a Replicated Logging System with Apache Kafka. VLDB 2015.
- Apache Flinkā¢: Stream and Batch Processing in a Single Engine. 2015.
- Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017.
- SVE: Distributed Video Processing at Facebook Scale. SOSP 2017.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD 2018.
- Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge. SOSP 2021.
Graph Processing
- A Lightweight Infrastructure for Graph Analytics. SOSP 2013.
- GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014.
- Chaos: Scale-out Graph Processing from Secondary Storage. SOSP 2015.
- Gemini: A Computation-Centric Distributed Graph Processing System. OSDI 2016.
- Scalability! But at what COST?. HotOS 2015.
Machine Learning Systems
- Scaling Distributed Machine Learning with the Parameter Server. OSDI 2014.
- TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016.
- Project Adam: Building an Efficient and Scalable Deep Learning Training System. OSDI 2014.
- Ray: A Distributed Framework for Emerging AI Applications. OSDI 2018.
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. OSDI 2018.
- Gandiva: Introspective Cluster Scheduling for Deep Learning. OSDI 2018.
- Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems. OSDI 2018.
- PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019.