Designing Modern Web-Scale Applications

ECE1724, Winter 2022
University of Toronto

Instructor: Ashvin Goel
Course Time: Wed, 4-6:30 pm
Start Date: Jan 12, 2022
Online: Zoom (until end of January 2022)
Classroom: WB219 (from Feb 9, 2022)

Quick Links

HomePiazza DiscussionAccessing PapersPresentation FormatProject FormatProject Ideas

Course Description

The last decade has seen an enormous shift in computing, with the rise of cloud computing and big data processing, powered by web-scale applications. This course discusses the principles, key technologies and trends in the design of web-scale applications. The course will examine and compare the architectures and the infrastructure needed to support several types of web-scale applications. Students will learn how these applications are designed to achieve high scalability, reliability and high availability.

This is a seminar-style course in which students will be required to read, analyze, present and discuss seminal and cutting-edge research in this area. The aim is to both learn from prior work and extract exciting research questions. A course project will provide concrete experience and deeper understanding of the material.

The course covers advanced topics, broadly in the areas of distributed systems, operating systems, storage and databases, with a focus on web-scale applications. The goal is provide a survey of research in this area, rather than focus on a specific topic.


This course builds on the following undergraduate courses taught at University of Toronto: operating systems (ECE344) and distributed systems (ECE419). It assumes that students are knowledgeable about the contents of these courses.

Students are expected to have strong coding skills and be experienced in languages such as Java, Python, or C++. They should have experience building and debugging a significant software system. If unsure, consult with the instructor.


There are no required textbooks for this course. The optional textbooks are

Group Discussion

Course announcements and course discussion will be on the Piazza web site. Please sign up for this course on Piazza. You should post any questions about the course on Piazza. You may post to the whole class or post privately to the instructor.

Grading Policy

Grades will be based on class presentations, a class project, and class participation. There will be no final exam in this course. The grading breakup is as follows:

Note: If a student is unable to attend a class, he or she will lose 2% for non-participation.

Class Presentation

Each week this class will cover a group of papers that focuses on a specific aspect of the course. Students are expected to read all the papers in the group that will be presented. At the beginning of the term, each paper will be assigned to a student who will be presenting the paper. Presentations will be limited to roughly 20 minutes.

More details about the presentation format. Please read very carefully.


There will be no assignments in this course.

Class Project

A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.

Students will generally work in groups on a term project that pushes the state-of-the-art in the design of a web-scale application. Students will typically be required to implement and evaluate a significantly-large software system.

More details about the project format. Please read very carefully.

Project Ideas

Here is a list of project ideas.


This is a tentative list. Most of these papers can be accessed from the ACM web site. If you cannot access ACM articles directly, please read the following instructions for accessing the papers.

Jan 12, Week 1: Introduction

  1. Overview of the course.
  2. Introduction to the course. Slides modified from Ken Birman's course on Cloud Computing 2019.
  3. A View of Cloud Computing. CACM 2010.
  4. The Dangers of Replication and a Solution. SIGMOD 1996.
  5. Efficient Readings of Papers in Science and Technology.
  6. How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.

Jan 19, Week 2: Consensus and Coordination

  1. Overview of consensus and coordination.
  2. Overview of linearizability.
  3. In Search of an Understandable Consensus Algorithm. USENIX ATC 2014. Wen Hao
  4. ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC 2010. Alex

     Optional reading:

  1. Paxos Made Simple.
  2. Replication Management using the State Machine Approach.
  3. The Chubby lock service for loosely-coupled distributed systems. OSDI 2006.
  4. Exploiting Nil-Externality for Fast Replicated Storage. SOSP 2021.

Jan 26, Week 3: Cluster Storage Systems

  1. Overview by instructor.
  2. The Google File System. SOSP 2003. Amogh
  3. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Kexiang

     Optional reading:

  1. The Hadoop Distributed File System. MSST 2010.
  2. Fast Crash Recovery in RAMCloud. SOSP 2011.
  3. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. SOSP 2011.

Feb 2, Week 4: Transactional Stores

  1. Overview by instructor.
  2. Sinfonia: A New Paradigm for Building Scalable Distributed Systems. SOSP 2007. Rui
  3. Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems. TODS 2014. Jianxin

      Optional reading:

  1. Caracal: Contention Management with Deterministic Concurrency Control. SOSP 2021.

Feb 9, Week 5: Wide Area Storage Systems

  1. Overview by instructor.
  2. Dynamo: Amazon's Highly Available Key-value Store . SOSP 2007. Yunhao
  3. Spanner: Google’s Globally Distributed Database. ACM TOCS 2013. Dongfang

     Optional reading:

  1. Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. SOSP 2011.
  2. Transaction chains: achieving serializability with low latency in geo-distributed storage systems. SOSP 2013.
  3. Sharding the Shards: Managing Datastore Locality at Scale with Akkio. OSDI 2018.
  4. Performance-Optimal Read-Only Transactions. OSDI 2020.
  5. FlightTracker: Consistency across Read-Optimized Online Stores at Facebook. OSDI 2020.
  6. Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. SOSP 2021.

Feb 16, Week 6: Data Parallel Frameworks (first report due)

  1. Overview by instructor.
  2. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Tanya
  3. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. Baixuan

      Optional reading:

  1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Eurosys 2007.
  2. Distributed aggregation for data-parallel computing: interfaces and implementations. SOSP 2009.
  3. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. SoCC 2014.
  4. Latency-Tolerant Software Distributed Shared Memory. USENIX ATC 2015.

Feb 23: Reading Week, No Class

Mar 2, Week 7: Scheduling

  1. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. Eurosys 2010. Haiqi
  2. Sparrow: Distributed, Low Latency Scheduling. SOSP 2013. Umarpeet

      Optional reading:

  1. Improving MapReduce Performance in Heterogeneous Environments. OSDI 2008.
  2. Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP 2009.

Mar 9, Week 8: Resource Management

  1. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011. Rishikesh
  2. Twine: A Unified Cluster Management System for Shared Infrastructure. OSDI 2020. Xinyang

     Optional reading:

  1. Apache Hadoop YARN: Yet Another Resource Negotiator. SoCC 2013.
  2. Omega: flexible, scalable schedulers for large compute clusters. Eurosys 2013.
  3. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. OSDI 2014.
  4. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. NSDI 2015.
  5. Large-scale cluster management at Google with Borg. Eurosys 2015.
  6. RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation. SOSP 2021.
  7. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. SOSP 2021.

Mar 16, Week 9: Stream Processing (second report due)

  1. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013. Ray
  2. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. SOSP 2013. Lan

     Optional reading:

  1. Kafka: a Distributed Messaging System for Log Processing. NetDB 2011.
  2. Naiad: A Timely Dataflow System. SOSP 2013.
  3. Storm @Twitter. SIGMOD 2014.
  4. Building a Replicated Logging System with Apache Kafka. VLDB 2015.
  5. Apache Flink™: Stream and Batch Processing in a Single Engine. 2015.
  6. Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017.
  7. SVE: Distributed Video Processing at Facebook Scale. SOSP 2017.
  8. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD 2018.
  9. Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge. SOSP 2021.

Mar 23, Week 10: Serverless Processing

  1. Fault-tolerant and transactional stateful serverless workflows. OSDI 2020. Zongxin
  2. Boki: Stateful Serverless Computing with Shared Logs. SOSP 2021. Xueqi

      Optional reading:

  1. Realizing the fault-tolerance promise of cloud storage using locks with intent. OSDI 2016.
  2. Faster and Cheaper Serverless Computing on Harvested Resources. SOSP 2021.
  3. FaasCache: keeping serverless computing alive with greedy-dual caching. ASPLOS 2021.

Mar 30, Week 11: Graph Processing

  1. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010. Junyuan
  2. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012. Ashvin

     Optional reading:

  1. A Lightweight Infrastructure for Graph Analytics. SOSP 2013.
  2. GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014.
  3. Chaos: Scale-out Graph Processing from Secondary Storage. SOSP 2015.
  4. Gemini: A Computation-Centric Distributed Graph Processing System. OSDI 2016.
  5. Scalability! But at what COST?. HotOS 2015.

Apr 6, Week 12: Machine Learning Systems

  1. Scaling Distributed Machine Learning with the Parameter Server. OSDI 2014. Sheharyar
  2. TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016. Hanyu

     Optional reading:

  1. Project Adam: Building an Efficient and Scalable Deep Learning Training System. OSDI 2014.
  2. Ray: A Distributed Framework for Emerging AI Applications. OSDI 2018.
  3. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. OSDI 2018.
  4. Gandiva: Introspective Cluster Scheduling for Deep Learning. OSDI 2018.
  5. Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems. OSDI 2018.
  6. PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019.

Apr 13, Week 13: Instructor Away, No Class

Apr 20, Week 14: Project Presentations (final report due)

Project presentations will be held during class hour. Each presentation should be 15 minutes long, followed by 5 minutes of Q/A. Final project report are also due.