Designing Modern Web-Scale Applications , ECE 1724, Winter 2022

Quick Links

Home Piazza Discussion Accessing Papers Presentation Format Project Format Project Ideas

Course Description

The last decade has seen an enormous shift in computing, with the rise of cloud computing and big data processing, powered by web-scale applications. This course discusses the principles, key technologies and trends in the design of web-scale applications. The course will examine and compare the architectures and the infrastructure needed to support several types of web-scale applications. Students will learn how these applications are designed to achieve high scalability, reliability and high availability.

This is a seminar-style course in which students will be required to read, analyze, present and discuss seminal and cutting-edge research in this area. The aim is to both learn from prior work and extract exciting research questions. A course project will provide concrete experience and deeper understanding of the material.

The course covers advanced topics, broadly in the areas of distributed systems, operating systems, storage and databases, with a focus on web-scale applications. The goal is provide a survey of research in this area, rather than focus on a specific topic.

Prerequisites

This course builds on the following undergraduate courses taught at University of Toronto: operating systems (ECE344) and distributed systems (ECE419). It assumes that students are knowledgeable about the contents of these courses.

Students are expected to have strong coding skills and be experienced in languages such as Java, Python, or C++. They should have experience building and debugging a significant software system. If unsure, consult with the instructor.

Textbooks

There are no required textbooks for this course. The optional textbooks are

Distributed Systems: Concepts and Design (Fifth Edition), by George Coulouris, Jean Dollimore Tim Kindberg and Gordon Blair, 2011.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppmann, 2017.
Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services, by Kenneth P. Birman, 2012.

Group Discussion

Course announcements and course discussion will be on the Piazza web site. Please sign up for this course on Piazza. You should post any questions about the course on Piazza. You may post to the whole class or post privately to the instructor.

Grading Policy

Grades will be based on class presentations, a class project, and class participation. There will be no final exam in this course. The grading breakup is as follows:

Class presentation: 30%
Class project: 50%
Class participation: 20%

Note: If a student is unable to attend a class, he or she will lose 2% for non-participation.

Class Presentation

Each week this class will cover a group of papers that focuses on a specific aspect of the course. Students are expected to read all the papers in the group that will be presented. At the beginning of the term, each paper will be assigned to a student who will be presenting the paper. Presentations will be limited to roughly 20 minutes.

More details about the presentation format. Please read very carefully.

Assignments

There will be no assignments in this course.

Class Project

A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.

Students will generally work in groups on a term project that pushes the state-of-the-art in the design of a web-scale application. Students will typically be required to implement and evaluate a significantly-large software system.

More details about the project format. Please read very carefully.

Project Ideas

Here is a list of project ideas.

Readings

This is a tentative list. Most of these papers can be accessed from the ACM web site. If you cannot access ACM articles directly, please read the following instructions for accessing the papers.

Jan 12, Week 1: Introduction

Overview of the course.
Introduction to the course. Slides modified from Ken Birman's course on Cloud Computing 2019.
A View of Cloud Computing. CACM 2010.
The Dangers of Replication and a Solution. SIGMOD 1996.
Efficient Readings of Papers in Science and Technology.
How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.

Jan 19, Week 2: Consensus and Coordination

Overview of consensus and coordination.
Overview of linearizability.
In Search of an Understandable Consensus Algorithm. USENIX ATC 2014. Wen Hao
ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC 2010. Alex

Optional reading:

Jan 26, Week 3: Cluster Storage Systems

Overview by instructor.
The Google File System. SOSP 2003. Amogh
Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Kexiang

Optional reading:

The Hadoop Distributed File System. MSST 2010.
Fast Crash Recovery in RAMCloud. SOSP 2011.
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. SOSP 2011.

Feb 2, Week 4: Transactional Stores

Overview by instructor.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems. SOSP 2007. Rui
Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems. TODS 2014. Jianxin

Optional reading:

Caracal: Contention Management with Deterministic Concurrency Control. SOSP 2021.

Feb 9, Week 5: Wide Area Storage Systems

Overview by instructor.
Dynamo: Amazon's Highly Available Key-value Store . SOSP 2007. Yunhao
Spanner: Google’s Globally Distributed Database. ACM TOCS 2013. Dongfang

Optional reading:

Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. SOSP 2011.
Transaction chains: achieving serializability with low latency in geo-distributed storage systems. SOSP 2013.
Sharding the Shards: Managing Datastore Locality at Scale with Akkio. OSDI 2018.
Performance-Optimal Read-Only Transactions. OSDI 2020.
FlightTracker: Consistency across Read-Optimized Online Stores at Facebook. OSDI 2020.
Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. SOSP 2021.

Feb 16, Week 6: Data Parallel Frameworks (first report due)

Overview by instructor.
MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Tanya
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. Baixuan

Optional reading:

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Eurosys 2007.
Distributed aggregation for data-parallel computing: interfaces and implementations. SOSP 2009.
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. SoCC 2014.
Latency-Tolerant Software Distributed Shared Memory. USENIX ATC 2015.

Feb 23: Reading Week, No Class

Mar 2, Week 7: Scheduling

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. Eurosys 2010. Haiqi
Sparrow: Distributed, Low Latency Scheduling. SOSP 2013. Umarpeet

Optional reading:

Improving MapReduce Performance in Heterogeneous Environments. OSDI 2008.
Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP 2009.

Mar 9, Week 8: Resource Management

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011. Rishikesh
Twine: A Unified Cluster Management System for Shared Infrastructure. OSDI 2020. Xinyang

Optional reading:

Apache Hadoop YARN: Yet Another Resource Negotiator. SoCC 2013.
Omega: flexible, scalable schedulers for large compute clusters. Eurosys 2013.
Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. OSDI 2014.
Retro: Targeted Resource Management in Multi-tenant Distributed Systems. NSDI 2015.
Large-scale cluster management at Google with Borg. Eurosys 2015.
RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation. SOSP 2021.
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. SOSP 2021.

Mar 16, Week 9: Stream Processing (second report due)

MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013. Ray
Discretized Streams: Fault-Tolerant Streaming Computation at Scale. SOSP 2013. Lan

Optional reading:

Kafka: a Distributed Messaging System for Log Processing. NetDB 2011.
Naiad: A Timely Dataflow System. SOSP 2013.
Storm @Twitter. SIGMOD 2014.
Building a Replicated Logging System with Apache Kafka. VLDB 2015.
Apache Flink™: Stream and Batch Processing in a Single Engine. 2015.
Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017.
SVE: Distributed Video Processing at Facebook Scale. SOSP 2017.
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD 2018.
Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge. SOSP 2021.

Mar 23, Week 10: Serverless Processing

Fault-tolerant and transactional stateful serverless workflows. OSDI 2020. Zongxin
Boki: Stateful Serverless Computing with Shared Logs. SOSP 2021. Xueqi

Optional reading:

Realizing the fault-tolerance promise of cloud storage using locks with intent. OSDI 2016.
Faster and Cheaper Serverless Computing on Harvested Resources. SOSP 2021.
FaasCache: keeping serverless computing alive with greedy-dual caching. ASPLOS 2021.

Mar 30, Week 11: Graph Processing

Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010. Junyuan
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012. Ashvin

Optional reading:

A Lightweight Infrastructure for Graph Analytics. SOSP 2013.
GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014.
Chaos: Scale-out Graph Processing from Secondary Storage. SOSP 2015.
Gemini: A Computation-Centric Distributed Graph Processing System. OSDI 2016.
Scalability! But at what COST?. HotOS 2015.

Apr 6, Week 12: Machine Learning Systems

Scaling Distributed Machine Learning with the Parameter Server. OSDI 2014. Sheharyar
TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016. Hanyu

Optional reading:

Project Adam: Building an Efficient and Scalable Deep Learning Training System. OSDI 2014.
Ray: A Distributed Framework for Emerging AI Applications. OSDI 2018.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. OSDI 2018.
Gandiva: Introspective Cluster Scheduling for Deep Learning. OSDI 2018.
Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems. OSDI 2018.
PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019.

Apr 13, Week 13: Instructor Away, No Class

Apr 20, Week 14: Project Presentations (final report due)

Project presentations will be held during class hour. Each presentation should be 15 minutes long, followed by 5 minutes of Q/A. Final project report are also due.

Designing Modern Web-Scale Applications

ECE1724, Winter 2022 University of Toronto

Quick Links

Course Description

Prerequisites

Textbooks

Group Discussion

Grading Policy

Class Presentation

Assignments

Class Project

Project Ideas

Readings

Jan 12, Week 1: Introduction

Jan 19, Week 2: Consensus and Coordination

Jan 26, Week 3: Cluster Storage Systems

Feb 2, Week 4: Transactional Stores

Feb 9, Week 5: Wide Area Storage Systems

Feb 16, Week 6: Data Parallel Frameworks (first report due)

Feb 23: Reading Week, No Class

Mar 2, Week 7: Scheduling

Mar 9, Week 8: Resource Management

Mar 16, Week 9: Stream Processing (second report due)

Mar 23, Week 10: Serverless Processing

Mar 30, Week 11: Graph Processing

Apr 6, Week 12: Machine Learning Systems

Apr 13, Week 13: Instructor Away, No Class

Apr 20, Week 14: Project Presentations (final report due)

ECE1724, Winter 2022
University of Toronto