Designing Modern Web-Scale Applications , ECE 1724, Winter 2021

Quick Links

Home Piazza Discussion Accessing Papers Presentation Format Project Format Project Ideas

Course Description

The last decade has seen an enormous shift in computing, with the rise of cloud computing and big data processing, powered by web-scale applications. This course discusses the principles, key technologies and trends in the design of web-scale applications. The course will examine and compare the architectures and implementation of several types of web-scale applications. Students will learn how these applications are designed to achieve high scalability, reliability and high availability.

This is a seminar-style course in which students will be required to read, analyze, present and discuss seminal and cutting-edge research in this area. The aim is to both learn from prior work and extract exciting research questions. A course project will provide concrete experience and deeper understanding of the material.

The course covers advanced topics, broadly in the area of distributed systems, storage and databases, with a focus on web-scale applications. The goal is provide a survey of research in this area, rather than focus on a specific topic.

Prerequisites

This course builds on the following undergraduate courses taught at University of Toronto: operating systems (ECE344) and distributed systems (ECE419). It assumes that students are knowledgeable about the contents of these courses.

Students are expected to have strong coding skills and be experienced in a language like Java, Python, or C++. They should have experience building and debugging a significant software system. If unsure, consult with the instructor.

Textbooks

There are no required textbooks for this course. The optional textbooks are

Distributed Systems: Concepts and Design (Fifth Edition), by George Coulouris, Jean Dollimore Tim Kindberg and Gordon Blair, 2011.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppmann, 2017.
Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services, by Kenneth P. Birman, 2012.

Group Discussion

Course announcements and course discussion will be on the Piazza web site. Please sign up for this course on Piazza. You should post any questions about the course on Piazza. You may post to the whole class or post privately to the instructor.

Grading Policy

Grades will be based on class presentations, a class project, and class participation. There will be no final exam in this course. The grading breakup is as follows:

Class presentation: 30%
Class project: 50%
Class participation: 20%

Note: If a student is unable to attend a class, he or she will lose 2% for non-participation.

Class Presentation

Each week this class will cover a group of papers that focuses on a specific aspect of the course. Students are expected to read all the papers in the group that will be presented. At the beginning of the term, each paper will be assigned to a student who will be presenting the paper. Presentations will be limited to roughly 20 minutes.

More details about the presentation format. Please read very carefully.

Assignments

There will be no assignments in this course.

Class Project

A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.

Students will generally work in groups on a term project that pushes the state-of-the-art in the design of a web-scale application. Students will typically be required to implement and evaluate a significantly large software system.

More details about the project format. Please read very carefully.

Project Ideas

Here is a list of project ideas.

Readings

This is a tentative list. Most of these papers can be accessed from the ACM web site. If you cannot access ACM articles directly, please read the following instructions for accessing the papers.

Jan 15, Week 1: Introduction

Overview of the course.
Introduction to the course. Slides modified from Ken Birman's course on Cloud Computing 2019.
Lecture video.
A View of Cloud Computing. CACM 2010.
The Dangers of Replication and a Solution. SIGMOD 1996.
Efficient Readings of Papers in Science and Technology.
How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.

Jan 22, Week 2: Consensus and Coordination

Overview of consensus and coordination.
Overview of linearizability.
Lecture video.
In Search of an Understandable Consensus Algorithm. USENIX ATC 2014. Shirley
ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC 2010. Daniel

Optional reading:

Jan 29, Week 3: Cluster Storage Systems

Overview by instructor.
The Google File System. SOSP 2003. Weiqi
Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Corey

Optional reading:

The Hadoop Distributed File System. MSST 2010.
Fast Crash Recovery in RAMCloud. SOSP 2011.

Feb 5, Week 4: Transactional Stores

Overview by instructor.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems. SOSP 2007. Daniel
Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems. TODS 2014. Ziwen

Optional reading:

Transaction chains: achieving serializability with low latency in geo-distributed storage systems. SOSP 2013.

Feb 12, Week 5: Wide Area Storage Systems (first report due)

Overview by instructor.
Dynamo: Amazon's Highly Available Key-value Store . SOSP 2007. Rita
Spanner: Google’s Globally Distributed Database. ACM TOCS 2013. Minghao

Optional reading:

Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. SOSP 2011.
Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. SOSP 2011.
Sharding the Shards: Managing Datastore Locality at Scale with Akkio. OSDI 2018.
Performance-Optimal Read-Only Transactions. OSDI 2020.
FlightTracker: Consistency across Read-Optimized Online Stores at Facebook. OSDI 2020.

Feb 19: Reading Week, No Class

Feb 26, Week 6: Remote Memory

AIFM: High-Performance, Application-Integrated Far Memory. OSDI 2020. Shirley
Semeru: A Memory-Disaggregated Managed Runtime. OSDI 2020. Qiwei

Optional reading:

Mar 5, Week 7: Data Parallel Frameworks

MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Weiqi
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. Yifan

Optional reading:

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Eurosys 2007.
Distributed aggregation for data-parallel computing: interfaces and implementations. SOSP 2009.
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. SoCC 2014.
Latency-Tolerant Software Distributed Shared Memory. USENIX ATC 2015.

Mar 12, Week 8: Scheduling

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. Eurosys 2010. Mohammed
Sparrow: Distributed, Low Latency Scheduling. SOSP 2013. Simona

Optional reading:

Improving MapReduce Performance in Heterogeneous Environments. OSDI 2008.
Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP 2009.

Mar 19, Week 9: Resource Management (second report due)

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011. Eric
Twine: A Unified Cluster Management System for Shared Infrastructure. OSDI 2020. Dison

Optional reading:

Apache Hadoop YARN: Yet Another Resource Negotiator. SoCC 2013.
Omega: flexible, scalable schedulers for large compute clusters. Eurosys 2013.
Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. OSDI 2014.
Retro: Targeted Resource Management in Multi-tenant Distributed Systems. NSDI 2015.
Large-scale cluster management at Google with Borg. Eurosys 2015.

Mar 26, Week 10: Stream Processing

Storm @Twitter. SIGMOD 2014. Wenjun
MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013. Mohammed

Optional reading:

Kafka: a Distributed Messaging System for Log Processing. NetDB 2011.
Naiad: A Timely Dataflow System. SOSP 2013.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale. SOSP 2013.
Building a Replicated Logging System with Apache Kafka. VLDB 2015.
Apache Flink™: Stream and Batch Processing in a Single Engine. 2015.
Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017.
SVE: Distributed Video Processing at Facebook Scale. SOSP 2017.
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD 2018.

Apr 2, Week 11: Graph Processing

Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010. Karthik
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012. Minghao

Optional reading:

A Lightweight Infrastructure for Graph Analytics. SOSP 2013.
GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014.
Chaos: Scale-out Graph Processing from Secondary Storage. SOSP 2015.
Gemini: A Computation-Centric Distributed Graph Processing System. OSDI 2016.
Scalability! But at what COST?. HotOS 2015.

Apr 9, Week 12: Graph Mining

Arabesque: A System for Distributed Graph Mining. SOSP 2015. Chenghao
Fractal: A General-Purpose Graph Patern Mining System. SIGMOD 2019. Ziwen

Optional reading:

Apr 16, Week 13: Machine Learning Systems (final report due)

Scaling Distributed Machine Learning with the Parameter Server. OSDI 2014. Simona
TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016. Karthik

Optional reading:

Project Adam: Building an Efficient and Scalable Deep Learning Training System. OSDI 2014.
Ray: A Distributed Framework for Emerging AI Applications. OSDI 2018.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. OSDI 2018.
Gandiva: Introspective Cluster Scheduling for Deep Learning. OSDI 2018.
Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems. OSDI 2018.
PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019.

Apr 23, Week 14: Project Presentations

Project presentations will be held during class hour. Each presentation should be 15 minutes long, followed by 5 minutes of Q/A. Final project report are also due.

Designing Modern Web-Scale Applications

ECE1724, Winter 2021 University of Toronto

Quick Links

Course Description

Prerequisites

Textbooks

Group Discussion

Grading Policy

Class Presentation

Assignments

Class Project

Project Ideas

Readings

Jan 15, Week 1: Introduction

Jan 22, Week 2: Consensus and Coordination

Jan 29, Week 3: Cluster Storage Systems

Feb 5, Week 4: Transactional Stores

Feb 12, Week 5: Wide Area Storage Systems (first report due)

Feb 19: Reading Week, No Class

Feb 26, Week 6: Remote Memory

Mar 5, Week 7: Data Parallel Frameworks

Mar 12, Week 8: Scheduling

Mar 19, Week 9: Resource Management (second report due)

Mar 26, Week 10: Stream Processing

Apr 2, Week 11: Graph Processing

Apr 9, Week 12: Graph Mining

Apr 16, Week 13: Machine Learning Systems (final report due)

Apr 23, Week 14: Project Presentations

ECE1724, Winter 2021
University of Toronto