Designing Modern Web-Scale Applications

ECE1724, Winter 2021
University of Toronto

Instructor: Ashvin Goel
Course Time: Fri, 1-3 pm
Start Date: Jan 15, 2021
Zoom URL

Quick Links

HomePiazza DiscussionAccessing PapersPresentation FormatProject FormatProject Ideas

Course Description

The last decade has seen an enormous shift in computing, with the rise of cloud computing and big data processing, powered by web-scale applications. This course discusses the principles, key technologies and trends in the design of web-scale applications. The course will examine and compare the architectures and implementation of several types of web-scale applications. Students will learn how these applications are designed to achieve high scalability, reliability and high availability.

This is a seminar-style course in which students will be required to read, analyze, present and discuss seminal and cutting-edge research in this area. The aim is to both learn from prior work and extract exciting research questions. A course project will provide concrete experience and deeper understanding of the material.

The course covers advanced topics, broadly in the area of distributed systems, storage and databases, with a focus on web-scale applications. The goal is provide a survey of research in this area, rather than focus on a specific topic.

Prerequisites

This course builds on the following undergraduate courses taught at University of Toronto: operating systems (ECE344) and distributed systems (ECE419). It assumes that students are knowledgeable about the contents of these courses.

Students are expected to have strong coding skills and be experienced in a language like Java, Python, or C++. They should have experience building and debugging a significant software system. If unsure, consult with the instructor.

Textbooks

There are no required textbooks for this course. The optional textbooks are

Group Discussion

Course announcements and course discussion will be on the Piazza web site. Please sign up for this course on Piazza. You should post any questions about the course on Piazza. You may post to the whole class or post privately to the instructor.

Grading Policy

Grades will be based on class presentations, a class project, and class participation. There will be no final exam in this course. The grading breakup is as follows:

Note: If a student is unable to attend a class, he or she will lose 2% for non-participation.

Class Presentation

Each week this class will cover a group of papers that focuses on a specific aspect of the course. Students are expected to read all the papers in the group that will be presented. At the beginning of the term, each paper will be assigned to a student who will be presenting the paper. Presentations will be limited to roughly 20 minutes.

More details about the presentation format. Please read very carefully.

Assignments

There will be no assignments in this course.

Class Project

A major component of this course is devoted to a term-long project. The topic of the project is largely up to you, but to help you choose a project, a sample list of projects is provided below. This list should help students determine whether their own projects are of reasonable size and scope.

Students will generally work in groups on a term project that pushes the state-of-the-art in the design of a web-scale application. Students will typically be required to implement and evaluate a significantly large software system.

More details about the project format. Please read very carefully.

Project Ideas

Here is a list of project ideas.

Readings

This is a tentative list. Most of these papers can be accessed from the ACM web site. If you cannot access ACM articles directly, please read the following instructions for accessing the papers.

Jan 15, Week 1: Introduction

  1. Overview of the course.
  2. Introduction to the course. Slides modified from Ken Birman's course on Cloud Computing 2019.
  3. Lecture video.
  4. A View of Cloud Computing. CACM 2010.
  5. The Dangers of Replication and a Solution. SIGMOD 1996.
  6. Efficient Readings of Papers in Science and Technology.
  7. How (and How Not) to Write a Good Systems Paper. Operating Systems Review 1983.

Jan 22, Week 2: Consensus and Coordination

  1. Overview of consensus and coordination.
  2. Overview of linearizability.
  3. Lecture video.
  4. In Search of an Understandable Consensus Algorithm. USENIX ATC 2014. Shirley
  5. ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC 2010. Daniel

     Optional reading:

  1. Paxos Made Simple.
  2. Replication Management using the State Machine Approach.
  3. The Chubby lock service for loosely-coupled distributed systems. OSDI 2006.

Jan 29, Week 3: Cluster Storage Systems

  1. Overview by instructor.
  2. The Google File System. SOSP 2003. Weiqi
  3. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Corey

     Optional reading:

  1. The Hadoop Distributed File System. MSST 2010.
  2. Fast Crash Recovery in RAMCloud. SOSP 2011.

Feb 5, Week 4: Transactional Stores

  1. Overview by instructor.
  2. Sinfonia: A New Paradigm for Building Scalable Distributed Systems. SOSP 2007. Daniel
  3. Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems. TODS 2014. Ziwen

      Optional reading:

  1. Transaction chains: achieving serializability with low latency in geo-distributed storage systems. SOSP 2013.

Feb 12, Week 5: Wide Area Storage Systems (first report due)

  1. Overview by instructor.
  2. Dynamo: Amazon's Highly Available Key-value Store . SOSP 2007. Rita
  3. Spanner: Google’s Globally Distributed Database. ACM TOCS 2013. Minghao

     Optional reading:

  1. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. SOSP 2011.
  2. Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. SOSP 2011.
  3. Sharding the Shards: Managing Datastore Locality at Scale with Akkio. OSDI 2018.
  4. Performance-Optimal Read-Only Transactions. OSDI 2020.
  5. FlightTracker: Consistency across Read-Optimized Online Stores at Facebook. OSDI 2020.

Feb 19: Reading Week, No Class

Feb 26, Week 6: Remote Memory

  1. AIFM: High-Performance, Application-Integrated Far Memory. OSDI 2020. Shirley
  2. Semeru: A Memory-Disaggregated Managed Runtime. OSDI 2020. Qiwei

      Optional reading:

Mar 5, Week 7: Data Parallel Frameworks

  1. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Weiqi
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. Yifan

      Optional reading:

  1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Eurosys 2007.
  2. Distributed aggregation for data-parallel computing: interfaces and implementations. SOSP 2009.
  3. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. SoCC 2014.
  4. Latency-Tolerant Software Distributed Shared Memory. USENIX ATC 2015.

Mar 12, Week 8: Scheduling

  1. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. Eurosys 2010. Mohammed
  2. Sparrow: Distributed, Low Latency Scheduling. SOSP 2013. Simona

      Optional reading:

  1. Improving MapReduce Performance in Heterogeneous Environments. OSDI 2008.
  2. Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP 2009.

Mar 19, Week 9: Resource Management (second report due)

  1. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011. Eric
  2. Twine: A Unified Cluster Management System for Shared Infrastructure. OSDI 2020. Dison

     Optional reading:

  1. Apache Hadoop YARN: Yet Another Resource Negotiator. SoCC 2013.
  2. Omega: flexible, scalable schedulers for large compute clusters. Eurosys 2013.
  3. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. OSDI 2014.
  4. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. NSDI 2015.
  5. Large-scale cluster management at Google with Borg. Eurosys 2015.

Mar 26, Week 10: Stream Processing

  1. Storm @Twitter. SIGMOD 2014. Wenjun
  2. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013. Mohammed

     Optional reading:

  1. Kafka: a Distributed Messaging System for Log Processing. NetDB 2011.
  2. Naiad: A Timely Dataflow System. SOSP 2013.
  3. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. SOSP 2013.
  4. Building a Replicated Logging System with Apache Kafka. VLDB 2015.
  5. Apache Flink™: Stream and Batch Processing in a Single Engine. 2015.
  6. Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017.
  7. SVE: Distributed Video Processing at Facebook Scale. SOSP 2017.
  8. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD 2018.

Apr 2, Week 11: Graph Processing

  1. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010. Karthik
  2. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012. Minghao

     Optional reading:

  1. A Lightweight Infrastructure for Graph Analytics. SOSP 2013.
  2. GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014.
  3. Chaos: Scale-out Graph Processing from Secondary Storage. SOSP 2015.
  4. Gemini: A Computation-Centric Distributed Graph Processing System. OSDI 2016.
  5. Scalability! But at what COST?. HotOS 2015.

Apr 9, Week 12: Graph Mining

  1. Arabesque: A System for Distributed Graph Mining. SOSP 2015. Chenghao
  2. Fractal: A General-Purpose Graph Patern Mining System. SIGMOD 2019. Ziwen

     Optional reading:

  1. RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine. OSDI 2018.
  2. AutoMine: Harmonizing High-Level Abstraction and High Performance for Graph Mining. SOSP 2019.

Apr 16, Week 13: Machine Learning Systems (final report due)

  1. Scaling Distributed Machine Learning with the Parameter Server. OSDI 2014. Simona
  2. TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016. Karthik

     Optional reading:

  1. Project Adam: Building an Efficient and Scalable Deep Learning Training System. OSDI 2014.
  2. Ray: A Distributed Framework for Emerging AI Applications. OSDI 2018.
  3. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. OSDI 2018.
  4. Gandiva: Introspective Cluster Scheduling for Deep Learning. OSDI 2018.
  5. Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems. OSDI 2018.
  6. PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019.

Apr 23, Week 14: Project Presentations

Project presentations will be held during class hour. Each presentation should be 15 minutes long, followed by 5 minutes of Q/A. Final project report are also due.