Designing Modern Web-Scale Applications

ECE1724, Winter 2021
University of Toronto

Instructor: Ashvin Goel
Course Time: Fri, 1-3 pm
Start Date: Jan 15, 2021
Zoom URL

Quick Links

HomePiazza DiscussionAccessing PapersPresentation FormatProject FormatProject Ideas

Project Ideas

Some suggested projects are described below. You are also welcome to pick a project of your own choice. However, it is important for you to think about the following questions regarding your project before starting any design and implementation:

  1. What problem are you addressing? You should think about the the main goals of the design and what you plan to achieve in this project.
  2. What is interesting/novel about your approach? One way to answer this question is to ask yourself the following: what question will this project answer that you do not know the answer to already, i.e., why do you need to spend time on this project?
  3. How would you know that you have achieved your goals? You need to think about the metrics and testing method that will you use for evaluation. You should also think about the expected results from the evaluation.

Your project reports and final presentation will be evaluated based on the criteria described above.

All the projects described below are based on work being done by the instructor's group. Please talk to the instructor about more details regarding the projects.

Please make sure to get a confirmation about your project from the instructor before starting the project.

  1. Performance in the Caracal Distributed Database

    Many web applications store their large datasets in distributed databases today. The instructor's group has been working on a high-performance, in-memory distributed database called Caracal that is designed to scale well. In this project, you will be working on one of the projects shown below for improving the performance of Caracal. Caracal is implemented in C++, so you will need significant experience with C++ development.

    1. Random Sharding in Caracal

      Caracal supports various data sharding (partitioning) policies. In this project, you will experiment with hash-based or random sharding, which makes it easier to handle load imbalance across nodes. Currently, we have implemented a subset of the TPC-C benchmark with random sharding. In this project, you will implement the rest of the benchmark for Caracal and compare the performance of random sharding with warehouse-based sharding.

    2. Data Migration in Caracal

      Databases often manage long-term load imbalance across nodes by migrating data from hot nodes to other nodes. In this project, you will implement data migration in Caracal and evaluate the migration scheme using uniform and skewed workloads.

  2. Crash Recovery in the Caracal Database

    The instructor's group has been working on a high-performance, in-memory database called Caracal. Currently, this database does not support logging and checkpointing for crash recovery. In this project, you will implement support for logging and checkpointing a single-node database. This implementation will allow recovering from database crashes and single node failures. Caracal is implemented in C++, so you will need significant experience with C++ development.

  3. Replication in the Hailstorm Storage System

    Today, cloud-based systems often use log-structured merge tree (LSM) based storage systems. These storage systems provide both fast read and write accesses by maintaining data in two or more separate structures, each optimized for its storage media.

    The instructor's group has been working on a rack-level, LSM-based storage system called Hailstorm that is designed to balance CPU and storage load across the machines in the rack. Currently, the storage system does not support data replication for fault tolerance. In this project, you will be working on implementing replication for Hailstorm, similar to the replication mechanisms used in GFS and Bigtable. Hailstorm is implemented in Java, and so you will need significant experience with Java development.

  4. Efficient Distributed Graph Mining

    Graph mining algorithms help discover complex structural patterns, such as cliques and motifs in graphs, enabling applications such as analyzing communities in social networks.

    The instructor's group has been working on a distributed graph mining system called Tesseract. Unlike Arabesque and Fractal, which are designed for mining static graphs, Tesseract is designed for mining evolving graphs. As graph updates arrive, Tesseract detects new graph patterns with low delay and streams them for further processing. Currently, Tesseract is designed for mining general patterns that are expressed using arbitrary pattern-matching programs. However, its performance can be improved if the structure of the patterns is known apriori, by carefully directing and filtering the search based on the structure of the pattern. In this project, you will be implementing optimizations that enable efficient searching of specific patterns in Tesseract. Tesseract is implemented in Java and C++, and so you will need significant experience with development in these languages.