Quick Links
Project Ideas
Some suggested projects are described below. You are also welcome to pick a project of your own choice. However, it is important for you to think about the following questions regarding your project before starting any design and implementation:
- What problem are you addressing? You should think about the the main goals of the design and what you plan to achieve in this project.
- What is interesting/novel about your approach? One way to answer this question is to ask yourself the following: what question will this project answer that you do not know the answer to already, i.e., why do you need to spend time on this project?
- How would you know that you have achieved your goals? You need to think about the metrics and testing method that will you use for evaluation. You should also think about the benchmarks you will use and the expected results from the evaluation.
Your project reports and final presentation will be evaluated based on the criteria described above.
Some of the projects described below are based on work being done in the instructor's group. Others project ideas are based on projects done in the past in this course. Please talk to the instructor about more details regarding the projects.
Please make sure to get a confirmation about your project from the instructor before starting the project.
- A Comparison of Streaming Databases
A stream processing system is a software framework that enables real-time processing and analysis of data streams. To use streaming systems, developers create jobs on these systems and connect the systems to upstream data sources, like Apache Kafka. Then the systems consume data from their sources, incrementally process the data using operators (e.g., filter, aggregators, etc.), update the query results, and store them on sink systems such as AWS S3, thus providing up-to-date information.
In this project, you will compare and analyze the performance (throughput and latency), scalability, timeliness and failure recovery times of streaming databases such as Apache Flink, Apache Spark, etc.
- Auto Scaling in Streaming Database
Streaming systems respond to workload changes by automatically scaling and reconfiguring streaming jobs, which enables them to meet their stringent real-time processing requirements. Automatic scaling involves detecting and predicting workload changes, changing the parallelism of operators, and allocating resources such as CPUs, memory and disks to operators on physical or virtual machines.
In this project, you will analyze the auto scaling capabilities of a streaming database such as Apache Flink.
- Elastic, Distributed KV Store
In this project, you will design a distributed KV store that enables adding and removing servers that store data. The project will primarily require you to think about how to migrate data when servers are added or removed.
- Replicated KV Store
In this project, you will design a replicated KV store. You could start with the RocksDB database and implement a replicated store. How will you do the replication? There are many options, including application-level command replication (e.g., CockroachDB), LSM-level replication (e.g., Rose) and file-level replication (e.g., Hailstorm). As with any replication system, how will you ensure that the replicas remain consistent? With the project handle failures, and if so, how?
- Scaling in Container-Based Systems
In this project, you will analyze the auto-scaling methods used in container-based systems (e.g., Kubernetes). In particular, you will analyze whether the scaling methods work well for highly varying workloads. You could also work on predicting workload patterns to improve the scaling mechanism.
- Disaggregation in Data Processing Frameworks
Data processing frameworks such as Spark co-locate computation with storage. While this approach provides locality, it can cause storage bottlenecks, especially for large data shuffles. In this project, you will store data on remote storage, thus decoupling compute from storage and enabling remote data shuffles.