Distributed Systems

ECE419, Winter 2025
University of Toronto
Instructor: Ashvin Goel

Lab Machines Lab Setup Lab Submission Lab 1 Lab 2 Lab 3 Lab 4

Advice for Raft Labs

Most of the labs require only a modest amount of code (less than a couple of hundred lines per lab part) but can be conceptually difficult and may require a good deal of thought and debugging. Some of the tests are difficult to pass.

Don't start a lab the night before it is due; it's more efficient to do the labs in several sessions spread over multiple days. Tracking down bugs in distributed systems is difficult, because of concurrency, crashes, and an unreliable network.

Additional lab material

It will help to go over this material for the labs. We suggest you give this material a quick read before starting the labs. Then as you do the labs, you can return to this material when you are stuck or have trouble debugging your code.

Look at the Raft visualization.
Read about structuring your Raft lab.
Read about locking in the Raft lab.
Read about options for the Raft implementation.
Read this Raft guide for students. This guide is a long read but it is helpful for understanding the details of the Raft protocol.
This diagram of Raft interactions may help you understand code flow between different parts of the system. Note that you will not be implementing snapshots.

Debugging tips

For each part of the lab, write (or re-write) code that is clean and clear, which will greatly help with debugging.
It may be helpful when debugging to insert print statements when a peer sends or receives a message, and collect the output in a file with go test > test.out. Then, by studying the trace of messages in the test.out file, you can identify where your implementation deviates from the desired behavior.
Structure your debug messages in a consistent format so that you can use grep to search for specific lines in the output.
You might find DPrintf in raft.go useful instead of calling log.Printf directly to turn printing on and off as you debug different problems.
You can use colors or columns to help you parse log output. This post explains one strategy.
As you're writing code (i.e., before you have a bug), it may be worth adding explicit checks for conditions that the code assumes to be true, perhaps using Go's panic. Such checks may help detect situations where later code unwittingly violates your assumptions.

Debugging in general

Efficient debugging takes experience. It helps to be systematic: form a hypothesis about a possible cause of the problem; collect evidence that might be relevant; think about the information you've gathered; repeat as needed. For extended debugging sessions it helps to keep notes, both to accumulate evidence and to remind yourself why you've discarded specific earlier hypotheses.

One approach is to progessively narrow down the specific point in time at which things start to go wrong. You could add code at various points in the execution that tests whether the system has reached the bad state. Or your code could print messages with relevant state at various points; collect the output in a file, and look through the file for the first point where things look wrong.

The Raft labs involve events, such as RPCs arriving or timeouts expiring or peers failing, that may occur at times you don't expect, or may be interleaved in unexpected orders. For example, one peer may decide to become a candidate while another peer thinks it is already the leader. It's worth thinking through the "what can happen next" possibilities. For example, when your Raft code releases a mutex, the very next thing that may happen (before the next line of code is executed!) might be the delivery (and processing) of an RPC request, or a timeout going off. Add Print statements to find out the actual order of events during execution.

The Raft paper's Figure 2 must be followed fairly exactly. It is easy to miss a condition that Figure 2 says must be checked, or an exact state change that it says must be made. If you have a bug, re-check that all of your code adheres closely to Figure 2.

If code used to work, but now it doesn't, maybe a change you've recently made is at fault.

The bug is often in the very last place you think to look, so be sure to look even at code you feel certain is correct.

The TAs are happy to help you think about your code during lab hours, but you're likely to get the most mileage out of the limited lab time if you've already dug as deep as you can into the situation.