Distributed Systems

ECE419, Winter 2025
University of Toronto
Instructor: Ashvin Goel

Distributed Systems
HomeLecturesLabsPiazzaQuercus
Lab MachinesLab SetupLab SubmissionLab 1Lab 2Lab 3Lab 4

Advice for Raft Labs

Most of the labs require only a modest amount of code (less than a couple of hundred lines per lab part) but can be conceptually difficult and may require a good deal of thought and debugging. Some of the tests are difficult to pass.

Don't start a lab the night before it is due; it's more efficient to do the labs in several sessions spread over multiple days. Tracking down bugs in distributed systems is difficult, because of concurrency, crashes, and an unreliable network.

Additional lab material

It will help to go over this material for the labs. We suggest you give this material a quick read before starting the labs. Then as you do the labs, you can return to this material when you are stuck or have trouble debugging your code.

Debugging tips

Debugging in general

Efficient debugging takes experience. It helps to be systematic: form a hypothesis about a possible cause of the problem; collect evidence that might be relevant; think about the information you've gathered; repeat as needed. For extended debugging sessions it helps to keep notes, both to accumulate evidence and to remind yourself why you've discarded specific earlier hypotheses.

One approach is to progessively narrow down the specific point in time at which things start to go wrong. You could add code at various points in the execution that tests whether the system has reached the bad state. Or your code could print messages with relevant state at various points; collect the output in a file, and look through the file for the first point where things look wrong.

The Raft labs involve events, such as RPCs arriving or timeouts expiring or peers failing, that may occur at times you don't expect, or may be interleaved in unexpected orders. For example, one peer may decide to become a candidate while another peer thinks it is already the leader. It's worth thinking through the "what can happen next" possibilities. For example, when your Raft code releases a mutex, the very next thing that may happen (before the next line of code is executed!) might be the delivery (and processing) of an RPC request, or a timeout going off. Add Print statements to find out the actual order of events during execution.

The Raft paper's Figure 2 must be followed fairly exactly. It is easy to miss a condition that Figure 2 says must be checked, or an exact state change that it says must be made. If you have a bug, re-check that all of your code adheres closely to Figure 2.

If code used to work, but now it doesn't, maybe a change you've recently made is at fault.

The bug is often in the very last place you think to look, so be sure to look even at code you feel certain is correct.

The TAs are happy to help you think about your code during lab hours, but you're likely to get the most mileage out of the limited lab time if you've already dug as deep as you can into the situation.