Distributed Systems

ECE419, Winter 2025
University of Toronto
Instructor: Ashvin Goel

Distributed Systems
HomeLecturesLabsPiazzaQuercus
Lab MachinesLab SetupLab SubmissionLab 1Lab 2Lab 3Lab 4

Lab 3: Raft Protocol

Due dates: Mar 2, Mar 16, Mar 30

In this lab you'll implement Raft, a replicated state machine protocol. The bonus lab will use your Raft implementation to build a linearizable, replicated key/value server.

This lab has three parts (3A, 3B, 3C) that build on the previous parts. The submission deadlines for the three parts are shown above.

Starter Code

You can get the latest starter code for this lab from the student repository by running:

cd ece419
git pull upstream main

For more details, see our previous instructions on obtaining the starter code.

The starter code for this lab is available in the raft directory. You only need to modify the raft/raft.go file for this lab. This file provides some initial code and some code samples, e.g., how to send and receive RPCs. You may modify any other files for testing but please run your final tests with the original versions of these files before submission. Also, do not add any new or remove existing files.

After you have completed your implementation, you can test your code as follows:

cd raft
go test -run 3A
go test -run 3B
go test -run 3C

Overview

A replicated service (e.g., a key/value service) achieves fault tolerance by storing copies of its state (e.g., all key/value data) on multiple servers called replica servers or replicas. Replication allows the service to continue operating even if some of its replicas experience failures (crashes or a failed or flaky network) thus providing high availability. The challenge is that failures may cause the replicas to hold differing copies of the data, making it hard to ensure correct (e.g., linearizable) operation.

Raft organizes client requests into a sequence, called the log, and ensures that all the replicas see the same log. Each replica executes client requests in log order, applying the requests to its local copy of the service's state. Since all the live replicas see the same log contents, they all execute the same requests in the same order, and thus continue to have identical service state. If a replica fails but later recovers, Raft takes care of bringing its log up to date. Raft will continue to operate as long as at least a majority of the replicas are alive and can talk to each other. If there is no such majority, Raft will make no progress, but it will pick up where it left off and ensure correct operation as soon as a majority can communicate again. This greatly simplify implementing a linearizable, replicated service.

In this lab, you'll implement Raft as a Go object type with associated methods that is meant to be used as a module by a replicated service. A set of Raft instances called peers talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called log entries. In Raft, a log entry may or may not be committed. However, when your Raft implementation commits a log entry, it should send it to the replicated service for it to be executed.

You should follow the design in the extended Raft paper, with particular attention to Figure 2. You'll implement most of the design in the paper, including saving persistent state and reading it after a node fails and then restarts. However, you will not implement log compaction and cluster membership changes (Sections 7 and 6).

Raft API

Your implementation must support the following interface, which our tester and (eventually) your key/value server will use. You'll find more details in the comments in raft.go.

// create a new Raft server instance
rf := Make(peers, me, persister, applyCh)

// start agreement on a new log entry
rf.Start(command interface{}) (index, term, isLeader)

// ask a Raft peer for its current term, and whether it thinks it is leader
rf.GetState()(term, isLeader)

// each time a new entry is committed to the log, every Raft peer
// should send an ApplyMsg to the service (or tester).
type ApplyMsg

A service calls Make(peers,me,...) to create a Raft peer. The peers argument is an array of network identifiers of the Raft peers (including this one). A peer communicates with other peers using RPC. The me argument is the index of this peer in the peers array. Start(command) asks Raft to initiate appending command to the replicated log. Start() should return immediately, without waiting for the log append to complete. The service expects each Raft peer to send an ApplyMsg for each newly committed log entry to the applyCh channel argument passed to Make().

The raft.go file contains example code that sends an RPC (sendRequestVote) and that handles an incoming RPC (RequestVote). Your Raft peers should exchange RPCs using the Go labrpc package (source in labrpc directory). The tester can tell labrpc to delay RPCs, re-order them, and discard them to simulate various network failures. While you can temporarily modify labrpc, make sure your Raft implementation works with the original labrpc, since that's what we'll use to test and grade your lab. Your Raft instances must only interact using RPC; for example, they are not allowed to communicate using shared Go variables or files.

Each part of the lab will take significant time and so start early so that you have enough time to complete the lab. The later parts of this lab build on the previous parts so make sure to write easy-to-understand, well-commented code.


Part 3A: Leader Election

Implement Raft leader election and heartbeats (AppendEntries RPCs with no log entries). The goal for Part 3A is to 1) elect a single leader at a time, 2) use heartbeats to ensure that the leader remains a leader when there are no failures, and 3) elect a new leader that takes over when the old leader fails or when packets to/from the old leader are lost.

Follow the paper's Figure 2. At this point you care about sending and receiving RequestVote RPCs, the Rules for Servers that relate to elections, and the State related to leader election. Add the Figure 2 state for leader election to the Raft struct in raft.go.

Fill in the RequestVoteArgs and RequestVoteReply structs. When a peer hasn't heard from the leader for a period of time (election timeout), start an election by sending out RequestVote RPCs. Make() creates a background ticker() goroutine, which use can use to start elections. Implement the RequestVote RPC handler so that the Raft peers will vote for one another.

To implement heartbeats, define an AppendEntries RPC struct (though you may not need all the arguments yet), and have the leader send AppendEntries RPCs periodically. Write an AppendEntries RPC handler method. The tester requires that the leader send heartbeat RPCs no more than tens of times per second.

The tester requires your Raft implementation to elect a new leader within five seconds of the failure of the old leader (if a majority of peers can still communicate). The paper's Sections 5.2 and 9.3 mention election timeouts in the range of 150 to 300 milliseconds. Such a range only makes sense if the leader sends heartbeats considerably more often than once per 150 milliseconds (e.g., once per 10 milliseconds). Since the tester limits you to tens of heartbeats per second, you will have to use an election timeout larger than the paper's 150 to 300 milliseconds, but not too large, because then you may fail to elect a leader within five seconds.

The tester calls your Raft's rf.Kill() when it is permanently shutting down an instance. You can check whether rf.Kill() has been called using rf.killed(). You may want to do this in all loops, to avoid having dead Raft instances print confusing messages.

Don't forget to implement GetState().

Advice

The easiest way to perform periodic processing is to create a goroutine (or reuse an existing one) that invokes time.Sleep() in a loop. See the ticker() goroutine. Don't use Go's time.Timer or time.Ticker, which are difficult to use correctly.

You may find Go's rand useful.

Go RPC only sends struct fields whose names start with capital letters. Sub-structures must also have capitalized field names (e.g. fields of log records in an array). The labgob package will warn you about this; don't ignore the warnings.

The most challenging part of this lab may be the debugging. Spend some time making your implementation easy to debug.

Please visit our Lab3 advice page for tips on how to develop and debug your code.

Testing

We supply a set of tests, which you should use to drive your implementation efforts, and which we'll use to grade your submitted lab. The tests are in raft/test_test.go. You can't easily run your Raft implementation directly. Instead you should run it by way of the tester.

If you wish to run individual tests, look for function names starting with Test, e.g., TestInitialElection3A in test_test.go. Then run the individual test as follows:

go test -run TestInitialElection3A

If your code has trouble passing the tests, read the paper's Figure 2 again; the full logic for leader election is spread over multiple parts of the figure.

If you fail a test, look at test_test.go and config.go to understand what's being tested. config.go also illustrates how the tester uses the Raft API.

When we grade your submissions, we will run the tests without the race flag. However, you should make sure to run your code with this flag (go test -race) and fix any races it reports.

Be sure you pass the 3A tests before submitting Part 3A, so that you see something like this:

$ go test -run 3A
Test (3A): initial election ...
  ... Passed --   3.0  3   58   13928    0
Test (3A): election after network failure ...
  ... Passed --   4.5  3  120   21164    0
Test (3A): multiple elections ...
  ... Passed --   6.0  7  930  167580    0
PASS
ok      ece419/raft     13.529s

The numbers on each Passed line are the real time the test took in seconds, the number of Raft peers, the number of RPCs sent during the test, the total number of bytes in the RPC messages, and the number of log entries that Raft reports were committed. Your numbers will differ from those shown here. You can ignore the numbers if you like, but they may help you sanity-check the number of RPCs that your implementation sends.

It is a good idea to run the tests multiple (e.g., 10) times before submitting and check that each run prints PASS.

for i in {0..10}; do go test -run 3A; done

For this lab, we have provided you a tool that allows you to estimate your lab grade. After setting your path variable, you can run the tool in the raft directory as follows:

ece419-lab3-check [A|B|C]

Note that the tool will fail your solution if it takes more than 120 seconds.

Submission

We will use the following files from your code repository for grading: raft/raft.go.

Please see lab submission.


Part 3B: Raft Log

In this part of the lab, you will implement the leader and follower code to append new log entries, so that the go test -run 3B tests pass.

Start by implementing Start() and then write the code to send and receive new log entries via AppendEntries RPCs, following Figure 2. Send each newly committed entry to the applyCh channel on each peer. You'll need to define a struct to hold information about each log entry.

You will need to implement the election restriction (section 5.4.1 in the paper).

Your code may have loops that repeatedly check for certain events. Don't have these loops execute continuously without pausing, since that will slow your implementation enough that it fails tests. Use Go's condition variables or insert a time.Sleep(10 * time.Millisecond) in each loop iteration.

Advice

Please visit our Lab3 advice page for tips on how to develop and debug your code.

Testing

See the testing instructions above.

The tests for the later labs may fail your code if it runs too slowly. You can check how much real time and CPU time your solution uses with the time command. Here's typical output (notice /bin/time before go test below):

$ /bin/time go test -run 3B
Test (3B): basic agreement ...
  ... Passed --   0.4  3   16    3876    3
Test (3B): RPC byte count ...
  ... Passed --   1.2  3   48  112240   11
Test (3B): test progressive failure of followers ...
  ... Passed --   4.4  3  118   22718    3
Test (3B): test failure of leaders ...
  ... Passed --   4.8  3  190   36684    3
Test (3B): agreement after follower reconnects ...
  ... Passed --   3.6  3   92   21031    7
Test (3B): no agreement if too many followers disconnect ...
  ... Passed --   3.2  5  204   39035    3
Test (3B): concurrent Start()s ...
  ... Passed --   0.5  3   24    6247    6
Test (3B): rejoin of partitioned leader ...
  ... Passed --   3.9  3  148   30236    4
Test (3B): leader backs up quickly over incorrect follower logs ...
  ... Passed --  14.7  5 2056 1499404  102
Test (3B): RPC counts aren't too high ...
  ... Passed --   2.1  3   68   18966   12
PASS
ok      ece419/raft     38.915s
1.41user 0.48system 0:39.13elapsed 4%CPU (0avgtext+0avgdata 71280maxresident)k
16inputs+7528outputs (41major+17111minor)pagefaults 0swaps

The ok ece419/raft 42.980 means that Go measured the time taken for the 3B tests to be 38.915 seconds of real (wall-clock) time. The 1.41user means that the code consumed 1.41 seconds of (user-mode) CPU time, or the time spent actually executing instructions (rather than waiting or sleeping). If your solution uses much more than a minute of real time for the 3B tests or much more than 5 seconds of CPU time, you may run into trouble later on. Look for time spent sleeping, waiting for RPC timeouts, loops that run without sleeping or waiting for conditions or channel messages, or large numbers of RPCs sent.

Submission

We will use the following files from your code repository for grading: raft/raft.go.

Please see lab submission.


Part 3C: Persistence

If a Raft-based server reboots it should resume service where it left off. This requires that Raft keep persistent state that survives a reboot. The paper's Figure 2 mentions which state should be persistent.

A real implementation would write Raft's persistent state to disk each time it changed and would read the state from disk when restarting after a reboot. Your implementation won't use the disk; instead, it will save and restore persistent state from a Persister object (see persister.go). Whoever calls Raft.Make() supplies a Persister that initially holds Raft's most recently persisted state (if any). Raft should initialize its state from that Persister, and should use it to save its persistent state each time the state changes. Use the Persister's ReadRaftState() and Save() methods.

Complete the functions persist() and readPersist() in raft.go by adding code to save and restore persistent state. You will need to encode (or "serialize") the state as an array of bytes in order to pass it to the Persister. Use the labgob encoder; see the comments in persist() and readPersist(). labgob is like Go's gob encoder but prints error messages if you try to encode structures with lower-case field names. Pass nil as the second argument to persister.Save(). Insert calls to persist() at the points where your implementation changes persistent state. Once you've done this, and if the rest of your implementation is correct, you should pass all of the 3C tests.

You will probably need the optimization that backs up nextIndex by more than one entry at a time. Look at the extended Raft paper starting at the bottom of page 7 and top of page 8 (marked by a gray line). The paper is vague about the details; you will need to fill in the gaps. While there are several options, one option is to have a failed message include:

XTerm:  term in the conflicting entry (if any)
XIndex: index of first entry with that term (if any)
XLen:   log length

Then the leader's logic can be something like:

Case 1: leader doesn't have XTerm:
  nextIndex = XIndex
Case 2: leader has XTerm:
  nextIndex = leader's last entry for XTerm
Case 3: follower's log is too short:
  nextIndex = XLen

Advice

Please visit our Lab3 advice page for tips on how to develop and debug your code.

Testing

See the testing instructions above.

The 3C tests are more demanding than those for 3A or 3B, and failures may be caused by problems in your code for 3A or 3B.

Be sure you pass the 3C tests (as well as the 3A and 3B tests) before submitting Part 3C, so that you see something like this:

$ go test -run 3C
Test (3C): basic persistence ...
  ... Passed --   3.3  3   84   18701    6
Test (3C): more persistence ...
  ... Passed --  16.0  5  992  183642   16
Test (3C): partitioned leader and one follower crash, leader restarts ...
  ... Passed --   1.4  3   36    7995    4
Test (3C): Figure 8 ...
  ... Passed --  30.3  5 1100  208003   52
Test (3C): unreliable agreement ...
  ... Passed --   1.3  5 1036  317628  246
Test (3C): Figure 8 (unreliable) ...
  ... Passed --  40.8  5 10708 14992854  344
Test (3C): churn ...
  ... Passed --  16.5  5 15736 143777533 3741
Test (3C): unreliable churn ...
  ... Passed --  16.4  5 6776 6455245 1491
PASS
ok      ece419/raft     126.007s

Submission

We will use the following files from your code repository for grading: raft/raft.go.

Please see lab submission.

Acknowledgements

Prof. Robert Morris, Prof. Frans Kaashoek, and Prof. Nickolai Zeldovich at MIT developed these labs. We have their permission to use them for this course.