Notes on Caches
Andreas Moshovos
Spring 2007
Please refer to the book for a complete treatment of caches. Here we review a few examples that explain some of the underlying concepts.
Cache Indexing
A cache is organized as a two-dimensional array of blocks. The rows are called sets, and the columns are called ways. At the one extreme there is only one set. This is a fully-associative cache. In this case the number of ways equals the number of blocks. At the other extreme there is a single column. This is a direct-mapped cache. In this case, the number of sets equals the number of blocks. Any configuration in between is called an N-way set-associative cache where N is the number of ways. Notice that the number of sets is not specified. We can derived it given the total cache capacity.
Given an address from the CPU we need to index the cache. By this we mean, selecting the set in which the address may be cached. The common indexing scheme partitions the incoming address into three fields. Starting from the MSB these are the TAG, SET and OFFSET:
TAG |
SET |
OFFSET |
<- -------------------------------- Address bits ------------------------------ ->
For a fully-associative cache the set field does not exist. This is because there is only one set. For the direct-mapped cache, if the set width is S bits it holds that 2^S = #Blocks.
Let’s go over a few examples.
Example #1
1MB cache, with 64B blocks, 4-way Set-Associative, write-through with LRU replacement. Assume a 36-bit byte-addressable address space.
In this case, there are 2^36 different byte addresses.
It’s good practice to try to write the parameters as a power of two where possible. So we have Capacity = 1MB = 2^20 bytes, Block Size = 64B = 2^6, associativity = 4 = 2^2.
Notice that the capacity refers to the total number of data bytes that the cache can store. It does not include the overhead bits required by the tags, valid bits and LRU bits.
From the size of the blocks we can immediately derive the width of the offset field. It holds that offset_width = lg(Block Size), where lg is the logarithm base two. So, in this case OFFw = 6 bits.
To find the width of the SET field we need to determine the number of rows. To do that we need to determine the number of blocks. We are given that this is a 4-way set-associative cache, hence each set has four blocks.
It holds:
#Blocks = Capacity / BlockSize = 2^20 / 2^6 = 2^14. There are 2^14 blocks.
#Sets = #Blocks / #ways = 2^14 / 2^2 = 2^12. There are 2^12 sets, hence the SET field is 12 bits wide.
So the indexing is done as follows:
<- --- 36 – 12 – 6 --- -> |
<- ------- 12 -------- -> |
<- -------- 6 -------- -> |
TAG |
SET |
OFFSET |
To access the cache we first use the 12 SET bits to select one of the 2^12 sets. Then we compare the TAG field with the TAG fields of the four blocks that belong to the set. If a match is found, we finally use the OFFSET field to select the appropriate byte out of the 64 stored in the block.
Example #2
36KB, 9-way set-associative cache, with 8-byte blocks, write-back with LRU replacement and a 2^24 address space (byte addressable).
We immediately know that OFF width = 3 bits (from the block size).
#blocks = Capacity / block size = 9 * 4K / 2^3 = 9 * 2^12 / 2^3 = 9 x 2^9.
#sets = #blocks / #ways = 9 x 2^9 / 9 = 2^9.
So the SET field is 9 bits wide.
It follows that that TAG field is 24 – 9 – 3 bits wide.
Example #3
If we have 7KB of data cache and assuming 128B blocks what can the associativity be?
Let’s first see how many blocks are there:
#blocks = Capacity / block size = 7 x 2^10 / 2^7 = 7 x 2^3.
We can organize the blocks into one set. This would be a Fully-Associative cache.
Can we organize the blocks into two sets? Each set would have 7 x 2^3/2 = 7x2^2 blocks. Yes we can. This is a 28-way set-associative cache.
Can we organize the blocks into four sets? Notice that the number of sets *has* to be a power of two since we assume that we use a portion of the address to index the sets directly (note that there are advanced designs that use different functions but we will not discuss those in this course). So, if we had four sets, each set would have 7x2^3/2^2 = 7x2 blocks. Yes, we can. This is a 14-way set-associative cache.
How about 8 sets? Each set would have 7 blocks. This is a 7-way set-associative cache.
How about 16 sets? Each set would have 4.5 blocks. This is not possible.
How about 32? Each set would have have 1 3/4 blocks. This is not possible.
How about 64? No, there are only 56 blocks.
Storage Requirements
How many storage bits are required to implement a 256KB cache, with 16B blocks, that is 4-way set-associative, uses write-back, LRU replacement and assuming a 2^36 byte-addressable address space?
Bits are required for:
1. The data
2. The tags
3. The valid bits
4. The dirty bits
5. The LRU bits
Data bits = 256K x 8 = 2^18 x 2^3 = 2^21 bits = 2Mbits
To calculate the rest we need to figure out how the cache is indexed.
#blocks = 2^18 / 2^4 = 2^14
#sets = 2^14 / 2^2 = 2^12
So, the TAG field is 36-12-4 = 20 bits wide.
Tag bits = #blocks x TAG field width = 2^14 * 20 bits
There is one valid bit and one dirty bit per blocks
Valid Bits = 2^14 and Dirty Bits = 2^14
The LRU bits are associated with the set. They tell us in which order the blocks within the set were accessed. As we explained we need at least lg(ways!) bits to encode this information. So:
LRUbits = 2^12 x lg(4!)
Determining Hits and Misses
Let us assume that a program accesses the following addresses:
0x200, 0x204, 0x208, 0x20C, 0x2F4, 0x2F0, 0x200, 0x204, 0x218, 0x21C, 0x24C, 0x2F4, repeated two times
Given a cache with 8 four byte blocks determine how many hits and misses will occur.
#1 Direct-Mapped
Cache
Since the block size is four bytes the lower two bits are the offset within the block. These are show in italics.
Since there are 8 blocks, and this is a direct-mapped cache, there are 8 sets. Hence, the next 3 bits are the set. These are underlined.
The rest are the TAG show in normal font.
ADDRESS IN HEX |
ADDRESS IN
BINARY |
HIT/MISS |
Action |
0x200 |
0010 0000 0000 |
Miss |
Bring Block 0010
0000 00XX from memory into set 000 |
0x204 |
0010 0000 0100 |
Miss |
Bring block 0010
0000 01XX from memory into set 001 |
0x208 |
0010 0000 1000 |
Miss |
Bring block 0010
0000 10XX from memory into set 010 |
0x20C |
0010 0000 1100 |
Miss |
Bring block 0010
0000 11XX from memory into set 011 |
0x2F4 |
0010 1111 0100 |
Miss |
Bring block 0010
1111 01XX from memory into set 101 |
0x2F0 |
0010 1111 0000 |
Miss |
Bring block 0010
1111 00XX from memory into set 100 |
0x200 |
0010 0000 0000 |
Hit |
Set 000 |
0x204 |
0010 0000 0100 |
Hit |
Set 001 |
0x218 |
0010 0001 1000 |
Miss |
Bring block 0010
0001 10XX from memory into set 110 |
0x21C |
0010 0001 1100 |
Miss |
Bring block 0010
0001 11XX from memory into set 111 |
0x24C |
0010 0100 1100 |
Miss |
Bring block 0010
0100 11XX from memory into set 011 Replace block
0010 0000 11XX |
0x2F4 |
0010 1111 0100 |
Hit |
Set 101 |
0x200 |
0010 0000 0000 |
Hit |
Set 000 |
0x204 |
0010 0000 0100 |
Hit |
Set 001 |
0x208 |
0010 0000 1000 |
Hit |
Set 010 |
0x20C |
0010 0000 1100 |
Miss |
Bring block 0010
0000 11XX from memory into set 011 Replace block
0010 0100 11XX |
0x2F4 |
0010 1111 0100 |
Hit |
Set 101 |
0x2F0 |
0010 1111 0000 |
Hit |
Set 100 |
0x200 |
0010 0000 0000 |
Hit |
Set 000 |
0x204 |
0010 0000 0100 |
Hit |
Set 001 |
0x218 |
0010 0001 1000 |
Hit |
Set 110 |
0x21C |
0010 0001 1100 |
Hit |
Set 111 |
0x24C |
0010 0100 1100 |
Miss |
Bring block 0010
0100 11XX from memory into set 011 Replace block
0010 0000 11XX |
0x2F4 |
0010 1111 0100 |
Hit |
Set 101 |
#2
Fully-Associative Cache
In this case there is only one set with 8 blocks.
ADDRESS IN HEX |
ADDRESS IN
BINARY |
LRU Order |
HIT/MISS |
Action |
0x200 |
0010 0000 0000 |
0x200 |
Miss |
Bring Block 0010
0000 00XX from memory |
0x204 |
0010 0000 0100 |
0x204, 0x200 |
Miss |
Bring block 0010
0000 01XX from memory |
0x208 |
0010 0000 1000 |
0x208, 0x204,
0x200 |
Miss |
Bring block 0010
0000 10XX from memory |
0x20C |
0010 0000 1100 |
0x20C, 0x208,
0x204, 0x200 |
Miss |
Bring block 0010
0000 11XX from memory |
0x2F4 |
0010 1111 0100 |
0x2F4, 0x20C,
0x208, 0x204, 0x200 |
Miss |
Bring block 0010
1111 01XX from memory |
0x2F0 |
0010 1111 0000 |
0x2F0, 0x2F4,
0x20C, 0x208, 0x204, 0x200 |
Miss |
Bring block 0010
1111 00XX from memory |
0x200 |
0010 0000 0000 |
0x200, 0x2F0,
0x2F4, 0x20C, 0x208, 0x204 |
Hit |
|
0x204 |
0010 0000 0100 |
0x204, 0x200,
0x2F0, 0x2F4, 0x20C, 0x208 |
Hit |
|
0x218 |
0010 0001 1000 |
0x218, 0x204,
0x200, 0x2F0, 0x2F4, 0x20C, 0x208 |
Miss |
Bring block 0010
0001 10XX from memory |
0x21C |
0010 0001 1100 |
0x21C, 0x218,
0x204, 0x200, 0x2F0, 0x2F4, 0x20C, 0x208 |
Miss |
Bring block 0010
0001 11XX from memory |
0x24C |
0010 0100 1100 |
0x24C, 0x21C,
0x218, 0x204, 0x200, 0x2F0, 0x2F4, 0x20C |
Miss |
Bring block 0010
0100 11XX from memory Replace block
containing 0x208 |
0x2F4 |
0010 1111 0100 |
0x2F4, 0x24C,
0x21C, 0x218, 0x204, 0x200, 0x2F0, 0x20C |
Hit |
|
0x200 |
0010 0000 0000 |
0x200, 0x2F4,
0x24C, 0x21C, 0x218, 0x204, 0x2F0, 0x20C |
Hit |
|
0x204 |
0010 0000 0100 |
0x204, 0x200,
0x2F4, 0x24C, 0x21C, 0x218, 0x2F0, 0x20C |
Hit |
|
0x208 |
0010 0000 1000 |
0x208, 0x204,
0x200, 0x2F4, 0x24C, 0x21C, 0x218, 0x2F0 |
Miss |
Bring block 0010
0000 10XX from memory Replace block
containing 0x20C |
0x20C |
0010 0000 1100 |
0x20C, 0x208,
0x204, 0x200, 0x2F4, 0x24C, 0x21C, 0x218 |
Miss |
Bring block 0010
0000 11XX from memory Replace block
containing 0x2F0 |
0x2F4 |
0010 1111 0100 |
0x2F4, 0x20C,
0x208, 0x204, 0x200, 0x24C, 0x21C, 0x218 |
Hit |
|
0x2F0 |
0010 1111 0000 |
0x2F0, 0x2F4,
0x20C, 0x208, 0x204, 0x200, 0x24C, 0x21C |
Miss |
Bring block 0010
1111 01XX from memory Replace block
containing 0x218 |
0x200 |
0010 0000 0000 |
0x200, 0x2F0,
0x2F4, 0x20C, 0x208, 0x204, 0x24C, 0x21C |
Hit |
|
0x204 |
0010 0000 0100 |
0x204, 0x200,
0x2F0, 0x2F4, 0x20C, 0x208, 0x24C, 0x21C |
Hit |
|
0x218 |
0010 0001 1000 |
0x218, 0x204,
0x200, 0x2F0, 0x2F4, 0x20C, 0x208, 0x24C |
Miss |
Bring block 0010
0001 10XX from memory Replace block
containing 0x21C |
0x21C |
0010 0001 1100 |
0x21C, 0x218,
0x204, 0x200, 0x2F0, 0x2F4, 0x20C, 0x208 |
Miss |
Bring block 0010
0001 11XX from memory Replace block
containing 0x24C |
0x24C |
0010 0100 1100 |
0x24C, 0x21C,
0x218, 0x204, 0x200, 0x2F0, 0x2F4, 0x20C |
Miss |
Bring block 0010
0100 11XX from memory Replace block
containing 0x208 |
0x2F4 |
0010 1111 0100 |
0x2F4, 0x24C,
0x21C, 0x218, 0x204, 0x200, 0x2F0, 0x2F4 |
Miss |
Bring block 0010
0100 11XX from memory Replace block contaning 0x20C |
Demonstrating that caches
are a heuristic
The
actual hit rate achieved by a cache is a function of both the cache the program
using it. Given two caches of the same capacity but different organization it
is possible to find access sequences that will make them behave differently.
Here’s
an example: Let’s demonstrate that a given a four block cache with byte blocks
a direct-mapped cache can be better and worse than a 2-way set-associative
cache.
To
show that the direct-mapped cache is worse than the 2-way set-associative we
can use the following access sequence:
0,
4, 0, 4, 0, 4, …
Addresses 0 and 4 map to set 0 in the direct-mapped cache and
the 2-way SA cache. Since the DM cache has only one block per set the two addresses
will conflict evicting each other out. The DM will never manage to get a hit
with this sequence. The 2-way SA has two blocks in set 0,
hence the two addresses can co-exist. Besides the first two initial misses, all
other accesses will be hits.
Let’s
now show that the direct mapped cache can be better than the 2-way SA cache.
For that we can use the access sequence:
0,
2, 4, 0, 2, 4, 0, 2, 4, …
All
addresses map to set 0 in the 2-way SA cache and will evict each other assuming
an LRU replacement policy. There will be no hits in the 2-way SA cache. Addresses
0 and 4 map to set 0 in the DM and will evict each other, however, address 2
maps to set 2 and after the initial miss will remain in the cache. Hence the DM
cache will observe 1 hit every three accesses.