Determining Hits & Misses with Caches

Andreas Moshovos

Spring 2005

 

In this set of lectures we will look at how, given (1) a piece of assembly code and (2) a memory hierarchy, we can determine how many hits and misses will occur and thus how long it will take for the code to execute. To simplify the discussion we make the following assumptions:

 

1.    Assuming that memory is ideally fast all instructions execute within a single cycle.

2.    We ignore the memory accesses needed for fetching instructions (because we want to avoid having to figure out how many bytes each m68k instruction takes and because adding these references will make it harder to understand what is happening without really enhancing the material).

 

Example #1: Linear access of array elements

 

Lets see what will happen when the following code executes:

 

movea.l #array, a0 ; base of the array

clr.l d0 ; sum = 0

clr.l d1 ; counts the number of elements processed thus far

move.l #4096, d2 ; number of array elements

 

loop

add.l (a0), d0 ; sum = sum + next element of the array

adda.l #4, a0 ; make a0 point to the next element

addq.l #1, d1 ; one more element processed

cmp.l d1, d2 ; have we processed all elements?

Bne loop ; no, continue

 

When this code executes, and ignoring instruction fetches, it will access memory via the instruction add.l (a0), d0. It will access the following addresses:

0, 4, 8, 16, ..., 4092 (decimal).

 

Execution time with ideal memory (single cycle latency):

 

Assuming that each instruction takes a single cycle to execute the aforementioned code will execute in the following number of cycles:

 

4 + 4096 x 5 cycles

 

The first four cycles are for the first four instructions that execute only once. Each loop iteration comprises five instructions. The loop iterates 4096 times (d1 counting from 0 to 4096).

 

Execution time with a Cache

 

Lets now determine how long it will take for the code to execute with a realistic memory hierarchy. For this purpose let us assume that a memory reference completes in a single cycle if it hits in the cache (cache latency = 1 cycle). It takes 100 cycles in total to complete a reference when it misses in the cache (1 cycle for determining that this is a miss plus 99 cycle to go to memory). Lets assume that the data cache is direct mapped, 16Kbytes (data capacity notice that many more bits are needed for implementing the cache since it contains the tags, valid bits and possibly dirty bits these are often not included in the data cache size the latter refers to the useful capacity of the cache). The data cache has 32 byte frames.

 

Let us further assume that array (the base address of the array) is divisible by 16K, that is (array modulo 16K) = 0. This assumption will allow us to determine which set access will map to. Because array is divisible by 16k this implies that the last 14 bits (thats because 2^14 = 16K) are zeroes. So, array = x...x0...0 where the x...x = 18 bits and 0...0 = 14 bits.

 

First we need to determine how the cache is indexed. That is given a 32 bit address from the CPU how do we determine which set and frame this may map to.

 

We know that:

 

#frames = Capacity / bytes_per_frame = 16K / 32 = 2^14 / 2^5 = 2^9 = 512

#sets = #frames / associativity = 2^9 / 1 = 2^9

 

So, our cache has 512 sets with one frame per set. Each frame contains 32 bytes. Thus, the 32-bit address is divided into three fields:

Offset: total 5 bits because each frame holds 32 bytes à bits 4 through bit 0 (the least significant ones)

Set: total 9 bits because there are 2^9 sets in the cache à bits 13 through 5 (immediately next to offset)

Tag: what is left, so 32 9 5 = 18 bits, or bits 31 through 14.

 

Now that we know how addresses are mapped onto the cache we can determine how many are hits and how many are misses.

 

Access #1: address array + 0 à this maps to tag x...x, set 0, and offset 0 à the cache is initially empty hence this is a miss à latency 100 cycles

Note the tag is x...x the upper 18 bits of the array address.

Because this is a miss, the whole block will be fetched from memory in the 100 cycles. The block includes the bytes in addresses 0 through 31 (32 bytes in total).

Recall that the assignment of addresses into blocks is static. Addresses 0 through 31 belong to block 0, 32 to 63 to block 1, 64 trough 95 to block 2 and so on.

Access #2: address array + 4 à tag x...x, set 0, offset 4 à this is a hit since we brought in this data when we missed on the first access à latency is 1

Access #3: address array + 8 à tag x...x, set 0, offset 8 à hit à latency 1 cycle

...

Access #8: address array + 28 à tag x...x, set 0, offset 28 à hit à latency 1 cycle

Access #9: address array + 32 à tag x...x, set 1, offset 0 à miss since this belongs to block 1 and this is the first access to an address within block 1 à latency 100 cycles

This will bring in the data at addresses array + 32 ... array + 63

Access #10: address array + 36 à tag x...x, set 1, offset 4 à hit à latency 1 cycle

 

If we continue this way we will see a repeating pattern of 1 miss and 7 hits per block.

 

So our execution time will be:

 

4 (for the first four instructions) + 4096 x 4 (for the four instructions in the loop that do not access memory) + 4096 (7/8 x 1 (for the 7 hits per block) 1/8 x 100 (for the one miss per block))

if the above code was to be executed twice with the same cache, the second time the array elements are accessed they will all be hits since the full array has been cached. This is possible because the caches capacity is sufficient to hold all of the array (which is 4096 x 4 = 16Kbytes). The same will hold true for caches that have higher capacity. If the cache was 8Kbytes, then accesses to elements 2048 through 4095 would replace elements 0 through 2047 from the cache. So, next time around 0 through 2047 are accessed they will observe the 1 miss 7 hit pattern.

 

 

Example #2: Accessing two aligned arrays

 

Lets us now look at another example where we access two arrays simultaneously:

 

movea.l #aA, a0 ; array A

movea.l #aB, a1 ; array B

move.l #4096, d1 ; elements per array

clr.l d1 ; elements processed thus far

clr.l d0 ; sum = 0

loop

add.l (a0), d0 ; sum = sum + element of array a

add.l (a1), d0 ; sum = sum + element of array b

addq.l #1, d1 ; one more element per array processed

adda.l #4, a0 ; point to next element of array a

adda.l #4, a1 ; point to next element of array b

cmp.l d1, d2 ; all elements processed?

bne loop ; no, continue

 

If we assume an ideal memory system then this many cycles will be needed:

 

5 (for the initialization) + 4096 (iterations) x 7 (instructions in loop body)

 

Lets assume that aA modulo 16K =0 and aB modulo 16K = 0 hence aA = x...x0...0 and bB=y...y0...0 where the 0...0 portions comprise 14 bits and the x...x and y...y are each 18 bits. The exact values of x...x and y...y are not important.

 

As before let us assume a 16Kbyte cache, with 32 bytes per frame and that is direct mapped. Heres the sequence of accesses and what happens during each one of them:

 

Access #1: aA + 0 à tag x...x, set 0, offset 0 à miss -> 100 cycles à brings in data from aA + 0 through aA + 31
Access #2: bB+ 0 à tag y...y, set 0, offset 0 à miss à 100 cycles à brings in data from aB + 0 through bB + 31 à replaces data in the frame (those fetched from the previous access)

Access #3: aA + 4 à tag x...x, set 0, offset 4 à miss à 100 cycles à brings in data from aA+0 through aA + 31 à replaces data from previous reference

Access #4: bB + 4 à tag y...y, set 0, offset 4 à miss à 100 cycles brings in data from aB + 0 through bB + 31 à replaces data from previous reference

And so on.

 

As we see no access results in a hit as consecutive addresses map onto the same cache frame but to different blocks and hence they evict each other.

 

What if the cache was 2-way set associative? In this case, the case will have 256 sets where each set will contain 2 frames. In this case, the tag is 19 bits long and the set 8 bits long. The offset still is 5 bits long since each frame is still 32 bytes long.

 

So it will take this many cycles for the code to execute:

 

5 (for initialization) + 4096 x 5 (for the loop body instructions that do not access memory) + 4096 x 2 x 100 (for the two instructions that access memory)

 

Heres what happens at each access:

 

Access #1: aA + 0 à tag x...x0, set 0, offset 0 à miss -> 100 cycles à brings in data from aA + 0 through aA + 31, placed in frame 0 of set 0
Access #2: bB+ 0 à tag y...y0, set 0, offset 0 à miss à 100 cycles à brings in data from aB + 0 through bB + 31 à placed in frame 1 of set 0

Access #3: aA + 4 à tag x...x0, set 0, offset 4 à hit

Access #4: bB + 4 à tag y...y0, set 0, offset 4 à hit

...

Access #15: aA + 28 à tag x...x0, set 0, offset 28 à hit

Access #16: bB + 28 à tag y...y0, set 0, offset 28 à hit

Access #17: aA + 32 à tag x...x0, set 1, offset 0 à miss -> 100 cycles à brings in data from aA + 32 through aA + 63, placed in frame 0 of set 1
Access #18: bB+ 32 à tag y...y0, set 1, offset 0 à miss à 100 cycles à brings in data from aB + 32 through bB + 63 à placed in frame 1 of set 1

And so on.

 

5 + 4096 x 5 + 4096 x ( 1 / 8 x 100 + 7 / 8 x 1) x 2 (two instructions experiencing one miss and seven hits per eight references).