The problem sets are intended to help you learn the material, and we encourage you to collaborate with other students and to ask questions in discussion sections and office hours to understand the problems. However, each student must turn in their own solution to the problems.

The problem sets also provide essential background material for the exam and the midterms. The problem sets will be graded primarily on an effort basis, but if you do not work through the problem sets you are unlikely to succeed on the exam or midterms! We will distribute solutions to the problem sets on the day the problem sets are due to give you feedback. Homework assignments are due at the beginning of class on the due date, and all assignments are to be submitted through Gradescope. The CS152 Gradescope can be joined using the code: KY4RX5. Late homework will not be accepted, except for extreme circumstances and with prior arrangement.
**Problem 1: Cache Access-Time & Performance**

This problem requires the knowledge of Handout #2 and Lectures 6 & 7. Please, read these materials before answering the following questions.

Ben is trying to determine the best cache configuration for a new processor. He knows how to build two kinds of caches: direct-mapped caches and 4-way set-associative caches. The goal is to find the better cache configuration with the given building blocks. He wants to know how these two different configurations affect the clock speed and the cache miss-rate, and choose the one that provides better performance in terms of average latency for a load.

**Problem 1.A**

Now we want to compute the access time of a direct-mapped cache. We use the implementation shown in Figure H2-A in Handout #2. Assume a 256-KB cache with 8-word (32-byte) cache lines. The address is 32 bits and byte-addressed, so the two least significant bits of the address are ignored since a cache access is word-aligned. The data output is also 32 bits (1 word), and the MUX selects one word out of the eight words in a cache line. Using the delay equations given in Table 2.1-1, fill in the column for the direct-mapped (DM) cache in the table. In the equation for the data output driver, ‘associativity’ refers to the associativity of the cache (1 for direct-mapped caches, A for A-way set-associative caches).

<table>
<thead>
<tr>
<th>Component</th>
<th>Delay equation (ps)</th>
<th>DM (ps)</th>
<th>SA (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Decoder</td>
<td>30×(# of index bits) + 80</td>
<td>Tag 470</td>
<td>410</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Data 470</td>
<td>410</td>
</tr>
<tr>
<td>Memory array</td>
<td>30×log₂(# of rows) + 30×[log₂(# of bits in a row)] + 120</td>
<td>Tag 610</td>
<td>610</td>
</tr>
<tr>
<td>Comparator</td>
<td>30×(# of tag bits) + 70</td>
<td>Data 730</td>
<td>700</td>
</tr>
<tr>
<td>N-to-1 MUX</td>
<td>50×log₂N + 100</td>
<td></td>
<td>250 250</td>
</tr>
<tr>
<td>Buffer driver</td>
<td>180</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Data output driver</td>
<td>50×(associativity) + 100</td>
<td></td>
<td>150 300</td>
</tr>
<tr>
<td>Valid output</td>
<td>80</td>
<td></td>
<td>80 80</td>
</tr>
</tbody>
</table>

Table 2.1-1: Delay of each Cache Component

What is the critical path of this direct-mapped cache for a cache read? What is the access time of the cache (the delay of the critical path)? To compute the access time, assume that a 2-input gate (AND, OR) delay is 50 ps. If the CPU clock is 2.5 GHz, how many CPU cycles does a cache access take?
For the given cache structure which is byte addressable, we can know that the

\[ \text{# of offset bits} = \log_2(\text{# of byte in a word line}) = 5 \text{ bits}. \]

The # of lines in the cache is

\[ \text{# of lines} = \frac{\text{size}}{\text{wordline size}} = 2^{18}/2^5 = 2^{13} \text{ lines} \]

Because the cache is direct map, then

\[ \text{# of index bit} = \log_2(2^{13}) = 13 \text{ bits} \]

The total address bits is 32 bits, then

\[ \text{# of tag bits} = 32 - 13 - 5 = 14 \text{ bits} \]

Applying all values we calculate above to the delay equations, we have:

- \text{Decoder (tag)} = 30 * 13 + 80 = 470ps
- \text{Decoder (data)} = 30 * 13 + 80 = 470ps

Note: \# of bits in a row for the tag should include the valid and dirty bits

- \text{Memory array (tag)} = 30*\log_2(2^{13}) + 30*\text{ceil} (\log_2 (14+2)) + 100 = 610ps
- \text{Memory array (data)} = 30*\log_2(2^{13}) + 30*\text{ceil} (\log_2 (32*8)) + 100 = 730ps

- \text{Comparator} = 30*14+70 = 490ps

- \text{N-1 mux} = 50*\log_2(8) + 100 = 250ps

- \text{Data output driver} = 50 * 1 + 100 = 150ps

To determine the critical path for a cache read, we need to compute the time it takes to go through each path in hardware (tag check and data read). By taking the maximum delay of these two paths, we are left with the critical path.

- \text{Time to tag check valid driver from tag array}\n  = \text{Decoder (tag)} + \text{Memory array (tag)} + \text{comparator} + \text{AND gate} + \text{valid output driver}\n  = 470 + 610 + 490 + 50 + 80 = 1700ps

- \text{Time to data output drive from data array}\n  = \text{Decoder (data)} + \text{Memory array (data)} + \text{8-1 MUX} + \text{data output driver}\n  = 470 + 730 + 250 + 150 = 1600ps

From the above results, we can see that the critical path is tag check. The access time is 1700ps. At 2.5GHz, the cache access takes \((1700ps/(1/2.5GHz)) = 4.3 \sim 5 \text{ cycles}\). Here, \text{rounding up} to the nearest cycle is sensible, as this reflects how a synchronous system would work.
Problem 1.B  Access Time: Set-Associative

We also want to investigate the access time of a set-associative cache using the 4-way set-associative cache in Figure H2-B in Handout #2. Assume the total cache size is still 256KB (each way is 64KB), a 2-input gate delay is 50 ps, and all other parameters (such as the input address, cache line, etc.) are the same as part 2.1.A. **Compute the delay of each component and fill in the column for a 4-way set-associative cache in Table 2.1-1.**

What is the critical path of the 4-way set-associative cache? What is the access time of the cache (the delay of the critical path)? What is the main reason that the 4-way set-associative cache is slower than the direct-mapped cache? If the CPU clock is 2.5 GHz, how many CPU cycles does a cache access take?

For the given cache structure which is byte addressable, we know that the number of offset bits = \( \log_2(\text{# of byte in a word line}) = 5 \) bits.

The # of lines in a way of the cache:

\[ \text{# of lines} = \left( \frac{\text{size}}{\text{wordline size}} \right) / \text{nWays} = \left( \frac{2^{18}}{2^5} \right) / 4 = 2^{11} \text{ lines} \]

The number of index bits is then:

\[ \text{# of index bit} = \log_2(2^{11}) = 11 \text{ bits} \]

The total address bits is 32 bits, then:

\[ \text{# of tag bits} = 32-11-5 = 16 \text{ bits} \]

Applying all values we calculate above to the delay equations, we have:

- **Decoder (tag)** = 30 * 11 + 80 = 410ps
- **Decoder (data)** = 30 * 11 + 80 = 410ps

**Note:** tag bits include the valid/dirty bits (+2)

- **Memory array (tag)** = 30*\( \log_2(2^{11}) \) + 30*ceil(\( \log_2((16+2)*4) \)) + 100 = 640ps
- **Memory array (data)** = 30*\( \log_2(2^{11}) \) + 30*ceil(\( \log_2(32*8*4) \)) + 100 = 730ps

- **Comparator** = 30*16+70 = 550ps
- **N-1 mux** = 50*\( \log_2(8) \) + 100 = 250ps
- **Data output driver** = 50 * 4 + 100 = 300ps

There are three possible critical paths in an associative cache. The first two are the same as those in the direct mapped cache. The third one is the path through the tag array, the tag comparators, through the way-select mux, and through the data output driver.

**Time to tag check valid driver**

\[ = \text{Decoder (tag)} + \text{Memory array (tag)} + \text{comparator} + \text{AND gate} + \text{2*OR gate} + \text{valid output driver} \]

\[ = 410 + 640 + 550 + 50 + 100 + 80 = 1830ps \]
Time to data output drive:
= Decoder (data) + Memory array (data) + 8-1 MUX + data output driver
= 410 + 730 + 250 + 300 = 1690ps

Time to tag valid check to output driver:
= Decoder (tag) + Memory array (tag) + comparator + AND gate + buffer driver +
data output driver
= 400 + 640 + 550 + 50 + 180 + 300 = 2120ps

From the above results, we can see that the critical path is tag valid check to output
driver. The access time is 2120ps. At 2.5GHz, the cache access takes
(2120ps/(1/2.5GHz)) = 5.3 ~ 6 cycles. Here, rounding up to the nearest cycle is sensible,
as this reflects how a synchronous system would work.

Problem 1.C  Miss-rate analysis

Now Ben is studying the effect of set-associativity on the cache performance. Since he now
knows the access time of each configuration, he wants to know the miss-rate of each one.
For the miss-rate analysis, Ben is considering two small caches: a direct-mapped cache
with 8 lines with 32 bytes/line, and a 4-way set-associative cache of the same size and line
size. For the set-associative cache, Ben tries out two replacement policies – least recently
used (LRU) and round robin (FIFO).

Ben tests the cache by accessing the following sequence of hexadecimal byte addresses,
starting with empty caches. For simplicity, assume that the addresses are only 12 bits.

Complete the following tables for the direct-mapped cache and both types of 4-way set-
associative caches showing the progression of cache contents as accesses occur (in the
tables, ‘inv’ = invalid, and the column of a particular cache line contains the tag of that
line). Also, for each address calculate the tag and index (which should help in filling out the table). You only need to fill in elements in the table when a value
changes.

Address: 12 bits
Tag: 4 bits [11:8]
Index: 3 bits [7:5]
Offset: 5 bits [4:0]

<table>
<thead>
<tr>
<th>Address</th>
<th>D-map</th>
<th>Addresses and tags are in HEX</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>line in cache (tag)</td>
</tr>
<tr>
<td>11B</td>
<td>1</td>
<td>inv inv inv inv inv inv inv no</td>
</tr>
<tr>
<td>134</td>
<td>1</td>
<td>no</td>
</tr>
<tr>
<td>20D</td>
<td>2</td>
<td>no</td>
</tr>
<tr>
<td>1A2</td>
<td>1</td>
<td>1 no</td>
</tr>
<tr>
<td>105</td>
<td>1</td>
<td>no</td>
</tr>
<tr>
<td>Address</td>
<td>Tag</td>
<td>Index</td>
</tr>
<tr>
<td>---------</td>
<td>-----</td>
<td>-------</td>
</tr>
<tr>
<td>360</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>27D</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>121</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>1A3</td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>17A</td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>307</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>273</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>131</td>
<td>1</td>
<td></td>
</tr>
</tbody>
</table>

### D-map

<table>
<thead>
<tr>
<th>Total Misses</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total Accesses</td>
<td>13</td>
</tr>
</tbody>
</table>

Address: 12 bits
Index: 1 bits [5:5]
Offset: 5 bits [4:0]
Assume that the results of the above analysis can represent the average miss-rates of the direct-mapped and the 4-way set-associative 256-KB caches studied in 1.A and 1.B. What would be the average memory access latency in CPU cycles for each cache (assume that the cache miss penalty is 20 cycles)? Which one is better? For the different replacement policies for the set-associative cache, which one has a smaller cache miss rate for the address stream in 1.C? Explain why. Is that replacement policy always going to yield better miss rates? If not, give a counter example using an address stream.

The miss rate for the direct-mapped cache is 10/13. The miss rate for the 4-way LRU set associative cache is 8/13. For FIFO is 9/13.

The average memory access latency is \((\text{hit time}) + (\text{miss rate}) \times (\text{miss penalty})\).

For the direct-mapped cache, the average memory access latency would be:
\((5 \text{ cycles}) + (10/13) \times (20 \text{ cycles}) = 20.4 \text{ cycles}\).
For the LRU set-associative cache, the average memory access latency would be:
\[(6 \text{ cycles}) + (8/13) \times (20 \text{ cycles}) = 18.3 \text{ cycles}.\]

For the FIFO set-associative cache, the average memory access latency would be:
\[(6 \text{ cycles}) + (9/13) \times (20 \text{ cycles}) = 19.8 \text{ cycles}.\]

The set-associative cache with LRU replacement is better than the direct-mapped cache in terms of average memory access latency. For the above example, LRU has a slightly smaller miss rate than FIFO. This is because the FIFO policy replaced tag {4} block instead of tag {D} during the 10th access, because the {4} block has been in the cache longer, even though the {D} was least recently used. In this case, the LRU policy took better advantage of temporal locality.

LRU does not always outperform FIFO. Assume we have a set-associative cache with the same parameters as in 1.C and an access sequence shown below. There is a miss with LRU for the last access while there is a hit with FIFO.

```
0x11B
0x134
0x20D
0x1A2
0x105
0x360
0x27D
0x121
0x1A3
0x17A
0x307
0x273
0x361
```
Problem 2: Loop Ordering

This problem requires knowledge of Lecture 7. Please, read it before answering the following questions.

This problem evaluates the cache performances for different loop orderings. You are asked to consider the following two loops, written in C, which calculate the sum of the entries in a 128 by 32 matrix of 32-bit integers:

<table>
<thead>
<tr>
<th>Loop A</th>
<th>Loop B</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sum = 0;</code></td>
<td><code>sum = 0;</code></td>
</tr>
<tr>
<td><code>for (i = 0; i &lt; 128; i++)</code></td>
<td><code>for (j = 0; j &lt; 32; j++)</code></td>
</tr>
<tr>
<td><code>for (j = 0; j &lt; 32; j++)</code></td>
<td><code>for (i = 0; i &lt; 128; i++)</code></td>
</tr>
<tr>
<td><code>sum += A[i][j];</code></td>
<td><code>sum += A[i][j];</code></td>
</tr>
</tbody>
</table>

The matrix `A` is stored contiguously in memory in row-major order. Row major order means that elements in the same row of the matrix are adjacent in memory as shown in the following memory layout:

\[ A[i][j] \text{ resides in memory location } [4*(32*i + j)] \]

Memory Location:

```
    0  4  124  128  4*(32*127+31)
```

For Problem 2.A to Problem 2.C, assume that the caches are initially empty. Also, assume that only accesses to matrix `A` cause memory references and all other necessary variables are stored in registers. Instructions are in a separate instruction cache.
Problem 2.A

Consider an 8KB direct-mapped data cache with 4-word (16-byte) cache lines. Calculate the number of cache misses that will occur when running Loop A. Calculate the number of cache misses that will occur when running Loop B.

Each element of the 128x32 matrix A can only be mapped to one particular cache location in this direct-mapped data cache. Since each row has 32 32-bit integers, and since each cache line can hold 4 32-bit ints, a row of the matrix occupies the lines in 8 consecutive sets of the cache.

Loop A—where each iteration of the inner loop sums a row of A—accesses memory addresses in a linear sequence. Given this access pattern, the access to the first word in each cache line will miss, but the next three accesses will hit. After sequentially moving through this line, it will not be accessed again, so its later eviction will not cause any future misses. Therefore, Loop A will only have compulsory misses for the 1024 (128 rows x 8 lines per row) first-word-in-line accesses that matrix A spans.

The consecutive accesses in Loop B will move in a stride of 32 words. Therefore, the inner loop will touch the first element in 128 cache lines before the next iteration of the outer loop. While intuition might suggest that the 128 lines could all fit in the cache with 512 sets, there is a complicating factor: each row is eight cache lines past the previous row, meaning that the lines accessed when traversing the first column go in indices 0, 8, 16, 32, and so on. Since the lines containing the column are competing for only one eighth of the total number of sets (effectively 64 sets), the lines loaded when starting a column are evicted by the time the column is complete, preventing any reuse. Therefore, all 4096 (128 x 32) accesses miss.

The number of cache misses for Loop A:___________1024________________

The number of cache misses for Loop B:___________4096________________

Problem 2.B

Consider a direct-mapped data cache with 4-word (16-byte) cache lines. Calculate the minimum number of cache lines required for the data cache if Loop A is to run without any cache misses other than compulsory misses. Calculate the minimum number of cache lines required for the data cache if Loop B is to run without any cache misses other than compulsory misses.
Since Loop A accesses memory sequentially, we can sum all the elements in a cache line and then never touch it again. Therefore, we only need to hold 1 active line at any given time to avoid all but compulsory misses.

For Loop B to run without any cache misses other than compulsory misses, the data cache needs to have the ability to hold one column of matrix A in the cache. Since the consecutive accesses in the inner loop of Loop B will use one out of every eight cache lines, and since we have 128 rows, Loop B requires $1024 (128 \times 8)$ lines to avoid all but compulsory misses.

Data-cache size required for Loop A: ___________ 1 ______________ cache line(s)

Data-cache size required for Loop B: ___________ 1024 ______________ cache line(s)

Problem 2.C

Consider an 8KB set-associative data cache with 4 ways, and 4-word (16-byte) cache lines. This data cache uses a first-in/first-out (FIFO) replacement policy. Calculate the number of cache misses that will occur when running Loop A. Calculate the number of cache misses that will occur when running Loop B.

Note that the offset is 4 bits.
The # of lines in a way of this cache = $2^13 / (2^4 \times 4) = 2^7 = 128$.

Loop A still only has $1024 (128 \text{ rows} \times 8 \text{ lines per row})$ compulsory misses.

Loop B still cannot fully utilize the cache. Consider accessing a single column. The first $128/8 = 16$ accesses will allocate into way 1 in sets 0, 8, 16, 32, etc.; the next 16 accesses will allocate into way 2 of those same sets; and so on. After 64 accesses, all four ways will be filled, and the next 16 accesses along the column will evict the previous lines in way 1, preventing any reuse. Therefore, all $4096 (128 \times 32)$ accesses miss.

The number of cache misses for Loop A: _______ 1024 ______________

The number of cache misses for Loop B: _______ 4096 ______________
Problem 3: Microtagged Cache

In this problem, we explore microtagging, a technique to reduce the access time of set-associative caches. Recall that for associative caches, the tag check must be completed before load results are returned to the CPU, because the result of the tag check determines which cache way is selected. Consequently, the tag check is often on the critical path.

The time to perform the tag check (and, thus, way selection) is determined in large part by the size of the tag. We can speed up way selection by checking only a subset of the tag—called a microtag—and using the results of this comparison to select the appropriate cache way. Of course, the full tag check must also occur to determine if the cache access is a hit or a miss, but this comparison proceeds in parallel with way selection. We store the full tags separately from the microtag array.

We will consider the impact of microtagging on a 4-way set-associative 16KB data cache with 32-byte lines. Addresses are 32 bits long. Microtags are 8 bits long. The baseline cache (i.e. without microtagging) is depicted in Figure H2-B in the handout 2. Figure 1, below, shows the modified tag comparison and driver hardware in the microtagged cache.

Figure 2.4-1: Microtagged cache datapath
Problem 3.A

Cache Cycle Time

Table 2.4-1, below, contains the delays of the components within the 4-way set-associative cache, for both the baseline and the microtagged cache. For both configurations, determine the critical path and the cache access time (i.e., the delay through the critical path).

Assume that the 2-input AND gates have a 50ps delay and the 4-input OR gate has a 100ps delay.

<table>
<thead>
<tr>
<th>Component</th>
<th>Delay equation (ps)</th>
<th>Baseline</th>
<th>Microtagged</th>
</tr>
</thead>
<tbody>
<tr>
<td>Decoder</td>
<td>20×(# of index bits) + 100</td>
<td>Tag 240</td>
<td>240</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Data 240</td>
<td>240</td>
</tr>
<tr>
<td>Memory array</td>
<td>20×log₂(# of rows) + 20×log₂(# of bits in a row) + 100</td>
<td>Tag 330</td>
<td>330</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Data 440</td>
<td>440</td>
</tr>
<tr>
<td>Comparator</td>
<td>20×(# of tag bits) + 100</td>
<td>Tag 500</td>
<td>500</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Microtag 260</td>
<td>260</td>
</tr>
<tr>
<td>N-to-1 MUX</td>
<td>50×log₂N + 100</td>
<td>250</td>
<td>250</td>
</tr>
<tr>
<td>Buffer driver</td>
<td>200</td>
<td>200</td>
<td>200</td>
</tr>
<tr>
<td>Data output driver</td>
<td>50×(associativity) + 100</td>
<td>300</td>
<td>300</td>
</tr>
<tr>
<td>Valid output driver</td>
<td>100</td>
<td>100</td>
<td>100</td>
</tr>
</tbody>
</table>

Table 2.4-1: Delay of each Cache Component

What is the old critical path? The old cycle time (in ps)?

Candidate 1: Full tag check
- tag decoder ➔ tag read ➔ comparator ➔ 2-in AND ➔ 4-in OR ➔ valid output driver
- 240 ps + 330 ps + 500 ps + 50 ps + 100 ps + 100 ps = 1320 ps

Candidate 2: Data select based on full tag check
- tag decoder ➔ tag read ➔ comparator ➔ 2-in AND ➔ buffer driver ➔ data output driver
- 240 ps + 330 ps + 500 ps + 50 ps + 200 ps + 300 ps = 1620 ps

Candidate 3: Data readout
- data decoder ➔ data read ➔ 4-to-1 MUX ➔ data output driver
- 240 ps + 440 ps + 250 ps + 300 ps = 1230 ps
The critical path is the data select based on the full tag match. The cycle time is 1620 ps.

What is the new critical path? The new cycle time (in ps)?

Candidate 1: Full tag check
same as baseline full tag check => 1320 ps

Candidate 2: Data select based on microtag check
µtag decoder → µtag read → comparator → 2-in AND → buffer driver → data out
driver
240 ps + 300 ps + 260 ps + 50 ps + 200 ps + 300 ps = 1350 ps

Candidate 3: Data readout
same as baseline data read => 1230 ps
The critical path is the data select based on the microtag check. The cycle time is 1350 ps.

**Problem 3.B**

Assume temporarily that both the baseline cache and the microtagged cache have the same hit rate, 95%, and the same average miss penalty, 15 ns. Using the cycle times computed in 3.A as the hit times, compute the average memory access time for both caches. **What was the old AMAT (in ns)? What is the new AMAT (in ns)?**

AMAT = (hit_time) + (miss)_rate x (miss_penalty)
= X + (0.05) * (15ns) = X + 0.75ns, where X is the hit time calculated from 3.A

Old AMAT = 1.620 + 0.75 = 2.37 ns

New AMAT with microtags = 1.350 + 0.75 = 2.10 ns

**Problem 3.C**

Microtags add an additional constraint to the cache: in a given cache set, all microtags must be unique. This constraint is necessary to avoid multiple microtag matches in the same set, which would prevent the cache from selecting the correct way.

**State** which of the 3C’s of cache misses this constraint affects. **How will the cache miss rate compare to an ordinary 4-way set-associative cache?** How will it compare to that of a direct-mapped cache of the same size?
Because the uniqueness property of microtags restricts the replacement policy, the cache isn’t free to make as optimal replacement decisions as it could in the baseline. This will lead to some increase in conflict misses. The magnitude of this effect depends on which 8 bits are selected to form the microtag. In principle, using the bottom 8 bits would result in more potential for microtag collisions and would add the biggest restriction to the ability of the cache to hold spatially local data. However, it will still be better than a directmapped cache of the same size and line size.

### Problem 4: Victim Cache Evaluation

Although direct-mapped caches have an advantage of smaller access time than set-associative caches, they have more conflict misses due to their lack of associativity. In order to reduce these conflict misses, Norm Jouppi proposed victim caching, where a small fully-associative back up cache, called a victim cache, is added to a direct-mapped L1 cache to hold recently evicted cache lines.

The following diagram shows how a victim cache can be added to a direct-mapped L1 data cache. Upon a data access, the following chain of events takes place:

1. The L1 data cache is checked. If it holds the data requested, the data is returned.
2. If the data is not in the L1 cache, the victim cache is checked. If it holds the data requested, the data is moved into the L1 cache and sent back to the processor. The data evicted from the L1 cache is put in the victim cache, and put at the end of the FIFO replacement queue.
3. If neither of the caches holds the data, it is retrieved from memory, and put in the L1 cache. If the L1 cache needs to evict old data to make space for the new data, the old data is put in the victim cache and placed at the end of the FIFO replacement queue. Any data that needs to be evicted from the victim cache to make space is written back to memory or discarded, if unmodified.
Note that the two caches are *exclusive*. That means that the same data cannot be stored in both L1 and victim caches at the same time.

### Problem 4.A

**Baseline Cache Design**

The diagram below shows our victim cache, a 32-Byte fully associative cache with four 8-Byte cache lines. Each line contains of two 4-Byte words and has an associated tag and two status bits (valid and dirty). The Input Address is 32-bits and the two least significant bits are assumed to be zero. The output of the cache is a 32-bit word.

![Victim cache datapath](image)

Figure 2.5-1: Victim cache datapath

Please complete Table 2.5-1 with delays across each element of the cache. Using the data you compute in Table 2.5-1, calculate the critical path delay through this cache (from when the Input Address is set to when both Valid Output Driver and the appropriate Data Output Driver are outputting valid data).

<table>
<thead>
<tr>
<th>Component</th>
<th>Delay equation (ps)</th>
<th>FA(ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Comparator</td>
<td>30×(# of tag bits) + 100</td>
<td>970</td>
</tr>
<tr>
<td>N-to-1 MUX</td>
<td>50×log₂ N + 100</td>
<td>150</td>
</tr>
<tr>
<td>Buffer driver</td>
<td>200</td>
<td>200</td>
</tr>
<tr>
<td>AND gate</td>
<td>100</td>
<td>100</td>
</tr>
<tr>
<td>OR gate</td>
<td>50× log₂ N + 100</td>
<td>200</td>
</tr>
<tr>
<td>Data output driver</td>
<td>50×(associativity) + 100</td>
<td>300</td>
</tr>
<tr>
<td>Valid output driver</td>
<td>100</td>
<td>100</td>
</tr>
</tbody>
</table>

Table 2.5-1: Delay of each cache component
Critical Path Cache Delay:
Below, we evaluate the three major paths through the victim cache to find the critical path and cycle time. Note that the victim cache is fully-associative and uses 29-bit tags.

Candidate 1: Tag check
comparator \(\rightarrow\) 2-in AND \(\rightarrow\) 4-in OR \(\rightarrow\) valid output driver
970 ps + 100 ps + 200 ps + 100 ps = 1370 ps

Candidate 2: Data select based on tag check
comparator \(\rightarrow\) 2-in AND \(\rightarrow\) buffer driver \(\rightarrow\) data output driver
970 ps + 100 ps + 200 ps + 300 ps = \(\boxed{1570 \text{ ps}}\)

Candidate 3: Data readout
2-to-1 MUX \(\rightarrow\) data output driver
200 ps + 300 ps = 500 ps

The critical path is the data select based on the tag match. The cycle time is 1570 ps.

Problem 4.B

Victim Cache Behavior

Now we will study the impact of a victim cache on cache hit rate. Our main L1 cache is a 128 byte, direct-mapped cache with 16 bytes per cache line. The cache is word (4-bytes) addressable. The victim cache in Figure 2.5-1 is a 32-byte fully associative cache with 16 bytes per cache line and is also word addressable. The victim cache uses the first in first out (FIFO) replacement policy.

Please complete Table 2.5-2 showing a trace of memory accesses. In the table, each entry contains the tag of that line, or “inv”, if no data is present. You should only fill in elements in the table when a value changes. For simplicity, the addresses are only 8 bits. The first 3 lines of the table have been filled in for you. For your convenience, the address breakdown for access to the main cache is depicted below.
Table 2.5-2: Memory access trace

<p>| | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>A0</td>
<td>1</td>
<td>N</td>
<td>N</td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>0</td>
<td>N</td>
<td>N</td>
<td></td>
<td></td>
</tr>
<tr>
<td>C0</td>
<td>1</td>
<td>N</td>
<td>N</td>
<td></td>
<td></td>
</tr>
<tr>
<td>18</td>
<td>0</td>
<td>Y</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td>0</td>
<td>N</td>
<td>A</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td>8C</td>
<td>1</td>
<td>N</td>
<td>0</td>
<td>Y</td>
<td></td>
</tr>
<tr>
<td>28</td>
<td>0</td>
<td>Y</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AC</td>
<td>1</td>
<td>N</td>
<td>2</td>
<td>Y</td>
<td></td>
</tr>
<tr>
<td>38</td>
<td>0</td>
<td>N</td>
<td>N</td>
<td></td>
<td></td>
</tr>
<tr>
<td>C4</td>
<td>1</td>
<td>Y</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3C</td>
<td>0</td>
<td>Y</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>48</td>
<td>0</td>
<td>N</td>
<td>C</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td>0C</td>
<td>0</td>
<td>N</td>
<td>8</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td>24</td>
<td>0</td>
<td>N</td>
<td>A</td>
<td>N</td>
<td></td>
</tr>
</tbody>
</table>

Problem 4.C

Assume 15% of memory accesses are resolved in the victim cache. If retrieving data from the victim cache takes 4 cycles and retrieving data from main memory takes 50 cycles, by how many cycles does the victim cache improve the average memory access time? (You may assume the L1 miss rate is 10%)

\[
AMAT = \text{HitTime} + \text{L1MissRate} \times \text{L1MissPenalty}
\]

\[
AMAT^* = \text{HitTime} + \text{L1MissRate} \times (\text{VictimHitRate} \times \text{VictimHitTime} + (1 - \text{VictimHitRate}) \times \text{VictimMissPenalty})
\]

\[
\text{VictimMissPenalty} = \text{L1MissPenalty} = \text{DRAMTime}, \text{since this is just time to get data from main memory}
\]

\[
AMAT - AMAT^* = \text{L1MissRate} \times (\text{DRAMTime} - \text{VictimHitRate} \times \text{VictimHitTime} - (1 - \text{VictimHitRate}) \times \text{DRAMTime})
\]

\[
= \text{L1MissRate} \times (50 - 0.15 \times 4 - 0.85 \times 50)
\]

\[
= \text{L1MissRate} \times 6.9
\]

Since L1 Miss rate was not provided, an answer in the form of above is acceptable. With the 0.1 L1 miss rate, the average difference in AMAT is 0.69.
Problem 5: Three C’s of Cache Misses

Mark whether the following modifications will cause each of the categories to **increase**, **decrease**, or whether the modification will have **no effect**. You can assume the baseline cache is set associative. **Explain your reasoning.**

<table>
<thead>
<tr>
<th>Description</th>
<th>Compulsory Misses</th>
<th>Conflict Misses</th>
<th>Capacity Misses</th>
</tr>
</thead>
<tbody>
<tr>
<td>Halving the line size (associativity and # sets constant)</td>
<td>Increase</td>
<td>Increase</td>
<td>Increase</td>
</tr>
<tr>
<td>Halves capacity!</td>
<td></td>
<td>The program will access more cache lines in total, creating more opportunity for conflict misses.</td>
<td>Capacity has been cut in half.</td>
</tr>
<tr>
<td>Doubling the number of sets (capacity and line size constant)</td>
<td>No effect</td>
<td>Increase</td>
<td>No effect</td>
</tr>
<tr>
<td>Halves associativity!</td>
<td></td>
<td>Typically, lower associativity increases conflict misses, since there are fewer places to put the same element.</td>
<td>Capacity does not change.</td>
</tr>
<tr>
<td>Adding prefetching</td>
<td>Compulsory Misses</td>
<td>Conflict Misses</td>
<td>Capacity Misses</td>
</tr>
<tr>
<td>--------------------</td>
<td>------------------</td>
<td>----------------</td>
<td>----------------</td>
</tr>
<tr>
<td></td>
<td>Decrease</td>
<td>Best answer: Decrease With good prefetching, conflict misses should decrease, as the prefetcher will often bring lines that have been evicted back into the cache.</td>
<td>Best answer: Decrease With good prefetching, capacity misses should decrease. In a situation where the working set simply won’t fit, the prefetcher can dynamically bring lines in, “Just-In-Time,” avoiding what would have been capacity misses.</td>
</tr>
<tr>
<td></td>
<td>Ideally, a good prefetcher can bring data in before we use it, avoiding compulsory misses.</td>
<td>Okay answer: increase With mediocre prefetching, conflict misses could increase, as the prefetcher could evict useful lines to bring in useless.</td>
<td>Okay answer: no effect With a mediocre prefetcher that would increase conflict misses, it is unlikely that capacity misses would increase. If prefetcher traffic evicts useful data, newly created misses will almost</td>
</tr>
<tr>
<td>Combine ICache and DCache into a single L1 cache with the combined capacity (associativity and line size constant)</td>
<td>No effect</td>
<td>May increase: New opportunities for conflicts between cache lines for data and cache lines for instructions are introduced</td>
<td>Decrease: Greater capacity</td>
</tr>
</tbody>
</table>
**Problem 6: Memory Hierarchy Performance**

Mark whether the following modifications will cause each of the categories to increase, decrease, or whether the modification will have no effect. You can assume the baseline cache is set associative. **Explain your reasoning.**

<table>
<thead>
<tr>
<th>Modification</th>
<th>Hit Time</th>
<th>Miss Rate</th>
<th>Miss Penalty</th>
</tr>
</thead>
<tbody>
<tr>
<td>Halving the line size (associativity and # sets constant)</td>
<td>Decreases</td>
<td>Increases</td>
<td>Decreases</td>
</tr>
<tr>
<td>Halves # of capacity</td>
<td>The cache is now physically smaller, which overshadows the slightly increased tag check time (tag grows by 1 bit).</td>
<td>Smaller capacity, less ability to take advantage of spatial locality within a single cache line (more compulsory misses).</td>
<td>Smaller lines can be brought in more quickly. OR No effect, because cache already brings in critical word first.</td>
</tr>
<tr>
<td>Description</td>
<td>Decreases</td>
<td>Increases</td>
<td>No effect</td>
</tr>
<tr>
<td>---------------------------------------------------------------------------</td>
<td>---------------------------------------------------------------------------</td>
<td>--------------------------------------------------------------------------</td>
<td>-----------------------------------------------------------------------------------------------------</td>
</tr>
<tr>
<td>Doubling the number of sets (capacity and line size constant)</td>
<td># of sets increases, so tags get smaller. Fewer tags must be checked, and fewer ways have to be muxed outs.</td>
<td>More conflict misses because associativity gets halved.</td>
<td>This is dominated by the outer memory hierarchy</td>
</tr>
<tr>
<td>Halves associativity</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Adding prefetching</td>
<td>No effect</td>
<td>Decreases</td>
<td>Good answer: no effect.</td>
</tr>
<tr>
<td></td>
<td>The prefetcher isn’t on the hit path.</td>
<td>The whole purpose of a prefetcher is to reduce the miss rate by bringing in data ahead of time.</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>May increase due to bandwidth pollution but we can(should) give a priority on cache misses over prefetch requests.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>May decrease because a prefetch can be in flight when a miss occurs (but this is unlikely).</td>
</tr>
<tr>
<td>Combine L1ICache and L1DCache into a single L1 cache with the combined capacity (associativity and line size constant)</td>
<td>Increase: If the cache is dual-ported, it will be slower than a single-ported cache</td>
<td>May Decrease Cache can more flexibly allocate space towards either data or instructions, depending on dynamic program behavior.</td>
<td>No effect: This is dominated by outer memory hierarchy</td>
</tr>
<tr>
<td></td>
<td>If there is a single port, then frequently data accesses may stall for instruction accesses, or vice-versa</td>
<td>May increase: Edge cases may cause more conflict misses between instruction and data accesses</td>
<td></td>
</tr>
</tbody>
</table>