1 Introduction and goals

The goal of this laboratory assignment is to conduct some simple memory hierarchy experiments in the RISC-V simulation environment. Using the cache simulator module, you will collect cache statistics and make architectural recommendations based on the results.

The lab has two sections, a directed portion and an open-ended portion. Everyone will do the directed portion the same way, and grades will be assigned based on correctness. The open-ended portion will allow you to pursue more creative investigations, and your grade will be based on the effort made to complete the task or the arguments you provide in support of your ideas.

Students are encouraged to discuss solutions to the lab assignments with other students, but must run through the directed portion of the lab by themselves and turn in their own lab report for those problems. For the open-ended portion of each lab, students will work individually or in groups of two or three. Each group will turn in a single report for the open-ended portion of the lab. Students are free to take part in different groups for different lab assignments.

You are only required to do one of the open-ended assignments. These assignments are generally starting points or suggestions. Alternatively, you can propose and complete your own open-ended project as long as it is sufficiently rigorous. If you feel uncertain about the rigor of a proposal, feel free to consult the instructor or the TA.

1.1 Tools and Benchmarks

The processors that you will be studying in this lab implement the RISC-V ISA, recently developed at UC Berkeley for use in education and research.

The ISA simulator riscv-isa-sim or spike can execute RISC-V binaries. Note that in Lab 1, we used the ISA simulator as the golden reference for the ISA. The ISA simulator executes RISC-V code rapidly, but does not model pipeline timing and so is not cycle-accurate.

In Lab 2, we will use an ISA simulator that has been extended with a cache simulator. The cache simulator will run memory addresses through a simulated cache (with a given size, associativity, and block size), and record the number of accesses, hits, and misses. With the simulated miss rate and miss penalty, we can estimate the impact on CPI. We also use CACTI (http://quid.hpl.hp.com:9081/cacti) to see how various cache parameters will impact the cycle time of the pipeline.

The ISA simulator is used in this lab, not the Chisel-generated emulator, even though the ISA simulator doesn’t give accurate cycle counts: 1) Typically, we need to run a couple billion cycles to
get a realistic view of the cache behavior, and the Chisel-generated emulator runs a few orders of magnitudes slower than the ISA simulator, 2) the AMAT/CPI performance estimate for a simple pipeline turns out to be reasonably accurate.

We will use 5 benchmarks (bzip2, mcf, soplex, sjeng, lbm) from the SPEC CPU2006 benchmark suite (http://www.spec.org). SPEC is a standardized set of benchmarks to evaluate the performance of modern computer systems. The test results are published on the SPEC website. To read more about SPEC, please consult the SPEC website.

1.2 Graded Items

You will turn in a copy of your results via e-mail to the TA. Please label each section of the results clearly. The directed items need to be turned in for evaluation. Your group only needs to turn in one of the problems found in the open-ended portion.

1. (Directed) Problem 2.2: simple cache statistics for each benchmark and answers
2. (Directed) Problem 2.3: suggested working sets and evidence
3. (Directed) Problem 2.4: optimal I$ configuration and evaluation
4. (Directed) Problem 2.5: optimal D$ configuration and evaluation
5. (Directed) Problem 2.6: optimal D$ configuration with an L2$ and evaluation
6. (Open-ended) Problem 3.1: optimal cache configuration and evidence
7. (Open-ended) Problem 3.2: suggested victim cache configuration, modifications, and evaluation (include source code if required)
8. (Open-ended) Problem 3.3: suggested prefetching algorithm, modifications, and evaluation (include source code if required)
9. (Open-ended) Problem 3.4: suggested replacement policy, modifications, and evaluation (include source code if required)
10. (Directed) Problem 4: Feedback on this lab

Lab reports must be in readable English and not raw dumps of log-files. Your lab reports must be typed and the open-ended portion must not exceed 6 pages. Charts, tables, and figures - when appropriate - are great ways to succinctly summarize your data.

2 Directed Portion (50% of lab grade)

2.1 Setting Up Your Workspace

To complete this lab you will log in to an instructional server to run the RISC-V ISA simulator and compiler tool chain. We will provide you with an instructional computing account for this purpose.

The tools for this lab were set up to run on any of the 5 instructional Linux servers icluster5.eecs, icluster6.eecs, ..., icluster9.eecs. (see http://inst.eecs.berkeley.edu/cgi-bin/clients.cgi?choice=servers for more information about available machines).

First, download the lab materials. This lab is now managed as a git repository which means you can also use git to fetch updates from the published version. To copy the repo you will need to clone it:
If any updates are released you can then pull in the new updates using

```
inst$ cd ${LAB2ROOT}
inst$ git pull
```

If you encounter problems using git feel free to post a question on Piazza or consult the git documentation (see [https://git-scm.com/doc](https://git-scm.com/doc))

The following command will set up your bash environment, giving you access to the entire CS152 lab tool-chain. Run it before each session:

```
inst$ source ~cs152/fa16/cs152.lab2.bashrc
```

We will refer to `~/lab2` as `${LAB2ROOT}` in the rest of the handout to denote the location of the Lab 2 directory. The directory structure is shown below:

- `${LAB2ROOT}/`
  - `riscv-isa-sim/` Source code for the RISC-V ISA Simulator
    - `riscv/` Source code for the RISC-V ISA Simulator
  - `spec-cpu2006-riscv`
    - `401.bzip2/` Source and data files for the bzip2 benchmark
      - `src/` Source files
      - `data/` Data files
    - `429.mcf/` Source and data files for the mcf benchmark
    - `450.soplex/` Source and data files for the soplex benchmark
    - `458.sjeng/` Source and data files for the sjeng benchmark
    - `470.lbm/` Source and data files for the lbm benchmark

We will now build the ISA simulator from sources. You may wonder why this would be necessary, given that we already gave you a working version in Lab #1. The reason is that in the open-ended section, you may be modifying the simulator yourself, and you will need to repeat these steps to build and install the new version. To Build the ISA simulator, run the following commands:

```
inst$ cd ${LAB2ROOT}/riscv-isa-sim
inst$ mkdir build
inst$ cd build
inst$ ../configure --prefix=${LAB2ROOT}/install
inst$ make -j
inst$ make install
```

1Or better yet, add this command to your bash profile.
Execute the following line to put the ISA simulator in your path. Also, add this command to your bash profile (~/.bashrc) so that the path will automatically get updated every time you open up a new session.

    inst$ export PATH=${LAB2ROOT}/install/bin:$PATH

To check whether the ISA simulator is in your path, run the following command.

    inst$ which spike

2.2 Collecting statistics from a simple cache

You should first build the benchmarks.

    inst$ cd ${LAB2ROOT}/spec-cpu2006-riscv
    inst$ make -j
    inst$ ls -ls build.riscv/
    total 17120
         932 -rwxr-xr-x 1 yunsup grad 1014224 Feb 15 23:31 401.bzip2
         1348 -rwxr-xr-x 1 yunsup grad 1440538 Feb 15 23:31 429.mcf
         11940 -rwxr-xr-x 1 yunsup grad 12285464 Feb 15 23:31 450.soplex
         1632 -rwxr-xr-x 1 yunsup grad 1730854 Feb 15 23:31 458.sjeng
         1268 -rwxr-xr-x 1 yunsup grad 1357611 Feb 15 23:31 470.lbm

Execute the target binary on the ISA simulator with an L1 instruction cache.

    inst$ cd ${LAB2ROOT}/spec-cpu2006-riscv
    inst$ mkdir test
    inst$ cd test
    inst$ spike --ic=128:2:64 pk ../build.riscv/401.bzip2 ../401.bzip2/data/test/input/dryer.jpg

spec_init
Loading Input Data
...
Tested 64KB buffer: OK!
I$ Bytes Read:    4539708644
I$ Bytes Written: 0
I$ Read Accesses: 1134927161
I$ Write Accesses: 0
I$ Read Misses:    1534
I$ Write Misses:  0
I$ Writebacks:  0
I$ Miss Rate:    0.000%

The 3 parameters given to the instruction cache --ic are number of sets:associativity:line size. If you multiply all the numbers $128 \times 2 \times 64$, you will get cache size (16KB). Note that “I$ Read Accesses” equals the number of instructions executed.

You can also run the simulation with an L1 data cache, and with an unified L2 cache with the following commands (these are the cache parameters we will be using).
We wrote a Makefile that launches the benchmarks. Execute all the tests in parallel with the following command. It will take a couple minutes to run all 5 benchmarks.

```bash
inst$ cd ${LAB2ROOT}/spec-cpu2006-riscv
inst$ make -j run
inst$ ls -ls run.riscv/
```

Answer the following 7 questions.

(Q1) For each benchmark, look at the corresponding output file and record the miss rate for the L1 I$, L1 D$, and the L2$. Which benchmark has the best cache performance? Which has the worst? We highly recommend you to automate this process by scripting as we are going to analyze a lot of data in the subsequent sections.

(Q2) What is the cache access time for the L1 I$, L1 D$, and the L2$? Refer to Table 1.

(Q3) What is the cycle time of the pipeline? Assume that the critical path among all non-memory pipeline stages is 900ps long.

(Q4) Calculate the average CPI (in cycles) across all benchmarks without the L2$. Use the following formula to calculate CPI. (MP=Miss Penalty, CT=Cycle Time. Assume that the backside of the L1$s are connected to a DRAM, and it takes 100ns to access the DRAM. Use $CPI_{base} = 1.1$)

$$CPI = CPI_{base} + \frac{L1 I$ misses + L1 D$ misses}{# of insts} \times \left\lceil \frac{MP}{CT} \right\rceil$$

(Q5) What is the AMAT (in ns) of the L2$ for all benchmarks? Use the following formula to calculate AMAT. (HT=Hit Time, MR=Miss Rate, MP=Miss Penalty. Assume that the backside of the L2$ is connected to a DRAM, and it takes 100ns to access the DRAM)

$$AMAT_{L2} = HT_{L2} + MR_{L2} \times MP_{L2}$$

(Q6) Calculate the average CPI (in cycles) across all benchmarks with the L2$. Use the following formula to calculate CPI, assuming the L2$ is running asynchronously on its own clock domain. (Use $CPI_{base} = 1.1$)

$$CPI = CPI_{base} + \frac{L1 I$ misses + L1 D$ misses}{# of insts} \times \left\lceil \frac{AMAT_{L2}}{CT} \right\rceil$$

(Q7) Compare the average CPI value with and without the L2$. Does the L2$ help performance?
2.3 Determining benchmark working-set size

Your task in this section is to determine the working-set size of each of the benchmarks by varying the size and associativity of the L2$ cache used in the previous section. Record the measurements you make that support your claim. Which benchmark seems to have the largest working set, and how big is it? We have provided you a python script that will help you launch many simulation jobs. Take a look at $LAB2ROOT/spec-cpu2006-riscv/explore.py. Simply write a loop that populates the design_space dictionary. Finally, run the script to kick off the design-space exploration.

```
inst$ cd ${LAB2ROOT}/spec-cpu2006-riscv
inst$ ./explore.py
```

2.4 Find the Optimal L1 I$ Configuration

For this section, you will find the optimal L1 I$ configuration that maximizes performance of the 5 SPEC benchmarks. Assume the data memory accesses all hit in the cache, and hence don’t affect the CPI.

Note:

- Limit the L1 cache design space to: capacity[16KB,32KB,64KB] × associativity[1,2,4,8] × cache line size[64].
- To calculate CPI, come up with a similar formula you have used in 2.2.
- Consult Table 1 to see how cache access time scales with different cache configurations.
- Think carefully how the cache access time will affect the cycle time of the processor, and the overall performance.
- We encourage you to modify the design-space exploration script used in 2.3.
- We put a text version of Table 1 in $LAB2ROOT/spec-cpu2006-riscv/cacti just in case you would like to read it from your analysis script.

Which configuration do you recommend for the L1 I$? Show your work. Based on your findings, can you provide any intuition behind sizing the L1 I$? If you account the silicon area for different cache configurations (see Table 2), would your recommendation change? If so, why?

2.5 Find the Optimal L1 D$ Configuration

For this section, you will find the optimal L1 D$ configuration that maximizes performance of the 5 SPEC benchmarks. Pick one configuration from the following cache design space: capacity[16KB,32KB,64KB] × associativity[1,2,4,8] × cache line size[64]. Assume your processor has the L1 I$ you have recommended in Section 2.4.

Which configuration do you recommend for the L1 D$? Show your work. Based on your findings, can you provide any intuition behind sizing the L1 D$? If you account the silicon area (see Table 2), would your recommendation change? If so, why?
2.6 Find the Optimal L1 D$ Configuration with an L2$

This question is similar to Section 2.5, but now we assume that we have an L2$ between the L1$s and the DRAM. Pick one configuration for the L1 D$ from the following cache design space that maximizes performance: capacity\([16\text{KB},32\text{KB},64\text{KB}]\) \times \text{associativity}[1,2,4,8] \times \text{cache line size}[64]. Assume your processor has the L1 I$ you have recommended in Section 2.4. Also assume a fixed 256KB 8-way set-associative L2 cache that uses 64-byte cache lines.

Which configuration do you recommend for the L1 D$? Show your work. How does the L2$ affect your cache design decisions?

3 Open-ended Portion (50% of lab grade)

Pick one of the following questions. The open-ended portion is worth a large fraction of the grade of the lab, and the grade depends on how complex and interesting a project you complete, so spend the appropriate amount of time and energy on it. Also, have fun with it!

3.1 Design a memory hierarchy that fits within a 5mm$^2$ area budget

For this question, we want to figure out how we can make the best use out of our 5mm$^2$ area budget for caches. In the directed portion of the lab, we have asked you to explore the design space of L1 caches. However, we have constrained the design space to minimize the busy work. But for this study, you have the freedom to change any cache parameter. The only constraint you have is to fit in a 5mm$^2$ area budget. Propose the best memory hierarchy for the 5 SPEC benchmarks we used in this lab.

Use Table 2 to estimate area, and Table 1 to estimate cache access time for different cache configurations. If the tables don’t have an estimate for your cache configuration, please use the CACTI web interface (http://quid.hpl.hp.com:9081/cacti) to obtain them. Use 1 bank and technology node of 45nm.

Make sure to report all the statistics you gathered and calculations you made to reach your conclusions.

3.2 Design a victim cache

Although direct-mapped caches have an advantage of smaller access time than set-associative caches, they have more conflict misses due to their lack of associativity. In order to reduce these conflict misses, N. Jouppi proposed victim caching where a small fully-associative back up cache, called a victim cache, is added to a direct-mapped L1 cache to hold recently evicted cache lines.

Given a 32KB direct-mapped L1 D$, design your own victim cache. Assume that the backside of the L1 D$ is directly hooked up to the DRAM. To read more about victim caches, please consult problem 2.4 in the problem set. The only constraint you have is to add less than 2K flip-flops to the cache design.

First sketch out a block diagram. Then modify the cache simulator to model your victim cache. You will have to modify the \texttt{cache\_sim\_t::access()} function in $\$LAB2ROOT/riscv-isa-sim/riscv/cachesim.cc$. Recompile the ISA simulator with the steps described in Section 2.1. Run the benchmarks and record the cache statistics. Analyze the impact on AMAT and CPI. Estimate
Table 1: Cache access time (in ns) for various cache configurations in 45nm technology. Data obtained from CACTI.

(a) cache line size = 32

<table>
<thead>
<tr>
<th>assoc — size</th>
<th>8KB</th>
<th>16KB</th>
<th>32KB</th>
<th>64KB</th>
<th>128KB</th>
<th>256KB</th>
<th>512KB</th>
<th>1MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.31</td>
<td>0.37</td>
<td>0.44</td>
<td>0.53</td>
<td>0.62</td>
<td>0.79</td>
<td>0.98</td>
<td>1.31</td>
</tr>
<tr>
<td>2</td>
<td>0.51</td>
<td>0.57</td>
<td>0.62</td>
<td>0.65</td>
<td>0.74</td>
<td>0.84</td>
<td>1.17</td>
<td>1.52</td>
</tr>
<tr>
<td>4</td>
<td>0.56</td>
<td>0.60</td>
<td>0.65</td>
<td>0.70</td>
<td>0.74</td>
<td>0.85</td>
<td>1.17</td>
<td>1.52</td>
</tr>
<tr>
<td>8</td>
<td>0.77</td>
<td>0.78</td>
<td>0.86</td>
<td>0.89</td>
<td>0.95</td>
<td>1.03</td>
<td>1.16</td>
<td>1.52</td>
</tr>
<tr>
<td>16</td>
<td>N/A</td>
<td>1.21</td>
<td>1.24</td>
<td>1.30</td>
<td>1.35</td>
<td>1.42</td>
<td>1.53</td>
<td>1.69</td>
</tr>
<tr>
<td>32</td>
<td>N/A</td>
<td>N/A</td>
<td>2.10</td>
<td>2.12</td>
<td>2.19</td>
<td>2.30</td>
<td>2.45</td>
<td>2.46</td>
</tr>
<tr>
<td>64</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>3.90</td>
<td>3.92</td>
<td>3.95</td>
<td>4.02</td>
<td>4.13</td>
</tr>
</tbody>
</table>

(b) cache line size = 64

<table>
<thead>
<tr>
<th>assoc — size</th>
<th>8KB</th>
<th>16KB</th>
<th>32KB</th>
<th>64KB</th>
<th>128KB</th>
<th>256KB</th>
<th>512KB</th>
<th>1MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.32</td>
<td>0.34</td>
<td>0.41</td>
<td>0.53</td>
<td>0.61</td>
<td>0.79</td>
<td>1.04</td>
<td>1.31</td>
</tr>
<tr>
<td>2</td>
<td>0.60</td>
<td>0.61</td>
<td>0.64</td>
<td>0.68</td>
<td>0.74</td>
<td>0.85</td>
<td>1.09</td>
<td>1.44</td>
</tr>
<tr>
<td>4</td>
<td>0.83</td>
<td>0.84</td>
<td>0.86</td>
<td>0.91</td>
<td>0.94</td>
<td>0.99</td>
<td>1.09</td>
<td>1.44</td>
</tr>
<tr>
<td>8</td>
<td>N/A</td>
<td>1.28</td>
<td>1.30</td>
<td>1.34</td>
<td>1.37</td>
<td>1.41</td>
<td>1.48</td>
<td>1.60</td>
</tr>
<tr>
<td>16</td>
<td>N/A</td>
<td>N/A</td>
<td>2.05</td>
<td>2.21</td>
<td>2.24</td>
<td>2.29</td>
<td>2.34</td>
<td>2.44</td>
</tr>
<tr>
<td>32</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>3.97</td>
<td>4.01</td>
<td>4.04</td>
<td>4.10</td>
<td>4.18</td>
</tr>
<tr>
<td>64</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>7.20</td>
<td>7.25</td>
<td>7.28</td>
<td>7.35</td>
</tr>
</tbody>
</table>

(c) cache line size = 128

<table>
<thead>
<tr>
<th>assoc — size</th>
<th>8KB</th>
<th>16KB</th>
<th>32KB</th>
<th>64KB</th>
<th>128KB</th>
<th>256KB</th>
<th>512KB</th>
<th>1MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.37</td>
<td>0.45</td>
<td>0.46</td>
<td>0.53</td>
<td>0.62</td>
<td>0.85</td>
<td>1.04</td>
<td>1.37</td>
</tr>
<tr>
<td>2</td>
<td>0.83</td>
<td>0.84</td>
<td>0.87</td>
<td>0.90</td>
<td>0.94</td>
<td>0.98</td>
<td>1.06</td>
<td>1.42</td>
</tr>
<tr>
<td>4</td>
<td>N/A</td>
<td>1.29</td>
<td>1.29</td>
<td>1.32</td>
<td>1.35</td>
<td>1.39</td>
<td>1.44</td>
<td>1.53</td>
</tr>
<tr>
<td>8</td>
<td>N/A</td>
<td>N/A</td>
<td>2.16</td>
<td>2.18</td>
<td>2.25</td>
<td>2.27</td>
<td>2.33</td>
<td>2.36</td>
</tr>
<tr>
<td>16</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>3.92</td>
<td>3.96</td>
<td>4.02</td>
<td>4.02</td>
<td>4.12</td>
</tr>
<tr>
<td>32</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>7.42</td>
<td>7.46</td>
<td>7.49</td>
<td>7.54</td>
</tr>
<tr>
<td>64</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>13.84</td>
<td>13.88</td>
<td>13.92</td>
</tr>
</tbody>
</table>
Table 2: Cache area (in mm²) for various cache configurations in 45nm technology. Data obtained from CACTI.

(a) cache line size = 32

<table>
<thead>
<tr>
<th>assoc — size</th>
<th>8KB</th>
<th>16KB</th>
<th>32KB</th>
<th>64KB</th>
<th>128KB</th>
<th>256KB</th>
<th>512KB</th>
<th>1MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.09</td>
<td>0.23</td>
<td>0.35</td>
<td>0.54</td>
<td>1.12</td>
<td>1.53</td>
<td>3.11</td>
<td>6.78</td>
</tr>
<tr>
<td>2</td>
<td>0.12</td>
<td>0.16</td>
<td>0.23</td>
<td>0.46</td>
<td>0.65</td>
<td>1.30</td>
<td>2.94</td>
<td>5.49</td>
</tr>
<tr>
<td>4</td>
<td>0.18</td>
<td>0.34</td>
<td>0.40</td>
<td>0.65</td>
<td>0.65</td>
<td>1.30</td>
<td>2.94</td>
<td>4.88</td>
</tr>
<tr>
<td>8</td>
<td>0.37</td>
<td>0.49</td>
<td>0.59</td>
<td>0.73</td>
<td>1.32</td>
<td>1.36</td>
<td>2.62</td>
<td>5.55</td>
</tr>
<tr>
<td>16</td>
<td>N/A</td>
<td>0.80</td>
<td>0.99</td>
<td>1.08</td>
<td>1.34</td>
<td>2.29</td>
<td>3.37</td>
<td>5.51</td>
</tr>
<tr>
<td>32</td>
<td>N/A</td>
<td>N/A</td>
<td>1.77</td>
<td>2.06</td>
<td>2.23</td>
<td>3.40</td>
<td>4.07</td>
<td>6.46</td>
</tr>
<tr>
<td>64</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>4.03</td>
<td>4.32</td>
<td>4.79</td>
<td>5.83</td>
<td>7.96</td>
</tr>
</tbody>
</table>

(b) cache line size = 64

<table>
<thead>
<tr>
<th>assoc — size</th>
<th>8KB</th>
<th>16KB</th>
<th>32KB</th>
<th>64KB</th>
<th>128KB</th>
<th>256KB</th>
<th>512KB</th>
<th>1MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.21</td>
<td>0.25</td>
<td>0.33</td>
<td>0.73</td>
<td>1.25</td>
<td>1.95</td>
<td>3.24</td>
<td>6.98</td>
</tr>
<tr>
<td>2</td>
<td>0.34</td>
<td>0.37</td>
<td>0.44</td>
<td>0.57</td>
<td>1.10</td>
<td>1.47</td>
<td>3.12</td>
<td>6.99</td>
</tr>
<tr>
<td>4</td>
<td>0.55</td>
<td>0.64</td>
<td>0.71</td>
<td>0.84</td>
<td>1.48</td>
<td>2.03</td>
<td>3.12</td>
<td>6.99</td>
</tr>
<tr>
<td>8</td>
<td>N/A</td>
<td>1.12</td>
<td>1.26</td>
<td>1.39</td>
<td>2.22</td>
<td>2.75</td>
<td>3.78</td>
<td>5.81</td>
</tr>
<tr>
<td>16</td>
<td>N/A</td>
<td>N/A</td>
<td>3.11</td>
<td>2.54</td>
<td>2.78</td>
<td>4.35</td>
<td>5.36</td>
<td>7.34</td>
</tr>
<tr>
<td>32</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>4.89</td>
<td>5.14</td>
<td>5.64</td>
<td>8.66</td>
<td>10.92</td>
</tr>
<tr>
<td>64</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>10.23</td>
<td>10.75</td>
<td>15.44</td>
<td>17.37</td>
</tr>
</tbody>
</table>

(c) cache line size = 128

<table>
<thead>
<tr>
<th>assoc — size</th>
<th>8KB</th>
<th>16KB</th>
<th>32KB</th>
<th>64KB</th>
<th>128KB</th>
<th>256KB</th>
<th>512KB</th>
<th>1MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.70</td>
<td>0.75</td>
<td>0.82</td>
<td>0.98</td>
<td>2.03</td>
<td>3.13</td>
<td>4.60</td>
<td>7.38</td>
</tr>
<tr>
<td>2</td>
<td>1.12</td>
<td>1.27</td>
<td>1.35</td>
<td>1.48</td>
<td>1.77</td>
<td>3.30</td>
<td>4.50</td>
<td>7.39</td>
</tr>
<tr>
<td>4</td>
<td>N/A</td>
<td>2.18</td>
<td>2.36</td>
<td>2.49</td>
<td>2.74</td>
<td>4.65</td>
<td>5.72</td>
<td>7.85</td>
</tr>
<tr>
<td>8</td>
<td>N/A</td>
<td>N/A</td>
<td>4.29</td>
<td>4.57</td>
<td>5.52</td>
<td>6.04</td>
<td>7.16</td>
<td>10.63</td>
</tr>
<tr>
<td>16</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>8.65</td>
<td>9.16</td>
<td>10.69</td>
<td>10.58</td>
<td>13.72</td>
</tr>
<tr>
<td>32</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>17.90</td>
<td>18.36</td>
<td>19.35</td>
<td>21.84</td>
</tr>
<tr>
<td>64</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>36.17</td>
<td>37.17</td>
<td>39.01</td>
</tr>
</tbody>
</table>

Table 2: Cache area (in mm²) for various cache configurations in 45nm technology. Data obtained from CACTI.
the impact on critical path, and performance. Change parameters in your design and see which configuration works the best.

Make sure to report all the statistics you gathered and calculations you made to reach your conclusions.

Feel free to email your TA or attend his office hours if you need help understanding the ISA simulator, the cache simulator, or anything else regarding this problem.

3.3 Design your own hardware prefetcher

For this question, we want to investigate whether hardware data prefetching would improve performance of the 5 SPEC benchmarks we have used in the directed portion. Please take a look at lecture 7 for more information on hardware prefetching.

Assume you are building a hardware prefetcher for a 32KB 4-way set-associative L1 D$. The backside of the L1 D$ is directly hooked up to the DRAM. The only constraint you have is to add less than 2K flip-flops to the cache design.

First understand what the current simulator does, and plan a few ways to improve performance. Then modify the cache simulator to model your hardware prefetcher. You will have to modify the cache_sim_t::access() function in $LAB2ROOT/riscv-isa-sim/riscv/cachesim.cc. Recompile the ISA simulator with the steps described in Section 2.1. Run the benchmarks and record the cache statistics. Analyze the impact on AMAT and CPI. Estimate the impact on critical path, and performance. Change parameters in your design and see which configuration works the best.

Make sure to report all the statistics you gathered and calculations you made to reach your conclusions.

Feel free to email your TA or attend his office hours if you need help understanding the ISA simulator, the cache simulator, or anything else regarding this problem.

3.4 Design your own replacement policy

For this question, we want to investigate whether a different replacement policy would improve performance of the 5 SPEC benchmarks we have used in the directed portion. Please take a look at lecture 6 for more information on replacement policies.

Assume you are designing a new replacement policy for a 32KB 4-way set-associative L1 D$. The backside of the L1 D$ is directly hooked up to the DRAM. The only constraint you have is to add less than 2K flip-flops to the cache design.

First understand what the current simulator does, and plan a few ways to improve performance. Then modify the cache simulator to model your new replacement policy. You will have to modify the cache_sim_t::victimize() function in $LAB2ROOT/riscv-isa-sim/riscv/cachesim.cc. Recompile the ISA simulator with the steps described in Section 2.1. Run the benchmarks and record the cache statistics. Analyze the impact on AMAT and CPI. Estimate the impact on critical path, and performance. Change parameters in your design and see which configuration works the best.

Make sure to report all the statistics you gathered and calculations you made to reach your conclusions.

Feel free to email your TA or attend his office hours if you need help understanding the ISA simulator, the cache simulator, or anything else regarding this problem.
4 The Third Portion: Feedback

In order to improve the labs for the next offering of this course we would like your feedback. Please submit your feedback via an online form (the domain will be provided separately).

How many hours did the directed portion take you? How many hours did you spend on the open-ended portion? Was this lab boring? Did you learn anything? Is there anything you would change? Feel free to write as little or as much as you want (a point will be taken off only if left completely empty).

4.1 Team Feedback

In addition to feedback on the individual portion of the lab, we would like you to answer a few questions about your team:

1. In one short paragraph, describe your contributions to the project.

2. Describe the contribution of each of your team mates.

3. Do you think that every member of the team contributed fairly? If no, why?

5 Acknowledgments

Many people have contributed to versions of this lab over the years. This lab was originally developed for CS152 at UC Berkeley by Yunsup Lee and Andrew Waterman, and heavily inspired by the previous set of CS 152 labs (which targeted the Simics emulators) written by Henry Cook. This lab was made possible through the work of Andrew Waterman, Yunsup Lee, David Patterson, and Krste Asanović who developed the RISC-V ISA.