# CS61C Summer 2014 Lab 6 - Data Locality and Cache Blocking

## Goal

• To start seeing how the memory access patterns of your program affect its performance.
• To understand how you can modify your memory access patterns to improve performance.
• To learn about cache blocking.

## Setup

Copy the directory ~cs61c/labs/06 to an appropriate place under your home directory.

`\$ cp -r ~cs61c/labs/su14/06 ~/lab06`

## Background Information

### Matrix Multiplication

If you recall, matrices are 2-dimensional data structures where each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:

```for (int i = 0; i < n; i++) {                 // outer loop
for (int j = 0; j < n; j++) {             // middle loop
for (int k = 0; k < n; k++) {         // inner loop
C[i+j*n] += A[i+k*n] * B[k+j*n];  // how many memory accesses each time this is run?
}
}
}
```

Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within graphics (e.g. video games) and the applied sciences.

In the above code, note that the loops are ordered i, j, and then k. Considering the innermost loop (k), we move through B with stride 1, A with stride n and C with stride 0. To compute the matrix multiplication correctly, the loop order doesn't matter. However, how we choose to stride though the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses exhibit better spatial and temporal locality. Optimizing a program's memory access pattern is essential to obtaining good performance from the memory hierarchy.

### Matrix Transposition

Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a "transposition" and an efficient implementation can be quite helpful while performing more-complicated linear algebra operations. The transpose of matrix A is often denoted as AT. ### Cache Blocking

In the above code for matrix multiplication, note that we are striding across the entire matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in cache by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider completing the transposition one block matrix at a time.

Note: Try not to confuse the term "block" as it is used for both a "cache block" (amount of data transfered between cache and memory) and "block matrix" (arbitrary section of a larger matrix that we operate on)! In the above image, each block Aij of matrix A is transposed into its final location in the output matrix. With this scheme, we significantly reduce the magnitude of the working set in cache at any one point in time. This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance. As a side note, you will be required to implement several levels of cache blocking for matrix multiplication for Project 2.

### Lab Machine Specs (Optional)

Find information on your machine's caches if you are on a MAC (e.g. the Orchard):

```\$ /usr/sbin/sysctl -a | grep cache
```

Find information on your machine's caches if you are on a LINUX machine (e.g. the Hive):

```\$ cat /proc/cpuinfo | grep cache
```

Note: cpuinfo contains cache info for EACH of the machine's logical processors (there are 16 total, 8 physical cores, and 2 physical sockets).

## Exercises

### Exercise 1: Matrix multiply

Take a glance at matrixMultiply.c. You'll notice that the file contains multiple implementations of matrix multiply with 3 nested loops. Compile and run this code with the following command:

```\$ make ex1
```

Note: Make sure to run on a lab machine (preferably hive as some of the orchard machine's gcc are broken)! It is important here that we use the '-O3' flag to turn on compiler optimizations.

The makefile will run matrixMultiply() twice and report the function's performance in floating-point operations per second (flops).

1. It will report the performance of the same matrix multiplication on matrices of varying sizes.
2. It will report the performance of a matrix multiplication of fixed size for changing the loop orderings.

Copy the results somewhere so that you do not have to run this program again and use them to help you answer the following questions. Please note that flops reports the amount of calculations done per unit time, so it already accounts for the difference in total execution time (i.e. algorithm scaling is not a valid answer to the questions below).

1. Think about the naive matrix multiplication algorithm. Is it relatively good or bad in terms of locality? Why?
2. Why does performance drop for large values of n? (Hint: what gets loaded on a cache miss? How long does that data stay there?)
3. Which of the 6 different loop orderings perform best for 1000-by-1000 matrices? Which ordering(s) perform the worst?
4. How does the way we stride through the matrices with respect to the innermost loop affect performance?

#### Check-off

• Be prepared to EXPLAIN your responses to the above questions to your TA.

### Exercise 2: Matrix transposition

Compile and run the naive matrix transposition implemented in transpose.c by running:

```\$ make ex2
```

Note: Make sure to run on a lab machine (Hive or Orchard)! It is important here that we use the '-O3' flag to turn on compiler optimizations.

1. Note the time required to perform the naive transposition for a matrix of size 2000-by-2000.
2. Modify the function transpose() in transpose.c to implement a single level of cache blocking. That is, loop over all matrix blocks and transpose each into the destination matrix. Make sure to handle "fringe" cases of the transposition: What if we tried to transpose the 5-by-5 matrix above with a blocksize of 2? (Hint: this can be done by just adding new for-loops and modifying the existing ones a little.)
3. Try block sizes of 2-by-2, 100-by-100, 200-by-200, 400-by-400, 600-by-600, and 1000-by-1000. Which performs best on the 2000-by-2000 matrix? Which performs worst?
4. For a particular matrix size and cache parameters, there will exist an "optimal" block size for this algorithm. What is this optimal block matrix size in terms of the cache parameters (associativity, block size, cache size)? (Hint: how much memory do we need to "touch" in order to transpose a block matrix?)