# CS61C Summer 2014 Lab 13 - GPU Programming

## Goals

Thus far in 61C, we've focused on achieving maximum performance from our CPU. In this lab, we'll attempt high-performance computing using the Nvidia Tesla C1060 Co-Processor that is installed in each hive machine (note that this is a separate card from the Nvidia Quadro FX580 powering the display). This is a GPU-based co-processor designed to allow for extremely high-performance mathematical and scientific computation (SIMD-to-the-extreme). We'll explore Nvidia's CUDA framework.

## Setup

Copy the directory ~cs61c/labs/su14/13 to an appropriate directory under your home directory.

## Exercises

### Exercise 0 - Weighted Vector Addition Walkthrough

This exercise uses the file weightVecAdd.cu. Assuming A and B are vectors and c and d are scalars, weightVecAdd performs the following computation: Result = A*c + B*d. Thus, we're performing the addition of the vectors A and B, weighted by the scalar values c and d.

Before we jump into writing code, let's take a moment to get aquainted with the CUDA programming model.

#### Heterogeneous Programming

The CUDA framework is an example of "Heterogeneous" Programming; the C code you will write (in .cu files, compiled by the nvcc compiler) will run on two separate processors - the CPU and the GPU - each with their own dedicated (separate) memory regions. At a very high level, your code will perform as follows (we will elaborate on the specific steps later):

1. Prepare memory for computation (copy input data from the CPU memory region to the GPU memory region).
2. Dispatch GPU code to the GPU, wait for completion. The code we delegate to run on the GPU is called the Kernel.
4. That's it! Return.

#### Preparing Memory

Note: Most of the time, we'll perform this step for you - however it's a good idea to get acquainted with this step, or you won't know what's being passed into the functions you're expected to write!

The first step in preparing memory is to malloc space in the GPU's addressible region. To do this, we'll use the cudaMalloc(void** ptr, size_t size) function. This function takes in a pointer to a pointer and assigns to that pointer the address of the allocated space. For example:

```float* gpuArray;
size_t len = gpuArrayLen*sizeof(float);
cudaMalloc(&gpuArray, len);
// at this point, gpuArray is a pointer to the allocated space on the GPU
```

Correspondingly, when we're all done with our GPU computation, we'll use the cudaFree(void* ptr) function to free our allocated space. The semantics of this function are identical to its CPU counterpart.

Next, we'll use cudaMemcpy(void* dst, void* src, size_t copy_size, OP_SPECIFIER) to copy data between CPU addressible memory and GPU addressible memory. Here's a breakdown of the arguments it takes:
• void* dst - pointer to the memory you want to copy data into
• void* src - pointer to the memory you want to copy data from
• size_t copy_size - the amount of data you wish to copy in bytes
• OP_SPECIFIER - You will pass it one of two values: cudaMemcpyHostToDevice, if dst is a pointer to GPU addressible memory and src is a pointer to CPU addressible memory OR cudaMemcpyDeviceToHost, if the opposite is true. These values come from <cuda_runtime.h>, you may simply use them.
Finally, you'll notice in weightVecAdd.cu that all of our calls to memory functions are surrounded by the macro CUDA_SAFE_CALL(...). We're using CUDA's convenient automatic error-checking functionality that comes from <cutil.h> (similar to how we would protect malloc calls in normal C code with a NULL check).

#### The Kernel

The essence of GPU computation is a function called the kernel. The kernel is simply a C function that performs your computation, as adapted for the GPU. A kernel function is marked with the modifier __global__. It is compiled separately to run on the GPU's architecture (not x86). Before we take a look at how the kernel actually performs our computation, we need to make a quick digression to the logical geometry of the GPU.

#### The Logical Geometry of the GPU

Just like in CPUs, the fundamental unit of execution in the GPU is the thread. However, GPUs are able to handle many more threads than CPUs. The reason for this is a change in how we approach the problem - instead of having a small set of long-running threads, we instead segment our problem into lots of "mini" threads, each of which are dispatched to run a single kernel call (which usually consists of a rather small amount of code). The GPU then schedules these across the large number of hardware threads it can handle. (This description isn't strictly accurate, but sufficient for our purposes).

Unlike CPUs however, GPUs have a defined "geometry" of threads to which we must fit our problem. This has a two-fold advantage - from the hardware perspective, it allows us to reduce the amount of management hardware the GPU needs to schedule and dispatch threads. The primary advantage we gain as programmers from this special layout is that we can easily manage our work across many more threads than a CPU is capable of concurrently executing.

In the CUDA framework, threads are first organized into a unit called a block. The Tesla C1060 is able to have a total of 512 threads per block. Blocks can be 3-Dimensional structures (x, y, z), but for our purposes, we'll set x to 512, y to 1, and z to 1 (effectively treating the block as one-dimensional). Note that in general, x * y * z must be less than or equal to 512 for your code to work.

Blocks are then built into units called grids. Grids are again multi-dimensional (up to 2-dimensional in the case of the C1060), but we'll stick to a single dimension in this example. Thus, we'll have a grid with an x-dimension of up to 65535 (we'll determine the actual value at runtime) and a y-dimension of one (Hint: You'll want to keep this fact in mind as you approach Checkoff Q1). Below is a table of geometry-related specifications for the Tesla C1060.

SpecificationTesla C1060
(Compute Capability 1.3)
Max Number of Grid dimensions2
Max x, y, or z dimension of Grid65535
Max Number of Block dimensions3
Max x or y dimension of Block512
Max z dimension of Block64
Max number of threads per Block512

#### Back to the Kernel

The sample Kernel from weightVecAdd.cu is below:

```__global__ void weightedVecAddKernel(float* out, float* A, float* B, int len, float weight_a, float weight_b) {
}
}
```

The most important thing to note here is that each call of the kernel computes a single element of the output. Each thread makes exactly one call to the above kernel and thus computes one value of our output.

Our job when we write these kernels is to determine which index a thread should compute for. Luckily, we have values available to us in the kernel that tell us which thread we are (where we fall in the GPU geometry). We've enumerated a few of these below:
• threadIdx.{x, y, or z} - A thread's index within a block
• blockIdx.{x or y} - A block's index within a grid
• blockDim.{x, y, or z} - The size of the given dimension in the block. If we've simply allocated 512 threads per block, blockDim.x is 512 and blockDim.y and blockDim.z are 1.
In this example, the thread computes which index it's working on by taking the block index, multiplying it by the size of a block (in number of threads), and then adding the thread number within that block. We then use this index to do a single computation.

#### Invoking the Kernel

Finally, lets take a look at actually dispatching the kernel to the GPU. This happens in void weightedVecAdd(...). An amended version is below for convenience:

```void weightedVecAdd(float* out, float* A, float* B, int len, float weight_a, float weight_b) {
int threads_per_block = 512; // set to the max possible
int blocks_per_grid = (len/threads_per_block)+1; // similar to what we'd do with intrinsics/openMP

// 3-dim vector objects to initialize values
dim3 dim_blocks_per_grid(blocks_per_grid, 1);

// launch kernel on GPU

cudaThreadSynchronize(); // wait for GPU to finish computation
CUT_CHECK_ERROR(""); // check for errors reported by the GPU
}
```

First, we hardcode the number of threads per block to 512. You'll generally want to stick with this value. Next we compute blocks per grid as you'd expect when doing any kind of parallel computation. Note here that we purposefully allocate an extra block of threads to handle edge cases - this has a negligible impact on performance in this case. We prevent from overrunning the end of our array thanks to the if-statement inside the kernel.

Remember that we established a geometry above. In order to feed this geometry to the CUDA runtime that'll dispatch our kernel code, we build dim3 objects. You'll notice that the declaration of these objects looks a little odd if you're only familiar with C. This declaration/initialization style is taken from C++, and the nvcc compiler allows us to use this semantic for the purpose of creating dim3 objects, where the three arguments are x-dim, y-dim, and z-dim. The first dim3 inside the "chevron" C-extension (the <<<val1, val2>>> syntax) establishes the grid layout - note that we only feed this dim3 two arguments, since we can have at most 2 dimensions in the grid. The size of the x-dimension of our grid in this case is simply the number of blocks we wish to dispatch and we leave the size of the y-dimension as one. The next dim3 defines the geometry of the block. As stated above, we'll stick with a x-dimension of 512 and y and z dimensions of 1.

Finally, we pass in the arguments the kernel expects - this call performs the actual dispatch. However, it is asynchronous (it returns immediately, without waiting for the GPU computation to finish). Although this is useful in some cases, we're interested in measuring performance, so we insert a call to cudaThreadSynchronize(), which will block CPU execution and return when the entire GPU computation is complete.

Now that we've thoroughly analyzed the framework, it's time to try it out!

```\$ make weightVecAdd
```

#### Checkoff

Run weightedVecAdd and write down the answers to the following questions a text file.
1. The weightedVecAdd framework runs 9 benchmarks with increasing array sizes. Why does the last one fail?
2. How could we change the code to support larger arrays? (You don't need to implement this, just describe it). You may want to take a look at the API docs linked at the bottom.

### Exercise 1 - Reduction

This exercise uses the file reduction.cu.

Recall the reduction operation we performed in an earlier lab:

```// sums up the first len elements in float* A
float reductionCPU(float* A, int len) {
float result = 0.0;
for (int i = 0; i < len; i++) {
result += A[i];
}
return result;
}
```

In this exercise, you'll need to implement reduction on the GPU. Unfortunately, synchronization primitives are limited on the GPU, so we'll need to modify our local-sum/global-sum based algorithm slightly, but using the same basic idea. Instead of having arbitrarily sized blocks and producing a single "level" of local sums, we'll need multiple levels of "local sums." Effectively what we'll do is perform pairwise sums at each level, cutting the amount of elements to sum in half at each level of our computation. Ultimately we'll be left with two elements to sum up to produce our final result. Here is an example from Nvidia's CUDA reduction slides (slide 8): Some things you may wish to think about before writing code:
• Where is the GPU storing the intermediate local sums? Is doing this okay?
• How is work divided amongst threads? Do we need the same number of threads at each level?
• How many times do we need to dispatch the kernel? See the pseudocode below for a hint.
• If you're stuck, first consider the case where the array has 512 elements (only one thread block) and generalize.

Pseudocode for Kernel-chaining:
```gpuReduction(...):
while (level != len_of_array):
launch kernel to compute step n
wait for all kernels at this step to finish (by calling cudaThreadSynchronize())

level *= 2
```

Your job is to implement reductionKernel and reductionGPU in reduction.cu to produce a correct result. Note that you are only required to implement a very basic version of the algorithm described above (it will only beat optimized CPU code for very large arrays). You do not need to implement all of the performance tweaks described in the CUDA reduction slides, rather you will describe some of the potential improvements you could make for checkoff. Your solution only needs to be able to support arrays of size 2^n where n ≥ 9 (e.g. 512, 1024, 2048, etc.). You also do not need to worry about having a multidimensional grid. Running the following will allow you to compile your code and run every relevant test case:

```\$ make reduction
\$ ./reduction
```