CS61C Summer 2014 Lab 7 - SIMD Intrinsics and Loop Unrolling


Copy the directory ~cs61c/labs/su14/07 to an appropriate location in your home directory.

$ cp -r ~cs61c/labs/su14/07 ~/lab07

Note: All code using SIMD must be run on a lab machine (Hive or Orchard).


Exercise 1: Familiarize Yourself

There are a very large number of available SIMD intrinsics and the instruction names follow slightly different syntax than you are used to. Here we want you to learn how to find the ones that you'll need in your application.

Go to Intel's website. This can also be found by Googling "intel intrinsic guide." Download the appropriate copy of the Intel Intrinsic Guide and unzip it. Running the Intrinsics Guide will open a two-columned window.

First things to notice:

  1. The left column contains search and filter functions to narrow the instruction list in the right column.
  2. Notice the color-coded Technologies. You will not be needing anything besides MMX through SSE4.2. The right column also color-codes the instructions based on which technologies they come from.
  3. The instruction column by default is a condensed list. The left text shows the intrinsic function prototypes (as you would code into a C program). The right text shows the related SIMD assembly instruction. Clicking on an instruction name will reveal more detailed descriptions and operations. Click again to collapse.
  4. The vector variable types all start with '__m' (notice TWO underscores and ONE m) followed by their bit-widths (64 or 128).
  5. The SSE functions all start with '_mm' (notice ONE underscore and TWO m's) followed by instruction shorthand. If you learn the basics of the function naming scheme, you won't have to come back to the Intrinsics Guide very often.

Do your best to interpret the new syntax and terminology. Find the 128-bit intrinsics for the following SIMD operations (one for each):


Exercise 2: Reading SIMD Code

In this exercise you will consider the vectorization of 2-by-2 matrix multiplication in double precision:

This translates to the following arithmetic operations:

    C[0] += A[0]*B[0] + A[2]*B[1];
    C[1] += A[1]*B[0] + A[3]*B[1];
    C[2] += A[0]*B[2] + A[2]*B[3];
    C[3] += A[1]*B[2] + A[3]*B[3];

You are given the code sseTest.c that implements these operations in a SIMD manner.

The following intrinsics are used:

__m128d _mm_loadu_pd( double *p ) returns vector (p[0], p[1])
__m128d _mm_load1_pd( double *p ) returns vector (p[0], p[0])
__m128d _mm_add_pd( __m128d a, __m128d b ) returns vector (a0+b0, a1+b1)
__m128d _mm_mul_pd( __m128d a, __m128d b ) returns vector (a0b0, a1b1)
   void _mm_storeu_pd( double *p, __m128d a ) stores p[0]=a0, p[1]=a1
  1. Compile the code using "gcc -g0 -O1 -S sseTest.c" to produce the assembly file sseTest.s. Note that you can compile the .s file into binary with "gcc -o sseTest sseTest.s" and then execute as usual.
    (here -g0 is g-zero and -O1 is a capital letter 'Oh'-one)

The assembly output from gcc is in AT&T syntax, which has the following differences from what we're used to:

  1. Find the equivalent assembly code for the for-loop in sseTest.s. What seems to have happened and why would the compiler do this? Identify what each intrinsic is compiled into. Are the instrinsic functions compiled into function calls in assembly? Comment the loop so that your TA can see that you understand the code.


Exercise 3: Writing SIMD Code

For Exercise 3, you will vectorize/SIMDize the following code to achieve a significant speedup over the naive implementation shown here:

    int sum_naive( int n, int *a ) {
        int sum = 0;
        for( int i = 0; i < n; i++ ) {
            sum += a[i];
        return sum;

You might find the following intrinsics useful: (Pay special attention to the variable types of the arguments! You may find casting helpful here.)

__m128i _mm_setzero_si128( ) returns 128-bit zero vector
__m128i _mm_loadu_si128( __m128i *p ) returns 128-bit vector stored at pointer p
__m128i _mm_add_epi32( __m128i a, __m128i b ) returns vector (a0+b0, a1+b1, a2+b2, a3+b3)
   void _mm_storeu_si128( __m128i *p, __m128i a ) stores 128-bit vector a at pointer p

Start with sum.c. Use SSE instrinsics to implement the sum_vectorized() function. (You can compile using the make sum command.)


Exercise 4: Loop Unrolling

Happily, you can obtain even more performance improvement! Carefully unroll the SIMD vector sum code that you created in the previous exercise. This should get you about a factor of 2 further increase in performance. As an example of loop unrolling, consider the supplied function sum_unrolled():

    int sum_unrolled( int n, int *a ) {
        int sum = 0;

        /* do the body of the work in a faster unrolled loop */
        for( int i = 0; i < n/4*4; i += 4 ) {
            sum += a[i+0];
            sum += a[i+1];
            sum += a[i+2];
            sum += a[i+3];

        /* handle the small tail in a usual way */
        for( int i = n/4*4; i < n; i++ ) {
            sum += a[i];

        return sum;

Also, feel free to check out Wikipedia's article on loop unrolling for more information.

Within sum.c, copy your sum_vectorized() code into sum_vectorized_unrolled() and unroll it four times. (Again, you can compile using the make sum command.)