CS61C Fall 2010


Lab 4a: SIMD and Loop unrolling


  1. Introduction
  1. As you learned in class, using Single-Instruction Multiple Data (SIMD) techniques for data-level parallelism can be very useful when looking for performance improvement. In this lab, you will explore a specific example of SIMD: Intel’s Streaming SIMD Extensions (SSE). These extensions provide the programmer and compiler the ability to perform a single arithmetic operations upon multiple data instances in parallel. For example, one may perform 2 double-precision or 4 single-precision adds simultaneously. The performance improvement with SSE is clearly upper-bounded by a constant factor (depending on the size of the data). This may seem small, but is necessary to achieve maximal machine floating point throughput. In other words, double-precision code without SSE will never achieve over 50% of the peak floating point capability of a machine!
  2. At a lower level, x86 implements SSE with 8 128-bit registers: xmm0-xmm7. During Exercise 1, you will see how gcc moves data to and from these registers.
  3. Please see the recent SIMD lecture for further introductory information.
  1. Setup:
  1. Copy  the lab tarball “~cs61c/labs/04a/04a.tar.gz” to your home directory, and decompress with “tar -zxvf 04a.tar.gz”.  
  1. Exercise 1: SSE C to assembly
  1. If you consider the file sseTest.c, you will note that it contains a working implementation of the 2x2 matrix multiply example given in class.
  2. Compile sseTest.c to assembly with the command:

                gcc -g0 -O2 -S sseTest.c

  1. The result of this compilation is an x86 assembly file named sseTest.s. Open this file, and search for the label “_main:”. (Note: You can compile sseTest.s into binary with “gcc sseTest.s”. Then run as usual.)
  2. Key point: The gcc assembly output is in AT&T syntax, so the result of an instruction is stored in the second argument of the instruction. For example,

        movapd source, dest

  1. Comment the _main function within sseTest.s to show that you understand the mapping from C compiler intrinsics to SSE instructions. Show this to your TA.
  1. Exercise 2: Dot Product
  1. In the file dotProduct.c, there exists a naive implementation of a dot product, and some benchmarking code. The program takes a single command-line argument: the size of the vectors upon which to perform the dot product. Also of interest is that the benchmarking code utilizes the x86 rdtsc instruction for timing purposes. Recall that a dot product takes two vectors of identical length (a and b) and calculates:

dotProduct(a,b) = a1*b1 + b2*b2 + ... + an*bn

  1. Implement the sseDotProduct function, and show your TA that it runs faster than the naive code. (Make sure to compile with -O3)


  1. Exercise 3: Loop unrolling
  1. Duff’s Device
  1. What does this code snippit do? Explain to your TA.


// to and from point to arrays
double *to, *from;
register int count;
        register int n=(count+7)/8;
        case 0:        do{        *to++ = *from++;
        case 7:                *to++ = *from++;
        case 6:                *to++ = *from++;
        case 5:                *to++ = *from++;
        case 4:                *to++ = *from++;
        case 3:                *to++ = *from++;
        case 2:                *to++ = *from++;
        case 1:                *to++ = *from++;

  1. The above code is a refined implementation of a technique called loop unrolling. What may be some advantages of this implementation? What are possible disadvantages of loop unrolling?