Lab 8: SIMD Intrinsics and Unrolling

Setup

Copy the contents of ~cs61c/labs/sp11/08 to your home directory:

$ cp -r ~cs61c/labs/sp11/08/ lab8

Exercise 1 — Warm up

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you'll need in your application.

Here is a way to find the necessary information:
1. Google "intel intrinsic guide"
2. Go to the Intel’s web site you find and download "Intel Intrinsic Guide"

Your task:
Find the 128-bit intrinsics for the following SIMD operations:
∗    Four floating point divisions in single precision (i.e. float)
∗    Sixteen max operations over signed 8-bit integers (i.e. char)
∗    Arithmetic shifts right of eight signed 16-bit integers (i.e. short)

Record these intrinsics in a text file to show your GSI.

Exercise 2 — Study some SIMDized code

In this exercise you will consider the vectorization of 2x2 matrix multiplication in double precision:

This accounts to the following arithmetic operations:

    C[0] += A[0]*B[0] + A[2]*B[1];
    C[1] += A[1]*B[0] + A[3]*B[1];
    C[2] += A[0]*B[2] + A[2]*B[3];
    C[3] += A[1]*B[2] + A[3]*B[3];

You are given the code sseTest.c that implements these operations in a SIMD manner. The following intrinsics are used:

__m128d _mm_loadu_pd( double *p )returns vector (p[0], p[1])
__m128d _mm_load1_pd( double *p )returns vector (p[0], p[0])
__m128d _mm_add_pd( __m128d a, __m128d b )returns vector (a0+b0, a1+b1)
__m128d _mm_mul_pd( __m128d a, __m128d b )returns vector (a0b0, a1b1)
void _mm_storeu_pd( double *p, __m128d a )stores p[0]=a0, p[1]=a1

Your task is:
1. Compile the code using "gcc -g0 -O2 -S sseTest.c" to produce the assembly file sseTest.s. (NOTE: You can compile the .s file into binary with "gcc sseTest.s" and run as usual.)

2. Find the for-loop in sseTest.s and identify what each intrinsic is compiled into. Are they compiled into function calls? Comment the loop so that your GSI can see that you understand the code.

Note that the assembly GCC output is in AT&T syntax, which:
∗    uses "%" in front of register names (not $ as in MIPS assembly)
∗    has the destination is always on the right (e.g. movupd source, destination)
∗    an operand with parenthesis represents a memory location:
       –    (%rax) uses %rax as a pointer
       –    8(%rax) uses %rax as a base and 8 as an offset
       –    (%rax,%rbx) adds %rax and %rbx to form a memory address
       –    (%rax,%rbx,4) adds %rax and 4 times %rbx to form a memory address
∗    "leaq 8(%rax), %rbx" loads address of 8(%rax), i.e. %rax+8, into %rbx
∗    %rsp is the stack pointer; %rbp is the frame pointer
∗    %rXX registers are 64-bit; %eXX represent the lower 32 bits of them
∗    addq and addl are 64- and 32-bit versions of add respectively

Exercise 3 — Write some SIMD code

For Exercise 3, you will vectorize/SIMDize the following code to achieve approximately 4x speedup over the naive implementation shown here:

    int sum_naive( int n, int *a )
    {
        int sum = 0;
        for( int i = 0; i < n; i++ )
            sum += a[i];
        return sum;
    }

You might find the following intrinsics useful:

__m128i _mm_setzero_si128( )returns 128-bit zero vector
__m128i _mm_loadu_si128( __m128i *p )returns 128-bit vector stored at pointer p
__m128i _mm_add_epi32( __m128i a, __m128i b )returns vector (a0+b0, a1+b1, a2+b2, a3+b3)
void _mm_storeu_si128( __m128i *p, __m128i a )stores 128-bit vector a at pointer p

Your task is:
1. Start with sum.c. Use SSE instrinsics to implement the sum_vectorized() function. (Make sure to compile with the -O3 flag!!!!)

2. Show your GSI your working code and performance improvement.

Exercise 4 — Unroll SIMDized code

Happily, you can obtain even more performance improvement! Carefully unroll the SIMD vector sum code that you created in the previous exercise. This should get you about a factor of 2 further increase in performance. As an example of loop unrolling, consider the supplied function sum_unrolled():

    int sum_unrolled( int n, int *a )
    {
        int sum = 0;

        /* do the body of the work in a faster unrolled loop */
        for( int i = 0; i < n/4*4; i += 4 )
        {
            sum += a[i+0];
            sum += a[i+1];
            sum += a[i+2];
            sum += a[i+3];
        }

        /* handle the small tail in a usual way */
        for( int i = n/4*4; i < n; i++ )   
            sum += a[i];

        return sum;
    }

Also, feel free to check out Wikipedia's article on loop unrolling for more information.

Your task is:
1. Copy your sum_vectorized() code into sum_vectorized_unrolled() and unroll it four times. (Make sure to compile with the -O3 flag!!!!)

2. Show your GSI the unrolled implementation and performance improvement.