TA: James Ferguson
Due Sunday, Oct 21, 2012 @ 23:59:59
This assignment is designed to give you some practice with caches and parallelism, especially simd.
Inside your working repository, pull the hw5 directory from ~cs61c/hw/fa12/05 (git pull ~cs61c/hw/fa12/05). Put your solutions for 1 and 3 in a file named
hw5.txt, and your solution for 2 in
count.c. It is important that you place your submission for hw5 inside this directory and not somewhere else, as when we pull submissions, we will look for your submission there. Then run these commands from inside the hw5 directory:
git add −A git commit −m “hw5 submission” git tag −f hw5 git push --tags origin master
The following C program is run (with no optimizations) on a processor with a direct-mapped cache that has eight-word blocks and holds 256 words of data:
int i,j,c,stride,array; ... for(i=0; i<10000; i++) for (j=0; j<512; j+=stride) c += i%(array[j]);
If we consider only the cache activity generated by references to the array and we assume that integers are words, what are possible miss rates (there is more than one for one of them) when
- stride = 256?
- stride = 255?
Your friend just started programming and wrote a naive count function. This function takes three arguments:
- char *arr - An array of chars.
- int len - The length of the array.
- char key - The char to be counted.
and returns the number of times that key appears in the array.
Unfortunately, the code is kind of slow, so you, eager to show off your newfound knowledge of SIMD instructions, decide to improve on the implementation.
Your task is to SIMD-ize the naive code in order to improve the running time. Your final code should gain a speedup of at least 3x over the naive version on the largest array length (but possibly much more).
In the hw5 directory that you pulled from ~cs61c/hw/fa12/05, you'll find the naive code, as well as a small testing framework.
Some things to note:
-chars are 8 bits wide, meaning they can only hold values between -256 and 255 before overflowing.
-You may (or may not) find _mm_set1_epi8, _mm_setzero_si128, and _mm_cmpeq_epi8 (among others) to be useful for this problem.
-You can run test your implementation by running "make test".
The latter 40% of a program supports operation on multiple cores. You noted the time the entire program took to complete with one cores. In the ideal case, what is the overall speedup you'd expect, running the latter portion with
- four cores rather than one?
- eight cores rather than one?
- as many as you want?
d. Why might the actual speedups be less than the ideal ones you calculated?