714CAhw1sol

CS 252 - Homework #1 Solution

5.1, except make Cache A match the characteristics of the Alpha 21264 in Figure 5.58, and assume Cache B is an Alpha 21264 modified to be direct mapped with the same size cache, and write through instead of write back. Assume everything hits in the L2 cache for these programs.

I cache
64 KB 2-way set-associative
3 clock cycles latency => 3 cycle hit time

D cache
64 KB 2-way set-associative write-back write allocate
8 block write/victim buffer
3 clock cycles latency => 3 cycle hit time

64 byte block size for both I and D cache

L2 cache
16 clock cycles latency
128 bit interface to cache
   => I cache and D cache miss penalty
        = latency + tranfer time
        = 16 + 64*8/128
        = 20 cycles

Assum writing a block takes 20 cycles.
Assum LRU replacement.

Cache system B:
is the same except for caches are direct-mapped, write-through and no-write allocate
instead of 2 way set associative, write-back and write allocate.
Assume writing a word takes 10 cycles.

Assume everything hits in the L2 cache for these programs.

(a)A program which is going to run much better on cache system A than on
cache system B:
1) Take advantage of the associativity. Becareful to overflow the victim cache.

So read in a pattern that alternates between the two ways of cache system A but which causes a conflict miss on every read on cache system B. Do this sequentially to every set in the cache A to overflow the victim cache. This can be done in instruction cache as well, but I will ignore this for the simplicity of the program.

In the above program follow the read with a write to the block, in order to trigger the write-through behavior of cache system B. The sequential access will over flow the write buffer.

(b)A program which is going to run much better on cache system B than on cache system A:
1)Take advantage of no-write allocate and smaller block write. Becareful to over flow the victim cache and write buffer
Write to arbitrary places that are not in cache to trigger write allocate and flush out the block before any read or write to the the allocated block. Repeat this 19 times to overflow victim buffer and write buffer in A. (That is 8 for the victim cache, 8 for the write buffer, and 2 for the two way associative cache plus one for the over flow)

example program
Forever
    write one word to the highest memory that maps to set 1 in A.
    write one word to the lowest memory that maps to set 1 in A.
    write one word to another different memory that maps to set 1 in A.
     ....
    (total of 19 writes to different memory that maps to set 1 in A).
end

c) Speed up of program in (a) on machine A versus machine B.

On each iteration of the loop, machine B will have a miss and will
have to do a write through to the L2 cache. Not counting the
initializing overhead,

SpeedUp = Time to execute on machine B/ Time to execute on machine A

Assuming of the number of instructions is the same and also the clock
speed is the same, And assuming the loop instructions cause a hit in
the I-cache all the time.

Base CPI = 0.63 (Given for gcc assuming this is only for ideal cache
with all hits in I-cache and D-cache)

Effective CPI on A = 0.63 + (D-cache Memory accesses/Instruction *
Miss Rate * Miss Penalty)

Since in the steady state of the loop, the miss rate is going to be
zero since the D-cache will not get any misses,

Eff CPI of A = 0.63 + (D-cache memory accesses/Instruction * 0 *
Miss Penalty)
= 0.63

Effective CPI on B = Base CPI + (D-cache memory accesses/Instruction *
(Miss Rate read * read Miss Penalty + write rate*write penalty))

In this case if we again assume that in the steady state of the loop
there are no I-cache misses, however D-cache misses every iteration of
the loop 4 times. Assuming the loop can be done in 6 instructions,

D-cache memory acceses/Instruction = 2/3

Miss Rate read = 1/2 since it will miss once every 2 access in the loop
Miss Penalty = 20 cycles
write rate = 1/2
write penalty = 10 cycles
Effective CPI on B = 0.63 + (0.66 * (1/2 * 20 + 1/2*10) = 10.53
(Looking at the equation, higher CPI can be achieved if we just drop the write instruction and replace them with read instructions that cause more misses.)

Time to execute = (Instr. Count * Eff. CPI)/ Clock Rate

Since IC and Clock Rate are assumed to be the same,

Speedup = Eff. CPI B / Eff. CPI A = 10.53/.63 ~ 16 times

(d) Speed up of program in (b) on machine B versus machine A.
    similar to c)
SpeedUp = Time to execute on machine A/ Time to execute on machine B
Assume 13 instruction per loop
Common to A and B
    Base CPI = 0.63 (Given for gcc assuming this is only for ideal
                     cache with all hits in I-cache and D-cache)

Effective CPI = Base CPI + (D-cache memory accesses/Instruction *
Write Miss Rate * Write Miss Penalty)

D cache accesses/instruction = 0.85 (11 accesses for every 13
instructions in the loop)

Write Miss Rate = 1 (since we kick out every block we need for doing the
calculations)

Write Miss Penalty for A = 20+20 = 40 cycles since it has to do a read allocate and a
write every time you miss.

Write Miss Penalty for B = 10 cycles the penalty for a write a word.

Effective CPI of B = 0.63 + (0.85 * 1 * 10) = 9.13
Effective CPI of A = 0.63 + (0.85 * 1 * 40) = 34.6

Since IC and clock rate is the same, Speed up of B on A = 34.6/9.13 =3.8times.

5.8, it is referring to page 440 example.

prefetch (a[0][0]);
...
prefetch (a[2][99]);
prefetch (b[0][0]);
...
prefetch (b[100][0]);
for(i = 0; i<3; i = i+1)
for (j = 0; j < 100; j = j+1)
a[i][j] = b[j][0] * b[j+1][0];

This solution will give you 2100clock (loop) + 401clock (prefetch) = 2501clock cycles.
Realisticly, compiler may not do this because the code generated is too big. This may be a problem especially in embedded system, where instructions are stored in ROM and we want to minimized the ROM size to minimized the system cost. To decrease code size, we can write separate loops for the prefetch section.

for (j = 0; j< 100; j = j+1) {
        prefetch a[0][j];
        prefetch a[1][j];
        prefetch a[2][j];
        prefetch b[j][0];
}
prefetch b[100][0];
for(i = 0; i<3; i = i+1)
        for (j = 0; j < 100; j = j+1)
                a[i][j] = b[j][0] * b[j+1][0];

This will cost us some additional clock cycles because of the loop overhead.