I cache
64 KB 2-way set-associative
3 clock cycles latency => 3 cycle hit time
D cache
64 KB 2-way set-associative write-back write allocate
8 block write/victim buffer
3 clock cycles latency => 3 cycle hit time
64 byte block size for both I and D cache
L2 cache
16 clock cycles latency
128 bit interface to cache
=> I cache and D cache miss penalty
= latency + tranfer time
= 16 + 64*8/128
= 20 cycles
Assum writing a block takes 20 cycles.
Assum LRU replacement.
Cache system B:
is the same except for caches are direct-mapped, write-through
and no-write allocate
instead of 2 way set associative, write-back and write allocate.
Assume writing a word takes 10 cycles.
Assume everything hits in the L2 cache for these programs.
(a)A program which is going to run much better on cache system A than
on
cache system B:
1) Take advantage of the associativity. Becareful to overflow
the victim cache.
(b)A program which is going to run much better on cache system B than
on cache system A:
1)Take advantage of no-write allocate and smaller block write.
Becareful to over flow the victim cache and write buffer
Write to arbitrary places that are not in cache to trigger write allocate
and flush out the block before any read or write to the the allocated block.
Repeat this 19 times to overflow victim buffer and write buffer in
A. (That is 8 for the victim cache, 8 for the write buffer, and 2
for the two way associative cache plus one for the over flow)
example program
Forever
write one word to the highest memory that maps to
set 1 in A.
write one word to the lowest memory that maps to
set 1 in A.
write one word to another different memory that
maps to set 1 in A.
....
(total of 19 writes to different memory that maps
to set 1 in A).
end
c) Speed up of program in (a) on machine A versus machine B.
On each iteration of the loop, machine B will have a miss and will
have to do a write through to the L2 cache. Not counting the
initializing overhead,
SpeedUp = Time to execute on machine B/ Time to execute on machine A
Assuming of the number of instructions is the same and also the clock
speed is the same, And assuming the loop instructions cause a hit in
the I-cache all the time.
Base CPI = 0.63 (Given for gcc assuming this is only for ideal
cache
with all hits in I-cache and D-cache)
Effective CPI on A = 0.63 + (D-cache Memory accesses/Instruction
*
Miss Rate * Miss Penalty)
Since in the steady state of the loop, the miss rate is going to be
zero since the D-cache will not get any misses,
Eff CPI of A = 0.63 + (D-cache memory accesses/Instruction *
0 *
Miss Penalty)
= 0.63
Effective CPI on B = Base CPI + (D-cache memory accesses/Instruction
*
(Miss Rate read * read Miss Penalty + write rate*write penalty))
In this case if we again assume that in the steady state of the loop
there are no I-cache misses, however D-cache misses every iteration
of
the loop 4 times. Assuming the loop can be done in 6 instructions,
D-cache memory acceses/Instruction = 2/3
Miss Rate read = 1/2 since it will miss once every 2 access in
the loop
Miss Penalty = 20 cycles
write rate = 1/2
write penalty = 10 cycles
Effective CPI on B = 0.63 + (0.66 * (1/2 * 20 + 1/2*10) = 10.53
(Looking at the equation, higher CPI can be achieved
if we just drop the write instruction and replace them with read instructions
that cause more misses.)
Time to execute = (Instr. Count * Eff. CPI)/ Clock Rate
Since IC and Clock Rate are assumed to be the same,
Speedup = Eff. CPI B / Eff. CPI A = 10.53/.63 ~ 16 times
(d) Speed up of program in (b) on machine B versus machine A.
similar to c)
SpeedUp = Time to execute on machine A/ Time to execute on machine
B
Assume 13 instruction per loop
Common to A and B
Base CPI = 0.63 (Given for gcc assuming this is
only for ideal
cache with all hits in I-cache and D-cache)
Effective CPI = Base CPI + (D-cache memory accesses/Instruction
*
Write Miss Rate * Write Miss Penalty)
D cache accesses/instruction = 0.85 (11 accesses
for every 13
instructions in the loop)
Write Miss Rate = 1 (since we kick out every block
we need for doing the
calculations)
Write Miss Penalty for A = 20+20 = 40 cycles since it has to
do a read allocate and a
write every time you miss.
Write Miss Penalty for B = 10 cycles the penalty for a write a word.
Effective CPI of B = 0.63 + (0.85 * 1 * 10) = 9.13
Effective CPI of A = 0.63 + (0.85 * 1 * 40) = 34.6
Since IC and clock rate is the same, Speed up of B on A = 34.6/9.13
=3.8times.
prefetch (a[0][0]);
...
prefetch (a[2][99]);
prefetch (b[0][0]);
...
prefetch (b[100][0]);
for(i = 0; i<3; i = i+1)
for (j = 0; j < 100;
j = j+1)
a[i][j] = b[j][0] * b[j+1][0];
This solution will give you 2100clock (loop) + 401clock (prefetch) =
2501clock cycles.
Realisticly, compiler may not do this because the code generated is
too big. This may be a problem especially in embedded system, where
instructions are stored in ROM and we want to minimized the ROM size to
minimized the system cost. To decrease code size, we can write separate
loops for the prefetch section.
for (j = 0; j< 100; j = j+1) {
prefetch a[0][j];
prefetch a[1][j];
prefetch a[2][j];
prefetch b[j][0];
}
prefetch b[100][0];
for(i = 0; i<3; i = i+1)
for (j = 0; j < 100;
j = j+1)
a[i][j] = b[j][0] * b[j+1][0];
This will cost us some additional clock cycles because of the loop overhead.