CS61C Spring 2012

TA: Paul Ruan

Due Sunday, March 11, @ 23:59:00

Submission

Submit a file named hw5.txt with all your answers to the following questions with submit hw5.

Problem 1: Cache Dimensions

  1. How many 32-bit integers can be stored in a byte-addressed direct-mapped cache with 15 tag bits, 15 index bits, and 2 offset bits? Write your answer as a power of 2.
  2. Assume 16-bit byte addresses. You are trying to access memory byte address 0x3434. What are the corresponding tag, index and offset for
    1. 32-line direct-mapped cache with 1 word blocks?
    2. 16-line direct-mapped cache with 4 word blocks?
    3. 1KB direct-mapped cache with 2 word blocks?
  3. What is the ratio of data bits to total bits in a 128 byte write-back direct-mapped cache that has 2-word blocks and byte addresses are 64 bits? What about if it were a write-through cache instead?

Problem 2: Cache Accesses (from P&H)

The following C program is run (with no optimizations) on a processor with a direct-mapped cache that has eight-word blocks and holds 256 words of data:

	int i,j,c,stride,array[512];
	...
	for(i=0; i<10000; i++)
	  for (j=0; j<512; j+=stride)
		c += i%(array[j]);

If we consider only the cache activity generated by references to the array and we assume that integers are words, what are possible miss rates (there is more than one for one of them) when

  1. stride = 256?
  2. stride = 255?
Explain your answers clearly in a few sentences.

Problem 3: Cache Performance

  1. Data L1 Cache with hit rate of 80% and Instruction L1 Cache with hit rate of 95%. Accessing main memory takes 100 cycles. If the ideal CPI (no cache misses) is 1 and 25% of instructions are loads and stores, what is the CPI?
  2. L1 Cache has a hit rate of 80% and a hit time of 1 cycle. L2 Cache has a local miss rate of 10% and a hit time of 4 cycles. L3 Cache has a local miss rate 5%, and a hit time of 20 cycles. Accessing main memory takes 100 cycles. Find the AMAT.

Problem 4: Thread Level Parallelism, Performance

The latter 40% of your program supports multiple threads. You noted the time your entire program took to complete with one thread. In the ideal case, what is the overall speedup you'd expect, running the latter portion with
  1. four threads rather than one?
  2. eight threads rather than one?
  3. as many as you want?
Why might the actual speedups be less than the ideal ones you calculated?