











# Memory Caching

- Mismatch between processor and memory speeds leads us to add a new level: a memory cache
- Implemented with same IC processing technology as the CPU (usually integrated on same chip): faster but more expensive than DRAM memory.
- Cache is a copy of a subset of main memory.
- Most processors have separate caches

for instructions and data.

Cal

# Memory Hierarchy

- If level closer to Processor, it is:
  - Smaller
  - Faster

Cal

- More expensive
- subset of lower levels (contains most recently used data)
- Lowest Level (usually disk) contains all available data (does it go beyond the disk?)
- Memory Hierarchy presents the processor with the illusion of a very large & fast memory

| Memory Hierarchy Analogy: Library (1/2)                                                                                             |
|-------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>You're writing a term paper (Processor) at<br/>a table in Doe</li> </ul>                                                   |
| <ul> <li>Doe Library is equivalent to disk</li> <li>essentially limitless capacity</li> <li>very slow to retrieve a book</li> </ul> |
| <ul> <li>Table is main memory</li> <li>smaller capacity: means you must return book<br/>when table fills up</li> </ul>              |
| easier and faster to find a book there once                                                                                         |

easier and faster to find a book there onc you've already retrieved it

## Memory Hierarchy Analogy: Library (2/2)

- Open books on table are cache
  - smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book
  - much, much faster to retrieve data
- Illusion created: whole library open on the tabletop
  - Keep as many recently used books open on table as possible since likely to use again
- Also keep as many books on table as possible, since faster than going to library

# Memory Hierarchy Basis

- Cache contains copies of data in memory that are being used.
- Memory contains copies of data on disk that are being used.
- Caches work on the principles of temporal and spatial locality.
  - Temporal Locality: if we use it now, chances are we'll want to use it again soon.
  - Spatial Locality: if we use a piece of memory, chances are we'll use the neighboring pieces soon.

Huddleston, Summer 2009 © UCB

# CS61C L11 Caches (1

Cal

Huddleston, Summer 2009 © UCB

# Direct-Mapped Cache (1/4)

- In a direct-mapped cache, each memory address is associated with one possible block within the cache
  - Therefore, we only need to look in a single location in the cache for the data if it exists in the cache
  - Block is the unit of transfer between cache and memory

# Cache Design

- How do we organize cache?
- Where does each memory address map to?
  - (Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.)
- How do we know which elements are in cache?

Huddleston, Summer 2009 © UCB

How do we quickly locate them?

## Administrivia

CS61C L11 Caches (10)

al

Cal

- Project 4 (on Caches) will be in optional groups of two.
- Jeremy's OH today canceled
  - I will have OH on Friday, time will be posted on the newsgroup
- HW7 due tomorrow
- You MUST have a discussion with your TA in lab tomorrow for credit







# Issues with Direct-Mapped

- Since multiple memory addresses map to same cache index, how do we tell which one is in there?
- What if we have a block size > 1 byte?
- Answer: divide memory address into three fields

|    | tttttttttttttt                              | iiiiiiiii                      | 0000                              |
|----|---------------------------------------------|--------------------------------|-----------------------------------|
| .0 | tag<br>to check<br>if have<br>correct block | index<br>to<br>select<br>block | byte<br>offset<br>within<br>block |
| 3  | CS61C L11 Caches (18)                       |                                | Huddleston, Summer 2009 ©         |

# **Direct-Mapped Cache Terminology**

- All fields are read as unsigned integers.
- Index
- specifies the cache index (which "row"/block of the cache we should look in)
- Offset
- once we've found correct block, specifies which byte within the block we want
- Tag
  - the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

Cal "

Cal

CS61C | 11 Caches (22)

# Direct-Mapped Cache Example (3/3)

- Tag: use remaining bits as tag
- tag length = addr length offset index
  - = 32 1 2 bits
  - = 29 bits
- so tag is leftmost 29 bits of memory address
- Why not full 32 bit address as tag?
- All bytes within block need same address (4b)
- Index must be same for every address within a block, so it's redundant in tag check, thus can leave off to save memory (here 10 bits)



# Direct-Mapped Cache Example (1/3)

- Suppose we have a 8B of data in a directmapped cache with 2 byte blocks
   Sound familiar?
- Determine the size of the tag, index and offset fields if we're using a 32-bit architecture
- Offset
  - need to specify correct byte within a block
  - block contains 2 bytes
     = 2<sup>1</sup> bytes
- need 1 bit to specify correct byte

CS61C L

Huddleston, Summer 2009 © UCB

- Caching Terminology
- When reading memory, 3 things can happen:
  - cache hit: cache block is valid and contains proper address, so read desired word
  - cache miss: nothing in cache in appropriate block, so fetch from memory
- cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy)



# Direct-Mapped Cache Example (2/3)

- Index: (~index into an "array of blocks")
  - need to specify correct block in cache
  - cache contains 8 B = 2<sup>3</sup> bytes
  - block contains 2 B = 2<sup>1</sup> bytes
  - # blocks/cache

al

- = <u>bytes/cache</u> bytes/block
- = 2<sup>3</sup> bytes/cache 2<sup>1</sup> bytes/block
- = 2<sup>2</sup> blocks/cache
- need 2 bits to specify this many blocks

nmer 2009 © UCB

























| 4    | 1.         | Read          | 0x000x0                       | 8014 = 0               | 10 00                | 01 0100                      |
|------|------------|---------------|-------------------------------|------------------------|----------------------|------------------------------|
|      | 00<br>/ali | оооо<br>а     | 000000000<br><b>Tag field</b> | 0010 <u>000</u><br>Inc | 0000001<br>lex field | 0100<br>Offset               |
| Inde | x          | Tag           | 0xc-f                         | 0x8-b                  | 0x4-7                | 0x0-3                        |
| 0    | 0          | 0             |                               |                        |                      |                              |
| 1    | Ц          | 0             | d                             | С                      | b                    | а                            |
| 23   | 1          | 0             | h                             | a                      | f                    | е                            |
| 4    | 0          |               |                               |                        |                      |                              |
| 5    | 0          |               |                               |                        |                      |                              |
| 7    | 0          |               |                               |                        |                      |                              |
|      | _          |               |                               |                        |                      |                              |
| 1022 | 20         |               |                               |                        |                      |                              |
| 1023 | 0          |               |                               |                        |                      |                              |
| Cal  | CS6        | 1C L11 Caches | (38)                          |                        | н                    | uddleston, Summer 2009 © UCB |

| С               | ache B            | lock 1 Tag             | does not               | match (0 !           | = 2)                       |
|-----------------|-------------------|------------------------|------------------------|----------------------|----------------------------|
| • <u>(</u><br>V | 000000<br>alid    | 000000000<br>Vag field | 0010 <u>000</u><br>Inc | 0000001<br>lex field | 0100<br>Offset             |
| Inde            | x Tag/            | 0xc-f                  | 0x8-b                  | 0x4-7                | 0x0-3                      |
| Ö               | 07                |                        |                        |                      |                            |
| <u>1</u>        | 1 0'              | d                      | С                      | b                    | а                          |
| 2               | 1 0               | h                      | a                      | f                    |                            |
| 4               | 0                 |                        | Ч                      | - 1                  | l v                        |
| 5               | 0                 |                        |                        |                      |                            |
| 6               | 0                 |                        |                        |                      |                            |
| 7               | 0                 |                        |                        |                      |                            |
|                 |                   |                        | •••                    |                      |                            |
| 1022            | 0                 |                        |                        |                      |                            |
| 1023            | 0                 |                        |                        |                      |                            |
| Cal             | CS61C   11 Caches | c (40)                 |                        |                      | uddleston, Summer 2009 © U |

# What to do on a write hit?

## Write-through

- update the word in cache block and corresponding word in memory
- Write-back
  - update word in cache block
  - allow memory word to be "stale"
- ⇒ add 'dirty' bit to each block indicating that memory needs to be updated when block is replaced
- $\Rightarrow$  OS flushes cache before I/O...
- Performance trade-offs?

| • <u>0</u> | 000000           | 0000000000<br>Tag field | 0010 000<br>Inc | 0000001<br>lex field | 0100<br>Offset               |
|------------|------------------|-------------------------|-----------------|----------------------|------------------------------|
| Index      | Tag              | 0xc-f                   | 0x8-b           | 0x4 - 7              | 0x0-3                        |
| 0          | 0                |                         |                 |                      |                              |
| 1          | 1 2              |                         | k               | i                    | i                            |
| 2          | 1 0              | h                       | a               | f                    |                              |
| 4          | 0                |                         | 9               | •                    | - Ŭ                          |
| 5          | 0                |                         |                 |                      |                              |
| <u>é</u>   | 0                |                         |                 |                      |                              |
| / [        | 0                |                         |                 |                      |                              |
|            |                  |                         | •••             |                      |                              |
| 1022       | 0                |                         |                 |                      |                              |
| 1023       | 0                |                         |                 |                      |                              |
| Cal        | CS61C L11 Caches | s (41)                  |                 | н                    | uddleston, Summer 2009 © UCI |

Miss as replace block 1 with new data 8 tes

# Types of Cache Misses (1/2)

- "Three Cs" Model of Misses
- 1<sup>st</sup> C: Compulsory Misses

Cal

- occur when a program is first started
- cache does not contain any of that program's data yet, so misses are bound to occur
- can't be avoided easily, so won't focus on these in this course

#### So read Cache Block 1, Data is Valid Valie 0x4 - 70x8-b $0 \times 0 - 3$ 0xc-f Index Tag 0 <u>1</u> 2 3 Ĭ 0 h 1 0 h a е 4 5 6 7 10 1023

|     | A        | ٩r        | nd re          | turn wo                | rd J                   |                             |                              |
|-----|----------|-----------|----------------|------------------------|------------------------|-----------------------------|------------------------------|
|     | • (<br>V | ))<br>ali | 0000 <u>0</u>  | 000000000<br>Tag field | 0010 <u>000</u><br>Inc | 0000001<br><b>lex field</b> | 0100<br>Offset               |
| Ind | de       | x         | Tag            | 0xc-f                  | 0x8-b                  | 0x4-7                       | / 0x0-3                      |
| (   | 0        | 0         | 0              |                        |                        |                             |                              |
|     | 1        | 1         | 2              |                        | k                      | i                           | Di                           |
| 2   | 2        | 0         |                |                        |                        |                             |                              |
| 1 3 | 3        | 11        | 0              | h                      | g                      | f                           | е                            |
| 4   | 4        | 0         |                |                        |                        |                             |                              |
|     | 5        | 0         |                |                        |                        |                             |                              |
| •   | 6        | 0         |                |                        |                        |                             |                              |
| 1   | 7        | 0         |                |                        |                        |                             |                              |
| .   | •••      |           |                |                        |                        |                             |                              |
| 10  | 22       | 0         |                |                        |                        |                             |                              |
| 10  | 23       | 0         |                |                        |                        |                             |                              |
| G   | l        | CSE       | 31C L11 Caches | (42)                   |                        | н                           | uddleston, Summer 2009 © UCE |

# Types of Cache Misses (2/2)

- 2<sup>nd</sup> C: Conflict Misses
  - miss that occurs because two distinct memory addresses map to the same cache location
  - two blocks (which happen to map to the same location) can keep overwriting each other
  - big problem in direct-mapped caches
- how do we lessen the effect of these?
- Dealing with Conflict Misses
  - Solution 1: Make the cache size bigger
     Fails at some point
- Solution 2: Multiple distinct blocks can fit in the same cache Index?



# Fully Associative Cache (1/3) Memory address fields: • Tag: same as before offset: same as before Index: non-existant What does this mean? • no "rows": any block can go anywhere in the cache • must compare with all tags in entire cache to see if data is there Cal

# Final Type of Cache Miss

3rd C: Capacity Misses

- miss that occurs because the cache has a limited size
- miss that would not occur if we increase the size of the cache
- sketchy definition, so just get the general idea
- This is the primary type of miss for Fully Associative caches.

# N-Way Set Associative Cache (2/3)

### Basic Idea

Cal

Cal

- cache is direct-mapped w/respect to sets
- each set is fully associative with N blocks in it
- Given memory address:
  - Find correct set using Index value.
- Compare Tag with all Tag values in the determined set.
- If a match occurs, hit!, otherwise a miss.
- Finally, use the offset field as usual to find the desired data within the block.

# Fully Associative Cache (2/3)

Fully Associative Cache (e.g., 32 B block) compare tags in parallel



# N-Way Set Associative Cache (1/3)

- Tag: same as before
- Offset: same as before
- Index: points us to the correct "row" (called a set in this case)
- So what's the difference?
- each set contains multiple blocks
- once we've found correct set, must compare with all tags in that set to find our data

N-Way Set Associative Cache (3/3)

- What's so great about this?
  - even a 2-way set assoc cache avoids a lot of conflict misses
  - hardware cost isn't that bad: only need N comparators
- In fact, for a cache with M blocks,
  - it's Direct-Mapped if it's 1-way set assoc
  - it's Fully Assoc if it's M-way set assoc
  - so these two are just special cases of the more general set associative design

# Fully Associative Cache (3/3)

- Benefit of Fully Assoc Cache
  - No Conflict Misses (since data can go anywhere)
- Drawbacks of Fully Assoc Cache
  - Need hardware comparator for every single entry: if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible

Huddleston, Summer 2009 © UCB

# Associative Cache Example Cache Memory Index Address Memory A Here's a simple 2-way set associative cache.



- Memory address fields:

Cal CSEIC L11 Cache

Cal

# **Block Replacement Policy**

### Direct-Mapped Cache

 index completely specifies position which position a block can go in on a miss

- N-Way Set Assoc
  - index specifies a set, but block can occupy any position within the set on a miss
- Fully Associative
  - block can be written into any position
- Question: if we have the choice, where should we write an incoming block?
- If there are any locations with valid bit off (empty), then usually write the new block into the first one.
- If all possible locations already have a valid block, we must
- pick a replacement policy: rule by which we determine

which block gets "cached out" on a miss.



# And in Conclusion...

- We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible.
- So we create a memory hierarchy:
- each successively lower level contains "most used" data from next higher level
- exploits temporal & spatial locality
- do the common case fast, worry less about the exceptions (design principle of MIPS)
- Locality of reference is a Big Idea

# Block Replacement Policy: LRU

- LRU (Least Recently Used)
  - Idea: cache out block which has been accessed (read or write) least recently
  - Pro: temporal locality ⇒ recent past use implies likely future use: in fact, this is a very effective policy
  - Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this

## **Block Replacement Example**

 We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem):

## 0, 2, 0, 1, 4, 0, 2, 3, 5, 4

Improving Miss Penalty

 How many hits and how many misses will there be for the LRU block replacement policy?

## Big Idea

- How to choose between associativity, block size, replacement & write policy?
- Design against a performance model
   Minimize: Average Memory Access Time
   Hit Time
  - + Miss Penalty x Miss Rate
  - influenced by technology & program behavior
- Create the illusion of a memory that is large, cheap, and fast - on average
- How can we improve miss penalty?





Solution: another cache between memory and the processor cache: <u>Second Level (L2) Cache</u>

| <ul> <li>Mechanism for transparent movem<br/>data among levels of a storage hier</li> <li>set of address/value bindings</li> <li>address ⇒ index to set of candidates</li> <li>compare desired address with tag</li> <li>service hit or miss</li> <li>load new block and binding on miss</li> </ul> | ent of<br>rarchy |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| address: tag index<br>000000000000000000000000000000000000                                                                                                                                                                                                                                          | offset<br>100    |
| Valid                                                                                                                                                                                                                                                                                               | -                |
| Tag / 0xc-f 0x8-b 0x4-7                                                                                                                                                                                                                                                                             | 0x0-3            |
|                                                                                                                                                                                                                                                                                                     | 2                |
|                                                                                                                                                                                                                                                                                                     | a                |
| 3                                                                                                                                                                                                                                                                                                   |                  |
|                                                                                                                                                                                                                                                                                                     |                  |

# And in Conclusion...

- We've discussed memory caching in detail. Caching in general shows up over and over in computer systems
  - Filesystem cache, Web page cache, Game databases / tablebases, Software memoization, Others?
- Big idea: if something is expensive but we want to do it repeatedly, do it once and cache the result.
- Cache design choices:
  - Size of cache: speed v. capacity
  - Block size (i.e., cache aspect ratio)
  - Write Policy (Write through v. write back
  - Associativity choice of N (direct-mapped v. set v. fully associative)
     Block replacement policy
  - 2nd level cache?
- 3rd level cache?

Use performance model to pick between choices, depending on programs, technology, budget, ...



# Bonus slides

- These are extra slides that used to be included in lecture notes, but have been moved to this, the "bonus" area to serve as a supplement.
- The slides will appear in the order they would have in the normal presentation





# Block Size Tradeoff (1/3)

Cal

## Benefits of Larger Block Size

- Spatial Locality: if we access a given word, we're likely to access other nearby words soon
- Very applicable with Stored-Program Concept: if we execute a given instruction, it's likely that we'll execute the next few as well
- Works nicely in sequential array accesses too





#### Accessing data in a direct mapped cache Memory Address (hex) Value of Word Ex.: 16KB of data, direct-mapped, 0000010 4 word blocks 00000014 Can you work out 00000018 000001C height, width. area? 0000030 **Read 4 addresses** 00000034 1. 0x0000014 00000038 000003cl 2. 0x000001C 3. 0x0000034 00008010 4. 0x00008014 00008014 00008018 Memory vals here: 0000801C ••• +++ Huddleston, Summer 2009 © UCB

|     | Answers                   |                      |                           |
|-----|---------------------------|----------------------|---------------------------|
| _   | ■ 0x00000030 a <u>hit</u> | Men<br>Address (bax) | nory<br>Value of Word     |
|     | Index = 3, Tag matches,   | Address (nex)        |                           |
|     | Offset = 0, value = e     | 00000010             | a                         |
|     | - 0000001 - 0 mico        | 00000014             | b                         |
|     | • 0x0000001c a miss       | 00000018             | C d                       |
|     | Index = 1, Tag mismatch,  | <u>0000001C</u>      | a                         |
|     | so replace from memor     | у,                   |                           |
|     | Offset = 0xc, value = d   | 0000030              | е                         |
|     | Sinco roade values        | 0000034              | f                         |
|     | - Since reaus, values     | 00000038             | g                         |
|     | must = memory values      | S 0000003C           | h                         |
|     | whether or not cached     | 1:                   |                           |
|     |                           | 00008010             |                           |
|     | □ 0x0000030 = e           | 00008014             |                           |
| -   | □ 0x000001c = d           | 00008018             | <u> </u>                  |
| Cal | /                         | 0000801C             |                           |
|     | CS61C L11 Caches (69)     | •••                  | Huddleston, Summer 2009 © |

# Block Size Tradeoff (2/3)

- Drawbacks of Larger Block Size
  - Larger block size means larger miss penalty
  - on a miss, takes longer time to load a new block from next level
  - If block size is too big relative to cache size, then there are too few blocks
  - Result: miss rate goes up
- In general, minimize

Cal

Average Memory Access Time (AMAT)

= Hit Time + Miss Penalty x Miss Rate

# Block Size Tradeoff (3/3)

- Hit Time
  - time to find and retrieve data from current level cache
- Miss Penalty
  - average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy)
- Hit Rate
  - % of requests that are found in current level cache







- Assume
  - Hit Time = 1 cycle
  - Miss rate = 5%
  - Miss penalty = 20 cycles
  - Calculate AMAT...
- Avg mem access time
  - $= 1 + 0.05 \times 20$
  - = 1 + 1 cycles
  - = 2 cycles

Cal

Cal

## Example: with L2 cache

- Assume
  - L1 Hit Time = 1 cycle
  - L1 Miss rate = 5%
  - L2 Hit Time = 5 cycles
  - L2 Miss rate = 15% (% L1 misses that miss)
  - L2 Miss Penalty = 200 cycles
- L1 miss penalty = 5 + 0.15 \* 200 = 35
- Avg mem access time = 1 + 0.05 x 35 = 2.75 cycles





| Т   | ypical Scale                                                                                                                  |
|-----|-------------------------------------------------------------------------------------------------------------------------------|
| •   | L1 • size: tens of KB • hit time: complete in one clock cycle • mice retro: 4.5%                                              |
| -   | <ul> <li>miss rates: 1-5%</li> <li>L2:</li> <li>size: hundreds of KB</li> </ul>                                               |
|     | <ul> <li>hit time: few clock cycles</li> <li>miss rates: 10-20%</li> <li>2 miss rate is fraction of L1 misses that</li> </ul> |
| Cal | also miss in L2<br>• • why so high?                                                                                           |

|                                                                                              | Data Cacha                                              |
|----------------------------------------------------------------------------------------------|---------------------------------------------------------|
| <ul> <li>Cache</li> </ul>                                                                    | Units                                                   |
| <ul> <li>32 KB Instructions and 32<br/>KB Data L1 caches</li> </ul>                          | Data lags                                               |
| <ul> <li>External L2 Cache<br/>interface with integrated<br/>controller and cache</li> </ul> | L2<br>Cache<br>Tags Bix Pus<br>Interface Instruction PH |
| tags, supports up to 1<br>MByte external L2 cache                                            | Ann Sequencer<br>Common<br>Mitu                         |
| <ul> <li>Dual Memory<br/>Management Units (MMU)</li> </ul>                                   |                                                         |
| Lookaside Buffers (TLB)                                                                      | Instruction Cache                                       |
| <ul> <li>Pipelining</li> </ul>                                                               |                                                         |

Example: without L2 cache

Ways to reduce miss rate

limited by cost and technology

(bigger caches are slower)

hit time of first level cache < cycle time</p>

More places in the cache to put each

block of memory – associativity

Larger cache

• fully-associative

any block any line

N-way set associated

 N places for each block direct map: N=1

Assume

Cal

- L1 Hit Time = 1 cycle
- L1 Miss rate = 5%
- L1 Miss Penalty = 200 cycles
- Avg mem access time = 1 + 0.05 x 200 = 11 cycles
- 4x faster with L2 cache! (2.75 vs. 11)

