# inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures

#### Lecture #21 Caches I



2005-11-14

There is one handout today at the front and back of the room!

Lecturer PSOE, new dad Dan Garcia

www.cs.berkeley.edu/~ddgarcia

The matrix...reality? ⇒ MIT neuroscientists

can now decipher part of the code involved in recognizing visual objects! A "classifier" was used on the monkey brain signals.

www.physorg.com/news7879.html

CS61C L21 Caches I (1)



#### Review

- Pipeline challenge is hazards
  - Forwarding helps w/many data hazards
  - Delayed branch helps with control hazard in 5 stage pipeline
- More aggressive performance:
  - Superscalar
  - Out-of-order execution
- You can be creative with your pipelines
- Learn from our top 10 worst SW bugs...
  - Test, test, test. Expect the unexpected.
    - Design w/failure as possibility! Redundancy!

#### Big Ideas so far

- 15 weeks to learn big ideas in CS&E
  - Principle of abstraction, used to build systems as layers
  - Pliable Data: a program determines what it is
  - Stored program concept: instructions just data
  - Compilation v. interpretation to move down layers of system
  - Greater performance by exploiting parallelism (pipeline)
  - Principle of Locality, exploited via a memory hierarchy (cache)
  - Principles/Pitfalls of Performance Measurement



#### Where are we now in 61C?

- Architecture! (aka "Systems")
  - CPU Organization
    - Datapath
    - Control
  - Pipelining
  - Caches
  - Virtual Memory
  - ·1/0
  - Networks
  - Performance



#### **The Big Picture**





# **Memory Hierarchy (1/3)**

#### Processor

- executes instructions on order of nanoseconds to picoseconds
- holds a small amount of code and data in registers

#### Memory

- More capacity than registers, still limited
- Access time ~50-100 ns

#### Disk

- Cal
- HUGE capacity (virtually limitless)
- · VERY slow: runs ~milliseconds

#### **Review: Why We Use Caches**



- 1989 first Intel CPU with cache on chip
- 1998 Pentium III has two levels of cache on chip



#### **Memory Hierarchy (2/3)**



# Size of memory at each level

As we move to deeper levels the latency goes up and price per bit goes down.

Q: Can \$/bit go up as move deeper?

CS61C L21 Caches I (8)

### **Memory Hierarchy (3/3)**

- If level closer to Processor, it must be:
  - · smaller
  - faster
  - subset of lower levels (contains most recently used data)
- Lowest Level (usually disk) contains all available data
- Other levels?



#### **Memory Caching**

- We've discussed three levels in the hierarchy: processor, memory, disk
- Mismatch between processor and memory speeds leads us to add a new level: a memory cache
- Implemented with SRAM technology: faster but more expensive than DRAM memory.
  - "S" = Static, no need to refresh, ~10ns
  - "D" = Dynamic, need to refresh, ~60ns
  - arstechnica.com/paedia/r/ram guide/ram guide.part1-1.html



#### **Memory Hierarchy Analogy: Library (1/2)**

- You're writing a term paper (Processor) at a table in Doe
- Doe Library is equivalent to <u>disk</u>
  - essentially limitless capacity
  - very slow to retrieve a book
- Table is memory
  - smaller capacity: means you must return book when table fills up
  - easier and faster to find a book there once you've already retrieved it



#### **Memory Hierarchy Analogy: Library (2/2)**

- Open books on table are <u>cache</u>
  - smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book
  - much, much faster to retrieve data
- Illusion created: whole library open on the tabletop
  - Keep as many recently used books open on table as possible since likely to use again
  - Also keep as many books on table as possible, since faster than going to library

### **Memory Hierarchy Basis**

- Disk contains everything.
- When Processor needs something, bring it into to all higher levels of memory.
- Cache contains copies of data in memory that are being used.
- Memory contains copies of data on disk that are being used.
- Entire idea is based on <u>Temporal</u>
   <u>Locality</u>: if we use it now, we'll want to use it again soon (a Big Idea)

**CS61C L21 Caches I (13)** 

#### **Cache Design**

- How do we organize cache?
- Where does each memory address map to?

(Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.)

- How do we know which elements are in cache?
- How do we quickly locate them?



#### **Administrivia**

- Dan's Wed's OH this week moved a few hours earlier to 10-11am
- Project 3 due Friday



#### **Direct-Mapped Cache (1/2)**

- In a <u>direct-mapped cache</u>, each memory address is associated with one possible <u>block</u> within the cache
  - Therefore, we only need to look in a single location in the cache for the data if it exists in the cache
  - Block is the unit of transfer between cache and memory



### **Direct-Mapped Cache (2/2)**



#### **Issues with Direct-Mapped**

Tag Index Offset

- Since multiple memory addresses map to same cache index, how do we tell which one is in there?
- What if we have a block size > 1 byte?
- Answer: divide memory address into three fields

#### **Direct-Mapped Cache Terminology**

- All fields are read as unsigned integers.
- Index: specifies the cache index (which "row" of the cache we should look in)
- Offset: once we've found correct block, specifies which byte within the block we want -- I.e., which "column"
- Tag: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location



# **TIO** Dan's great cache mnemonic

 $2^{(H+W)} = 2^{H} * 2^{W}$ 

AREA (cache size, B)
= HEIGHT (# of blocks)
\* WIDTH (size of one block, B/block)

**Offset** Index Tag

**WIDTH** (size of one block, B/block)

**HEIGHT** (# of blocks)





#### **Direct-Mapped Cache Example (1/3)**

- Suppose we have a 16KB of data in a direct-mapped cache with 4 word blocks
- Determine the size of the tag, index and offset fields if we're using a 32-bit architecture
- Offset
  - need to specify correct byte within a block
  - block contains 4 words

= 16 bytes

= 2<sup>4</sup> bytes



need 4 bits to specify correct byte

#### **Direct-Mapped Cache Example (2/3)**

- Index: (~index into an "array of blocks")
  - need to specify correct row in cache
  - cache contains 16 KB =  $2^{14}$  bytes
  - block contains 2<sup>4</sup> bytes (4 words)
  - # blocks/cache
    - = <u>bytes/cache</u> bytes/block
    - = <u>2<sup>14</sup> bytes/cache</u> 2<sup>4</sup> bytes/block
    - = 2<sup>10</sup> blocks/cache
  - need <u>10 bits</u> to specify this many rows



### **Direct-Mapped Cache Example (3/3)**

Tag: use remaining bits as tag

```
tag length = addr length - offset - index= 32 - 4 - 10 bits= 18 bits
```

so tag is leftmost <u>18 bits</u> of memory address

- Why not full 32 bit address as tag?
  - All bytes within block need same address (4b)
  - Index must be same for every address within a block, so its redundant in tag check, thus can leave off to save memory (10 bits in this example)



# **Caching Terminology**

- When we try to read memory, 3 things can happen:
- cache hit: cache block is valid and contains proper address, so read desired word
- 2. cache miss: nothing in cache in appropriate block, so fetch from memory
- 3. cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy)

# Accessing data in a direct mapped cache

- Ex.: 16KB of data, direct-mapped, 4 word blocks
- Read 4 addresses
  - $1.0 \times 00000014$
  - 2.0x000001C
  - 3.0x00000034
  - 4.0x00008014
- Memory values on right:
  - only cache/ memory level of hierarchy

Memory
Address (hex) Value of Word

| •••                        | ••• |
|----------------------------|-----|
| 0000010                    | a   |
| $\frac{00000014}{0000018}$ | b   |
| 0000018                    | C   |
| 0000001C                   | d   |

| •••      | ••• |
|----------|-----|
| 00000030 | е   |
| 00000034 | f   |
| 00000038 | g   |
| 000003C  | h   |
| •••      | ••• |
| 00008010 |     |
|          |     |



### Accessing data in a direct mapped cache

#### 4 Addresses:

```
•0x0000014, 0x000001C, 0x0000034, 0x00008014
```

 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields

```
      00000000000000000
      000000000
      0100

      000000000000000
      000000000
      1100

      0000000000000000
      000000001
      0100

      0000000000000000
      000000000
      0100
```



Index Offset

Tag

#### 16 KB Direct Mapped Cache, 16B blocks

• Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid)

| Index_                               | 0x0-3 | 0x4-7 | 0x8-b | 0xc-f |
|--------------------------------------|-------|-------|-------|-------|
| 0 0                                  |       |       |       |       |
| 1 0                                  |       |       |       |       |
| 2 0                                  |       |       |       |       |
| 3 0                                  |       |       |       |       |
| 4 0                                  |       |       |       |       |
| 5 0                                  |       |       |       |       |
| 0<br>1<br>2<br>3<br>4<br>5<br>6<br>7 |       |       |       |       |
| 7 0                                  |       |       |       |       |
| •••                                  |       | •••   |       |       |
| 10220                                |       |       |       |       |
| 10230                                |       |       |       |       |

#### 1. Read 0x0000014

00000000000000000 000000001 0100 Index field Offset Tag field Valid 0x4-70x8-b0xc-f 0x0-3Index Tag 1234567 10220



#### So we read block 1 (000000001)

 00000000000000000 000000001 0100 Tag field Index field Offset Valid 0x4-7d-8x00xc-f 0x0-3Tag Index **1**234567 10220 **1023**0



#### No valid data

• 0000000000000000 <u>00000001</u> 0100 Tag field \_\_\_\_ Index field Offset Valid 0x4-70x8-b0xc-f 0x0-3Tag Index **1**234567 10220



#### So load that data into cache, setting tag, valid

Tag field Index field Offset Valid 0x4-70x8-b $0 \times 0 - 3$ 0xc-f Tag Index 1234567 a b C **1022**0 10230



#### Read from cache at offset, return word b

• 000000000000000000 000000001 <u>0100</u>

Tag field Index field Offset





#### 2. Read 0x0000001C = 0...00 0..001 1100

• 00000000000000000 000000001 1100

Tag field Index field Offset

Valid 0x4-70x0-30x8-b0xc-f Index Tag 1234567 0 b a C

| 10220 |  |  |
|-------|--|--|
| 10230 |  |  |



#### **Index** is Valid

• 00000000000000000 000000001 1100 Tag field Index field Offset Valid 0x4-70x8-b0xc-f 0x0-3Tag Index 0 b **1**234567 a C 10220



**1023**0

#### **Index valid, Tag Matches**

•••

| 10220 |  |  |
|-------|--|--|
| 10230 |  |  |



# Index Valid, Tag Matches, return d

00000000000000000 000000001 1100





## 3. Read 0x00000034 = 0...00 0..011 0100

| Inde | <u>X</u> | Tag | 0x0-3 | 0x4-7 | 0x8-b | 0xc-f |
|------|----------|-----|-------|-------|-------|-------|
| 0    | 0        |     |       |       |       |       |
| 1    | 1        | 0   | а     | b     | С     | d     |
| 2    | 0        |     |       |       |       |       |
| 3    | 0        |     |       |       |       |       |
| 4    | 0        |     |       |       |       |       |
| 5    | 0        |     |       |       |       |       |
| 6    | 0        |     |       |       |       |       |
| 7    | 0        |     |       |       |       |       |
|      |          |     |       | _     |       |       |

10220

10230



## So read block 3

• 0000000000000000 <u>000000011</u> 0100 Index field Offset Tag field Valid 0x8-b 0x4-70xc-f 0x0-3Index Tag 0 a C 234567 10220



#### No valid data

• 0000000000000000 <u>000000011</u> 0100 Index field Offset Tag field Valid 0x8-b 0x4-70xc-f 0x0-3Index Tag a C 2<u>3</u>4567 10220



# Load that cache block, return word f

Index field Offset Tag field Valid 0x8-b 0x4-0xc-f 0x0-3Tag Index 0 a C 234567 **1022**0



## 4. Read 0x00008014 = 0...10 0..001 0100

| Inde | <u>X</u> | Tag | 0x0-3 | 0x4-7 | 0x8-b | 0xc-f |
|------|----------|-----|-------|-------|-------|-------|
| 0    | 0        |     |       |       |       |       |
| 1    | 1        | 0   | а     | b     | С     | d     |
| 2    | 0        |     |       |       |       |       |
| 3    | 1        | 0   | е     | f     | g     | h     |
| 4    | 0        |     |       |       |       |       |
| 5    | 0        |     |       |       |       |       |
| 6    | 0        |     |       |       |       |       |
| 7    | 0        |     |       |       |       |       |

| 10220 |  |  |
|-------|--|--|
| 10230 |  |  |



# So read Cache Block 1, Data is Valid

• 000000000000000010 000000001 0100 Offset Valid Valid Offset

| Index      | <b>X</b> | Tag | 0x0-3 | 0x4-7 | 0x8-b | 0xc-f |
|------------|----------|-----|-------|-------|-------|-------|
| 0          | 0        |     |       |       |       |       |
| * <u>1</u> | 1        | 0   | а     | b     | С     | d     |
| 2          | 0        |     |       |       |       |       |
| 3          | 1        | 0   | е     | f     | g     | h     |
| 4          | 0        |     |       |       |       |       |
| 5          | 0        |     |       |       |       |       |
| 6          | 0        |     |       |       |       |       |
| 7          | 0        |     |       |       |       |       |
|            |          |     |       |       |       |       |

1022<sub>0</sub> 1023<sub>0</sub>



# Cache Block 1 Tag does not match (0 != 2)

 00000000000000010 000000001 0100 Tag field Index field Offset Valid 0x4-70x0-30x8-b0xc-f Tag Index 0<u>1</u>234567 b a C 0 g 10220 10230



# Miss, so replace block 1 with new data & tag

0100 Tag field Index field Offset Valid 0x4-70x8-b0xc-f 0x0-3Index Tag 1234567 0 g 10220 **1023**0



## And return word j

• 00000000000000010 <u>000000001</u> <u>0100</u> Index field Offset Tag field Valid 0x8-b 0x4-70xc-f 0x0-3Index Tag 1234567 k 0 g 10220



# Do an example yourself. What happens?

• Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l

Read address 0x0000000 ?
 0000000000000000 000000011 0000

| Valid<br>Index Tag |   |   | 0 <b>x</b> 0-3 | 0 <b>x</b> 4-7 | 0x8-b | 0xc-f |
|--------------------|---|---|----------------|----------------|-------|-------|
| 0                  |   | _ | •              | •              | •-    | •     |
| 1                  | 붜 | 2 | I              |                | K     | l l   |
| 2                  | H | 0 | е              | f              | a     | h     |
| 4                  | Ō |   |                | -              | 9     | ••    |
| 5                  | 0 |   |                |                |       |       |
| 6                  | 0 |   |                |                |       |       |
| 7                  | 0 |   |                |                |       |       |

•••

#### **Answers**

## 0x00000030 a hit

Index = 3, Tag matches, Offset = 0, value = e

0x0000001c a miss

Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d

- Since reads, values must = memory values whether or not cached:
  - $\cdot 0x00000030 = e$
  - 9x0000001c = d

# **Memory Address Value of Word**

| a                |
|------------------|
| b                |
| C                |
| a<br>b<br>c<br>d |
|                  |

| •••      | ••• |
|----------|-----|
| 0000030  | е   |
| 0000034  | f   |
| 0000038  | g   |
| 000003c  | h   |
| •••      | ••• |
| 00008010 |     |
| 00008014 |     |
| 00008018 | k   |
| 0000801c |     |

<sup>••</sup> Garcia, Fall 2005 © UCB

#### **Peer Instruction**

- A. Mem hierarchies were invented before 1950. (UNIVAC I wasn't delivered 'til 1951)
- B. If you know your computer's cache size, you can often make your code run faster.
- C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor.



ABC

1: FFF

2: FFT

3: **FTF** 

4: FTT

5: **TFF** 

6: **TFT** 

7: TTF

8: TTT

#### **Peer Instruction Answer**

- A. "We are...forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less accessible." von Neumann, 1946
- B. Certainly! That's call "tuning"
- C. "Most Recent" items ⇒ <u>Temporal</u> locality
- A. Mem hierarchies were invented before 1950. (UNIVAC I wasn't delivered 'til 1951)
- B. If you know your computer's cache size, you can often make your code run faster.
- C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor.





1: FFF

2: **FFT** 

3: **FTF** 

4: **F**TT

5: **TFF** 

6: TFT

7: TTF

8: TTT



#### **Peer Instructions**

- 1. All caches take advantage of spatial locality.
- 2. All caches take advantage of temporal locality.
- 3. On a read, the return value will depend on what is in the cache.

**ABC** 

1: FFF

2: **FFT** 

3: **FTF** 

4: FTT

5: **TFF** 

6: **TFT** 

7: TTF

8: TTT

#### **Peer Instruction Answer**

- 1. Ar cacres take warrage of spatial locality.
- 2. At cannot alle anyantage of temporal locality.
- 3. Created, the return value will depend on what is in the cacle.



- 1. Block size = 1, no spatial!
- 2. That's the <u>idea</u> of caches; We'll need it again soon.
- 3. It better not! If it's there, use it. Oth, get from mem

# **And in Conclusion (1/2)**

- We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible.
- So we create a memory hierarchy:
  - each successively lower level contains "most used" data from next higher level
  - exploits <u>temporal locality</u>
  - do the common case fast, worry less about the exceptions (design principle of MIPS)
- Locality of reference is a Big Idea



# And in Conclusion (2/2)

- Mechanism for transparent movement of data among levels of a storage hierarchy
  - set of address/value bindings
  - address ⇒ index to set of candidates
  - compare desired address with tag
  - service hit or miss
    - load new block and binding on miss



**CS61C L21 Caches I (53)**