

#### EECS 252 Graduate Computer Architecture

## Lec 3 – Performance

#### + Pipeline Review

David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

http://www.eecs.berkeley.edu/~pattrsn http://www-inst.eecs.berkeley.edu/~cs252

#### **Review from last lecture**

- Tracking and extrapolating technology part of architect's responsibility
- Expect Bandwidth in disks, DRAM, network, and processors to improve by at least as much as the square of the improvement in Latency
- Quantify Cost (vs. Price)
   IC ≈ f(Area<sup>2</sup>) + Learning curve, volume, commodity, margins
- Quantify dynamic and static power – Capacitance x Voltage<sup>2</sup> x frequency, Energy vs. power
- Quantify dependability
   Reliability (MTTF vs. FIT), Availability (MTTF/(MTTF+MTTR))

#### 1/25/2006

CS252-s06, Lec 02-intro



- Review
- Quantify and summarize performance – Ratios, Geometric Mean, Multiplicative Standard Deviation
- F&P: Benchmarks age, disks fail,1 point fail danger
- 252 Administrivia
- MIPS An ISA for Pipelining
- 5 stage pipelining
- Structural and Data Hazards
- Forwarding
- Branch Schemes
- Exceptions and Interrupts
- Conclusion

1/25/2006

3



- Performance is in units of things per sec – bigger is better
- If we are primarily concerned with response time

performance(x) =

1 execution time(x)

" X is n times faster than Y" means

Performance(X)

Execution\_time(Y)

- n =
- Performance(Y)

Execution\_time(X)

1/25/2006



#### **Performance: What to measure**

- · Usually rely on benchmarks vs. real workloads
- To increase predictability, collections of benchmark applications-- *benchmark suites* -- are popular
- SPECCPU: popular desktop benchmark suite
  - CPU only, split between integer and floating point programs
  - SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
  - SPECCPU2006 to be announced Spring 2006
  - SPECSFS (NFS file server) and SPECWeb (WebServer) added as server benchmarks
- Transaction Processing Council measures server performance and cost-performance for databases
  - TPC-C Complex query for Online Transaction Processing
  - TPC-H models ad hoc decision support
  - TPC-W a transactional web benchmark
- TPC-App application server and web services benchmark 1/25/2006 CS252-s06, Lec 02-intro

#### How Summarize Suite Performance (1/5)

- Arithmetic average of execution time of all pgms?
  - But they vary by 4X in speed, so some would be more important than others in arithmetic average
- Could add a weights per program, but how pick weight?
  - Different companies want different weights for their products
- SPECRatio: Normalize execution times to reference computer, yielding a ratio proportional to performance =

time on reference computer

time on computer being rated

1/25/2006

CS252-s06, Lec 02-intro

#### How Summarize Suite Performance (2/5)

• If program SPECRatio on Computer A is 1.25 times bigger than Computer B, then

 $ExecutionTime_{reference}$ 

 $1.25 = \frac{SPECRatio_{A}}{SPECRatio_{B}} = \frac{ExecutionTime_{A}}{ExecutionTime_{reference}}$ 

*ExecutionTime*<sub>*B*</sub>

5

$$=\frac{ExecutionTime_{B}}{ExecutionTime_{A}}=\frac{Performance_{A}}{Performance_{B}}$$

• Note that when comparing 2 computers as a ratio, execution times on the reference computer drop out, so choice of reference computer is irrelevant

#### How Summarize Suite Performance (3/5)

• Since ratios, proper mean is geometric mean (SPECRatio unitless, so arithmetic mean meaningless)

GeometricMean = 
$$\sqrt[n]{\prod_{i=1}^{n} SPECRatio_{i}}$$

- 2 points make geometric mean of ratios attractive to summarize performance:
- 1. Geometric mean of the ratios is the same as the ratio of the geometric means
- 2. Ratio of geometric means
  - = Geometric mean of performance ratios
  - $\Rightarrow$  choice of reference computer is irrelevant!





#### How Summarize Suite Performance (4/5)



q

11

- Does a single mean well summarize performance of programs in benchmark suite?
- Can decide if mean a good predictor by characterizing variability of distribution using standard deviation
- Like geometric mean, geometric standard deviation is multiplicative rather than arithmetic
- Can simply take the logarithm of SPECRatios, compute the standard mean and standard deviation, and then take the exponent to convert back:

$$GeometricMean = \exp\left(\frac{1}{n} \times \sum_{i=1}^{n} \ln(SPECRatio_i)\right)$$

$$GeometricStDev = \exp(StDev(\ln(SPECRatio_i)))$$

1/25/2006

```
CS252-s06, Lec 02-intro
```

#### How Summarize Suite Performance (5/5)

- Standard deviation is more informative if know distribution has a standard form
  - bell-shaped normal distribution, whose data are symmetric around mean
  - lognormal distribution, where logarithms of data--not data itself--are normally distributed (symmetric) on a logarithmic scale
- For a lognormal distribution, we expect that

**68% of samples fall in range**  $[mean/gstdev, mean \times gstdev]$ **95% of samples fall in range**  $[mean/gstdev^2, mean \times gstdev^2]$ 

• Note: Excel provides functions EXP(), LN(), and STDEV() that make calculating geometric mean and multiplicative standard deviation easy

```
1/25/2006
```

CS252-s06, Lec 02-intro

# Ø

10

### **Example Standard Deviation (2/2)**

• GM and multiplicative StDev of SPECfp2000 for AMD Athlon



**Example Standard Deviation (1/2)** 

GM and multiplicative StDev of SPECfp2000 for Itanium 2



#### **Comments on Itanium 2 and Athlon**

- Standard deviation of 1.98 for Itanium 2 is much higher-- vs. 1.40--so results will differ more widely from the mean, and therefore are likely less predictable
- SPECRatios falling within one standard deviation:
  - -10 of 14 benchmarks (71%) for Itanium 2
  - -11 of 14 benchmarks (78%) for Athlon
- Thus, results are quite compatible with a lognormal distribution (expect 68% for 1 StDev)

CS252-s06, Lec 02-intro

#### Fallacies and Pitfalls (2/2)

1/25/2006

- Fallacy Rated MTTF of disks is 1,200,000 hours or ≈ 140 years, so disks practically never fail
- But disk lifetime is 5 years ⇒ replace a disk every 5 years; on average, 28 replacements wouldn't fail
- A better unit: % that fail (1.2M MTTF = 833 FIT)
- Fail over lifetime: if had 1000 disks for 5 years
   = 1000\*(5\*365\*24)\*833 /10<sup>9</sup> = 36,485,000 / 10<sup>6</sup> = 37
   = 3.7% (37/1000) fail over 5 yr lifetime (1.2M hr MTTF)
- But this is under pristine conditions
  - little vibration, narrow temperature range  $\Rightarrow$  no power failures
- Real world: 3% to 6% of SCSI drives fail per year
   3400 6800 FIT or 150,000 300,000 hour MTTF [Gray & van Ingen 05]
- 3% to 7% of ATA drives fail per year

- 3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen 05] 1/25/2006 CS252-s06, Lec 02-intro 15

## Fallacies and Pitfalls (1/2)

- Fallacies commonly held misconceptions
  - When discussing a fallacy, we try to give a counterexample.
- Pitfalls easily made mistakes.
  - Often generalizations of principles true in limited context
  - Show Fallacies and Pitfalls to help you avoid these errors
- Fallacy: Benchmarks remain valid indefinitely
  - Once a benchmark becomes popular, tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark: "benchmarksmanship."
  - 70 benchmarks from the 5 SPEC releases. 70% were dropped from the next release since no longer useful
- Pitfall: A single point of failure
  - Rule of thumb for fault tolerant systems: make sure that every component was redundant so that no single component failure could bring down the whole system (e.g, power supply)

```
1/25/2006
```

13

CS252-s06, Lec 02-intro



14

#### CS252: Administrivia

Instructor: Prof David Patterson

Office: 635 Soda Hall, pattrsn@cs

Office Hours: Tue 11 - noon or by appt.

(Contact Cecilia Pracher; cpracher@eecs)

- T. A: Archana Ganapathi, archanag@eecs
- Class: M/W, 11:00 12:30pm 203 McLaughlin (and online)

Text: Computer Architecture: A Quantitative Approach, 4th Edition (Oct, 2006), Beta, distributed for free provided report errors

Web page: http://www.cs/~pattrsn/courses/cs252-S06/

Lectures available online <9:00 AM day of lecture

Wiki page: ??

Reading assignment: Memory Hierarchy Basics Appendix C (handout) for Mon 1/30

Wed 2/1: Great ISA debate (3 papers) + Prerequisite Quiz

1/25/2006



#### **Outline**

- Review •
- Quantify and summarize performance - Ratios, Geometric Mean, Multiplicative Standard Deviation
- F&P: Benchmarks age, disks fail,1 point fail • danger
- 252 Administrivia
- MIPS An ISA for Pipelining .
- 5 stage pipelining .
- **Structural and Data Hazards** •
- Forwarding .
- **Branch Schemes**
- **Exceptions and Interrupts**
- . . . . . . . . . . .

## A "Typical" RISC ISA

- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (R0 contains zero, DP take pair)
- 3-address, reg-reg arithmetic instruction
- Single address mode for load/store: base + displacement
  - no indirection
- Simple branch conditions
- Delayed branch
  - see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

| • Conclusion |                         |    |           |                         |    |
|--------------|-------------------------|----|-----------|-------------------------|----|
| 1/25/2006    | CS252-s06, Lec 02-intro | 17 | 1/25/2006 | CS252-s06, Lec 02-intro | 18 |
|              |                         |    |           |                         |    |

## **Example: MIPS (- MIPS)**



#### **Register-Register**



#### **Register-Immediate**

| 31 20 | 5 25 2 | 120 16 | 15        |
|-------|--------|--------|-----------|
| Ор    | Rs1    | Rd     | immediate |

#### Branch

| 3 | 1 26 | 25 2 | 2120 : | 16 15 |           |
|---|------|------|--------|-------|-----------|
|   | Ор   | Rs1  | Rs2/0  | рх    | immediate |

#### Jump / Call



## **Datapath vs Control**



- Datapath: Storage, FU, interconnect sufficient to perform the desired functions
  - Inputs are Control Points
  - Outputs are signals
- · Controller: State machine to orchestrate operation on the data path

1/25/2006 Based on desired function and signals

19

n

## **Approaching an ISA**



21

- Instruction Set Architecture
  - Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing
- Meaning of each instruction is described by RTL on architected registers and memory
- · Given technology constraints assemble adequate datapath
  - Architected storage mapped to actual storage
  - Function units to do all the required operations
  - Possible additional storage (eg. MAR, MBR, ...)
  - Interconnect to move information among regs and FUs
- · Map each instruction to sequence of RTLs
- Collate sequences into symbolic controller state transition diagram (STD)
- · Lower symbolic STD to control points
- Implement controller

| 1/25 | 2006 |  |
|------|------|--|
| 1/40 | 2000 |  |

```
CS252-s06, Lec 02-intro
```

#### **5 Steps of MIPS Datapath**

Figure A.2, Page A-8



## **5 Steps of MIPS Datapath**



#### Inst. Set Processor Controller



Ø

## **5 Steps of MIPS Datapath**





Figure A.3, Page A-9



## Pipelining is not quite that easy!

- Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
  - <u>Structural hazards</u>: HW cannot support this combination of instructions (single person to fold and put clothes away)
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock)
  - <u>Control hazards</u>: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

## **Visualizing Pipelining**

Figure A.2, Page A-8



#### One Memory Port/Structural Hazards Figure A.4, Page A-14





## **One Memory Port/Structural Hazards**

(Similar to Figure A.5, Page A-15)



#### Example: Dual-port vs. Single-port

- Machine A: Dual ported memory ("Harvard Architecture")
- Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
- Ideal CPI = 1 for both
- Loads are 40% of instructions executed

```
SpeedUp<sub>A</sub> = Pipeline Depth/(1 + 0) x (clock<sub>unpipe</sub>/clock<sub>pipe</sub>)
               = Pipeline Depth
```

- SpeedUp<sub>B</sub> = Pipeline Depth/(1 + 0.4 x 1) x (clock<sub>unpipe</sub>/(clock<sub>unpipe</sub>/ 1.05) = (Pipeline Depth/1.4) x 1.05
  - = 0.75 x Pipeline Depth
- SpeedUp<sub>4</sub> / SpeedUp<sub>8</sub> = Pipeline Depth/( $0.75 \times Pipeline Depth$ ) = 1.33
- Machine A is 1.33 times faster

#### 1/25/2006

31

## **Speed Up Equation for Pipelining**



Cycle Time unpipelined Ideal CPI × Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Cycle Time<sub>pipelined</sub>

#### For simple RISC pipeline, CPI = 1:

Cycle Time<sub>unpipelined</sub> Pipeline depth Speedup = 1 + Pipeline stall CPI Cycle Time<sub>pipelined</sub>

```
1/25/2006
```

Ι

n 5

**†** 

r.

0

r

d

e

r

CS252-s06, Lec 02-intro

**Data Hazard on R1** 

Figure A.6, Page A-17



#### **Three Generic Data Hazards**

- Read After Write (RAW) Instr<sub>J</sub> tries to read operand before Instr<sub>I</sub> writes it
  - I: add r1,r2,r3
    J: sub r4,r1,r3
- Caused by a "Dependence" (in compiler nomenclature). This hazard results from an actual need for communication.

CS252-s06, Lec 02-intro

#### **Three Generic Data Hazards**

- Write After Write (WAW) Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.
  - I: sub r1,r4,r3
    J: add r1,r2,r3
    K: mul r6,r1,r7
- Called an "output dependence" by compiler writers This also results from the reuse of name "r1".
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Writes are always in stage 5
- Will see WAR and WAW in more complicated pipes

#### 1/25/2006

1/25/2006



35

33

#### **Three Generic Data Hazards**

- Write After Read (WAR) Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

- Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1".
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Reads are always in stage 2, and
- Writes are always in stage 5
   1/25/2006 C\$252-s06, Lec 02-intro

#### Forwarding to Avoid Data Hazard Figure A.7, Page A-19



1/25/2006



# Software Scheduling to Avoid Load Hazards



Try producing fast code for

a = b + c;

 $\mathbf{d} = \mathbf{e} - \mathbf{f};$ 

assuming a, b, c, d ,e, and f in memory.





## Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7

# 22: add r8,r1,r9

36: xor r10,r1,r11

What do you do with the 3 instructions in between?

#### How do you do it?

Where is the "commit"? 1/25/2006



43

41

#### Outline

- Review
- Quantify and summarize performance
   Atios, Geometric Mean, Multiplicative Standard Deviation
- F&P: Benchmarks age, disks fail,1 point fail danger
- 252 Administrivia
- MIPS An ISA for Pipelining
- 5 stage pipelining
- Structural and Data Hazards
- Forwarding
- Branch Schemes
- Exceptions and Interrupts
- Conclusion

```
1/25/2006
```

CS252-s06, Lec 02-intro



42

#### **Branch Stall Impact**

- If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
- Two part solution:
  - Determine branch taken or not sooner, AND
  - Compute taken branch address earlier
- MIPS branch tests if register = 0 or ≠ 0
- MIPS Solution:
  - Move Zero test to ID/RF stage
  - Adder to calculate new PC in ID/RF stage
  - 1 clock cycle penalty for branch versus 3

#### **Pipelined MIPS Datapath**



Figure A.24, page A-38



#### **Four Branch Hazard Alternatives**

#1: Stall until branch direction is clear

#### #2: Predict Branch Not Taken

- Execute successor instructions in sequence
- "Squash" instructions in pipeline if branch actually taken
- Advantage of late pipeline state update
- 47% MIPS branches not taken on average
- PC+4 already calculated, so use it to get next instruction

#### #3: Predict Branch Taken

- 53% MIPS branches taken on average
- But haven't calculated branch target address in MIPS
  - » MIPS still incurs 1 cycle branch penalty
  - » Other machines: branch target known before outcome

1/25/2006

CS252-s06, Lec 02-intro



46

## Four Branch Hazard Alternatives

#### #4: Delayed Branch

- Define branch to take place AFTER a following instruction

branch instruction sequential successor<sub>1</sub> sequential successor<sub>2</sub> ..... sequential successor<sub>n</sub> branch target if taken

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- MIPS uses this

| 1/ | 25 | /200 | )6 |
|----|----|------|----|
| 1/ | 40 | 1400 | ,0 |

47

#### Scheduling Branch Delay Slots (Fig A.14)



- A is the best choice, fills delay slot & reduces instruction count (IC)
- In B, the sub instruction may need to be copied, increasing IC
- In B and C, must be okay to execute sub when branch fails 1/25/2006 CS252-s06, Lec 02-intro

#### **Delayed Branch**



- · Compiler effectiveness for single branch delay slot:
  - Fills about 60% of branch delay slots
  - About 80% of instructions executed in branch delay slots useful in computation
  - About 50% (60% x 80%) of slots usefully filled
- Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
  - Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches
  - Growth in available transistors has made dynamic approaches relatively cheaper

| 1/25/2006 | CS252-s06, Lec 02-intro |
|-----------|-------------------------|
|           |                         |
|           |                         |



49

#### **Problems with Pipelining**

- Exception: An unusual event happens to an instruction during its execution
  - Examples: divide by zero, undefined opcode
- Interrupt: Hardware signal to switch the processor to a new instruction stream
  - Example: a sound card interrupts when it needs more audio output samples (an audio "click" happens if it is left waiting)
- Problem: It must appear that the exception or interrupt must appear between 2 instructions (I<sub>i</sub> and I<sub>i+1</sub>)
  - The effect of all instructions up to and including  $\mathbf{I}_{\mathrm{i}}$  is totalling complete
  - No effect of any instruction after I<sub>i</sub> can take place
- The interrupt (exception) handler either aborts program or restarts at instruction I<sub>i+1</sub>

#### **Evaluating Branch Alternatives**

Pipeline speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Branch frequency} \times \text{Branch penalty}}$ 

Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken

| Scheduling<br>scheme | Branch<br>penalty | CPI  | speedup v.<br>unpipelined | speedup v.<br>stall |
|----------------------|-------------------|------|---------------------------|---------------------|
| Stall pipeline       | 3                 | 1.60 | 3.1                       | 1.0                 |
| Predict taken        | 1                 | 1.20 | 4.2                       | 1.33                |
| Predict not take     | en 1              | 1.14 | 4.4                       | 1.40                |
| Delayed branch       | 0.5               | 1.10 | 4.5                       | 1.45                |

1/25/2006

CS252-s06, Lec 02-intro

50

#### **Precise Exceptions in Static Pipelines**





Key observation: architected state only change in memory and register write stages.

## And In Conclusion: Control and Pipelining

- Quantify and summarize performance – Ratios, Geometric Mean, Multiplicative Standard Deviation
- F&P: Benchmarks age, disks fail,1 point fail danger
- Next time: Read Appendix A, record bugs online!
- Control VIA State Machines and Microprogramming
- Just overlap tasks; easy if tasks are independent
- Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:

| Sneedun - | Pipeline depth           | Cycle Time <sub>unpipelined</sub> |  |
|-----------|--------------------------|-----------------------------------|--|
| Opeedup - | 1 + Pipeline stall CPI ^ | Cycle Time <sub>pipelined</sub>   |  |

- Hazards limit performance on computers:
  - Structural: need more HW resources
  - Data (RAW,WAR,WAW): need forwarding, compiler scheduling
  - Control: delayed branch, prediction
- Exceptions, Interrupts add complexity
- Next time: Read Appendix C, record bugs online!

| 1/25/2006 | CS252-s06, Lec 02-intro |
|-----------|-------------------------|
|           |                         |