



## Instruction Level Parallelism (ILP)

- Another parallelism form to go with Request Level Parallelism and Data Level Parallelism
- RLP e.g., Warehouse Scale Computing
- DLP e.g., SIMD, Map Reduce
- ILP e.g., Pipelined instruction Execution
- 5 stage pipeline => 5 instructions executing simultaneously, one at each pipeline stage

11/7/11 Fall 2011 – Lecture #31 3



Pipelined Execution
Representation

Time

IFtch|Dcd | Exec|Mem | WB

or Every instruction must take same number of steps, also called pipeline "stages", so some will go idle sometimes





#### **Pipeline Performance**

- · Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- · What is pipelined clock rate?
  - Compare pipelined datapath with single-cycle datapath

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register<br>write | Total time |
|----------|-------------|---------------|--------|---------------|-------------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps            | 800ps      |
| sw       | 200ps       | 100 ps        | 200ps  | 200ps         |                   | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps            | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                   | 500ps      |

Eá)/7/0111 -- Lecture #31

Program
execution Time
200 400 600 800 1000 1200 1400 1600 1800

| W S1, 100(S0) | Perfection | Reg | ALU | Data | Reg |

## Pipeline Speedup

- · If all stages are balanced
  - i.e., all take the same time
  - Time between instructions<sub>pipelined</sub>
    - = Time between instructions<sub>nonpipelined</sub>

Number of stages

- · If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not decrease

E&#70111 -- Lecture #31

10

#### Hazards

Situations that prevent starting the next logical instruction in the next clock cycle

- 1. Structural hazards
  - Required resource is busy (e.g., roommate studying)
- 2. Data hazard
  - Need to wait for previous instruction to complete its data read/write (e.g., pair of socks in different loads)
- 3. Control hazard
  - Deciding on control action depends on previous instruction (e.g., how much detergent based on how clean prior load turns out)

11/7/11 Fall 2011 – Lecture #31 11

#### 1. Structural Hazards

- · Conflict for use of a resource
- In MIPS pipeline with a single memory
  - Load/Store requires memory access for data
  - Instruction fetch would have to *stall* for that cycle
    - Causes a pipeline "bubble"
- Hence, pipelined datapaths require separate instruction/data memories
  - In reality, provide separate L1 I\$ and L1 D\$

11 Fall 2011 -- Lecture #31





## 1. Structural Hazard #2: Registers (2/2)

- Two different solutions have been used:
  - 1) RegFile access is *VERY* fast: takes less than half the time of ALU stage
    - Write to Registers during first half of each clock cycle
    - Read from Registers during second half of each clock cycle
  - Build RegFile with independent read and write ports
- Result: can perform Read and Write during same clock cycle

7/11 Fall 2011 -- Lecture #31

# Data Hazards (1/2)

Consider the following sequence of instructions

add \$t0, \$t1, \$t2 sub \$t4, \$t0, \$t3 and \$t5, \$t0, \$t6 or \$t7, \$t0, \$t8

xor \$t9, \$t0 ,\$t10





# Data Hazard: Load/Use (1/4)

• Dataflow backwards in time are hazards



- · Can't solve all cases with forwarding
- Must stall instruction dependent on load, then forward (more hardware)



# Data Hazard: Load/Use (3/4)

- Instruction slot after a load is called "<u>load delay</u> slot"
- If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle.
- <u>Alternative</u>: If the compiler puts an unrelated instruction in that slot, then no stall
- Letting the hardware stall the instruction in the delay slot is equivalent to putting a nop in the slot (except the latter uses more code space)

# 

# Pipelining and ISA Design

- MIPS Instruction Set designed for pipelining
- All instructions are 32-bits
  - Easier to fetch and decode in one cycle
  - x86: 1- to 17-byte instructions
  - (x86 HW actually translates to internal RISC instructions!)
- Few and regular instruction formats, 2 source register fields always in same place
  - Can decode and read registers in one step
- Memory operands only in Loads and Stores
- Can calculate address 3<sup>rd</sup> stage, access memory 4<sup>th</sup> stage
- Alignment of memory operands
  - Memory access takes only one cycle

11/7/11

Fall 2011 -- Lecture #31

## 3. Control Hazards

- · Branch determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
    - Still working on ID stage of branch
- · BEQ, BNE in MIPS pipeline
- Simple solution Option 1: Stall on every branch until have new PC value
  - Would add 2 bubbles/clock cycles for every Branch! (~ 20% of instructions executed)

/7/11

Fall 2011 -- Lecture #31

4



# Until next time ...

# The BIG Picture

- Pipelining improves performance by increasing instruction throughput: exploits ILP
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Stalls reduce performance
  - But are required to get correct results
- Compiler can arrange code to avoid hazards and stalls
  - Requires knowledge of the pipeline structure

Fall 2011 -- Lecture #31

26