# <u>EECS150 - Digital Design</u> <u>Lecture 13 - Accelerators</u>

## March 5, 2013 John Wawrzynek

Spring 2013

EECS150 - Lec13-accelerators

Page 1

## **Motivation**

- 90/10 rule:
  - Often 90 percent of the program runtime and energy is consumed by 10 percent of the code (inner-loops).
  - Only small portions of an application become the performance bottlenecks.
  - Usually, these portions of code are data processing intensive with relatively fixed dataflow patterns (little control): cryptography, graphics, video, communications signal processing, networking, ...
  - The other 90 percent of the code not performance critical: UI, control, glue, exceptional cases, ...

Hybrid processor-core hardware accelerator

- Hardware accelerator/economizer implements specialized circuits for inner-loops.
- Processor packs the noncritical portions (90%), 10% of the computation into minimal space.

# Energy Efficiency of CPU versus ASIC versus FPGA

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37–47, June 2010.

Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA '06, pages 21–30, New York, NY, USA, 2006. ACM





#### $\therefore$ FPGA : CPU = 70x

#### Similar story for performance efficiency

Wawrzynek

ReConFig 12/14/2010

# <u>Why is HW more efficient than</u> <u>processors?</u>

- Performance/cost or Energy/op
  - 1. exploit problem specific parallelism, at thread and instructions level
  - 2. custom "instructions" match the set of operations needed for the algorithm (replace multiple instructions with one), custom word width arithmetic, etc.





#### What about FPGAs?

Spring 2013

EECS150 - Lec13-accelerators

3

### "System on Chip" Example

- Three ARM cores, plus lots of accelerators
- Targets smart phones



Spring 2013

Figure 1 - NVIDIA Tegra 2 System on a Chip



#### Altera: Dual-Core ARM Cortex-A9 MPCore Processor

5



Spring 2013

EECS150 - Lec13-accelerators

Page

### **Custom Hardware in the Pipeline**



EECS150 - Lec13-accelerators



**Custom Instructions** 

Spring 2013

EECS150 - Lec13-accelerators

Page 9

## **Tightly Coupled Co-processor**



#### **MicroBlaze Fast Simplex Links**



#### **Memory Mapped Accelerator**



 Memory mapped control/data registers

## <u>Memory Mapped Accelerator</u> <u>Common Variations</u>



Spring 2013

EECS150 - Lec13-accelerators

Page 13

# <u>CPU/Accelerator Shared Memory</u>



- Processor instructs accelerator to independently access memory and perform work
- How does processor synchronize with accelerator (how does it know when it is done)
- Data Cache on CPU creates "coherency" issue
- What about a cache in the accelerator?

Spring 2013