DNN Accelerators and HLS

Qijing (Jenny) Huang

# Outline

Deep Neural Network (DNN)
 Design Methodology
 Accelerator Architecture
 High-level Synthesis (HLS)

### Learning from the Brain



- The basic computational unit of the brain is **a neuron** 
  - 86B neurons in the brain
- Neurons are connected with nearly **10**<sup>14</sup> **10**<sup>15</sup> synapses
- Neurons receive input signal from **dendrites** and produce output signal along **axon**, which interact with the dendrites of other neurons via **synaptic weights**
- Synaptic weights learnable & control influence strength

\* Slide from http://cs231n.github.io/

### **Neural Networks**



- NNs are usually feed forward computational graphs constructed from one or more layers
- The "Neuron" computes:
  - Integrate typically linear transform (dot-product of receptive field)
  - Fire followed by a non-linear "activation" function

\* Slide from http://cs231n.github.io/

### **Training vs. Inference**



#### Training

Process for a machine to learn by optimizing models (weights) from labeled data.

#### Typically computed in the cloud

#### Inference

Using trained models to predict or estimate outcomes from new inputs.

Deployment at the edge

In the Cloud (Training + Inference)

- 10s TFLOPs
- 10s MB on-chip memory
- 8 32 bit precision
- 700 MHz 1 GHz
- 10-100s Watts



Cloud TPU v3 (45 TFLOP/s)

Many AI Chips ~ 85 AI chip companies worldwide

At the Edge (Inference)

- 100s-1000s GFLOPs
- 100s KB on-chip memory
- 1 16 bit precision
- 50 MHz 400 MHz
- 1-10s Watts



Intel Movidius (4 TFLOP/s)

In the Edge SoC/SiP (Inference)

- 10s-1000s GFLOPs
- 100s KB on-chip memory
- 1 16 bit precision
- 600 MHz 1 GHz
- 10-100s mWatts



Cambricon-1M IP

\* Data adapted from Prof. Kurt Keutzer's talk at DAC 2018



\* Image from https://www.electronicproducts.com/Digital\_ICs/Designer\_s\_Guide\_Selecting\_AI\_chips\_for\_embedded\_designs.aspx

#### **Computer Vision Applications**



Autonomous Vehicles



Security Camera



Drones



Medical Imaging



Robots



**Mobile Applications** 

#### **Computer Vision Tasks**



Image Classification



**Object Detection** 



Semantic Segmentation



Super Resolution



#### Activity Recognition

# Deep Neural Network

#### **Common Operations**

- Convolution (Dilated, Transposed, 3D and etc.)
- ReLU
- Pooling (Average, Max)
- Fully-Connected
- Batch Normalization

#### **Activation/Feature Maps**

- Input images have three dimensions with RGB channels
- Intermediate data have more channels after performing convolution
- We refer to them as feature maps



### Weights/Kernels

- weights for full convolution typically have four dimensions:
  - input channels, width, height, output channels
- input channel size matches the channel dimension of input features
- output channel size specifies the channel dimension of output features



### **3x3 Convolution - Spatially**





Output feature map

Input feature map

- 3x3 Conv with No Stride, No Padding
- Weights = [[0, 1, 2], [2,2,0], [0,1,2]]



# 16571097108

Output feature map

#### Input feature map

- 3x3 Conv with Stride 2, Padding 1
- Weights = [[2, 0, 1], [1,0,0], [0,1,1]]

 $O_{00} = I_{00} \times W_{00} + I_{01} \times W_{01} + I_{02} \times W_{02} + I_{10} \times W_{10} + I_{11} \times W_{11} + I_{12} \times W_{12} + I_{20} \times W_{20} + I_{21} \times W_{21} + I_{22}$ 

\* gif from <u>Attp://deeplearning.net/software/theano\_versions/dev/\_images/</u>

#### 3x3 Convolution - 3D



#### 3x3 Convolution - 3D

http://cs231n.github.io/assets/conv-demo/index.html

\* gif from https://cdn-images-1.medium.com/max/800/1\*q95f1mqXAVsj\_VMHaOm6Sw.gif

### **Fully-Connected Layer (FC)**

- Each input activation is connected to every output activation
- Essentially a matrix-vector multiplication





#### **ReLU Activation Function**

- Implements the concept of "Firing"
- Introduces non-linearity
- Rectified Linear Unit
  - $\circ$  R(z) = max(0, z)
- Not differentiable at 0



### **Batch Normalization (BN)**

 Shifts and scales activations to achieve <u>zero-centered</u> <u>distribution with unit</u>

#### <u>variance</u>

- Subtracts mean
- Divides by standard deviation



\* images from https://en.wikipedia.org/wiki/Normal\_distribution

### **Pooling**

#### • Downsamples

- Takes the maximum
- Takes the average
- Operates at each feature map independently



112x112x64

112

112

\* images from http://cs231n.github.io/convolutional-networks/

#### **Full DNN Example: AlexNet**



| Top-1 Accuracy | 57.1% |
|----------------|-------|
| Top-5 Accuracy | 80.2% |
| Model Size     | 61M   |
| MACs           | 725M  |



## Design Methodology

#### **The Roofline Model**



- $\pi$  the peak compute performance
- β the peak bandwidth
- I the arithmetic intensity
- The attainable throughput P:

$$P = \min \left\{ egin{smallmatrix} \pi \ eta imes I 
ight.$$

- **Performance** is upper bounded by <u>the peak performance</u>, <u>the communication</u> <u>bandwidth</u>, and <u>the operational intensity</u>
- Arithmetic Intensity is the ratio of the compute to the memory traffic

#### **The Roofline Model**



Figure from https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

### **Conv2D to Matrix-Matrix Multiplication**

- Im2Col stores in each column the necessary pixels for each kernel map
  - Duplicates input feature maps in memory
  - Restores output feature map structure



\* Image from <a href="http://nmhkahn.github.io/CNN-Practice">http://nmhkahn.github.io/CNN-Practice</a>

#### **Im2col Transform**



\* from https://www.researchgate.net/publication/327070011 Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures

#### Image to column operation (im2col) Slide the input image like a convolution but each patch become a column vector.



We get true performance gain

when the kernel has a large number of filters, ie: F=4

and/or you have a batch of images (N=4). Example for the input batch [4x4x3x4], convolved with 4 filters [2x2x3x2]. The only problem with this approach is the amount of memory

Reshaped kernel: [4x12]

#### Converted input batch [12x36]



\* Image from <a href="https://github.com/numforge/laser/wiki/Convolution-optimisation-resources">https://github.com/numforge/laser/wiki/Convolution-optimisation-resources</a>

#### **Conv2D to Matrix-Vector Multiplication**

- For each pixel, we can first perform Matrix-Vector Multiplication along the input channel dimension
- Then we can use adder-tree to aggregate the sum of K x K pixels (K is the kernel size)



Input Channels (IC)

### **Accelerator Architecture**

# General Architecture

### **Systolic Array**

- **Systolic Array** is a homogeneous network of tightly coupled data processing units (DPUs).
- Each **DPU** independently computes a partial result as a function of the data received from its upstream neighbors, stores the result within itself and passes it downstream.
- Advantages of systolic array design:
  - Shorter wires -> lower propagation delay and lower power consumption
  - High degree of pipelining -> faster clock
  - High degree of parallelism -> high throughput
  - Simple control logic -> less design efforts



\* Images from http://www.telesens.co/2018/07/30/systolic-architectures/

Specialized Architecture

### Layer-based Design







AlexNet Design

























AlexNet Design





AlexNet Design

### **Spatial Design**



### **Line-Buffer Design**



• Buffers inputs to perform spatial operations

• Buffers inputs for reuse to improve the arithmetic intensity

\* Ritchie Zhao, et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17)

| 4 | 2 | 5 | 6 | 9 |  |  |
|---|---|---|---|---|--|--|
| 1 | 3 | 8 | 7 | 3 |  |  |
| 6 | 4 | 2 | 8 | 1 |  |  |
|   |   |   |   |   |  |  |
|   |   |   |   |   |  |  |

| 4 | 2 | 5 | 6 | 9 |  |  |
|---|---|---|---|---|--|--|
| 1 | 3 | 8 | 7 | 3 |  |  |
| 6 | 4 | 2 | 8 | 1 |  |  |
|   |   |   |   |   |  |  |
| 4 |   |   |   |   |  |  |

| 4 | 2 | 5 | 6 | 9 |
|---|---|---|---|---|
| 1 | 3 | 8 | 7 | 3 |
| 6 | 4 | 2 | 8 | 1 |



| 4 | 2 | 5 | 6 | 9 |
|---|---|---|---|---|
| 1 | 3 | 8 | 7 | 3 |
| 6 | 4 | 2 | 8 | 1 |
|   | / |   |   |   |
| 4 |   |   |   |   |

| 4 | 2 | 5 | 6 | 9 |  |  |  |
|---|---|---|---|---|--|--|--|
| 1 | 3 | 8 | 7 | 3 |  |  |  |
| 6 | 4 | 2 | 8 | 1 |  |  |  |
|   |   |   |   |   |  |  |  |
| 4 | 8 |   |   |   |  |  |  |

# Mixed-precision Processing Elements

#### **Mixed-Precision Processing Element (PE)**



#### **Spatial PE: 2-bit mode**



#### 16x Parallelism

### **Spatial PE: 4-bit mode**



#### **4x Parallelism**

### **Spatial Processing Element: 4-8 bit mode**





**Partial Products** 

#### 2x Parallelism

#### **Mixed-precision PE: Temporal vs. Spatial**





Bit-Serial: Combines results over time

Bit-Parallel: Combines results over space

- Spatial design is normally more efficient in terms of area and power, given the same throughput
- \* Images from https://iscaconf.org/isca2018/slides/9A2.pdf

**High-level Synthesis** 

### **High-Level Synthesis**

- Allows users to specify algorithm logic in high-level languages
  - No concept of clock
  - Not specifying register-transfer level activities
- HLS compiler generates RTL based on high-level algorithmic description
  - Allocation
  - Scheduling
  - Binding
- Advantages:
  - Faster development and debugging cycles
  - More structural code
  - Focuses on larger architecture design tradeoffs

#### **HLS Abstraction**

- High-level Languages
  - $\circ$  C/C++, OpenCL, GoLang
- Typical hardware mapping
  - C Function -> Verilog Module
  - Function Arguments -> Memory Ports
  - Basic Blocks (blocks without branches) -> Hardware Logic
  - Operators -> Functional Units
  - Arrays -> BRAMs
  - Control Flow Graph (CFG) -> Finite-state Machine (FSM)
- Limitations:
  - No dynamic memory allocation allowed
  - No recursion support

#### **Example: Matrix Multiplication**

#### **Step 1: Partition Local Arrays**

// Local memory to store input and output matrices
int localA[MAX\_SIZE][MAX\_SIZE];
#pressure ULS\_APPAX\_PARTITION\_variable\_localA\_dim\_1\_compl

#pragma HLS ARRAY\_PARTITION variable=localA dim=1 complete

int localB[MAX\_SIZE][MAX\_SIZE];
#pragma HLS ARRAY\_PARTITION variable=localB dim=2 complete

```
int localC[MAX_SIZE][MAX_SIZE];
```

#pragma HLS ARRAY\_PARTITION variable=localC dim=0 complete

Step 2: Design Systolic Array (Implicit)

```
systolic1: for(int k = 0; k < a_col; k++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
#pragma HLS PIPELINE II=1
systolic2: for(int i = 0; i < MAX_SIZE; i++) {
    systolic3: for(int j = 0; j < MAX_SIZE; j++) {
}
</pre>
```

```
// Get previous sum
int last = (k==0) ? 0 : localC[i][j];
```

```
// Update current sum
// Handle boundary conditions
int a_val = (i < a_row && k < a_col)? localA[i][k] : 0;
int b_val = (k < b_row && j < b_col)? localB[k][j] : 0;
int result = last + a_val*b_val;</pre>
```

```
// Write back results
localC[i][j] = result;
```

}

Step 2: Design Systolic Array (Explicit)

```
for (int r = 0; r < N + 2 * MAX SIZE - 2; r++) {
#pragma HLS pipeline
                // update data (i.e., reads data from previous PE)
                for (int i = 0; i < MAX SIZE; i++)</pre>
                     for (int j = MAX SIZE - 1; j >= 1; j--)
                         localA[i][j] = localA[i][j - 1];
                 for (int i = MAX SIZE - 1; i >= 1; i--)
                     for (int j = 0; j < MAX SIZE; j++)
                         localB[i][j] = localB[i - 1][j];
                // read new data from inputs
                // not ok here!
                for (int i = 0; i < MAX SIZE; i++) {</pre>
                     if (r >= i \&\& r < i + N)
                         localA[i][0] = A[i + ii * MAX SIZE][r - i];
                     else
                         localA[i][0] = 0;
                 }
                 for (int j = 0; j < MAX SIZE; j++) {</pre>
                     if (r \ge j \& \& r < j + N)
                         localB[0][j] = B[r - j][j + jj * MAX SIZE];
                     else
                         localB[0][j] = 0;
                 }
                // PE
                 for (int i = 0; i < MAX SIZE; i++)</pre>
                     for (int j = 0; j < MAX_SIZE; j++)</pre>
                         C[i + ii * MAX SIZE][j + jj * MAX SIZE] += localA[i][j] * localB[i][j];
             }
```

**Step 3: Schedule Outer Loop Control Logic and** Memory Accesses

```
// Burst reads on input matrices from global memory
// Read Input A
 readA: for(int loc = 0, i = 0, j = 0; loc < a_row*a_col; loc++, j++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size*c_size max=c_size*c_size
#pragma HLS PIPELINE II=1
    if(j == a_col) { i++; j = 0;}
    localA[i][j] = a[loc];
 }
// Read Input B
 readB: for(int loc = 0, i = 0, j = 0; loc < b_row*b_col; loc++, j++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size*c_size max=c_size*c_size
#pragma HLS PIPELINE II=1
    if(j == b_col) { i++; j = 0; }
    localB[i][j] = b[loc];
 }
// Burst write from output matrices to global memory
// Burst write from matrix C
writeC: for(int loc = 0, i = 0, j = 0; loc < c_row*c_col; loc++, j++) {</pre>
#pragma HLS LOOP_TRIPCOUNT min=c_size*c_size max=c_size*c_size
#pragma HLS PIPELINE II=1
    if(j == c_col) { i++; j = 0; }
    c[loc] = localC[i][j];
```

\* Please see the <u>SDAccel page</u> for detailed source code

#### Resources

- Vivado HLS Design Hubs
- Parallel Programming for FPGAs
- <u>Cornell ECE 5775: High-Level Digital Design Automation</u>
- LegUp: Open-source HLS Compiler
- VTA design example
- Vivado SDAccel design examples

Questions?