

## EECS151/251A Spring 2019 Digital Design and Integrated Circuits

Instructor:
John Wawrzynek
Lecture 26:
Wrap-up

## Outline



- Important takeaways
- Exam Topics


## Why Study and Learn Digital Design?

- We expect that many of our graduates will eventually be employed as designers.
- Digital design is not a spectator sport. The only way to learn it, and to appreciate the issues, is to do it.
- To a large extent, it comes with practice/experience (this course is just the beginning).
- Another way to get better is to study other designs. Not time to do much of this during the semester, but a good practice for later.
- However, a significant percentage of our graduates will not be digital designers. What's in it for them?
- Better manager of designers, marketers, field engineers, etc.
- Better researcher/scientist/designer in related areas
- Software engineers, fabrication process development, etc.
- To become a better user of electronic systems.


## In What Context Will You be Designing?

Engineers learn so that they can build. Scientists build so that they can learn.

- Electronic design is a critical tool for most areas of pure science:
- Astrophysics - special electronics used for processing radio antenna signals.
- Genomics - special processing architectures for DNA string matching.
- In general - sensor processing, control, and number crunching.
- Machine Learning now relies heavily on special hardware.
- In some fields, computation has replaced experimentation - particle physics, world weather prediction (fluid dynamics).
- In computer engineering, prototypes often designed, implemented, and studied to "prove out" an idea. Common within Universities and industrial research labs. Lessons learned and proven ideas often transferred to industry through licensing, technical communications, or startup companies.
- RISC processors were first proved out at Berkeley and IBM Research


## Designs in Industry

- Of course, companies are the primary employer of designers. Provide some useful products to society or government and make a profit for the shareholders.
- Interesting recent shift
- All software giants now have hardware design teams (embedded and chips)
- Google, Amazon, Facebook, Microsoft, ...



## Ten Big Ideas from EECS151

1. Modularity and Hierarchy is an important way to describe and think about digital systems.
2. Parallelism is a key property of hardware systems and distinguishes them from serial software execution.
3. Clocking and the use of state elements (latches, flip-flops, and memories) control the flow of data.
4. Cost/Performance/Power tradeoffs are possible at all levels of the system design.
5. Boolean Algebra and other logic representations.
6. Hardware Description Languages (HDLs) and Logic Synthesis are a central tool for digital design.
7. Datapath + Controller is a effective design pattern.
8. Finite State Machines abstraction gives us a way to model any digital system - used for designing controllers.
9. Arithmetic circuits are often based on "long-hand" arithmetic techniques.
10.FPGAs + ASICs give us a convenient and flexible implementation technology.

## What We Didn't Cover

- Design Verification and Testing
- Industrial designers spend more than half their time testing and verifying correctness of their designs.
- Some of this covered in the lab and a bit in lecture. Didn't cover rigorous testing procedures.
- Most industrial products are designed from the start for testability. Important for design verification and later for manufacturing test.
- Related: Fault modeling and fault tolerant design.
- Other High-level Optimization Techniques
- High-level Synthesis - now starting to catch on
- Other High-level Architectures: GPUs, video processing, network routers, ...
- Asynchronous Design


## Most Closely Related Courses

- CS152 Computer Architecture and Engineering
- Design and Analysis of Microprocessors
- Applies basic design concepts from EECS151
- EE241B Digital Integrated Circuits
- Transistor-level design of ICs
- More on Advanced ASIC Tool use
- CS250 VLSI Systems Design
- Advanced-undergrad/grad course
- Design tradeoffs at the chip design level


## Future Design Issues

- Automatic High-level synthesis (HLS) and optimization (with micro-architecture synthesis) and hardware/software codesign.
- Current trend is towards "system on a chip" (SOC) design methodology:
- Pre-designed subsystems (processor cores, bus controllers, memory systems, network interfaces, etc. ) connected with standard on-chip interconnect or bus.
- Strong emphasis on "accelerators".
- A number of alternatives to silicon VLSI have been proposed, including techniques based on:
- Carbon nanotubes*, molecular electronics, quantum mechanics, and biological processes.
- How will these change the way we design systems?

> *In 2012, IBM produced a sub-10 nm carbon nanotube transistor that outperformed silicon on speed and power. "The superior low-voltage performance of the sub-10 nm CNT transistor proves the viability of nanotubes for consideßation in future aggressively scaled transistor technologies", according to the abstract of the paper in Nano Letters.

## Final Exam and Project Info

- Exam held in scheduled final exam slot: Friday May 17, 7-10PM, Hearst Gym 251
- "Comprehensive" Final Exam
- Emphasis on second half
- Project interviews
- Project final reports

The exam will take place Friday May 17, $7-10$ PM in Hearst Gym 251. The exam comprises a set of questions with 1 point per expected minute of completion with a total of approximately 90 points. 251 A students will be asked to complete extra questions. All students are allowed one 2 -sided $8.5 \times 11$ inch sheet of notes. No calculators, phones, or other electronic devices will be allowed. Slide-rules will be permitted.

Topics: The final exam will be comprehensive and test all topics covered this semester. However, emphasis will be placed on topics covered after the midterm exam - those listed below.

1. Sources of Power and Energy consumption in Digital ICs
2. Principles Behind Six Low-power Design Techniques
3. How to Improve Energy Efficiency through Parallelism and Pipelining
4. How to Design a RISC-V Single-Cycle Processor from the ISA
5. Processor Pipelining Hazards and Mechanisms
6. Principle behind and motivations for hardware acceleration
7. Line Drawing Accelerator Design Details
8. Memory Block Internal Architecture
9. SRAM Cell and Read/Write Operation
10. Memory Block Periphery Circuits
11. Memory Decoder Design
12. DRAM Cell and Read/Write Operation
13. Dual-port Memory Architecture
14. Effect of Clock Uncertainties on Maximum Clock Frequency 15. Source of Clock Uncertainties
15. Principle of Good Clock Distribution
16. IR and dI/dt effects in Power distribution
17. Cascading Memory blocks for More Width, Depth, and Ports
18. FIFO Implementation
19. Memory Block Specification in Verilog
20. Serialization versus Parallelization in Iterative Computations
21. Principles of Pipelining and Restrictions of Loops
22. C-Slow Technique for Pipelining Loops
23. Carry Select Adder Design
24. Carry Lookahead and Parallel Prefix Adders
25. Bit-Serial Addition
26. Array Multiplier Design
27. Carry Save Addition
28. Signed Multiplication
29. Booth Encoding
30. Bit-Serial Multiplication
31. CSD Multiplier Design
32. Log and Barrel Shifters Design and Analysis
33. Use of Counters in Controller Design
34. Binary Counter Design and Optimization
35. Ring Counter Design
36. LFSR Implementation
37. List Processor Design and Optimizations
38. Modulo Scheduling
39. Types and Sources of Faults in ICs
40. Hamming Codes

## Ring Counters

- "one-hot" counters 0001, 0010, 0100, 1000, 0001, ...

"Self-starting" version:



## Building an LFSR from a Primitive Polynomial

- For $k$-bit LFSR number the flip-flops with FF1 on the right.
- Find the primitive polynomial of the form $x^{k}+\ldots+1$.
- The feedback path comes from the Q output of the leftmost FF, corresponding to the $x^{k}$ term.
- The $x^{0}=1$ term corresponds to connecting the feedback directly to the D input of FF 1.
- Each term of the form $x^{n}$ corresponds to connecting an xor between FF $n$ and $n+1$.
- 4-bit example, uses $x^{4}+x+1$
- $x^{4} \Leftrightarrow$ FF4's Q output
- $x \Leftrightarrow$ xor between FF1 and FF2
- $l \Leftrightarrow$ FF1's D input
- To build an 8-bit LFSR, use the primitive polynomial $x^{8}+x^{4}+x^{3}+x^{2}+1$ and connect xors between FF2 and FF3, FF3 and FF4, and FF4 and FF5.



## Total Power $=P_{\text {switching }}+P_{\text {short-circuit }}+P_{\text {leakage }}$



## Six low-power design techniques

* 
* 


## Parallelism and pipelining

## Power-down idle transistors

* Slow down non-critical paths
* Clock gating
* Data-dependent processing
* 

Thermal management


And so, we can transform this:


Block processes stereo audio. 1/2 of clocks for "left", 1/2 for "right".


THIS MAGIC TRICK BROUGHT TO HOU BY CORE HALL ...

## Single-Cycle RISC-V RV32I Datapath



## Pipelined Processor



## Energy Efficiency of CPU versus ASIC versus FPGA

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37-47, June 2010.


Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA '06, pages 21-30, New York, NY, USA, 2006. ACM

$\therefore F P G A: C P U=70 x$
Similar story for performance efficiency

## Line Drawing Algorithm

This version assumes: $\mathbf{x}_{0}<\mathbf{x}_{1}, \mathbf{y}_{0}<\mathrm{y}_{1}$, slope $=<45$ degrees
function line( $x 0, x 1, y 0, y 1)$

$$
\text { int deltax }:=x 1-x 0
$$

int deltay $:=y 1$ - y0
int error $:=$ deltax $/ 2$
int $y:=y 0$
for $x$ from $x 0$ to $x 1$

$$
\operatorname{plot}(x, y)
$$

error := error - deltay
if error < 0 then
$y:=y+1$ error := error + deltax

Note: error starts at deltax/2 and gets decremented by deltay for each $x$. y gets incremented when error goes negative, therefore $y$ gets incremented at a rate proportional to deltax/deltay.

## Memory Architecture Overview

- Word lines used to select a row for reading or writing
- Bit lines carry data to/from periphery
- Core aspect ratio keep close to 1 to help balance delay on word line versus bit line
- Address bits are divided between the two decoders
- Row decoder used to select word line
- Column decoder used to select one or more columns for input/output of data



## SRAM read/write operations



## Periphery

# $\square$ Decoders <br> $\square$ Sense Amplifiers <br> - Input/Output Buffers <br> Control / Timing Circuitry 

## Row Decoder

- Expands L-K address lines into $2^{L-K}$ word lines

- Example: decoder for 8Kx8 memory block
- core arranged as 256x256 cells
- Need 256 AND gates, each driving one word line


## 1-Transistor DRAM Cell



Write: C s is charged or discharged by asserting WL and BL.
Read: Charge redistribution takes places between bit line and storage capacitance
$C_{S} \ll C_{B L}$ Voltage swing is small; typically around 250 mV.
$\square$ To get sufficient $C_{s}$, special IC process is used

- Cell reading is destructive, therefore read operation always is followed by a write-back
$\square$ Cell looses charge (leaks away in ms - highly temperature dependent), therefore cells occasionally need to be "refreshed" - read/write cycle


## Dual-ported Memory Internals

- Add decoder, another set of read/write logic, bits lines, word lines:
- Example cell: SRAM

- Repeat everything but cross-coupled inverters.
- This scheme extends up to a couple more ports, then need to add additional transistors.


## Clock Constraints in Edge-Triggered Systems

If launching edge is late and receiving edge is early, the data will not be too late if:
$t_{c l k-q, \max }+t_{l o g i c, \max }+t_{\text {setup }}<T_{C L K}-t_{J S, 1}-t_{J S, 2}+\delta$

Minimum cycle time is determined by the maximum delays through the logic
$t_{c l k-q, \max }+t_{\text {logic,max }}+t_{\text {setup }}-\delta+2 t_{J S}<T_{C L K}$
Skew can be either positive or negative Jitter $\mathrm{t}_{\mathrm{JS}}$ usually expressed as peak-to-peak or $\mathrm{n} \times$ RMS value

## Clock Uncertainties

Sources of clock uncertainty


## H-Tree



Equal wire length/number of buffers to get to every location

## Power Supply Impedance

- No voltage source is ideal - \|Z\| > 0
- Two principal elements increase Z:
- Resistance of supply lines (IR drop)
- Inductance of supply lines (L•di/dt drop)



## Cascading Memory-Blocks

How to make larger memory blocks out of smaller ones. Increasing the depth. Example: given $1 \mathrm{Kx8}$, want $2 \mathrm{~K} \times 8$


## FIFO Implementation Details

- Assume, dual-port memory with asynchronous read, synchronous write.
- Binary counter for each of read and write address. CEs (count enable) controlled by WE and RE.
- Equal comparator to see when pointers match.
- State elements for FULL and EMPTY flags:

| WE RE equal $^{*}$ |  |  |  | EMPTY $_{i}$ FULL $_{i}$ |
| :---: | :---: | :---: | :--- | :--- |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | EMPTY $_{\mathrm{i}-1}$ | FULL $_{i-1}$ |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | EMPTY $_{\mathrm{i}-1}$ | FULL $_{i-1}$ |

- Control logic (FSM) with truth-table (draft) shown to left.
* Actually need 2 signals: "will be equal after read" and "will be equal after write"


## Verilog RAM Specification

```
//
// Single-Port RAM with Asynchronous Read
//
module ramBlock (clk, we, a, di, do);
    input clk;
    input we; // write enable
    input [19:0] a; // address
    input [7:0] di; // data in
    output [7:0] do; // data out
    reg [7:0] ram [1048575:0]; // 8x1Meg
    always @(posedge clk) begin // Synch write
        if (we)
            ram[a] <= di;
    assign do = ram[a]; // Asynch read
endmodule
```

    What do the synthesis tools do with this?
    
## Time-Multiplexing

- Time multiplex single ALU for all adds and multiplies:
- Attempts to minimize cost at the expense of time.
- Need to add extra register, muxes, control.

- If we adopt above approach, we can then consider the combinational hardware circuit diagram as an abstract computation-graph.


Using other primitives, other coverings are possible.


- This time-multiplexing "covers" the computation graph by performing the action of each node one at a time. (Sort of emulates it.)


## Limits on Pipelining

- Without FF overhead, throughput improvement $\alpha$ \# of stages.
- After many stages are added FF overhead begins to dominate:

- Other limiters to effective pipelining:
- clock skew contributes to clock overhead
- unequal stages
- FFs dominate cost
- clock distribution power consumption
- feedback (dependencies between loop iterations)


## Pipelining Loops with Feedback "Loop carry dependency"

However, we can overlap the "nonfeedback" part of the iterations:

Add is associative and communitive Therefore we can reorder the computation to shorten the delay of the feedback path:

$$
y_{i}=\left(y_{i-1}+x_{i}\right)+a=\left(a+x_{i}\right)+y_{i-1}
$$

"Shorten" the feedback path.

| $\operatorname{add}_{1}$ | $\mathrm{x}_{\mathrm{i}}+\mathrm{a}$ | $\mathrm{x}_{\mathrm{i}+1}+\mathrm{a}$ | $\mathrm{x}_{\mathrm{i}+2}+\mathrm{a}$ |  |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{add}_{2}$ |  | $\mathrm{y}_{\mathrm{i}}$ | $\mathrm{y}_{\mathrm{i}+1}$ | $\mathrm{y}_{\mathrm{i}+2}$ |
|  |  |  |  |  |



- Pipelining is limited to 2 stages.


## "C-slow" Technique

- Essentially this means we go ahead and cut feedback path:

- This makes operations in adjacent pipeline stages independent and allows full cycle for each:
- C computations (in this case $\mathrm{C}=2$ ) can use the pipeline simultaneously.
- Must be independent.
- Input MUX interleaves input streams.
- Each stream runs at half the pipeline frequency.
- Pipeline achieves full throughput.
Multithreaded Processors use this.

| add $_{1}$ | $\mathrm{x}+\mathrm{b}$ | $\mathrm{x}+\mathrm{b}$ | $\mathrm{x}+\mathrm{b}$ | $\mathrm{x}+\mathrm{b}$ | $\mathrm{x}+\mathrm{b}$ | $\mathrm{x}+\mathrm{b}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |
| mult $^{2}$ | ay | ay | ay | ay | ay | ay |
| add $_{2}$ | y | y | y | y | y | y |
|  |  |  |  |  |  |  |

## Carry Select Adder

- Extending Carry-select to multiple blocks

- What is the optimal \# of blocks and \# of bits/block?
- If blocks too small delay dominated by total mux delay
- If blocks too large delay dominated by adder ripple delay

$$
\begin{aligned}
& T \alpha \operatorname{sqrt}(N), \\
& \text { Cost } \approx 2^{*} \text { ripple + muxes }
\end{aligned}
$$




$$
\begin{aligned}
G & =g_{3}+g_{2} p_{3}+\left(g_{1}+g_{0} p_{1}\right) p_{3} p_{2} \\
& =g_{3}+g_{2} p_{3}+g_{1} p_{3} p_{2}+g_{0} p_{3} p_{2} p_{1} \\
& =c_{4}
\end{aligned}
$$

$$
s_{i}=a_{i} \oplus b_{i} \oplus c_{i}=p_{i} \oplus c_{i}
$$

## Bit-serial Adder

n-bit shift registers


- A, B, and $R$ held in shift-registers. Shift right once per clock cycle.
- Reset is asserted by controller.
- Addition of 2 n -bit numbers:
- takes $n$ clock cycles,
- uses 1 FF, 1 FA cell, plus registers
- the bit streams may come from or go to other circuits, therefore the registers might not be needed.


## Combinational Multiplier (unsigned)



## 2's Complement Multiplication



## Carry-Save Addition

- Speeding up multiplication is a matter of speeding up the summing of the partial products.
- "Carry-save" addition can help.
- Carry-save addition passes (saves) the carries to the output, rather than propagating them.
carry-save add $\left\{\begin{aligned} & 3_{10} \underline{0011} \\ & c \underline{0010} \\ &=2_{10} \\ & s \underline{0110}=6_{10}\end{aligned}\right.$
- In general, carry-save addition takes in 3 numbers and produces 2.
- Sometimes called a " $3: 2$ compressor": 3 input signals into 2 in a potentially lossy operation
- Whereas, carry-propagate takes 2 and produces 1.
- With this technique, we can avoid carry propagation until final addition


## Bit-serial Multiplier

- Bit-serial multiplier ( $\mathrm{n}^{2}$ cycles, one bit of result per n cycles):

- Control Algorithm:

```
repeat n cycles { // outer (i) loop
    repeat n cycles{ // inner (j) loop
        shiftA, selectSum, shiftHI
    }
    shiftB, shiftHI, shiftLOW, reset
}
```

Note: The occurrence of a control signal $x$ means $x=1$. The absence of $x$ means $x=0$.

## Booth recoding

(On-the-fly canonical signed digit encoding!)
current bit pair


A "1" in this bit means the previous stage needed to add 4*A. Since this stage is shifted by 2 bits with respect to the previous stage, adding 4* $A$ in the previous stage is like adding $A$ in this stage!

## Canonic Signed Digit Representation

- CSD represents numbers using $1, \overline{1}, \& 0$ with the least possible number of non-zero digits.
- Strings of 2 or more non-zero digits are replaced.
- Leads to a unique representation.
- To form CSD representation might take 2 passes:
- First pass: replace all occurrences of 2 or more 1 's:

$$
01 . .10 \text { by } 10 . . \overline{10}
$$

- Second pass: same as above, plus replace $01 \overline{1} 0$ by 0010 and $0 \overline{110}$ by $00 \overline{10}$
- Examples:

$$
011101=29
$$

$$
100101=32-4+1
$$

$$
\begin{array}{ll}
0010111=23 & 0110110=54 \\
001100 \overline{1}=3 \\
010 \overline{1} 00 \overline{1}=32-8-1 & 10 \overline{110 \overline{1} 0} \\
100 \overline{10} 0=64-8-2
\end{array}
$$

- Can we further simplify the multiplier circuits?


## Log Shifter / Rotator

- $\log (\mathrm{N})$ stages, each shifts (or not) by a power of 2 places, $\mathrm{S}=\left[\mathrm{s}_{2} ; \mathrm{s}_{1} ; \mathrm{S}_{0}\right]$ :



## Barrel Shifter



Cost/delay?

- (don't forget the decoder)


## Controller using Counters

- State Transition Diagram:
- Assume presence of two binary counters. An "i" counter for the outer loop and "j" counter for inner loop.



## 5. Optimization, Architecture \#4

- Datapath:

- Incremental cost:
- Addition of another register \& mux, adder mux, and control.
] Performance: find max time of the four actions

1. $X \leftarrow$ Memory[NUMA], NUMA < NEXT+1;
2. NEXT<Memory[NEXT], SUM $\leftarrow S U M+X$;

$$
\begin{aligned}
& 0.5+1+10+1+0.5=13 \mathrm{~ns} \\
& \text { same for all } \Rightarrow \mathrm{T}>13 \mathrm{~ns}, \mathrm{~F}<77 \mathrm{MHz}
\end{aligned}
$$

## Modulo Scheduling List Processor



- Assuming a single adder and a single ported memory. Minimal schedule section length $=2$. Because both memory and adder are used for 2 cycles during one iteration.


wrap-around, decrease subscript

wrap-around, decrease subscript
- Finished schedule for 4 iterations:

| Memory | next $_{1}$ |  | next $_{2}$ | x $_{1}$ | next $_{3}$ | x $_{2}$ | next $_{4}$ | x $_{3}$ |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| adder |  | numa $_{1}$ |  | numa $_{2}$ | sum $_{1}$ | numa $_{3}$ | sum $_{2}$ | numa $_{4}$ | sum $_{3}$ |

## Synchronous Counters

- How do we extend to n-bits?
- Extrapolate $\mathrm{c}^{+}: \mathrm{d}^{+}=\mathrm{d} \oplus \mathrm{abc}, \mathrm{e}^{+}=\mathrm{e} \oplus \mathrm{abc}$

- Has difficulty scaling (AND gate inputs grow with $n$ )

- CE is "count enable", allows external control of counting,
- TC is "terminal count", is asserted on highest value, allows cascading, external sensing of occurrence of max value.


## Types of Faults in Digital Designs

- Design Bugs (function, timing, power draw)
- detected and corrected at design time through testing and verification (simulation, static checks)
- Manufacturing Defects (violation of design rules, impurities in processing, statistical variations)
- post production testing for sorting
- spare on-chip resources for repair
- Runtime Failures (physical effects and environmental conditions)
- assuming design is correct and no manufacturing defects


## Hamming Error Correcting Code

- Use more parity bits to pinpoint bit(s) in error, so they can be corrected.
- Example: Single error correction (SEC) on 4-bit data
- use 3 parity bits, with 4-data bits results in 7-bit code word
- 3 parity bits sufficient to identify any one of 7 code word bits
- overlap the assignment of parity bits so that a single error in the 7-bit word can be corrected
- Procedure: group parity bits so they correspond to subsets of the 7 bits:
- $\mathrm{p}_{1}$ protects bits 1,3,5,7
- $\mathrm{p}_{2}$ protects bits $2,3,6,7$
- $\mathrm{p}_{3}$ protects bits $4,5,6,7$


Bit position number
$\left.\begin{array}{l}001=1_{10} \\ 011=3_{10} \\ 101=5_{10} \\ 111=7_{10}\end{array}\right\} \mathrm{p}_{1}$

Note:
number bits from left to right.

## The End.

- Special thanks to our GSIs: Chris and Arya.
- Good luck on the final.
- Thanks for a great semester!

