

EECS151/251A
Fall 2023
Digital Design and
Integrated Circuits

Instructors:
John Wawrzynek

Lecture 23: Adders

## **Announcements**

- □ Homework 9 posted
- □ 2 more weeks of lecture
- □ 2 more homework exercises

## **Outline**



- □ "tricks with trees"
- □ Adder review, subtraction, carry-select
- Carry-lookahead
- □ Bit-serial addition, summary



**Tricks with Trees** 

# A log(n) lower (time) bound to compute any function of n variables

- Assume we can only use binary operations, each taking unit time
- □ After 1 time unit, an output can only depend on two inputs
- □ Use induction to show that after k time units, an output can only depend on 2<sup>k</sup> inputs
  - After log<sub>2</sub> n time units, output depends on at most n inputs
- □ A binary tree performs such a computation



### Reductions with Trees - Review



If each node (operator) is k-ary instead of binary, what is the delay?

## Trees for optimization





$$((x_0 + x_1) + (x_2 + x_3)) + ((x_4 + x_5) + (x_6 + x_7))$$

- □ What property of "+" are we exploiting?
- Other associate operators? Boolean operations? Division? Min/Max?

## Parallel Prefix, or "Scan"

If "+" is an associative operator, and  $x_0, ..., x_{p-1}$  are input data then

parallel prefix operation computes:

$$y_j = x_0 + x_1 + ... + x_j$$
 for  $j=0,1,...,p-1$   
 $x_0, x_0 + x_1, x_0 + x_1 + x_2, ...$ 





Adder review, subtraction, carry-select

### 4-bit Adder Example

Motivate the adder circuit design by hand addition:

→ Add a0 and b0 as follows:

| а | b                | r | С | carry to next |
|---|------------------|---|---|---------------|
| 0 | 0                | 0 | 0 | stage         |
| 0 | 0<br>1<br>0<br>1 | 1 | 0 | G.G.G         |
| 1 | 0                | 1 | 0 |               |
| 1 | 1                | 0 | 1 |               |

$$r = a XOR b = a \oplus b$$
  
 $c = a AND b = ab$ 

Add a1 and b1 as follows:

| ci | а | b | r | CO |
|----|---|---|---|----|
| 0  | 0 | 0 | 0 | 0  |
| 0  | 0 | 1 | 1 | 0  |
| 0  | 1 | 0 | 1 | 0  |
| 0  | 1 | 1 | 0 | 1  |
| 1  | 0 | 0 | 1 | 0  |
| 1  | 0 | 1 | 0 | 1  |
| 1  | 1 | 0 | 0 | 1  |
| 1  | 1 | 1 | 1 | 1  |

$$r = a \oplus b \oplus c_i$$
  
 $co = ab + ac_i + bc_i$ 

## Carry-ripple Adder Revisited

#### □ Each cell:

$$r_i = a_i \oplus b_i \oplus c_{in}$$

$$c_{out} = a_i c_{in} + a_i b_i + b_i c_{in} = c_{in} (a_i + b_i) + a_i b_i$$



"Full adder cell"

#### □ 4-bit adder:



■ What about subtraction?

### Subtractor/Adder

$$A - B = A + (-B)$$

How do we form -B?

- complement B
   add 1



## Delay in Ripple Adders

Ripple delay amount is a function of the data inputs:



□ However, we usually only consider the worst case delay on the critical path. There is always at least one set of input data that exposes the worst case delay.

13

## Adders (cont.)

### Ripple Adder



Ripple adder is inherently slow because, in worst case s7 must wait for c7 which must wait for c6 ...

 $T \alpha n$ ,  $Cost \alpha n$ 

How do we make it faster, perhaps with more cost?

## Carry Select Adder



$$T = T_{ripple\_adder} / 2 + T_{MUX}$$

$$COST = 1.5 * COST_{ripple\_adder} + (n/2 + 1) * COST_{MUX}$$

### Carry Select Adder

Extending Carry-select to multiple blocks



- □ What is the optimal # of blocks and # of bits/block?
  - If blocks too small delay dominated by total mux delay
  - If blocks too large delay dominated by adder ripple delay

$$\sqrt{N}$$
 stages of  $\sqrt{N}$  bits

$$T \alpha \ \text{sqrt}(N)$$
,  
Cost  $\approx 2 \text{*ripple} + \text{muxes}$ 

## Carry Select Adder



□ Compare to ripple adder delay:

 $T_{total} = 2 \text{ sqrt}(N) T_{FA} - T_{FA}$ , assuming  $T_{FA} = T_{MUX}$ 

For ripple adder  $T_{total} = N T_{FA}$ 

"cross-over" at N=3, Carry select faster for any value of N>3.

- □ Is sqrt(N) really the optimum?
  - From right to left increase size of each block to better match delays
  - Ex: 64-bit adder, use block sizes [12 11 10 9 8 7 7], the exact answer depends on the relative delay of mux and FA

(note: one less block than sqrt(N) solution)



Carry-lookahead and Parallel Prefix

- □ How do we arrange carry generation to be associative?
- □ Reformulate basic adder stage:

| abc <sub>i</sub> | $C_{i+1}$ | S |                        |
|------------------|-----------|---|------------------------|
| 000              | 0         | 0 | carry "kill"           |
| 001              | 0         | 1 | $k_i = a_i' b_i'$      |
| 010              | 0         | 1 |                        |
| 011              | 1         | 0 | carry "propagate"      |
| 100              | 0         | 1 | $p_i = a_i \oplus b_i$ |
| 101              | 1         | 0 |                        |
| 110              | 1         | 0 | carry "generate"       |
| 111              | 1         | 1 | $a_i = a_i b_i$        |

$$c_{i+1} = g_i + p_i c_i$$
$$s_i = p_i \oplus c_i$$

□ Ripple adder using p and g signals:



$$p_i = a_i \oplus b_i$$
$$g_i = a_i b_i$$

 $\square$  So far, no advantage over ripple adder: T  $\alpha$  N

"Group" propagate and generate signals:



- □ P true if the group as a whole propagates a carry to c<sub>out</sub>
- ☐ G true if the group as a whole generates a carry

$$c_{out} = G + Pc_{in}$$

Group P and G can be generated hierarchically.







# 8-bit Carry Look-ahead Adder with 2-input gates.



### Parallel-Prefix Adders

Can carry generation be made to be a kind of "reduction operation"?

AND reduction

Lowest delay for a reduction is a balanced tree.

- But in this case all intermediate values are required.
- One way is to use "Parallel Prefix" to compute the carries.





$$y_0 = x_0$$
$$y_1 = x_0 x_1$$
$$y_2 = x_0 x_1 x_2$$

Parallel Prefix requires that the operation be associative, but simple carry generation is not!

## Parallel-Prefix Carry Look-ahead Adders

Ground truth specification of all carries directly (no grouping):

$$c_0 = 0$$
  
 $c_1 = g_0 + p_0 c_0 = g_0$   
 $c_2 = g_1 + p_1 c_1 = g_1 + p_1 g_0$   
 $c_3 = g_2 + p_2 c_2 = g_2 + p_2 g_1 + p_1 p_2 g_0$   
 $c_4 = g_3 + p_3 c_3 = g_3 + p_3 g_2 + p_3 p_2 g_1 + p_4 p_3 p_2 g_0$ 

$$c_{i+1} = g_i + p_i c_i$$



Binary (G, P) associative operator

Can be used to form all carries!

## (G",P") (G',P') Parallel Prefix Adder Example





$$G = g_3 + g_2 p_3 + (g_1 + g_0 p_1) p_3 p_2$$
  
=  $g_3 + g_2 p_3 + g_1 p_3 p_2 + g_0 p_3 p_2 p_1$   
=  $c_4$ 

$$s_i = a_i \oplus b_i \oplus c_i = p_i \oplus c_i$$

## Other Parallel Prefix Adder Architectures



Kogge-Stone adder: minimum logic depth, and full binary tree with minimum fan-out, resulting in a fast adder but with a large area

Ladner-Fischer adder: minimum logic depth, large fan-out requirement up to n/2



**Brent-Kung adder:** minimum area, but high logic depth

Han-Carlson adder: hybrid design combining stages from the Brent-Kung and Kogge-Stone adder 28

### Carry look-ahead Wrap-up

- □ Adder delay O(logN).
- □ Cost?
- □ Can be applied with other techniques. Group P & G signals can be generated for sub-adders, but another carry propagation technique (for instance ripple) used within the group.
  - For instance on FPGA. Ripple carry up to 32 bits is fast, CLA used to extend to large adders. CLA tree quickly generates carry-in for upper blocks.



Bit-serial Addition, Adder summary

## Bit-serial Adder



- Addition of 2 n-bit numbers:
  - takes n clock cycles,
  - uses 1 FF, 1 FA cell, plus registers
  - the bit streams may come from or go to other circuits, therefore the registers might not be needed.

### Adders on FPGAs

- Dedicated carry logic provides fast arithmetic carry capability for highspeed arithmetic functions.
- On Virtex-5
  - Cin to Cout (per bit) delay = 40ps, versus 900ps for F to X delay.
  - 64-bit add delay
     = 2.5ns.



## Adder Final Words

| Туре            | Cost | Delay      |
|-----------------|------|------------|
| Ripple          | O(N) | O(N)       |
| Carry-select    | O(N) | O(sqrt(N)) |
| Carry-lookahead | O(N) | O(log(N))  |
| Bit-serial      | 0(1) | O(N)       |

- Dynamic energy per addition for all of these is O(n).
- "O" notation hides the constants. Watch out for this!
- □ The "real" cost of the carry-select is at least 2X the "real" cost of the ripple. "Real" cost of the CLA is probably at least 2X the "real" cost of the carry-select.
- The actual multiplicative constants depend on the implementation details and technology.
- □ FPGA and ASIC synthesis tools will try to choose the best adder architecture automatically assuming you specify addition using the "+" operator, as in "assign A = B + C"