### EECS 151/251A Homework 10

Due Monday, May 1, 2023

#### Problem 1: Vector Addition

Imagine you are asked to implement a vector addition module using single-stage (non-pipelined) adders that works at 100 MHz and consumes 10  $\mu$ W (only due to switching power) at a nominal  $V_{dd}$ . Given a switching power constraint of 15  $\mu$ W and an area constraint of 4 adders, what is the maximum possible throughput (number of elements output per second)? You are allowed to change  $V_{dd}$  arbitrarily. Assume that the frequency scales linearly with  $V_{dd}$  and that delay and power of other components are negligible.

#### Problem 2: Power and Energy Efficiency

The following table shows the effect of optimization on the total capacitance C, supply voltage  $V_{dd}$ , clock frequency f, power P, time per task T, and energy per task E. Consider only the switching power.

| Optimization       | C   | $V_{dd}$ | f   | P | T   | $\mid E \mid$ |
|--------------------|-----|----------|-----|---|-----|---------------|
| Logic optimization | 1/k | 1        | t   |   |     |               |
| Lower $V_{dd}$     | 1   | 1/k      | 1/k |   |     |               |
| Denard scaling     | 1/k | 1/k      | k   |   |     |               |
| Pipelining         | 1   | 1        | k   | k | 1/k | 1             |
| Multi-core         | k   | 1        | 1   | k | 1/k | 1             |

- (a) Fill out the blank cells in the table.
- (b) The effect of pipelining in the table is not accurate. What is wrong?
- (c) The effect of multi-core in the table may not be accurate. Explain why.
- (d) Clock gating is another possible optimization. Explain how it helps.
- (e) Power gating turns off a part of chip to cut leakage power as well as switching power. However, it involves some overhead for activation. Explain an alternative technique to reduce leakage power without such an overhead.

## Problem 3: 36-bit Carry Select Adder

For a 36-bit carry selection adder, we need 6 stages of 6-bit adders in the square root method. Assuming an FA and a multiplexer take the same delay, can you minimize the delay furthermore? Explain how.

#### Problem 4: Hierarchcal Carry Select Adder

The key concept of carry select adders is duplicate adders with carry-in assigned to 0 and 1, except for the first stage, and create multiplexers to select one of them. Consider the case when we apply this idea recursively with radix 2. The following figure shows an example for a 4-bit adder, where we reach the terminal case in 2 steps. For simplicity, only the multiplexers for the carry-outs are shown. It contains 1 FA with variable carry-in, 8 FAs with fixed carry-in (although 2 FAs in the last level can be shared), 4 multiplexers for the carry-outs, and 5 multiplexers (not shown) for the sums. Assuming FAs and multiplexers have the same delay, this 4-bit carry select adder has 3 unit delay for the maximum delay. What is the maximum delay and the number of FAs and multiplexers for an 8-bit carry select adder constructed in this way? Do not consider logic sharing between duplicated adders.



## Problem 5: Carry Look-ahead Adder

The figure below shows the structure of 4-bit carry look-ahead adder. Assuming each box takes the same delay uniformly from each input to dependent output, what is the maximum delay? Can you generalize it for N-bit carry look-ahead adder where N is power of 2 larger than 4?



# Problem 6: Parallel Prefix Adder

Draw a block diagram for a 4-bit Brent-Kung adder with carry-in. It takes 9 inputs  $a0, \ldots, a3$ ,  $b0, \ldots, b3$ , and cin, and generates 5 outputs  $s0, \ldots, s3$ , and cout. You are allowed to define the building blocks such as partial full adder as you want.