

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits

Instructor: John Wawrzynek

Lecture 12

The Watt: Unit of power. A rate of energy (J/s). A gas pump hose delivers 6 MW. The Joule: Unit of energy. A 1 Gallon gas container holds 130 MJ of energy.

ANG

1 W = 1 J/s.

1 J = 1 W \* s

120 KW: The power delivered by a Tesla Supercharger. Tesla Model S has a 306 MJ battery (good for 265 miles).

# **Energy and Power**

Energy is the ability to do work (W). Power is rate of expending energy. Energy Efficiency: energy per operation

 $P = \frac{dW}{dt}$ 

Handheld and portable (battery operated):
Energy Efficiency - limits battery life
Power - limited by heat



Infrastructure and servers (connected to power grid):

- Energy Efficiency dictates operation cost
- Power heat removal contributes to TCO

Remember: P = IV



Sad fact: Computers turn electrical energy into heat. Computation is a byproduct.

# **Energy and Performance**

Air or water carries heat away, or chip melts.

The Joule: Unit of energy. Can also be expressed as Watt-Seconds. Burning 1 Watt for 100 seconds uses 100 Watt-Seconds of energy.

**1A** 

**1V** 

This is how electric tea pots work ...

1 Joule heats 1 gram of water 0.24 degree C

> 1 Joule of Heat Energy per Second

> > The Watt: Unit of power. The amount of energy burned in the resistor in 1 second.

1 Ohm Resistor

Watt

**20 W rating: Maximum power** the package is able to transfer to the air. Exceed rating and resistor burns.

## Old example: Cooling an iPod nano ...



Like resistor on last slide, iPod relies on passive transfer of heat from case to the air.

Why? Users don't want fans in their pocket ...

To stay "cool to the touch" via passive cooling, power budget of 5 W.

If iPod nano used 5W all the time, its battery would last 15 minutes ...

## Powering an iPod nano (2005 edition)



**1.2 W-hour battery:** Can supply 1.2 watts of power for 1 hour.

1.2 W-hr / 5 W  $\approx$  15 minutes.

More W-hours require bigger battery and thus bigger "form factor" -it wouldn't be "nano" anymore :-).

Real specs for iPod nano : 14 hours for music, 4 hours for slide shows.

85 mW for music.300 mW for slides.

# What's happened since 2005?

2010 nano 0.74 ounces (50% of 2005 Nano)

"Up to" 24 hours audio playback. 70% improvement from 2005 nano.



0.39 W Hr (33% of 2005 Nano)

# 2015 **Apple** Watch



3.8 V, 0.78 Wh lithium-ion battery on 38mm model. Apple claims the 205 mAh battery should provide up to 18 hours of use (which translates to 6.5 hours of audio playback, 3 hours of talk time, or 72 hours of Power Reserve mode.)





2.1 Wh battery – 5.3x as much energy as 2010 Nano.

Battery life very usage dependent.

640 x 360 Liquid Crystal on Silicon (LCoS) prism projector.

1.76 ounces -4X the weight ofiPod Shuffle

401

Logic Board



#### 4.7 inch iPhone6: 1,810mAh battery @3.8V = 6.88 Wh



iPhone 5s: 1570mAh @3.8V = 6 Wh

- The front side of the logic board:
  - Apple A8 APL1011 SoC + SK Hynix RAM as denoted by the markings H9CKNNN8KTMRWR-NTH (we presume it is 1 GB LPDDR3 RAM, the same as in the iPhone 6 Plus)
  - Qualcomm MDM9625M LTE Modem
  - Skyworks 77802-23 Low Band LTE PAD
  - Avago A8020 High Band PAD
  - Avago A8010 Ultra High Band PA + FBARs
  - SkyWorks 77803-20 Mid Band LTE PAD
  - InvenSense MP67B 6-axis Gyroscope and Accelerometer Combo



The A8 is manufactured on a 20 nm process by TSMC. It contains 2 billion transistors. Its physical size is 89 mm^2. ] It has 1 GB of LPDDR3 RAM included in the package. It is dual core, and has a frequency of 1.38 GHz.

## Notebooks ... as designed in 2006 ...

#### 2006 Apple MacBook -- 5.2 lbs



#### 12.8 in

- Performance: Must be "close enough" to desktop performance ... most people no longer used a desktop (even in 2006).
- Size and Weight. Ideal: paper notebook.
- Heat: No longer "laptops" -- top may get "warm", bottom "hot". Quiet fans OK.

## Battery: Set by size and weight limits ...

46x more energy than iPod

**46x more energy than iPod nano battery. And iPod lets you listen to music for 14 hours!** 

Almost full 1 inch depth. Width and height set by available space, weight.

At 1 GHz, CPU consumes 13 Watts. "Energy saver" option uses this mode ...

CPU!

**Battery rating:** 

At 2.3 GHz, Intel

consumes 31 W

running a heavy

load - under 2

hours battery

life! And, just for

Core Duo CPU

55 W-hour.



#### MacBook Air ... design the laptop like an iPod



#### Mainboard: fills about 25% of the laptop



35 W-h battery: 63% of 2006 MacBook's 55 W-h

#### MacBook Air: Full <u>PC</u>

Thunderbolt I/O

#### Platform Controller \ Hub

#### Core i5 CPU/GPU





#### Up to 4GB DRAM



#### 50Wh is 180,000 Joules!



## **Servers: Total Cost of Ownership (TCO)**



Reliability: running computers hot makes them fail more often. Machine rooms are expensive. Removing heat dictates how many servers to put in a machine room.

**Electric bill** adds up! Powering the servers + powering the air conditioners is a big part of TCO. Computations per W-h doubles every 1.6 years, going back to the first computer.

(Jonathan Koomey, Stanford).



# **CMOS Circuits and Energy**

## **Switching Energy: Fundamental Physics**

#### **Every logic transition dissipates energy.**



How can we (1) Reduce # of clock transitions. But we have work to do ... limit (2) Reduce Vdd. But lowering Vdd limits the clock speed ... switching (3) Fewer circuits. But more transistors can do more work. energy? (4) Reduce C per node. One reason why we scale processes.

#### **Chip-Level "Dynamic" Power**



# Additional Dynamic Power - "short circuit current"



 $V_{in}$   $V_{in}$   $V_{in}$   $V_{t}$   $I_{peak}$  t

When gate switches, brief period when both pullup network and pulldown network could be on.

Worse when input is changing slowly compared to the output.

## **Another Factor: Leakage Currents**

Even when a logic gate isn't switching, it burns power.



Isub: Even when this nFet is off, it passes an loff leakage current.

We can engineer any loff we like, but a lower loff also results in a lower lon, and thus a lower maximum clock speed.

Igate: Ideal capacitors have zero DC current. But modern transistor gates are a few atoms thick, and are not ideal.

#### Intel's 2006 processor designs, leakage vs switching power



A lot of work was done to get a ratio this good ... 50/50 is common.

Bill Holt, Intel, Hot Chips 17.

#### Engineering "On" Current at 25 nm ... We can increase $I_{0n}$ by raising V<sub>dd</sub> and/or lowering V<sub>t</sub>. $I_{uu} = 25 \text{ nm}$ $I_{uu} = 25 \text{ nm}$ $I_{uu} = 25 \text{ nm}$



#### Plot on a "Log" Scale to See "Off" Current



## **Device engineers trade speed and power**



From: Silicon Device Scaling to the Sub-10-nm Regime Meikei leong,<sup>1\*</sup> Bruce Doris,<sup>2</sup> Jakub Kedzierski,<sup>1</sup> Ken Rim,<sup>1</sup> Min Yang<sup>1</sup>

#### **Customize processes for product types ...**



From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.



Transistor channel is a raised fin. Gate controls channel from sides and top. Channel depth is fin width. 12-15nm for L=22nm.







## **Dynamic versus Leakage Power**



Figure 1: The reduction of feature sizes from 45 to 7nm may induce drastic gains in power consumption and leakage power [Xie2015]

Xie, Q. (2015). Performance Comparisons between 7-nm FinFET and Conventional Bulk CMOS Standard Cell Libraries. IEEE Transactions on Circuits and Systems II: Express Briefs, 62(8), 761-765.

## Total Power = $P_{switching}$ + $P_{short-circuit}$ + $P_{leakage}$







## **Six low-power design techniques**

- **H** Parallelism and pipelining
- \*

⋇

- Power-down idle transistors
- Slow down non-critical paths
  - **Clock gating**



₩

- Data-dependent processing
- **H** Thermal management

Design Technique #1 (of 6)

# **Trading Hardware for Power**

via Parallelism and Pipelining ...



THIS MAGIC TRICK BROUGHT TO YOU BY CORY HALL ...

### Chandrakasan & Brodersen (UCB, 1992)

| Architecture       | Power<br>(normalized) |
|--------------------|-----------------------|
| Simple             | 1                     |
| Parallel           | 0.36                  |
| Pipelined          | 0.39                  |
| Pipelined-Parallel | 0.2                   |

| Architecture       | Area<br>(normalized) |
|--------------------|----------------------|
| Simple             | 1                    |
| Parallel           | 3.4                  |
| Pipelined          | 1.3                  |
| Pipelined-Parallel | 3.7                  |

| Architecture       | Voltage |
|--------------------|---------|
| Simple             | 5V      |
| Parallel           | 2.9V    |
| Pipelined          | 2.9V    |
| Pipelined-Parallel | 2.0     |











#### Pipelined

Minimizing Power Consumption in CMOS Circuits Anantha P. Chandrakasan Robert W. Brodersen

#### Example: Intel Graphics Pipeline IP



A 2.05 GVertices/s 151 mW Lighting Accelerator for 3D Graphics Vertex and Pixel Shading in 32 nm CMOS

Farhana Sheikh, Member, IEEE, Sanu K. Mathew, Member, IEEE, Mark A. Anders, Member, IEEE, Himanshu Kaul, Member, IEEE, Steven K. Hsu, Member, IEEE, Amit Agarwal, Member, IEEE, Ram K. Krishnamurthy, Fellow, IEEE, and Shekhar Borkar, Fellow, IEEE

#### Clock Rate and Power vs Voltage



A 2.05 GVertices/s 151 mW Lighting Accelerator for 3D Graphics Vertex and Pixel Shading in 32 nm CMOS

Farhana Sheikh, Member, IEEE, Sanu K. Mathew, Member, IEEE, Mark A. Anders, Member, IEEE, Himanshu Kaul, Member, IEEE, Steven K. Hsu, Member, IEEE, Amit Agarwal, Member, IEEE, Ram K. Krishnamurthy, Fellow, IEEE, and Shekhar Borkar, Fellow, IEEE

# **Multiple Cores for Low Power**

#### Trade hardware for power, on a large scale ...

# Cell: The PS3 chip















### Cell (PS3 Chip): 1 CPU + 8 "SPUs"



### A "Schmoo" plot for a Cell SPU ...



Vdd (Volt)

### Clock speed alone doesn't help E/op ...

But, lowering clock frequency while keeping voltage constant spreads the same amount of work over a longer time, so chip stays cooler ...

$$\mathbf{E}_{0\to 1} = \frac{1}{2} \mathbf{C} \mathbf{V}_{dd}^2 \mathbf{E}_{1\to 0} = \frac{1}{2} \mathbf{C} \mathbf{V}_{dd}^2$$

| 1.3 | 48C<br>4W   | 49C<br>4W | 50C<br>5W | 50C<br>6W | 51C<br>6W | 52C<br>7W | 53C<br>7W | 54C<br>7W | 55C<br>8W | 56C<br>8W | 57C<br>9W | 58C<br>9W | 59C<br>10W | 60C<br>10W |         | 63C<br>11W | 61C     |
|-----|-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|------------|------------|---------|------------|---------|
| 1.2 | 39C<br>2W   | 39C<br>3W | 40C<br>3W | 41C<br>4W | 42C<br>4W | 42C<br>4W | 43C<br>5W | 44C<br>5W | 45C<br>5W | 45C<br>5W | 46C<br>6W | 47C<br>6W | 47C<br>7 W | 48C        | 49C     |            |         |
| 1.1 | 32 C<br>2 W | 33C<br>2W | 33C<br>3W | 35C<br>3W | 35C<br>3W | 36C<br>3W | 36C<br>4W | 37C<br>4W | 37C<br>4W | 38C<br>4W | 38C<br>4W | 39C       | 39C        |            |         |            |         |
| 1   | 28C<br>2W   | 28C<br>2W | 29C<br>2W | 29C<br>2W | 30C<br>2W | 30C<br>3W | 30C<br>3W | 31C<br>3W | 31C<br>3W | 31C<br>3W | 32C       |           |            |            |         |            |         |
| 0.9 | 25C<br>1W   | 26C<br>1W | 26C<br>1W | 26C<br>2W | 27C<br>2W | 27C<br>2W | 27C       |           |           |           |           |           |            |            | Faile   |            |         |
|     | 2           | 2.2       | 2,4       | 2.6       | 2.8       | ω         | 3.2       | ين<br>4   | 3.6       | .u<br>8   | 4         | 4.2       | .4.<br>4   | 4.6        | .4<br>8 | տ          | ул<br>N |
|     |             | -         | -         |           |           |           |           | F         | eq (GI    | Hz)       |           | -         | -          |            |         |            | •-      |

Vdd (Volt)

### Scaling V and f does lower energy/op

#### 1 W to get 2.2 GHz performance. 26 C die temp.

# 7W to reliably get 4.4 GHz performance. 47C die temp.

#### If a program that needs a 4.4 Ghz CPU can be recoded to use two 2.2 Ghz CPUs ... big win.

| 1.3 | 48C<br>4W   | 49)<br>4\\  |        | 50C<br>5W | 50C<br>6W | 51C<br>6W | 52C<br>7W | 53C<br>7W | 54C<br>7W | 55C<br>8W | 56C<br>8W | 57C<br>9W | 58C<br>9W | 59C<br>10W | 60C<br>10W | 61C<br>10W | 63C<br>11W | 61C |
|-----|-------------|-------------|--------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|------------|------------|------------|------------|-----|
| 1.2 | 39C<br>2W   | 39 (<br>3 W | - 1    | 40C<br>3W | 41C<br>4W | 42C<br>4W | 42C<br>4W | 43C<br>5W | 44C<br>5W | 45C<br>5W | 45C<br>5W | 46C<br>6W | 47C<br>6W | 47C<br>7 W | 48C        | 49C        |            |     |
| 1.1 | 32 C<br>2 W | 33)<br>2 W  |        | 33C<br>3W | 35C<br>3W | 35C<br>3W | 36C<br>3W | 36C<br>4W | 37C<br>4W | 37C<br>4W | 38C<br>4W | 38C<br>4W | 39C       | 39C        |            |            |            |     |
| 1   | 28C<br>2W   | 28<br>2 V   |        | 29C<br>2W | 29C<br>2W | 30C<br>2W | 30C<br>3W | 30C<br>3W | 31C<br>3W | 31C<br>3W | 31C<br>3W | 32C       |           |            |            |            |            |     |
| 0.9 | 25C<br>1W   | 26)<br>1W   |        | 26C<br>1W | 26C<br>2W | 27C<br>2W | 27C<br>2W | 27C       |           |           |           |           |           |            |            | Faile      |            |     |
|     | 2           |             | د<br>د | 2,4       | 2.6       | 2.8       | ω         | 3.2       | 3.4       | 3.6       | 3.8       | 4         | 4.2       | 4.4        | 4.6        | 4.8        | տ          | 52  |
|     | Freq(GHz)   |             |        |           |           |           |           |           |           |           |           |           |           |            |            |            |            |     |

Vdd (Volt)

Design Technique #2 (of 6)

# **Powering down idle circuits**

### Add "sleep" transistors to logic ...



Example: Floating point unit logic.

When running fixed-point instructions, put logic "to sleep".

+++ When "asleep", leakage power is dramatically reduced.

---- Presence of sleep transistors slows down the clock rate when the logic block is in use.

### Intel example: Sleeping cache blocks



### A tiny current supplied in "sleep" maintains SRAM state.

From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

#### Intel Medfield



#### Intel Medfield

Switches 45 power "islands."

Fine-grained control of leakage power, to track user activity.

"Race to idle" strategy -finish tasks quickly, to get to power down.



#### Playing a game ...



#### Watching a video ...



#### Looking at phone screen, not doing anything ...



#### Phone in your pocket, waiting for a call ...



Design Technique #3 (of 6)

# Slow down "slack paths"

### Fact: Most logic on a chip is "too fast"



From "The circuit and physical design of the POWER4 microprocessor", IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

### Use several supply voltages on a chip ...



Why use multi-Vdd? We can reduce dynamic power by using low-power Vdd for logic off the critical path.

What if we can't do a multi-Vdd design? In a multi-Vt process, we can reduce leakage power on the slow logic by using high-Vth transistors.

From: "Facing the Hot Chips Challenge Again", Bill Holt, Intel, presented at Hot Chips 17, 2005.

#### LOW POWER ARM 1136JF-STM DESIGN

George Kuo, Anand Iyer Cadence Design Systems, Inc. San Jose, CA 95134, USA

Logical partition into 0.8V and 1.0V nets done manually to meet 350 MHz spec (90nm).

Level-shifter insertion and placement done automatically.

> Dynamic power in 0.8V section cut 50% below baseline.

Leakage power in 1.0V section cut 70% by using high V<sub>T</sub>.



From a chapter from book on ASIC design by Chinnery and Keutzer (UCB).

Design Technique #4 (of 6)

# Gating clocks to save power

### On a CPU, where does the power go?



Half of the power goes to latches (Flip-Flops).

Most of the time, the latches don't change state.

So "gated" clocks are a big win. But, done with CAD tools in a disciplined way.

From: Bose, Martonosi, Brooks: Sigmetrics-2001 Tutorial

### Synopsis Design Compiler can do this ...



"Up to 70% power savings at the block level, for applicable circuits" Synopsis Data Sheet

Design Technique #5 (of 6)

# **Data-Dependent Processing**

#### Example: Video Decode Transform

Most of the time, the inputs flip between small positive and negative integers.

In 2's complement, wastes power:

+1: 0b00001 -1: 0b11110



Quad Full-HD Transform Engine for Dual-Standard Low-Power Video Coding Rahul Rithe, Student Member, IEEE, Chih-Chi Cheng, Member, IEEE, and Anantha P. Chandrakasan, Fellow, IEEE

#### Solution: Add bias value to all inputs

30+% power reduction for a bias of 64. For this linear transform, correcting the output for the bias is trivial.



Quad Full-HD Transform Engine for Dual-Standard Low-Power Video Coding

Rahul Rithe, Student Member, IEEE, Chih-Chi Cheng, Member, IEEE, and Anantha P. Chandrakasan, Fellow, IEEE

Design Technique #6 (of 6)

## **Thermal Management**

### Keep chip cool to minimize leakage power



Optimizing Designs for Power Consumption through Changes to the FPGA Environment

XILINX<sup>®</sup>

WP285 (v1.0) February 14, 2008

### **IBM Power 4: How does die heat up?**



4 dies on a multi-chip module

2 CPUs \_ \_ \_ \_ \_



### 115 Watts: Concentrated in "hot spots"



# 66.8 C == 152 F 82 C == 179.6 F

### Idea: Monitor temperature, servo clock speed



TDP = Thermal Design Point

### Intel realtime temp monitoring



### **Six low-power design techniques**

- **H** Parallelism and pipelining
- \*

⋇

- Power-down idle transistors
- Slow down non-critical paths
  - **Clock gating**



₩

- Data-dependent processing
- **H** Thermal management