

EECS 151/251A Spring 2023 Digital Design and Integrated Circuits

Instructor: John Wawrzynek

## Lecture 11: Timing Part 1

## Announcements

### □ Midterm exam coming up in 2 weeks

# What do ASIC/FPGA Designers need to know about physics?

Physics effect: Area  $\Rightarrow$  cost Delay  $\Rightarrow$  performance Energy  $\Rightarrow$  performance & cost

- Ideally, zero delay, area, and energy. However, the physical devices occupy area, take time, and consume energy.
- CMOS process lets us build transistors, wires, connections, and we get capacitors, inductors, and resistors whether or not we want them.

## Performance, Cost, Power



- How do we measure performance? operations/sec? cycles/sec?
- Performance is directly proportional to clock frequency. Although it may not be the entire story:

Ex: CPU performance

= # instructions X CPI X clock period

## Limitations on Clock Rate

1 Logic Gate Delay





3 Interconnect Delay: wires



1, 2, & 3 all contribute to limiting the clock period.

- What must happen in one clock cycle for correct operation?
  - All signals connected to FF (or memory) inputs must be ready and "setup" before rising edge of clock.
  - For now we assume perfect clock distribution (all flip-flops see the clock at the same time).



□ Three important times associated with flip-flops:

- Setup time How long d must be stable before the rising edge of CLK
- Hold time How long D must be stable after the rising edge of CLK
- Clock-to-q delay Propagation delay after rising edge of the CLK

## **Example: Timing Analysis**



## In General ...



$$T \ge \tau_{clk \rightarrow Q} + \tau_{CL} + \tau_{setup}$$
 for all paths.

- How do we enumerate all paths?
  - Any circuit input or register output to any register input or circuit output?
- Note:
  - "setup time" for outputs is a function of what it connects to.
  - "clk-to-q" for circuit inputs depends on from where it comes.

#### Modern CMOS gate delays on the order of a few picoseconds. (However, highly dependent on gate design and context.)

- Often expressed as FO4 delays (fan-out of 4) - as a process dependent delay metric:
  - the delay of an inverter, driven by an inverter 4x smaller than itself, and driving an inverter 4x larger than itself.
  - Less than 10ps for a 32nm process. For a 7nm process F04 is around 2.5ps.









## **Process Dependent FO4 Delay**

Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm

Aaron Stillmaker<sup>a,b,\*</sup>, Bevan Baas<sup>a</sup>

<sup>a</sup> Department of Electrical and Computer Engineering, University of California, Davis, One Shields Ave., Davis, CA 95616, USA

<sup>b</sup> Department of Electrical and Computer Engineering, California State University, Fresno, 2320 E. San Ramon Ave., Fresno, CA 93740, USA

Characteristics of different technology nodes [23]. The modeled measurements are for a single inverter in an FO4 chain. The energy value is the average energy required for a single inverter transition from low to high, or high to low.

|                    |                         |                    |                        | SimulatedPerfor | mance of Inverter |               |
|--------------------|-------------------------|--------------------|------------------------|-----------------|-------------------|---------------|
| Production<br>Year | Technology<br>Node (nm) | Technology<br>Type | V <sub>DD</sub><br>(V) | Delay<br>(ps)   | Energy<br>(fJ)    | Power<br>(µW) |
| 1999               | 180                     | Bulk               | 1.8                    | 77.2            | 27.5              | 105           |
| 2001               | 130                     | Bulk               | 1.2                    | 34.7            | 5.20              | 26.1          |
| 2004               | 90                      | Bulk               | 1.1                    | 26.5            | 2.62              | 13.0          |
| 2007               | 65                      | Bulk               | 1.1                    | 19.8            | 1.72              | 8.58          |
| 2008               | 45                      | High-k             | 1.1                    | 10.9            | 1.05              | 5.19          |
| 2010               | 32                      | High-k             | 0.97                   | 9.8             | 0.51              | 2.47          |
| 2012               | 20                      | Multi-Gate         | 0.9                    | 9.66            | 0.198             | 1.51          |
| 2013               | 16 <sup>a</sup>         | Multi-Gate         | 0.86                   | 6.12            | 0.179             | 1.28          |
| 2013               | 14 <sup>a</sup>         | Multi-Gate         | 0.86                   | 4.02            | 0.144             | 0.995         |
| 2015               | 10                      | Multi-Gate         | 0.83                   | 3.24            | 0.122             | 0.866         |
| 2017               | 7                       | Multi-Gate         | 0.8                    | 2.47            | 0.111             | 0.789         |

<sup>a</sup> The 2013 ITRS report labels a single "16/14" node.

## "Path Delay"



 For correct operation: Total Delay ≤ clock\_period - FF<sub>setup\_time</sub> - FF<sub>clk\_to\_q</sub> on all paths.

 High-speed processors critical paths (worst case paths) have around 20 F04 delays.

## FO4 Delays per clock period



With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

#### **F04 Delays Per Cycle for Processor Designs**

CPU DB: Recording Microprocessor History

G

gcm



F04 delay per cycle is roughly proportional to the amount of computation completed per cycle.

# "Gate Delay"

- What determines the actual delay of a logic gate?
- Transistors are not perfect switches - cannot change terminal voltages instantaneously.
- Consider the NAND gate:
  - Current (I) value depends on: process parameters, transistor size



 $\Delta t \propto C_L / I$ 

- C<sub>L</sub> models gate output, wire, inputs to next stage (Cap. of Load)
- C "integrates" I creating a voltage change at output

## More on transistor Current

Transistors actually act like a cross between a resistor and "current source"



ISAT depends on process parameters (higher for nFETs than for pFETs) and transistor size (layout):







## Physical Layout determines FET strength



- "Switch-level" abstraction gives a good way to understand the function of a circuit.
  - nFET (g=1 ? short circuit : open)
  - pFET (g=0 ? short circuit : open)
- Understanding delay means going below the switch-level abstraction to transistor physics and layout details.

(Cartoon physics)

#### tor) transistors



#### electrons are water molecules, or strengths (W/L) are pipe diameters, and capacitors are buckets ...









CS152 / Kubiatowicz Lec3.31

A "on" n-FET

empties the bucket.





## **Inverter: Transient Response**



## **Turning Rise/Fall Delay into Gate Delay**



## More on gate delay

Everything that connects to the output of a logic gate (or transistor) contributes capacitance:



- Transistor drains
- Interconnection (wires/contacts/ vias)
- Transistor Gates

## Wires

□ As parallel plate capacitors: 777C ~ Area = width \* length



• Wires have some finite resistance, so have distributed R and C:







## <u>Wire Delay</u>

- Wires posses distributed resistance
   and capacitance
- Time constant associated with distributed RC is proportional to the **square** of the length



- For **short wires** on ICs, resistance is insignificant (relative to effective R of transistors), but C is important.
  - Typically around half of C of gate load is in the wires.
- For **long wires** on ICs:
  - busses, clock lines, global control signal, etc.
  - Resistance is significant, therefore distributed RC effect dominates.
  - signals are typically "rebuffered" to reduce delay
- For long wires on ICs with high currents:
  - *inductance* is also important

## **Wire Rebuffering**

#### • For **long wires** on ICs:

- busses, clock lines, global control signal, etc.
- Resistance is significant, therefore rcL<sup>2</sup> effect dominates.
- signals are typically "rebuffered" to reduce delay:



#### unbuffered wire $\Delta t \propto L^2$

wire buffered into N sections  $\Delta t \propto N * (L/N)^2 + (N-1) * t_{buffer}$ 

Assuming  $t_{buffer}$  is small,  $\Delta t \propto L^2/N$ 

#### Speedup: $\propto N$

#### Flip-Flop delays eat into "time budget"





#### **Recall: Positive edge-triggered flip-flop**



#### **Sensing: When clock is low**



#### Capture: When clock goes high



### Flip Flop delays:

clk-to-Q? setup?



Note: with too much fanout, second stage could fail to capture data properly. Often output is rebuffered.

## Hold-time Violations



Some state elements have positive hold time requirements.

#### How can this be?

clk

FF

d

- Fast paths from one state element to the next can create a violation. (Think about shift registers!)
- CAP tools do their best to fix violations by inserting delay (buffers).
  - Of course, if the path is delayed too much, then cycle time suffers.
  - Difficult because buffer insertion changes layout, which changes path delay.



DWB

## **Components of Combinational Path Delay**



- 1. # of levels of logic
- 2. Internal cell delay
- 3. wire delay
- 4. cell input capacitance
- 5. cell fanout
- 6. cell output drive strength

# Who controls the delay in ASIC?

|                                  | foundary<br>engineer<br>(TSMC) | Library<br>Developer<br>(Aritsan) | CAD Tools (DC,<br>IC Compiler) | Designer (you!) |
|----------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------|
| 1. # of levels                   |                                |                                   | synthesis                      | HDL design      |
| 2. Internal<br>cell delay        | physical<br>parameters         | cell topology,<br>trans sizing    | cell selection                 |                 |
| 3. Wire delay                    | physical<br>parameters         |                                   | place & route                  | layout          |
| <b>4.</b> Cell input capacitance | physical<br>parameters         | cell topology,<br>trans sizing    | cell selection                 |                 |
| 5. Cell fanout                   |                                |                                   | synthesis                      | HDL design      |
| 6. Cell drive strength           | physical<br>parameters         | transistor<br>sizing              | cell selection                 |                 |

# Timing Closure: Searching for and beating down the critical path



Must consider all connected register pairs, paths, plus from input to register, plus register to output.

Design tools help in the search.

 Synthesis tools work to meet clock constraint, report delays on paths,

Special static timing analyzers accept a design netlist and report path delays,

 and, of course, simulators can be used to determine timing performance.

Tools that are expected to **do something** about the timing behavior (such as synthesizers), also include provisions for specifying input arrival times (relative to the clock), and output requirements (set-up times of next stage).

## Timing Analysis, real example

#### The critical path

Most paths have hundreds of picoseconds to spare.



From "The circuit and physical design of the POWER4 microprocessor", IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

# **Timing Optimization**

As an ASIC/FPGA designer you get to choose:

- The algorithm
- The Microarchitecture (block diagram)
- The HPL description of the CL blocks (number of levels of logic)
- Where to place registers and memory (the pipelining)
- Overall floorplan and relative placement of blocks

## Circuit retiming\*

# Circles are combinational logic, labelled with delays.

Critical path is 5 (ignore FF delay for now). We want to improve

it without changing circuit semantics.

Add a register, move <u>one circle</u> 2= 000 Performance improves by 20%.

> \*a.k.a. Register Rebalancing



Figure 1: A small graph before retiming. The nodes represent logic delays, with the inputs and outputs passing through mandatory, fixed registers. The critical path is 5.



Figure 2: The example in Figure 2 after retiming. The critical path is reduced from 5 to 4.

Logic Synthesis tools can do this in simple cases.

### **Retiming Example**



Want to retime to here, however, delay cannot be added to the loop without changing the semantics of the logic. Because of this, many retiming tools stop at loops.

### **Retiming Example**



### **Retiming Example**



#### Floorplaning: essential to meet timing.





(Intel XScale 80200)

# **Timing Analysis Tools**

Static Timing Analysis: Tools use delay models for gates and interconnect. Traces through circuit paths.

- Cell delay model capture
  - For each input/output pair, internal delay (output load independent)
  - output dependent delay
- Standalone tools (PrimeTime) and part of logic synthesis.
- Back-annotation takes information from results of place and route to improve accuracy of timing analysis.
- DC in "topographical mode" uses preliminary layout information to model interconnect parasitics.
  - Prior versions used a simple fan-out model of gate loading.

output load

delay

### **Standard cell characterization**



 Each library cell (FF, NAND, NOR, INV, etc.) and the variations on size (strength of the gate) is fully characterized across temperature, loading, etc.

## End of Lecture 11