# Why is lowering VDD not enough? - · Total P can be minimized by lower V - lower V are a natural result of smaller feature sizes - · But... transistor speeds decrease dramatically as V is reduced to close to "threshold voltage" - performance goals may not be met - $-t_d = CV / k(V-V_t)^{\alpha}$ where $\alpha$ is between 1-2 - · Why not lower this "threshold voltage"? - makes noise margin and I<sub>leak</sub> worse! - · Need to do smarter voltage scaling! # Reducing the Supply Voltage: Architectural Approach - · Operate at reduced voltage at lower speed - · Use architecture optimization to compensate for slower operation - e.g. concurrency, pipelining via compiler techniques - · Architecture bottlenecks limit voltage reduction - degradation of speed-up - interconnect overheads - Similar idea for memory: slower and parallel # Parallel Datapath - The clock rate can be reduced by x2 with the same throughput: $f_{par} = f_{ref}/2 = 20 \text{ MHz}$ - Total switched capacitance = C<sub>par</sub> = 2.15C<sub>ref</sub> - $V_{par} = V_{ref}/1.7 \downarrow V_{par} = (2.15C_{ref})(V_{ref}/1.7)^2(f_{ref}/2) = 0.36P_{ref}$ ## **Pipelined Datapath** - Voltage can be dropped while maintaining the original - $P_{\text{pipe}} = C_{\text{pipe}} V_{\text{pipe}}^2 f_{\text{pipe}} = (1.1 C_{\text{ref}}) (V_{\text{ref}}/1.7)^2 f_{\text{ref}} = 0.37 P_{\text{ref}}$ # Datapath Architecture-Power Trade-off Summary | Datapath<br>Architecture | Voltage | Area | Power | |--------------------------|---------|------|-------| | Original | 5V | 1 | 1 | | Pipelined | 2.9V | 1.3 | 0.37 | | Parallel | 2.9V | 3.4 | 0.34 | | Pipeline-<br>Parallel | 2.0V | 3.7 | 0.18 | ``` An Utra Low Power 50 C in If nm Trigate CMO5 ( ISSCC 2018 Intel Grp.) For IOT Applications 2.5 mm x 2.5 mm chip, 12m transistors 4 independent Vpp. 0.55 V for logic, RoM, register file avray 0.50 V for dense sham array 1.05 V for radio 1.05 V for radio 1.70 V for transceivers 5 energy (n) 0.012 0.012 0.004 0.4 0.5 0.6 0.7 0.8 (V) ``` $$V_{5}(H) = V_{5}(H) = V_{5}(H) + V_{5}(1)$$ $$C \frac{dV_{5}}{dV_{5}} = \lambda \qquad (3)$$ $$C \frac{dV_{5}}{dV_{5}} = \lambda \qquad (3)$$ $$V_{5}(H) = L_{5}C \frac{dV_{5}}{dV_{5}} + V_{5}(1)$$ $$V_{5}(g) = L_{5}C \frac{dV_{5}}{dV_{5}} + V_{5}(1) + V_{5}(1)$$ $$V_{5}(g) = (L_{5}C + 1)V_{5}(1) - V_{5}(1) + V_{5}(1)$$ $$V_{5}(g) = (L_{5}C + 1)V_{5}(1) - V_{5}(1) + V_{5}(1)$$ $$V_{5}(g) = (L_{5}C + 1)V_{5}(1) - V_{5}(1) + V_{5}(1)$$ $$V_{5}(g) = V_{5}(1) V_{5}(g) + V_{5}(g) + V_{5}(g)$$ $$V_{5}(g) = V_{5}(g) + V_{5}(g) + V_{5}(g)$$ $$V_{5}(g) = V_{5}(g) + V_{5}(g)$$ #### **Combinational Vs. Sequential Circuits** Combinational (or Combinatorial) Networks/Circuits - · Circuit without storage - · Outputs depend only on its current inputs - Examples: NAND gate, look-up table (LUT) Vs. ### Sequential Networks/Circuits - · Circuit with storage elements - Outputs depend on present inputs and also on history of inputs - · Examples: RAM, finite state machine (FSM) - Normally, combinational cells AND storage elements Computation in Memory ( Also called Processing in Memory) Red. Convolution - RAM: An Energy Ediciet SRAM with Embedded Convolution Computation for Law Power CNN (convolution Veura Network) - based Machine Learning Applications Issue 2018 A. Biswas, et al. MIT SRAMS-58; weight stope And the process of the convolution convolu #### Synchronous Clocking - Design Rules #### Design rule Consistently dissociate signals into - ▶ asynchronous reset signals (when to enter the start state) - clock signals (when to move from one state to the next - ▶ information signals (what state to enter, what output to produce) #### Design rule Make the clock period long enough so that all transient effects have died out before the next active clock edge instructs registers (and other storage devices to accept new data! #### Timing Parameters of Combinational & Sequential Circuits Combinational circuit with three input signals and three output signals. $t_{pd}$ (Propagation delay): The time required to process new input from applying a stable logic value at a data or clock input until the output has settled on its final value. $t_{cd}$ (Contamination delay): The time from altering the logic value at a data or clock input until a the output value starts to change. #### Timing Quantities - Comb. and Seq. Circuits #### Combinational and sequential circuits Propagation delay tpd - New stable input (data or clock) output settled on final value - Example NAND: A-to-Z, and B-to-Z - Example latch: D-to-Q, and/or CLK-to-Q Contamination delay (or retain delay) tcd - Altering input (data or clock) first change of value at output - By definition: 0≤tcd ≤tpd #### **Synchronous Timing** Synchronous timing: All registers synchronized with same CLK # Timing Constraints (Setup & Hold) - ☐ There are two main problems that can possibly arise in synchronous logic: - » Max Delay: The data doesn't have enough time to pass from one register to the next before the next clock edge. - » <u>Min Delay</u>: The data path is so short that it causes a hold violation in capturing register. - Max delay violations are a result of a slow (long) data paths, including the register's t<sub>su</sub>, therefore they are often called "Setup violations". - Min delay violations are a result of a fast (short) data path, causing the data to change before the t<sub>hold</sub> of the reg has passed, therefore they are often called the "Hold violations". # Setup (Max) Constraint - □ Let's see what makes up our clock cycle: - » After the clock rises, it takes $t_{cq}$ for the data to propagate to point A. - » Then the data goes through the delay of the logic to get to point B. - » The data has to arrive at point $B t_{su}$ before the next clock edge. - □ In general, our timing path is a race: - » Between the Data Arrival, starting with the launching clock edge. - » And the Data Capture, one clock period later. # Setup (Max) Constraint # Hold (Min) Constraint - $\ \square$ Hold problems occur due to the logic changing before $t_{\it hold}$ has passed. - □ This is not a function of cycle time it is relative to a single clock edge! - □ Example of meeting the hold constraint: - » The clock rises and the data at A changes after to - » The data at B changes $t_{pd}(logic)$ later. - Since the data at B had to stay stable for $t_{hold}$ after the clock (for the second register), the change at B has to be at least $t_{hold}$ after the clock edge. #### Hold (Min) Constraint $$t_{CQ} + t_{\rm logic} > t_{hold}$$ #### Summary - □ For Setup constraints, the clock period has to be longer than the data path delay: $T > t_{CQ} + t_{\text{logic}} + t_{SU}$ - » This sets our maximum frequency. slow down the clock. - » If we have setup failures, we can always just - □ For Hold constrains, the data path delay has to be longer than the hold time: - » This is independent of clock period. $t_{CQ} + t_{logic} > t_{hold}$ - » If there is a hold failure, your chip will never work! # Fundamental Timing Conditions - Reminder $t_{ho_{ff}} \leqslant t_{cd_{ff}} + t_{cd_c} \left( -t_{sk} \right)$ Clock Skew Positive clock skew $t_{sk}$ relaxes The formulas to the right apply to the data the setup condition, while having detrimental effects on the hold condition. #### □ Clock skew » Spatial variation in temporally equivalent clock edges; deterministic (can control it) + random, $t_{\rm SK}$ #### □ Clock jitter - » Temporal variations in consecutive edges of the clock signal; purely random - » Cycle-to-cycle (short-term) $t_{JS}$ - » Long term $t_{JL}$ Both skew and jitter affect the effective cycle time #### □ Variation of the pulse width » Important for level-sensitive clocking (latches) # Clock Distribution - H-Tree Clock tree minimizing skew Clock insertion delay to every FF identical (ideally) Sometimes: work with "useful (intentional) skew", e.g. if many FFs have hold violation, to avoid insertion of a large amount of buffers Equal wire length/number of buffers to get to every location #### Dealing with Skew and Jitter - □ Balance clock paths using regular distribution network, such as H-tree - □ Use local clock GRIDS (increased cap and power) - □ Route data and clock in opposite directions to improve hold at the expense of setup. - □ Shield clock wires (minimize capacitive coupling) - Use dummy metal density fillers for regular wires - □ Use decoupling capacitors (for stable VDD) - □ Time borrowing (or cycle stealing): long path borrows time from subsequent short path, accomplished using latches - □ "Useful skew" to avoid expensive buffering for hold fix # **Clock Dividers** □ Derive a slow clock from a fast clock by dividing it by an integer number (counter) Remember: all clocks (including divided clocks) need to be free of hzards and glitches » Must assume that any logic can produce glitches » Generated (divided) clocks MUST come directly from a FF output # **Clock Gating** #### □ Advantages: - » Saves a multiplexer per FlipFlop with Enable (Area, Delay, and Power advantage) - » Avoids activity on the clock net leading to the FlipFlop - » Avoids activity on the FlipFlops clock pin reducing internal power consumption #### □ Disadvantages: - » Need for additional logic to suppress the clock while FlipFlop is disabled (area and power penalty) - » Need to ensure that the clock signal is free of glitches #### □ A safe strategy to realize clock gating - » Latch on the Enable signal shields glitches during the sensitive period of the clock - » Implemented with individual cells: some timing constraints need to be observed - » Often realized as single dedicated library cell **VLS**\* #### **Clock Gating** - □ Cofase grained: disable entire blocks of the design » Inserted manually - □ Fine grained: enable for small groups of FlipFlops ### **Improving IO Timing** - □ Chip provides a clock output that is aligned with the clock at the leaves of the FlipFlops - » Output of the clock output pad is declared as clock leaf - » Delayed clock is used as a reference for rest of system ## **Improving IO Timing** - □ Delay locked loop ( ) LL) - » Generates a phase shifted clock such that the reference input is phase aligned with the input clock - » Clock reference is taken from the leaves of the clock tree - □ Internal clock at the leaves is aligned with clock input