Loading…

Part 1: Low-Power Physical Design

Back to README


Baseline Run

The first run, called base, applied no low-power techniques. The RTL went through synthesis with default settings. Non-default configuration items were only what was needed to integrate the SRAM macros: LEF paths, GDS paths, LIB paths, Verilog behavioral model paths, die area set to 2000x2000 um, a manual macro placement file, and the DRC/LVS relaxation flags required because OpenRAM SRAM macros have known DRC issues under Magic. Placement density was set to 30% of core area.

Metric Value
Total power 2,577,878.5 (OpenSTA normalized units)
Internal power 2,577,878.5
Switching power 0.003
Leakage power 0.000015
Hold violations 42
Setup violations 0
Slew violations 789
Cap violations 13
Die area 1,824,020 um²
Wirelength 140,132 um

Nearly all power is internal power. Switching power and leakage are negligible. In a synchronous CMOS design, internal power is the product of switching activity on node capacitances inside cells, primarily flip-flop output nodes, multiplied by clock frequency. This design has hundreds of registers distributed across the VC FIFOs, output FIFOs, state machines, and allocator logic. Every one of those registers toggles on every clock edge regardless of whether useful data is being processed, because the clock tree is routed unconditionally to all flip-flops.

The 42 hold violations and 789 slew violations at baseline are addressed by later optimization steps.


RTL Clock Gating

The highest-impact optimization in the project was RTL-level clock gating. An integrated clock gating (ICG) cell was manually instantiated in the Verilog source. The ICG cell is a latch-based gate with an enable input, a clock input, and a gated clock output. When enable is low, the gated clock output stays at a constant logic level and never transitions.

ICG cells were placed in the signal path to the register banks inside the VC FIFOs and output FIFOs. When a FIFO slot is not actively being written, the enable to its ICG cell is deasserted, and none of the data registers in that FIFO bank toggle.

The RTL change was guarded behind a Verilog define (CLOCK_GATE) passed through the OpenLane configuration. This run was called baseopt.

Metric Value
Total power 1,129,917
Reduction from base 56.17%
Clock gate cells in placed layout 45
Hold violations 18

A 56% reduction from a single RTL modification. Internal power is proportional to switching activity. Gating the clock to idle register banks eliminates all toggling on those nodes when idle, which is most of the time in a router that is not saturated. The 45 ICG cells in the final layout correspond to the latch-based gating cells inserted by the synthesizer when it processes the ICG instantiations in the RTL.

Hold violations dropped from 42 to 18 because the synthesizer, operating on a modified netlist with explicit clock enable logic, produces slightly different timing paths.


RTL Isolation Cells

The next step added isolation cells at the RTL level. Isolation cells are AND-gate-like structures placed at module output boundaries. When a module’s enable is deasserted, the isolation cell drives its output to a known constant value (zero) instead of forwarding the module’s actual combinational output. This prevents stale or indeterminate values on a module’s outputs from propagating through downstream combinational logic — mux inputs, crossbar data lines, allocator request lines — and causing unnecessary switching on those nets.

This was guarded behind a Verilog define (ISOLATION_CELLS). The run was called basecgiso.

Metric Value
Total power 1,095,670.4
Reduction from base 57.50%
Incremental reduction from baseopt 3.03%
Hold violations 112

The incremental reduction from isolation cells alone is about 3%. This is consistent with the power breakdown at baseline: switching power was already nearly zero compared to internal power. The dominant source is flip-flop toggling, which clock gating addresses directly. Isolation cells address combinational glitching, which is a secondary contributor.

Hold violations jumped from 18 to 112. The isolation cell logic adds new combinational paths near the SRAM macro interface boundaries. Those short new paths create hold violations that the default resizer settings cannot fully fix.


Reading the OpenLane 2 Source

Before trying synthesis-level knobs, the OpenLane 2 source was read directly to build an accurate picture of what is configurable. The documentation does not enumerate every configuration variable, so relying on it alone means missing options that exist in the codebase. Files examined:

  • openlane/steps/openroad.py — all PnR-level variables
  • openlane/steps/yosys.py and openlane/steps/pyosys.py — synthesis variables
  • openlane/scripts/pyosys/synthesize.py — Yosys synthesis script structure
  • openlane/scripts/pyosys/construct_abc_script.py — every ABC optimization strategy definition
  • openlane/flows/classic.py — step ordering and inter-step state passing

Synthesis-level variables found:

SYNTH_STRATEGY controls which ABC optimization strategy runs during technology mapping. Options are AREA 0 through AREA 3 and DELAY 0 through DELAY 4. Each is a different combination of ABC passes. SYNTH_SIZING enables cell upsizing and downsizing in ABC during mapping. SYNTH_ABC_DFF passes flip-flops through ABC for retiming across them. SYNTH_SHARE_RESOURCES enables logic resource sharing. SYNTH_ABC_USE_MFS3 enables SAT-based remapping. SYNTH_ABC_BUFFERING enables buffer insertion inside ABC. USE_LIGHTER activates the Lighter Yosys plugin for automatic clock gating inference.

PnR-level variables found:

CTS clustering parameters (CTS_SINK_CLUSTERING_SIZE, CTS_SINK_CLUSTERING_MAX_DIAMETER) control how OpenROAD groups flip-flops into CTS clusters. Post-GPL and post-GRT design repair steps (RUN_POST_GPL_DESIGN_REPAIR, RUN_POST_GRT_RESIZER_TIMING) enable buffer insertion and cell resizing to fix slew, cap, and hold violations. Hold and setup slack margins with buffer budgets control how aggressively the resizer overfixes timing. Gate cloning enables the tool to duplicate cells to reduce fanout on high-fanout nets.

What cannot be done in this flow:

The sky130_fd_sc_hd library is a single threshold voltage library. No HVT or LVT variant exists. Multi-Vt cell assignment cannot be applied. DVFS requires voltage regulators, level shifters, and frequency dividers; OpenLane 2 has no support for voltage island definition, multi-supply nets, or DVFS infrastructure. Power gating requires header and footer power switch insertion, power domain boundary definition, retention register mapping, and a power management controller; OpenLane 2 supports none of these. The isolation cells in this project were manually inserted at the RTL level as a partial approximation of what a full power gating flow would provide at a domain boundary.


Synthesis Optimization Run: basecgisoopt

The first synthesis optimization run added AREA 0 strategy, SYNTH_SIZING, SYNTH_ABC_DFF, and SYNTH_SHARE_RESOURCES on top of the RTL changes from previous runs.

AREA 0 uses the resyn2 rewriting passes followed by an area-optimized technology mapper (amap). Cell sizing allows ABC to downsize cells with excess timing slack, reducing drive strength and switching capacitance. ABC DFF optimization allows retiming across flip-flop boundaries within the synthesis engine.

Metric Value
Total power 1,052,925.4
Reduction from base 59.16%
Incremental from basecgiso 3.90%
Hold violations 0
Setup violations 0

First fully timing-clean run. The incremental 3.90% power reduction comes from ABC downsizing cells with excess slack after AREA 0 mapping. Timing closed because the synthesis optimization produced a netlist the default CTS and resizer settings could handle without overflowing.


Synthesis Strategy Sweep

The ABC synthesis strategies are defined in construct_abc_script.py. Each is a different sequence of ABC passes:

AREA 0 runs resyn2 (a standard area recovery sequence: rewrite, rewrite -z, balance, rewrite, rewrite -z, balance, rewrite, balance) followed by amap (area mapper).

AREA 1 runs choice2 (introduces structural choices in the AIG for better technology mapping flexibility) followed by amap.

AREA 2 runs choice2 twice before amap. Running it twice generates a richer set of structural alternatives for the mapper to choose from. This is the most aggressive area recovery strategy among the AREA options.

AREA 3 uses the ORFS area script, which has a different internal structure from the resyn2 family.

Three additional strategy runs were made after basecgisoopt.

Run: area1

AREA 1 with all previous RTL optimizations.

Metric Value
Total power 981,842.9
Reduction from base 61.91%
Hold violations 6

Better power than basecgisoopt but not timing-clean. The choice2 strategy produced a more compact netlist, but the default resizer settings could not close 6 remaining hold violations.

Run: area2

AREA 2 with all previous RTL optimizations.

Metric Value
Total power 764,607.6
Reduction from base 70.34%
Hold violations 136
Slew violations 878
Cap violations 22

Best power result in the entire project. The double application of choice2 produced a significantly smaller, lower-power netlist. The cost was severe timing and signal integrity violations: 136 hold violations, 878 slew violations, and 22 cap violations. The smaller cells AREA 2 maps to have reduced drive strength, making their output nets more susceptible to slew degradation and cap violations. Without sufficient resizer repair, the tool could not fix these.

Run: area3 (FAILED)

Used AREA 3 strategy with post-CTS resizer timing repair disabled and gate cloning disabled. Failed with unresolvable hold violations. The ORFS area script generates a netlist topology that, combined with no resizer repair, leaves too many hold-critical paths unfixed.

Run: noresizer (FAILED)

Used AREA 0 with all resizer timing repair disabled and a cell exclusion list that blacklisted large drive-strength cells (buf_16, buf_12, inv_16, inv_12, dfxtp_4, dfrtp_4). The intent was to force the tool to use smaller, lower-power cells. This failed because sky130_fd_sc_hd does not have sufficient density of intermediate-drive cells to allow the resizer to fix hold violations when high-drive cells are excluded. Without buf_16 and similar cells available for buffer insertion, the resizer has no legal way to add delay on short hold-critical paths.


Timing Closure for area2: CTS and Resizer Tuning

area2 had 136 hold violations, 878 slew violations, and 22 cap violations. The target was to close all of these without losing the 70.34% power reduction. The approach was coordinated changes to CTS clustering and the design repair and resizer steps.

CTS Changes

CTS_SINK_CLUSTERING_SIZE was reduced from 25 to 20. Smaller cluster sizes produce a more balanced clock tree because the tool groups fewer sinks per cluster, reducing variation in clock insertion delay across sinks. CTS_SINK_CLUSTERING_MAX_DIAMETER was reduced from 50 to 40 um, preventing any single cluster from spanning a distance that would produce a long, high-skew branch.

Design Repair (post-GPL)

RUN_POST_GPL_DESIGN_REPAIR was enabled. This runs buffer insertion and cell resizing after global placement but before detailed placement.

  • DESIGN_REPAIR_MAX_SLEW_PCT set to 10
  • DESIGN_REPAIR_MAX_CAP_PCT set to 10
  • DESIGN_REPAIR_MAX_WIRE_LENGTH set to 300 um, forcing buffer insertion on any net longer than 300 um
  • DESIGN_REPAIR_BUFFER_INPUT_PORTS and DESIGN_REPAIR_BUFFER_OUTPUT_PORTS both enabled

Resizer Timing (post-GRT)

RUN_POST_GRT_RESIZER_TIMING was enabled. PL_RESIZER_HOLD_SLACK_MARGIN set to 0.15 ns and GRT_RESIZER_HOLD_SLACK_MARGIN set to 0.10 ns. These margins cause the resizer to overfix hold timing beyond the zero-slack target, providing headroom against extraction pessimism. Hold and setup buffer budgets were set to 60% of available slack budget. Gate cloning was enabled for both post-CTS and post-GRT stages to reduce fanout on nets with excessive load.

Run: area2cts

Metric Value
Total power 764,607.6
Hold violations 0
Setup violations 0
Slew violations 45
Cap violations 0
Wirelength 152,592 um

Power stayed at 764,607.6. All timing and cap violations closed. The 45 remaining slew violations are exclusively on nets connected to SRAM macro output pins. The macro’s output driver characteristics come from its Liberty model and cannot be changed by the PnR resizer. OpenROAD cannot insert buffers between a macro output pin and its connected net in a way that fixes slew without going inside the macro boundary, which is not allowed. These 45 violations are a known limitation of hardened SRAM macros and are present across all runs.

The wirelength increase from 138,621 to 152,592 um (about 10%) is the cost of the buffers added by design repair and resizer.


Floorplan Optimization

The baseline floorplan used a 2000x2000 um die. The 15 SRAM macros accounted for 87% of total instance area. Standard cell utilization was only 4.6% of core area. The standard cells were essentially floating in a sea of filler cells because the core was massively oversized relative to actual logic density.

The original macro placement put all 10 input SRAM macros in two columns at x=10 um and x=210 um, and all 5 output SRAM macros in a single column at x=700 um. Vertical pitch between macros in the same column was 190 um.

Several smaller die sizes were attempted. 1200x1200 and 1500x1500 um both failed: the former with PDN channel errors (the power grid routing could not find channels between tightly packed macros) and the latter with routing congestion failures. Changing only the placement density target from 30% to 35% without changing the die size also caused routing congestion.

A 3-column macro layout was designed for the 1800x1800 um die:

  • Column 1 at x=120 um: VC0 SRAM macros for all 5 input units (one per row)
  • Column 2 at x=340 um: VC1 SRAM macros for all 5 input units (one per row)
  • Column 3 at x=820 um: Output FIFO SRAM macros for all 5 output units (one per row)
  • Vertical pitch: 200 um

Grouping each input unit’s two VC FIFOs into adjacent columns keeps the short data and control paths between them compact. The 480 um gap between column 2 and column 3 provides a placement channel where the crossbar and switch allocator logic can be placed without being squeezed between macro columns.

Run: area2fp

Metric Value
Total power 764,607.6
Die area 1,528,380 um²
Reduction in die area 16.47% from area2cts
Wirelength 143,786 um
Reduction in wirelength 5.77% from area2cts
Hold violations 1 (WNS = -0.0035 ns)
Setup violations 0
Slew violations 47
Cap violations 1
Fill cells 82,592 (down from 108,434)
Utilization 31.8% (up from 26.8%)

Power is unchanged because floorplan optimization does not affect logic mapping or cell sizing, only physical placement. The die area reduction of 16.47% is significant. The wirelength reduction comes from shorter routes between cells packed more tightly. The single remaining hold violation at -0.0035 ns on a 25 ns clock is marginal and is attributed to a timing corner at an SRAM macro interface. The fill cell count dropped 23.8% because there is less empty space in the smaller die.


Complete Run Summary

Run Power vs Base Die Area (um²) Wirelength (um) Hold Setup Slew Status
base 2,577,878.5 0% 1,824,020 140,132 42 0 789 baseline
baseopt 1,129,917.0 -56.17% 1,829,730 144,640 18 0 830 clock gating
basecgiso 1,095,670.4 -57.50% 1,830,030 142,056 112 0 801 + isolation cells
basecgisoopt 1,052,925.4 -59.16% 1,830,820 147,129 0 0 879 + synth opts
area1 981,842.9 -61.91% 1,830,380 143,817 6 0 820 AREA 1
area2 764,607.6 -70.34% 1,829,840 138,621 136 0 878 AREA 2
area3 FAILED AREA 3
noresizer FAILED no resizer
area2cts 764,607.6 -70.34% 1,829,840 152,592 0 0 45 timing clean
area2fp 764,607.6 -70.34% 1,528,380 143,786 1 0 47 floorplan opt

Metrics Collection

All runs produce final/metrics.json from the OpenLane 2 signoff steps.

  • Power: OpenSTA static analysis at tt_025C_1v80.
  • Timing: Worst-case across 9 corners: tt_025C_1v80, ss_100C_1v60, ff_n40C_1v95, each crossed with min/nom/max parasitic extraction corners. Hold violations reported are worst across all 9 corners.
  • IR drop: OpenROAD PSM on power and ground nets.
  • Routing DRC: OpenROAD DRT.
  • Physical DRC and LVS: Magic and KLayout.

Known Issue: SRAM DRC Violations

All DRC errors across every run are 132 nwell.4 violations located inside the SRAM macro instance boundaries. Zero routing DRC errors exist on the standard cell or interconnect layers.

OpenRAM-generated macros for sky130 use an optimized SRAM-specific layout. The abstract LEF used during PnR exposes only the external metal interface. When Magic runs DRC on the assembled design, it sees the nwell regions from the macro boundary without the internal tap cell geometry that satisfies the nwell.4 rule. The official OpenLane documentation and the SkyWater PDK known issues page both document this. These errors appear identically on every run including the unmodified baseline and are not caused by any work in this project.