Physical Design of a Wormhole NoC Router on Sky130: From Low Power Techniques to Pipelining to DFT Scan Insertion
The Design Under Work
The subject of all this work is a 5-port wormhole Network-on-Chip router written in Verilog and targeted at the Skywater 130nm open-source process node (sky130A), using the sky130_fd_sc_hd high-density standard cell library. The full physical design flow runs on OpenLane 2 with the Classic flow.
The router has five ports: Local, North, South, East, and West. Each input port has two virtual channels, each VC backed by its own synchronous FIFO. The FIFO storage is not register-based; it uses a single-port SRAM macro (sram_1rw_16x16) generated by OpenRAM. With 5 input ports times 2 VCs per port, that gives 10 input FIFOs. Each of the 5 output ports also has its own output FIFO backed by another SRAM macro, giving 5 output FIFOs. Total SRAM instance count in the design: 15 macros.
The router uses XY routing. The routing decision is decoded from the destination field encoded in the HEAD flit. The design includes a crossbar, a switch allocator, and a set of round-robin arbiters for contention resolution. Flit width is 16 bits.
The clock period used throughout the low-power experiments was 25 ns. During the pipelining work the clock was tightened to 20 ns to measure actual setup slack improvement.
Part One: Low Power Physical Design
Establishing a Baseline
The first run, called base, used no low power techniques whatsoever. The RTL was taken through synthesis with default settings, and the only non-default configuration items were those required to integrate the SRAM macros: LEF paths, GDS paths, LIB paths, Verilog behavioral model paths, die area set to 2000x2000 um, a manual macro placement file, and the DRC/LVS relaxation flags required because OpenRAM SRAM macros have known DRC issues under Magic (covered in detail later in this post). Placement density was set to 30% of core area.
The baseline results:
| Metric | Value |
|---|---|
| Total power | 2,577,878.5 (normalized units from OpenSTA) |
| Internal power | 2,577,878.5 |
| Switching power | 0.003 |
| Leakage power | 0.000015 |
| Hold violations | 42 |
| Setup violations | 0 |
| Slew violations | 789 |
| Cap violations | 13 |
| Die area | 1,824,020 um² |
| Wirelength | 140,132 um |
The power breakdown is the first important observation. Essentially all power is internal power. Switching power and leakage are negligible. This is expected and well understood: in a synchronous CMOS digital design, internal power is the product of switching activity on node capacitances inside cells (primarily flip-flop output nodes) multiplied by the clock frequency. This design has hundreds of registers distributed across the VC FIFOs, output FIFOs, state machines, and allocator logic. Every one of those registers toggles on every clock edge regardless of whether any useful data is being processed, because the clock tree is routed unconditionally to all flip-flops. The 42 hold violations and 789 slew violations at the baseline are addressed by later optimization steps.
RTL-Level Clock Gating
The single highest-impact optimization applied in this entire project was RTL-level clock gating. An integrated clock gating (ICG) cell was manually instantiated in the Verilog source. The ICG cell is a latch-based gate with an enable input, a clock input, and a gated clock output. When the enable signal is low, the gated clock output stays at a constant logic level and never transitions. The ICG cells were placed in the signal path to the register banks inside the VC FIFOs and output FIFOs. When a FIFO slot is not actively being written, the enable to its corresponding ICG cell is deasserted, and none of the data registers in that FIFO bank toggle.
The RTL change was guarded behind a Verilog define (CLOCK_GATE) passed through the OpenLane configuration. This run was called baseopt`.
Results after clock gating:
| Metric | Value |
|---|---|
| Total power | 1,129,917 |
| Reduction from base | 56.17% |
| Clock gate cells in placed layout | 45 |
| Hold violations | 18 |
A 56% power reduction from a single RTL modification. The reason this works so dramatically is that internal power is proportional to switching activity. Gating the clock to idle register banks does not merely reduce power proportionally to the fraction of time those banks are idle; it eliminates all toggling on those nodes when idle, which is most of the time in a router that is not saturated. The 45 ICG cells visible in the final layout correspond to the latch-based gating cells inserted by the synthesizer when it processes the ICG instantiations in the RTL. Hold violations dropped from 42 to 18 because the synthesizer, now operating on a modified netlist with explicit clock enable logic, produces slightly different timing paths.
RTL-Level Isolation Cells
The next step added isolation cells at the RTL level. Isolation cells are tri-state or AND-gate-like structures placed at module output boundaries. When a module’s enable signal is deasserted (the module is idle), the isolation cell drives its output to a known constant value (zero in this case) instead of forwarding the module’s actual combinational output. The purpose is to prevent stale or indeterminate values on a module’s outputs from propagating through downstream combinational logic (mux inputs, crossbar data lines, allocator request lines) and causing unnecessary switching activity on those nets.
This was also guarded behind a Verilog define (ISOLATION_CELLS). The run was called basecgiso.
Results:
| Metric | Value |
|---|---|
| Total power | 1,095,670.4 |
| Reduction from base | 57.50% |
| Incremental reduction from baseopt | 3.03% |
| Hold violations | 112 |
The incremental reduction from isolation cells alone is about 3%, which is small. This is consistent with what was observed in the power breakdown at baseline: switching power (which isolation cells address) was already nearly zero compared to internal power. The dominant power source is flip-flop toggling, which clock gating addresses directly. Isolation cells address combinational glitching, and combinational glitching is a secondary contributor. The hold violation count jumped from 18 to 112. This happens because the isolation cell logic adds new combinational paths near the SRAM macro interface boundaries, and those short new paths create hold violations that the default resizer settings cannot fully fix.
Reading the OpenLane 2 Source to Find Available Optimizations
Before trying synthesis-level knobs, the OpenLane 2 source code was read directly to build an accurate picture of what is actually configurable. The reason for this: the OpenLane 2 documentation does not enumerate every configuration variable, and relying on documentation alone means missing options that exist in the codebase but are not documented. The files examined were:
openlane/steps/openroad.pyfor all PnR-level variablesopenlane/steps/yosys.pyandopenlane/steps/pyosys.pyfor synthesis variablesopenlane/scripts/pyosys/synthesize.pyfor the Yosys synthesis script structureopenlane/scripts/pyosys/construct_abc_script.pyfor every ABC optimization strategy definitionopenlane/flows/classic.pyfor the step ordering and inter-step state passing
This source inspection identified the following relevant variables:
Synthesis-level:
SYNTH_STRATEGY controls which ABC optimization strategy runs during technology mapping. Options are AREA 0, AREA 1, AREA 2, AREA 3, and DELAY 0 through DELAY 4. Each corresponds to a different combination of ABC passes (rewrite, refactor, retime, map). SYNTH_SIZING enables cell upsizing and downsizing in ABC during mapping. SYNTH_ABC_DFF passes flip-flops through ABC so it can retime across them. SYNTH_SHARE_RESOURCES enables logic resource sharing (merging functionally equivalent sub-expressions). SYNTH_ABC_USE_MFS3 enables SAT-based remapping. SYNTH_ABC_BUFFERING enables buffer insertion inside ABC. USE_LIGHTER activates the Lighter Yosys plugin for automatic clock gating inference.
PnR-level:
CTS clustering parameters (CTS_SINK_CLUSTERING_SIZE, CTS_SINK_CLUSTERING_MAX_DIAMETER) control how OpenROAD groups flip-flops into CTS clusters. Post-GPL and post-GRT design repair steps (RUN_POST_GPL_DESIGN_REPAIR, RUN_POST_GRT_RESIZER_TIMING) enable buffer insertion and cell resizing to fix slew, cap, and hold violations. Hold and setup slack margins with buffer budgets control how aggressively the resizer overfixes timing. Gate cloning enables the tool to duplicate cells to reduce fanout on high-fanout nets.
What cannot be done in this flow:
The sky130_fd_sc_hd library is a single threshold voltage library. There is no HVT variant and no LVT variant. Multi-Vt cell assignment, which is one of the most effective post-synthesis leakage power reduction techniques in commercial flows, cannot be applied here because only one Vt flavor exists. DVFS requires voltage regulators, level shifters, and frequency dividers at the system level; OpenLane 2 has no support for voltage island definition, multi-supply nets, or DVFS-related infrastructure. Power gating (MTCMOS) requires header and footer power switch insertion, power domain boundary definition, retention register mapping, and a power management controller; OpenLane 2 supports none of these. The isolation cells in this project were manually inserted at the RTL level as a partial approximation of what a full power gating flow would provide at a domain boundary.
Synthesis Optimization Run: basecgisoopt
The first synthesis optimization run added AREA 0 strategy, SYNTH_SIZING, SYNTH_ABC_DFF, and SYNTH_SHARE_RESOURCES on top of the RTL changes from the previous runs. AREA 0 uses the resyn2 rewriting passes followed by an area-optimized technology mapper (amap). Cell sizing allows ABC to downsize cells that have excess timing slack, reducing their drive strength and switching capacitance. ABC DFF optimization allows retiming across flip-flop boundaries within the synthesis engine.
Results:
| Metric | Value |
|---|---|
| Total power | 1,052,925.4 |
| Reduction from base | 59.16% |
| Incremental from basecgiso | 3.90% |
| Hold violations | 0 |
| Setup violations | 0 |
This was the first fully timing-clean run. The incremental 3.90% power reduction comes from ABC downsizing cells that had excess slack after the AREA 0 mapping. The timing closure (zero hold and setup violations) happened because the synthesis optimization produced a netlist that the default CTS and resizer settings could handle without overflowing.
Synthesis Strategy Sweep
The ABC synthesis strategies are defined in construct_abc_script.py inside the OpenLane 2 source. Each strategy is a different sequence of ABC passes:
AREA 0 runs resyn2 (a standard area recovery sequence of rewrite, rewrite -z, balance, rewrite, rewrite -z, balance, rewrite, balance) followed by amap (area mapper).
AREA 1 runs choice2 (introduces structural choices in the AIG for better technology mapping flexibility) followed by amap.
AREA 2 runs choice2 twice before amap. Running choice2 twice is more aggressive because it generates a richer set of structural alternatives for the mapper to choose from. This is the most aggressive area recovery strategy among the AREA options.
AREA 3 uses the ORFS area script, which has a completely different internal structure from the resyn2 family.
Three additional strategy runs were made after basecgisoopt.
Run: area1
AREA 1 with all previous RTL optimizations.
| Metric | Value |
|---|---|
| Total power | 981,842.9 |
| Reduction from base | 61.91% |
| Hold violations | 6 |
Better power than basecgisoopt but not timing-clean. The choice2 strategy produced a more compact netlist, but the default resizer settings could not close 6 remaining hold violations.
Run: area2
AREA 2 with all previous RTL optimizations.
| Metric | Value |
|---|---|
| Total power | 764,607.6 |
| Reduction from base | 70.34% |
| Hold violations | 136 |
| Slew violations | 878 |
| Cap violations | 22 |
This is the best power result of any run in the entire project. The double application of choice2 produced a significantly smaller and lower-power netlist. The cost was severe timing and signal integrity violations: 136 hold violations, 878 slew violations, and 22 cap violations. The smaller cells that AREA 2 maps to have reduced drive strength, which means their output nets are more susceptible to slew degradation and capacitance violations. Without sufficient resizer repair, the tool could not fix these.
Run: area3 (FAILED)
Used AREA 3 strategy with the post-CTS resizer timing repair disabled and gate cloning disabled. Failed with unresolvable hold violations. The ORFS area script generates a netlist topology that, when combined with no resizer repair, leaves too many hold-critical paths unfixed.
Run: noresizer (FAILED)
Used AREA 0 with all resizer timing repair disabled and a cell exclusion list that blacklisted large drive-strength cells (buf_16, buf_12, inv_16, inv_12, dfxtp_4, dfrtp_4). The idea was to force the tool to use smaller, lower-power cells. This failed because the sky130_fd_sc_hd library does not have a sufficient density of intermediate-drive cells to allow the resizer to fix hold violations when the high-drive cells are excluded. Without buf_16 and similar cells available for buffer insertion, the resizer has no legal way to add delay on short paths that violate hold.
Timing Closure for area2: The CTS and Resizer Tuning
The area2 run had 136 hold violations, 878 slew violations, and 22 cap violations. The target was to bring all of these to zero without losing the 70.34% power reduction.
The approach was a coordinated set of changes to CTS clustering and the design repair and resizer steps.
CTS changes:
CTS_SINK_CLUSTERING_SIZE was reduced from the default of 25 to 20. Smaller cluster sizes produce a more balanced clock tree because the tool groups fewer sinks per cluster, reducing the variation in clock insertion delay across sinks. CTS_SINK_CLUSTERING_MAX_DIAMETER was reduced from 50 to 40 um. This prevents any single cluster from spanning too large a physical distance, which would produce a long, high-skew branch.
Design repair (post-GPL):
RUN_POST_GPL_DESIGN_REPAIR was enabled. This runs buffer insertion and cell resizing after global placement but before detailed placement. DESIGN_REPAIR_MAX_SLEW_PCT was set to 10 (10% of the slew limit as margin). DESIGN_REPAIR_MAX_CAP_PCT was set to 10. DESIGN_REPAIR_MAX_WIRE_LENGTH was set to 300 um, forcing buffer insertion on any net longer than 300 um. DESIGN_REPAIR_BUFFER_INPUT_PORTS and DESIGN_REPAIR_BUFFER_OUTPUT_PORTS were both enabled.
Resizer timing (post-GRT):
RUN_POST_GRT_RESIZER_TIMING was enabled. PL_RESIZER_HOLD_SLACK_MARGIN was set to 0.15 ns and GRT_RESIZER_HOLD_SLACK_MARGIN was set to 0.10 ns. These margins cause the resizer to overfix hold timing beyond the zero-slack target, providing headroom against extraction pessimism. Hold and setup buffer budgets were set to 60% of the available slack budget. Gate cloning was enabled for both the post-CTS and post-GRT stages to reduce fanout on nets with excessive load.
Run: area2cts
| Metric | Value |
|---|---|
| Total power | 764,607.6 |
| Hold violations | 0 |
| Setup violations | 0 |
| Slew violations | 45 |
| Cap violations | 0 |
| Wirelength | 152,592 um |
Power stayed exactly at 764,607.6. All timing and cap violations closed. The 45 remaining slew violations are exclusively on nets connected to SRAM macro output pins. The SRAM macro’s output driver characteristics are fixed (they come from the macro’s Liberty model and cannot be changed by the PnR resizer). OpenROAD cannot insert buffers between a macro output pin and its connected net in a way that fixes the slew without also inserting inside the macro boundary, which is not allowed. These 45 violations are a known limitation of using hardened SRAM macros. The wirelength increase from 138,621 to 152,592 um (about 10%) is the cost of the buffers added by design repair and resizer.
Floorplan Optimization
The baseline floorplan used a 2000x2000 um die. The 15 SRAM macros accounted for 87% of total instance area. Standard cell utilization was only 4.6% of the core area. The stdcells were essentially floating in a sea of filler cells because the core was massively oversized relative to the actual logic density.
The original macro placement put all 10 input SRAM macros in two columns at x=10 um and x=210 um, and all 5 output SRAM macros in a single column at x=700 um. Vertical pitch between macros in the same column was 190 um.
Several smaller die sizes were attempted. 1200x1200 and 1500x1500 um both failed, the former with PDN channel errors (the power grid routing could not find channels between tightly packed macros) and the latter with routing congestion failures. Changing only the placement density target from 30% to 35% without changing the die size also caused routing congestion.
A new 3-column macro layout was designed:
- Column 1 at x=120 um: VC0 SRAM macros for all 5 input units (one per row)
- Column 2 at x=340 um: VC1 SRAM macros for all 5 input units (one per row)
- Column 3 at x=820 um: Output FIFO SRAM macros for all 5 output units (one per row)
- Vertical pitch: 200 um
This groups each input unit’s two VC FIFOs into adjacent columns, keeping the short data and control paths between them compact. The gap between column 2 and column 3 (480 um) provides a placement channel where the crossbar and switch allocator logic can be placed by the tool without being squeezed between macro columns.
Run: area2fp
Die shrunk from 2000x2000 to 1800x1800 um. Core utilization increased from 25% to 30%. Placement density increased from 30% to 35%.
| Metric | Value |
|---|---|
| Total power | 764,607.6 |
| Die area | 1,528,380 um² |
| Reduction in die area | 16.47% from area2cts |
| Wirelength | 143,786 um |
| Reduction in wirelength | 5.77% from area2cts |
| Hold violations | 1 (WNS = -0.0035 ns) |
| Setup violations | 0 |
| Slew violations | 47 |
| Cap violations | 1 |
| Fill cells | 82,592 (down from 108,434) |
| Utilization | 31.8% (up from 26.8%) |
Power is unchanged because floorplan optimization does not affect logic mapping or cell sizing, only physical placement. The die area reduction of 16.47% is significant. The wirelength reduction comes from shorter routes between cells that are now packed more tightly. The single remaining hold violation at -0.0035 ns on a 25 ns clock is marginal and is attributed to a timing corner at an SRAM macro interface. The fill cell count dropped 23.8% because there is less empty space in the smaller die to fill.
Complete Run Summary
| Run | Power | vs Base | Die Area | Wirelength | Hold | Setup | Slew | Status |
|---|---|---|---|---|---|---|---|---|
| base | 2,577,878.5 | 0% | 1,824,020 | 140,132 | 42 | 0 | 789 | baseline |
| baseopt | 1,129,917.0 | -56.17% | 1,829,730 | 144,640 | 18 | 0 | 830 | clock gating |
| basecgiso | 1,095,670.4 | -57.50% | 1,830,030 | 142,056 | 112 | 0 | 801 | + isolation |
| basecgisoopt | 1,052,925.4 | -59.16% | 1,830,820 | 147,129 | 0 | 0 | 879 | + synth opts |
| area1 | 981,842.9 | -61.91% | 1,830,380 | 143,817 | 6 | 0 | 820 | AREA 1 |
| area2 | 764,607.6 | -70.34% | 1,829,840 | 138,621 | 136 | 0 | 878 | AREA 2 |
| area3 | FAILED | N/A | N/A | N/A | N/A | N/A | N/A | AREA 3 |
| noresizer | FAILED | N/A | N/A | N/A | N/A | N/A | N/A | no resizer |
| area2cts | 764,607.6 | -70.34% | 1,829,840 | 152,592 | 0 | 0 | 45 | timing clean |
| area2fp | 764,607.6 | -70.34% | 1,528,380 | 143,786 | 1 | 0 | 47 | floorplan opt |
How Metrics Were Collected
All runs produce a final/metrics.json file from the OpenLane 2 signoff steps. Power numbers are from OpenSTA static power analysis at the nominal corner (tt_025C_1v80). Timing numbers are worst-case across 9 corners: three PVT corners (tt_025C_1v80, ss_100C_1v60, ff_n40C_1v95) crossed with three parasitic extraction corners (min, nom, max). Hold violations reported are worst across all 9 corners. IR drop analysis runs via OpenROAD PSM on the power and ground nets. Routing DRC runs via OpenROAD DRT. Physical DRC and LVS run via Magic and KLayout.
Part Two: Pipelining the Router
The Non-Pipelined Architecture
Before any pipelining, the router executes its entire datapath within a single clock cycle. Three logical operations happen every cycle in a single combinational chain:
The HEAD flit arrives at an input port FIFO. The input unit state machine reads the destination field from the flit and determines the target output port. This is the route decode operation.
The switch allocator (SA), a 5x5 round-robin arbiter matrix, resolves contention when multiple input ports simultaneously request the same output port. The grant signal is purely combinational: it goes high in the same cycle the request is asserted.
The granted flit passes through the crossbar mux to the output unit, which writes it into the output FIFO.
The path from FIFO read data through route decode logic, through the SA arbiter, through the crossbar mux select, to the output FIFO write enable is a single unbroken combinational chain. At 20 ns clock period targeting sky130_fd_sc_hd, synthesis reports a setup slack of 16.48 ns at the TT corner, confirming the path comfortably fits in a single cycle at this frequency. But the goal is to cut this chain for higher frequency operation, and to understand the microarchitectural consequences of the cuts.
The Four-Stage Pipeline Plan
The target pipeline is:
S1 (Buffer Write): Incoming flits are written into the VC FIFO on the input port. This is a registered write.
S2 (Route Decode): A new DECODE state is inserted into the input unit state machine between IDLE and ROUTING. When a HEAD flit is detected at the head of the FIFO, the state machine transitions from IDLE to DECODE, registering the destination port (sa_dst) but not yet asserting the switch allocation request (sa_req). This adds one cycle of latency for decode.
S3 (Switch Allocation): The state machine moves from DECODE to ROUTING and asserts sa_req. The SA resolves contention and produces a grant. In the initial pipelined version, the grant output of the SA (grant_flat) was made a registered reg instead of a combinational wire, adding one pipeline cycle for the grant to propagate out. The crossbar connection state (xbar_src, xbar_vc, xbar_live, out_busy) is also registered in this stage.
S4 (Crossbar Traversal and Output Write): A pipeline register (xbar_v_r for valid, xbar_flit_r for flit data) captures the crossbar output. The output unit reads from this register and writes into the output FIFO.
The three RTL modifications to implement this initial version:
input_unit.v received the new DECODE state. The state transition in IDLE upon seeing a HEAD flit now goes to DECODE (latching sa_dst only) rather than directly to ROUTING. DECODE then transitions to ROUTING on the following cycle, where sa_req is asserted.
switch_allocator.v had grant_flat changed from a combinational wire to a registered reg. The combinational arbitration result (resolved) is captured into grant_flat at the clock edge.
noc_router.v received xbar_v_r (5-bit registered valid) and xbar_flit_r (5 channels times 16-bit flit width = 80-bit registered data) as pipeline registers between the crossbar mux and the output units. The combinational forwarding logic (xbar_v_next) computes whether a flit can be forwarded on each output port, and this is registered into xbar_v_r. The output unit reads from xbar_v_r for the valid signal and from xbar_flit_r for the flit data.
First Simulation Failure: Complete Flit Order Corruption
The initial pipelined version was a disaster in simulation. Out of 52 test checks, only 3 passed: the two reset sanity checks and one count check on test T2. The monitor output showed BODY flits arriving where HEAD flits were expected, and TAIL flits arriving where HEAD flits were expected. Flit ordering was completely corrupted across all test cases.
The root cause was in how the FIFO acknowledge signal (xack) interacted with the registered xbar_v_r. The FIFO pop operation is driven by xack. When xack is high in a given cycle, the FIFO advances its read pointer, and the next flit becomes visible at the FIFO output (iu_flit). In the initial pipelined version, xack was derived from the registered xbar_v_r. This means the FIFO was popped one cycle after the flit was supposedly captured into xbar_flit_r. But xbar_flit_r was being written from iu_flit on the same posedge that xack was derived from xbar_v_r. By the time xbar_v_r went high, the FIFO had already advanced (because the pop happened at the same posedge xbar_v_r was registered), and the data in iu_flit was now the next flit, not the one that should be in xbar_flit_r.
The fix was to derive xack from xack_next, the combinational version of the acknowledge signal computed alongside xbar_v_next. This ensures the FIFO pops in the same cycle the flit data is captured into xbar_flit_r. The FIFO read pointer advances and the data capture into the pipeline register happen at the same clock edge, maintaining coherence between the pop and the capture.
Second Failure: HEAD Arrives, BODY and TAIL Stuck
After fixing the xack timing issue, the results improved substantially. HEAD flits arrived correctly at the right output ports with correct payloads. Test T2 (a 2-flit HEAD+TAIL packet on VC1) passed entirely. But every test involving multi-flit packets on VC0 showed the same failure pattern: the HEAD flit arrived correctly, and then forwarding stopped permanently. BODY and TAIL flits remained in the input FIFO indefinitely.
A cycle-accurate debug probe was inserted that printed sa_grant, out_busy, xbar_vc, xbar_src, iu_sel_vc, and iu_flit on every cycle. The trace for test T1 (Local to North, VC0, 3 flits) showed this:
cy=12 t=115ns: sa_grant=...010 out_busy=00000 | port1: vc=0
cy=12 t=125ns: sa_grant=...010 out_busy=00010 | port1: vc=0
cy=14 t=135ns: sa_grant=...000 out_busy=00010 | port1: vc=1 <-- BUG
At cycle 12 (t=115ns), the SA grant fires. out_busy is 0 (pre-registration). xbar_vc[1] is correctly set to 0, meaning the crossbar is connecting VC0 of input port 0 to output port 1. One cycle later at t=125ns, the HEAD flit is being forwarded. But at t=135ns (cycle 14), xbar_vc[1] has changed from 0 to 1. The input unit is presenting VC0 data (iu_sel_vc[0]=0), but the crossbar register says VC1 (xbar_vc[1]=1). The VC mismatch check iu_sel_vc[xbar_src[1]] == xbar_vc[1] evaluates to 0 == 1 which is false, so xbar_v_next[1] stays 0 permanently. No BODY or TAIL flit ever gets forwarded.
The Double-Grant Bug: Exact Mechanism
The VC register corruption was caused by the SA issuing a second grant for the same output port one cycle after the first. Here is the exact cycle-by-cycle sequence:
Cycle N (t=105ns): The input unit is in ROUTING state. sa_req[0]=1. out_busy[1]=0 (its registered value, not yet updated). The combinational arbiter inside the SA sees sa_req_masked[0*5+1]=1 and resolves a grant. But grant_flat is registered, so this grant does not appear at the output yet.
Cycle N+1 (t=115ns): The registered grant_flat[0*5+1] appears (registered from the previous cycle). noc_router sees this and registers xbar_src[1]=0, xbar_vc[1]=0, xbar_live[1]=1, out_busy[1]=1. These updates take effect at the end of this cycle (at posedge). The input unit also sees sa_grant[0]=1 and transitions sa_req[0] from 1 to 0 and st0 from ROUTING to ACTIVE, also at posedge.
The critical question is: what does the SA see during this cycle, before the posedge? sa_req[0] is still 1 (the registered value has not been cleared yet; it clears at posedge of this cycle). out_busy[1] is still 0 (the registered value has not been set yet; it sets at posedge of this cycle). Therefore sa_req_masked[0*5+1]=1 again. The combinational arbiter sees a valid request with an unblocked output and resolves a second grant. This second grant gets registered into grant_flat.
Cycle N+2 (t=125ns): The second grant_flat[0*5+1]=1 appears. The noc_router evaluates the grant installation logic for xbar_vc:
xbar_vc[jj] <= (iu_sa_req[ii][0] && iu_sa_dst0[ii] == jj) ? 1'b0 : 1'b1;
Now iu_sa_req[0][0] is 0 (it was cleared at the posedge of cycle N+1 when the input unit moved to ACTIVE). The condition evaluates to false. The ternary takes the else branch: xbar_vc[1] <= 1'b1. This overwrites the correct value of 0 with the incorrect value of 1.
The root cause is a one-cycle timing gap that exists when a registered pipeline output interacts with a feedback loop. The SA grant register, the sa_req register, and the out_busy register all update at the same posedge. During any given cycle, the SA’s combinational arbiter sees the pre-posedge values of sa_req and out_busy, which are both stale by one cycle relative to the grant that has just propagated out.
The Fix: Removing the SA Grant Register
Three approaches were considered to break the double-grant window.
Approach A was a combinational out_busy_next lookahead: compute out_busy combinationally by OR-ing the registered value with current-cycle grants, and use this masked version to gate SA requests. This does not work because out_busy_next depends on sa_grant_flat, which is the registered output, not the combinational arbiter output. The window still exists between the combinational arbiter resolving inside the SA and the registered output propagating.
Approach B was a just_granted mask inside the SA: track which outputs were granted in the previous cycle and block them. This incorrectly prevents a legitimately different source from being granted the same output in the following cycle, which is wrong behavior.
Approach C was to remove the SA grant register entirely, making grant_flat a combinational wire output. This was the correct solution.
With combinational grant_flat:
Cycle N: sa_req[0]=1, out_busy[1]=0. SA combinationally resolves grant_flat[0*5+1]=1 immediately. The noc_router sees this grant and registers out_busy[1]=1 at posedge. The input unit sees sa_grant[0]=1 and registers sa_req[0]=0 at posedge.
Cycle N+1: sa_req[0]=0, out_busy[1]=1. sa_req_masked[0*5+1]=0. No grant fires. Double-grant window eliminated.
The one-cycle timing gap disappears because the grant and the blocking signals now update at the same posedge in the same cycle. The pipeline stages are preserved: the DECODE state in the input unit provides S2, the registered xbar_live, xbar_src, out_busy in noc_router provides S3, and the output FIFO write register provides S4. The SA does not need to be a pipeline stage boundary; the other registered structures already provide the cut.
All 52 test checks passed after this fix.
Three Optimizations Applied After Correctness Was Established
With the pipelined router functionally correct, three structural optimizations were identified from inspecting the synthesis report.
Optimization 1: Removing xbar_flit_r and xbar_v_r
The xbar_flit_r register (80 FFs) and xbar_v_r register (5 FFs) were introduced to create the S3-to-S4 pipeline boundary. But the output unit already contains an SRAM-backed FIFO with its own write register. The FIFO write (posedge clock with wr_en = flit_valid & ~full) is itself a register boundary. Driving the output unit with the combinational signals xbar_v_next (valid) and xbar_out (flit data) and letting the FIFO write be the S3-to-S4 boundary eliminates 85 FFs without changing the pipeline depth at all. This also eliminates a short FF-to-FF path from xbar_v_r to the output FIFO write enable, which was the source of 228 additional hold-repair buffers in synthesis. The combinational path from xbar_v_next through the FIFO ready check and the ou_ready signal has enough gate delay to naturally satisfy hold without inserted buffers.
Optimization 2: One-Hot xbar_src Encoding
The original xbar_src[dst] was a 3-bit binary register selecting which of the 5 input ports is routed to each output port. A 3-bit binary select driving a 5-to-1 mux over 16-bit flit data produces high-fanout select lines: each select bit must drive the mux control input for all 16 data bits across all 5 output ports. The synthesis report flagged 78 fanout violations on these select nets.
Replacing xbar_src[dst] with a 5-bit one-hot register xbar_src_oh[dst] changes the mux structure. Instead of a binary decode tree driving a mux, each one-hot bit drives a 16-wide AND gate that masks the corresponding input flit bus. The 5 masked results are OR-reduced. Each iu_flit[src] bit fans out to exactly 5 AND gates (one per output port). Each xbar_src_oh[dst][src] bit fans out to exactly 16 AND gates (one per flit bit). This is a significantly more balanced fanout distribution. The one-hot encoding also simplifies xbar_v_next: instead of indexing iu_valid[xbar_src[b]] through a mux, the one-hot bits can gate each iu_valid[src] and OR-reduce, removing mux decode logic from the critical path.
Optimization 3: Removing out_busy_next
The out_busy_next combinational block was introduced as a partial workaround for the double-grant bug when the SA grant was still registered. It computed a speculative next-cycle out_busy state by OR-ing the registered value with current-cycle grants. With the SA grant now combinational, out_busy updates at the same posedge as sa_req clears. The double-grant window does not exist, and speculative lookahead is unnecessary. The simple ~out_busy[gj] masking is sufficient. Removing out_busy_next eliminates a 25-term combinational block and reduces the combinational logic feeding the SA request mask.
Final Optimized Pipeline Architecture
S1 (Buffer Write): Incoming flits write into the VC FIFO on the input port. The FIFO is fall-through: combinational read data always presents the head element.
S2 (Route Decode): Input unit state machine transitions from IDLE to DECODE upon seeing a HEAD flit. Destination port is registered into sa_dst. No switch allocation request is asserted yet.
S3 (Switch Allocation and Crossbar Setup): Input unit asserts sa_req in ROUTING state. SA resolves grants combinationally (no register). noc_router registers the crossbar connection state: xbar_src_oh (5-bit one-hot source select), xbar_vc (which VC is in the active wormhole), xbar_live (whether a wormhole is active on this output port), and out_busy (whether an output port is allocated). The combinational forwarding logic xbar_v_next evaluates xbar_live, VC match between registered xbar_vc and iu_sel_vc, input valid (iu_valid), and output ready (ou_ready) for each of the 5 output ports.
S4 (Crossbar Traversal and Output Write): The crossbar mux is combinational, using xbar_src_oh to select the source flit via AND-OR reduction. The output unit receives flit data and valid signal combinationally. The output FIFO write register (clocked on posedge) is the S3-to-S4 pipeline boundary. The FIFO read side drives the router’s output with downstream flow control via credits.
Wormhole forwarding after the HEAD flit does not re-arbitrate. Once xbar_live[dst] is set and the source and VC are registered, BODY and TAIL flits flow through the crossbar in every cycle that iu_valid is high and ou_ready is high. xbar_live[dst] is cleared when a TAIL flit is forwarded, releasing the output port for subsequent arbitration.
Synthesis Comparison: Non-Pipelined vs Pipelined
Both versions were taken through the full OpenLane flow targeting sky130_fd_sc_hd.
| Metric | Naive (non-pipelined) | Pipelined |
|---|---|---|
| Sequential cells | 500 | 615 |
| Standard cell area | 60,051 um² | 65,464 um² |
| Setup WS at SS corner | 8.73 ns | 9.90 ns |
| Hold violations at SS corner | 14 | 0 |
| Hold violations (worst cross-corner) | 42 | 43 |
| Hold WNS (worst) | -0.951 ns | -0.256 ns |
| Hold buffers inserted | 468 | 696 |
| Fanout violations | 78 | 72 |
| Switching power | 0.003135 W | 0.004005 W |
The pipelined version achieves a 13.4% improvement in setup slack at the SS corner (8.73 ns to 9.90 ns), confirming that the critical combinational path has been broken. Hold violations at the SS corner dropped from 14 to 0 because the inserted pipeline registers create natural hold margins on the previously short combinational paths. Hold violations at the worst cross-corner degraded slightly (42 to 43) but the magnitude improved significantly (-0.951 ns to -0.256 ns), meaning the remaining violations are far less severe.
The costs: 115 additional flip-flops (23% increase in sequential cells), 9% more standard cell area, 48.7% more hold buffers (696 vs 468), and 27.7% more switching power. The bulk of the FF increase comes from the DECODE state register and the registered crossbar connection state (xbar_src_oh, xbar_vc, xbar_live, out_busy). The xbar_flit_r and xbar_v_r registers were already removed in the optimization round; if they had been retained, the FF count would have been higher by another 85. The switching power increase is consistent with more state registers toggling per packet.
Part Three: DFT Scan Chain Insertion
Why OpenLane 2 Has No Native DFT
OpenLane 2 version 2.3.10 ships with zero DFT steps. Running Step.factory.list() against the installed package confirms this: no ScanReplace, no ScanInsert, no DFTConfig step exists anywhere in the openlane.steps namespace. OpenLane 1 had a run_dft flag that invoked Yosys’s dfflegalize pass followed by a custom Perl script to stitch scan chains, but this infrastructure was never ported to OpenLane 2.
The only community-maintained DFT option for OpenLane 2 is difetto, an alpha-state package distributed through Nix. It was not used here because it introduces a separate dependency chain that conflicts with the existing Docker-based flow and is not stable enough for integration.
OpenROAD (the PnR engine underlying OpenLane 2) does have native scan chain support through three TCL commands: set_dft_config, scan_replace, and insert_dft. These are fully functional in the OpenROAD binary bundled with OpenLane 2.3.10. The problem is that OpenLane 2 never wraps these commands into steps. The solution was to write those steps manually using the OpenLane 2 Python step API.
Why Scan Insertion Requires Two Separate Steps at Two Different Flow Points
OpenROAD’s scan insertion cannot be done in one pass. It requires two separate operations at two different points in the PnR flow.
scan_replace must run before global placement. It iterates over every flip-flop instance in the netlist and replaces it with its scan-equivalent cell from the standard cell library. In sky130_fd_sc_hd, a DFF_X1 becomes SDFF_X1, which adds SCD (scan data) and SCE (scan enable) ports. The scan-equivalent cells are physically larger than the original cells. If scan_replace runs after placement, the placer has already arranged the original smaller cells in the layout rows. Swapping them to larger variants after the fact causes cell overlaps that detailed placement cannot legally resolve without moving cells by distances exceeding the legal perturbation range. The sequential cell area increase in this design was from 18,918 um² to 23,516 um², a 24% increase, all of which the placer needs to account for from the start.
insert_dft must run after detailed placement. It uses the physical coordinates of the already-placed scan flops to build minimum-wirelength chains by ordering the flops spatially. Running it before detailed placement means the tool has no final location data and cannot optimize the chain ordering. Running it before detailed placement also means the scan stitching wires are committed before cell positions are finalized, creating mismatches between the wiring and the physical cell locations.
The correct flow order is:
ScanReplace → GlobalPlacement → DetailedPlacement → ScanStitch → CTS → Routing → ...
CTS must come after scan stitching because the scan enable signal (scan_enable) needs to be treated as a quasi-clock signal during clock tree synthesis. If CTS runs before scan stitching, the scan enable net has no buffering infrastructure and will violate max-fanout constraints across all 720 scan flops.
The Custom Step Implementation
Two files were written to implement DFT inside the OpenLane 2 Python API.
dft_step.py
Both classes subclass OpenROADStep, the correct base class for any step that generates a TCL script and runs it through the OpenROAD binary. The get_script_path method returns a path inside self.step_dir, the per-step directory OpenLane 2 creates fresh for each step execution. The run method writes the TCL file and calls super().run() to execute it through the OpenROAD subprocess.
ScanReplace generates this TCL:
source $::env(SCRIPTS_DIR)/openroad/common/io.tcl
read_current_odb
set_dft_config \
-max_chains 4 \
-clock_mixing no_mix
scan_replace
write_views
The source and read_current_odb pattern is required because OpenROADStep does not automatically load the database before running user TCL. An initial attempt used read_db $::env(CURRENT_ODB) directly, which fails: CURRENT_ODB is not defined in the subprocess environment when OpenROADStep launches OpenROAD. The correct path is to source the io.tcl helper bundled with OpenLane 2, which defines read_current_odb. That function reads the ODB path through a mechanism OpenLane 2 does wire up correctly, loading the database from the previous step’s output state.
set_dft_config -max_chains 4 -clock_mixing no_mix configures the scan chain structure. max_chains 4 creates 4 separate scan chains across the 720 flip-flops. no_mix prevents the tool from interleaving flops that are clocked by different clock domains into the same chain. scan_replace then modifies the ODB in-place, swapping every DFF* instance for its SDFF* equivalent. write_views flushes the modified ODB and netlist to disk.
ScanStitch generates:
source $::env(SCRIPTS_DIR)/openroad/common/io.tcl
read_current_odb
insert_dft
place_pin -pin_name scan_enable_1 -layer met2 -location {0 100} -pin_size {0.2 2}
place_pin -pin_name scan_in_1 -layer met2 -location {0 250} -pin_size {0.2 2}
place_pin -pin_name scan_out_1 -layer met2 -location {0 400} -pin_size {0.2 2}
place_pin -pin_name scan_in_2 -layer met2 -location {0 550} -pin_size {0.2 2}
place_pin -pin_name scan_out_2 -layer met2 -location {0 700} -pin_size {0.2 2}
place_pin -pin_name scan_in_3 -layer met2 -location {0 850} -pin_size {0.2 2}
place_pin -pin_name scan_in_4 -layer met2 -location {0 1000} -pin_size {0.2 2}
place_pin -pin_name scan_out_3 -layer met2 -location {0 1150} -pin_size {0.2 2}
write_views
insert_dft reads the set_dft_config parameters stored in the ODB from the earlier scan_replace run, builds minimum-wirelength chains using the flop placement coordinates from detailed placement, and creates the scan I/O ports in the ODB. It auto-names them scan_in_N, scan_out_N, and scan_enable_N. The eight place_pin calls assign physical locations to those ports on met2. The die dimensions are 1358.645 um x 1369.365 um. All eight scan ports are placed on the left edge (x=0) at Y coordinates spaced 150 um apart, all within the die boundary.
run_dft_flow.py
The flow builder retrieves the standard Classic step list via SequentialFlow.factory.get("Classic"), filters out problematic steps, finds the indices of OpenROAD.GlobalPlacement and OpenROAD.DetailedPlacement, and splices in the two custom steps:
steps = steps[:gpl_idx] + [ScanReplace] + steps[gpl_idx:dpl_idx+1] + [ScanStitch] + steps[dpl_idx+1:]
Three existing steps were removed from the flow. OpenROAD.RepairDesignPostGPL was removed because after scan replace, the timing constraints include transition time checks on the newly created scan ports (SCD, SCE), which have no set_max_transition SDC constraints. The repair step throws an unrecoverable error about unconstrained transition paths on these ports. Removing this step is safe because post-GPL design repair is a timing optimization, not a correctness requirement. Odb.CheckDesignAntennaProperties was removed because the Magic-generated LEF for the DFT design contains syntactically invalid USE ; lines for the scan ports (the correct syntax is USE SCAN ;). The ODB LEF parser aborts on this syntax error. The GDS and ODB are already written before this step, so nothing is lost. Yosys.EQY was removed because formal equivalence checking cannot pass after scan replacement: the scan flops have SCD and SCE ports that do not exist on the original DFF cells in the RTL. Comparing the original RTL to the post-scan-replace gate netlist will always report unmatched module interfaces. A DFT-aware equivalence flow would compare non-scan mode behavior only, which requires separate EQY configuration outside the scope of this work.
Bugs Encountered During DFT Implementation
Bug 1: CURRENT_ODB variable reference. Initial TCL used read_db $::env(CURRENT_ODB). This fails because CURRENT_ODB is not set in the subprocess environment when OpenROADStep launches OpenROAD. The fix was to source io.tcl and call read_current_odb.
Bug 2: execute_dft_plan does not exist. The OpenROAD TCL command for scan stitching is insert_dft, not execute_dft_plan. OpenROAD documentation has inconsistencies between versions. Using the wrong command name produces a TCL error invalid command name "execute_dft_plan".
Bug 3: %OL_CREATE_REPORT is not valid TCL. An earlier version of the ScanStitch TCL script included %OL_CREATE_REPORT as a directive to trigger OpenLane 2’s report generation hook. This is a Python-side template substitution marker, not valid TCL. OpenROAD produces a TCL parse error. Removing it entirely and using write_views is sufficient.
Bug 4: RepairDesignPostGPL transition constraint failure. Described above in the step filtering section. The fix was removing the step.
Bug 5: DetailedPlacement failure when ScanReplace ran after GlobalPlacement. The original flow injected ScanReplace after GlobalPlacement. This caused DetailedPlacement to fail with cell overlap errors because the placer had already arranged original-size flip-flops in rows, and the larger scan cells could not fit in the pre-allocated spaces. Moving ScanReplace to before GlobalPlacement fixed this.
Bug 6: scan_out_3 had no routing layer assigned. After the first successful ScanStitch run, global routing failed with [GRT-0209] Pin scan_out_3 is completely outside the die area and cannot be routed. The DEF after CTS showed that all scan_in_* pins had FIXED status but all scan_out_* pins had no placement entry at all. The place_pin calls in the initial TCL were using database unit values (e.g., {0 1150000}) rather than micron values. place_pin in OpenROAD expects coordinates in microns. The value 1150000 was being interpreted as 1,150,000 microns, which is 1.15 meters, far outside the 1358 um die. OpenROAD did not error on this; it silently dropped the out-of-bounds placement for output-direction ports. Input pins may have behaved differently at large values due to DBU rounding. The fix was to divide all coordinate values by 1000 to convert from the erroneous DBU values to correct micron values. After this fix, all eight scan ports appeared with FIXED status in the DEF.
Bug 7: Magic LEF writer emitting USE with empty value. After routing completed, the flow reached Odb.CheckDesignAntennaProperties and crashed with a SIGABRT. The Magic-generated LEF contained:
PIN scan_enable_1
DIRECTION INPUT ;
USE ;
The correct syntax is USE SCAN ;. Magic’s LEF writer does not handle the USE SCAN attribute for ports created by OpenROAD’s DFT flow. The ODB LEF parser treats the empty USE value as a syntax error and aborts. Since the GDS is written at step 58 (Magic.StreamOut) before this LEF is generated, removing the Odb.CheckDesignAntennaProperties step loses only the antenna check, not the GDS or ODB.
Bug 8: flow.start() returning more than two values. After all fixes, the flow completed 75/75 stages but Python threw ValueError: too many values to unpack (expected 2) at state, steps = flow.start(tag="pipe_dft"). A newer version of SequentialFlow.start() returns a tuple with more elements than two. The fix was to replace the unpacking assignment with a bare flow.start(tag="pipe_dft") call, since neither return value was used afterward.
DFT Results
The flow completed at 75/75 stages with LVS passing.
Scan chain statistics:
| Metric | Value |
|---|---|
| Total flip-flops replaced | 720 |
| Scan chains created | 4 |
| Scan ports created | scan_in_1, scan_in_2, scan_in_3, scan_in_4, scan_out_1, scan_out_2, scan_out_3, scan_enable_1 |
| Scan port layer | met2, left edge of die |
| Sequential cell area before scan replace | 18,918 um² |
| Sequential cell area after scan replace | 23,516 um² |
| Sequential cell area increase | 24.3% |
Output files were produced in runs/pipe_dft/final/ covering DEF, GDS, JSON header, KLayout GDS, LEF, LIB, MAG, metrics CSV, metrics JSON, netlist, ODB, power netlist, SDC, SDF, SPEF, SPICE, and Verilog header.
DRC Results and the OpenRAM Issue
The final design reports 132 DRC errors from KLayout. All 132 errors are nwell.4 violations. Every single one is located within the boundaries of the SRAM macro instances (sram_1rw_16x16). Zero routing DRC errors exist on the standard cell or interconnect portion of the design.
The nwell.4 rule in sky130 checks that every nwell region has a metal-connected N+ tap within a specified distance. OpenRAM-generated SRAM macros for sky130 use an optimized SRAM-specific layout that includes tap structures inside the macro, but the abstract LEF file used during PnR contains only the metal layers that connect externally, not the internal tap cell geometry. When Magic runs DRC on the assembled design (combining the standard cell area’s GDS with the SRAM macro GDS), it sees nwells from the macro boundary without seeing the internal tap cells that satisfy the rule, because those tap cells are inside a different abstraction level of the GDS hierarchy.
This is a documented, known issue with OpenRAM-generated macros on sky130 under Magic DRC. The sky130 uses optical proximity correction to reduce SRAM transistor sizes. SRAM blocks in sky130 generated by OpenRAM use a different DRC ruleset to accommodate this size reduction, and when running Magic VLSI it is expected to see many DRC violations as a result. The official OpenLane documentation for OpenRAM integration explicitly notes this: SRAM cells in sky130 have a special set of DRC rules; OpenRAM uses these optimized SRAM cells but the current DRC deck is missing these rules, causing false issues. The SkyWater PDK known issues page further confirms that Magic does not have DRC checking rules for the specialized exceptions for SRAM cells in the sky130_fd_sp_sram SRAM build space.
These violations exist identically in the non-DFT runs and in every run throughout the low-power experiment series. They are not caused by scan insertion, routing, or any change made in this work. The routing DRC output from OpenROAD DRT reports zero violations on the interconnect. All DRC errors in the final report are sourced exclusively from the SRAM macro interiors.
Additional pre-existing warnings:
- 1 antenna pin violation and 1 antenna net violation, both in the SRAM macro boundary (pre-existing across all runs)
- Hold violations in the ss_100C_1v60 corner: pre-existing timing margin issue present in the non-DFT baseline
- Slew and cap violations: pre-existing, consistent across all corners in both DFT and non-DFT runs, attributed to SRAM macro output driver characteristics as explained in the low-power section
- 270 disconnected pins: SRAM macro ports (
csb0,spare_wen0) that are not on the routing grid, a known property of the OpenRAM-generated macro’s port placement in the sky130 technology. These ports are internally connected within the macro GDS but appear disconnected when the abstract LEF is used during PnR.
LVS passed. The GDS is valid. The 4 scan chains are physically inserted, placed, routed, and verified.
Part Four: ATPG on the Gate-Level Netlists
What ATPG Is Actually Doing Here
After scan chain insertion establishes the structural DFT infrastructure in the physical design, the separate question is whether test vectors can actually detect faults in the logic. Scan chains provide the mechanism to shift test data into and out of flip-flops. ATPG generates the content of those vectors: the specific bit patterns that activate fault sites and propagate their effects to observable outputs.
The tool used here is Fault 0.6.1, an open-source ATPG solution from the American University in Cairo, targeting the stuck-at fault model. Stuck-at faults model permanent hard faults: a net that is stuck at logic 0 (sa0) regardless of what the driving logic computes, or stuck at logic 1 (sa1) regardless of the driving logic. These represent physical defects such as shorts to supply or ground, broken connections, and oxide defects that hold a node in a fixed state. Stuck-at is not the only fault model in use industrially (transition delay faults and path delay faults are standard additions), but it is the foundational model and the one compatible with Fault 0.6.1’s simulation engine.
ATPG operates on gate-level netlists, not RTL. The RTL contains behavioral descriptions that do not map directly to fault sites. Fault sites are gate pins and wire nodes in a technology-mapped netlist. The flow therefore takes the synthesized flat gate netlist, cuts sequential boundaries, and runs fault simulation on the resulting combinational representation.
The cut step is the mechanism that makes a sequential circuit amenable to combinational ATPG. Every flip-flop in the netlist has its D input and Q output disconnected. The D input becomes a pseudo-primary output. The Q output becomes a pseudo-primary input. The result is a netlist with only combinational logic between the original primary ports and the added pseudo-ports at every flip-flop boundary. This is the same assumption as full-scan: if every flip-flop is in a scan chain and can be loaded with an arbitrary value, then the sequential circuit reduces to a set of combinational cones, each testable independently.
The ATPG engine in Fault 0.6.1 uses a D-algorithm derived approach. For each undetected fault, the engine attempts two things: first, set the faulted node to the opposite of its stuck value (fault activation), and second, find a path through the combinational cone from the fault site to a primary output where the difference between the fault-free and faulty circuit behaviors is observable (fault propagation). If both can be achieved simultaneously under a consistent input assignment, the input assignment is a test vector for that fault. If no such assignment exists (due to circuit structure preventing either activation or propagation), the fault is classified as redundant or structurally undetectable, meaning no test vector can ever detect it regardless of input.
The parameters used across all four module runs were -m 80 (target 80% minimum fault coverage before stopping), -v 100 (generate up to 100 vectors), and --ceiling 500 (attempt up to 500 backtracks per fault before classifying it as ATPG-untestable). These are not random simulation parameters. The -m and -v flags control termination conditions. The --ceiling flag directly controls the search depth of the D-algorithm backtracking procedure.
Why ATPG Runs on Submodules, Not the Full Top Level
The complete noc_router top-level synthesizes to a netlist with 34,906 fault sites across over 10,000 gates. Fault 0.6.1 spawns one Icarus Verilog simulation process per test vector application. Each process elaborates the full design netlist plus the sky130_fd_sc_hd full cell model (the concatenated primitives and cell definition file, which is itself several thousand lines of Verilog). At this scale, the number of parallel simulation processes multiplied by the memory cost of elaborating both files simultaneously exceeds the available RAM on a standard development machine. The run was attempted and the system ran out of memory during parallel simulation.
The solution was to run ATPG on four submodules independently: rr_arbiter, switch_allocator, input_unit, and vc_fifo. The crossbar module was omitted because it is purely combinational passthrough logic with no sequential state, meaning every fault site in it is covered trivially by the routing logic that drives and reads the crossbar. The four submodules cover all sequential logic in the design.
Running ATPG at the submodule level gives fault coverage numbers for each block independently. It does not give a single fault coverage number for the integrated top level. The trade-off is that the ATPG is actually executable rather than crashing the machine, and the results are still meaningful for each block in isolation.
Tool Installation and Cell Model Preparation
Fault 0.6.1 is distributed as a self-contained Linux x86_64 AppImage:
curl -L https://github.com/AUCOHL/Fault/releases/download/0.6.1/Fault-0.6.1-x86_64.AppImage -o fault.AppImage
chmod +x fault.AppImage
./fault.AppImage --version
The tool internally uses Yosys for synthesis and Icarus Verilog for fault simulation. Both are bundled inside the AppImage.
The sky130_fd_sc_hd Verilog cell models are split across two files: primitives.v containing UDP (User-Defined Primitive) definitions for the mux and DFF-based primitives, and sky130_fd_sc_hd.v containing full cell definitions that reference those UDP primitives. Fault’s -c flag specifies the cell model file. Passing only sky130_fd_sc_hd.v causes Icarus Verilog to report Unknown module type for every sky130_fd_sc_hd__udp_dff$* and sky130_fd_sc_hd__udp_mux_* primitive instantiation inside the cell definitions, because the UDP definitions are in the file that was not provided. The fix is to concatenate both files:
cat primitives.v sky130_fd_sc_hd.v > sky130_fd_sc_hd_full.v
This single concatenated file is passed to every Fault invocation via -c sky130_fd_sc_hd_full.v.
The Liberty .lib file cannot be substituted here. Fault’s -c flag expects Verilog. Passing a Liberty file produces syntax error: I give up from the Icarus parser immediately.
SRAM Behavioral Model
The sram_sp wrapper in the design conditionally instantiates the OpenRAM-generated macro sram_1rw_16x16 under a SYNTHESIS define. When Fault’s internal Yosys elaborates the design without this define set, the synthesis step fails with Module sram_1rw_16x16 referenced in module sram_sp is not part of the design. Fault does not have access to the OpenRAM macro definition and cannot synthesize it.
A behavioral replacement for sram_sp was written with the ifdef guards removed, implementing the register-based simulation model directly:
module sram_sp #(
parameter DW = 16,
parameter AW = 3
)(
input wire clk,
input wire csb,
input wire web,
input wire [AW-1:0] addr,
input wire [DW-1:0] din,
output wire [DW-1:0] dout
);
reg [DW-1:0] mem [0:(1<<AW)-1];
reg [DW-1:0] dout_r;
integer k;
initial begin
for(k=0;k<(1<<AW);k=k+1) mem[k]=0;
dout_r=0;
end
always @(posedge clk) begin
if (!csb) begin
if (!web) mem[addr] <= din;
else dout_r <= mem[addr];
end
end
assign dout = dout_r;
endmodule
This file is passed to synthesis in place of the original sram_sp.v for any module that instantiates SRAM. The behavioral model synthesizes cleanly through Yosys to a register array, giving Fault a complete, simulable gate-level representation of the memory.
DFF Cell Names for the Cut Step
The cut step requires explicit naming of the DFF cell variants in the synthesized netlist so Fault knows which instances are flip-flops and how to sever their sequential boundaries. The cell names present after synthesis were identified with:
grep -o 'sky130_fd_sc_hd__df[a-z_]*' noc_router.netlist.v | sort -u
Three DFF base types appeared: sky130_fd_sc_hd__dfxtp (standard positive-edge triggered DFF), sky130_fd_sc_hd__dfrtp (DFF with reset), and sky130_fd_sc_hd__dfstp (DFF with set). Each comes in drive strength variants 1, 2, and 4. All nine variants were passed to fault cut -d:
sky130_fd_sc_hd__dfxtp_1,sky130_fd_sc_hd__dfxtp_2,sky130_fd_sc_hd__dfxtp_4,
sky130_fd_sc_hd__dfrtp_1,sky130_fd_sc_hd__dfrtp_2,sky130_fd_sc_hd__dfrtp_4,
sky130_fd_sc_hd__dfstp_1,sky130_fd_sc_hd__dfstp_2,sky130_fd_sc_hd__dfstp_4
Missing any variant causes the cut step to leave those flip-flop instances intact as sequential elements in the cut netlist, which means the ATPG engine sees a partially sequential circuit and cannot correctly enumerate pseudo-primary inputs and outputs for those instances.
The Three-Step Flow Per Module
Each module goes through three Fault invocations in sequence.
Step 1: Synthesis. fault synth takes the RTL source files and the Liberty file and produces a flat gate-level netlist mapped to sky130_fd_sc_hd cells. Internally this calls Yosys with a standard synthesis script. The output is a Verilog netlist containing only sky130 standard cell instances and interconnect.
./fault.AppImage synth \
-l <liberty_path> \
-t <top_module> \
-o <module>.netlist.v \
<source_files>
Step 2: Cut. fault cut takes the synthesized netlist, identifies every flip-flop instance matching the specified cell name list, severs the D-to-Q connections, and writes a combinational netlist. The D inputs become outputs (observable points) and the Q outputs become inputs (controllable points). The resulting netlist has no sequential elements.
./fault.AppImage cut \
-d <dff_cell_list> \
-o <module>.cut.v \
<module>.netlist.v
Step 3: ATPG. The main Fault invocation takes the cut netlist and the cell model, enumerates all fault sites, applies the D-algorithm to generate test vectors, and writes the result to a JSON file. The --clock and -i flags tell the tool which ports are clock and reset signals to exclude from controllable inputs (since those are infrastructure signals, not data inputs the ATPG should manipulate).
./fault.AppImage \
-c sky130_fd_sc_hd_full.v \
--clock clk \
-i clk,rst_n \
-m 80 \
-v 100 \
--ceiling 500 \
-o <module>.tv.json \
<module>.cut.v
The output .tv.json file contains the actual binary test vectors: the input assignments that detect the covered faults. These can be loaded into a simulator or an ATE (automatic test equipment) pattern loader for silicon testing.
Module Results
rr_arbiter
The round-robin arbiter is the smallest module. It arbitrates between N requestors in a rotating priority order, maintaining a rotating grant pointer in a register. The cut netlist was 433 lines after synthesis.
ATPG results:
- Fault coverage: 96.43%
- Test vectors generated: 97
- Runtime: 5.00 seconds
The arbiter’s register structure is highly observable: the grant output is directly visible at primary outputs, and the priority rotation register has direct paths to the grant outputs on the next cycle boundary (represented as a pseudo-primary output in the cut netlist). The 3.57% uncovered faults are structurally undetectable: nodes where the fault effect is masked by reconvergent fanout before reaching any observable output.
switch_allocator
The switch allocator contains a 5x5 round-robin arbiter matrix, with one arbiter per output port resolving contention among the 5 input ports requesting that output. It instantiates rr_arbiter internally, so both files are passed to synthesis. The cut netlist was 2,722 lines.
ATPG results:
- Fault coverage: 95.46%
- Runtime: 7.12 seconds
The switch allocator had the lowest coverage of the four modules. The round-robin arbitration state creates fault masking paths through the grant logic: a stuck fault on a priority state bit can be masked when the grant resolves through a different path that reaches the same output regardless of the faulted bit value. This is structural masking from the redundancy inherent in the priority rotation logic, not a gap in the ATPG.
vc_fifo
The VC FIFO contains the FIFO control logic, read and write pointer management, full and empty flag generation, and the SRAM interface. The behavioral SRAM model is passed for synthesis.
ATPG results:
- Fault coverage: 98.09%
- Runtime: 10.60 seconds
The highest coverage of the four modules. The FIFO datapath has high controllability through the data input port and high observability through the data output port. The read pointer, write pointer, full flag, and empty flag all have direct combinational paths to the output ports that can be driven to observe their values, which makes the majority of fault sites detectable with straightforward input assignments.
input_unit
The input unit contains the pipeline state machine (IDLE, DECODE, ROUTING, ACTIVE), the two-VC FIFO interface logic, and the switch allocation request and grant interface. It instantiates vc_fifo which instantiates sram_sp. With the behavioral SRAM model substituted in, the entire hierarchy flattens into a single combinational netlist at the cut step. The cut netlist was 11,688 lines, the largest of the four modules.
ATPG results:
- Fault coverage: 96.59%
- Runtime: 17.22 seconds
The runtime scales with netlist size as expected: the fault simulation workload is proportional to the number of fault sites (each requiring at least one Icarus simulation pass to evaluate). The 11,688-line cut netlist contains the flattened FIFO and SRAM logic in addition to the state machine and allocation interface.
Summary Table
| Module | Cut Netlist Lines | Fault Coverage | Runtime |
|---|---|---|---|
| rr_arbiter | 433 | 96.43% | 5.00s |
| switch_allocator | 2,722 | 95.46% | 7.12s |
| vc_fifo | (not recorded) | 98.09% | 10.60s |
| input_unit | 11,688 | 96.59% | 17.22s |
All four modules exceeded 95% stuck-at fault coverage. The remaining uncovered faults in each case are structurally undetectable faults: fault sites where no input assignment can simultaneously activate the fault and propagate its effect to any primary or pseudo-primary output, because the circuit topology itself masks the effect through reconvergent fanout or through redundant logic. This is normal in any ATPG run on real logic. No test vector exists that can detect a structurally undetectable fault because its presence has no observable effect on the circuit’s outputs under any input combination.
The Relationship Between Scan Insertion and ATPG
The DFT scan chain work in Part Three and the ATPG in Part Four address different parts of the same problem. Scan insertion creates the physical mechanism for controllability and observability: by replacing every flip-flop with its scan-equivalent variant and connecting them into chains, any bit pattern can be shifted into any flip-flop and the state of any flip-flop can be shifted out. ATPG determines what patterns to use.
In the context of this project, the two flows are connected but run separately. The scan insertion was done on the post-PnR physical design using OpenROAD’s DFT commands through the custom OpenLane 2 steps. The ATPG was done on the pre-PnR synthesized netlists using Fault 0.6.1. The test vectors from ATPG target the same logical fault sites that are covered by the scan chains in the physical design, but the vector sets have not been mapped to the specific scan chain ordering that OpenROAD produced.
A complete production DFT flow would close this loop: take the post-scan-replace netlist (the one where all DFFs have been swapped for SDFFs), reorder the fault simulation to account for the specific scan chain ordering (which flop is at position N in chain K), and generate vectors in a format compatible with ATE scan load sequences. Fault 0.6.1 does not produce ATE-formatted vectors natively, and the scan chain ordering from OpenROAD’s insert_dft was not fed back into the ATPG run. This is the main gap between the work done here and a full production-ready DFT closure flow. The ATPG results are valid as submodule stuck-at coverage numbers and as validation that the logic is fully observable and controllable at each block boundary. They are not a substitute for a scan-aware vector set mapped to the physical scan chain topology.
What Structurally Undetectable Faults Mean in Practice
The 3.41% to 4.54% of undetected faults across the four modules are not a sign of insufficient ATPG effort or missed coverage. The D-algorithm with --ceiling 500 backtracks exhaustively within its configured depth for each fault. If a fault cannot be detected after exhaustive search, the tool classifies it as ATPG-untestable. The underlying reason is always structural: the circuit has some combination of reconvergent fanout or redundant logic that makes it impossible for any input to simultaneously satisfy both the activation and propagation conditions.
Reconvergent fanout is the typical mechanism. A signal fans out to two paths that reconverge at an AND or OR gate. A stuck fault on the fanout point affects both paths identically, and the reconvergence point computes the same output regardless of whether the fault is present or absent. The fault cannot be propagated past the reconvergence point.
Redundant logic in an arbiter is a common source of structural undetectability. If two priority decode terms can both assert the same grant under different but operationally equivalent conditions, a fault in one term may be masked by the other term activating the grant through the alternate path. The grant output does not differ between the fault-free and faulty circuit.
These faults cannot be removed by generating more test vectors. They can only be removed by redesigning the logic to eliminate the structural masking, which typically means removing the redundancy that causes the masking, which may change the circuit’s functional behavior or timing properties. In practice, 95%+ stuck-at coverage with a known remainder of structurally undetectable faults is the accepted outcome of ATPG on any nontrivial combinational logic block.
Summary
This project covered four distinct phases of work on the same 5-port wormhole NoC router on sky130A.
The low-power phase established that RTL-level clock gating with ICG cells was the dominant optimization, contributing a 56% power reduction from baseline by eliminating flip-flop toggling in idle FIFO banks. Adding isolation cells at module output boundaries contributed an additional 3%, addressing combinational glitching on downstream nets. Synthesis strategy tuning with ABC’s AREA 2 strategy (two sequential choice2 passes before area-optimized technology mapping) pushed the total reduction to 70.34% from the baseline. The AREA 2 netlist required coordinated CTS parameter tuning and resizer configuration to achieve timing closure without power regression: smaller cluster sizes for more balanced clock insertion delay, post-GPL design repair for slew and cap violations, and hold slack margins forcing the resizer to overfix beyond the zero-slack boundary. Floorplan restructuring from a 2000x2000 um die to an 1800x1800 um die with a 3-column SRAM arrangement produced a 16.47% die area reduction and a 5.77% wirelength reduction at no cost to power or timing.
The pipelining phase decomposed the single-cycle combinational datapath into a 4-stage pipeline through RTL modifications to input_unit.v, switch_allocator.v, and noc_router.v. The initial implementation had two bugs. The first was a FIFO coherence bug: xack derived from the registered xbar_v_r caused the FIFO to pop one cycle after the flit data was captured, creating a one-cycle mismatch that corrupted flit ordering across all multi-flit packets. The fix was deriving xack from the combinational xack_next so the pop and the capture happen at the same clock edge. The second was a double-grant bug: the registered grant_flat created a one-cycle window where the SA’s combinational arbiter saw stale pre-update values of both sa_req and out_busy, resolving a second grant to the same output port. The fix was removing the SA grant register entirely, making grant_flat a combinational output so the grant, the sa_req clear, and the out_busy set all happen at the same posedge. Three structural optimizations followed: removal of 85 pipeline registers (xbar_flit_r and xbar_v_r) by using the output FIFO write as the S3-to-S4 boundary, replacement of 3-bit binary xbar_src with 5-bit one-hot xbar_src_oh to eliminate 78 fanout violations on the crossbar mux select nets, and removal of the speculative out_busy_next combinational block that was only needed as a workaround for the now-eliminated registered grant.
The DFT phase implemented scan chain insertion inside OpenLane 2 by writing two custom OpenROADStep subclasses: ScanReplace injected before global placement to swap all 720 flip-flops for their scan-equivalent SDFF variants, and ScanStitch injected after detailed placement to build minimum-wirelength chains using final flop coordinates and place the 8 scan ports on met2. Eight bugs were encountered and resolved: incorrect ODB loading via CURRENT_ODB, wrong TCL command name for scan stitching, invalid Python template markers in TCL, a transition constraint failure on unconstrained scan ports, cell overlap from placing ScanReplace after global placement rather than before, scan output pins silently dropped due to coordinate values in database units rather than microns, a Magic LEF writer emitting syntactically invalid USE ; for scan ports, and a Python API return value change in SequentialFlow.start(). The completed flow ran 75/75 stages with LVS passing, producing 4 scan chains across 720 flops with 8 scan ports on the left die edge. All 132 DRC errors in the final design are nwell.4 violations located inside the SRAM macro boundaries, consistent across all runs in this project and documented as a known limitation of OpenRAM-generated macros on sky130 under Magic DRC.
The ATPG phase ran Fault 0.6.1 on synthesized gate-level netlists of four submodules: rr_arbiter, switch_allocator, vc_fifo, and input_unit. ATPG on the full top-level netlist was not feasible due to memory exhaustion from parallel Icarus simulation at 34,906 fault sites. Each submodule went through synthesis, sequential boundary cutting to produce a combinational netlist, and D-algorithm ATPG targeting stuck-at faults. Coverage results were 96.43%, 95.46%, 98.09%, and 96.59% respectively, all above the 95% threshold. The remaining uncovered faults in each module are structurally undetectable due to reconvergent fanout and redundant priority decode logic, not gaps in test vector generation. The ATPG vectors have not been mapped to the physical scan chain ordering from the OpenROAD DFT flow, which is the remaining step to produce a production-complete, ATE-ready test program.
