Loading…

Part 2: Pipelining the Router

Back to README


Non-Pipelined Architecture

Before any pipelining, the router executes its entire datapath within a single clock cycle. Three logical operations happen every cycle in a single combinational chain:

The HEAD flit arrives at an input port FIFO. The input unit state machine reads the destination field from the flit and determines the target output port. This is the route decode operation.

The switch allocator (SA), a 5x5 round-robin arbiter matrix, resolves contention when multiple input ports simultaneously request the same output port. The grant signal is purely combinational: it goes high in the same cycle the request is asserted.

The granted flit passes through the crossbar mux to the output unit, which writes it into the output FIFO.

The path from FIFO read data through route decode logic, through the SA arbiter, through the crossbar mux select, to the output FIFO write enable is a single unbroken combinational chain. At 20 ns clock period targeting sky130_fd_sc_hd, synthesis reports a setup slack of 16.48 ns at the TT corner. The path fits comfortably in one cycle at this frequency. The goal of pipelining is to cut this chain for higher frequency operation and to understand the microarchitectural consequences of the cuts.


Four-Stage Pipeline Plan

The target pipeline stages:

S1 (Buffer Write): Incoming flits write into the VC FIFO on the input port. The FIFO is fall-through: combinational read data always presents the head element.

S2 (Route Decode): A new DECODE state is inserted into the input unit state machine between IDLE and ROUTING. When a HEAD flit is detected at the head of the FIFO, the state machine transitions from IDLE to DECODE, registering the destination port (sa_dst) but not yet asserting the switch allocation request (sa_req). This adds one cycle of latency for decode.

S3 (Switch Allocation): The state machine moves from DECODE to ROUTING and asserts sa_req. The SA resolves contention and produces a grant. In the initial pipelined version, the grant output of the SA (grant_flat) was made a registered reg instead of a combinational wire, adding one pipeline cycle for the grant to propagate out. The crossbar connection state (xbar_src, xbar_vc, xbar_live, out_busy) is also registered in this stage.

S4 (Crossbar Traversal and Output Write): A pipeline register (xbar_v_r for valid, xbar_flit_r for flit data) captures the crossbar output. The output unit reads from this register and writes into the output FIFO.

RTL Modifications for Initial Pipelined Version

input_unit.v received the new DECODE state. The state transition in IDLE upon seeing a HEAD flit now goes to DECODE (latching sa_dst only) rather than directly to ROUTING. DECODE then transitions to ROUTING on the following cycle, where sa_req is asserted.

switch_allocator.v had grant_flat changed from a combinational wire to a registered reg. The combinational arbitration result is captured into grant_flat at the clock edge.

noc_router.v received xbar_v_r (5-bit registered valid) and xbar_flit_r (5 channels x 16-bit flit width = 80-bit registered data) as pipeline registers between the crossbar mux and the output units. The combinational forwarding logic (xbar_v_next) computes whether a flit can be forwarded on each output port, and this is registered into xbar_v_r. The output unit reads from xbar_v_r for the valid signal and from xbar_flit_r for the flit data.


Bug 1: FIFO Coherence / xack Timing

The initial pipelined version failed in simulation. Out of 52 test checks, only 3 passed: the two reset sanity checks and one count check on test T2. The monitor output showed BODY flits arriving where HEAD flits were expected, and TAIL flits arriving where HEAD flits were expected. Flit ordering was completely corrupted across all test cases.

The root cause was in how the FIFO acknowledge signal (xack) interacted with the registered xbar_v_r. The FIFO pop operation is driven by xack. When xack is high in a given cycle, the FIFO advances its read pointer and the next flit becomes visible at the FIFO output (iu_flit). In the initial pipelined version, xack was derived from the registered xbar_v_r. This means the FIFO was popped one cycle after the flit was supposedly captured into xbar_flit_r.

xbar_flit_r was being written from iu_flit on the same posedge that xack was derived from xbar_v_r. By the time xbar_v_r went high, the FIFO had already advanced because the pop happened at the same posedge xbar_v_r was registered. The data in iu_flit was now the next flit, not the one that should be in xbar_flit_r.

Fix: Derive xack from xack_next, the combinational version of the acknowledge signal computed alongside xbar_v_next. This ensures the FIFO pops in the same cycle the flit data is captured into xbar_flit_r. The FIFO read pointer advances and the data capture into the pipeline register happen at the same clock edge.


Bug 2: Double-Grant Bug

After fixing the xack timing issue, HEAD flits arrived correctly. Test T2 (a 2-flit HEAD+TAIL packet on VC1) passed entirely. But every test involving multi-flit packets on VC0 showed the same failure: HEAD flit arrived correctly, then forwarding stopped permanently. BODY and TAIL flits remained in the input FIFO indefinitely.

A cycle-accurate debug probe printed sa_grant, out_busy, xbar_vc, xbar_src, iu_sel_vc, and iu_flit on every cycle. The trace for test T1 (Local to North, VC0, 3 flits):

cy=12 t=115ns: sa_grant=...010 out_busy=00000 | port1: vc=0
cy=12 t=125ns: sa_grant=...010 out_busy=00010 | port1: vc=0
cy=14 t=135ns: sa_grant=...000 out_busy=00010 | port1: vc=1   <-- BUG

At cycle 12 (t=115ns), the SA grant fires. xbar_vc[1] is correctly set to 0, meaning the crossbar is connecting VC0 of input port 0 to output port 1. One cycle later at t=125ns, the HEAD flit is forwarding. At t=135ns (cycle 14), xbar_vc[1] has changed from 0 to 1. The input unit is presenting VC0 data (iu_sel_vc[0]=0), but the crossbar register says VC1 (xbar_vc[1]=1). The VC mismatch check iu_sel_vc[xbar_src[1]] == xbar_vc[1] evaluates to 0 == 1, which is false, so xbar_v_next[1] stays 0 permanently. No BODY or TAIL flit ever gets forwarded.

Exact Mechanism

The VC register corruption was caused by the SA issuing a second grant for the same output port one cycle after the first.

Cycle N (t=105ns): Input unit is in ROUTING state. sa_req[0]=1. out_busy[1]=0 (registered value, not yet updated). The combinational arbiter inside the SA sees sa_req_masked[0*5+1]=1 and resolves a grant. But grant_flat is registered, so this grant does not appear at the output yet.

Cycle N+1 (t=115ns): The registered grant_flat[0*5+1] appears. noc_router sees this and registers xbar_src[1]=0, xbar_vc[1]=0, xbar_live[1]=1, out_busy[1]=1 at posedge. The input unit also sees sa_grant[0]=1 and transitions sa_req[0] from 1 to 0 at posedge.

The critical question is what the SA sees during this cycle before the posedge. sa_req[0] is still 1 (registered value has not been cleared yet). out_busy[1] is still 0 (registered value has not been set yet). Therefore sa_req_masked[0*5+1]=1 again. The combinational arbiter sees a valid request with an unblocked output and resolves a second grant. This second grant gets registered into grant_flat.

Cycle N+2 (t=125ns): The second grant_flat[0*5+1]=1 appears. noc_router evaluates the grant installation logic for xbar_vc:

xbar_vc[jj] <= (iu_sa_req[ii][0] && iu_sa_dst0[ii] == jj) ? 1'b0 : 1'b1;

Now iu_sa_req[0][0] is 0 (it was cleared at posedge of cycle N+1). The condition is false. The ternary takes the else branch: xbar_vc[1] <= 1'b1. This overwrites the correct value of 0 with the incorrect value of 1.

The root cause is a one-cycle timing gap created when a registered pipeline output interacts with a feedback loop. The SA grant register, the sa_req register, and the out_busy register all update at the same posedge. During any given cycle, the SA’s combinational arbiter sees the pre-posedge values of both, which are stale by one cycle relative to the grant that has just propagated out.

Fix: Removing the SA Grant Register

Three approaches were considered.

Approach A: A combinational out_busy_next lookahead: compute out_busy combinationally by OR-ing the registered value with current-cycle grants, and use this masked version to gate SA requests. This does not work because out_busy_next depends on sa_grant_flat, which is the registered output, not the combinational arbiter output. The window still exists.

Approach B: A just_granted mask inside the SA: track which outputs were granted in the previous cycle and block them. This incorrectly prevents a legitimately different source from being granted the same output in the following cycle.

Approach C: Remove the SA grant register entirely, making grant_flat a combinational wire output. This is the correct solution.

With combinational grant_flat:

Cycle N: sa_req[0]=1, out_busy[1]=0. SA combinationally resolves grant_flat[0*5+1]=1 immediately. noc_router sees this grant and registers out_busy[1]=1 at posedge. Input unit sees sa_grant[0]=1 and registers sa_req[0]=0 at posedge.

Cycle N+1: sa_req[0]=0, out_busy[1]=1. sa_req_masked[0*5+1]=0. No grant fires. Double-grant window eliminated.

The grant and the blocking signals now update at the same posedge in the same cycle. The pipeline stages are preserved without the SA needing to be a pipeline stage boundary: the DECODE state in the input unit provides S2, the registered xbar_live, xbar_src, and out_busy in noc_router provide S3, and the output FIFO write register provides S4.

All 52 test checks passed after this fix.


Three Structural Optimizations

With the pipelined router functionally correct, three structural optimizations were identified from the synthesis report.

Optimization 1: Removing xbar_flit_r and xbar_v_r

xbar_flit_r (80 FFs) and xbar_v_r (5 FFs) were introduced to create the S3-to-S4 pipeline boundary. The output unit already contains an SRAM-backed FIFO with its own write register. The FIFO write (posedge clock with wr_en = flit_valid & ~full) is itself a register boundary. Driving the output unit with the combinational signals xbar_v_next (valid) and xbar_out (flit data) and letting the FIFO write be the S3-to-S4 boundary eliminates 85 FFs without changing pipeline depth.

This also eliminates a short FF-to-FF path from xbar_v_r to the output FIFO write enable, which was the source of 228 additional hold-repair buffers in synthesis. The combinational path from xbar_v_next through the FIFO ready check and the ou_ready signal has enough gate delay to naturally satisfy hold without inserted buffers.

Optimization 2: One-Hot xbar_src Encoding

The original xbar_src[dst] was a 3-bit binary register selecting which of the 5 input ports is routed to each output port. A 3-bit binary select driving a 5-to-1 mux over 16-bit flit data produces high-fanout select lines: each select bit must drive the mux control input for all 16 data bits across all 5 output ports. The synthesis report flagged 78 fanout violations on these select nets.

Replacing xbar_src[dst] with a 5-bit one-hot register xbar_src_oh[dst] changes the mux structure. Each one-hot bit drives a 16-wide AND gate that masks the corresponding input flit bus. The 5 masked results are OR-reduced. Each iu_flit[src] bit fans out to exactly 5 AND gates (one per output port). Each xbar_src_oh[dst][src] bit fans out to exactly 16 AND gates (one per flit bit). This is a significantly more balanced fanout distribution.

The one-hot encoding also simplifies xbar_v_next: instead of indexing iu_valid[xbar_src[b]] through a mux, the one-hot bits gate each iu_valid[src] and OR-reduce, removing mux decode logic from the critical path.

Optimization 3: Removing out_busy_next

The out_busy_next combinational block was introduced as a partial workaround for the double-grant bug when the SA grant was still registered. It computed a speculative next-cycle out_busy state by OR-ing the registered value with current-cycle grants. With the SA grant now combinational, out_busy updates at the same posedge as sa_req clears. The double-grant window does not exist and speculative lookahead is unnecessary. Simple ~out_busy[gj] masking is sufficient. Removing out_busy_next eliminates a 25-term combinational block and reduces the combinational logic feeding the SA request mask.


Final Optimized Pipeline Architecture

S1 (Buffer Write): Incoming flits write into the VC FIFO on the input port. The FIFO is fall-through: combinational read data always presents the head element.

S2 (Route Decode): Input unit state machine transitions from IDLE to DECODE upon seeing a HEAD flit. Destination port is registered into sa_dst. No switch allocation request is asserted yet.

S3 (Switch Allocation and Crossbar Setup): Input unit asserts sa_req in ROUTING state. SA resolves grants combinationally (no register). noc_router registers the crossbar connection state: xbar_src_oh (5-bit one-hot source select), xbar_vc (which VC is in the active wormhole), xbar_live (whether a wormhole is active on this output port), and out_busy (whether an output port is allocated). The combinational forwarding logic xbar_v_next evaluates xbar_live, VC match between registered xbar_vc and iu_sel_vc, input valid (iu_valid), and output ready (ou_ready) for each of the 5 output ports.

S4 (Crossbar Traversal and Output Write): The crossbar mux is combinational, using xbar_src_oh to select the source flit via AND-OR reduction. The output unit receives flit data and valid signal combinationally. The output FIFO write register (clocked on posedge) is the S3-to-S4 pipeline boundary. The FIFO read side drives the router’s output with downstream flow control via credits.

Wormhole forwarding after the HEAD flit does not re-arbitrate. Once xbar_live[dst] is set and the source and VC are registered, BODY and TAIL flits flow through the crossbar every cycle that iu_valid is high and ou_ready is high. xbar_live[dst] is cleared when a TAIL flit is forwarded, releasing the output port for subsequent arbitration.


Synthesis Comparison: Non-Pipelined vs Pipelined

Both versions were taken through the full OpenLane flow targeting sky130_fd_sc_hd at a 20 ns clock period.

Metric Non-pipelined Pipelined
Sequential cells 500 615
Standard cell area (um²) 60,051 65,464
Setup WS at SS corner (ns) 8.73 9.90
Hold violations at SS corner 14 0
Hold violations (worst cross-corner) 42 43
Hold WNS (worst, ns) -0.951 -0.256
Hold buffers inserted 468 696
Fanout violations 78 72
Switching power (W) 0.003135 0.004005

The pipelined version achieves a 13.4% improvement in setup slack at the SS corner (8.73 ns to 9.90 ns), confirming the critical combinational path has been broken. Hold violations at SS corner dropped from 14 to 0 because the inserted pipeline registers create natural hold margins on previously short combinational paths. Hold violations at worst cross-corner degraded slightly (42 to 43) but the magnitude improved significantly (-0.951 ns to -0.256 ns).

Costs: 115 additional flip-flops (23% increase in sequential cells), 9% more standard cell area, 48.7% more hold buffers, and 27.7% more switching power. The bulk of the FF increase comes from the DECODE state register and the registered crossbar connection state (xbar_src_oh, xbar_vc, xbar_live, out_busy). The xbar_flit_r and xbar_v_r registers were already removed in the optimization round; retaining them would have added another 85 FFs.