Ping-Pong Wrapper — Tiled GEMM
GEMM Wrapper — Memory Hierarchy Layer
A memory-hierarchy wrapper around the existing gemm systolic array. Adds ping-pong
SRAM double-buffering for operands, a C-tile accumulation buffer, and a sequencer FSM —
turning the bare streaming datapath into a self-timed tile engine.
The five new modules are:
| Module | Role |
|---|---|
sram_bank |
Primitive single-port synchronous SRAM |
ping_pong_buf |
Double-buffered operand SRAM (A or B) |
output_buf |
C-tile staging and partial-sum accumulation |
gemm_ctrl |
Sequencer FSM — all control signals |
gemm_wrapper |
Top-level integration |
gemm, pe, mac, and line_buffer are unchanged.
The Problem Being Solved
The bare gemm module is a pure streaming datapath. It has no memory. Every cycle of
every tile, the caller must present a full column-slice of A and a full row-slice of B
on a_in and b_in. The caller also has to track done, align clr_in, enforce the
minimum inter-tile gap, and guarantee back-pressure-free delivery. At scale this is
impossible to meet from DRAM bandwidth alone — the array sits idle waiting for data.
The wrapper solves this by decoupling loading from computing:
- While the array computes tile i from the active SRAM bank, the host DMA fills the shadow bank with tile i+1.
- When tile i finishes, the banks swap in one cycle. The array immediately starts tile i+1 without a stall.
- The sequencer handles
clr_in,valid_in, bank-flip timing, and output capture automatically.
Math and Throughput
Tile parameters
M = PE rows = output rows of C
N = PE columns = output cols of C
K = dot-product depth
MAC-ops per tile = M × N × K
Pipeline timing inside gemm
DONE_DELAY = K + (M-1) + (N-1)
MIN_GAP = (M-1) + (N-1)
T_tile = K + MIN_GAP = DONE_DELAY
DONE_DELAY is the number of cycles from the first valid_in=1 until done goes high.
It has two parts:
- K cycles — the dot-product accumulation depth. Each PE needs K beats of operands.
- (M-1)+(N-1) cycles — wavefront drain. The last data beat enters PE(0,0) on RUNNING cycle K-1, but it takes (i+j) more cycles to reach PE(i,j) through the PE pass-through registers. The farthest PE is PE(M-1,N-1) at distance M+N-2 = MIN_GAP.
T_tile is the minimum legal period between consecutive tile starts. After K cycles of
valid data, the caller must leave at least MIN_GAP idle cycles before the next clr_in.
The optimal back-to-back period is exactly T_tile = K + MIN_GAP = DONE_DELAY.
Theoretical peak throughput
Peak = (M × N × K) / T_tile MAC-ops/cycle
= (M × N × K) / (K + (M-1) + (N-1))
For M=N=K=3: Peak = 27 / 7 ≈ 3.86 MAC-ops/cycle
PE utilisation
Each PE accumulates for K cycles out of every T_tile cycle period:
PE utilisation = K / T_tile = K / (K + M + N - 2)
For M=N=K=3: 3/7 ≈ 42.9% asymptotic.
The idle fraction (MIN_GAP/T_tile = 4/7 ≈ 57%) is the wavefront drain overhead. It shrinks as K grows relative to M+N. For a Conv2d tile (M=4, N=16, K=36): K/T_tile = 36/54 ≈ 66.7%.
Wrapper overhead
The wrapper adds 2 cycles of PRELOAD (FLIP_INIT + PRELOAD state) before each tile and 1 cycle of CAPTURE_WAIT + CAPTURE + WRITEBACK after. For long tiles (large K) this overhead is negligible. For M=N=K=3 it adds 8 cycles to the first tile latency:
First c_out_valid latency from start = FLIP_INIT(1) + PRELOAD(1) + DONE_DELAY(7) +
CAPTURE_WAIT(1) + CAPTURE(1) + WRITEBACK(1)
= 12 cycles minimum
Measured: 15 cycles (includes 2 extra cycles for PRELOAD_A and the registered
out_we delay stage).
Architecture
┌─────────────────────────────────────────────────┐
│ gemm_wrapper │
│ │
a_wr_en/addr/data ───►│ ┌─────────────────┐ rd_en, rd_addr │
│ │ ping_pong_buf │───────────────────────► │
b_wr_en/addr/data ───►│ │ (A banks ×2) │ a_rd_raw[8M-1:0] │
│ └─────────────────┘ │ │
│ │ (gated by rd_en) │
│ ┌─────────────────┐ ▼ │
│ │ ping_pong_buf │ ┌──────────────────┐ │
│ │ (B banks ×2) │─►│ gemm │ │
│ └─────────────────┘ │ (unchanged) │ │
│ └────────┬─────────┘ │
│ ┌─────────────────┐ │ c_out_w │
│ │ output_buf │◄──────────┘ │
│ │ (C tile latch) │ out_we_r, out_flush_r │
│ └────────┬────────┘ │
│ │ c_out_data, c_out_valid │
│ ┌────────┴────────┐ │
│ │ gemm_ctrl │ valid_in, clr_in │
│ │ (FSM) │─────────────────────────► │
│ └─────────────────┘ flip_en, rd_en, rd_addr │
│ ▲ │
│ start, done │
└─────────────────────────────────────────────────┘
Module Reference
sram_bank
Single-port synchronous SRAM. Synthesisable behavioural model — in a real tape-out this would be replaced by a foundry SRAM macro.
Parameters: DATA_WIDTH, DEPTH, ADDR_WIDTH = $clog2(DEPTH)
Ports: clk, we, oe, addr, wdata, rdata
Read latency: 1 cycle. rdata is registered at every posedge. oe (output enable)
gates rdata to zero when low — this is the mechanism that prevents DRAIN-phase SRAM
residue from reaching the PE array.
posedge clk:
if (we) mem[addr] <= wdata
if (oe) rdata <= mem[addr]
else rdata <= 0
rdata is a reg, so it holds its last value. Setting oe=0 forces it to zero on the
next posedge — important for the SRAM → gemm pipeline cleanliness.
ping_pong_buf
Wraps two sram_bank instances. One bank is active (read port feeds gemm). The other
is shadow (write port receives the next tile from the host). A flip_en pulse swaps
roles on the next posedge.
Parameters: DATA_WIDTH, DEPTH
Ports: clk, rst, flip_en, bank_sel (out),
rd_en, rd_addr, rd_data,
wr_en, wr_addr, wr_data
Bank assignment:
bank_sel = 0 : bank0 = active (read) bank1 = shadow (write)
bank_sel = 1 : bank1 = active (read) bank0 = shadow (write)
The write port always targets the shadow bank. The read port always targets the active bank. Simultaneous R/W to different banks is safe and expected every tile period.
host writes ctrl reads
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ bank1 │ shadow │ bank0 │ active
│ (next │◄──wr_en │ (curr │──►rd_data
│ tile) │ │ tile) │
└─────────┘ └─────────┘
─── flip_en ───►
┌─────────┐ ┌─────────┐
│ bank1 │ active │ bank0 │ shadow
│ (curr │──►rd_data │ (next │◄──wr_en
│ tile) │ │ tile) │
└─────────┘ └─────────┘
bank_sel_d is a registered copy of bank_sel. The output mux uses bank_sel_d to
correctly align with the 1-cycle SRAM read latency — the mux selects which bank’s
rdata is presented based on which bank was active when the address was presented,
not the current cycle’s bank_sel.
output_buf
Holds one complete C tile (M×N × 32-bit) in a register file. Supports two write modes:
accumulate = 0 : buf[i] = wr_data[i] (overwrite — first K-partition)
accumulate = 1 : buf[i] = buf[i] + wr_data[i] (accumulate — subsequent K-partitions)
flush_en triggers a registered read of the entire buffer onto rd_data and asserts
c_valid for one cycle. The tricky part: we and flush_en can fire on the same cycle
(CAPTURE state). A combinational next_buf bypass solves this — rd_data reads from
the post-write value without waiting for the register to update:
wire next_buf[i] = we ? (accumulate ? buf[i]+wr_data[i] : wr_data[i]) : buf[i];
// flush_en uses next_buf, not buf, so simultaneous we+flush is correct
This is the key correctness fix: without the bypass, a simultaneous write+flush captures
the pre-write value of buf, giving stale C output.
gemm_ctrl
The sequencer FSM. Generates all control signals for gemm, the ping-pong buffers,
and output_buf. Nothing else has any autonomous control logic.
FSM states
┌────────┐ start ┌───────────┐ ┌──────────┐
│ IDLE │────────►│ FLIP_INIT │────────►│ PRELOAD │
└────────┘ └───────────┘ └────┬─────┘
▲ │
│ tile_cnt==NUM_TILES-1 ▼
┌───────────┐ done ┌──────────────┐ ┌─────────┐
│ WRITEBACK │◄─────────│ CAPTURE_WAIT │ │ RUNNING │
└───────────┘ └──────────────┘ └────┬────┘
│ else ▲ │ k_cnt==K-1
▼ │ ▼
┌──────────┐ ┌──────┴──────┐ ┌─────────┐
│ PRELOAD │◄─────────│ CAPTURE │ │ DRAIN │
└──────────┘ └─────────────┘ └─────────┘
gap_cnt==MIN_GAP-1
FLIP_INIT (1 cycle, tile-0 only): asserts flip_en. The shadow bank (loaded by the
host before start) becomes the active bank. For tiles 1+ this state is skipped —
the DRAIN flip at the end of the previous tile already did the swap.
PRELOAD (1 cycle): rd_en=1, clr_in=1. The SRAM sees rd_addr=0 with oe=1
and will output data[0] on the next cycle. clr_in=1 enters gemm’s delay lines —
it will reach PE(i,j) after i+j cycles, exactly when that PE’s first valid data arrives.
RUNNING (K cycles): valid_in=1, rd_en=1 for cycles 0..K-2, rd_en=0 on cycle
K-1 (last beat). The final rd_en=0 forces the SRAM to capture zero at the last
RUNNING posedge, so DRAIN cycle 0’s SRAM output is zero — no extra beat accumulates.
DRAIN (MIN_GAP cycles): valid_in=0. Wavefront drains. flip_en pulses on
gap_cnt=0 (first drain cycle) so the host can immediately start loading the new shadow
bank for the next tile. The host load window is T_tile = K + MIN_GAP cycles.
CAPTURE_WAIT: waits for done from gemm. done is a registered output inside
gemm, so it arrives 1-2 cycles after DRAIN ends. The FSM spins here until done=1.
CAPTURE (1 cycle): asserts out_we. This is delayed 1 cycle in the wrapper
(out_we_r) so output_buf sees the stable gemm.c_out value.
WRITEBACK (1 cycle): pulses tile_done. Returns to PRELOAD (next tile) or IDLE.
rd_addr is combinational
rd_addr is a combinational output (not registered), so the SRAM sees the address
immediately. Combined with 1-cycle SRAM latency:
PRELOAD state: rd_addr = 0 → SRAM output = data[0] arrives RUNNING cy0
RUNNING cy0: rd_addr = 1 → data[1] arrives RUNNING cy1
RUNNING cy1: rd_addr = 2 → data[2] arrives RUNNING cy2
RUNNING cy2: rd_en = 0 → SRAM output = 0 arrives DRAIN cy0
This gives exactly K beats of valid data to PE(0,0), zero otherwise.
gemm_wrapper
Top-level. Instantiates all of the above plus gemm. No logic — just wiring and two
pipeline registers.
out_we_r / out_acc_r / out_flush_r: registered 1 cycle from ctrl outputs.
gemm.c_out is a register that latches acc[] when done=1. The ctrl asserts out_we
in CAPTURE at the same posedge that gemm.c_out latches — so output_buf would capture
the pre-latch value. Delaying the write controls by 1 cycle fixes this.
Timing Diagram (M=N=K=3, T_tile=7, MIN_GAP=4)
Cycle: 0 1 2 3 4 5 6 7 8 ... 14 15
State: FLIP_INIT PRE RUN0 RUN1 RUN2 DRN0 DRN1 DRN2 DRN3 CAP WB
rd_en: 0 1 1 1 0 0 0 0 0 0 0
rd_addr:0 0 1 2 0 0 0 0 0 0 0
SRAM: 0 d[0] d[1] d[2] 0 0 0 0
↑ ↑ ↑ ↑
│ │ │ └─ SRAM captures 0 (rd_en=0 at posedge)
└─────┴─────┴─ K=3 valid beats to PE(0,0)
valid_in: 0 0 1 1 1 0 0 0 0
clr_in: 0 1 0 0 0 0 0 0 0
↑
└─ clr_in enters gemm delay lines here.
PE(0,0) clears in PRELOAD cy (delay=0).
PE(1,1) clears at RUNNING cy1 (delay=2).
PE(2,2) clears at DRAIN cy2 (delay=4).
flip_en: 1 0 0 0 0 1 0 0 0
↑ ↑
└─ tile-0: shadow→active│
└─ tile-0 DRAIN: swap for tile-1
Host load window opens here ─────►
done (from gemm): ←── rises here
│
out_we_r: 0 0 0 0 0 0 0 0 0 0 1
c_out_valid: 1
Host load window (K + MIN_GAP = 7 cycles between flip_en pulses). The host must write K words into the shadow bank within this window. At K=3, SRAM_DEPTH=8, this is comfortable at any reasonable memory bandwidth.
Ping-Pong Handshake
Tile 0 Tile 1 Tile 2
─────── ─────── ───────
Bank1: [active: data0] [shadow: data2 loading] [active: data2]
Bank0: [shadow: data1 loading] [active: data1] [shadow: data3 loading]
flip_en=1 (FLIP_INIT) flip_en=1 (DRAIN cy0) flip_en=1 (DRAIN cy0)
bank_sel: 0→1 bank_sel: 1→0 bank_sel: 0→1
Host: load bank1 before start │ load bank0 in DRAIN │ load bank1 in DRAIN
The host detects the bank flip by watching bank_sel_a toggle. Every time it toggles,
the shadow bank is freshly available for the next tile’s data.
Challenges and Fixes
1. SRAM read latency vs clr_in alignment
Problem: clr_in is delayed i+j cycles inside gemm to reach PE(i,j). Data
arrives at PE(i,j) after (i+j) PE pass-through hops plus 1 cycle of SRAM read
latency. So data is always 1 cycle late relative to clr at every PE except PE(0,0).
PE(i,j) would accumulate one beat before clr fires, giving wrong results.
Fix: Assert clr_in in the PRELOAD state (1 cycle before RUNNING) with rd_en=0
(SRAM output = 0). When clr reaches PE(i,j) after i+j cycles, the SRAM has had i+j+1
cycles of valid output — exactly matching. Because rd_en=0 during PRELOAD, the SRAM
outputs 0 when clr fires, so the MAC clears to zero, not to some residual value.
PRELOAD: clr_in=1, rd_en=0 → SRAM=0 → MAC clears to product(0,0)=0
RUNNING: clr_in=0, rd_en=1 → SRAM=data → K beats accumulate
For PE(1,1) specifically:
clr reaches PE(1,1) at PRELOAD + 2 = RUNNING cy1
data[0] reaches PE(1,1) at RUNNING cy0 + 2 PE hops = RUNNING cy2
↑ 1 SRAM cycle already included
clr fires at cy1, data arrives cy2 → 1 cycle gap → K=3 beats (cy2, cy3, cy4). ✓
2. SRAM bleed into DRAIN — extra beat at PE(0,0)
Problem: The last RUNNING cycle (k_cnt=K-1) has rd_en=1 and the SRAM is
presenting real data. At the posedge of that cycle, the SRAM latches the last read
address and will present data[K-1] at DRAIN cy0. PE(0,0) accumulates this extra beat.
Fix: Set rd_en=0 when k_cnt==K-1 (still in RUNNING state combinationally, but
the SRAM sees oe=0 and captures 0 instead of real data). DRAIN cy0 SRAM output = 0.
PE(0,0) gets exactly K beats.
ST_RUNNING: rd_en = (k_cnt == K-1) ? 1'b0 : 1'b1;
3. output_buf simultaneous write + flush capturing stale data
Problem: out_we fires in CAPTURE state at the same posedge that gemm.c_out
latches acc[] (because gemm.done is itself registered). So output_buf would
write the pre-latch value of c_out — all zeros from the previous state.
Fix A: Delay out_we/out_acc/out_flush by 1 cycle (registered in wrapper).
By the time out_we_r fires, gemm.c_out has already settled.
Fix B: Within output_buf, when we and flush_en fire on the same cycle, use
a combinational next_buf mux to read the post-write value, not the stored register.
This handles any residual same-cycle case cleanly.
4. FLIP_INIT adds a pipeline overhead for all tiles
Problem: Early implementations re-entered FLIP_INIT on every tile, adding 1 extra idle cycle per tile to the throughput and also misaligning the clr/data timing for tiles 2+.
Fix: FLIP_INIT runs only for tile-0 (tracked by first_tile register). For tiles
1+ the FSM goes directly WRITEBACK→PRELOAD. The bank swap for tiles 1+ is handled by
DRAIN’s flip_en pulse.
nstate = (tile_cnt == NUM_TILES-1) ? ST_IDLE : ST_PRELOAD; // not ST_FLIP_INIT
5. Testbench protocol — both SRAM banks must be pre-loaded
Problem: After hard_reset, bank_sel=0 and the write port targets shadow=bank1.
We can load bank1. But tile-1 runs from bank0 (after the first DRAIN flip), which is
still zero from reset. Tile-1 produces all-zero C.
Fix: preload_both_banks task uses a dummy-flip protocol:
1. hard_reset → bank_sel=0, shadow=bank1. Load bank1 (K words).
2. Pulse start for 1 cycle → FLIP_INIT fires → bank_sel=1, shadow=bank0.
3. Load bank0 (K words) while ctrl is in PRELOAD (safe, no RUNNING yet).
4. Assert rst to cancel the dummy run → bank_sel=0, state=IDLE.
5. Both banks now contain tile data. Assert real start.
6. rd_addr must be combinational, not registered
Problem: If rd_addr is registered (as is typical for FSM outputs), there are
2 cycles from “state issues addr” to “data at SRAM output” (1 for the register, 1 for
SRAM latency). Compensating for this required 2 PRELOAD states and complex addr
sequencing that was fragile.
Fix: Make rd_addr a combinational wire assigned from always @(*). The SRAM
sees the address immediately; 1 cycle later the data is ready. The sequencing becomes:
State changes to PRELOAD → rd_addr=0 combinationally → SRAM latches → data at RUNNING cy0
One clean pipeline stage, no compensation needed.
7. CAPTURE_WAIT — gemm.done arrives later than expected
Problem: After DRAIN ends, the FSM moved directly to CAPTURE. But gemm.done is
a registered output of the valid_sr shift register inside gemm — it arrives 1 cycle
after the shift register fills. Without CAPTURE_WAIT, CAPTURE fired before done was
high and gemm.c_out had not yet latched the final accumulator values.
Fix: Add ST_CAPTURE_WAIT — a spin state that waits for done=1 before moving to
CAPTURE. Since done stays high for K cycles, CAPTURE always fires within the valid
window.
File List
| File | New / Existing |
|---|---|
sram_bank |
New |
ping_pong_buf |
New |
output_buf |
New |
gemm_ctrl |
New |
gemm_wrapper |
New |
gemm_wrapper_tb |
New |
gemm |
Existing — unchanged |
pe |
Existing — unchanged |
mac |
Existing — unchanged |
line_buffer |
Existing — unchanged |
Compile and Run
iverilog -o sim_wrapper sram_bank.v ping_pong_buf.v output_buf.v \
line_buffer.v mac.v pe.v gemm.v \
gemm_ctrl.v gemm_wrapper.v gemm_wrapper_tb.v && vvp sim_wrapper
Expected output: 12 PASS / 0 FAIL
Parameters
| Parameter | Default | Description |
|---|---|---|
M |
3 | PE rows / output rows of C |
N |
3 | PE columns / output cols of C |
K |
3 | Dot-product depth |
NUM_TILES |
1 | Total tiles the FSM runs before returning to IDLE |
K_PARTS |
1 | K-partitions per output tile (1 = full K at once) |
SRAM_DEPTH |
8 | Words per ping-pong bank (must be ≥ K) |
Test Coverage
| Test | What it checks |
|---|---|
| T1 | Single tile, all-ones. Basic wrapper correctness, correct latency. |
| T2 | 8 back-to-back tiles, I×I=I. Ping-pong handoff correctness, all 8 tiles pass, throughput measured. |
| T3 | K-partitioned GEMM. A(M×2K)×B(2K×N) split into 2 sub-tiles, manual accumulation matches reference. |
| T4 | Conv2d pass-through. Identity filter with channel-packed im2col input. Output = input pixels. |
| T5 | Reset mid-RUNNING. FSM returns to IDLE cleanly, correct result on restart. |