L2 Tile Cache — Cached Tiled GEMM

L2 Tile Cache — Cached Tiled GEMM

The cache implemented here is a direct-mapped on-chip tile cache — it has tag RAM, valid bits, combinational hit detection, and a DMA fill path on miss. Calling it “L2” is a convenience label based on where it sits structurally: the PE accumulators act as L1 (data lives there during computation), and this cache sits one level above between the ping-pong SRAMs and DRAM. But the RTL itself has no knowledge of any hierarchy — it is just an SRAM with cache semantics. Whether it is L1, L2, or a scratchpad in your actual system depends entirely on what else surrounds it.

L2 Tile Cache Layer

This document covers the four new modules added on top of the existing ping-pong wrapper. The ping-pong wrapper (gemm_wrapper, gemm_ctrl, sram_bank, ping_pong_buf, output_buf) is treated as a black box here — see the other doc for that layer.


The Problem Being Solved

The ping-pong wrapper decouples loading from computing — the host DMA fills the shadow bank while the array computes from the active bank. But it still requires the host to supply fresh tile data for every tile, every time, directly from DRAM. For workloads where the same weights (A matrix) are reused across many output tiles — the dominant pattern in Conv2d — this means redundant DRAM traffic. The same filter slice gets fetched over and over.

The L2 tile cache intercepts every tile load request. If the tile is already resident on-chip, it streams directly from cache SRAM without touching DRAM. Only a genuine miss triggers a DMA fetch. The array never sees the difference — the shadow bank gets filled either way before FLIP_INIT.


Architecture

                          DRAM
                           │
              dma_a_req/ack/we/data    dma_b_req/ack/we/data
                    │                          │
          ┌─────────┴──────────────────────────┴─────────┐
          │                                               │
          │  ┌─────────────────┐   ┌─────────────────┐   │
          │  │  tile_cache (A) │   │  tile_cache (B) │   │
          │  │  tag+valid SRAM │   │  tag+valid SRAM │   │
          │  └────────┬────────┘   └────────┬────────┘   │
          │           │ hit/miss             │ hit/miss   │
          │  ┌────────┴────────┐   ┌────────┴────────┐   │
          │  │  cache_ctrl (A) │   │  cache_ctrl (B) │   │
          │  │  LOOKUP/FILL    │   │  LOOKUP/FILL    │   │
          │  └────────┬────────┘   └────────┴────────┘   │
          │           │ pp_wr_*          │ pp_wr_*        │
          │           ▼                  ▼                │
          │    ping_pong_buf(A)   ping_pong_buf(B)        │
          │           │                  │                │
          │           └──────┬───────────┘                │
          │                  ▼                            │
          │             gemm (unchanged)                  │
          │                  │ c_out_w                    │
          │             output_buf                        │
          │                  │                            │
          │        gemm_ctrl_cached (FSM)                 │
          │    ┌──────────────────────────────┐           │
          │    │ CACHE_REQ → CACHE_WAIT →      │           │
          │    │ FLIP_INIT → PRELOAD →         │           │
          │    │ RUNNING → DRAIN →             │           │
          │    │ CAPTURE_WAIT → CAPTURE →      │           │
          │    │ WRITEBACK → (loop or IDLE)    │           │
          │    └──────────────────────────────┘           │
          └───────────────────────────────────────────────┘

New Module Reference

tile_cache

Direct-mapped set-associative SRAM cache. One instance per operand (A and B).

Parameters: NUM_SETS, TILE_WORDS (=K), WORD_WIDTH (=max(8*M,8*N)), TAG_WIDTH

Data SRAM: NUM_SETS × TILE_WORDS × WORD_WIDTH bits, indexed as mem[set * TILE_WORDS + word_addr]. Holds one full operand tile per set — K words, each word being one column-slice of A (8·M bits) or row-slice of B (8·N bits).

Tag RAM: NUM_SETS × TAG_WIDTH register array. Tags survive rst (only valid_r is reset, not the tag RAM data itself).

Valid bits: NUM_SETS × 1 register array, reset to 0 on rst.

Set index (direct-map): set = lu_tag[SET_W-1:0] — the low log2(NUM_SETS) bits of the tag. Tags that differ only in higher bits alias to the same set and evict each other. Tag 0 and tag 4 both map to set 0 with NUM_SETS=4.

Hit detection (combinational):

lu_set = lu_tag[SET_W-1:0]
lu_hit = lu_valid && valid_r[lu_set] && (tag_ram[lu_set] == lu_tag)

This fires in the same cycle as lu_valid — no extra latency for the tag compare.

Fill interface:

fill_we, fill_set, fill_addr, fill_data  — write one word per cycle
fill_tag, fill_done                      — commit tag and set valid_r on completion

Hit/miss counters: 32-bit saturating counters incremented on every lu_valid pulse.


cache_ctrl

One instance per operand. Mediates between the tile request from gemm_ctrl_cached, the tile_cache lookup/fill ports, the DMA interface, and the ping-pong shadow bank write port.

FSM

  ┌────────┐  req_valid  ┌─────────┐
  │  IDLE  │────────────►│ LOOKUP  │
  └────────┘             └────┬────┘
                         hit? │  miss?
                    ┌─────────┴──────────┐
                    ▼                    ▼
             ┌────────────┐       ┌──────────┐
             │ HIT_STREAM │       │ MISS_REQ │
             │ (K cycles) │       │ wait ack │
             └─────┬──────┘       └────┬─────┘
                   │                   ▼
                   │            ┌───────────┐
                   │            │ MISS_FILL │
                   │            │ K words   │
                   │            └─────┬─────┘
                   └──────┬───────────┘
                          ▼
                      ┌──────┐
                      │ DONE │ → pulse load_done → IDLE
                      └──────┘

IDLE: waits for req_valid. When it fires, latches req_tag into saved_tag at that same posedge (critical — req_tag is only driven in ST_CACHE_REQ by the ctrl and is gone by the time cache_ctrl reaches LOOKUP).

LOOKUP (1 cycle): drives lu_valid=1, lu_tag=saved_tag to tile_cache. Hit or miss is combinationally available by the end of this cycle.

HIT_STREAM (K+1 cycles): reads K words from cache SRAM into the ping-pong shadow bank. The 1-cycle SRAM read latency means rd_en runs ahead of pp_wr_en by one cycle — pp_wr_addr lags rd_addr by 1, and the write fires when rd_data is valid.

MISS_REQ: asserts dma_req and dma_tile_id, holds until dma_ack fires.

MISS_FILL (K cycles): receives DMA words via dma_we/addr/data. Each word goes simultaneously to:

  • tile_cache fill port → writes into cache SRAM
  • pp_wr_* → writes directly into ping-pong shadow bank

This refill-and-use path means the shadow bank is filled in parallel with the cache, adding zero extra latency compared to a direct DMA-to-SRAM path.

DONE (1 cycle): pulses fill_done to tile_cache (commits tag and sets valid_r), and pulses load_done to gemm_ctrl_cached.


gemm_ctrl_cached

Extended sequencer FSM. Replaces gemm_ctrl for the cached system. Adds two states (CACHE_REQ, CACHE_WAIT) at the front of each tile’s pipeline and connects to cache_ctrl via the tile request interface.

FSM

  ┌────────┐  start   ┌───────────┐         ┌────────────┐
  │  IDLE  │─────────►│ CACHE_REQ │────────►│ CACHE_WAIT │
  └────────┘          └───────────┘         └─────┬──────┘
      ▲                                           │ a_load_done
      │ tile_cnt==NUM_TILES-1                     │ & b_load_done
      │                                           ▼
  ┌───────────┐         ┌─────────────┐    ┌───────────┐
  │ WRITEBACK │◄────────│ CAPTURE     │    │ FLIP_INIT │
  └───────────┘         └─────────────┘    └─────┬─────┘
       │ else                  ▲                  ▼
       ▼                  ┌────┴──────┐    ┌──────────┐
  ┌───────────┐           │CAPTURE_WT │    │ PRELOAD  │
  │ CACHE_REQ │◄──────────│           │    └─────┬────┘
  └───────────┘           └────────── ┘          ▼
                                             ┌─────────┐
                                             │ RUNNING │
                                             └────┬────┘
                                                  │ k_cnt==K-1
                                                  ▼
                                             ┌─────────┐
                                             │  DRAIN  │
                                             └─────────┘

CACHE_REQ (1 cycle): drives a_req_valid=1, b_req_valid=1 with tags derived from tile_cnt — A tag = {1'b0, tile_cnt[6:0]}, B tag = {1'b1, tile_cnt[6:0]}. These fire for exactly 1 cycle, then cache_ctrl has registered saved_tag and REQ goes low.

CACHE_WAIT: holds until a_load_done & b_load_done both asserted. Duration is:

  • Hit path: 1 (LOOKUP) + K+1 (HIT_STREAM) + 1 (DONE) = K+3 cycles = 6 cycles for K=3
  • Miss path: 1 (LOOKUP) + 1 (MISS_REQ) + DMA_LATENCY + K (MISS_FILL) + 1 (DONE) = DMA_LATENCY+K+3 cycles

FLIP_INIT: fires flip_en to both ping-pong buffers. The shadow bank (now fully written by cache_ctrl) becomes the active bank. Gemm reads from it immediately in PRELOAD/RUNNING.

Everything from FLIP_INIT onward is identical to gemm_ctrl.

Tile tag generation:

tile-0: a_req_tag = 8'h00  b_req_tag = 8'h80
tile-1: a_req_tag = 8'h01  b_req_tag = 8'h81
tile-N: a_req_tag = 8'h0N  b_req_tag = 8'h8N

The upper bit distinguishes A tiles from B tiles in the unified cache tag space.


gemm_cached_wrapper

Top-level integration. Wires tile_cache × 2, cache_ctrl × 2, ping_pong_buf × 2, gemm, output_buf, and gemm_ctrl_cached together. No logic — only port connections and two pipeline registers.

Width normalisation: A words are 8*M bits, B words are 8*N bits. Both tile caches use WORD_WIDTH = max(8*M, 8*N). Narrower operand data is zero-padded before the cache write and stripped on read. Done via intermediate wires dma_a_data_w / dma_b_data_w.

out_we_r delay: same as in the ping-pong wrapper — out_we, out_acc, out_flush are delayed 1 cycle to align with gemm.c_out settling after done.


Latency Analysis

Cold miss (first access to a tile)

CACHE_REQ  (1)
CACHE_WAIT (LOOKUP=1 + MISS_REQ=1 + DMA_LATENCY + MISS_FILL=K + DONE=1)
FLIP_INIT  (1)
PRELOAD    (1)
RUNNING    (K)
DRAIN      (MIN_GAP)
CAPTURE_WAIT + CAPTURE (2)
WRITEBACK  (1)
out_we_r delay (1)
─────────────────────────────────────────────────────────────
Total ≈ DMA_LATENCY + 2*K + MIN_GAP + 10 cycles

For M=N=K=3, MIN_GAP=4, DMA_LATENCY=3: 3 + 6 + 4 + 10 = 23 cycles minimum. Measured: 26 cycles (a few extra CAPTURE_WAIT cycles waiting for done).

Cache hit (tile already resident)

CACHE_REQ  (1)
CACHE_WAIT (LOOKUP=1 + HIT_STREAM=K+1 + DONE=1)
FLIP_INIT  (1)
PRELOAD    (1)
RUNNING    (K)
DRAIN      (MIN_GAP)
CAPTURE_WAIT + CAPTURE (2)
WRITEBACK  (1)
out_we_r delay (1)
─────────────────────────────────────────────────────────────
Total ≈ 2*K + MIN_GAP + 10 cycles

For M=N=K=3: 6 + 4 + 10 = 20 cycles minimum. Measured: 22 cycles.

Cycles saved per hit

Cold - Hit = DMA_LATENCY + (MISS_FILL - HIT_STREAM) = DMA_LATENCY + (K - (K+1)) + 1
           = DMA_LATENCY cycles saved

Measured saving: 4 cycles (DMA_LATENCY=3 + 1 extra FSM cycle). Every cache hit eliminates one DMA round-trip, hiding DMA_LATENCY cycles of memory latency.

Throughput with cache

Miss path tile period: DMA_LATENCY + K_fill + T_tile + FSM_overhead
Hit  path tile period: K_stream    + T_tile + FSM_overhead

Cold: 99 cycles / 4 tiles = 24.75 cycles/tile Warm: 83 cycles / 4 tiles = 20.75 cycles/tile


Hit-Rate Mathematics

With NUM_SETS sets and a working set of W unique tiles where W ≤ NUM_SETS (no eviction), repeated over N passes:

Total requests = W × N
Cold misses    = W  (always, one per unique tile on first access)
Hits           = W × (N-1)

Hit-rate = (N-1) / N
Passes Hit-rate
2 50.0%
4 75.0%
8 87.5%
16 93.75%
32 96.875%
100%

For Conv2d: the filter (A) is typically reused across all output spatial positions. If the filter fits in cache, A hit-rate approaches 100% after warmup. B (input patches) vary per output position and may miss every time unless the spatial reuse pattern fits.

Eviction: with W > NUM_SETS, tags that share the same set (tag % NUM_SETS equal) evict each other. Tags {0,4,8,...} all map to set 0. Running tiles 0-7 with NUM_SETS=4 causes tiles 4,5,6,7 to evict tiles 0,1,2,3. Re-accessing tiles 0-3 afterwards incurs misses again.


Challenges and Fixes

1. Tag latched in wrong state — false cache hits on all tiles 1+

Problem: cache_ctrl needed to look up the tile tag in ST_LOOKUP, but req_tag from gemm_ctrl_cached is only driven during ST_CACHE_REQ (one cycle before LOOKUP). By the time the cache_ctrl was in LOOKUP, req_tag had returned to zero. The lookup always compared against tag 0 — which after tile-0 was resident — giving a spurious hit on every subsequent tile regardless of its actual tag.

Root cause: the original code did saved_tag <= lu_tag in LOOKUP, where lu_tag = req_tag — but req_tag was already gone.

Fix: latch in IDLE when req_valid=1 (the only cycle req_tag is valid):

ST_IDLE: if (req_valid) saved_tag <= req_tag;
// Then in LOOKUP:
lu_tag = saved_tag;   // not req_tag

2. fill_data port double-driven — X propagation into SRAM

Problem: cache_ctrl has an output port fill_data that drives tile_cache’s fill input. In the wrapper, this was accidentally connected to dma_a_data_w (a wire also driven by an assign statement from the DMA input). Two drivers on the same wire in Verilog produces X — which propagated into the cache SRAM and rendered every value read from it as X.

Fix: route through the intermediate wire a_fill_data_w correctly:

// WRONG: .fill_data(dma_a_data_w)   ← double-drives the wire
// RIGHT: .fill_data(a_fill_data_w)  ← cache_ctrl drives this, tile_cache reads it

3. FSM ordering: FLIP_INIT before fill → gemm reads zeroes

Problem: original design issued flip_en in FLIP_INIT before cache_ctrl had finished filling the shadow bank. After the flip, the newly-active bank was the old shadow (not yet written) and contained stale zeros. Gemm computed C = 0.

Fix: restructure per-tile sequence to fill first, flip second:

OLD: FLIP_INIT → CACHE_REQ → CACHE_WAIT (fill) → PRELOAD → RUNNING
NEW: CACHE_REQ → CACHE_WAIT (fill completes) → FLIP_INIT → PRELOAD → RUNNING

The shadow bank is guaranteed fully written before it becomes the active bank.

4. HIT_STREAM pp_wr_addr part-select on expression — syntax error

Problem: pp_wr_addr = (word_cnt - 1)[WORD_W-1:0] is illegal Verilog — part-selects cannot be applied to expressions, only to named signals.

Fix: rewrite as an arithmetic expression:

pp_wr_addr = word_cnt[WORD_W-1:0] - 1;

5. Multi-tile FSM stall — CACHE_WAIT entered without prior request

Problem: after tile-0 WRITEBACK, the FSM went to CACHE_WAIT for tile-1 but no cache_ctrl request had been issued (the original design tried to prefetch during DRAIN by re-arming req_issued, but the re-arming and next-state condition both depended on req_issued in the same cycle — the registered update was invisible to the combinational next-state logic). cache_ctrl stayed IDLE, load_done never fired, CACHE_WAIT spun forever.

Fix: remove prefetch entirely. Go directly WRITEBACK→CACHE_REQ for the next tile. Sequential fill-then-compute with no overlap. Simpler and correct. Prefetch overlap is future work.

6. T2 cache hit actually a miss — working set evicted between runs

Problem in testbench: with NUM_TILES=8, tile-0 (tag=0, set=0) was evicted by tile-4 (also set=0) during the same 8-tile run. The T2 “repeat” test fired after the 8-tile run, but tag=0 was no longer resident. So T2 always measured a cold miss (26 cycles) instead of a hit (22 cycles).

Fix: change NUM_TILES=4 so tiles 0-3 fill sets 0-3 with no aliasing. Tag=0 in set=0 survives the entire 4-tile run. T2 immediately after T1 (no rst between them, valid_r intact) correctly gets a cache hit.

7. T3/T4 pass-2 showing same speed as pass-1 — hits not faster

Problem: measured 99 cycles for both cold and warm passes — contradicting the hit path being 4 cycles faster per tile. Root cause: n_captured was not reset between pass-1 and pass-2, so wait_n_tiles(4) returned immediately (already had 4 captured from pass-1) and the pass-2 timing measurement was garbage.

Fix: add n_captured=0 before each pass. Warm pass correctly shows 83 cycles vs cold 99 cycles — 16 cycles total saved across 4 tiles = 4 cycles/tile.


File List

File Role
tile_cache New — direct-mapped SRAM cache with tag/valid/hit/miss
cache_ctrl New — per-operand FSM: LOOKUP → HIT_STREAM or MISS_FILL
gemm_ctrl_cached New — extended sequencer with CACHE_REQ / CACHE_WAIT states
gemm_cached_wrapper New — top-level integration
gemm_cached_wrapper_tb New — testbench with DMA model and T1-T6 test suite
sram_bank Existing (ping-pong layer)
ping_pong_buf Existing (ping-pong layer)
output_buf Existing (ping-pong layer)
gemm_ctrl Existing (ping-pong layer) — not used in cached design
gemm_wrapper Existing (ping-pong layer) — not used in cached design
gemm, pe, mac, line_buffer Existing core — unchanged

Parameters

Parameter Default Description
M 3 PE rows
N 3 PE columns
K 3 Dot-product depth
NUM_TILES 4 Tiles per start pulse
K_PARTS 1 K-partitions per tile (for large K)
SRAM_DEPTH 8 Words per ping-pong bank
CACHE_SETS 4 Cache lines per operand
TAG_WIDTH 8 Tag bits (supports 128 distinct A tiles + 128 B tiles)

DMA Interface

The DMA interface is the only external connection in the cached design. No operand data enters via the wrapper ports — only DMA fills on misses.

dma_X_req      ← asserted by RTL when a miss is detected
dma_X_tile_id  ← which tile to fetch (same as the tag that missed)
dma_X_ack      → host asserts when data is ready to flow (after DRAM latency)
dma_X_we       → host drives 1 for each valid data word
dma_X_addr     → word index within tile (0..K-1)
dma_X_data     → data word (K=3 words, one per clock)
dma_X_done     → host pulses on the last word to commit the cache line

The host (or memory controller) sees req and must respond within any number of cycles — the RTL will stall in CACHE_WAIT indefinitely. Once ack fires, the host must send exactly K words on consecutive cycles with we=1, then pulse done.


Compile and Run

iverilog -o sim_cached sram_bank.v ping_pong_buf.v output_buf.v tile_cache.v cache_ctrl.v gemm_ctrl_cached.v gemm_cached_wrapper.v line_buffer.v mac.v pe.v gemm.v gemm_cached_wrapper_tb.v && vvp sim_cached

Expected: 54 PASS / 0 FAIL


Test Coverage

Test What it proves
T1 Cold miss — DMA fills, miss counter increments, correct result
T2 True hit — no rst between T1 and T2, valid_r intact, hit counter increments, latency 22 < cold 26
T3 pass-1 4-tile cold fill — 4 misses, correct C for each unique tag
T3 pass-2 4-tile warm replay — 4 hits, 83 cycles vs cold 99, same correct C
T4 pass-1/2 Capacity fill + full hit — All 4 A hits confirmed CACHE WORKING
T4 note Eviction arithmetic — tag=4 aliases set=0, would evict tag=0
T5 cold DMA stall — latency breakdown: 3+3+7+13=26 cycles
T5 hit Hit latency — 22 cycles, saved 4 vs cold
T6a 8 passes × 4 tiles — 28/32 hits = 87.5% as predicted by (N-1)/N formula
T6b 16 passes warm — 64/64 hits = 100% (cache stays warm from T6a)
T6c Cold-start 16 passes — 60/64 = 93.75%, exactly 4 cold misses on pass-0