Manual RTL Implementation

CIFAR-10 CNN Inference — Manual RTL Verilog Implementation

Pure handwritten Verilog implementation of the CIFAR-10 Mini-ResNet, developed across three progressive design stages: a simulation-oriented modular RTL model, a synthesizable handshake-driven ROM-based pipeline with layer-by-layer verification, and a streaming AXI-based architecture with a fully validated first-layer proof-of-concept. All arithmetic is fixed-point (Q1.7, SCALE = 128). No vendor IP cores are used anywhere in the design.

Design Progression Overview

The RTL implementation was developed in three distinct stages, each building on the previous:

Stage	Directory Style	Weight Storage	Interface	Scope
1 — Modular simulation	`modular/`	Inline ROM	None (simulation only)	Full inference
2 — Synthesizable handshake	`verilog_roms_mems_hdshk/`	`.mem` ROM files	Start/done handshake	Full inference, layer-by-layer
3 — Streaming AXI	`wtinp/`, `imginp/`, `conv/`	AXI-MM runtime load	AXI-MM + AXI-Stream	First layer POC

Background: Hardware vs Software CNN Inference

In software, an image is loaded into memory and processed as a batch. On an FPGA, images arrive as a continuous pixel stream from a camera with synchronization signals:

VSYNC — start of a new frame
HSYNC — start of a new scan line
PCLK — pixel clock, driven by the camera, indicates when a pixel is valid

The FPGA latches pixel values in real time using these signals, typically buffering into a FIFO or line buffer before feeding the CNN. In this project, simulation replaces the live camera with .mem files or auto-generated ROM modules — the same hardware logic applies in both cases.

always @(posedge pclk) begin
    if (vsync) begin
        row <= 0; col <= 0;
    end else if (hsync) begin
        col <= 0; row <= row + 1;
    end else begin
        pixel_buffer[row][col] <= pixel_data;
        col <= col + 1;
    end
end

On-Chip vs Off-Chip Memory

Type	Where	Use in this project
BRAM / LUTRAM	On-chip	Feature maps, small weight kernels
ROM (`.mem` files)	On-chip (simulation)	Full layer weights and images
External DDR	Off-chip	AXI streaming stage (Stage 3)

Small, frequently accessed data (feature maps, partial kernels) lives on-chip for low latency. Full layer weights and images are stored in ROMs or off-chip DDR. The Stage 3 design introduces DDR as the primary storage medium, with AXI protocols handling transfers.

Verilog Limitations Encountered

$readmemh works in simulation but support during synthesis depends on the FPGA flow. It is used extensively in Stages 1 and 2 for ROM initialization.
Multi-dimensional arrays must be flattened to 1D for hardware storage, requiring careful index arithmetic throughout all modules.
Real-time camera interfacing requires precise timing verification that simulation alone cannot fully validate without proper stimulus generators.

Tools Used

Tool	Purpose
Iverilog	Verilog simulation
vvp	Simulation runtime
Python	ROM generation, fixed-point conversion, automation
TCL / Python	Scripted inference automation
Yosys / OpenLane / OpenROAD / KLayout	Synthesis, layout, routing

Fixed-Point Quantization and Weight Preparation

All weights, biases, activations, and intermediate values use Q1.7 fixed-point (SCALE = 128). Float weights are converted to 8-bit signed integers. Overflow is clamped:

SCALE   = 1 << 7       # 128
MAX_VAL =  127
MIN_VAL = -128

def float_to_fixed(f):
    val = int(round(f * SCALE))
    val = min(max(val, MIN_VAL), MAX_VAL)
    return val

Converted weights are written to hex .mem files for Verilog $readmemh:

for val in fixed_values:
    hexval = format((val + 256) % 256, "02X")
    f.write(hexval + "\n")

The parameterized ROM module template used across all stages:

module rom_layer #(parameter DEPTH=1024, WIDTH=8)(
    input  wire                      clk,
    input  wire                      rst,
    input  wire                      addr_valid,
    output reg                       addr_ready,
    input  wire [$clog2(DEPTH)-1:0]  addr,
    output reg                       data_valid,
    input  wire                      data_ready,
    output reg  [WIDTH-1:0]          data
);
    reg [WIDTH-1:0] rom [0:DEPTH-1];
    initial $readmemh("weights_layer.mem", rom);
endmodule

Image ROMs are generated per channel (R, G, B). Each pixel is stored as an 8-bit hex value, 32×32 = 1024 entries per channel:

for i in range(32):
    for j in range(32):
        f.write(f"{array[i,j]:02X}\n")

module image_r_rom(
    input  wire        clk,
    input  wire [9:0]  addr,
    output reg  [7:0]  data
);
    reg [7:0] rom [0:1023];
    initial $readmemh("image_r.mem", rom);
    always @(posedge clk) data <= rom[addr];
endmodule

On-chip BRAM for feature map storage (synthesis hint):

(* ram_style = "block" *) reg [7:0] bram [0:1023];
always @(posedge clk) begin
    if (we) bram[addr] <= data_in;
    data_out <= bram[addr];
end

Two-Cycle Handshake Protocol

All inter-module communication in Stages 1 and 2 uses a two-cycle handshake that decouples address and data phases, preventing race conditions and allowing each consumer to read data at its own pace.

Cycle 1 — Address Phase: The consumer asserts addr_valid and drives an address. When the ROM is ready (addr_ready = 1), it latches the address and de-asserts addr_ready to block further requests until data is delivered.

Cycle 2 — Data Phase: The ROM asserts data_valid when data is ready. The consumer asserts data_ready to latch it. The ROM then de-asserts data_valid and re-asserts addr_ready for the next request.

// Address phase
if (addr_valid && addr_ready) begin
    addr_reg   <= addr;
    data       <= rom[addr];
    data_valid <= 1'b1;
    addr_ready <= 1'b0;
end

// Data phase
if (data_valid && data_ready) begin
    data_valid <= 1'b0;
    addr_ready <= 1'b1;
end

This protocol is used for all ROM accesses — image ROMs, kernel ROMs, and bias ROMs — throughout the design. Every module that reads from a ROM goes through this handshake, ensuring synchronized data delivery regardless of the consumer’s processing speed.

Network Architecture

Input: 32×32×3

Block 1 (Residual):
  Conv3×3  (3  → 28, ReLU)
  Conv3×3  (28 → 28, ReLU)
  Conv1×1  (3  → 28, shortcut, no activation)
  Add(conv_out, shortcut) → MaxPool 2×2

Block 2 (Residual):
  Conv3×3  (28 → 56, ReLU)
  Conv3×3  (56 → 56, ReLU)
  Conv1×1  (28 → 56, shortcut, no activation)
  Add(conv_out, shortcut) + ReLU → MaxPool 2×2

Head:
  GlobalAvgPool → Dense(56 → 10) → Softmax/Argmax

Two architecture variants exist in the modular stage:

MODEL_ARCH_4 — initial architecture used for early validation. Contains two Conv2D layers and a MaxPool. Used for structural verification of the convolution and pooling modules before adding residual connections.
MODEL_ARCH_8 — full CNN with both residual blocks, GAP, dense, and softmax. This is the complete inference pipeline.

Stage 1 — Modular Simulation (ROM-based, No Handshake)

Directory Structure

auto_all.py
auto_single.py
fixed.py
img2rgb.py
cnn_mod.v
cnn_mod_tb.v
cnn_top.v
modular/
    conv.v
    dense.v
    gap.v
    maxpool.v
model_weights_fixed8/
sample_images/
results.txt

Purpose

This stage establishes functional correctness of the full CNN pipeline in a simulation-only environment. Weights and images are statically embedded in ROMs at elaboration time via $readmemh. There is no handshake protocol and no synthesizable interface — the design is optimized for architectural clarity and simulation speed rather than FPGA deployment.

The top-level module cifar10_top wires together: conv2d_module → maxpool_module → avgpool_module → dense_module across both residual blocks and the classification head.

One-Time Weight Preparation

python fixed.py

Converts trained float weights to Q1.7 fixed-point and writes .mem files for all layers. Reports overflow count. Run once before any simulation.

Single Image Flow

# Step 1 — copy image
cp sample_images/cat/cat_0.png image.png

# Step 2 — convert to RGB text memory files
python img2rgb.py

# Step 3 — compile RTL
iverilog -o test cnn_mod.v modular/*.v

# Step 4 — simulate
vvp test

The predicted class index is reported via pred_index. Logits logits0 through logits9 are available for inspection. The class with the highest logit is the prediction.

Automated Single Image

python auto_single.py

Wraps image conversion, weight loading, RTL simulation, and result extraction into a single script.

Batch Evaluation (100 Images)

python auto_all.py

Iterates all 100 test images (10 per class), runs full RTL inference for each, compares prediction against ground truth, and logs to results.txt. The file records image index, predicted label, actual label, and match status.

Expected accuracy: ~84% matching the Python fixed-point reference.

Module Summary

Module	File	Function
`conv2d_module`	`modular/conv.v`	2D convolution with padding and ReLU
`maxpool_module`	`modular/maxpool.v`	2×2 max pooling
`avgpool_module`	`modular/gap.v`	Global average pooling
`dense_module`	`modular/dense.v`	Fully connected layer
`cifar10_top`	`cnn_top.v`	Top-level integration

Stage 2 — Synthesizable Handshake-Driven ROM Pipeline

Directory Structure

fixed_hdshk.py
img2rgb.py
vis_fixed_single_new.py
vis_fixed_all_new.py

00_conv2d_hdshk.v          # Layer 0: Conv3×3, 3→28
01_conv2d_hdshk.v          # Layer 1: Conv3×3, 28→28
02_maxpool_hdshk.v         # MaxPool 2×2
03_conv2d_hdshk.v          # Block 2 Conv
09_gap_new.v               # Global average pool
10_dense_new.v             # Dense 56→10
11_softmax_new.v           # Softmax/argmax

00_tb_hdshk.v              # Testbench for layer 0
01_tb_hdshk.v

verilog_roms_mems_hdshk/   # Auto-generated ROM modules and .mem files
image_roms_mem_hdshk/      # Image ROM modules per channel
model_weights_fixed8/      # Q1.7 text weight files
model_weights_2/           # Secondary weight format

relu_0/                    # Python reference: post-conv0 activations
relu_1/                    # Python reference: post-conv1 activations
max_pool_1/                # Python reference: post-maxpool
global_avg_pool/           # Python reference: GAP outputs
dense_logits/              # Python reference: dense outputs
softmax/                   # Python reference: softmax outputs

00_conv2d_w_br/            # RTL output: layer 0 feature maps

Purpose

This stage migrates from the simulation-only model to a synthesizable design suitable for FPGA implementation. Key differences from Stage 1:

Explicit start/done handshake between every layer pair
ROM modules and .mem files are auto-generated by fixed_hdshk.py
Each layer is independently compilable and testable
Every layer’s outputs are compared against Python golden references
Structure is compatible with downstream synthesis tools

One-Time Setup

# Generate Q1.7 weights, ROM modules, and .mem files
python fixed_hdshk.py

This script:

Converts float weights to Q1.7 fixed-point (reports overflow count)
Verifies reconstruction against original float weights
Generates Verilog ROM modules in verilog_roms_mems_hdshk/
Generates .mem initialization files for all layers

# Prepare image ROM files
python img2rgb.py

# Create output directory for layer 0
mkdir 00_conv2d_w_br

Layer-by-Layer Simulation

Each layer is compiled and run independently. Output feature maps are written to text files and compared against the Python golden reference for that layer.

Layer 0 — Conv3×3, 3→28 channels:

iverilog -o test 00_conv2d_hdshk.v \
    verilog_roms_mems_hdshk/00_conv2d_*.v \
    image_roms_mem_hdshk/*.v \
    00_tb_hdshk.v
vvp test

Expected console output:

Starting conv2d_mem_tb...
Opened 00_conv2d_w_br/feature_map_0.txt
-> Wrote feature_map_0.txt
...
conv2d_mem_tb completed.

Feature maps in 00_conv2d_w_br/feature_map_*.txt are compared against relu_0/.

Layer 1 — Conv3×3, 28→28:

iverilog -o test 01_conv2d_hdshk.v \
    verilog_roms_mems_hdshk/01_conv2d_*.v \
    01_tb_hdshk.v
vvp test

Compare output against relu_1/.

Full Layer Progression

Layer	RTL File	Reference Dir	Output Dir
Conv0 + ReLU	`00_conv2d_hdshk.v`	`relu_0/`	`00_conv2d_w_br/`
Conv1 + ReLU	`01_conv2d_hdshk.v`	`relu_1/`	`01_conv2d_w_br/`
MaxPool	`02_maxpool_hdshk.v`	`max_pool_1/`	`02_mp_w_br/`
Conv Block 2	`03_conv2d_hdshk.v`	`relu_2/`	`03_conv2d_w_br/`
GAP	`09_gap_new.v`	`global_avg_pool/`	—
Dense	`10_dense_new.v`	`dense_logits/`	—
Softmax	`11_softmax_new.v`	`softmax/`	—

Visual comparison across all layers:

python vis_fixed_single_new.py   # single image
python vis_fixed_all_new.py      # all layers

Design Characteristics

Fully synchronous — single clock domain throughout
Explicit start/done handshake — each layer waits for the previous to assert done before beginning
Synthesizable structure — no simulation-only constructs in layer modules (testbenches use $readmemh for stimulus, not the DUT itself)
ROM-based weight storage — all weights are in generated Verilog ROM modules with $readmemh initialization
Layer-by-layer correctness — each stage is verified independently before integration, ensuring errors are localized

RTL Module Details

`conv2d` — Convolution Core

The convolution module is the most complex block in the design. It is fully parameterized and reused for every convolutional layer:

Parameter	Description
`WIDTH`, `HEIGHT`	Input feature map spatial dimensions
`CHANNELS`	Number of input channels
`FILTERS`	Number of output filters
`K`	Kernel size (square)
`PAD`	Input padding
`BIAS_MODE_POST_ADD`	Whether bias is added before or after normalization

Single hardware module covers all conv layers without rewriting RTL — only parameters and ROM files change per layer.

FSM Control

The convolution is orchestrated by a finite state machine. The FSM is the only way to correctly sequence the ROM handshake requests across kernel dimensions, channels, and spatial positions without race conditions.

FSM states:

S_IDLE          → wait for start
S_START_FILTER  → request bias for current filter
S_BIAS_WAIT     → wait for bias ROM response
S_SETUP_PIXEL   → initialise accumulator for current output pixel
S_MAC_DECIDE    → check padding: is this kernel element in bounds?
S_IMG_REQ       → issue image ROM address request
S_IMG_WAIT      → wait for image ROM data valid
S_KERN_REQ      → issue kernel ROM address request
S_KERN_WAIT     → wait for kernel ROM data valid
S_MAC_ACCUM     → perform MAC, advance kernel/channel counters
S_PIXEL_DONE    → normalize, add bias, apply ReLU, emit output
S_NEXT_PIXEL    → advance spatial position
S_NEXT_FILTER   → advance filter counter
S_DONE          → assert done

Linear but looping — the FSM traverses all pixels, all channels, and all filters systematically. The padding check in S_MAC_DECIDE prevents out-of-bounds ROM accesses and implements zero-padding without dedicated pad logic:

in_y = i + m - PAD;
in_x = j + n - PAD;
if ((in_y >= 0) && (in_y < HEIGHT) && (in_x >= 0) && (in_x < WIDTH)) begin
    image_addr       <= in_y * WIDTH + in_x;
    image_addr_valid <= 1'b1;
end
// else: skip this kernel element (implicit zero padding)

Accumulation and Output

After the MAC loop over all kernel elements and channels:

// Normalize: scale down accumulator, add bias
if (BIAS_MODE_POST_ADD) begin
    numerator = accum;
    out_int   = ((numerator * 257) + (1<<15)) >>> 16;
    out_int   = out_int + bias16;
end
// ReLU
if (out_int < 0) out_int = 0;

The * 257 + >>> 16 is an efficient fixed-point normalization — equivalent to dividing by SCALE=128 with rounding, implemented as a shift-multiply to avoid a hardware divider.

Top-Level Wiring

In the top-level module, each conv2d core receives broadcast image addresses to three channel ROMs (R, G, B). Kernel and bias ROMs are private to each filter. Multiple cores can compute different filters in parallel:

conv2d_core0 (
    .clk(clk), .rst(rst), .start(start),
    .image_addr(image_addr),
    .image_addr_valid(image_addr_valid),
    .image_addr_ready(image_addr_ready),
    .image_r_data(image_r_q),
    .image_g_data(image_g_q),
    .image_b_data(image_b_q),
    .image_data_valid(image_data_valid),
    .image_data_ready(image_data_ready),
    .kernel_addr(kernel_addr0),
    .kernel_addr_valid(kernel_addr_valid0),
    .kernel_addr_ready(kernel_addr_ready0),
    .kernel_data(kernel_data0),
    .kernel_data_valid(kernel_data_valid0),
    .kernel_data_ready(kernel_data_ready0),
    .bias_addr(bias_addr0),
    .bias_data(bias_data0),
    .bias_data_valid(bias_data_valid0),
    .bias_data_ready(bias_data_ready0),
    .out_data(conv0_out),
    .out_valid(conv0_valid)
);

`max_pool` — Max Pooling Core

Fully parameterized streaming max-pooling:

Parameter	Description
`WIDTH_IN`, `HEIGHT_IN`	Input feature map dimensions
`CHANNELS`	Number of channels
`POOL_SIZE`	Pooling window size (e.g., 2 for 2×2)
`STRIDE`	Step between windows

Output dimensions derived automatically:

WIDTH_OUT  = (WIDTH_IN  - POOL_SIZE) / STRIDE + 1
HEIGHT_OUT = (HEIGHT_IN - POOL_SIZE) / STRIDE + 1

FSM States

S_IDLE      → wait for start
S_START     → begin first channel and output cell
S_INIT_CELL → reset window counters and max register
S_REQ       → request current pixel (handshake address phase)
S_WAIT      → wait for data valid (handshake data phase)
S_ACC       → compare with current max, update if larger
S_OUTPUT    → stream out pooled max value
S_NEXT      → advance to next output cell, channel, or finish
S_DONE      → assert done

Pixel read request with address calculation:

in_y = ph * STRIDE + pi;
in_x = pw * STRIDE + pj;
ifm_addr       <= in_y * WIDTH_IN + in_x;
ifm_chan       <= c;
ifm_addr_valid <= 1'b1;

Max comparison after handshake completes:

if (sample_q17 > max_val)
    max_val <= sample_q17;

Output streaming with sign extension to 32-bit:

out_data  <= { {16{max_val[15]}}, max_val };
out_valid <= 1'b1;

`add` — Residual / Shortcut Add Module

Implements the skip connection in each residual block. When the shortcut path needs channel dimension alignment, a 1×1 convolution is applied first:

for (i = 0; i < H_IN; i++) begin
    for (j = 0; j < W_IN; j++) begin
        for (oc = 0; oc < OUT_CH; oc++) begin
            sum = 0;
            for (ic = 0; ic < IN_CH; ic++) begin
                prod = kernel_1x1[ic][oc] * input_img[i][j][ic]; // Q7 multiply
                sum  = sum + prod;
            end
            shortcut[i][j][oc] = (sum + ROUND_CONST) / SCALE + bias_1x1[oc];
        end
    end
end

After the shortcut is computed, element-wise addition with the main path:

for (i = 0; i < H_OUT; i++) begin
    for (j = 0; j < W_OUT; j++) begin
        for (oc = 0; oc < OUT_CH; oc++) begin
            res_out[i][j][oc] = conv_out[i][j][oc] + shortcut[i][j][oc];
        end
    end
end

The accumulator is kept wide enough to prevent overflow in Q7 representation. Element-wise addition is highly parallelizable — channels can be computed in parallel as independent SIMD operations in hardware.

`gap` — Global Average Pooling

Reduces each channel’s H×W feature map to a single value by averaging:

for (c = 0; c < CHANNELS; c++) begin
    sum = 0;
    for (i = 0; i < VALUES_PER_MAP; i++)
        sum = sum + feature_map[c][i]; // Q7 integer sum
    gap_result[c] = sum / VALUES_PER_MAP;
end

Output is one Q7 integer per channel (56 values). These feed directly into the dense layer without additional normalization, keeping the fixed-point scaling consistent throughout.

`dense` — Fully Connected Layer

Matrix-vector multiply between GAP output (56 values) and the weight matrix (56×10), plus bias addition. Each output neuron accumulates independently:

accum = 0;
for (i = 0; i < INPUT_SIZE; i++) begin
    product = gap_q7[i] * kernel_q7[i][j]; // Q7 × Q7
    accum   = accum + product;
end
out_q7[j] = accum + bias_q7[j];

Wide accumulators prevent overflow from the 56-term summation. Weight and bias ROMs are separate from the image ROMs and are accessed via the same two-cycle handshake. Each of the 10 output neurons can compute independently, allowing parallelism.

`softmax` — Class Prediction

In hardware, full softmax (exponential + normalization) is resource-expensive and unnecessary for top-1 prediction. The module implements argmax — finds the index of the maximum logit:

max_idx = 0;
max_val = logits_q7[0];
for (i = 1; i < CLASS_NUM; i++) begin
    if (logits_q7[i] > max_val) begin
        max_val = logits_q7[i];
        max_idx = i;
    end
end
predicted_class = max_idx; // 0-9

A 4-bit output is sufficient to encode classes 0–9. The label mapping to strings (airplane, automobile, etc.) is done externally in the testbench or software. This keeps the hardware minimal while maintaining correct top-1 accuracy.

Stage 3 — Streaming AXI Architecture (First Layer POC)

Directory Structure

wtinp/
    axi_simple_mem.v
    axi_weight_bias_loader.v
    weight_bias_loader.v
    tb_axi_weight_bias_loader.v

imginp/
    ddr_mem_dualport.v
    dma_read.v
    byte_fifo.v
    axis_rgb_packer.v
    conv1_image_sink.v
    tb_axis_rgb_packer_verify.v
    tb_axis_rgb_packer.v

conv/
    pe.v
    conv.v
    conv_tb.v
    conv_full_layer1.v

Motivation: Why Move to AXI

The ROM-based stages (1 and 2) statically bind weights and images to the design at synthesis time. This works for simulation and verification but does not reflect realistic Zynq deployment, where:

Parameters live in external DDR and are transferred at runtime
Retraining requires no re-synthesis — only a memory write
The PS can update weights without rebuilding the bitstream
Larger models can be supported without on-chip ROM size limits

Stage 3 migrates to AMBA AXI protocols for all data movement, enabling runtime parameter loading and continuous image streaming.

Module 1: `wtinp` — AXI-MM Weight and Bias Loader

Why AXI4-Lite for Weights

Weights and biases are small relative to image data and accessed infrequently (once per inference, or once on startup). AXI4-Lite was chosen because:

Parameters are read sequentially — no random access needed
Burst transfers are not required for the parameter count
Simpler handshake logic reduces RTL complexity
Single outstanding transaction is sufficient

AXI4-Lite limitations accepted in this design: no burst support, lower throughput than full AXI4. For larger models, full AXI4 with burst would be preferable.

AXI Read Handshake

AXI memory-mapped reads use two independent decoupled channels:

Address channel:

ARVALID — master asserts: address is valid
ARREADY — slave asserts: ready to accept address
Transfer completes when both are high simultaneously

Read data channel:

RVALID — slave asserts: data is valid on bus
RREADY — master asserts: ready to consume data
Transfer completes when both are high simultaneously

Decoupling address and data channels allows address and data phases to proceed independently, improving timing robustness and enabling pipelined transactions.

Weight Loader FSM

axi_weight_bias_loader implements a 4-state FSM:

ST_IDLE   → wait for start
ST_WEIGHT → issue sequential read addresses for all W_COUNT weights
ST_BIAS   → issue sequential read addresses for all B_COUNT biases
ST_DONE   → assert done

Operation:

Sequential read addresses issued for all quantized weights
Each 32-bit AXI read response truncated to 8-bit signed: weight_mem[i] ← RDATA[7:0]
After all weights, transitions to bias loading
After all biases, done asserted

Only one outstanding read at a time — simplifies control while maintaining protocol correctness. All values stored as signed 8-bit fixed-point (Q1.7), consistent with the convolution core arithmetic.

Memory Map

Parameters arranged sequentially in external memory:

Weight block: addresses [0, W_COUNT-1]
Bias block: addresses [W_COUNT, W_COUNT + B_COUNT - 1]

This flat layout enables deterministic sequential fetching. A wrapper module per CNN layer encapsulates the loader and provides a clean interface to the convolution core.

Simulation

cd wtinp/
iverilog -o test axi_weight_bias_loader.v axi_simple_mem.v tb_axi_weight_bias_loader.v
vvp test

Expected output:

[PASS] AXI-MM weight & bias loader verified

Verifies: correct AXI read sequencing, correct address mapping, correct internal storage, first 16 weights and all biases printed for inspection.

Module 2: `imginp` — AXI-Stream Image Pipeline

Why AXI-Stream for Images

Image data is large and must be supplied continuously to the convolution core. AXI-Stream is chosen because:

No address phase overhead — data flows continuously
Natural backpressure support via TREADY
Suited for pixel pipelines consuming one pixel per clock
TLAST signal cleanly marks frame boundaries

AXI-Stream handshake (simpler than AXI-MM):

TVALID — producer: data is valid
TREADY — consumer: ready to accept
Transfer when TVALID AND TREADY
TLAST — end of frame marker

Pipeline Architecture

DDR Model → DMA Engine → Byte FIFO → RGB Packer → Conv Core

ddr_mem_dualport.v — DDR model:

Byte-addressable storage
Configurable read latency (models realistic DDR latency)
Dual-port: separate read and write ports
Read requests pipelined internally

dma_read.v — DMA engine:

Issues sequential memory requests to DDR
Tracks outstanding transactions with a counter: outstanding = req_count - resp_count
Limits concurrent requests to prevent overflow
Buffers responses into a FIFO
Outputs byte stream via AXI-Stream

byte_fifo.v — decoupling FIFO:

Backpressure-aware (ready/valid)
Decouples DDR latency from stream timing
Prevents pipeline stalls when DDR has variable latency

axis_rgb_packer.v — RGB reconstruction:

Images are stored in DDR in planar format: all R pixels, then all G pixels, then all B pixels:

[R block: 1024 bytes] [G block: 1024 bytes] [B block: 1024 bytes]

The packer receives sequential bytes, reconstructs 24-bit RGB pixels in correct channel order, and asserts TLAST at frame end. This ensures channel alignment is correct before entering the convolution core.

Simulation

cd imginp/
iverilog -o test ddr_mem_dualport.v dma_read.v byte_fifo.v \
    conv1_image_sink.v axis_rgb_packer.v tb_axis_rgb_packer.v
vvp test

Expected output:

PIX 0 | R=131 G=125 B=125 | TLAST=0 | EXP_R=131 EXP_G=125 EXP_B=125
...
[PASS] RGB packing verified

Verifies: DMA memory traversal, streaming behavior, FIFO backpressure, RGB pixel reconstruction correctness, TLAST frame boundary signaling.

Module 3: `conv` — Parametric Streaming Convolution Core

Architecture

The convolution core operates on a pixel stream using a sliding window mechanism implemented with three line buffers:

Incoming pixels are written row-wise into line memories
Previously received rows are shifted as new data arrives
Once three rows and three columns are buffered, a valid 3×3 window forms
win_fire signal triggers convolution only when a complete window is available

This approach enables continuous streaming without stalling once the pipeline is primed. Column and row counters track spatial position; a delayed coordinate stage ensures window stability before firing.

Kernel Loading

Weights stored in a 3×3 register array, loaded via a lightweight sequential interface before streaming begins:

w_load  — enable
w_addr  — kernel element index (0–8)
w_data  — signed 8-bit weight value

In the integrated design, weights come from the AXI-MM loader rather than the testbench.

Processing Elements and MAC Tree

Nine processing elements (PEs in pe.v) perform synchronous registered multiplication — one per kernel element. PE outputs feed a pipelined accumulation tree:

Fully registered multipliers
One-cycle accumulation stage
Output latency determined by pipeline depth
Continuous throughput after pipeline primed

Architecture is structurally scalable — replicate for multi-filter execution by instantiating multiple convolution cores with different weight sets.

Standalone Verification

A 7×7 image was streamed with pixel values 1–49 (row-major) and the kernel:

1 2 3
4 5 6
7 8 9

For the first valid 3×3 window (top-left):

 1  2  3
 8  9 10
15 16 17

Expected output:

(1×1 + 2×2 + 3×3) + (8×4 + 9×5 + 10×6) + (15×7 + 16×8 + 17×9) = 537

RTL simulation produced:

OUT = 537
OUT = 582
OUT = 627
OUT = 672
OUT = 717
OUT = 852
OUT = 897
OUT = 942
OUT = 987
OUT = 1032
OUT = 1167
OUT = 1212
OUT = 1257
OUT = 1302
OUT = 1347
OUT = 1482
OUT = 1527
OUT = 1572
OUT = 1617
OUT = 1662
OUT = 1797
OUT = 1842
OUT = 1887
OUT = 1932
OUT = 1977

25 outputs — correct for a 7×7 input with no padding giving a 5×5 output feature map. All values match the mathematical convolution exactly, confirming the sliding window, PE multiplications, accumulation pipeline, and output timing are all correct.

cd conv/
iverilog -o test conv.v pe.v conv_tb.v
vvp test

Full First-Layer Implementation (`conv_full_layer1.v`)

The full first-layer proof-of-concept:

Loads 32×32×3 image from text files
Loads 3×3×3×28 = 756 kernel weights (28 filters)
Loads 28 biases
Performs full convolution with same padding
Applies Q1.7 fixed-point normalization and ReLU
Dumps all 28 feature maps to conv_res/feature_map_X.txt

cd conv/
iverilog -o test conv_full_layer1.v
vvp test

Expected output:

Loading images...
Loading kernel...
Loading bias...
Starting convolution...
Filter 0 done.
...
Filter 27 done.
All 28 feature maps generated.

Quantized First-Layer Validation Against Python Golden

Three consecutive rows (rows 8–10) of Feature Map 0:

Python golden:

0 17 0 0 0 55 22 19 20 21 0 0 0 0 0 0 0 0 3 63 84 41 29 21 0 0 0 35 33 0 0 0
0 38 0 0 0 75 37 18 9 8 0 0 0 4 8 22 55 69 82 75 61 33 20 8 0 0 0 27 0 0 0 0
0 5 0 0 0 52 49 29 15 4 0 0 22 67 36 39 61 107 133 77 37 23 14 0 0 0 24 17 0 0 0 93

RTL generated:

0 17 0 0 0 56 23 19 21 22 0 0 0 0 0 0 0 0 4 63 84 41 29 22 0 0 0 36 34 0 0 0
0 39 0 0 0 75 37 19 10 9 0 0 0 5 9 23 55 70 83 75 61 34 20 9 0 0 0 28 0 0 0 0
0 6 0 0 0 52 50 29 15 5 0 0 23 67 36 40 62 108 134 77 37 23 15 0 0 0 24 17 0 0 0 93

Differences are ±1 at a small number of positions (55→56, 22→23, 38→39, 133→134). These are deterministic fixed-point rounding deviations caused by:

Fixed-point rounding in integer division by SCALE=128
Ordering of accumulation across input channels
Division-by-255 normalization of pixel values

The spatial structure, activation distribution, and zero patterns are identical. All ReLU zero regions agree exactly. The ±1 differences are at positions where the true float value is very close to a rounding boundary — the RTL and Python implementations round in opposite directions. This confirms functional correctness of the quantized convolution and ReLU pipeline.

AXI Protocol Rationale Summary

Data type	Protocol	Reason
Weights / biases	AXI4-Lite (AXI-MM)	Small, infrequent, sequential reads — simple handshake sufficient
Image pixels	AXI-Stream	Large, continuous, latency-sensitive — no address overhead
Feature maps (future)	AXI-Lite	Control-accessible buffers, lightweight register access

Full AMBA compliance (burst modes, protection bits, cache hints, QoS) was intentionally not implemented. The design covers the minimal functional subset required for CNN inference validation. All AXI modules are handwritten RTL — no Xilinx IP cores used.

What Is Verified

Capability	Stage 1	Stage 2	Stage 3
Full CNN inference (84% accuracy)	✓	✓	—
Layer-by-layer correctness	Partial	✓ (all layers)	✓ (layer 0)
Synthesizable structure	✗	✓	✓
Start/done handshake	✗	✓	✓
AXI-MM weight loading	✗	✗	✓
AXI-Stream image pipeline	✗	✗	✓
RGB packing / DMA	✗	✗	✓
Streaming conv core	✗	✗	✓
First layer numerical match	✓	✓	✓ (±1 rounding)

Current Status and Next Steps

Stage 3 establishes the complete hardware foundation for FPGA CNN acceleration:

AXI-MM read master correctly loads quantized weights and biases from memory
DDR → DMA → AXI-Stream → RGB packer pipeline verified end-to-end
Parametric streaming 3×3 convolution core verified with exact numerical match
Full first convolution layer (28 filters, RGB input) validated against Python golden

Subsequent layers (Conv Block 2, MaxPool, GAP, Dense, Softmax) can be integrated using the same AXI-MM parameter loading and AXI-Stream data flow patterns established in the first-layer proof-of-concept. The wrapper structure from Stage 2 provides the layer-by-layer verification methodology for each new layer added.

Manual RTL Implementation

CIFAR-10 CNN Inference — Manual RTL Verilog Implementation

Design Progression Overview

Background: Hardware vs Software CNN Inference

On-Chip vs Off-Chip Memory

Verilog Limitations Encountered

Tools Used

Fixed-Point Quantization and Weight Preparation

Two-Cycle Handshake Protocol

Network Architecture

Stage 1 — Modular Simulation (ROM-based, No Handshake)

Directory Structure

Purpose

One-Time Weight Preparation

Single Image Flow

Automated Single Image

Batch Evaluation (100 Images)

Module Summary

Stage 2 — Synthesizable Handshake-Driven ROM Pipeline

Directory Structure

Purpose

One-Time Setup

Layer-by-Layer Simulation

Full Layer Progression

Design Characteristics

RTL Module Details

conv2d — Convolution Core

FSM Control

Accumulation and Output

Top-Level Wiring

max_pool — Max Pooling Core

FSM States

add — Residual / Shortcut Add Module

gap — Global Average Pooling

dense — Fully Connected Layer

softmax — Class Prediction

Stage 3 — Streaming AXI Architecture (First Layer POC)

Directory Structure

Motivation: Why Move to AXI

Module 1: wtinp — AXI-MM Weight and Bias Loader

Why AXI4-Lite for Weights

AXI Read Handshake

Weight Loader FSM

Memory Map

Simulation

Module 2: imginp — AXI-Stream Image Pipeline

Why AXI-Stream for Images

Pipeline Architecture

Simulation

Module 3: conv — Parametric Streaming Convolution Core

Architecture

Kernel Loading

Processing Elements and MAC Tree

Standalone Verification

Full First-Layer Implementation (conv_full_layer1.v)

Quantized First-Layer Validation Against Python Golden

AXI Protocol Rationale Summary

What Is Verified

Current Status and Next Steps

`conv2d` — Convolution Core

`max_pool` — Max Pooling Core

`add` — Residual / Shortcut Add Module

`gap` — Global Average Pooling

`dense` — Fully Connected Layer

`softmax` — Class Prediction

Module 1: `wtinp` — AXI-MM Weight and Bias Loader

Module 2: `imginp` — AXI-Stream Image Pipeline

Module 3: `conv` — Parametric Streaming Convolution Core

Full First-Layer Implementation (`conv_full_layer1.v`)