preloader

CNN Models for CIFAR-10 — Inference Using Verilog, Optimized for Hardware

NameCNN For CIFAR10
DescriptionImplements a light weight CNN in Verilog HDL for better hardware acceleration of image claasification tasks
Start21 June 2025
RepositoryNeVer🔗
TypeIndividual
LevelBeginner
SkillsHDL, Computer Vision, Programming, ML
Tools UsedVerilog, Icarus, Python, NumPy
Current StatusOngoing (Active)
Progress- Developed a lightweight CNN [Conv2D×2 + MaxPool]×3 → GAP → Dense(10) for CIFAR-10 Image Classification (32x32RGB) using both IEEE 754 floating-point and Q1.31, Q1.15, Q1.7, and Q1.3 fixed-point arithmetic, achieving 84% accuracy in both implementations (Py ~85% | FP ~84% | Q31 ~84% | Q15 ~84% | Q7 ~82% | Q3 ~65%)
Next Steps- Optimise it for hw inference

Model Architectures

MODEL_ARCH_1

[ (Conv2D → BN)×2 → MaxPool → Dropout(0.3) ]
→ [ (Conv2D → BN)×2 → MaxPool → Dropout(0.4) ]
→ [ (Conv2D → BN)×2 → MaxPool → Dropout(0.5) ]
→ Flatten → Dense(512) → BN → Dropout(0.5) → Dense(10, softmax)

  • Number of Parameters: 3,251,018
  • Test Accuracy: 90.91%
MODEL_ARCH_1
MODEL_ARCH_2

[ (Conv2D(32) → BN)×2 → MaxPool → Dropout(0.25) ]
→ [ (Conv2D(64) → BN)×2 → MaxPool → Dropout(0.35) ]
→ [ (Conv2D(128) → BN)×2 → MaxPool → Dropout(0.4) ]
→ Flatten → Dense(256) → BN → Dropout(0.5) → Dense(10, softmax)

  • Number of Parameters: 815,530
  • Test Accuracy: 88.84%
MODEL_ARCH_2
MODEL_ARCH_3

[ Conv2D(32)×2 → MaxPool ]
→ [ Conv2D(64)×2 → MaxPool ]
→ [ Conv2D(96) → MaxPool ]
→ Flatten → Dense(256) → Dense(10, softmax)

  • Number of Parameters: 517,002
  • Test Accuracy: 85.53%
MODEL_ARCH_3
MODEL_ARCH_4

[ Conv2D(16)×2 → MaxPool ]
→ [ Conv2D(32)×2 → MaxPool ]
→ [ Conv2D(64)×2 → MaxPool ]
→ GAP → Dense(10, softmax)

  • Number of Parameters: 72,730
  • Test Accuracy: 83.05%
MODEL_ARCH_4

Model Architecture Summary

Model IDNumber of ParametersTest Accuracy (%)Model Size (MB)
MODEL_ARCH_13,251,01890.9112.40
MODEL_ARCH_2815,53088.843.11
MODEL_ARCH_3517,00285.531.97
MODEL_ARCH_472,73083.050.28

Model size calculated assuming 32-bit floating-point weights (params × 4 bytes ÷ 1024²).

Verilog Inference Q-Point Results

All results are based on a test set of 100 images (10 from each class). The local test accuracy is therefore considered equal to the number of correct predictions out of 100.

All evaluations use the final model [MODEL_ARCH_4] listed in the table above.

FormatPython Accuracy (%)Verilog Accuracy (%)Notes
Float32 (Python) vs Float64 IEEE-754 (Verilog real)8584Verilog used double precision; the minor accuracy difference is not due to precision loss
Q1.318484High-precision fixed-point; results match exactly
Q1.15848416-bit fixed-point; results match exactly
Q1.782~82Moderate precision; Verilog value estimated from Python result
Q1.365~65Low precision; Verilog value estimated from Python result

YOSYS SYNTHESIS STATS (ONLY 1ST CONV)

   Number of wires:                181
   Number of wire bits:           2514
   Number of public wires:          39
   Number of public wire bits:     422
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                165
     $add                           16
     $adff                           5
     $adffe                         14
     $dffe                           5
     $eq                            12
     $ge                             3
     $logic_not                      2
     $logic_or                       3
     $lt                             9
     $mul                            6
     $mux                           41
     $ne                            13
     $neg                            2
     $not                            3
     $pmux                          11
     $reduce_and                    12
     $reduce_bool                    5
     $reduce_or                      1
     $sub                            2


1 modules:
  conv2d_mem
  
  • Module Definition: conv2d_mem

    • Parameters: WIDTH, HEIGHT, CHANNELS, FILTERS, K, PAD, BIAS_MODE_POST_ADD.
    • Ports: clk, rst, start, done, memory interfaces, and out_data output stream.
  • Image ROM interface

    • image_addr, image_addr_valid, image_addr_ready → control signals for address request.
    • image_r_data, image_g_data, image_b_data, image_data_valid → pixel data from memory.
    • image_data_ready → handshake back to memory.
  • Kernel ROM interface

    • kernel_addr, kernel_addr_valid, kernel_addr_ready → kernel coefficient fetch.
    • kernel_data, kernel_data_valid, kernel_data_ready → coefficient value transfer.
  • Bias ROM interface

    • bias_addr, bias_addr_valid, bias_addr_ready → bias fetch.
    • bias_data, bias_data_valid, bias_data_ready → bias value transfer.
  • Output stream

    • out_data (32-bit convolution output), out_valid (valid strobe).
  • Counters and indices

    • f = filter index, i = row index, j = column index.
    • m, n, c = kernel indices (row, col, channel).
    • in_x, in_y = input coordinates for convolution.
    • kernel_row = computed index for kernel access.
  • Datapath registers

    • accum (64-bit accumulator), kernel_mul (48-bit multiply result).
    • out_int, out_int_relu → final processed outputs.
    • pix_signed, kern16, bias16 → signed extensions of input, kernel, bias.
    • Latched memory outputs: image_r_q, image_g_q, image_b_q, kernel_q, bias_q.
  • FSM (Finite State Machine)

    • state register (6 bits).

    • Localparams define FSM states:

      • S_IDLE, S_START_FILTER, S_BIAS_REQ, S_BIAS_WAIT, S_SETUP_PIXEL, S_MAC_DECIDE, S_IMG_REQ, S_IMG_WAIT, S_KERN_REQ, S_KERN_WAIT, S_MAC_ACCUM, S_PIXEL_DONE, S_NEXT_PIXEL, S_NEXT_FILTER, S_DONE.
  • Reset logic

    • Initializes FSM to S_IDLE.
    • Clears counters, output signals, addresses, and accumulator.
  • FSM behavior

    • S_IDLE: Wait for start.
    • S_START_FILTER / S_BIAS_REQ: Request bias for current filter.
    • S_BIAS_WAIT: Wait for bias to be valid, latch into bias_q.
    • S_SETUP_PIXEL: Reset accumulator, set pixel iteration counters.
    • S_MAC_DECIDE: Calculate input pixel position (with padding check).
    • S_IMG_REQ / S_IMG_WAIT: Fetch image data.
    • S_KERN_REQ / S_KERN_WAIT: Fetch kernel weight.
    • S_MAC_ACCUM: Multiply-accumulate pixel × weight, update accum.
    • S_PIXEL_DONE: Apply bias, normalization, ReLU, set out_data.
    • S_NEXT_PIXEL: Iterate over j, i for next pixel.
    • S_NEXT_FILTER: Increment filter index f.
    • S_DONE: Assert done after all pixels and filters are computed.
  • Bias handling

    • Controlled by BIAS_MODE_POST_ADD parameter.
    • Option for post-addition bias vs scaled bias.
  • Output processing

    • Converts accumulator to 32-bit (out_int).
    • Applies bias.
    • Applies ReLU activation (clamp negative to zero).
    • Valid output signaled with out_valid.
rom_00_conv2d_bias
==== VIEW A: after frontend (before lowering) ====
Number of wires:                 16
Number of wire bits:             46
Number of public wires:           9
Number of public wire bits:      22
Number of memories:               1
Number of memory bits:          128
Number of processes:              2
Number of cells:                  4
  $logic_and                      2
  $meminit_v2                     1
  $memrd                          1

==== VIEW B: after memory -nomap (abstract memories kept) ==== Number of wires: 17 Number of wire bits: 48 Number of public wires: 8 Number of public wire bits: 18 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 12 $adff 2 $dff 1 $logic_and 2 $mem_v2 1 $mux 6

==== VIEW C: after synth_xilinx (tech-mapped) ==== Number of wires: 19 Number of wire bits: 52 Number of public wires: 8 Number of public wire bits: 18 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 36 BUFG 1 FDCE 1 FDPE 1 FDRE 6 IBUF 8 LUT3 1 LUT4 8 OBUF 10

rom_00_conv2d_kernel
==== VIEW A: after frontend (before lowering) ====
Number of wires:                 16
Number of wire bits:             61
Number of public wires:           9
Number of public wire bits:      32
Number of memories:               1
Number of memory bits:         3456
Number of processes:              2
Number of cells:                  4
  $logic_and                      2
  $meminit_v2                     1
  $memrd                          1

==== VIEW B: after memory -nomap (abstract memories kept) ==== Number of wires: 17 Number of wire bits: 53 Number of public wires: 8 Number of public wire bits: 23 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 12 $adff 2 $dff 1 $logic_and 2 $mem_v2 1 $mux 6

==== VIEW C: after synth_xilinx (tech-mapped) ==== Number of wires: 99 Number of wire bits: 186 Number of public wires: 8 Number of public wire bits: 23 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 142 BUFG 1 FDCE 1 FDPE 1 FDRE 8 IBUF 13 LUT2 4 LUT3 1 LUT4 2 LUT5 1 LUT6 62 MUXF7 30 MUXF8 8 OBUF 10

image_b_rom
==== VIEW A: after frontend (before lowering) ====
Number of wires:                 17
Number of wire bits:             65
Number of public wires:           9
Number of public wire bits:      34
Number of memories:               1
Number of memory bits:         8192
Number of processes:              2
Number of cells:                  5
  $logic_and                      2
  $logic_not                      1
  $meminit_v2                     1
  $memrd                          1

==== VIEW B: after memory -nomap (abstract memories kept) ==== Number of wires: 22 Number of wire bits: 86 Number of public wires: 9 Number of public wire bits: 34 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 17 $dff 4 $logic_and 2 $logic_not 1 $mem_v2 1 $mux 9

==== VIEW C: after synth_xilinx (tech-mapped) ==== Number of wires: 243 Number of wire bits: 504 Number of public wires: 9 Number of public wire bits: 34 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 360 BUFG 1 FDRE 28 IBUF 14 LUT2 5 LUT3 7 LUT4 6 LUT5 23 LUT6 160 MUXF7 86 MUXF8 20 OBUF 10

Bias ROM: rom_00_conv2d_bias
  • Views

    • VIEW A (Frontend) – memory still modeled as abstract mem with initialization.
    • VIEW B (After memory -nomap) – abstract memory removed, lowered into registers + logic.
    • VIEW C (After synth_xilinx) – mapped to FPGA primitives (LUTs, FFs, buffers).
  • Module Definition: rom_00_conv2d_bias

    • A small ROM that stores bias values for convolution filters.
    • Single-port memory with initialization ($meminit_v2).
  • Resource breakdown

    • VIEW A:

      • 1 memory, 128 bits total.
      • 2 processes (for initialization and read).
      • Simple logic ($logic_and, $memrd).
    • VIEW B:

      • Memory flattened → 0 memories, instead implemented as FFs + MUX network.
      • 12 cells total (2 $adff, 1 $dff, 6 $mux).
    • VIEW C:

      • Fully mapped to FPGA primitives.

      • 36 cells total:

        • Sequential logic: 1 FDCE, 1 FDPE, 6 FDRE (flip-flops).
        • Combinational logic: LUT3, LUT4 for addressing.
        • I/O: 8 IBUF, 10 OBUF.
        • Clocking: 1 BUFG.
  • Summary: The bias ROM is small (128-bit storage), flattened into a few LUTs + registers. It’s essentially a lookup table of per-filter biases, fetched during convolution setup.


Kernel ROM: rom_00_conv2d_kernel
  • Views

    • VIEW A (Frontend) – large memory still intact.
    • VIEW B (After memory -nomap) – memory lowered into muxes + registers.
    • VIEW C (After synth_xilinx) – mapped into FPGA LUTs, MUXF7/F8 cascade for wide ROM.
  • Module Definition: rom_00_conv2d_kernel

    • Stores kernel weights for convolution (coefficients).
    • Much larger than bias ROM: 3456 bits total.
  • Resource breakdown

    • VIEW A:

      • 1 memory (3456 bits).
      • 2 processes, 4 cells.
    • VIEW B:

      • Memory lowered → muxes + FFs.
      • 12 cells total (same pattern as bias: 2 $adff, 1 $dff, 6 $mux).
    • VIEW C:

      • Large expansion into 142 FPGA cells.

      • Breakdown:

        • Sequential logic: 1 FDCE, 1 FDPE, 8 FDRE.
        • Combinational logic: 62 LUT6, plus smaller LUTs (LUT2, LUT3, LUT4, LUT5).
        • Wide mux structures: 30 MUXF7, 8 MUXF8 (used to build large ROM).
        • I/O: 13 IBUF, 10 OBUF.
        • Clocking: 1 BUFG.
  • Summary: The kernel ROM is significantly larger than the bias ROM (3456 vs 128 bits). After mapping, it consumes many LUT6s and cascaded multiplexers (MUXF7/MUXF8), which is typical for FPGA ROM inference. This block dominates the logic footprint compared to bias.

rom_00_conv2d_bias
==== VIEW A: after frontend (before lowering) ====
Number of wires:                 16
Number of wire bits:             46
Number of public wires:           9
Number of public wire bits:      22
Number of memories:               1
Number of memory bits:          128
Number of processes:              2
Number of cells:                  4
  $logic_and                      2
  $meminit_v2                     1
  $memrd                          1

==== VIEW B: after memory -nomap (abstract memories kept) ==== Number of wires: 17 Number of wire bits: 48 Number of public wires: 8 Number of public wire bits: 18 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 12 $adff 2 $dff 1 $logic_and 2 $mem_v2 1 $mux 6

==== VIEW C: after synth_xilinx (tech-mapped) ==== Number of wires: 19 Number of wire bits: 52 Number of public wires: 8 Number of public wire bits: 18 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 36 BUFG 1 FDCE 1 FDPE 1 FDRE 6 IBUF 8 LUT3 1 LUT4 8 OBUF 10

rom_00_conv2d_kernel
==== VIEW A: after frontend (before lowering) ====
Number of wires:                 16
Number of wire bits:             61
Number of public wires:           9
Number of public wire bits:      32
Number of memories:               1
Number of memory bits:         3456
Number of processes:              2
Number of cells:                  4
  $logic_and                      2
  $meminit_v2                     1
  $memrd                          1

==== VIEW B: after memory -nomap (abstract memories kept) ==== Number of wires: 17 Number of wire bits: 53 Number of public wires: 8 Number of public wire bits: 23 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 12 $adff 2 $dff 1 $logic_and 2 $mem_v2 1 $mux 6

==== VIEW C: after synth_xilinx (tech-mapped) ==== Number of wires: 99 Number of wire bits: 186 Number of public wires: 8 Number of public wire bits: 23 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 142 BUFG 1 FDCE 1 FDPE 1 FDRE 8 IBUF 13 LUT2 4 LUT3 1 LUT4 2 LUT5 1 LUT6 62 MUXF7 30 MUXF8 8 OBUF 10

image_b_rom
==== VIEW A: after frontend (before lowering) ====
Number of wires:                 17
Number of wire bits:             65
Number of public wires:           9
Number of public wire bits:      34
Number of memories:               1
Number of memory bits:         8192
Number of processes:              2
Number of cells:                  5
  $logic_and                      2
  $logic_not                      1
  $meminit_v2                     1
  $memrd                          1

==== VIEW B: after memory -nomap (abstract memories kept) ==== Number of wires: 22 Number of wire bits: 86 Number of public wires: 9 Number of public wire bits: 34 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 17 $dff 4 $logic_and 2 $logic_not 1 $mem_v2 1 $mux 9

==== VIEW C: after synth_xilinx (tech-mapped) ==== Number of wires: 243 Number of wire bits: 504 Number of public wires: 9 Number of public wire bits: 34 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 360 BUFG 1 FDRE 28 IBUF 14 LUT2 5 LUT3 7 LUT4 6 LUT5 23 LUT6 160 MUXF7 86 MUXF8 20 OBUF 10

Blue Channel ROM: image_b_rom
  • Views

    • VIEW A (Frontend) – memory modeled as abstract $mem with initialization.
    • VIEW B (After memory -nomap) – memory lowered into registers and combinational logic.
    • VIEW C (After synth_xilinx) – mapped to FPGA primitives (LUTs, FFs, buffers).
  • Module Definition: image_b_rom

    • Stores the blue channel pixel values for an image block.
    • Single-port memory initialized via $meminit_v2.
  • Resource breakdown

    • VIEW A:

      • 1 memory, 8192 bits total.
      • 2 processes (initialization and read).
      • Simple logic: $logic_and, $logic_not, $memrd.
      • Total 5 cells.
    • VIEW B:

      • Memory flattened → 0 memories, replaced by flip-flops + mux network.

      • Total 17 cells:

        • 4 $dff (flip-flops for state storage)
        • 2 $logic_and, 1 $logic_not
        • 1 $mem_v2
        • 9 $mux (address/data selection)
    • VIEW C:

      • Fully mapped to FPGA primitives.

      • Total 360 cells:

        • Sequential logic: 28 FDRE flip-flops.
        • Combinational logic: 5 LUT2, 7 LUT3, 6 LUT4, 23 LUT5, 160 LUT6.
        • MUX cascades: 86 MUXF7, 20 MUXF8.
        • I/O: 14 IBUF, 10 OBUF.
        • Clocking: 1 BUFG.
  • Summary:

    The blue channel ROM is large (8192 bits) and dominates logic usage after tech mapping. It is fully flattened into LUTs, flip-flops, and cascaded multiplexers to implement read access efficiently. Its structure and resource usage are typical for storing image pixel data in FPGA designs.

VISUAL COMPARISON FOR PYTHON AND VERILOG

PythonVerilog