Loading…
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks

VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks

SystemVerilog Verilog Mircoarchitecture AXI-Stream Pipelining Image Processing Basic ML
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
INT8 Hardware Accelerated CNN
CNN implemented in fixed-point RTL (<1% accuracy loss) with systolic array based matrix mutiplicator
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
Bharat AI Soc Student Challenge by ARM & C2S India
HLS based approaches and automated workflows with bloatware analysis of the above
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
ImProVe: IMage PROcessing using VErilog
Image Processing toolkit with AXI-Stream and OpenCL functional validation
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
ImProVeR: IMage PROcessing using VErilog - Revised
Image Processing toolkit with AXI-Stream and OpenCL functional validation
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
NeVer: NEural NEtwork in VERilog
FCNNs for MNIST and EMNIST (hand-written) classification with fixed-point RTL
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
Wavefront Systolic Array - GEMM, Conv2d, and MAC Physical Design Study
Benchmarking Adders-Multipliers for MAC units via identical RTL2GDS flow
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
Design and Formal Verification of Fixed-Point CORDIC core
Designed a parametrisable core with math wrappers and formally verified for certain properties
View Project
RepositoryMummanajagadeesh/ViSiON
Start DateDec 2024

Note: The following projects evolved in parallel under a single umbrella of compute architectures, hardware acceleration, and RTL-level optimization. While not all efforts directly converge to a single product, each branch was intentionally explored to understand performance–accuracy–area trade-offs across digital design, ML inference, verification, and physical implementation flows.


VISION: Verilog for Image Processing and Simulation-Based Inference of Neural Networks

Timeline Overview

" I tried to ImProVe, but NeVer really did - so I MOVe-d on ¯\_(ツ)_/¯ “

Dec 2024 — ImProVe & Early Image Processing

Work began with ImProVe (IMage PROcessing using VErilog), a streaming RTL toolkit developed while experimenting with hardware image rotation (“ROVER”). Focus areas included:

  • Geometric transforms (rotation, scaling)
  • Color space transforms
  • Edge detection and filtering
  • Image enhancement pipelines

These designs were implemented as AXI-Stream compatible RTL pipelines with FIFO-based backpressure handling. The toolkit served as a practical platform for learning streaming datapath design and pipeline scheduling.


Feb 2025 – May 2025 — NeVer: Neural Network in Verilog

The project expanded into hardware preprocessing for handwritten digit recognition.

An RTL preprocessing pipeline for MNIST was implemented, including:

  • Thresholding
  • Contrast normalization
  • Character detection
  • Cropping and centering
  • Rotation correction
  • Resizing

This preprocessing chain fed a fixed-point MLP classifier, implemented entirely in RTL. The system evolved into NeVer (NEural NEtwork in VERilog).

Key characteristics:

  • Fixed-point inference
  • Automated inference flow using TCL/Perl scripts
  • Dataset handling and preprocessing automation
  • 75% accuracy on MNIST/EMNIST datasets


June – July 2025 — MOVe: Math Ops in Verilog

While implementing image rotation and neural network operators, a need arose for hardware math primitives. This led to the creation of MOVe (Math Ops in Verilog) — a collection of arithmetic accelerators implemented entirely in RTL.

Implemented components included:

  • Shift-add sine approximation cores
  • CORDIC prototypes for trigonometric functions
  • MAC units for neural network operations
  • Early exploration of posit arithmetic
  • Fixed-point arithmetic wrappers
  • Custom arithmetic pipelines for matrix operations

This work focused on numerical representation and hardware-friendly math implementations, forming the arithmetic foundation for later CNN accelerators.


Aug 2025 — CNN Acceleration & Systolic Compute

The focus shifted to RGB image classification using CIFAR-10.

Major developments:

  • Design of several CNN architectures with Pareto analysis across accuracy, parameters, and FLOPs
  • Quantization experiments (PTQ and QAT)
  • Final deployment using INT8 (Q1.7) inference

Hardware acceleration efforts included:

  • Systolic-array based matrix multiplication
  • Dedicated MAC microarchitectures
  • GEMM-based convolution implementations
  • Tiling strategies for convolution layers

RTL implementations were validated layer-by-layer against Python reference models.


Nov 2025 — CORDIC IP Formalization

The earlier math experiments from MOVe were reorganized into a standalone CORDIC IP project.

Enhancements included:

  • Separation of core iteration engine and wrappers
  • Support for circular, linear, and hyperbolic modes
  • Implementation of trig, magnitude, atan2, exp, div, and mul functions
  • Formal verification using SymbiYosys + Yices2

This became the dedicated CORDIC IP project later integrated into other DSP systems.


Dec 2025 — ImProVe Streaming Refinements

The ImProVe toolkit was extended with:

  • Improved AXI-Stream pipeline scheduling
  • FIFO-based backpressure handling
  • Additional image processing operators

Functional validation was performed using OpenCL software models to cross-check RTL behavior.


Jan 2026 — FPGA Deployment (Bharat AI SoC Challenge)

The CNN accelerator work was extended to Zynq-7000 FPGA deployment.

Key system components:

  • AXI-Stream dataflow pipelines
  • AXI4-Lite control interfaces
  • DMA-based feature map transfers
  • ARM–FPGA integration

Additional experimentation included:

  • HLS implementations (Vitis HLS / hls4ml / FINN)
  • Comparison of manual RTL accelerators vs HLS-generated designs
  • Hardware/software co-validation on the target FPGA platform

INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite

  • “Designed and evaluated multiple CIFAR-10 CNNs, selecting a Pareto-optimal 6-layer residual model balancing accuracy (~84%), parameter memory (~52 kB), and compute (~12–13 M FLOPs) for hardware deployment”
  • “Implemented a tiling-based convolution/GEMM accelerator with reusable MAC PEs and line-buffered dataflow; integrated AXI4-Lite control + AXI-Stream/DMA data movement; verified end-to-end via RTL testbenches against Python reference models”
  • “Developed pipelined processing elements using 8-bit Booth–Kogge MACs, with FSM-based control and a 2-cycle ready/valid handshake, ensuring timing-clean and scalable datapath operation”
  • “Performed quantization studies (PTQ/QAT) from FP32 (Q1.31) to fixed-point (Q1.7), achieving ~4× memory reduction with <1% accuracy loss; validated TensorFlow FP32 to RTL numerical consistency”
  • “Built automation flows (TCL/Python) for ROM/weight generation, testbench stimulus, and inference execution; generated coefficient memories and ensured deterministic layer-by-layer verification”
  • “Implemented a streaming image-processing toolkit (AXI-Stream) including edge detection, filtering, denoising, and enhancement, with pipelined RTL and FIFO-based backpressure handling; included MLP-based (E)MNIST classifier with automated preprocessing/inference”

Pipelined Systolic Array for GEMM/Conv2D with MAC PPA Study (Sky130 OpenLane)

  • Designed parameterized output-stationary 2D systolic array for signed 8-bit GEMM/Conv2D with wavefront scheduling and pipelined PEs
  • Implemented im2col-based Conv2D mapping onto GEMM core (4×16×36), achieving 42.47 MAC/cycle (99.5% peak) at 66.3% PE utilization
  • Explored 9 MAC architectures (Array/Baugh/Booth × RCA/Kogge/CSA) via full RTL-to-GDSII (OpenLane, sky130_fd_sc_hd); achieved 100 MHz timing closure with post-route STA correlation and 0 DRC/LVS violations
  • Quantified PPA tradeoffs: Booth+RCA 15.8k µm² / 537 cells (min area); Array+Kogge 5.68 ns (~176 MHz, max Fmax); Kogge ~4–5% faster at ~20% higher power; CSA ~67% larger and ~2× power with no timing gain
  • Designed ping-pong SRAM tiled GEMM with DMA-backed data movement (1-cycle buffer swap), enabling overlap of load and compute
  • Implemented direct-mapped tile cache (tag+valid) achieving 93.75% hit rate; verified across 240+ tests (GEMM/Conv, random/boundary/burst/multi-tile), confirming functional correctness and linear systolic scaling (K+M+N−2)

Design & Formal Verification of Parameterizable Fixed-Point CORDIC IP

  • “Implemented shift-add datapath with all 6 modes rotation/vectoring (circular/linear/hyperbolic); width/iter/angle frac/output width–shift scaling swept across configs”
  • “Built trig/mag/atan2/mul/div/exp wrappers; observed ∼e-5 RMS (@32b, 16iter) baseline vs double-precision references”
  • “Proved handshake, deadlock-free bounded liveness, range safety, symmetry & monotonicity via SystemVerilog assertions (SymbiYosys/Yices2)”
  • “Auto-generated atan tables & param files via Python; FuseSoC-packaged core with documented sensitivity, error trends & failure regions”
  • “Variants: pipelined/SIMD/multi-issue; Systems: radix-2 FFT/IFFT, DPLL, Sigma-Delta ADC Front-End, QAM16 receiver (Costas carrier + Gardner timing recovery)”

Expanded Project Overview

Phase 1 — From ROVER to ImProVe (Image Processing Practice Platform)

The project began with hardware-based image rotation. After eliminating non-synthesizable constructs, rotation, geometric transforms, color transformations, filtering, and enhancement algorithms were implemented fully in RTL.

This evolved into ImProVe (IMage PROcessing using VErilog) — initially developed as a practice platform to understand streaming datapaths and pipeline scheduling. The toolkit included:

  • Edge detection
  • Denoising
  • Filtering and enhancement
  • Geometric transforms
  • Thresholding and contrast adjustment

Applications explored (as practice implementations):

  • Label detection
  • Document scanner pipeline
  • Stereo depth estimation

AXI-Stream compliant interfaces were implemented with FIFO-based backpressure control. Later, OpenCL was used for functional validation against software references.


Phase 2 — NeVer: Neural Network in Verilog

Around Feb 2025, preprocessing logic for MNIST was designed:

  • User input via Tkinter GUI
  • RTL-based thresholding
  • Contrast scaling
  • Character detection
  • Cropping
  • Resizing
  • Rotation correction

This created a full hardware preprocessing workflow.

A fixed-point MLP classifier was developed using weight scaling (linear advantage of MLPs exploited). It was later extended to EMNIST.

Observations:

  • Accuracy >75%
  • Noticeable drop due to weight scaling instead of structured quantization
  • Generalization issues from real handwritten inputs (distribution mismatch)

Extensive automation was introduced:

  • TCL/Perl scripts for inference flow
  • Python-based dataset handling
  • Automated testbench execution

This body of work formed NeVer, completed around May 2025.


Phase 3 — INT8 CNN & Systolic Acceleration (CIFAR-10)

Around Aug 2025, the direction shifted to RGB image classification using CIFAR-10 and CNNs.

Training-level optimizations:

  • Architectural exploration (multiple CNN variants)
  • BatchNorm experiments
  • Dense layer restructuring
  • BatchNorm fusion

Quantization:

  • PTQ and QAT experiments
  • Final deployment using Q1.7 format
  • <1% accuracy degradation

RTL Implementation:

  • Layer-by-layer validation using Python golden models
  • Manual ROM generation via Python
  • Deterministic verification across layers

Compute Acceleration:

  • Pipelined systolic array-based matrix multiplication
  • Dedicated convolution and GEMM cores
  • 2-cycle handshake protocol

Arithmetic Selection Study:

  • Multiple adder and multiplier architectures
  • RTL-to-GDS flow using Sky130 (OpenLane)
  • Not intended as production tapeout; used to compare latency, area, and power trade-offs
  • First experience with open-source RTL2GDS toolchains

Phase 4 — Formalized Math Acceleration: CORDIC IP

Rotation experiments originally required sine approximation. An early sine core was implemented in June 2025.

By Nov 2025, the design was restructured:

  • Core–wrapper separation
  • Support for circular, linear, hyperbolic modes
  • Trig, exp, log, div, mul implementations via wrappers

Formal verification (first exposure to formal methods):

  • Verified 2-cycle handshake correctness
  • Deadlock-free guarantees
  • Bounded liveness
  • Mathematical properties within tolerances

Formal tools used:

  • SymbiYosys
  • Yices2

This marked the first structured formal verification effort in the project.


Phase 5 — FPGA Deployment & HLS Exploration (Bharat AI SoC Challenge, Jan 2026)

The CNN accelerator work was extended to ARM-based FPGA deployment on Zynq-7000 (xc7z020) as part of the Bharat AI SoC Challenge (ARM + C2S India).

The objective was to evaluate hardware CNN inference under realistic FPGA resource constraints while comparing manual RTL accelerators with HLS-generated implementations.


Deployment Architecture

The system used a PS–PL co-design architecture.

ARM Cortex-A9 (PS)
    │
AXI4-Lite (control)
    │
CNN Inference Accelerator (PL)
    │
AXI-Stream / AXI Master
    │
DDR Memory (feature maps + images)

Key integration components:

  • AXI4-Lite control interface
  • AXI master DMA for image reads
  • AXI-Stream pipelines for feature-map processing
  • FIFO decoupling between memory and compute units

The accelerator operated on 32×32×3 CIFAR-10 images and returned the predicted class.


Network Architecture

A compact Mini-ResNet CNN was implemented.

Input 32x32x3
  Block 1:
    conv3x3  (3 → 28)
    conv3x3  (28 → 28) + conv1x1 shortcut
    maxpool2x2

  Block 2:
    conv3x3  (28 → 56)
    conv3x3  (56 → 56) + conv1x1 shortcut
    maxpool2x2

Head:
  GlobalAvgPool → Dense(56 → 10)

Weights were stored as Q1.7 fixed-point integers, reducing memory footprint while maintaining classification accuracy.


Quantization Results

Model FormatAccuracy
FP32 training~85%
Q1.7 inference~84%
Accuracy loss< 1%

Quantization reduced parameter storage to approximately 52 kB, enabling full weight storage within FPGA memory.


HLS Implementation

The CNN inference pipeline was implemented in Vitis HLS using a manually optimized C++ kernel.

Development stages:

  1. Python reference model
  2. Standalone C++ inference
  3. Vitis HLS kernel generation
  4. FPGA synthesis and resource analysis

The HLS kernel used:

ap_fixed<16,8> arithmetic
AXI master interface for image input
AXI-Lite control interface

Function interface:

void cifar10_infer(ap_fixed<16,8> image_in[3072], ap_uint<4>* pred_out);

Resource Utilization (Zynq-7020, 100 MHz target)

ResourceUsedAvailableUtilization
LUT18,37953,20034%
FF11,014106,40010%
BRAM24328086%
DSP2222010%

Estimated Fmax:

136.99 MHz

The design successfully fit within the FPGA fabric while maintaining the target throughput.


Manual Optimization to Fit FPGA

The initial HLS synthesis did not fit the FPGA:

LUT usage: 70,353 (132%)

The main issue came from runtime index multiplications inside the maxpool2x2 blocks. HLS generated 64-bit multipliers and large mux trees, dramatically increasing LUT usage.

Problematic blocks:

grp_maxpool2x2_16_16_56_s
grp_maxpool2x2_32_32_28_s

Optimization Strategy

Index expressions were replaced with loop-carried address increments.

Instead of:

oh*2*W*C
ow*2*C

the design used:

base pointers + constant stride increments

This allowed HLS to generate adders instead of multipliers.


Result

MetricBefore FixAfter Fix
LUT usage70,35318,379
Utilization132%34%
Maxpool LUT cost~27k~2.8k

The optimized design successfully fit on the xc7z020 while maintaining the original CNN accuracy (~84%).


HLS vs Manual RTL Exploration

The project also compared multiple accelerator approaches:

ApproachPurpose
Manual RTL CNN acceleratorFine-grained microarchitecture control
HLS CNN kernelRapid hardware generation
PYNQ NumPy baselineSoftware reference
FINN / hls4ml experimentsAlternative ML hardware toolflows

This comparison highlighted trade-offs between:

  • manual RTL optimization
  • HLS productivity
  • resource efficiency
  • design iteration speed

Key Observations

  • Fixed-point Q1.7 inference preserves ~84% CIFAR-10 accuracy
  • Manual restructuring of HLS code is often required to prevent unintended hardware generation
  • Memory indexing patterns strongly influence hardware resource usage
  • AXI-based streaming allows efficient PS–PL co-processing on Zynq

The deployment validated the feasibility of hardware CNN inference within the constraints of mid-range FPGA platforms.

Current Status / TLDR

  • “Explored 8 CNN architectures on CIFAR-10; selected Pareto-optimal ResNet-style model (52k params, 12.6M FLOPs, 80-84% accuracy) for hardware deployment.”
  • “Applied PTQ and QAT quantization (Q1.7 fixed-point); 4x memory reduction with <1% accuracy loss across MODEL ARCH 4 (72k params) and MODEL ARCH 8 (52k params).”
  • “Full Verilog RTL CNN: parametric conv2d, max-pool, residual/shortcut add (1x1 conv), GAP, dense, softmax in Q7 fixed-point; FSM-controlled with two-cycle ready/valid handshake; validated end-to-end at 84% accuracy (100 CIFAR-10 images, FP32 and Q1.7).”
  • “Systolic array GEMM and Conv2D: 9-PE CSA-MBE MAC array with Booth multiplier and CSA tree reduction, 3-stage pipeline; benchmarked CSA/Kogge-Stone/RCA adders and MBE/Booth/Baugh-Wooley multipliers on Sky130 RTL2GDS (Yosys/OpenSTA) based on fmax, power and area; GDS and layouts generated.”
  • “Streaming Zynq-7000 architecture with AMBA AXI: AXI4-Lite for runtime weight/bias loading from DDR, AXI-Stream for pixel ingestion with FIFO decoupling, AXI-Lite for control; first convolution layer validated against Python golden model (+-1 LSB rounding deviation).”
  • “SW baselines on PYNQ ARM Cortex-A9: NumPy inference (~21s/image FP32, ~30s/image Q1.7); keras2c evaluated across baseline, loop-pragma, and graph-fused variants (residual fusion, tensor materialization reduction) at O0-Ofast; best ~361ms/image.”
  • “hls4ml (Vivado HLS, ap_fixed<16,6>, ReuseFactor=32, Resource strategy) and Vitis HLS (ap_fixed<16,8>, AXI-MM + AXI-Lite) explored for HLS-based acceleration.”
  • “Brevitas QAT (4-bit weights/activations, 8-bit input) + FINN 17-stage dataflow compilation targeting Zybo Z7-10 (xc7z010clg400-1); custom board integration, AXI-Stream + AXI-Lite PS-PL; post-synthesis: 11,581 LUT, 14,557 FF, 17 BRAM36K, 3 DSP; bitstream generated.”
  • “Vitis HLS kernel on xc7z020: 18,379 LUT (34%), 243 BRAM_18K (86%), 22 DSP (10%), Fmax 136.99 MHz.”

Unified Theme

Across all branches—ImProVe, NeVer, MOVe, CNN acceleration, systolic microarchitecture, CORDIC IP, and FPGA deployment—the consistent focus has been:

  • Fixed-point arithmetic design
  • Pipelined microarchitectures
  • Streaming interfaces (AXI-Stream)
  • Hardware–software co-validation
  • Quantization vs scaling trade-offs
  • Automation of inference workflows
  • Exploration of arithmetic-level PPA trade-offs
  • Introduction to physical design and formal verification

The work collectively represents iterative exploration of hardware-aware ML acceleration, rather than a single linear product.