VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks

VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks

SystemVerilog Verilog Mircoarchitecture AXI-Stream Pipelining Image Processing Basic ML
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
INT8 Hardware Accelerated CNN
CNN implemented in fixed-point RTL (<1% accuracy loss) with systolic array based matrix mutiplicator
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
Bharat AI Soc Student Challenge by ARM & C2S India
HLS based approaches and automated workflows with bloatware analysis of the above
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
ImProVe: IMage PROcessing using VErilog
Image Processing toolkit with AXI-Stream and OpenCL functional validation
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
NeVer: NEural NEtwork in VERilog
FCNNs for MNIST and EMNIST (hand-written) classification with fixed-point RTL
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
3-Stage Pipelined Systolic Array-Based MAC Microarchitecture
Benchmarking Adders-Multipliers for MAC units via identical RTL2GDS flow
View Project
VISION: Verilog for Image Processing and Simulation-based Inference Of Neural Networks
Design and Formal Verification of Fixed-Point CORDIC core
Designed a parametrisable core with math wrappers and formally verified for certain properties
View Project
RepositoryMummanajagadeesh/ViSiON
Start DateDec 2024

Note: The following projects evolved in parallel under a single umbrella of compute architectures, hardware acceleration, and RTL-level optimization. While not all efforts directly converge to a single product, each branch was intentionally explored to understand performance–accuracy–area trade-offs across digital design, ML inference, verification, and physical implementation flows.


VISION: Verilog for Image Processing and Simulation-Based Inference of Neural Networks

Timeline Overview

Dec 2024 Initiated with hardware-based image rotation experiments (“ROVER”). Early focus: geometric transforms, color transforms, filtering, enhancement.

Feb 2025 – May 2025 Expansion into MNIST preprocessing pipeline in RTL. Development of fixed-point MLP classifier → evolved into NeVer.

June 2025 Initial sine approximation core; early experimentation with shift-add based math implementations.

Aug 2025 Transition to RGB datasets and CNNs (CIFAR-10). Began systolic-array based acceleration and structured microarchitecture exploration.

Nov 2025 Separated CORDIC core and wrappers. Introduced full multi-mode CORDIC and formal verification using SymbiYosys.

Dec 2025 AXI-Stream refinements and OpenCL-based functional validation for ImProVe toolkit.

Jan 2026 Extension of CNN accelerator to ARM-based FPGA deployment (Bharat AI SoC Challenge). Integrated AXI4-Lite, AXI-Stream, DMA workflows, and HLS explorations.


INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite

  • Designed a synthesizable shallow Res-CNN for CIFAR-10, Pareto-optimal among 8 CNNs for parameter memory, accuracy & FLOPs
  • Built systolic-array PEs with 8-bit CSA–MBE MACs, FSM-based control, 2-cycle ready/valid handshake, and verified TB operation
  • Performed PTQ/QAT (Q1.31→Q1.3) analysis; Q1.7 PTQ retained ∼84% accuracy (<1% loss) with 4x smaller (∼52kB) memory footprint
  • Auto-generated 14 coeff & 3 RGB ROMs via TCL/Py automation; validated TF/FP32–RTL consistency and automated inference execution
  • Implemented AXI-Stream DIP toolkit (edge, denoise, filter, enhance) with pipelined RTL & FIFO backpressure handling
  • MLP classifier on (E)MNIST (>75% acc.) with GUI viz; Automated preprocessing & inference with TCL/Perl

3-Stage Pipelined Systolic Array-Based MAC Microarchitecture

  • Benchmarked six 8-bit signed adders-multipliers via identical RTL2GDS Sky130 flow to isolate arithmetic-level post-route PPA trade-offs
  • 3-stage pipelined systolic MAC (CSA-MBE), achieving ↓66.3% delay; ↑3.1× area efficiency; ↓82.2% typical power vs naïve conv3 baseline
  • Used a 2D PE-grid structure for convolution (verified 0/same padding modes) and optimized GEMM (reducing power by 44.6%; N = 3)
  • Added a 648-bit scan chain across all pipeline/control registers, enabling full DFT/ATPG testability with only +14.5% cell overhead

Design & Formal Verification of Parameterizable Fixed-Point CORDIC IP

  • Implemented shift-add datapath with all 6 modes rotation/vectoring (circular/linear/hyperbolic); width/iter/angle frac/output width–shift scaling swept across configs
  • Built trig/mag/atan2/mul/div/exp wrappers; observed ∼e-5 RMS (@32b, 16iter) baseline vs double-precision references
  • Proved handshake, deadlock-free bounded liveness, range safety, symmetry & monotonicity via SystemVerilog assertions (SymbiYosys/Yices2)
  • Auto-generated atan tables & param files via Python; FuseSoC-packaged core with documented sensitivity, error trends & failure regions
  • Built drop-in core variants (pipelined/SIMD/multi-issue); implemented a QAM16 demodulator using the CORDIC core

Expanded Project Overview

Phase 1 — From ROVER to ImProVe (Image Processing Practice Platform)

The project began with hardware-based image rotation. After eliminating non-synthesizable constructs, rotation, geometric transforms, color transformations, filtering, and enhancement algorithms were implemented fully in RTL.

This evolved into ImProVe (IMage PROcessing using VErilog) — initially developed as a practice platform to understand streaming datapaths and pipeline scheduling. The toolkit included:

  • Edge detection
  • Denoising
  • Filtering and enhancement
  • Geometric transforms
  • Thresholding and contrast adjustment

Applications explored (as practice implementations):

  • Label detection
  • Document scanner pipeline
  • Stereo depth estimation

AXI-Stream compliant interfaces were implemented with FIFO-based backpressure control. Later, OpenCL was used for functional validation against software references.


Phase 2 — NeVer: Neural Network in Verilog

Around Feb 2025, preprocessing logic for MNIST was designed:

  • User input via Tkinter GUI
  • RTL-based thresholding
  • Contrast scaling
  • Character detection
  • Cropping
  • Resizing
  • Rotation correction

This created a full hardware preprocessing workflow.

A fixed-point MLP classifier was developed using weight scaling (linear advantage of MLPs exploited). It was later extended to EMNIST.

Observations:

  • Accuracy >75%
  • Noticeable drop due to weight scaling instead of structured quantization
  • Generalization issues from real handwritten inputs (distribution mismatch)

Extensive automation was introduced:

  • TCL/Perl scripts for inference flow
  • Python-based dataset handling
  • Automated testbench execution

This body of work formed NeVer, completed around May 2025.


Phase 3 — INT8 CNN & Systolic Acceleration (CIFAR-10)

Around Aug 2025, the direction shifted to RGB image classification using CIFAR-10 and CNNs.

Training-level optimizations:

  • Architectural exploration (multiple CNN variants)
  • BatchNorm experiments
  • Dense layer restructuring
  • BatchNorm fusion

Quantization:

  • PTQ and QAT experiments
  • Final deployment using Q1.7 format
  • <1% accuracy degradation

RTL Implementation:

  • Layer-by-layer validation using Python golden models
  • Manual ROM generation via Python
  • Deterministic verification across layers

Compute Acceleration:

  • Pipelined systolic array-based matrix multiplication
  • Dedicated convolution and GEMM cores
  • 2-cycle handshake protocol

Arithmetic Selection Study:

  • Multiple adder and multiplier architectures
  • RTL-to-GDS flow using Sky130 (OpenLane)
  • Not intended as production tapeout; used to compare latency, area, and power trade-offs
  • First experience with open-source RTL2GDS toolchains

Phase 4 — Formalized Math Acceleration: CORDIC IP

Rotation experiments originally required sine approximation. An early sine core was implemented in June 2025.

By Nov 2025, the design was restructured:

  • Core–wrapper separation
  • Support for circular, linear, hyperbolic modes
  • Trig, exp, log, div, mul implementations via wrappers

Formal verification (first exposure to formal methods):

  • Verified 2-cycle handshake correctness
  • Deadlock-free guarantees
  • Bounded liveness
  • Mathematical properties within tolerances

Formal tools used:

  • SymbiYosys
  • Yices2

This marked the first structured formal verification effort in the project.


Phase 5 — ARM-Based FPGA Deployment (Bharat AI SoC Challenge, Jan 2026)

Extension of CIFAR CNN project to ARM-based FPGA (Zynq-7000 class).

Architecture:

  • Tiling-based convolution accelerator
  • MAC PEs + line buffers
  • AXI4-Lite (AXI-MM subset) for weights
  • AXI-Stream for image input
  • Planned feature-map streaming
  • DDR-DMA with FIFO decoupling

Software Baseline:

  • PYNQ-based NumPy implementation
  • keras2c
  • hls4ml (Vitis HLS) exploration

Quantization & Toolflow:

  • Brevitas 4-bit QAT
  • QONNX → FINN conversion
  • Achieved synthesis metrics (~11.5k LUT, 3 DSP, 22 BRAM @100 MHz)

Manual streaming-based loading replaced Python-generated ROM initialization for better deployment flexibility.

First convolution layer implemented on board as proof-of-concept.


Unified Theme

Across all branches—ImProVe, NeVer, CNN acceleration, systolic microarchitecture, CORDIC IP, and FPGA deployment—the consistent focus has been:

  • Fixed-point arithmetic design
  • Pipelined microarchitectures
  • Streaming interfaces (AXI-Stream)
  • Hardware–software co-validation
  • Quantization vs scaling trade-offs
  • Automation of inference workflows
  • Exploration of arithmetic-level PPA trade-offs
  • Introduction to physical design and formal verification

The work collectively represents iterative exploration of hardware-aware ML acceleration, rather than a single linear product.