Verilog for Image Processing and Simulation-based Inference Of Neural Networks

IMage PROcessing using VErilog

ImProVe (IMage PROcessing using VErilog) is a project focused on implementing image processing techniques using Verilog. It involves building image processing logic from the ground up, exploring various algorithms and approaches within HDL

Status: Ongoing

Type: Individual

Tags: FPGA, Verilog, HDL, Digital, Image Processing

View Project

NEural NEtwork in VERilog; (E)MNIST-MLP

Never (NEural NEtwork in VERilog) implements a neural network in Verilog for better hardware acceleration of image processing tasks

Status: Ongoing

Type: Individual

Tags: Verilog, HDL, Digital, Image Processing, ML

View Project

NEural NEtwork in VERilog; CIFAR10-CNN

Developed a lightweight CNN [Conv2D×2 + MaxPool]×3 → GAP → Dense(10) for CIFAR-10 Image Classification (32x32RGB) using both IEEE 754 floating-point and Q1.31, Q1.15, Q1.7, and Q1.3 fixed-point arithmetic, achieving 84% accuracy in both implementations (Py ~85% | FP ~84% | Q31 ~84% | Q15 ~84% | Q7 ~82% | Q3 ~65%)

Status: Ongoing

Type: Individual

Tags: Verilog, HDL, Digital, ML

View Project

Matrix Multiplier with Systolic Array

Systolic array of MAC PEs with Booth multipliers and carry-save adders, supporting both GEMM and 3×3 CNN convolutions for hardware-accelerated deep learning and linear algebra

Status: Started

Type: Individual

Tags: Verilog, HDL, Digital, Systolic Array, Linear Algebra

View Project

Fixed-Point Conversion Report for EMNIST Neural Network

This report outlines the transition of the EMNIST neural network from floating-point (IEEE754) to fixed-point arithmetic to reduce hardware complexity and improve synthesis compatibility, while aiming to preserve model accuracy

Status: N/A

Type: Individual

Tags: Verilog, HDL, Digital

View Project

High-Performance Quantized CNN Hardware Accelerator for Lightweight Inference (Verilog) | Link

Design of High-Performance Q1.7 Fixed-Point Quantized CNN Hardware Accelerator with Microarchitecture Optimization of 3-Stage Pipelined Systolic MAC Arrays for Lightweight Inference

" I tried to ImProVe, but NeVer really did — so I MOVe-d on ¯\_(ツ)_/¯ “

Current Project Overview

Duration: Individual, Ongoing
Tools: Verilog (Icarus Verilog, Yosys) | Python (TensorFlow, NumPy) | Scripting (TCL, Perl)

8-bit Quantized CNN Hardware Accelerator: Open-source, Modular, & Optimized for Inference

Verilog | Basic Architecture | Digital Electronics

Designed a shallow residual-style CNN for CIFAR-10, achieving ~84% accuracy (< 1% loss) with a 52 KB model size (only ~17× 3 KB input). Applied post-training quantization variants including Q1.7 (8-bit signed), optimizing accuracy, model size, and inference efficiency.
Implemented synthesizable Verilog modules (Testbench Verified) with FSM-based control, 2-cycle handshake, and auto-generated ROMs (14: weights/biases & 3 (RGB): input). Intermediate values stored in registers and computed using systolic array-based MAC units.
Explored key image-processing techniques including edge detection, noise reduction, filtering, and enhancement. Implemented (E)MNIST classification using MLP, achieving >75% accuracy. Automated inference flow with TCL/Python scripts and manual GUI inputs.

Technical Summary

Designed a fully synthesizable INT8 CNN accelerator (Q1.7 PTQ) for CIFAR-10, optimized for throughput, latency determinism, and precision efficiency. Implemented a 2-cycle ready/valid handshake for all inter-module transactions and FSM-based control sequencing for deterministic pipeline timing. Trained 8 CNNs (TensorFlow, identical augmentation & LR scheduling w/ vanilla Adam optimizer) and performed architecture-level DSE via Pareto analysis, selecting 2 optimal variants including a ResNet-style residual CNN.
PTQ/QAT comparisons were conducted across Q1.31, Q1.15, Q1.7, and Q1.3; Q1.7 PTQ (1-int, 7-frac | 0.0078 step) gave the best accuracy–memory trade-off { Q1.31 ~84% ~210kB | Q1.7 ~83% ~52kB | Q1.3 ~78% ~26kB }, achieving ~84% top-1 accuracy, <1% loss, and ≈52 KB total (≈17×3 KB RGB inputs)
The 3-stage pipelined systolic-array convolution core employs Processing Elements (PEs) built around MAC units composed of 8-bit signed Carry-Save Adders (CSA) and Modified Booth-Encoded (MBE) multipliers, arranged in a 2D grid for high spatial reuse and single-cycle accumulation. All 14 coefficient ROMs and 3 RGB input ROMs were auto-generated via a Python/TCL automation flow handling coefficient quantization, packing, and open-source EDA simulation. Verified bit-accurate correlation between TensorFlow FP32 and RTL fixed-point inference layer-wise; an IEEE-754 single-precision CNN variant validated numeric consistency.
Integrated image-processing modules (edge detection, denoising, filtering, contrast enhancement) form a Verilog-based hardware preprocessing pipeline, feeding an MLP classifier evaluated on the (E)MNIST (52+)10 ByClass datasets. The MLP shares the preprocessing and automation flow, with an additional IEEE-754 64-bit FP variant for precision benchmarking. A Tkinter GUI enables interactive character input, and preprocessing visualization via Matplotlib

High-Speed 3-Stage Pipelined Systolic Array-Based MAC Architectures

Digital Logic Design | Synthesis

Compared 6 × 8-bit adders/multipliers for systolic-array MACs using PPA metrics (latency / throughput / area, sky130 nm PDK) and analyzed trade-offs.
Final design uses Carry-Save Adder (CSA) and Modified Booth Encoder (MBE) multiplier for 3×3 convolution and GEMM operations with 3-stage pipelined systolic arrays, verified for 0 / same padding modes.
Pipeline Stages: sampling image → truncating & flipping → MAC accumulation.

Technical Summary

Benchmarked six 8-bit signed adder and multiplier architectures for systolic-array MACs targeting CNN/GEMM workloads using a fully open-source ASIC flow (Yosys + OpenROAD/OpenLane) on the Google-SkyWater 130nm PDK (Sky130HS PDK @25°C_1.8V). Evaluated PPA (Power, Performance, Area) and latency/throughput/area metrics under a constant synthesis and layout environment with fixed constraints and floorplan parameters (FP_CORE_UTIL = 30 %, PL_TARGET_DENSITY = 0.36, 10 ns clock, CTS/LVS/DRC/Antenna enabled)
Adders:
- CSA – 5.07 ns CP, 197 MHz Fmax, 2.52 k µm² core, 0.083 mW (best speed/resource trade-off)
- Kogge–Stone – 6.21 ns, 161 MHz, 3.69 k µm² (area-heavy)
- RCA – 7.14 ns, 140 MHz, 1.13 k µm², 0.032 mW (most power/area-efficient)
Multipliers:
- MBE – 8.84 ns, 113 MHz, 9.6 k µm², 0.379 mW (best energy/area efficiency = 3.35 × 10³ pJ/op, 8.64 × 10³ ops/s/µm²)
- Baugh–Wooley – 8.63 ns, 115.9 MHz (fastest)
- Booth (Radix-2) – 12.5 ns, 80 MHz (highest area/power)
Final MAC integrates an 8-bit signed CSA adder and 8-bit signed MBE multiplier in a 3×3 convolution/GEMM core using a 3-stage pipelined systolic array (sampling → truncation/flipping → MAC accumulation). Verified via RTL testbench and post-synthesis timing across zero/same-padding modes. Automated GDS/DEF generation and PPA reporting for all architectures ensured fully reproducible, environment-consistent results
Comparative study (Small Scale Ops) :
- CSA-MBE pair Systolic Array Conv vs Naïve Conv (3x3 Kernel on 5x5 Image @CLK_PERIOD_20_ns)
  - Latency ↓ 66.3% Throughput ↑ 196.6% Speed ↑ 196.6% Area ↓ 67.8% Power ↓ 82.2%
- Single MAC re-use vs Systolic 4-PE Grid (2x2 matrix multiplication)
  - Latency ↓ 0.9% Throughput ≈ same Speed ≈ same Area ≈ same Power ↓ 15% Energy/op ↓ 14.6%
- Single MAC re-use vs Systolic 9-PE Grid (3x3 matrix multiplication)
  - Latency ↓ 1% Throughput ≈ same Speed ≈ same Area ≈ same Power ↓ 44.6% Energy/op ↓ 44%

Repositories

ViSiON – Verilog for Image Processing and Simulation-based Inference Of Neural Networks

This repo includes all related projects as submodules in one place

NEural NEtwork on VERilog | MLP for (E)MNIST | CNN for CIFAR10

ImProVe – IMage PROcessing using VErilog: A collection of image processing algorithms implemented in Verilog, including geometric transformations, color space conversions, and other foundational operations.

NeVer – NEural NEtwork on VERilog: A hardware-implemented MLP in Verilog for character recognition on (E)MNIST, alongside a lightweight CNN for CIFAR-10 image classification

MOVe – Math Ops in VErilog

Verilog-based implementation of the CORDIC algorithm for efficient estimation of mathematical functions

Systolic array of MAC PEs with Booth multipliers and carry-save adders, supporting both GEMM and 3×3 CNN convolutions for hardware-accelerated deep learning and linear algebra

Implements and compares 8-bit multipliers and 16-bit adders in synthesizable Verilog, analyzing their area, timing, and power characteristics in MAC datapath architectures

Posit number system: An alternative to IEEE 754 for efficient arithmetic

CORDIC Algorithm – Implements Coordinate Rotation Digital Computer (CORDIC) algorithms in Verilog for efficient hardware-based calculation of sine, cosine, tangent, square root, magnitude, and more.
Systolic Array Matrix Multiplication – Verilog implementation of matrix multiplication using systolic arrays to enable parallel computation and hardware-level performance optimization. Each processing element leverages a Multiply-Accumulate (MAC) unit for core operations.
Hardware Multiply-Accumulate Unit – Implements and compares 8-bit multipliers and 8-bit adders in synthesizable Verilog, analyzing their area, timing, and power characteristics in MAC datapath architectures.
Posit Arithmetic (Python) – Currently using fixed-point arithmetic; considering Posit as an alternative to IEEE 754 for better precision and dynamic range. Still working through the trade-off.

Storage and Buffer Modules

RAM1KB – A 1KB (1024 x 8-bit) memory module in Verilog with write-once locking for even addresses. Includes a randomized testbench. Also forms the base for a ROM3KB variant to store 32×32 RGB CIFAR-10 image data.
FIFO Buffer – Not started. Planned as a synchronous FIFO with fixed depth, single clock domain, and standard full/empty flag logic.