preloader

Verilog for Image Processing and Simulation-based Inference Of Neural Networks

NEural NEtwork in VERilog; CIFAR10-CNN
NEural NEtwork in VERilog; CIFAR10-CNN

Developed a lightweight CNN [Conv2D×2 + MaxPool]×3 → GAP → Dense(10) for CIFAR-10 Image Classification (32x32RGB) using both IEEE 754 floating-point and Q1.31, Q1.15, Q1.7, and Q1.3 fixed-point arithmetic, achieving 84% accuracy in both implementations (Py ~85% | FP ~84% | Q31 ~84% | Q15 ~84% | Q7 ~82% | Q3 ~65%)

View Project
IMage PROcessing using VErilog
IMage PROcessing using VErilog

ImProVe (IMage PROcessing using VErilog) is a project focused on implementing image processing techniques using Verilog. It involves building image processing logic from the ground up, exploring various algorithms and approaches within HDL

View Project
NEural NEtwork in VERilog; (E)MNIST-MLP
NEural NEtwork in VERilog; (E)MNIST-MLP

Never (NEural NEtwork in VERilog) implements a neural network in Verilog for better hardware acceleration of image processing tasks

View Project
Matrix Multiplier with Systolic Array
Matrix Multiplier with Systolic Array

Systolic array of MAC PEs with Booth multipliers and carry-save adders, supporting both GEMM and 3×3 CNN convolutions for hardware-accelerated deep learning and linear algebra

View Project
  Fixed-Point Conversion Report for EMNIST Neural Network
Fixed-Point Conversion Report for EMNIST Neural Network

This report outlines the transition of the EMNIST neural network from floating-point (IEEE754) to fixed-point arithmetic to reduce hardware complexity and improve synthesis compatibility, while aiming to preserve model accuracy

View Project

No projects found for the selected tags.

High-Performance Quantized CNN Hardware Accelerator for Lightweight Inference (Verilog) | Link


Design of High-Performance Q1.7 Fixed-Point Quantized CNN Hardware Accelerator with Microarchitecture Optimization of 3-Stage Pipelined Systolic MAC Arrays for Lightweight Inference

" I tried to ImProVe, but NeVer really did — so I MOVe-d on ¯\_(ツ)_/¯ “

Current Project Overview

Duration: Individual, Ongoing
Tools: Verilog (Icarus Verilog, Yosys) | Python (TensorFlow, NumPy) | Scripting (TCL, Perl)

8-bit Quantized CNN Hardware Accelerator: Open-source, Modular, & Optimized for Inference

Verilog | Basic Architecture | Digital Electronics

  • Designed a shallow residual-style CNN for CIFAR-10, achieving ~84% accuracy (< 1% loss) with a 52 KB model size (only ~17× 3 KB input). Applied post-training quantization variants including Q1.7 (8-bit signed), optimizing accuracy, model size, and inference efficiency.
  • Implemented synthesizable Verilog modules (Testbench Verified) with FSM-based control, 2-cycle handshake, and auto-generated ROMs (14: weights/biases & 3 (RGB): input). Intermediate values stored in registers and computed using systolic array-based MAC units.
  • Explored key image-processing techniques including edge detection, noise reduction, filtering, and enhancement. Implemented (E)MNIST classification using MLP, achieving >75% accuracy. Automated inference flow with TCL/Python scripts and manual GUI inputs.

Technical Summary
  • Designed a fully synthesizable INT8 CNN accelerator (Q1.7 PTQ) for CIFAR-10, optimized for throughput, latency determinism, and precision efficiency. Implemented a 2-cycle ready/valid handshake for all inter-module transactions and FSM-based control sequencing for deterministic pipeline timing. Trained 8 CNNs (TensorFlow, identical augmentation & LR scheduling w/ vanilla Adam optimizer) and performed architecture-level DSE via Pareto analysis, selecting 2 optimal variants including a ResNet-style residual CNN.

  • PTQ/QAT comparisons were conducted across Q1.31, Q1.15, Q1.7, and Q1.3; Q1.7 PTQ (1-int, 7-frac | 0.0078 step) gave the best accuracy–memory trade-off { Q1.31 ~84% ~210kB | Q1.7 ~83% ~52kB | Q1.3 ~78% ~26kB }, achieving ~84% top-1 accuracy, <1% loss, and ≈52 KB total (≈17×3 KB RGB inputs)

  • The 3-stage pipelined systolic-array convolution core employs Processing Elements (PEs) built around MAC units composed of 8-bit signed Carry-Save Adders (CSA) and Modified Booth-Encoded (MBE) multipliers, arranged in a 2D grid for high spatial reuse and single-cycle accumulation. All 14 coefficient ROMs and 3 RGB input ROMs were auto-generated via a Python/TCL automation flow handling coefficient quantization, packing, and open-source EDA simulation. Verified bit-accurate correlation between TensorFlow FP32 and RTL fixed-point inference layer-wise; an IEEE-754 single-precision CNN variant validated numeric consistency.

  • Integrated image-processing modules (edge detection, denoising, filtering, contrast enhancement) form a Verilog-based hardware preprocessing pipeline, feeding an MLP classifier evaluated on the (E)MNIST (52+)10 ByClass datasets. The MLP shares the preprocessing and automation flow, with an additional IEEE-754 64-bit FP variant for precision benchmarking. A Tkinter GUI enables interactive character input, and preprocessing visualization via Matplotlib

High-Speed 3-Stage Pipelined Systolic Array-Based MAC Architectures

Digital Logic Design | Synthesis

  • Compared 6 × 8-bit adders/multipliers for systolic-array MACs using PPA metrics (latency / throughput / area, sky130 nm PDK) and analyzed trade-offs.
  • Final design uses Carry-Save Adder (CSA) and Modified Booth Encoder (MBE) multiplier for 3×3 convolution and GEMM operations with 3-stage pipelined systolic arrays, verified for 0 / same padding modes.
  • Pipeline Stages: sampling image → truncating & flipping → MAC accumulation.

Technical Summary
  • Benchmarked six 8-bit signed adder and multiplier architectures for systolic-array MACs targeting CNN/GEMM workloads using a fully open-source ASIC flow (Yosys + OpenROAD/OpenLane) on the Google-SkyWater 130nm PDK (Sky130HS PDK @25°C_1.8V). Evaluated PPA (Power, Performance, Area) and latency/throughput/area metrics under a constant synthesis and layout environment with fixed constraints and floorplan parameters (FP_CORE_UTIL = 30 %, PL_TARGET_DENSITY = 0.36, 10 ns clock, CTS/LVS/DRC/Antenna enabled)

  • Adders:

    • CSA – 5.07 ns CP, 197 MHz Fmax, 2.52 k µm² core, 0.083 mW (best speed/resource trade-off)
    • Kogge–Stone – 6.21 ns, 161 MHz, 3.69 k µm² (area-heavy)
    • RCA – 7.14 ns, 140 MHz, 1.13 k µm², 0.032 mW (most power/area-efficient)
  • Multipliers:

    • MBE – 8.84 ns, 113 MHz, 9.6 k µm², 0.379 mW (best energy/area efficiency = 3.35 × 10³ pJ/op, 8.64 × 10³ ops/s/µm²)
    • Baugh–Wooley – 8.63 ns, 115.9 MHz (fastest)
    • Booth (Radix-2) – 12.5 ns, 80 MHz (highest area/power)
  • Final MAC integrates an 8-bit signed CSA adder and 8-bit signed MBE multiplier in a 3×3 convolution/GEMM core using a 3-stage pipelined systolic array (sampling → truncation/flipping → MAC accumulation). Verified via RTL testbench and post-synthesis timing across zero/same-padding modes. Automated GDS/DEF generation and PPA reporting for all architectures ensured fully reproducible, environment-consistent results

  • Comparative study (Small Scale Ops) :

    • CSA-MBE pair Systolic Array Conv vs Naïve Conv (3x3 Kernel on 5x5 Image @CLK_PERIOD_20_ns)

      • Latency ↓ 66.3% Throughput ↑ 196.6% Speed ↑ 196.6% Area ↓ 67.8% Power ↓ 82.2%
    • Single MAC re-use vs Systolic 4-PE Grid (2x2 matrix multiplication)

      • Latency ↓ 0.9% Throughput ≈ same Speed ≈ same Area ≈ same Power ↓ 15% Energy/op ↓ 14.6%
    • Single MAC re-use vs Systolic 9-PE Grid (3x3 matrix multiplication)

      • Latency ↓ 1% Throughput ≈ same Speed ≈ same Area ≈ same Power ↓ 44.6% Energy/op ↓ 44%



Repositories

Verilog for Image processing and Simulation-based Inference Of Neural Networks

ViSiONVerilog for Image Processing and Simulation-based Inference Of Neural Networks

This repo includes all related projects as submodules in one place




Image processing algorithmsNEural NEtwork on VERilog | MLP for (E)MNIST | CNN for CIFAR10

ImProVeIMage PROcessing using VErilog: A collection of image processing algorithms implemented in Verilog, including geometric transformations, color space conversions, and other foundational operations.

NeVerNEural NEtwork on VERilog: A hardware-implemented MLP in Verilog for character recognition on (E)MNIST, alongside a lightweight CNN for CIFAR-10 image classification


MOVeMath Ops in VErilog

Verilog-based implementation of the CORDIC algorithm for efficient estimation of mathematical functionsSystolic array of MAC PEs with Booth multipliers and carry-save adders, supporting both GEMM and 3×3 CNN convolutions for hardware-accelerated deep learning and linear algebra
Implements and compares 8-bit multipliers and 16-bit adders in synthesizable Verilog, analyzing their area, timing, and power characteristics in MAC datapath architecturesPosit number system: An alternative to IEEE 754 for efficient arithmetic


  • CORDIC Algorithm – Implements Coordinate Rotation Digital Computer (CORDIC) algorithms in Verilog for efficient hardware-based calculation of sine, cosine, tangent, square root, magnitude, and more.

  • Systolic Array Matrix Multiplication – Verilog implementation of matrix multiplication using systolic arrays to enable parallel computation and hardware-level performance optimization. Each processing element leverages a Multiply-Accumulate (MAC) unit for core operations.

  • Hardware Multiply-Accumulate Unit – Implements and compares 8-bit multipliers and 8-bit adders in synthesizable Verilog, analyzing their area, timing, and power characteristics in MAC datapath architectures.

  • Posit Arithmetic (Python) – Currently using fixed-point arithmetic; considering Posit as an alternative to IEEE 754 for better precision and dynamic range. Still working through the trade-off.


Storage and Buffer Modules

  • RAM1KB – A 1KB (1024 x 8-bit) memory module in Verilog with write-once locking for even addresses. Includes a randomized testbench. Also forms the base for a ROM3KB variant to store 32×32 RGB CIFAR-10 image data.

  • FIFO Buffer – Not started. Planned as a synchronous FIFO with fixed depth, single clock domain, and standard full/empty flag logic.