ImProVe: Image Processing using Verilog

Verilog SystemVerilog Image Processing Computer Vision Python OpenCV

Image Processing: Selected Results

A few results are included below at end of this page.

View Project

Image Processing: Streaming Version

ISP with multiple modes built on top of existing workflow

View Project

Label Detection

Edge-based label localization using Prewitt operator and contour extraction.

Document Scanner

Edge detection, corner detection, and perspective correction for document extraction.

Stereo Vision

Disparity and depth map computation implemented in Verilog.

MNIST Digit Recognition

Fully connected neural network implemented in Verilog using fixed-point arithmetic.

View Project

OCR using EMNIST

Multi-layer neural network for 62-class alphanumeric recognition in Verilog.

View Project

Overview

ImProVe (IMage PROcessing using VErilog) is an individual project focused on implementing core image processing algorithms directly in Verilog for hardware-oriented deployment. The primary objective is to accelerate image processing by exploiting the parallelism inherent in hardware architectures.

The project emphasizes the design of modular, reusable processing blocks suitable for FPGA/ASIC realization, while preserving a clear understanding of the mathematical foundations behind each algorithm.

Repository	Mummanajagadeesh/ImProVe
Start Date	27 Nov 2024

The work began with geometric rotation experiments and evolved into a structured framework covering edge detection, filtering, geometric transformations, stereo vision, and neural-network-related modules.

The hardware implementations use AXI-Stream interfaces for image input/output, while OpenCL-based implementations are used for functional verification and numerical comparison.

Motivation

The project originated from a practical need to rotate and scan handwritten notes during exam preparation. The initial question was whether image rotation could be implemented in Verilog.

It started as RoVer (Rotation using Verilog)- and gradually expanded to include:

Edge detection
Noise reduction
Thresholding
Geometric transformations
Neural network inference
OCR

Each algorithm was implemented while learning its mathematical foundation.

Design Approach

Image data is converted into text-based pixel representations using Python.
Verilog modules operate on pixel arrays.
Results are written back to text files.
Python is used for visualization and validation.
Later versions replace file I/O with synthesizable memory blocks.

Key design goals:

Replace non-synthesizable constructs.
Use fixed-point arithmetic.
Eliminate $cos, $sin, $sqrt, $exp via CORDIC-based implementations.
Move simulation-only constructs to testbenches.

Implemented Functionalities

Edge Detection and Feature Extraction

Sobel Operator
Prewitt Operator
Roberts Cross Operator
Robinson Compass Operator
Kirsch Compass Operator
Laplacian Operator
Laplacian of Gaussian (LoG)
Canny Edge Detection
Emboss Filter
Moravec Corner Detection

Noise Reduction and Smoothing

Gaussian Blur
Median Filter
Box Filter
Bilateral Filter

Thresholding and Binarization

Global Thresholding
Adaptive Thresholding
Otsu’s Method
Color Thresholding

Geometric Transformations

Rotation
Scaling
Translation
Shearing
Cropping
Reflection
3D Homogeneous Perspective Transformation

Color and Intensity Transformations

Negative Transformation
Inversion
Sepia
Brightness Adjustment
Contrast Adjustment
Gamma Correction
Saturation Adjustment
Sharpness Enhancement

Applications

Label Detection

Process:

Split RGB channels using Python.
Convert to grayscale using NTSC luminance formula.
Apply Gaussian blur if required.
Apply Prewitt operator.
Perform flood-fill to detect largest contour.
Draw bounding box.
Superimpose on original image.

Implementation uses Verilog for processing and Python for visualization.

Sample Results

Original Image	After Vertical Prewitt	After Horizontal Prewitt	After Full Prewitt

Original Image	After Full Prewitt	Binary Box	Overlayed Image with Box

Original Image	After Full Prewitt	Binary Box	Overlayed Image with Box

Document Scanner

Process:

Canny edge detection
Boundary fill
Boolean filtering
Moravec corner detection
Bresenham line drawing
Perspective mapping
Shearing and scaling refinement

Current issue: Bresenham implementation refinement.

Stereo Vision

This module implements a complete stereo matching pipeline using calibrated stereo image pairs. The workflow includes grayscale conversion, disparity estimation, depth computation using calibration parameters, and 3D reconstruction support.

The depth (Z) is computed from disparity (d) using:

\[ Z = \frac{baseline \times f}{d + doffs} \]

where focal length and baseline are obtained from the calibration file.

Sample Results

Left Image	Right Image	Disparity / Depth Map

Overview

Stereo image rectification using calibration matrices
Disparity map computation along horizontal epipolar lines
Depth estimation from disparity and baseline
Intermediate result storage for verification
Python-based 3D reconstruction (point cloud / mesh generation)

Current Limitation: Depth accuracy refinement and sub-pixel disparity optimization are under improvement.

MNIST Digit Recognition (Neural Network)

Dataset: MNIST (28×28, 784 inputs)

Architecture:

Input layer: 784 neurons
Hidden layer: 128 neurons (ReLU)
Output layer: 10 neurons

Training:

Implemented in Python using NumPy
500 iterations
Learning rate: 0.1
90% accuracy

Hardware Adaptation:

Weights scaled by 10,000
Stored as integers in text files
No softmax in hardware
Maximum activation used for classification
Fixed-point format (Q24.8 under development)

Text files converted into synthesizable register modules using Python scripts.

Only testbench contains $display, $finish, and file operations.

OCR (EMNIST – 62 Classes)

Dataset: EMNIST ByClass Classes: 0–9, A–Z, a–z

Architecture:

Input: 784
Hidden1: 256
Hidden2: 128
Output: 62

Training:

SGD and Adam optimizer
ReLU activations
Integer scaling for hardware compatibility

Inference in Verilog:

Matrix multiplications
ReLU activations
Maximum output selection

Character mapping to:

"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

Python automation scripts generate:

Image memory modules
Weight memory modules
Bias memory modules

All instantiated in a top-level module with a dedicated testbench.

A coarse-grained pipelined fully connected network using FSM and Taylor-series-based Softmax approximation was also implemented for improved throughput.

Tools

Verilog
SystemVerilog
Icarus Verilog 12.0
Xilinx Vivado
Python 3.12
OpenCV

AXI Streaming Acceleration

This section restructures the grayscale hardware model into a fully AXI-Stream–compliant streaming architecture. While grayscale is used as the reference operation, the same infrastructure applies to any pixel-wise or neighborhood-based accelerator (e.g., filtering, edge detection, thresholding).

Transition to Streaming Architecture

The initial RTL grayscale design implemented direct combinational arithmetic with a simple pipeline register. Although functionally correct, it assumed continuous data availability and did not model realistic flow control.

The revised architecture introduces:

AXI-Stream input and output interfaces
Input and output FIFOs
Skid buffering for timing isolation
Valid/ready handshake compliance
Configurable parallel processing lanes (LANES)

This transforms the design into a throughput-aware, backpressure-safe streaming accelerator.

AXI-Stream Dataflow

Input AXI Stream → Input FIFO → Pipeline Register → Parallel Processing Lanes → Output FIFO → Output AXI Stream

Key characteristics:

Proper tvalid/tready handshake behavior
Elastic buffering for stall tolerance
Deterministic cycle-level throughput modeling
Linear scalability with lane count

Architectural Comparison

Static Combination	AXI-Stream Architecture
Assumes ideal data flow	Handshake-driven flow control
No stall modeling	Backpressure-safe
Limited integration capability	SoC-ready streaming block
Minimal structural realism	Hardware-accurate architecture

Performance Characteristics

Latency: 1 pipeline cycle (core) + FIFO buffering
Throughput:
- 1 pixel/cycle for LANES = 1
- N pixels/cycle for LANES = N
Scales with clock frequency and lane parallelism

OpenCL vs RTL Output Comparison

Numerical comparison between floating-point OpenCL output and fixed-point RTL streaming output:

Metric	Value
MAE	8694.96 (0.132677)
RMSE	12930.8 (0.197311)
PSNR	14.097 dB

Minor deviations are expected due to fixed-point coefficient approximation in RTL versus floating-point arithmetic in OpenCL.

Verification and Status

AXI-Stream compliant accelerator
Functional equivalence validated against OpenCL reference
Parameterized multi-lane parallelism
FIFO-based elastic buffering for realistic simulation

This implementation uses AXI-Stream for image input/output handling and OpenCL for functional verification and numerical benchmarking.

Selected Image Processing Results

Below are some of the best results from my image processing work. While there are many more images, including all of them here without relevant explanations would not be meaningful. For a detailed breakdown of the implementation and the mathematical concepts behind each operation, refer to the repository.