ImProVe: Image Processing using Verilog
Overview
ImProVe (IMage PROcessing using VErilog) is an individual project focused on implementing core image processing algorithms directly in Verilog for hardware-oriented deployment. The primary objective is to accelerate image processing by exploiting the parallelism inherent in hardware architectures.
The project emphasizes the design of modular, reusable processing blocks suitable for FPGA/ASIC realization, while preserving a clear understanding of the mathematical foundations behind each algorithm.
| Repository | Mummanajagadeesh/ImProVe |
|---|---|
| Start Date | 27 Nov 2024 |
The work began with geometric rotation experiments and evolved into a structured framework covering edge detection, filtering, geometric transformations, stereo vision, and neural-network-related modules.
The hardware implementations use AXI-Stream interfaces for image input/output, while OpenCL-based implementations are used for functional verification and numerical comparison.
Motivation
The project originated from a practical need to rotate and scan handwritten notes during exam preparation. The initial question was whether image rotation could be implemented in Verilog.
It started as RoVer (Rotation using Verilog)- and gradually expanded to include:
- Edge detection
- Noise reduction
- Thresholding
- Geometric transformations
- Neural network inference
- OCR
Each algorithm was implemented while learning its mathematical foundation.
Design Approach
- Image data is converted into text-based pixel representations using Python.
- Verilog modules operate on pixel arrays.
- Results are written back to text files.
- Python is used for visualization and validation.
- Later versions replace file I/O with synthesizable memory blocks.
Key design goals:
- Replace non-synthesizable constructs.
- Use fixed-point arithmetic.
- Eliminate
$cos,$sin,$sqrt,$expvia CORDIC-based implementations. - Move simulation-only constructs to testbenches.
Implemented Functionalities
Edge Detection and Feature Extraction
- Sobel Operator
- Prewitt Operator
- Roberts Cross Operator
- Robinson Compass Operator
- Kirsch Compass Operator
- Laplacian Operator
- Laplacian of Gaussian (LoG)
- Canny Edge Detection
- Emboss Filter
- Moravec Corner Detection
Noise Reduction and Smoothing
- Gaussian Blur
- Median Filter
- Box Filter
- Bilateral Filter
Thresholding and Binarization
- Global Thresholding
- Adaptive Thresholding
- Otsu’s Method
- Color Thresholding
Geometric Transformations
- Rotation
- Scaling
- Translation
- Shearing
- Cropping
- Reflection
- 3D Homogeneous Perspective Transformation
Color and Intensity Transformations
- Negative Transformation
- Inversion
- Sepia
- Brightness Adjustment
- Contrast Adjustment
- Gamma Correction
- Saturation Adjustment
- Sharpness Enhancement
Applications
Label Detection
Process:
- Split RGB channels using Python.
- Convert to grayscale using NTSC luminance formula.
- Apply Gaussian blur if required.
- Apply Prewitt operator.
- Perform flood-fill to detect largest contour.
- Draw bounding box.
- Superimpose on original image.
Implementation uses Verilog for processing and Python for visualization.
Sample Results
| Original Image | After Vertical Prewitt | After Horizontal Prewitt | After Full Prewitt |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Original Image | After Full Prewitt | Binary Box | Overlayed Image with Box |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Original Image | After Full Prewitt | Binary Box | Overlayed Image with Box |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Document Scanner
Process:
- Canny edge detection
- Boundary fill
- Boolean filtering
- Moravec corner detection
- Bresenham line drawing
- Perspective mapping
- Shearing and scaling refinement
Current issue: Bresenham implementation refinement.
Stereo Vision
This module implements a complete stereo matching pipeline using calibrated stereo image pairs. The workflow includes grayscale conversion, disparity estimation, depth computation using calibration parameters, and 3D reconstruction support.
The depth (Z) is computed from disparity (d) using:
\[ Z = \frac{baseline \times f}{d + doffs} \]
where focal length and baseline are obtained from the calibration file.
Sample Results
| Left Image | Right Image | Disparity / Depth Map |
|---|---|---|
![]() |
![]() |
![]() |
Overview
- Stereo image rectification using calibration matrices
- Disparity map computation along horizontal epipolar lines
- Depth estimation from disparity and baseline
- Intermediate result storage for verification
- Python-based 3D reconstruction (point cloud / mesh generation)
Current Limitation: Depth accuracy refinement and sub-pixel disparity optimization are under improvement.
MNIST Digit Recognition (Neural Network)
Dataset: MNIST (28×28, 784 inputs)
Architecture:
- Input layer: 784 neurons
- Hidden layer: 128 neurons (ReLU)
- Output layer: 10 neurons
Training:
- Implemented in Python using NumPy
- 500 iterations
- Learning rate: 0.1
-
90% accuracy
Hardware Adaptation:
- Weights scaled by 10,000
- Stored as integers in text files
- No softmax in hardware
- Maximum activation used for classification
- Fixed-point format (Q24.8 under development)
Text files converted into synthesizable register modules using Python scripts.
Only testbench contains $display, $finish, and file operations.
OCR (EMNIST – 62 Classes)
Dataset: EMNIST ByClass Classes: 0–9, A–Z, a–z
Architecture:
- Input: 784
- Hidden1: 256
- Hidden2: 128
- Output: 62
Training:
- SGD and Adam optimizer
- ReLU activations
- Integer scaling for hardware compatibility
Inference in Verilog:
-
Matrix multiplications
-
ReLU activations
-
Maximum output selection
-
Character mapping to:
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
Python automation scripts generate:
- Image memory modules
- Weight memory modules
- Bias memory modules
All instantiated in a top-level module with a dedicated testbench.
A coarse-grained pipelined fully connected network using FSM and Taylor-series-based Softmax approximation was also implemented for improved throughput.
Tools
- Verilog
- SystemVerilog
- Icarus Verilog 12.0
- Xilinx Vivado
- Python 3.12
- OpenCV
AXI Streaming Acceleration
This section restructures the grayscale hardware model into a fully AXI-Stream–compliant streaming architecture. While grayscale is used as the reference operation, the same infrastructure applies to any pixel-wise or neighborhood-based accelerator (e.g., filtering, edge detection, thresholding).
Transition to Streaming Architecture
The initial RTL grayscale design implemented direct combinational arithmetic with a simple pipeline register. Although functionally correct, it assumed continuous data availability and did not model realistic flow control.
The revised architecture introduces:
- AXI-Stream input and output interfaces
- Input and output FIFOs
- Skid buffering for timing isolation
- Valid/ready handshake compliance
- Configurable parallel processing lanes (
LANES)
This transforms the design into a throughput-aware, backpressure-safe streaming accelerator.
AXI-Stream Dataflow
Input AXI Stream → Input FIFO → Pipeline Register → Parallel Processing Lanes → Output FIFO → Output AXI Stream
Key characteristics:
- Proper
tvalid/treadyhandshake behavior - Elastic buffering for stall tolerance
- Deterministic cycle-level throughput modeling
- Linear scalability with lane count
Architectural Comparison
| Static Combination | AXI-Stream Architecture |
|---|---|
| Assumes ideal data flow | Handshake-driven flow control |
| No stall modeling | Backpressure-safe |
| Limited integration capability | SoC-ready streaming block |
| Minimal structural realism | Hardware-accurate architecture |
Performance Characteristics
-
Latency: 1 pipeline cycle (core) + FIFO buffering
-
Throughput:
- 1 pixel/cycle for
LANES = 1 - N pixels/cycle for
LANES = N
- 1 pixel/cycle for
-
Scales with clock frequency and lane parallelism
OpenCL vs RTL Output Comparison
Numerical comparison between floating-point OpenCL output and fixed-point RTL streaming output:
| Metric | Value |
|---|---|
| MAE | 8694.96 (0.132677) |
| RMSE | 12930.8 (0.197311) |
| PSNR | 14.097 dB |
Minor deviations are expected due to fixed-point coefficient approximation in RTL versus floating-point arithmetic in OpenCL.
Verification and Status
- AXI-Stream compliant accelerator
- Functional equivalence validated against OpenCL reference
- Parameterized multi-lane parallelism
- FIFO-based elastic buffering for realistic simulation
This implementation uses AXI-Stream for image input/output handling and OpenCL for functional verification and numerical benchmarking.
Selected Image Processing Results
Below are some of the best results from my image processing work. While there are many more images, including all of them here without relevant explanations would not be meaningful. For a detailed breakdown of the implementation and the mathematical concepts behind each operation, refer to the repository.
Edge Detection – Prewitt Operator
Corner Detection – Moravec
Noise Reduction – Gaussian Blur
Thresholding – Otsu’s Method
Geometric Transformations
Rotation with Same Dimensions 
Rotation with Diagonal Dimensions 
Scaling 
Translation 
Shearing 
Cropping 
Reflection (Both Axes) 
3D Homogeneous Perspective Transformation 
Color and Intensity Transformations
Gamma Correction 
Image Inversion 
Sepia Effect 
Negative Transformation 
Grayscale Conversion 
Contrast Adjustment 
Brightness Adjustment 
Saturation Adjustment 
Sharpness Enhancement 
For more insights into the implementation, visit the repository for a comprehensive explanation of the mathematical foundations behind each operation
Important Links and Resources
Digital Image Processing
- GeeksforGeeks: Digital Image Processing Tutorial
- YouTube: Digital Image Processing Introduction
- YouTube Live: Advanced Digital Image Processing Concepts
Mathematics for Engineering and Computing
- YouTube: Building a neural network FROM SCRATCH
- YouTube: I Built a Neural Network from Scratch
- YouTube: Linear Algebra – Essence of Linear Algebra (Playlist)
Verilog
CORDIC Algorithm Resources
- IEEE Xplore: Hardware Implementation of a Math Module Based on the CORDIC Algorithm Using FPGA
- CORDIC Algorithm and Its Applications in DSP (NITR Thesis)
- CORDIC for Dummies (Introductory Guide)
- STMicroelectronics: Using the CORDIC for Mathematical Functions on STM32 MCUs
- Square Root Calculation Using CORDIC In System Verilog
Datasets
- MNIST Dataset: 0-9 Handwritten Numbers
- EMNIST Dataset: Extended MNIST with Alphabet Support
- Standard OCR Dataset: Various Images of Characters in Different Fonts
Contributors
Feel free to contribute by submitting pull requests or feature suggestions!
If interested in working together, do drop a DM or mail 🙂










