Never: NEural NEtwork in VERilog
Name | NeVer |
---|---|
Description | NeVer implements a neural network in Verilog for better hardware acceleration of image processing tasks |
Start | 28 Feb 2025 |
Repository | NeVer🔗 |
Type | Individual |
Level | Beginner |
Skills | Image Processing, HDL, Computer Vision, Programming, ML |
Tools Used | Verilog, Icarus, Quartus, Python, NumPy |
Current Status | Ongoing (Active) |
Progress | - Implemented detection of MNIST digits (0-9) - Added support for EMNIST, enabling classification of 62 character classes - Integrated real-time inference with a Tkinter-based character drawing interface |
Next Steps | - Ensure the top module is synthesizable by eliminating the use of the real data type- Optimize the design for parallel processing, leveraging Multiply-Accumulate (MAC) operations - Enhance floating-point multiplication and division support for improved computational efficiency |
Project Overview
NeVer (Neural Network in Verilog) is one of my favorite subprojects under ImProVe, where I aim to implement a fully functional neural network purely in Verilog and optimize it for efficient hardware acceleration
Motivation
I was inspired by this video: Building a neural network FROM SCRATCH (no TensorFlow/PyTorch, just NumPy & math) by Samson Zhang. The video demonstrates a simple 2 layer neural network for recognizing MNIST digits (0-9)
I expanded on this by:
- Using two hidden layers instead of none
- Implementing Adam optimizer along with vanilla SGD, replacing basic gradient descent
- Extending support for 62 classes (0-9, A-Z, a-z)
- Using Verilog for inference instead of Python
- Incorporating Tkinter for manually inputting handwritten characters irl, allowing direct interaction with the trained model
Project Roadmap
- Train a model using TensorFlow/PyTorch for quick validation → Infer in Python
- Train using NumPy only → Infer in Python
- Train using NumPy only → Infer in Verilog
- Train using pure Python (no NumPy, user-defined functions) → Infer in Verilog
- Implement a single neuron in Verilog for training
- Train directly in Verilog → Infer in Verilog
- Optimize the implementation using parallel processing, MAC units, etc.
- Make the entire Verilog implementation synthesizable
Current Status
- Successfully performing inference in Verilog using parameters trained in Python (Colab) with NumPy
- Implemented inference for both MNIST (digits 0-9) and EMNIST (62 classes: 0-9, A-Z, a-z)
- Training with 2000 iterations:
- First 1500 iterations: Adam optimizer
- Last 500 iterations: Vanilla SGD (Reason: Adam converges faster initially, but switching to SGD helps refine convergence | Later switched to “SGD with Momentum”)
- Using Tkinter for drawing input characters, which are then processed and fed into Verilog for inference
Current Workflow
- Image Processing:
draw.py
→ Converts the Tkinter drawing intodrawing.jpg
- Grayscale Image Conversion:
img2bin.py
→ Generatesmnist_single_no.txt
(integer values from 0-255) - Vectorization:
arr2row.v
→ Flattens the 2D 28×28 matrix intoinput_vector.txt
- Memory Preloading:
memloader_from_inp_vec.py
→ Convertsinput_vector.txt
into Verilog memory (image_memory.v
) - Weight & Bias Preloading:
wtbs_loader.py
→ Converts pretrained weight & bias TXT files into Verilog memory (W1_memory.v
,b1_memory.v
, etc.) - Inference in Verilog:
emnist_with_tb.v
→ Top module:emnist.v
Technical Details
- The top module (
emnist.v
) follows an FSM-based approach with minimal to no overlap - Softmax Approximation: Using Taylor series expansion for exponentiation
- Pipeline Strategy:
- Currently: Using coarse-grained pipelining, meaning major computation blocks execute sequentially with some latency
- Next Steps: Implement fine-grained pipelining, where smaller operations are parallelized for higher throughput
To-Do
- Implement LUTs for efficient exponential computation in Softmax
- Remove the
real
datatype in the top module to ensure full synthesizability - Extend support for all ASCII characters (optional)
- Enable OCR functionality to detect any character in a given image
- Optimize for parallel processing, better pipelining, and hardware acceleration
Demos
Here are some demo videos
Mnist Digit Recognition
I developed a fully connected neural network from scratch in Google Colab, avoiding frameworks like TensorFlow and Keras. Instead, I relied on NumPy for numerical operations, Pandas for data handling, and Matplotlib for visualization. The model was trained on sample_data/mnist_train_small.csv, a dataset containing flattened 784-pixel images of handwritten digits. Data preprocessing included normalizing pixel values (dividing by 255) and splitting the dataset into a training set and a development set, with the first 1000 samples reserved for validation. The dataset was shuffled before training to enhance randomness, and labels (digits 0-9) were stored separately
The network architecture consists of an input layer (784 neurons), a hidden layer (128 neurons, ReLU activation), and an output layer (10 neurons, softmax activation). Model parameters (weights and biases) were initialized randomly and updated via gradient descent over 500 iterations with a learning rate of 0.1. Training followed the standard forward propagation for computing activations and backpropagation for updating parameters. Accuracy was recorded every 10 iterations. To ensure compatibility with Verilog, all weights and biases were scaled by 10,000 and stored as integer values in text files (
W1.txt
,b1.txt
, etc.), eliminating the need for floating-point operations in hardware. These trained parameters were later used for inference on new images, verifying accuracy on the development set before deployment in Verilog for real-time classificationThe trained model, based on sample_data/mnist_train_small.csv, achieved over 90% accuracy. It generates
W1
,W2
,b1
, andb2
text files containing weight and bias values. These parameters are used in Verilog to predict digits from an input image stored ininput_vector.txt
, formatted as 784 space-separated integers. The Verilog module reads this data, performs inference, and displays the predicted output using$display
. The original CSV file was converted into a space-separated text format, where each row contains a digit followed by 784 pixel values (785 total). During inference, the first value (label) is discarded, ensuring the model classifies the input image without prior knowledge of its actual labelThe Verilog implementation of the neural network consists of an input layer (784 neurons), a hidden layer (128 neurons), and an output layer (10 neurons). It loads pre-trained weights and biases from
W1.txt
,b1.txt
,W2.txt
, andb2.txt
, along with an input vector frominput_vector.txt
. Input values are normalized by dividing by 255.0, while weights and biases are scaled by 10,000 for fixed-point arithmetic. The hidden layer applies a fully connected transformation (W1 * input + b1
) followed by ReLU activation, while the output layer computes another weighted sum (W2 * hidden + b2
). Instead of applying softmax, the model identifies the predicted class by selecting the index of the highest output valueThe module ensures correct file reading before computation begins. Forward propagation is executed sequentially, with an initial delay for loading weights, biases, and input values. After processing activations in both layers, the output layer iterates through its neurons to determine the class with the highest activation. The classification result is displayed via
$display
. This hardware implementation bypasses complex activation functions like softmax while maintaining classification accuracy through direct maximum-value selectionIn the latest iterations, Python scripts convert text-based weight and bias files into synthesizable Verilog memory blocks. These are stored in register modules, which are instantiated in the top-level module. The image input is handled in a similar way
Currently, the top module includes a few non-synthesizable constructs, such as
$display
,$finish
, and thereal
datatype. These were relocated to the testbench in later versions to improve synthesizability. Additionally, I am replacingreal
with a fixed-point representation (Q24.8) to make the design fully synthesizable. Future versions will output the classification result to a seven-segment display via case statements, replacing$display
Moving forward, I am working on transitioning training from Python to Verilog, aiming to implement a fully synthesizable neural network for hardware-based learning and inference
EMNIST Character Recognition
This model is trained on the EMNIST ByClass dataset (source), which includes 62 character classes—digits (0-9
), uppercase letters (A-Z
), and lowercase letters (a-z
). The dataset undergoes preprocessing, where it is converted into a CSV format, normalized, reduced in dimensionality, and shuffled before training to improve generalization
Neural Network Architecture
The model consists of multiple layers:
- Input Layer: 784 neurons (28×28 grayscale pixel values)
- First Hidden Layer: 256 neurons (
W1: 256×784
,b1: 256×1
) - Second Hidden Layer: 128 neurons (
W2: 128×256
,b2: 128×1
) - Output Layer: 62 neurons (
W3: 62×128
,b3: 62×1
)
Training Process
The network is trained using forward propagation, where activations are computed at each layer through matrix multiplications and ReLU activation functions for hidden layers. Backpropagation is used to update weights based on the gradient of the loss function. The dataset is shuffled before each epoch to prevent overfitting. The model is trained over multiple epochs using a combination of Stochastic Gradient Descent (SGD) and the Adam optimizer to improve convergence
To ensure compatibility with hardware, weights and biases are scaled by 10,000 and stored as integers in text files (W1.txt
, b1.txt
, etc.), since Verilog does not support floating-point arithmetic
Inference in Verilog
The inference process in Verilog follows a similar structure but accommodates additional layers and character classes. Input images are read from input_vector.txt
, normalized, and processed through the neural network using preloaded weights and biases. The computation follows:
hidden1 = ReLU(W1 * input + b1)
hidden2 = ReLU(W2 * hidden1 + b2)
output = W3 * hidden2 + b3
The index of the maximum output value corresponds to the predicted character, which is mapped to "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
and displayed using $display
Pipeline and Data Flow
The following scripts handle image processing, vectorization, and memory loading for Verilog inference:
- Drawing & Image Processing:
draw.py
: Creates a square canvas in Tkinter for character input- After drawing, the script grayscales, inverts, compresses the image to 28×28 resolution, and saves it as
drawing.jpg
- Data Conversion & Preprocessing:
img2bin.py
: Convertsdrawing.jpg
into a 28×28 grayscale pixel matrix (mnist_single_no.txt
)arr2row.py
: Flattens the 2D array into a 1D vector (784 values) and stores it ininput_vector.txt
- Memory Module Generation:
memloader_from_inp_vec.py
: Convertsinput_vector.txt
into a synthesizable Verilog memory module (image_memory.v
)wtbs_loader.py
: ConvertsW1
,W2
,W3
,b1
,b2
,b3
into Verilog memory modules (W1_memory.v
,b1_memory.v
, etc.)
Final Hardware Implementation
All these components are instantiated in the top module (emnist_with_tb.v
), along with a testbench (emnist_nn_tb.v
). The system successfully predicts handwritten characters in real-time
In the demo, I tested the characters “H”, “f”, and “7”, each representing different EMNIST subclasses (uppercase letters, lowercase letters, and numbers)
Additionally, I implemented a coarse-grained pipelined fully connected neural network using a Finite State Machine (FSM), integrating a Softmax function approximation via Taylor series expansion to improve computational efficiency
Raw Demo Shots
UpperCase Alphabet | LowerCase Alphabet | Single Digit Number |
---|
Actual: R | Prediction: R | Actual: i | Prediction: i | Actual: 9 | Prediction: 9 |
---|
![]() | ![]() | ![]() |
---|
Pre-processing Workflow
Original Drawing | Grayscale Image | Inverted Image | Text Matrix |
---|---|---|---|
![]() | ![]() | ![]() | ![]() |