Brevitas + FINN

Brevitas + FINN

CIFAR-10 CNN Inference — Brevitas + FINN FPGA Deployment

End-to-end quantization-aware training and hardware deployment of a ResNet-style CIFAR-10 classifier on the Digilent Zybo Z7-10 (Zynq-7010, xc7z010clg400-1) using PyTorch, Xilinx Brevitas, QONNX export, and the FINN compiler. The pipeline produces a functional .bit bitstream for direct FPGA deployment.

This is distinct from the manual HLS and hls4ml approaches documented elsewhere. FINN generates a streaming dataflow hardware architecture — every layer becomes a dedicated streaming IP block connected by FIFOs — rather than a single monolithic kernel.


Toolchain

Tool Role
PyTorch Model definition and training
Brevitas Quantization-aware training (QAT)
QONNX / QONNXManager Export quantized model to ONNX-based interchange format
FINN compiler Dataflow compilation, HLS/RTL IP generation, Vivado integration
Vivado 2022.2 Synthesis, place and route, bitstream generation

Model Architecture

A custom ResNet-style network designed specifically for the resource constraints of the xc7z010 (xc7z010clg400-1): 17,600 LUTs, 35,200 FFs, 80 BRAM_18K blocks, 80 DSP48s.

Input images are resized to 28×28 (not the standard CIFAR-10 32×32). This reduces the spatial dimensions of all feature maps through the network, directly reducing the memory and logic required for convolution input generators and FIFOs. The architecture uses residual blocks with MaxPool downsampling, no floating-point reductions, and strictly integer arithmetic throughout — all requirements imposed by FINN’s hardware mapping constraints.

Quantization Scheme

FINN requires that every operation in the graph be expressible as integer arithmetic with fixed bitwidths. The quantization scheme was chosen to satisfy this while retaining accuracy:

Input quantization — 8-bit signed

A quantized identity layer is placed at the network input. This scales the incoming pixel values into the 8-bit signed integer range before the first convolution. Without this, the first conv layer receives float inputs and FINN cannot map it to integer hardware.

Weight quantization — 4-bit signed

All convolutional and linear layer weights are quantized to 4-bit signed integers using Brevitas Int4WeightPerTensorFloat (or equivalent). 4-bit weights halve the BRAM footprint compared to 8-bit and reduce the bit-width of the multipliers that FINN generates, which in turn reduces DSP and LUT usage.

Activation quantization — 4-bit unsigned

Post-ReLU activations are quantized to 4-bit unsigned using Brevitas QuantReLU. Unsigned is correct here because ReLU outputs are always non-negative. 4-bit activations match the weight bitwidth, keeping the multiply-accumulate products at 8 bits before accumulation — well within ap_fixed ranges and compatible with FINN’s MVAU (Matrix-Vector Activation Unit) mapping.

The combination of 4-bit weights and 4-bit activations is the most hardware-efficient quantization point that FINN supports for convolutional layers on small Zynq devices.


Training

Dataset and preprocessing

CIFAR-10 with all images resized to 28×28 at load time. Resizing at the dataloader rather than offline means the model trains on the exact input dimensions that the hardware will receive.

Training configuration

  • 30 epochs
  • GPU acceleration
  • DataLoader with pinned memory, multiple CPU workers, and cuDNN benchmarking enabled (torch.backends.cudnn.benchmark = True) — standard settings for throughput-optimized GPU training

Export to QONNX

After training:

  1. Model weights saved to disk.
  2. Network traced with torch.jit.trace using a representative input tensor.
  3. Exported to FINN-compatible QONNX format using QONNXManager.

The .qonnx file encodes the network graph with quantization annotations attached to each operator — bitwidths, scale factors, zero points — in a format that FINN’s transformation passes can parse and lower to hardware.


FINN Environment Configuration

Docker I/O optimization

The FINN compiler runs inside a Docker container. The build directory was redirected to /tmp inside the container, which is backed by a RAM disk. This avoids host filesystem latency during the intermediate steps that generate large numbers of small files: Verilator simulation, C++ HLS compilation, and FIFO sizing. Host filesystem I/O can become a significant bottleneck during these steps on machines with slow drives or NFS-mounted home directories.

Dependency version control

Specific compatible versions of Brevitas and QONNX were pinned and installed automatically at container startup. FINN’s transformation passes are sensitive to the QONNX opset version — a version mismatch between the exporter (Brevitas-side) and the consumer (FINN-side) causes silent graph transformation failures where nodes are not recognized and pass through unmodified, producing incorrect hardware.


Custom Zybo Z7-10 Board Integration

The Zybo Z7-10 is not in FINN’s official board support list. FINN ships with board definitions for Pynq-Z1, Pynq-Z2, ZCU102, and a few others. xc7z010clg400-1 is not among them. Manual integration was required.

Board definition injection

A board entry was added to FINN’s board registry with:

  • Part number: xc7z010clg400-1
  • AXI port width: 32-bit (the Zynq-7010 PS has 32-bit AXI GP and HP ports, unlike the 64-bit ports on larger Zynq UltraScale+ devices)
  • Architecture: zynq_7000

Without this entry, FINN’s Vivado block design generation step does not know the PS configuration and generates incorrect PS–PL wiring.

TCL template modifications

FINN generates a Vivado block design by rendering TCL templates. The templates were modified to:

  • Use the correct part string for xc7z010clg400-1
  • Wire AXI-Stream and AXI-Lite ports to the correct Zynq PS slave/master ports
  • Bypass FINN’s legacy 10-interface AXI-Lite limit, which was too low for the number of control registers required by the multi-layer ResNet configuration. The limit is a hardcoded constant in the template that was increased.

FINN Compilation Pipeline

The exported QONNX model was passed through FINN’s 17-step dataflow compilation pipeline. Each step is a graph transformation pass operating on the QONNX intermediate representation.

Major stages

1–2. Model transformation and cleanup Canonicalization passes: constant folding, dead node elimination, operator fusion where possible, shape inference. Produces a clean graph with no redundant operations.

3. Streamlining Absorbs scale factors and zero-point shifts from quantization annotations into adjacent layers. After this step, all activations are integer-valued and all weights are integer-valued — there are no floating-point scale multiplications remaining in the graph. This is a prerequisite for hardware mapping.

4. Folding and parallelization Determines the parallelism (PE count, SIMD width) for each layer. This controls the hardware throughput: higher parallelism = more DSPs and BRAMs consumed, lower latency. FINN’s folding step attempts to balance throughput across layers to avoid bottlenecks.

5–6. HLS IP and RTL IP generation Each layer is lowered to either an HLS C++ implementation (compiled by Vitis HLS) or a direct RTL implementation. Convolutional layers map to MVAU (Matrix-Vector Activation Unit) HLS blocks. MaxPool maps to StreamingMaxPool HLS blocks. Input padding maps to FMPadding RTL blocks. Data width adapters (StreamingDataWidthConverter) and FIFOs (StreamingFIFO) are inserted between layers to match bitwidths and buffer data between pipeline stages.

7. Verilator simulation Functional simulation of the generated RTL to verify correctness before invoking Vivado. Run inside the container against the RAM disk.

8. Dataflow stitching All individual IP blocks are stitched together into a single streaming dataflow design. Data moves through the system as AXI-Stream packets: the DMA engine (IODMA) reads input from DDR, pushes it into the stream, it flows through convolutions, activations, and pooling layers in sequence, and the output stream is written back to DDR by a second DMA engine.

9. Vivado block design generation Renders the TCL template to produce a complete Vivado block design containing the Zynq PS, the FINN accelerator partition, AXI-Stream connections, and AXI-Lite control.

10–11. Synthesis and place-and-route Vivado runs synth_design followed by impl_design. The LD_PRELOAD memory allocator override (see below) was applied here.

12. Bitstream generation write_bitstream produces the final .bit file.

Memory stability fix

Vivado 2022.2 has known instability during synthesis and place-and-route on Linux when using the default glibc memory allocator under high memory pressure. Symptoms include random segfaults or hangs during elaboration. The standard workaround is:

LD_PRELOAD=<path_to_libtcmalloc.so>

This replaces glibc malloc with Google’s tcmalloc, which has better fragmentation behaviour under Vivado’s allocation patterns. This was applied inside the FINN container before invoking the Vivado steps.


Post-Synthesis Resource Utilization

Top-level summary (xc7z010clg400-1)

Resource Used Available Utilization
LUT 11,581 17,600 66%
SRL (shift register LUT) 879
FF 14,557 35,200 41%
BRAM_36K 17 40 43%
BRAM_18K 5 80 6%
DSP 3 80 4%

The design fits comfortably on the xc7z010. LUT at 66% is the tightest resource, with enough headroom for routing closure. DSP usage at 3 (4%) is remarkably low — a direct consequence of 4-bit weight × 4-bit activation products being narrow enough for FINN to implement the MVAU accumulation in LUT-based logic rather than DSP48 blocks. BRAM at 43% (36K equivalent) is dominated by the large FIFOs inserted between pipeline stages.

Per-module breakdown

DMA and I/O

Module LUT SRL FF BRAM_36K BRAM_18K DSP
IODMA_hls_0 (input DMA) 1,382 39 2,121 0 1 0
IODMA_hls_0 (output DMA) 1,554 144 2,301 0 1 0

Two DMA engines handle host–accelerator data transfer. The input DMA reads the image from DDR and pushes it into the streaming pipeline. The output DMA reads the result stream and writes the classification output back to DDR. Both are HLS-generated. Combined they consume 2,936 LUTs — about 25% of total LUT usage — and no BRAM_36K. Each uses one BRAM_18K for its internal data buffering.

Block 1 — Conv3×3 (3→28 channels, 28×28 spatial)

Module LUT SRL FF BRAM_36K BRAM_18K DSP
FMPadding_rtl_0 29 0 61 0 0 0
ConvolutionInputGenerator_rtl_0 197 0 95 0 0 0
MVAU_hls_0 535 0 615 0 1 1
StreamingMaxPool_hls_0 419 0 521 0 0 0

FMPadding_rtl_0 inserts the spatial padding required for same-convolution. It is pure RTL with negligible resource usage — 29 LUTs, 61 FFs.

ConvolutionInputGenerator_rtl_0 (CIG) generates the sliding window input for the MVAU. It reads the input feature map stream and outputs overlapping 3×3 patches in the order the MVAU expects. Implemented in RTL, no BRAM needed because the input channel count is small (3 channels) and the line buffer fits in distributed RAM.

MVAU_hls_0 is the matrix-vector activation unit for the first conv layer. This is where the multiply-accumulate happens. 535 LUTs and 1 DSP — the convolution uses one DSP48 for the MAC accumulator, with the weight and activation multiplications handled in LUT logic given the 4-bit operand widths. 1 BRAM_18K stores the weight matrix.

StreamingMaxPool_hls_0 implements 2×2 max pooling in the stream domain. 419 LUTs, no BRAM — the pooling window state is held in registers.

Block 2 — Conv3×3 (28→56 channels, 14×14 spatial)

Module LUT SRL FF BRAM_36K BRAM_18K DSP
FMPadding_rtl_1 47 0 141 0 0 0
ConvolutionInputGenerator_rtl_1 237 0 99 0 0 0
MVAU_hls_1 763 0 966 1 0 1
StreamingMaxPool_hls_1 794 0 962 0 0 0

Block 2 is larger than Block 1 across every metric, consistent with the increased channel count (28→56). MVAU_hls_1 grows to 763 LUTs and moves to BRAM_36K (1 block) for weight storage — the 28×56×3×3 weight matrix is large enough to warrant a full 36K BRAM. StreamingMaxPool_hls_1 at 794 LUTs is significantly larger than its Block 1 counterpart because the 56-channel stream requires more parallel comparison logic.

Dense head (56→10)

Module LUT SRL FF BRAM_36K BRAM_18K DSP
MVAU_rtl_0 71 0 200 2 1 1

The final fully connected layer is implemented as MVAU_rtl_0 — an RTL MVAU rather than HLS. At 56 inputs × 10 outputs the weight matrix is small (560 entries), but FINN still allocates 2 BRAM_36K and 1 BRAM_18K for it. LUT usage is minimal (71) because the RTL implementation offloads most logic to BRAM-based lookup.

FIFOs and width converters

Fourteen StreamingFIFO_rtl instances and five StreamingDataWidthConverter_rtl instances are inserted between every pair of adjacent pipeline stages.

FIFOs decouple the throughput of adjacent stages. If one stage stalls (e.g. MVAU finishing a channel reduction), the FIFO absorbs the backpressure without stalling the upstream stage. FIFO depths are sized by FINN’s folding step based on the latency difference between adjacent stages. The larger FIFOs (rtl_3 with 8 BRAM_36K, rtl_9 with 4 BRAM_36K) sit between the CIG and MVAU stages where the latency mismatch is largest. Smaller FIFOs use distributed RAM (SRL-based) or pure registers.

StreamingDataWidthConverter_rtl instances adapt between stages that produce and consume different AXI-Stream data widths. For example, the CIG for a 3-channel input produces 3×4 = 12-bit wide streams, while the MVAU for a 28-channel output consumes 28×4 = 112-bit wide streams — the width converter bridges this boundary.

Total FIFO and converter resources:

Resource FIFOs (14×) Width converters (5×)
LUT 1,333 150
SRL 520 0
FF 1,193 481
BRAM_36K 13 0

BRAM_36K is dominated by FIFOs — 13 of the 17 total BRAM_36K in the design are FIFO storage.


Final Output

Successful generation of:

  • Vivado hardware handoff (.xsa)
  • .bit bitstream
  • Deployable via PYNQ overlay or direct JTAG programming

Summary

Stage Tool Status
Quantization-aware training PyTorch + Brevitas ✓ Complete
QONNX export QONNXManager ✓ Complete
FINN dataflow compilation FINN compiler ✓ Complete
Vivado synthesis + P&R Vivado 2022.2 ✓ Complete
Bitstream generation Vivado 2022.2 ✓ Complete

The final design uses 66% LUT, 41% FF, 43% BRAM_36K, and only 4% DSP on the xc7z010 — a direct result of 4-bit weight and activation quantization allowing FINN to implement MAC operations in LUT logic rather than DSP48 blocks, and the ResNet architecture being sized to match the device constraints from the start.