SW Inference PYNQ ARM Cortex (NumPy)

SW Inference PYNQ ARM Cortex (NumPy)

CIFAR-10 CNN Inference — Pure NumPy on PYNQ

Software inference of the CIFAR-10 Mini-ResNet running directly on the ARM Cortex-A9 PS of the Zynq-7000 via the PYNQ framework, using a pure NumPy forward pass. No FPGA fabric is used. This documents two weight variants — float weights and fixed-point (Q1.7) weights — both executed through the same NumPy inference code on the PYNQ board.

This is the PYNQ-side software baseline, distinct from the FPGA-accelerated FINN deployment and the bare-metal keras2c/C inference documented elsewhere.


What This Is

PYNQ provides a Linux environment on the Zynq PS with a Python stack (NumPy, Jupyter, etc.) accessible over the network. Running inference here means using Python and NumPy on the ARM cores — the same physical processor as the Zynq ARM benchmarks, but accessed through the PYNQ software stack rather than cross-compiled C binaries.

The forward pass is implemented entirely in NumPy: convolutions as explicit nested loops or np.tensordot operations, ReLU as np.maximum, pooling with np.max over sliding windows, global average pooling as np.mean, and the dense layer as a matrix multiply. Weights are loaded from files at startup.


Two Weight Variants

Two runs were performed with different weight representations:

results_txt — float weights Weights loaded as float32 values. This is the direct Python/Keras weight format — the same values the model was trained with, at full floating-point precision.

results_txt_fixed — Q1.7 fixed-point weights Weights stored as 8-bit integers in Q1.7 format (divided by 128 to recover float values at inference time). This matches the quantized weight format used in the C and HLS implementations. The inference arithmetic still runs in float32 inside NumPy — the quantization only affects the weight values loaded, not the computation itself.

Both variants run the identical NumPy forward pass code. The only difference is the weight loading step.


Results

Accuracy

Both variants achieve 84 / 100 (84%) on the 100-image test set (10 images per class).

Per-class accuracy — float weights (results_txt)

Class Correct Misclassified as
airplane 9/10 truck
automobile 8/10 dog, horse
bird 7/10 truck, deer, cat
cat 7/10 dog, dog, horse
deer 8/10 ship, bird
dog 8/10 horse, cat
frog 10/10
horse 9/10 cat
ship 10/10
truck 8/10 automobile, automobile

Per-class accuracy — Q1.7 fixed weights (results_txt_fixed)

Class Correct Misclassified as
airplane 9/10 truck
automobile 8/10 dog, horse
bird 8/10 airplane, cat
cat 7/10 dog, dog, horse
deer 7/10 ship, bird, frog
dog 8/10 horse, cat
frog 10/10
horse 9/10 cat
ship 10/10
truck 8/10 automobile, automobile

Observations

Both variants produce exactly the same total accuracy (84%) despite the Q1.7 weight quantization introducing rounding errors. This confirms that Q1.7 (8-bit, 7 fractional bits) is sufficient precision for this network — the quantization error does not accumulate enough to change the argmax for any additional images beyond the 16 that were already misclassified by the float model.

The two runs are not identical image-by-image. Three predictions differ:

Image Actual Float pred Fixed pred
020 bird truck ❌ airplane ❌
025 bird deer ❌ → bird ✅
048 deer deer ✅ → frog ❌

Image 025 is a gain for the fixed model (deer → bird, correct). Image 048 is a loss (deer → frog, wrong). These are borderline cases where the quantization noise happens to push the logits across a class boundary in opposite directions. The net effect cancels out — both runs stay at 84%.

frog and ship are the strongest classes, achieving perfect 10/10 in both variants. cat and bird are the weakest (7/10 float), with cat showing persistent confusion with dog (2 misclassifications in both runs) — a well-known CIFAR-10 difficulty.

The truck class confuses exclusively with automobile in both runs (both wrong predictions are automobile), reflecting the visual and semantic similarity between the two vehicle classes.


Inference Timing

Metric Float weights Fixed weights
Min (ms) 21,111 29,611
Max (ms) 21,472 30,207
Avg (ms) 21,183 29,704
Std Dev (ms) 58.96 75.31

Float inference: ~21.2 seconds per image Fixed inference: ~29.7 seconds per image

Both are dramatically slower than the C implementations on the same hardware. The keras2c C baseline at -O0 (worst case) took ~120 ms per image on an x86 laptop; the Zynq ARM C baseline at -O0 took ~5526 ms for 100 images (~55 ms per image). NumPy on PYNQ is roughly 400–500× slower than compiled C on the same ARM cores.

This is expected. Python interpreter overhead, NumPy function call overhead, and the absence of any compiler optimization mean that every MAC in the convolution loops goes through multiple layers of Python dispatch. NumPy is not an efficient inference engine — it is a correctness reference and prototyping tool.

Why fixed weights are slower than float weights (~40% overhead)

The fixed-weight variant takes ~8.5 seconds longer per image (29,704 ms vs 21,183 ms). Both variants run the same NumPy arithmetic in float32 — the difference is entirely in the weight loading and dequantization step.

In the fixed variant, every weight value is loaded as an integer and divided by 128 at runtime before each use (or during the loading phase). This involves additional integer-to-float conversion operations across all weight arrays. The total weight count across all layers is substantial: the 3×3 conv layers alone have 3×3×3×28 + 3×3×28×28 + 3×3×28×56 + 3×3×56×56 = 113,904 values. Converting, scaling, and potentially reloading this many values in Python adds measurable overhead that the float variant avoids.

The higher standard deviation in the fixed run (75 ms vs 59 ms) also suggests the dequantization introduces more variable execution time, likely due to memory allocation patterns during the conversion step.


Timing Distribution

All inference times are tightly clustered with very low variance relative to the mean — std dev is 59 ms on a ~21,000 ms mean for float (0.28% CV) and 75 ms on ~29,700 ms for fixed (0.25% CV). There are no significant outliers. The one notable spike in the float run is Image 072 at 21,472 ms (360 ms above average) — likely a brief OS scheduling event on the PYNQ Linux environment. All other images are within ±200 ms of the mean.

This tight distribution indicates that the PYNQ Linux environment is stable and the workload is deterministic. The inference time is dominated by compute, not I/O or system noise.


Summary

Metric Float weights Fixed weights
Accuracy 84% 84%
Avg latency ~21.2 s/image ~29.7 s/image
Std dev 59 ms 75 ms
Backend NumPy (float32) NumPy (float32, Q1.7 weights)
Fabric used None (PS only) None (PS only)

Pure NumPy inference on PYNQ establishes the software-only baseline on this platform. The 84% accuracy result confirms the model and weights are correct. The ~21 second latency per image confirms that Python-level inference is impractical for any real-time application and motivates the C, HLS, and FPGA implementations documented in other parts of this project.