SW Inference PYNQ ARM Cortex (NumPy)
CIFAR-10 CNN Inference — Pure NumPy on PYNQ
Software inference of the CIFAR-10 Mini-ResNet running directly on the ARM Cortex-A9 PS of the Zynq-7000 via the PYNQ framework, using a pure NumPy forward pass. No FPGA fabric is used. This documents two weight variants — float weights and fixed-point (Q1.7) weights — both executed through the same NumPy inference code on the PYNQ board.
This is the PYNQ-side software baseline, distinct from the FPGA-accelerated FINN deployment and the bare-metal keras2c/C inference documented elsewhere.
What This Is
PYNQ provides a Linux environment on the Zynq PS with a Python stack (NumPy, Jupyter, etc.) accessible over the network. Running inference here means using Python and NumPy on the ARM cores — the same physical processor as the Zynq ARM benchmarks, but accessed through the PYNQ software stack rather than cross-compiled C binaries.
The forward pass is implemented entirely in NumPy: convolutions as explicit
nested loops or np.tensordot operations, ReLU as np.maximum, pooling with
np.max over sliding windows, global average pooling as np.mean, and the
dense layer as a matrix multiply. Weights are loaded from files at startup.
Two Weight Variants
Two runs were performed with different weight representations:
results_txt — float weights
Weights loaded as float32 values. This is the direct Python/Keras weight
format — the same values the model was trained with, at full floating-point
precision.
results_txt_fixed — Q1.7 fixed-point weights
Weights stored as 8-bit integers in Q1.7 format (divided by 128 to recover
float values at inference time). This matches the quantized weight format used
in the C and HLS implementations. The inference arithmetic still runs in
float32 inside NumPy — the quantization only affects the weight values loaded,
not the computation itself.
Both variants run the identical NumPy forward pass code. The only difference is the weight loading step.
Results
Accuracy
Both variants achieve 84 / 100 (84%) on the 100-image test set (10 images per class).
Per-class accuracy — float weights (results_txt)
| Class | Correct | Misclassified as |
|---|---|---|
| airplane | 9/10 | truck |
| automobile | 8/10 | dog, horse |
| bird | 7/10 | truck, deer, cat |
| cat | 7/10 | dog, dog, horse |
| deer | 8/10 | ship, bird |
| dog | 8/10 | horse, cat |
| frog | 10/10 | — |
| horse | 9/10 | cat |
| ship | 10/10 | — |
| truck | 8/10 | automobile, automobile |
Per-class accuracy — Q1.7 fixed weights (results_txt_fixed)
| Class | Correct | Misclassified as |
|---|---|---|
| airplane | 9/10 | truck |
| automobile | 8/10 | dog, horse |
| bird | 8/10 | airplane, cat |
| cat | 7/10 | dog, dog, horse |
| deer | 7/10 | ship, bird, frog |
| dog | 8/10 | horse, cat |
| frog | 10/10 | — |
| horse | 9/10 | cat |
| ship | 10/10 | — |
| truck | 8/10 | automobile, automobile |
Observations
Both variants produce exactly the same total accuracy (84%) despite the Q1.7 weight quantization introducing rounding errors. This confirms that Q1.7 (8-bit, 7 fractional bits) is sufficient precision for this network — the quantization error does not accumulate enough to change the argmax for any additional images beyond the 16 that were already misclassified by the float model.
The two runs are not identical image-by-image. Three predictions differ:
| Image | Actual | Float pred | Fixed pred |
|---|---|---|---|
| 020 | bird | truck ❌ | airplane ❌ |
| 025 | bird | deer ❌ → | bird ✅ |
| 048 | deer | deer ✅ → | frog ❌ |
Image 025 is a gain for the fixed model (deer → bird, correct). Image 048 is a loss (deer → frog, wrong). These are borderline cases where the quantization noise happens to push the logits across a class boundary in opposite directions. The net effect cancels out — both runs stay at 84%.
frog and ship are the strongest classes, achieving perfect 10/10 in both variants. cat and bird are the weakest (7/10 float), with cat showing persistent confusion with dog (2 misclassifications in both runs) — a well-known CIFAR-10 difficulty.
The truck class confuses exclusively with automobile in both runs (both wrong predictions are automobile), reflecting the visual and semantic similarity between the two vehicle classes.
Inference Timing
| Metric | Float weights | Fixed weights |
|---|---|---|
| Min (ms) | 21,111 | 29,611 |
| Max (ms) | 21,472 | 30,207 |
| Avg (ms) | 21,183 | 29,704 |
| Std Dev (ms) | 58.96 | 75.31 |
Float inference: ~21.2 seconds per image Fixed inference: ~29.7 seconds per image
Both are dramatically slower than the C implementations on the same hardware.
The keras2c C baseline at -O0 (worst case) took ~120 ms per image on an
x86 laptop; the Zynq ARM C baseline at -O0 took ~5526 ms for 100 images
(~55 ms per image). NumPy on PYNQ is roughly 400–500× slower than compiled
C on the same ARM cores.
This is expected. Python interpreter overhead, NumPy function call overhead, and the absence of any compiler optimization mean that every MAC in the convolution loops goes through multiple layers of Python dispatch. NumPy is not an efficient inference engine — it is a correctness reference and prototyping tool.
Why fixed weights are slower than float weights (~40% overhead)
The fixed-weight variant takes ~8.5 seconds longer per image (29,704 ms vs 21,183 ms). Both variants run the same NumPy arithmetic in float32 — the difference is entirely in the weight loading and dequantization step.
In the fixed variant, every weight value is loaded as an integer and divided by 128 at runtime before each use (or during the loading phase). This involves additional integer-to-float conversion operations across all weight arrays. The total weight count across all layers is substantial: the 3×3 conv layers alone have 3×3×3×28 + 3×3×28×28 + 3×3×28×56 + 3×3×56×56 = 113,904 values. Converting, scaling, and potentially reloading this many values in Python adds measurable overhead that the float variant avoids.
The higher standard deviation in the fixed run (75 ms vs 59 ms) also suggests the dequantization introduces more variable execution time, likely due to memory allocation patterns during the conversion step.
Timing Distribution
All inference times are tightly clustered with very low variance relative to the mean — std dev is 59 ms on a ~21,000 ms mean for float (0.28% CV) and 75 ms on ~29,700 ms for fixed (0.25% CV). There are no significant outliers. The one notable spike in the float run is Image 072 at 21,472 ms (360 ms above average) — likely a brief OS scheduling event on the PYNQ Linux environment. All other images are within ±200 ms of the mean.
This tight distribution indicates that the PYNQ Linux environment is stable and the workload is deterministic. The inference time is dominated by compute, not I/O or system noise.
Summary
| Metric | Float weights | Fixed weights |
|---|---|---|
| Accuracy | 84% | 84% |
| Avg latency | ~21.2 s/image | ~29.7 s/image |
| Std dev | 59 ms | 75 ms |
| Backend | NumPy (float32) | NumPy (float32, Q1.7 weights) |
| Fabric used | None (PS only) | None (PS only) |
Pure NumPy inference on PYNQ establishes the software-only baseline on this platform. The 84% accuracy result confirms the model and weights are correct. The ~21 second latency per image confirms that Python-level inference is impractical for any real-time application and motivates the C, HLS, and FPGA implementations documented in other parts of this project.