SW Inference PYNQ ARM Cortex (keras2c)
CIFAR-10 CNN Inference on Zynq-7000 (ARM Cortex-A9)
Performance benchmark of a CIFAR-10 CNN model compiled from Keras to C using keras2c, evaluated across three source variants and five GCC optimization levels on the Xilinx Zynq-7000 SoC (ARM Cortex-A9, ARMv7-A).
All results are CPU-only — no FPGA acceleration.
Hardware
| Property | Detail |
|---|---|
| Board | Zynq-7000 SoC |
| Processor | Dual-core ARM Cortex-A9 |
| Architecture | ARMv7-A (32-bit) |
| Frequency | ~650–866 MHz |
| Execution | Single-core, user-space |
| Dataset | 100 CIFAR-10 images (32×32 RGB, 10 per class) |
Project Setup
Clone keras2c and install dependencies:
git clone https://github.com/PlasmaControl/keras2c.git
cd keras2c
pip install -r requirements.txt
cd ..
Convert Keras model to C:
python convert.py
This generates c_model/my_model.c, my_model.h, and my_model_test_suite.c.
Copy runtime files into the model directory:
cp -r keras2c/include c_model/
cp -r keras2c/src c_model/
Download the image loading header:
cd c_model
wget https://raw.githubusercontent.com/nothings/stb/master/stb_image.h
cd ..
Run the generated test suite to verify correctness:
cd c_model
gcc -O3 my_model.c my_model_test_suite.c include/*.c -I./include -lm -o model_test
./model_test
Expected: average time printed, max absolute error ~1e-6.
Directory Layout
c_model/
include/
src/
stb_image.h
predict.c
predict
test_all.sh
prediction.txt
pred_table.txt
my_model.c
my_model.h
my_model_test_suite.c
sample_images/
airplane/ automobile/ bird/ cat/ deer/
dog/ frog/ horse/ ship/ truck/
Build & Run
Compile with each optimization level:
gcc -O0 predict.c my_model.c include/*.c -I./include -lm -o predict_O0
gcc -O1 predict.c my_model.c include/*.c -I./include -lm -o predict_O1
gcc -O2 predict.c my_model.c include/*.c -I./include -lm -o predict_O2
gcc -O3 predict.c my_model.c include/*.c -I./include -lm -o predict_O3
gcc -Ofast predict.c my_model.c include/*.c -I./include -lm -o predict_Ofast
Set the test image and run a warmup:
export IMG=../sample_images/cat/cat_7.png
./predict_O2 $IMG >/dev/null
Time a single inference:
start=$(date +%s%N); ./predict_O3 $IMG >/dev/null; end=$(date +%s%N)
echo "ms = $(( (end-start)/1000000 ))"
Run batch evaluation on all 100 images:
chmod +x test_all.sh
./test_all.sh
Outputs: prediction.txt (full logs) and pred_table.txt (image | ground truth | predicted).
Model Variants
Three source versions were evaluated. Each represents a different level of optimization strategy — from untouched generated code, to loop-level hints, to structural graph changes.
c_model — Baseline
Direct, unmodified output from keras2c. The converter generates one C function per Keras layer, each operating on fully materialized input and output tensors. There is no fusion, no shared buffers, and no awareness of adjacent operations.
The residual block structure in this model is:
conv → conv → add → maxpool
Each of these is a separate function call. Every call reads its full input tensor from memory and writes its full output tensor back to memory. This means each residual block incurs 5 separate full-tensor memory passes:
- Read conv output A
- Read conv output B
- Write result of add (a new intermediate tensor)
- Read that add tensor back for pooling
- Write the maxpool output
On a memory-bandwidth-limited processor like the Cortex-A9, these repeated DRAM reads and writes dominate execution time — far more than the actual arithmetic.
c_model_optm — Loop-Level Pragma Optimizations
This variant applies source-level micro-optimizations to the keras2c runtime library files. The model graph itself (my_model.c) is left untouched — no layers are merged, no tensors are eliminated, and the execution order is identical to baseline.
The following files were modified:
k2c_convolution_layers.c
- Added
#pragma GCC ivdepbefore inner loops to tell the compiler there are no loop-carried memory dependencies, allowing it to generate more aggressive instruction scheduling and potentially vectorized code. - Added
#pragma GCC unrollto hint that inner loops should be unrolled, reducing loop control overhead and exposing more instruction-level parallelism. - Weight and input buffers aligned to cache-line boundaries to reduce the chance of split cache-line loads.
- Restructured loop bounds and index calculations to reduce loop-carried dependencies that could stall the pipeline.
k2c_activations.c
- Tightened the ReLU and linear activation loops — removed redundant branches and reduced per-element call overhead.
- Inlined trivial activation paths to avoid function call overhead in tight loops.
k2c_merge_layers.c
- Elementwise add/multiply/max loops manually unrolled to reduce loop overhead.
- Pointer aliasing reduced using
restrictqualifiers where applicable, giving the compiler freedom to reorder loads and stores.
k2c_core_layers.c
- Dense layer inner loops restructured for better sequential memory access patterns.
- Improved cache locality by reordering the loop nest to match the data layout.
The intent of all these changes is to help the compiler generate faster machine code — fewer stalls, better pipelining, more vectorization. On x86 (with out-of-order execution, wide SIMD, and large caches) these hints do produce modest gains. On ARM Cortex-A9 they do not, for reasons covered in the analysis section.
c_model_optm_2 — Structural Graph Fusion
This variant makes no changes to the runtime library. All modifications are confined to my_model.c — the generated model execution file. The optimization is a manual operator fusion targeting the residual blocks.
The model contains two residual blocks. Each has the form:
conv → conv
↘ ↙
add → maxpool
In the original (and in c_model_optm), the add and maxpool are two separate operations called sequentially:
k2c_add(...); // writes full intermediate tensor
k2c_maxpool2d(...); // reads that tensor, writes pooled output
This means the intermediate add tensor — which exists only to be immediately consumed by pooling — must be fully written to memory and then fully read back. On a 32×32×28 feature map, that is a significant amount of data movement for a result that is never used again.
The fusion replaces both calls with a single custom loop that:
- Iterates over the pooling windows directly
- Performs the elementwise add inside the pooling window loop
- Computes the max of the summed values in the same pass
- Writes only the final pooled result to the output tensor
- Never materializes the intermediate add tensor at all
Memory passes per residual block:
| Version | Passes | Description |
|---|---|---|
| Original | 5 | Read A, Read B, Write add, Read add, Write pool |
| Fused | 3 | Read A, Read B, Write pool |
This was applied to both residual blocks in the model:
- Block 1: 32×32×28 feature maps → pooled to 16×16×28
- Block 2: 16×16×56 feature maps → pooled to 8×8×56
The eliminated tensors are large — removing them reduces both peak memory usage and total memory traffic per inference. On a memory-bound platform, fewer bytes moved means lower latency, regardless of how fast the arithmetic is.
Laptop Benchmark (x86, Single Image)
Results on a single image before deploying to Zynq. Used to validate optimizations.
Average Latency (ms)
| Opt Level | c_model | c_model_optm | c_model_optm_2 |
|---|---|---|---|
| O0 | 120 | 23 | 23 |
| O1 | 55 | 55 | 7 |
| O2 | 10 | 9 | 6 |
| O3 | 4 | 4 | 4 |
| Ofast | 4 | 4 | 3 |
Best vs Worst (laptop): ~30–40× improvement between fully optimized (c_model_optm_2 -Ofast -march=native) and fully pessimized (c_model -fno-inline -fno-tree-vectorize -fno-builtin).
Key observations
c_model_optmvs baseline: major gain only at-O0. At-O1and above the compiler already handles what the pragmas provide.c_model_optm_2vs baseline: consistent gain across all levels, most pronounced at-O1(~7×).- The bottleneck was graph structure and memory traffic, not arithmetic throughput.
Zynq Benchmark (ARM Cortex-A9, 100 Images)
c_model — Baseline
| Opt Level | Avg (ms) | Min | Max | Std Dev |
|---|---|---|---|---|
| O0 | 5526 | 5525 | 5542 | 1.69 |
| O1 | 1480 | 1479 | 1482 | 0.84 |
| O2 | 398 | 394 | 414 | 4.13 |
| O3 | 378 | 373 | 392 | 4.49 |
| Ofast | 378 | 373 | 388 | 4.11 |
Performance plateau: ~378 ms
c_model_optm — Pragma Optimized
| Opt Level | Avg (ms) | Min | Max | Std Dev |
|---|---|---|---|---|
| O0 | 5684 | 5672 | 5724 | 13.13 |
| O1 | 1506 | 1501 | 1664 | 17.23 |
| O2 | 447 | 442 | 464 | 5.31 |
| O3 | 422 | 418 | 439 | 5.32 |
| Ofast | 422 | 417 | 438 | 4.86 |
Performance plateau: ~422 ms
This variant is slower than baseline at every level on ARM. The ARM Cortex-A9 has a smaller cache, in-order pipeline, limited speculative execution, and no aggressive auto-vectorization. Loop unrolling increases register and cache pressure without improving memory behavior.
c_model_optm_2 — Fused Graph
| Opt Level | Avg (ms) | Min | Max | Std Dev |
|---|---|---|---|---|
| O0 | 936 | 934 | 944 | 2.34 |
| O1 | 379 | 373 | 397 | 5.76 |
| O2 | 363 | 358 | 378 | 4.99 |
| O3 | 362 | 358 | 377 | 4.59 |
| Ofast | 361 | 357 | 377 | 4.39 |
Performance plateau: ~361 ms
Full Comparison
| Opt Level | c_model | c_model_optm | c_model_optm_2 |
|---|---|---|---|
| O0 | 5526 | 5684 | 936 |
| O1 | 1480 | 1506 | 379 |
| O2 | 398 | 447 | 363 |
| O3 | 378 | 422 | 362 |
| Ofast | 378 | 422 | 361 |
Speedup
| Comparison | Baseline | Fused | Improvement |
|---|---|---|---|
| Best case (Ofast) | 378 ms | 361 ms | ~4.5% |
| Worst case (O0) | 5526 ms | 936 ms | ~5.9× |
Analysis
Compiler Flags: Large Gains Up to O2, Then a Hard Ceiling
On the baseline c_model, moving from -O0 to -O1 delivers a ~3.7× speedup on Zynq (5526 ms → 1480 ms). Moving from -O1 to -O2 delivers another ~3.7× (1480 ms → 398 ms). These are the compiler doing real work: eliminating redundant loads and stores, inlining small functions, removing dead code, and doing basic instruction selection. The keras2c-generated code has a lot of this low-hanging fruit — layer functions called through function pointers, temporary variables that don’t need to live in memory, bounds checks that are loop-invariant.
Beyond -O2 the gains essentially stop. -O3 adds auto-vectorization and more aggressive loop transformations. -Ofast additionally relaxes IEEE floating-point compliance to allow reassociation and use of approximate reciprocal instructions. On this workload, neither makes a measurable difference. The execution is already memory-bound: the processor spends most of its time waiting for data, not executing arithmetic. Making the arithmetic faster does not move the needle.
Why Pragmas Fail on Cortex-A9
The ARM Cortex-A9 is an in-order processor with a short pipeline and a small L1 cache (~32 KB instruction, ~32 KB data, configurable). It has no out-of-order execution engine and very limited branch prediction. This matters for pragma optimizations in several specific ways:
#pragma GCC ivdep tells the compiler that there are no loop-carried memory dependencies, freeing it to vectorize or reorder iterations. But the Cortex-A9’s NEON SIMD unit, while present, is not aggressively exploited by GCC’s auto-vectorizer for this type of loop — the vectorization threshold is rarely met for the loop structures in keras2c’s convolution code. So the pragma provides permission for an optimization the compiler doesn’t take anyway.
#pragma GCC unroll causes the compiler to replicate loop body iterations. On x86 this can reduce branch misprediction overhead and improve instruction-level parallelism since out-of-order CPUs can execute multiple unrolled iterations simultaneously. On an in-order Cortex-A9 with no instruction window, there is no parallel execution of independent iterations — they still run sequentially. The only effect is increased code size, which puts more pressure on the 32 KB instruction cache and can actually cause more cache misses. This is the likely cause of c_model_optm being slower than baseline on Zynq at every level: the unrolled convolution code no longer fits cleanly in L1 cache, causing additional fetch stalls that offset any reduction in branch overhead.
Cache alignment of weight buffers also helps less on ARM. On x86 with wider cache lines and hardware prefetchers, misaligned accesses incur significant penalties. The Cortex-A9’s cache line is 32 bytes, and its prefetcher is simpler — aligned accesses help, but the bottleneck is bandwidth to main memory (DDR), not alignment within cache.
The net result is that every optimization in c_model_optm either has no effect or makes things marginally worse on this architecture. This is not a failure of the optimization approach — it reflects a real and important architectural difference between high-performance x86 cores and embedded in-order ARM cores.
Why Structural Fusion Works Everywhere
The residual fusion in c_model_optm_2 does not rely on any particular processor feature. It reduces the total number of bytes read and written during inference. That is a property of the algorithm, not the microarchitecture, and it helps on any memory-bound platform.
Each fused residual block eliminates one full intermediate tensor. Concretely:
- Block 1 intermediate: 32×32×28 floats = 114,688 bytes written + 114,688 bytes read = ~224 KB per inference removed
- Block 2 intermediate: 16×16×56 floats = 57,344 bytes written + 57,344 bytes read = ~112 KB per inference removed
That is roughly 336 KB of memory traffic eliminated per image, just from the two fused blocks. On a Cortex-A9 at ~800 MHz with typical DDR bandwidth in the Zynq-7000 PS, this directly translates to saved time.
The effect is largest at -O0 because without compiler optimization the memory traffic completely dominates — removing a full tensor access is a huge fraction of total work. At -O3 the compiler has already compressed many operations, so the absolute gain is smaller, but it remains consistent (378 ms → 361 ms) because the memory bandwidth ceiling itself hasn’t changed.
ARM vs x86: Why the Same Optimization Behaves Differently
On an x86 laptop the picture looks different. The fused model at -O1 drops from 55 ms to 7 ms — a ~7× gain. The same fusion on Zynq at -O1 gives 1480 ms → 379 ms — also roughly 4×. Proportionally similar, but the absolute numbers diverge dramatically because of clock speed, cache hierarchy depth, and memory bandwidth differences.
More interestingly, on x86 the pragma-optimized model (c_model_optm) shows no regression at -O1 and above. This is because the x86 out-of-order engine absorbs the extra code size from unrolling without penalty, and the larger L1/L2/L3 caches mean the unrolled code still fits without evicting working data. The same changes that are neutral or helpful on x86 are actively harmful on the Cortex-A9’s constrained, in-order microarchitecture.
This confirms the core lesson: the right optimization depends on the target architecture. Pragma-based loop hints are an x86-centric technique. Structural memory reduction is architecture-agnostic and effective on any memory-bound system.
Recommended Configuration
| Target | Source | Compiler Flag |
|---|---|---|
| Zynq ARM (deployment) | c_model_optm_2 |
-O3 |
| Laptop / x86 (fast dev) | c_model_optm_2 |
-Ofast -march=native |
| Debugging / validation | c_model |
-O0 |
Avoid c_model_optm on ARM — it is strictly worse than baseline.
Final Ranking (Zynq)
| Rank | Variant | Latency |
|---|---|---|
| 1 | c_model_optm_2 at Ofast / O3 |
~361 ms |
| 2 | c_model at O3 / Ofast |
~378 ms |
| 3 | c_model_optm at O3 / Ofast |
~422 ms |
| 4 | c_model at O0 |
~5526 ms |
On embedded ARM systems, reducing memory movement at the graph level yields more benefit than micro-optimizing loops.