SW Inference PYNQ ARM Cortex (keras2c)

SW Inference PYNQ ARM Cortex (keras2c)

CIFAR-10 CNN Inference on Zynq-7000 (ARM Cortex-A9)

Performance benchmark of a CIFAR-10 CNN model compiled from Keras to C using keras2c, evaluated across three source variants and five GCC optimization levels on the Xilinx Zynq-7000 SoC (ARM Cortex-A9, ARMv7-A).

All results are CPU-only — no FPGA acceleration.


Hardware

Property Detail
Board Zynq-7000 SoC
Processor Dual-core ARM Cortex-A9
Architecture ARMv7-A (32-bit)
Frequency ~650–866 MHz
Execution Single-core, user-space
Dataset 100 CIFAR-10 images (32×32 RGB, 10 per class)

Project Setup

Clone keras2c and install dependencies:

git clone https://github.com/PlasmaControl/keras2c.git
cd keras2c
pip install -r requirements.txt
cd ..

Convert Keras model to C:

python convert.py

This generates c_model/my_model.c, my_model.h, and my_model_test_suite.c.

Copy runtime files into the model directory:

cp -r keras2c/include c_model/
cp -r keras2c/src c_model/

Download the image loading header:

cd c_model
wget https://raw.githubusercontent.com/nothings/stb/master/stb_image.h
cd ..

Run the generated test suite to verify correctness:

cd c_model
gcc -O3 my_model.c my_model_test_suite.c include/*.c -I./include -lm -o model_test
./model_test

Expected: average time printed, max absolute error ~1e-6.


Directory Layout

c_model/
  include/
  src/
  stb_image.h
  predict.c
  predict
  test_all.sh
  prediction.txt
  pred_table.txt
  my_model.c
  my_model.h
  my_model_test_suite.c
sample_images/
  airplane/ automobile/ bird/ cat/ deer/
  dog/ frog/ horse/ ship/ truck/

Build & Run

Compile with each optimization level:

gcc -O0 predict.c my_model.c include/*.c -I./include -lm -o predict_O0
gcc -O1 predict.c my_model.c include/*.c -I./include -lm -o predict_O1
gcc -O2 predict.c my_model.c include/*.c -I./include -lm -o predict_O2
gcc -O3 predict.c my_model.c include/*.c -I./include -lm -o predict_O3
gcc -Ofast predict.c my_model.c include/*.c -I./include -lm -o predict_Ofast

Set the test image and run a warmup:

export IMG=../sample_images/cat/cat_7.png
./predict_O2 $IMG >/dev/null

Time a single inference:

start=$(date +%s%N); ./predict_O3 $IMG >/dev/null; end=$(date +%s%N)
echo "ms = $(( (end-start)/1000000 ))"

Run batch evaluation on all 100 images:

chmod +x test_all.sh
./test_all.sh

Outputs: prediction.txt (full logs) and pred_table.txt (image | ground truth | predicted).


Model Variants

Three source versions were evaluated. Each represents a different level of optimization strategy — from untouched generated code, to loop-level hints, to structural graph changes.


c_model — Baseline

Direct, unmodified output from keras2c. The converter generates one C function per Keras layer, each operating on fully materialized input and output tensors. There is no fusion, no shared buffers, and no awareness of adjacent operations.

The residual block structure in this model is:

conv → conv → add → maxpool

Each of these is a separate function call. Every call reads its full input tensor from memory and writes its full output tensor back to memory. This means each residual block incurs 5 separate full-tensor memory passes:

  1. Read conv output A
  2. Read conv output B
  3. Write result of add (a new intermediate tensor)
  4. Read that add tensor back for pooling
  5. Write the maxpool output

On a memory-bandwidth-limited processor like the Cortex-A9, these repeated DRAM reads and writes dominate execution time — far more than the actual arithmetic.


c_model_optm — Loop-Level Pragma Optimizations

This variant applies source-level micro-optimizations to the keras2c runtime library files. The model graph itself (my_model.c) is left untouched — no layers are merged, no tensors are eliminated, and the execution order is identical to baseline.

The following files were modified:

k2c_convolution_layers.c

  • Added #pragma GCC ivdep before inner loops to tell the compiler there are no loop-carried memory dependencies, allowing it to generate more aggressive instruction scheduling and potentially vectorized code.
  • Added #pragma GCC unroll to hint that inner loops should be unrolled, reducing loop control overhead and exposing more instruction-level parallelism.
  • Weight and input buffers aligned to cache-line boundaries to reduce the chance of split cache-line loads.
  • Restructured loop bounds and index calculations to reduce loop-carried dependencies that could stall the pipeline.

k2c_activations.c

  • Tightened the ReLU and linear activation loops — removed redundant branches and reduced per-element call overhead.
  • Inlined trivial activation paths to avoid function call overhead in tight loops.

k2c_merge_layers.c

  • Elementwise add/multiply/max loops manually unrolled to reduce loop overhead.
  • Pointer aliasing reduced using restrict qualifiers where applicable, giving the compiler freedom to reorder loads and stores.

k2c_core_layers.c

  • Dense layer inner loops restructured for better sequential memory access patterns.
  • Improved cache locality by reordering the loop nest to match the data layout.

The intent of all these changes is to help the compiler generate faster machine code — fewer stalls, better pipelining, more vectorization. On x86 (with out-of-order execution, wide SIMD, and large caches) these hints do produce modest gains. On ARM Cortex-A9 they do not, for reasons covered in the analysis section.


c_model_optm_2 — Structural Graph Fusion

This variant makes no changes to the runtime library. All modifications are confined to my_model.c — the generated model execution file. The optimization is a manual operator fusion targeting the residual blocks.

The model contains two residual blocks. Each has the form:

conv → conv
  ↘      ↙
add → maxpool

In the original (and in c_model_optm), the add and maxpool are two separate operations called sequentially:

k2c_add(...);        // writes full intermediate tensor
k2c_maxpool2d(...);  // reads that tensor, writes pooled output

This means the intermediate add tensor — which exists only to be immediately consumed by pooling — must be fully written to memory and then fully read back. On a 32×32×28 feature map, that is a significant amount of data movement for a result that is never used again.

The fusion replaces both calls with a single custom loop that:

  • Iterates over the pooling windows directly
  • Performs the elementwise add inside the pooling window loop
  • Computes the max of the summed values in the same pass
  • Writes only the final pooled result to the output tensor
  • Never materializes the intermediate add tensor at all

Memory passes per residual block:

Version Passes Description
Original 5 Read A, Read B, Write add, Read add, Write pool
Fused 3 Read A, Read B, Write pool

This was applied to both residual blocks in the model:

  • Block 1: 32×32×28 feature maps → pooled to 16×16×28
  • Block 2: 16×16×56 feature maps → pooled to 8×8×56

The eliminated tensors are large — removing them reduces both peak memory usage and total memory traffic per inference. On a memory-bound platform, fewer bytes moved means lower latency, regardless of how fast the arithmetic is.


Laptop Benchmark (x86, Single Image)

Results on a single image before deploying to Zynq. Used to validate optimizations.

Average Latency (ms)

Opt Level c_model c_model_optm c_model_optm_2
O0 120 23 23
O1 55 55 7
O2 10 9 6
O3 4 4 4
Ofast 4 4 3

Best vs Worst (laptop): ~30–40× improvement between fully optimized (c_model_optm_2 -Ofast -march=native) and fully pessimized (c_model -fno-inline -fno-tree-vectorize -fno-builtin).

Key observations

  • c_model_optm vs baseline: major gain only at -O0. At -O1 and above the compiler already handles what the pragmas provide.
  • c_model_optm_2 vs baseline: consistent gain across all levels, most pronounced at -O1 (~7×).
  • The bottleneck was graph structure and memory traffic, not arithmetic throughput.

Zynq Benchmark (ARM Cortex-A9, 100 Images)

c_model — Baseline

Opt Level Avg (ms) Min Max Std Dev
O0 5526 5525 5542 1.69
O1 1480 1479 1482 0.84
O2 398 394 414 4.13
O3 378 373 392 4.49
Ofast 378 373 388 4.11

Performance plateau: ~378 ms

c_model_optm — Pragma Optimized

Opt Level Avg (ms) Min Max Std Dev
O0 5684 5672 5724 13.13
O1 1506 1501 1664 17.23
O2 447 442 464 5.31
O3 422 418 439 5.32
Ofast 422 417 438 4.86

Performance plateau: ~422 ms

This variant is slower than baseline at every level on ARM. The ARM Cortex-A9 has a smaller cache, in-order pipeline, limited speculative execution, and no aggressive auto-vectorization. Loop unrolling increases register and cache pressure without improving memory behavior.

c_model_optm_2 — Fused Graph

Opt Level Avg (ms) Min Max Std Dev
O0 936 934 944 2.34
O1 379 373 397 5.76
O2 363 358 378 4.99
O3 362 358 377 4.59
Ofast 361 357 377 4.39

Performance plateau: ~361 ms

Full Comparison

Opt Level c_model c_model_optm c_model_optm_2
O0 5526 5684 936
O1 1480 1506 379
O2 398 447 363
O3 378 422 362
Ofast 378 422 361

Speedup

Comparison Baseline Fused Improvement
Best case (Ofast) 378 ms 361 ms ~4.5%
Worst case (O0) 5526 ms 936 ms ~5.9×

Analysis

Compiler Flags: Large Gains Up to O2, Then a Hard Ceiling

On the baseline c_model, moving from -O0 to -O1 delivers a ~3.7× speedup on Zynq (5526 ms → 1480 ms). Moving from -O1 to -O2 delivers another ~3.7× (1480 ms → 398 ms). These are the compiler doing real work: eliminating redundant loads and stores, inlining small functions, removing dead code, and doing basic instruction selection. The keras2c-generated code has a lot of this low-hanging fruit — layer functions called through function pointers, temporary variables that don’t need to live in memory, bounds checks that are loop-invariant.

Beyond -O2 the gains essentially stop. -O3 adds auto-vectorization and more aggressive loop transformations. -Ofast additionally relaxes IEEE floating-point compliance to allow reassociation and use of approximate reciprocal instructions. On this workload, neither makes a measurable difference. The execution is already memory-bound: the processor spends most of its time waiting for data, not executing arithmetic. Making the arithmetic faster does not move the needle.

Why Pragmas Fail on Cortex-A9

The ARM Cortex-A9 is an in-order processor with a short pipeline and a small L1 cache (~32 KB instruction, ~32 KB data, configurable). It has no out-of-order execution engine and very limited branch prediction. This matters for pragma optimizations in several specific ways:

#pragma GCC ivdep tells the compiler that there are no loop-carried memory dependencies, freeing it to vectorize or reorder iterations. But the Cortex-A9’s NEON SIMD unit, while present, is not aggressively exploited by GCC’s auto-vectorizer for this type of loop — the vectorization threshold is rarely met for the loop structures in keras2c’s convolution code. So the pragma provides permission for an optimization the compiler doesn’t take anyway.

#pragma GCC unroll causes the compiler to replicate loop body iterations. On x86 this can reduce branch misprediction overhead and improve instruction-level parallelism since out-of-order CPUs can execute multiple unrolled iterations simultaneously. On an in-order Cortex-A9 with no instruction window, there is no parallel execution of independent iterations — they still run sequentially. The only effect is increased code size, which puts more pressure on the 32 KB instruction cache and can actually cause more cache misses. This is the likely cause of c_model_optm being slower than baseline on Zynq at every level: the unrolled convolution code no longer fits cleanly in L1 cache, causing additional fetch stalls that offset any reduction in branch overhead.

Cache alignment of weight buffers also helps less on ARM. On x86 with wider cache lines and hardware prefetchers, misaligned accesses incur significant penalties. The Cortex-A9’s cache line is 32 bytes, and its prefetcher is simpler — aligned accesses help, but the bottleneck is bandwidth to main memory (DDR), not alignment within cache.

The net result is that every optimization in c_model_optm either has no effect or makes things marginally worse on this architecture. This is not a failure of the optimization approach — it reflects a real and important architectural difference between high-performance x86 cores and embedded in-order ARM cores.

Why Structural Fusion Works Everywhere

The residual fusion in c_model_optm_2 does not rely on any particular processor feature. It reduces the total number of bytes read and written during inference. That is a property of the algorithm, not the microarchitecture, and it helps on any memory-bound platform.

Each fused residual block eliminates one full intermediate tensor. Concretely:

  • Block 1 intermediate: 32×32×28 floats = 114,688 bytes written + 114,688 bytes read = ~224 KB per inference removed
  • Block 2 intermediate: 16×16×56 floats = 57,344 bytes written + 57,344 bytes read = ~112 KB per inference removed

That is roughly 336 KB of memory traffic eliminated per image, just from the two fused blocks. On a Cortex-A9 at ~800 MHz with typical DDR bandwidth in the Zynq-7000 PS, this directly translates to saved time.

The effect is largest at -O0 because without compiler optimization the memory traffic completely dominates — removing a full tensor access is a huge fraction of total work. At -O3 the compiler has already compressed many operations, so the absolute gain is smaller, but it remains consistent (378 ms → 361 ms) because the memory bandwidth ceiling itself hasn’t changed.

ARM vs x86: Why the Same Optimization Behaves Differently

On an x86 laptop the picture looks different. The fused model at -O1 drops from 55 ms to 7 ms — a ~7× gain. The same fusion on Zynq at -O1 gives 1480 ms → 379 ms — also roughly 4×. Proportionally similar, but the absolute numbers diverge dramatically because of clock speed, cache hierarchy depth, and memory bandwidth differences.

More interestingly, on x86 the pragma-optimized model (c_model_optm) shows no regression at -O1 and above. This is because the x86 out-of-order engine absorbs the extra code size from unrolling without penalty, and the larger L1/L2/L3 caches mean the unrolled code still fits without evicting working data. The same changes that are neutral or helpful on x86 are actively harmful on the Cortex-A9’s constrained, in-order microarchitecture.

This confirms the core lesson: the right optimization depends on the target architecture. Pragma-based loop hints are an x86-centric technique. Structural memory reduction is architecture-agnostic and effective on any memory-bound system.


Target Source Compiler Flag
Zynq ARM (deployment) c_model_optm_2 -O3
Laptop / x86 (fast dev) c_model_optm_2 -Ofast -march=native
Debugging / validation c_model -O0

Avoid c_model_optm on ARM — it is strictly worse than baseline.


Final Ranking (Zynq)

Rank Variant Latency
1 c_model_optm_2 at Ofast / O3 ~361 ms
2 c_model at O3 / Ofast ~378 ms
3 c_model_optm at O3 / Ofast ~422 ms
4 c_model at O0 ~5526 ms

On embedded ARM systems, reducing memory movement at the graph level yields more benefit than micro-optimizing loops.