SW Inference PYNQ ARM Cortex (keras2c)

CIFAR-10 CNN Inference on Zynq-7000 (ARM Cortex-A9)

Performance benchmark of a CIFAR-10 CNN model compiled from Keras to C using keras2c, evaluated across three source variants and five GCC optimization levels on the Xilinx Zynq-7000 SoC (ARM Cortex-A9, ARMv7-A).

All results are CPU-only — no FPGA acceleration.

Hardware

Property	Detail
Board	Zynq-7000 SoC
Processor	Dual-core ARM Cortex-A9
Architecture	ARMv7-A (32-bit)
Frequency	~650–866 MHz
Execution	Single-core, user-space
Dataset	100 CIFAR-10 images (32×32 RGB, 10 per class)

Project Setup

Clone keras2c and install dependencies:

git clone https://github.com/PlasmaControl/keras2c.git
cd keras2c
pip install -r requirements.txt
cd ..

Convert Keras model to C:

python convert.py

This generates c_model/my_model.c, my_model.h, and my_model_test_suite.c.

Copy runtime files into the model directory:

cp -r keras2c/include c_model/
cp -r keras2c/src c_model/

Download the image loading header:

cd c_model
wget https://raw.githubusercontent.com/nothings/stb/master/stb_image.h
cd ..

Run the generated test suite to verify correctness:

cd c_model
gcc -O3 my_model.c my_model_test_suite.c include/*.c -I./include -lm -o model_test
./model_test

Expected: average time printed, max absolute error ~1e-6.

Directory Layout

c_model/
  include/
  src/
  stb_image.h
  predict.c
  predict
  test_all.sh
  prediction.txt
  pred_table.txt
  my_model.c
  my_model.h
  my_model_test_suite.c
sample_images/
  airplane/ automobile/ bird/ cat/ deer/
  dog/ frog/ horse/ ship/ truck/

Build & Run

Compile with each optimization level:

gcc -O0 predict.c my_model.c include/*.c -I./include -lm -o predict_O0
gcc -O1 predict.c my_model.c include/*.c -I./include -lm -o predict_O1
gcc -O2 predict.c my_model.c include/*.c -I./include -lm -o predict_O2
gcc -O3 predict.c my_model.c include/*.c -I./include -lm -o predict_O3
gcc -Ofast predict.c my_model.c include/*.c -I./include -lm -o predict_Ofast

Set the test image and run a warmup:

export IMG=../sample_images/cat/cat_7.png
./predict_O2 $IMG >/dev/null

Time a single inference:

start=$(date +%s%N); ./predict_O3 $IMG >/dev/null; end=$(date +%s%N)
echo "ms = $(( (end-start)/1000000 ))"

Run batch evaluation on all 100 images:

chmod +x test_all.sh
./test_all.sh

Outputs: prediction.txt (full logs) and pred_table.txt (image | ground truth | predicted).

Model Variants

Three source versions were evaluated. Each represents a different level of optimization strategy — from untouched generated code, to loop-level hints, to structural graph changes.

`c_model` — Baseline

Direct, unmodified output from keras2c. The converter generates one C function per Keras layer, each operating on fully materialized input and output tensors. There is no fusion, no shared buffers, and no awareness of adjacent operations.

The residual block structure in this model is:

conv → conv → add → maxpool

Each of these is a separate function call. Every call reads its full input tensor from memory and writes its full output tensor back to memory. This means each residual block incurs 5 separate full-tensor memory passes:

Read conv output A
Read conv output B
Write result of add (a new intermediate tensor)
Read that add tensor back for pooling
Write the maxpool output

On a memory-bandwidth-limited processor like the Cortex-A9, these repeated DRAM reads and writes dominate execution time — far more than the actual arithmetic.

`c_model_optm` — Loop-Level Pragma Optimizations

This variant applies source-level micro-optimizations to the keras2c runtime library files. The model graph itself (my_model.c) is left untouched — no layers are merged, no tensors are eliminated, and the execution order is identical to baseline.

The following files were modified:

k2c_convolution_layers.c

Added #pragma GCC ivdep before inner loops to tell the compiler there are no loop-carried memory dependencies, allowing it to generate more aggressive instruction scheduling and potentially vectorized code.
Added #pragma GCC unroll to hint that inner loops should be unrolled, reducing loop control overhead and exposing more instruction-level parallelism.
Weight and input buffers aligned to cache-line boundaries to reduce the chance of split cache-line loads.
Restructured loop bounds and index calculations to reduce loop-carried dependencies that could stall the pipeline.

k2c_activations.c

Tightened the ReLU and linear activation loops — removed redundant branches and reduced per-element call overhead.
Inlined trivial activation paths to avoid function call overhead in tight loops.

k2c_merge_layers.c

Elementwise add/multiply/max loops manually unrolled to reduce loop overhead.
Pointer aliasing reduced using restrict qualifiers where applicable, giving the compiler freedom to reorder loads and stores.

k2c_core_layers.c

Dense layer inner loops restructured for better sequential memory access patterns.
Improved cache locality by reordering the loop nest to match the data layout.

The intent of all these changes is to help the compiler generate faster machine code — fewer stalls, better pipelining, more vectorization. On x86 (with out-of-order execution, wide SIMD, and large caches) these hints do produce modest gains. On ARM Cortex-A9 they do not, for reasons covered in the analysis section.

`c_model_optm_2` — Structural Graph Fusion

This variant makes no changes to the runtime library. All modifications are confined to my_model.c — the generated model execution file. The optimization is a manual operator fusion targeting the residual blocks.

The model contains two residual blocks. Each has the form:

conv → conv
  ↘      ↙
add → maxpool

In the original (and in c_model_optm), the add and maxpool are two separate operations called sequentially:

k2c_add(...);        // writes full intermediate tensor
k2c_maxpool2d(...);  // reads that tensor, writes pooled output

This means the intermediate add tensor — which exists only to be immediately consumed by pooling — must be fully written to memory and then fully read back. On a 32×32×28 feature map, that is a significant amount of data movement for a result that is never used again.

The fusion replaces both calls with a single custom loop that:

Iterates over the pooling windows directly
Performs the elementwise add inside the pooling window loop
Computes the max of the summed values in the same pass
Writes only the final pooled result to the output tensor
Never materializes the intermediate add tensor at all

Memory passes per residual block:

Version	Passes	Description
Original	5	Read A, Read B, Write add, Read add, Write pool
Fused	3	Read A, Read B, Write pool

This was applied to both residual blocks in the model:

Block 1: 32×32×28 feature maps → pooled to 16×16×28
Block 2: 16×16×56 feature maps → pooled to 8×8×56

The eliminated tensors are large — removing them reduces both peak memory usage and total memory traffic per inference. On a memory-bound platform, fewer bytes moved means lower latency, regardless of how fast the arithmetic is.

Laptop Benchmark (x86, Single Image)

Results on a single image before deploying to Zynq. Used to validate optimizations.

Average Latency (ms)

Opt Level	c_model	c_model_optm	c_model_optm_2
O0	120	23	23
O1	55	55	7
O2	10	9	6
O3	4	4	4
Ofast	4	4	3

Best vs Worst (laptop): ~30–40× improvement between fully optimized (c_model_optm_2 -Ofast -march=native) and fully pessimized (c_model -fno-inline -fno-tree-vectorize -fno-builtin).

Key observations

c_model_optm vs baseline: major gain only at -O0. At -O1 and above the compiler already handles what the pragmas provide.
c_model_optm_2 vs baseline: consistent gain across all levels, most pronounced at -O1 (~7×).
The bottleneck was graph structure and memory traffic, not arithmetic throughput.

Zynq Benchmark (ARM Cortex-A9, 100 Images)

`c_model` — Baseline

Opt Level	Avg (ms)	Min	Max	Std Dev
O0	5526	5525	5542	1.69
O1	1480	1479	1482	0.84
O2	398	394	414	4.13
O3	378	373	392	4.49
Ofast	378	373	388	4.11

Performance plateau: ~378 ms

`c_model_optm` — Pragma Optimized

Opt Level	Avg (ms)	Min	Max	Std Dev
O0	5684	5672	5724	13.13
O1	1506	1501	1664	17.23
O2	447	442	464	5.31
O3	422	418	439	5.32
Ofast	422	417	438	4.86

Performance plateau: ~422 ms

This variant is slower than baseline at every level on ARM. The ARM Cortex-A9 has a smaller cache, in-order pipeline, limited speculative execution, and no aggressive auto-vectorization. Loop unrolling increases register and cache pressure without improving memory behavior.

`c_model_optm_2` — Fused Graph

Opt Level	Avg (ms)	Min	Max	Std Dev
O0	936	934	944	2.34
O1	379	373	397	5.76
O2	363	358	378	4.99
O3	362	358	377	4.59
Ofast	361	357	377	4.39

Performance plateau: ~361 ms

Full Comparison

Opt Level	c_model	c_model_optm	c_model_optm_2
O0	5526	5684	936
O1	1480	1506	379
O2	398	447	363
O3	378	422	362
Ofast	378	422	361

Speedup

Comparison	Baseline	Fused	Improvement
Best case (Ofast)	378 ms	361 ms	~4.5%
Worst case (O0)	5526 ms	936 ms	~5.9×

Analysis

Compiler Flags: Large Gains Up to O2, Then a Hard Ceiling

On the baseline c_model, moving from -O0 to -O1 delivers a ~3.7× speedup on Zynq (5526 ms → 1480 ms). Moving from -O1 to -O2 delivers another ~3.7× (1480 ms → 398 ms). These are the compiler doing real work: eliminating redundant loads and stores, inlining small functions, removing dead code, and doing basic instruction selection. The keras2c-generated code has a lot of this low-hanging fruit — layer functions called through function pointers, temporary variables that don’t need to live in memory, bounds checks that are loop-invariant.

Beyond -O2 the gains essentially stop. -O3 adds auto-vectorization and more aggressive loop transformations. -Ofast additionally relaxes IEEE floating-point compliance to allow reassociation and use of approximate reciprocal instructions. On this workload, neither makes a measurable difference. The execution is already memory-bound: the processor spends most of its time waiting for data, not executing arithmetic. Making the arithmetic faster does not move the needle.

Why Pragmas Fail on Cortex-A9

The ARM Cortex-A9 is an in-order processor with a short pipeline and a small L1 cache (~32 KB instruction, ~32 KB data, configurable). It has no out-of-order execution engine and very limited branch prediction. This matters for pragma optimizations in several specific ways:

#pragma GCC ivdep tells the compiler that there are no loop-carried memory dependencies, freeing it to vectorize or reorder iterations. But the Cortex-A9’s NEON SIMD unit, while present, is not aggressively exploited by GCC’s auto-vectorizer for this type of loop — the vectorization threshold is rarely met for the loop structures in keras2c’s convolution code. So the pragma provides permission for an optimization the compiler doesn’t take anyway.

#pragma GCC unroll causes the compiler to replicate loop body iterations. On x86 this can reduce branch misprediction overhead and improve instruction-level parallelism since out-of-order CPUs can execute multiple unrolled iterations simultaneously. On an in-order Cortex-A9 with no instruction window, there is no parallel execution of independent iterations — they still run sequentially. The only effect is increased code size, which puts more pressure on the 32 KB instruction cache and can actually cause more cache misses. This is the likely cause of c_model_optm being slower than baseline on Zynq at every level: the unrolled convolution code no longer fits cleanly in L1 cache, causing additional fetch stalls that offset any reduction in branch overhead.

Cache alignment of weight buffers also helps less on ARM. On x86 with wider cache lines and hardware prefetchers, misaligned accesses incur significant penalties. The Cortex-A9’s cache line is 32 bytes, and its prefetcher is simpler — aligned accesses help, but the bottleneck is bandwidth to main memory (DDR), not alignment within cache.

The net result is that every optimization in c_model_optm either has no effect or makes things marginally worse on this architecture. This is not a failure of the optimization approach — it reflects a real and important architectural difference between high-performance x86 cores and embedded in-order ARM cores.

Why Structural Fusion Works Everywhere

The residual fusion in c_model_optm_2 does not rely on any particular processor feature. It reduces the total number of bytes read and written during inference. That is a property of the algorithm, not the microarchitecture, and it helps on any memory-bound platform.

Each fused residual block eliminates one full intermediate tensor. Concretely:

Block 1 intermediate: 32×32×28 floats = 114,688 bytes written + 114,688 bytes read = ~224 KB per inference removed
Block 2 intermediate: 16×16×56 floats = 57,344 bytes written + 57,344 bytes read = ~112 KB per inference removed

That is roughly 336 KB of memory traffic eliminated per image, just from the two fused blocks. On a Cortex-A9 at ~800 MHz with typical DDR bandwidth in the Zynq-7000 PS, this directly translates to saved time.

The effect is largest at -O0 because without compiler optimization the memory traffic completely dominates — removing a full tensor access is a huge fraction of total work. At -O3 the compiler has already compressed many operations, so the absolute gain is smaller, but it remains consistent (378 ms → 361 ms) because the memory bandwidth ceiling itself hasn’t changed.

ARM vs x86: Why the Same Optimization Behaves Differently

On an x86 laptop the picture looks different. The fused model at -O1 drops from 55 ms to 7 ms — a ~7× gain. The same fusion on Zynq at -O1 gives 1480 ms → 379 ms — also roughly 4×. Proportionally similar, but the absolute numbers diverge dramatically because of clock speed, cache hierarchy depth, and memory bandwidth differences.

More interestingly, on x86 the pragma-optimized model (c_model_optm) shows no regression at -O1 and above. This is because the x86 out-of-order engine absorbs the extra code size from unrolling without penalty, and the larger L1/L2/L3 caches mean the unrolled code still fits without evicting working data. The same changes that are neutral or helpful on x86 are actively harmful on the Cortex-A9’s constrained, in-order microarchitecture.

This confirms the core lesson: the right optimization depends on the target architecture. Pragma-based loop hints are an x86-centric technique. Structural memory reduction is architecture-agnostic and effective on any memory-bound system.

Recommended Configuration

Target	Source	Compiler Flag
Zynq ARM (deployment)	`c_model_optm_2`	`-O3`
Laptop / x86 (fast dev)	`c_model_optm_2`	`-Ofast -march=native`
Debugging / validation	`c_model`	`-O0`

Avoid c_model_optm on ARM — it is strictly worse than baseline.

Final Ranking (Zynq)

Rank	Variant	Latency
1	`c_model_optm_2` at Ofast / O3	~361 ms
2	`c_model` at O3 / Ofast	~378 ms
3	`c_model_optm` at O3 / Ofast	~422 ms
4	`c_model` at O0	~5526 ms

On embedded ARM systems, reducing memory movement at the graph level yields more benefit than micro-optimizing loops.

SW Inference PYNQ ARM Cortex (keras2c)

CIFAR-10 CNN Inference on Zynq-7000 (ARM Cortex-A9)

Hardware

Project Setup

Directory Layout

Build & Run

Model Variants

c_model — Baseline

c_model_optm — Loop-Level Pragma Optimizations

c_model_optm_2 — Structural Graph Fusion

Laptop Benchmark (x86, Single Image)

Average Latency (ms)

Key observations

Zynq Benchmark (ARM Cortex-A9, 100 Images)

c_model — Baseline

c_model_optm — Pragma Optimized

c_model_optm_2 — Fused Graph

Full Comparison

Speedup

Analysis

Compiler Flags: Large Gains Up to O2, Then a Hard Ceiling

Why Pragmas Fail on Cortex-A9

Why Structural Fusion Works Everywhere

ARM vs x86: Why the Same Optimization Behaves Differently

Recommended Configuration

Final Ranking (Zynq)

`c_model` — Baseline

`c_model_optm` — Loop-Level Pragma Optimizations

`c_model_optm_2` — Structural Graph Fusion

`c_model` — Baseline

`c_model_optm` — Pragma Optimized

`c_model_optm_2` — Fused Graph