Manual C-HLS with Vitis
CIFAR-10 CNN Inference — Manual Vitis HLS Implementation
Mini-ResNet trained on CIFAR-10, taken from a Keras model all the way to a synthesisable RTL IP core targeting the Xilinx xc7z020 (Zynq-7020) via Vitis HLS. The implementation covers three stages: a standalone C++ float32 reference, an HLS C simulation kernel, and full RTL synthesis with AXI interface export.
All code is written manually — no HLS4ML, no keras2c. The network architecture, weight loading, operator implementations, and HLS pragmas were all authored from scratch.
Repository Layout
.
├── infer_txt_localds_fixed.cpp # Standalone C++ inference (float32 reference)
├── infer_txt_localds_fixed_hls.cpp # Vitis HLS kernel (synthesis target)
├── tb_cifar10.cpp # HLS testbench (10 images, pass/fail)
├── dump_weights_to_hls.py # Converts model_weights_fixed8/ → weights_hls/*.inc
├── run_hls.tcl # Vitis HLS flow script
├── model_weights_fixed8/ # Q1.7 quantized weight files (.txt)
├── weights_hls/ # Weight ROM include files (.inc, generated)
└── sample_images/ # 10 test images, one per class
Network Architecture
Input: 32×32×3
Block 1 (Residual):
conv3x3 (3 → 28, stride=1, same padding, ReLU)
conv3x3 (28 → 28, stride=1, same padding, ReLU)
conv1x1 (3 → 28, shortcut, no activation)
add(conv3x3_out, shortcut) → maxpool2×2
Block 2 (Residual):
conv3x3 (28 → 56, stride=1, same padding, ReLU)
conv3x3 (56 → 56, stride=1, same padding, ReLU)
conv1x1 (28 → 56, shortcut, no activation)
add(conv3x3_out, shortcut) → maxpool2×2
Head:
GlobalAvgPool → Dense(56 → 10) → argmax
Weights are stored as Q1.7 fixed-point integers. To recover float values:
divide each integer by 128. This quantization scheme was chosen to allow
weight storage as plain int arrays in ROM, with the dequantization happening
inline during the MAC loop (w = rom_k[idx] / 128.0).
Setup
Generate weight ROM include files (run once)
python3 dump_weights_to_hls.py
Reads model_weights_fixed8/*.txt and writes weights_hls/w_*.inc, one file
per layer kernel and bias. These .inc files are directly #included into
the HLS kernel’s static ROM arrays.
Set HLS environment
export XILINX_HLS=/tools/Xilinx/Vitis_HLS/2022.2
Stage 1 — Standalone C++ Float Reference
infer_txt_localds_fixed.cpp is a pure float32 C++ implementation of the
full forward pass. It has no HLS pragmas, no fixed-point types, and no
synthesis constraints. Its purpose is to validate that the network architecture
and weight loading are correct before introducing HLS-specific complexity.
Build and run
g++ -O3 -march=native -std=c++17 -o infer infer_txt_localds_fixed.cpp -lm
./infer sample_images/cat/cat_0.png
Implementation details
The reference uses a Tensor4D struct — a simple 4D array [N, H, W, C]
stored contiguous in row-major order with a std::vector<float> backing
store. All layers operate on this struct.
Weight loading reads the model_weights_fixed8/ text files, strips Windows
\r\n line endings (see problems section), and dequantizes on load
(value = int_token / 128.0f). Kernels are stored as ConvWeight and
DenseWeight structs holding std::vector<float>.
The forward pass mirrors the Keras graph exactly:
// Block 1
auto x1 = relu_tensor(conv2d(input, W.c01)); // conv3x3, 3→28
x1 = relu_tensor(conv2d(x1, W.c02)); // conv3x3, 28→28
auto sc1 = conv2d(input, W.c03); // shortcut conv1x1, no relu
auto x = add_tensors(x1, sc1);
x = max_pool(x);
// Block 2
auto x2 = relu_tensor(conv2d(x, W.c06)); // conv3x3, 28→56
x2 = relu_tensor(conv2d(x2, W.c07)); // conv3x3, 56→56
auto sc2 = conv2d(x, W.c08); // shortcut conv1x1, no relu
x = add_tensors(x2, sc2);
x = max_pool(x);
// Head
auto gap = global_avg_pool(x); // [56]
auto logits = dense_layer(gap, W.d12); // [10]
Image loading uses stb_image.h for decoding, followed by nearest-neighbour
resize to 32×32. Pixel values are normalized to [0, 1] by dividing by 255.
Stage 2 — HLS Kernel and C Simulation
infer_txt_localds_fixed_hls.cpp is the synthesis-targeted version. It is
structured so that it can be compiled as a normal C++ binary for functional
validation (C-sim), and also synthesised to RTL by Vitis HLS.
Dual-mode type system
The most important design decision in this file is the conditional type definition at the top:
#ifdef __SYNTHESIS__
typedef ap_fixed<16, 8> ftype; // 16-bit fixed-point, 8 integer bits
typedef ap_fixed<32,16> atype; // 32-bit accumulator
#else
typedef float ftype;
typedef float atype;
#endif
__SYNTHESIS__ is automatically defined by Vitis HLS during RTL generation,
and undefined during C simulation. This means:
- C-sim (
g++compilation): all arithmetic runs in float32. Results are numerically identical to the standalone reference. Fast to run, easy to debug. - RTL synthesis: all arithmetic uses
ap_fixed. The tool generates hardware multiply-accumulate units sized to the fixed-point widths. The truncation behaviour of fixed-point arithmetic is handled correctly by the hardware arithmetic units in a way that compound accumulation handles better than software simulation of the same type.
This separation is critical. An earlier attempt used ap_fixed for C-sim as
well — see the problems section for why that failed.
Weight storage as integer ROM arrays
Weights are stored as flat static const int arrays, compiled into ROM by
HLS:
static const int ROM_01_K[3*3*3*28] = {
#include "weights_hls/w_01_conv2d_kernel.inc"
};
The .inc files are generated by dump_weights_to_hls.py from the Q1.7
text files. Dequantization happens inline in every MAC: w = rom_k[idx] / W_SCALE
where W_SCALE = 128. In synthesis, HLS maps these to BRAM using
#pragma HLS BIND_STORAGE:
#pragma HLS BIND_STORAGE variable=ROM_02_K type=ROM_2P impl=BRAM
#pragma HLS BIND_STORAGE variable=ROM_06_K type=ROM_2P impl=BRAM
#pragma HLS BIND_STORAGE variable=ROM_07_K type=ROM_2P impl=BRAM
#pragma HLS BIND_STORAGE variable=ROM_08_K type=ROM_2P impl=BRAM
The large weight ROMs (3×3×28×28=7056 entries and 3×3×56×56=28224 entries) are explicitly mapped to BRAM rather than LUT-based distributed RAM. Without this directive, HLS will infer these as large LUT arrays and LUT utilization will exceed device limits.
Activation buffers
All intermediate feature maps are declared as static global arrays:
static ftype buf_in [IN_H * IN_W * IN_C];
static ftype buf_c01 [IN_H * IN_W * C1_OC];
static ftype buf_c02 [IN_H * IN_W * C1_OC];
// ... etc
Being static globals means they are not allocated on the stack (avoiding stack overflow for large tensors) and HLS can map them to BRAM. The top-level function zeroes them explicitly at the start of each call to ensure clean state between inferences.
Convolution implementation
The conv2d function is templated on all shape parameters so all loop bounds are compile-time constants:
template<int H, int W, int IC, int OC, int KH, int KW, bool RELU>
static void conv2d_same(
ftype in_buf [H * W * IC],
ftype out_buf[H * W * OC],
const int rom_k [],
const int rom_b []
)
The loop structure is:
LOOP_H (oh: 0..H)
LOOP_W (ow: 0..W)
LOOP_OC (oc: 0..OC)
LOOP_KI (ki: 0..KH)
LOOP_KJ (kj: 0..KW)
LOOP_IC (ic: 0..IC) ← #pragma HLS PIPELINE II=1 here
acc += in[ih,iw,ic] * kernel[ki,kj,ic,oc]
The #pragma HLS PIPELINE II=1 is placed on the innermost LOOP_IC only.
This is a deliberate and carefully chosen placement — see the problems section
for what happened when it was placed elsewhere.
Pipelining LOOP_IC means: for a given spatial position (oh, ow) and output
channel oc, HLS pipelines the reduction over all input channels. Each
iteration of LOOP_IC issues one multiply-accumulate. With II=1, a new
iteration starts every clock cycle. For a 56-channel inner loop, this completes
the reduction in 56 cycles with a fully pipelined MAC chain.
The outer loops (LOOP_H, LOOP_W, LOOP_OC, LOOP_KI, LOOP_KJ) are
left as sequential. HLS schedules them as nested loops around the pipelined
core. This keeps the elaboration tractable — see the problems section for
what happened with pipeline placement on outer loops.
MaxPool implementation
The maxpool function is templated and uses a loop-carried index scheme that avoids all runtime multiplication in the address calculations:
int o_base = 0;
int r0_base = 0;
POOL_H: for (int oh = 0; oh < OH; ++oh) {
int r1_base = r0_base + ROW_STRIDE; // ROW_STRIDE = W*C (compile-time)
int w0_off = 0;
POOL_W: for (int ow = 0; ow < OW; ++ow) {
int w1_off = w0_off + COL_STRIDE; // COL_STRIDE = C (compile-time)
POOL_C: for (int c = 0; c < C; ++c) {
#pragma HLS PIPELINE II=1
// four reads using only addition-based offsets
ftype v00 = in_buf[r0_base + w0_off + c];
// ...
}
o_base += C;
w0_off += 2 * COL_STRIDE;
}
r0_base += 2 * ROW_STRIDE;
}
ROW_STRIDE and COL_STRIDE are derived from template parameters, so they
are compile-time constants. HLS sees only additions and increments in the
address path — no multipliers are synthesized. This was the fix for the 132%
LUT utilization problem — see the problems section for full detail.
GlobalAvgPool implementation
Uses nested h, w loops to avoid integer division (hw / W) and modulo
(hw % W) that would otherwise be needed to recover 2D coordinates from a
flat index:
GAP_C: for (int c = 0; c < C; ++c) {
atype acc = atype(0);
GAP_H: for (int h = 0; h < H; ++h) {
GAP_W: for (int w = 0; w < W; ++w) {
#pragma HLS PIPELINE II=1
acc += atype(in_buf[(h * W + w) * C + c]);
}
}
out_buf[c] = ftype(acc * inv);
}
The (h * W + w) * C expression involves only compile-time constants (W, C)
multiplied by loop variables with bounded ranges, which HLS can implement
as adders rather than general multipliers.
Dense layer
Pipelined over the input feature dimension:
FC_O: for (int o = 0; o < OUT_F; ++o) {
atype acc = atype(rom_b[o]) / atype(W_SCALE);
FC_I: for (int i = 0; i < IN_F; ++i) {
#pragma HLS PIPELINE II=1
acc += atype(in_buf[i]) * (atype(rom_k[i*OUT_F + o]) / atype(W_SCALE));
}
out_buf[o] = ftype(acc);
}
IN_F = 56, so each output neuron takes 56 pipelined MAC cycles. There are 10
output neurons, giving 560 total cycles for the dense layer — negligible
compared to the conv layers.
Argmax
The 10-way argmax is small enough to fully unroll:
for (int i = 1; i < N_CLS; ++i) {
#pragma HLS UNROLL
if (logits[i] > best_val) { best_val = logits[i]; best = label_t(i); }
}
#pragma HLS UNROLL without a factor unrolls the entire loop. With only
10 iterations, this creates a small parallel comparison tree rather than a
sequential loop — sensible given the tiny size.
AXI interface
The top-level function is:
void cifar10_infer(
ftype image_in[IN_H * IN_W * IN_C], // 3072 elements
label_t* pred_out
);
The HLS interface pragmas:
#pragma HLS INTERFACE m_axi port=image_in offset=slave bundle=gmem depth=3072
#pragma HLS INTERFACE s_axilite port=image_in bundle=control
#pragma HLS INTERFACE s_axilite port=pred_out bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
m_axi on image_in: The image input is mapped to an AXI Master
(memory-mapped) interface. This means the IP core will DMA-read the 3072 input
pixels directly from DDR memory. The offset=slave directive means the base
address is configurable at runtime via the AXI-Lite control bus rather than
being hardcoded. bundle=gmem groups this port into a single AXI Master
interface named gmem. In Vivado block design, gmem connects to the HP
(High-Performance) slave port of the Zynq PS, giving the IP direct access to
DDR.
s_axilite on image_in (address register): Even though the data is
read via AXI Master, the base address of the image buffer is passed to the core
through the AXI-Lite control slave. The PS writes the pointer value into the
core’s address register before asserting start.
s_axilite on pred_out: The single output — the predicted class index
(0–9) — is returned through the AXI-Lite control slave as a readable register.
The PS polls or interrupts on done, then reads this register.
s_axilite on return (ap_ctrl_hs): This exposes the standard
start/done/idle/ready handshake signals on the AXI-Lite control slave. The PS
writes 1 to the ap_start bit to trigger inference, and reads ap_done
to know when the result is available. ap_ctrl_hs (handshake) mode holds
ap_done high for one cycle then clears it.
The full connection in Vivado is: AXI-Lite control port → Zynq PS GP master
(for register access), gmem AXI Master → Zynq PS HP slave (for DMA reads).
Build and run (C-sim)
Single image:
g++ -std=c++17 -DHW_CSIM -I$XILINX_HLS/include \
-o csim infer_txt_localds_fixed_hls.cpp -lm
./csim sample_images/airplane/airplane_0.png
Full testbench (10 images):
g++ -std=c++17 -I$XILINX_HLS/include \
-o tb_csim tb_cifar10.cpp infer_txt_localds_fixed_hls.cpp -lm
./tb_csim
Expected: 9/10 correct. bird_0.png is misclassified as airplane — the
float reference has low confidence on this image too, so this is a model
accuracy issue, not an implementation bug.
Stage 3 — Vitis HLS Synthesis and Export
vitis_hls -f run_hls.tcl
The TCL script runs: C simulation → RTL synthesis → IP export. Co-simulation
is excluded because the testbench uses chdir() for relative path handling,
which conflicts with HLS cosim’s internal file I/O. C-sim already validates
functional correctness.
Check synthesis results
# Resource utilization
cat cifar10_hls/solution1/syn/report/cifar10_infer_csynth.rpt \
| grep -A 60 "== Utilization Estimates"
# Timing
cat cifar10_hls/solution1/syn/report/cifar10_infer_csynth.rpt \
| grep -A 10 "== Performance Estimates"
Synthesised IP output
cifar10_hls/solution1/impl/export.zip
Resource Utilization (xc7z020clg400-1, 100 MHz target)
| Resource | Used | Available | Utilization |
|---|---|---|---|
| LUT | 18,379 | 53,200 | 34% |
| FF | 11,014 | 106,400 | 10% |
| BRAM_18K | 243 | 280 | 86% |
| DSP | 22 | 220 | 10% |
Estimated Fmax: 136.99 MHz (target was 100 MHz — design meets timing with margin)
BRAM utilization at 86% is the binding constraint. Most BRAM is consumed by
weight ROMs. The large conv layer ROMs (ROM_02_K: 7056 entries,
ROM_07_K: 28224 entries) are the dominant contributors.
Problems Encountered and How They Were Fixed
This section documents every significant failure in the development process, what caused it, and exactly what was changed to fix it.
Problem 1 — Weight file parse failure (\r\n line endings)
What happened: The C++ weight loader failed to parse the very first kernel file. The integer token on line 4 was read incorrectly and threw a parse error.
Root cause: The .txt weight files were generated on Windows and use
\r\n (CRLF) line endings. std::getline reads up to \n and leaves \r
attached to the end of the string. When the line is fed into
std::istringstream for integer parsing, the \r is treated as part of the
last token on each line, corrupting it. The error only appeared on line 4
because earlier lines happened to have tokens that parsed successfully despite
the corruption — the last token on each line was the one affected.
Fix: After std::getline, explicitly strip \r from the end of every
line before parsing:
if (!line.empty() && line.back() == '\r') line.pop_back();
This one line was added to the loader and all files parsed correctly afterwards.
Problem 2 — ap_fixed arithmetic giving wrong predictions in C-sim
What happened: With atype = ap_fixed<32,16> used for both C-sim and
synthesis, the HLS kernel predicted truck for an airplane image that the
float reference correctly classified.
Root cause: In C-sim using ap_fixed, every intermediate multiply result
is truncated back to ap_fixed<32,16> before being added to the accumulator.
In the largest conv layer (3×3 kernel, 56 input channels = 504 MACs per output
channel), truncation error accumulates across all 504 additions. The rounding
errors shift the winning logit to the wrong class. In hardware, the fixed-point
arithmetic units handle this correctly because the synthesis tool maps the
ap_fixed types to actual hardware multipliers and adders with the correct
bit-width propagation rules. In software simulation of ap_fixed, the
behaviour is a conservative model that truncates more aggressively than the
hardware would.
Fix: Decouple the type used for C-sim from the type used for synthesis
using the __SYNTHESIS__ macro, which Vitis HLS defines automatically only
during RTL generation:
#ifdef __SYNTHESIS__
typedef ap_fixed<16, 8> ftype;
typedef ap_fixed<32,16> atype;
#else
typedef float ftype;
typedef float atype;
#endif
C-sim now runs entirely in float32 and matches the standalone reference
exactly. The ap_fixed types are only active during synthesis, where the
hardware arithmetic units handle truncation correctly. Functional validation
and hardware generation are fully decoupled.
Problem 3 — HLS synthesis deadlock at “Starting code transformations”
What happened: Vitis HLS hung indefinitely during elaboration at the
“Starting code transformations” phase. First attempt had #pragma HLS UNROLL
on all inner conv loops. Second attempt removed the explicit unrolls but placed
#pragma HLS PIPELINE II=1 on LOOP_OC (the output channel loop). In both
cases the tool ran for over 110 seconds and never produced output.
Root cause: When PIPELINE is placed on LOOP_OC, HLS attempts to
achieve II=1 for that loop. To do so, it must unroll all loops nested inside
LOOP_OC — that is LOOP_KI (3), LOOP_KJ (3), and LOOP_IC (up to 56)
simultaneously. 3×3×56 = 504. HLS now needs 504 parallel MAC operations per
output channel, all executing every cycle. To feed 504 MACs simultaneously,
it must partition the weight ROM ROM_07_K (28,224 entries) into 505 cyclic
banks so every MAC can access its weight in the same cycle. Building the
scheduling dependency graph for a 505-way partitioned ROM with 504 parallel
MACs causes exponential elaboration complexity. The tool was effectively
deadlocked trying to resolve the scheduling problem.
Fix: Move #pragma HLS PIPELINE II=1 to the innermost LOOP_IC only:
LOOP_IC: for (int ic = 0; ic < IC; ++ic) {
#pragma HLS PIPELINE II=1
int k_idx = ((ki*KW + kj)*IC + ic)*OC + oc;
atype w = atype(rom_k[k_idx]) / atype(W_SCALE);
atype x = atype(in_buf[...]);
acc += x * w;
}
With pipeline on LOOP_IC, HLS only needs to pipeline the input-channel
reduction (at most 56 iterations). It does not unroll LOOP_KI, LOOP_KJ,
or LOOP_OC. The ROM only needs a 2-port access (input channel index advances
by 1 each cycle), which BRAM handles natively. Elaboration time dropped from
110+ seconds (hanging) to 3 seconds. The tradeoff is higher latency per
image — the spatial and output-channel loops are sequential — but the design
synthesises correctly.
Problem 4 — LUT utilization at 132% — design did not fit on xc7z020
What happened: After the elaboration fix, synthesis completed but the resource report showed total LUT usage at 70,353 against a device limit of 53,200 — 132% utilization. The design could not be implemented.
Root cause: The two maxpool2x2 modules dominated:
grp_maxpool2x2_16_16_56_s 27,692 LUT
grp_maxpool2x2_32_32_28_s 27,526 LUT
The original maxpool loop used expressions like oh*2*W*C and ow*2*C to
compute input buffer indices, where oh and ow are loop variables:
// Original — causes runtime multipliers
for (int oh = 0; oh < OH; ++oh)
for (int ow = 0; ow < OW; ++ow)
for (int c = 0; c < C; ++c) {
int in_idx_00 = (oh*2 * W + ow*2) * C + c;
int in_idx_01 = (oh*2 * W + ow*2 + 1) * C + c;
// ...
}
Even though W and C are template parameters (compile-time constants),
oh and ow are loop variables — unknown at compile time. HLS cannot fold
oh*2*W*C at elaboration time. It must synthesize a 64-bit runtime
multiplier for each such expression. With the POOL_C loop pipelined, four
such index expressions appear inside the pipeline, generating four instances
of mul_64ns_66ns_129_5_1 — large 64-bit multipliers backed by DSPs plus
thousands of LUT-based mux trees. Each multiplier instance cost ~250 LUTs and
~5,000 FFs. Multiplied across both maxpool modules, this alone consumed over
55,000 LUTs.
Fix: Replace all index expressions involving loop variables with loop-carried increment variables that are only ever added to by compile-time constants:
int o_base = 0;
int r0_base = 0;
POOL_H: for (int oh = 0; oh < OH; ++oh) {
int r1_base = r0_base + ROW_STRIDE; // ROW_STRIDE = W*C, compile-time
int w0_off = 0;
POOL_W: for (int ow = 0; ow < OW; ++ow) {
int w1_off = w0_off + COL_STRIDE; // COL_STRIDE = C, compile-time
POOL_C: for (int c = 0; c < C; ++c) {
#pragma HLS PIPELINE II=1
ftype v00 = in_buf[r0_base + w0_off + c];
ftype v01 = in_buf[r0_base + w1_off + c];
ftype v10 = in_buf[r1_base + w0_off + c];
ftype v11 = in_buf[r1_base + w1_off + c];
// max reduction...
out_buf[o_base + c] = mx;
}
o_base += C;
w0_off += 2 * COL_STRIDE;
}
r0_base += 2 * ROW_STRIDE;
}
Every address computation is now a chain of additions. r0_base, r1_base,
w0_off, w1_off, and o_base are incremented by compile-time stride
constants (ROW_STRIDE = W*C, COL_STRIDE = C). HLS synthesizes these as
simple adders. No runtime multipliers are generated.
Result after fix:
| Module | Before | After |
|---|---|---|
grp_maxpool2x2_16_16_56_s |
27,692 LUT | 2,921 LUT |
grp_maxpool2x2_32_32_28_s |
27,526 LUT | 2,809 LUT |
| Total LUT | 70,353 (132%) | 18,379 (34%) |
The design now fits comfortably on the xc7z020 with 66% LUT headroom.
Integrating the IP in Vivado
After synthesis, the exported IP is at:
cifar10_hls/solution1/impl/export.zip
In Vivado block design:
- Add
export.zipas an IP repository (IP Catalog → Add Repository). - Instantiate
cifar10_infer. - Connect
gmem(AXI Master) → Zynq PS HP slave port (S_AXI_HP0). This is the high-bandwidth path for DMA — HP ports bypass the GP bus and connect directly to the DDR controller. - Connect
control(AXI-Lite slave) → Zynq PS GP master port (M_AXI_GP0). This is the low-bandwidth path for register access — the PS writes the image pointer address and reads the prediction result here. - Connect clocks and resets. The HP and GP ports can share the same PL clock.
- Add an AXI Interconnect or SmartConnect between each Zynq port and the IP if needed by the block automation.
From the PS (Linux or bare-metal), usage is:
- Allocate a physically contiguous buffer, write 3072
float16pixels. - Write the physical address of that buffer to the
image_inaddress register via/dev/memor a UIO/XDMA driver. - Write
1toap_start. - Poll
ap_done(or configure an interrupt onap_ready). - Read the 4-bit
pred_outregister.
Summary
| Stage | File | Purpose |
|---|---|---|
| Float reference | infer_txt_localds_fixed.cpp |
Validate architecture and weights |
| HLS kernel | infer_txt_localds_fixed_hls.cpp |
C-sim + RTL synthesis target |
| Testbench | tb_cifar10.cpp |
10-image pass/fail gate |
| Weight converter | dump_weights_to_hls.py |
Q1.7 txt → .inc ROM arrays |
| HLS flow | run_hls.tcl |
Automate csim + synth + export |
The four problems encountered during development — CRLF parsing, ap_fixed
C-sim divergence, elaboration deadlock from over-pipelining, and 64-bit
multipliers in maxpool — each had a clear root cause and a targeted fix. None
required redesigning the overall approach. The final synthesised design meets
100 MHz timing at 136.99 MHz Fmax and fits on the xc7z020 at 34% LUT and 86%
BRAM.