Manual C-HLS with Vitis

Manual C-HLS with Vitis

CIFAR-10 CNN Inference — Manual Vitis HLS Implementation

Mini-ResNet trained on CIFAR-10, taken from a Keras model all the way to a synthesisable RTL IP core targeting the Xilinx xc7z020 (Zynq-7020) via Vitis HLS. The implementation covers three stages: a standalone C++ float32 reference, an HLS C simulation kernel, and full RTL synthesis with AXI interface export.

All code is written manually — no HLS4ML, no keras2c. The network architecture, weight loading, operator implementations, and HLS pragmas were all authored from scratch.


Repository Layout

.
├── infer_txt_localds_fixed.cpp          # Standalone C++ inference (float32 reference)
├── infer_txt_localds_fixed_hls.cpp      # Vitis HLS kernel (synthesis target)
├── tb_cifar10.cpp                       # HLS testbench (10 images, pass/fail)
├── dump_weights_to_hls.py               # Converts model_weights_fixed8/ → weights_hls/*.inc
├── run_hls.tcl                          # Vitis HLS flow script
├── model_weights_fixed8/                # Q1.7 quantized weight files (.txt)
├── weights_hls/                         # Weight ROM include files (.inc, generated)
└── sample_images/                       # 10 test images, one per class

Network Architecture

Input: 32×32×3

Block 1 (Residual):
  conv3x3  (3  → 28, stride=1, same padding, ReLU)
  conv3x3  (28 → 28, stride=1, same padding, ReLU)
  conv1x1  (3  → 28, shortcut, no activation)
  add(conv3x3_out, shortcut) → maxpool2×2

Block 2 (Residual):
  conv3x3  (28 → 56, stride=1, same padding, ReLU)
  conv3x3  (56 → 56, stride=1, same padding, ReLU)
  conv1x1  (28 → 56, shortcut, no activation)
  add(conv3x3_out, shortcut) → maxpool2×2

Head:
  GlobalAvgPool → Dense(56 → 10) → argmax

Weights are stored as Q1.7 fixed-point integers. To recover float values: divide each integer by 128. This quantization scheme was chosen to allow weight storage as plain int arrays in ROM, with the dequantization happening inline during the MAC loop (w = rom_k[idx] / 128.0).


Setup

Generate weight ROM include files (run once)

python3 dump_weights_to_hls.py

Reads model_weights_fixed8/*.txt and writes weights_hls/w_*.inc, one file per layer kernel and bias. These .inc files are directly #included into the HLS kernel’s static ROM arrays.

Set HLS environment

export XILINX_HLS=/tools/Xilinx/Vitis_HLS/2022.2

Stage 1 — Standalone C++ Float Reference

infer_txt_localds_fixed.cpp is a pure float32 C++ implementation of the full forward pass. It has no HLS pragmas, no fixed-point types, and no synthesis constraints. Its purpose is to validate that the network architecture and weight loading are correct before introducing HLS-specific complexity.

Build and run

g++ -O3 -march=native -std=c++17 -o infer infer_txt_localds_fixed.cpp -lm
./infer sample_images/cat/cat_0.png

Implementation details

The reference uses a Tensor4D struct — a simple 4D array [N, H, W, C] stored contiguous in row-major order with a std::vector<float> backing store. All layers operate on this struct.

Weight loading reads the model_weights_fixed8/ text files, strips Windows \r\n line endings (see problems section), and dequantizes on load (value = int_token / 128.0f). Kernels are stored as ConvWeight and DenseWeight structs holding std::vector<float>.

The forward pass mirrors the Keras graph exactly:

// Block 1
auto x1  = relu_tensor(conv2d(input, W.c01));        // conv3x3, 3→28
x1       = relu_tensor(conv2d(x1,    W.c02));        // conv3x3, 28→28
auto sc1 = conv2d(input, W.c03);                     // shortcut conv1x1, no relu
auto x   = add_tensors(x1, sc1);
x        = max_pool(x);

// Block 2
auto x2  = relu_tensor(conv2d(x,  W.c06));           // conv3x3, 28→56
x2       = relu_tensor(conv2d(x2, W.c07));           // conv3x3, 56→56
auto sc2 = conv2d(x, W.c08);                         // shortcut conv1x1, no relu
x        = add_tensors(x2, sc2);
x        = max_pool(x);

// Head
auto gap    = global_avg_pool(x);                    // [56]
auto logits = dense_layer(gap, W.d12);               // [10]

Image loading uses stb_image.h for decoding, followed by nearest-neighbour resize to 32×32. Pixel values are normalized to [0, 1] by dividing by 255.


Stage 2 — HLS Kernel and C Simulation

infer_txt_localds_fixed_hls.cpp is the synthesis-targeted version. It is structured so that it can be compiled as a normal C++ binary for functional validation (C-sim), and also synthesised to RTL by Vitis HLS.

Dual-mode type system

The most important design decision in this file is the conditional type definition at the top:

#ifdef __SYNTHESIS__
    typedef ap_fixed<16, 8>  ftype;   // 16-bit fixed-point, 8 integer bits
    typedef ap_fixed<32,16>  atype;   // 32-bit accumulator
#else
    typedef float ftype;
    typedef float atype;
#endif

__SYNTHESIS__ is automatically defined by Vitis HLS during RTL generation, and undefined during C simulation. This means:

  • C-sim (g++ compilation): all arithmetic runs in float32. Results are numerically identical to the standalone reference. Fast to run, easy to debug.
  • RTL synthesis: all arithmetic uses ap_fixed. The tool generates hardware multiply-accumulate units sized to the fixed-point widths. The truncation behaviour of fixed-point arithmetic is handled correctly by the hardware arithmetic units in a way that compound accumulation handles better than software simulation of the same type.

This separation is critical. An earlier attempt used ap_fixed for C-sim as well — see the problems section for why that failed.

Weight storage as integer ROM arrays

Weights are stored as flat static const int arrays, compiled into ROM by HLS:

static const int ROM_01_K[3*3*3*28] = {
#include "weights_hls/w_01_conv2d_kernel.inc"
};

The .inc files are generated by dump_weights_to_hls.py from the Q1.7 text files. Dequantization happens inline in every MAC: w = rom_k[idx] / W_SCALE where W_SCALE = 128. In synthesis, HLS maps these to BRAM using #pragma HLS BIND_STORAGE:

#pragma HLS BIND_STORAGE variable=ROM_02_K type=ROM_2P impl=BRAM
#pragma HLS BIND_STORAGE variable=ROM_06_K type=ROM_2P impl=BRAM
#pragma HLS BIND_STORAGE variable=ROM_07_K type=ROM_2P impl=BRAM
#pragma HLS BIND_STORAGE variable=ROM_08_K type=ROM_2P impl=BRAM

The large weight ROMs (3×3×28×28=7056 entries and 3×3×56×56=28224 entries) are explicitly mapped to BRAM rather than LUT-based distributed RAM. Without this directive, HLS will infer these as large LUT arrays and LUT utilization will exceed device limits.

Activation buffers

All intermediate feature maps are declared as static global arrays:

static ftype buf_in    [IN_H * IN_W * IN_C];
static ftype buf_c01   [IN_H * IN_W * C1_OC];
static ftype buf_c02   [IN_H * IN_W * C1_OC];
// ... etc

Being static globals means they are not allocated on the stack (avoiding stack overflow for large tensors) and HLS can map them to BRAM. The top-level function zeroes them explicitly at the start of each call to ensure clean state between inferences.

Convolution implementation

The conv2d function is templated on all shape parameters so all loop bounds are compile-time constants:

template<int H, int W, int IC, int OC, int KH, int KW, bool RELU>
static void conv2d_same(
    ftype     in_buf [H * W * IC],
    ftype     out_buf[H * W * OC],
    const int rom_k [],
    const int rom_b []
)

The loop structure is:

LOOP_H  (oh: 0..H)
  LOOP_W  (ow: 0..W)
    LOOP_OC  (oc: 0..OC)
      LOOP_KI  (ki: 0..KH)
        LOOP_KJ  (kj: 0..KW)
          LOOP_IC  (ic: 0..IC)   ← #pragma HLS PIPELINE II=1 here
            acc += in[ih,iw,ic] * kernel[ki,kj,ic,oc]

The #pragma HLS PIPELINE II=1 is placed on the innermost LOOP_IC only. This is a deliberate and carefully chosen placement — see the problems section for what happened when it was placed elsewhere.

Pipelining LOOP_IC means: for a given spatial position (oh, ow) and output channel oc, HLS pipelines the reduction over all input channels. Each iteration of LOOP_IC issues one multiply-accumulate. With II=1, a new iteration starts every clock cycle. For a 56-channel inner loop, this completes the reduction in 56 cycles with a fully pipelined MAC chain.

The outer loops (LOOP_H, LOOP_W, LOOP_OC, LOOP_KI, LOOP_KJ) are left as sequential. HLS schedules them as nested loops around the pipelined core. This keeps the elaboration tractable — see the problems section for what happened with pipeline placement on outer loops.

MaxPool implementation

The maxpool function is templated and uses a loop-carried index scheme that avoids all runtime multiplication in the address calculations:

int o_base  = 0;
int r0_base = 0;
POOL_H: for (int oh = 0; oh < OH; ++oh) {
    int r1_base = r0_base + ROW_STRIDE;   // ROW_STRIDE = W*C (compile-time)
    int w0_off  = 0;
    POOL_W: for (int ow = 0; ow < OW; ++ow) {
        int w1_off = w0_off + COL_STRIDE; // COL_STRIDE = C (compile-time)
        POOL_C: for (int c = 0; c < C; ++c) {
#pragma HLS PIPELINE II=1
            // four reads using only addition-based offsets
            ftype v00 = in_buf[r0_base + w0_off + c];
            // ...
        }
        o_base += C;
        w0_off += 2 * COL_STRIDE;
    }
    r0_base += 2 * ROW_STRIDE;
}

ROW_STRIDE and COL_STRIDE are derived from template parameters, so they are compile-time constants. HLS sees only additions and increments in the address path — no multipliers are synthesized. This was the fix for the 132% LUT utilization problem — see the problems section for full detail.

GlobalAvgPool implementation

Uses nested h, w loops to avoid integer division (hw / W) and modulo (hw % W) that would otherwise be needed to recover 2D coordinates from a flat index:

GAP_C: for (int c = 0; c < C; ++c) {
    atype acc = atype(0);
    GAP_H: for (int h = 0; h < H; ++h) {
        GAP_W: for (int w = 0; w < W; ++w) {
#pragma HLS PIPELINE II=1
            acc += atype(in_buf[(h * W + w) * C + c]);
        }
    }
    out_buf[c] = ftype(acc * inv);
}

The (h * W + w) * C expression involves only compile-time constants (W, C) multiplied by loop variables with bounded ranges, which HLS can implement as adders rather than general multipliers.

Dense layer

Pipelined over the input feature dimension:

FC_O: for (int o = 0; o < OUT_F; ++o) {
    atype acc = atype(rom_b[o]) / atype(W_SCALE);
    FC_I: for (int i = 0; i < IN_F; ++i) {
#pragma HLS PIPELINE II=1
        acc += atype(in_buf[i]) * (atype(rom_k[i*OUT_F + o]) / atype(W_SCALE));
    }
    out_buf[o] = ftype(acc);
}

IN_F = 56, so each output neuron takes 56 pipelined MAC cycles. There are 10 output neurons, giving 560 total cycles for the dense layer — negligible compared to the conv layers.

Argmax

The 10-way argmax is small enough to fully unroll:

for (int i = 1; i < N_CLS; ++i) {
#pragma HLS UNROLL
    if (logits[i] > best_val) { best_val = logits[i]; best = label_t(i); }
}

#pragma HLS UNROLL without a factor unrolls the entire loop. With only 10 iterations, this creates a small parallel comparison tree rather than a sequential loop — sensible given the tiny size.

AXI interface

The top-level function is:

void cifar10_infer(
    ftype    image_in[IN_H * IN_W * IN_C],   // 3072 elements
    label_t* pred_out
);

The HLS interface pragmas:

#pragma HLS INTERFACE m_axi     port=image_in offset=slave bundle=gmem depth=3072
#pragma HLS INTERFACE s_axilite port=image_in bundle=control
#pragma HLS INTERFACE s_axilite port=pred_out bundle=control
#pragma HLS INTERFACE s_axilite port=return   bundle=control

m_axi on image_in: The image input is mapped to an AXI Master (memory-mapped) interface. This means the IP core will DMA-read the 3072 input pixels directly from DDR memory. The offset=slave directive means the base address is configurable at runtime via the AXI-Lite control bus rather than being hardcoded. bundle=gmem groups this port into a single AXI Master interface named gmem. In Vivado block design, gmem connects to the HP (High-Performance) slave port of the Zynq PS, giving the IP direct access to DDR.

s_axilite on image_in (address register): Even though the data is read via AXI Master, the base address of the image buffer is passed to the core through the AXI-Lite control slave. The PS writes the pointer value into the core’s address register before asserting start.

s_axilite on pred_out: The single output — the predicted class index (0–9) — is returned through the AXI-Lite control slave as a readable register. The PS polls or interrupts on done, then reads this register.

s_axilite on return (ap_ctrl_hs): This exposes the standard start/done/idle/ready handshake signals on the AXI-Lite control slave. The PS writes 1 to the ap_start bit to trigger inference, and reads ap_done to know when the result is available. ap_ctrl_hs (handshake) mode holds ap_done high for one cycle then clears it.

The full connection in Vivado is: AXI-Lite control port → Zynq PS GP master (for register access), gmem AXI Master → Zynq PS HP slave (for DMA reads).

Build and run (C-sim)

Single image:

g++ -std=c++17 -DHW_CSIM -I$XILINX_HLS/include \
    -o csim infer_txt_localds_fixed_hls.cpp -lm
./csim sample_images/airplane/airplane_0.png

Full testbench (10 images):

g++ -std=c++17 -I$XILINX_HLS/include \
    -o tb_csim tb_cifar10.cpp infer_txt_localds_fixed_hls.cpp -lm
./tb_csim

Expected: 9/10 correct. bird_0.png is misclassified as airplane — the float reference has low confidence on this image too, so this is a model accuracy issue, not an implementation bug.


Stage 3 — Vitis HLS Synthesis and Export

vitis_hls -f run_hls.tcl

The TCL script runs: C simulation → RTL synthesis → IP export. Co-simulation is excluded because the testbench uses chdir() for relative path handling, which conflicts with HLS cosim’s internal file I/O. C-sim already validates functional correctness.

Check synthesis results

# Resource utilization
cat cifar10_hls/solution1/syn/report/cifar10_infer_csynth.rpt \
    | grep -A 60 "== Utilization Estimates"

# Timing
cat cifar10_hls/solution1/syn/report/cifar10_infer_csynth.rpt \
    | grep -A 10 "== Performance Estimates"

Synthesised IP output

cifar10_hls/solution1/impl/export.zip

Resource Utilization (xc7z020clg400-1, 100 MHz target)

Resource Used Available Utilization
LUT 18,379 53,200 34%
FF 11,014 106,400 10%
BRAM_18K 243 280 86%
DSP 22 220 10%

Estimated Fmax: 136.99 MHz (target was 100 MHz — design meets timing with margin)

BRAM utilization at 86% is the binding constraint. Most BRAM is consumed by weight ROMs. The large conv layer ROMs (ROM_02_K: 7056 entries, ROM_07_K: 28224 entries) are the dominant contributors.


Problems Encountered and How They Were Fixed

This section documents every significant failure in the development process, what caused it, and exactly what was changed to fix it.


Problem 1 — Weight file parse failure (\r\n line endings)

What happened: The C++ weight loader failed to parse the very first kernel file. The integer token on line 4 was read incorrectly and threw a parse error.

Root cause: The .txt weight files were generated on Windows and use \r\n (CRLF) line endings. std::getline reads up to \n and leaves \r attached to the end of the string. When the line is fed into std::istringstream for integer parsing, the \r is treated as part of the last token on each line, corrupting it. The error only appeared on line 4 because earlier lines happened to have tokens that parsed successfully despite the corruption — the last token on each line was the one affected.

Fix: After std::getline, explicitly strip \r from the end of every line before parsing:

if (!line.empty() && line.back() == '\r') line.pop_back();

This one line was added to the loader and all files parsed correctly afterwards.


Problem 2 — ap_fixed arithmetic giving wrong predictions in C-sim

What happened: With atype = ap_fixed<32,16> used for both C-sim and synthesis, the HLS kernel predicted truck for an airplane image that the float reference correctly classified.

Root cause: In C-sim using ap_fixed, every intermediate multiply result is truncated back to ap_fixed<32,16> before being added to the accumulator. In the largest conv layer (3×3 kernel, 56 input channels = 504 MACs per output channel), truncation error accumulates across all 504 additions. The rounding errors shift the winning logit to the wrong class. In hardware, the fixed-point arithmetic units handle this correctly because the synthesis tool maps the ap_fixed types to actual hardware multipliers and adders with the correct bit-width propagation rules. In software simulation of ap_fixed, the behaviour is a conservative model that truncates more aggressively than the hardware would.

Fix: Decouple the type used for C-sim from the type used for synthesis using the __SYNTHESIS__ macro, which Vitis HLS defines automatically only during RTL generation:

#ifdef __SYNTHESIS__
    typedef ap_fixed<16, 8>  ftype;
    typedef ap_fixed<32,16>  atype;
#else
    typedef float ftype;
    typedef float atype;
#endif

C-sim now runs entirely in float32 and matches the standalone reference exactly. The ap_fixed types are only active during synthesis, where the hardware arithmetic units handle truncation correctly. Functional validation and hardware generation are fully decoupled.


Problem 3 — HLS synthesis deadlock at “Starting code transformations”

What happened: Vitis HLS hung indefinitely during elaboration at the “Starting code transformations” phase. First attempt had #pragma HLS UNROLL on all inner conv loops. Second attempt removed the explicit unrolls but placed #pragma HLS PIPELINE II=1 on LOOP_OC (the output channel loop). In both cases the tool ran for over 110 seconds and never produced output.

Root cause: When PIPELINE is placed on LOOP_OC, HLS attempts to achieve II=1 for that loop. To do so, it must unroll all loops nested inside LOOP_OC — that is LOOP_KI (3), LOOP_KJ (3), and LOOP_IC (up to 56) simultaneously. 3×3×56 = 504. HLS now needs 504 parallel MAC operations per output channel, all executing every cycle. To feed 504 MACs simultaneously, it must partition the weight ROM ROM_07_K (28,224 entries) into 505 cyclic banks so every MAC can access its weight in the same cycle. Building the scheduling dependency graph for a 505-way partitioned ROM with 504 parallel MACs causes exponential elaboration complexity. The tool was effectively deadlocked trying to resolve the scheduling problem.

Fix: Move #pragma HLS PIPELINE II=1 to the innermost LOOP_IC only:

LOOP_IC: for (int ic = 0; ic < IC; ++ic) {
#pragma HLS PIPELINE II=1
    int k_idx = ((ki*KW + kj)*IC + ic)*OC + oc;
    atype w = atype(rom_k[k_idx]) / atype(W_SCALE);
    atype x = atype(in_buf[...]);
    acc += x * w;
}

With pipeline on LOOP_IC, HLS only needs to pipeline the input-channel reduction (at most 56 iterations). It does not unroll LOOP_KI, LOOP_KJ, or LOOP_OC. The ROM only needs a 2-port access (input channel index advances by 1 each cycle), which BRAM handles natively. Elaboration time dropped from 110+ seconds (hanging) to 3 seconds. The tradeoff is higher latency per image — the spatial and output-channel loops are sequential — but the design synthesises correctly.


Problem 4 — LUT utilization at 132% — design did not fit on xc7z020

What happened: After the elaboration fix, synthesis completed but the resource report showed total LUT usage at 70,353 against a device limit of 53,200 — 132% utilization. The design could not be implemented.

Root cause: The two maxpool2x2 modules dominated:

grp_maxpool2x2_16_16_56_s   27,692 LUT
grp_maxpool2x2_32_32_28_s   27,526 LUT

The original maxpool loop used expressions like oh*2*W*C and ow*2*C to compute input buffer indices, where oh and ow are loop variables:

// Original — causes runtime multipliers
for (int oh = 0; oh < OH; ++oh)
    for (int ow = 0; ow < OW; ++ow)
        for (int c = 0; c < C; ++c) {
            int in_idx_00 = (oh*2 * W + ow*2)     * C + c;
            int in_idx_01 = (oh*2 * W + ow*2 + 1) * C + c;
            // ...
        }

Even though W and C are template parameters (compile-time constants), oh and ow are loop variables — unknown at compile time. HLS cannot fold oh*2*W*C at elaboration time. It must synthesize a 64-bit runtime multiplier for each such expression. With the POOL_C loop pipelined, four such index expressions appear inside the pipeline, generating four instances of mul_64ns_66ns_129_5_1 — large 64-bit multipliers backed by DSPs plus thousands of LUT-based mux trees. Each multiplier instance cost ~250 LUTs and ~5,000 FFs. Multiplied across both maxpool modules, this alone consumed over 55,000 LUTs.

Fix: Replace all index expressions involving loop variables with loop-carried increment variables that are only ever added to by compile-time constants:

int o_base  = 0;
int r0_base = 0;
POOL_H: for (int oh = 0; oh < OH; ++oh) {
    int r1_base = r0_base + ROW_STRIDE;   // ROW_STRIDE = W*C, compile-time
    int w0_off  = 0;
    POOL_W: for (int ow = 0; ow < OW; ++ow) {
        int w1_off = w0_off + COL_STRIDE; // COL_STRIDE = C, compile-time
        POOL_C: for (int c = 0; c < C; ++c) {
#pragma HLS PIPELINE II=1
            ftype v00 = in_buf[r0_base + w0_off + c];
            ftype v01 = in_buf[r0_base + w1_off + c];
            ftype v10 = in_buf[r1_base + w0_off + c];
            ftype v11 = in_buf[r1_base + w1_off + c];
            // max reduction...
            out_buf[o_base + c] = mx;
        }
        o_base += C;
        w0_off += 2 * COL_STRIDE;
    }
    r0_base += 2 * ROW_STRIDE;
}

Every address computation is now a chain of additions. r0_base, r1_base, w0_off, w1_off, and o_base are incremented by compile-time stride constants (ROW_STRIDE = W*C, COL_STRIDE = C). HLS synthesizes these as simple adders. No runtime multipliers are generated.

Result after fix:

Module Before After
grp_maxpool2x2_16_16_56_s 27,692 LUT 2,921 LUT
grp_maxpool2x2_32_32_28_s 27,526 LUT 2,809 LUT
Total LUT 70,353 (132%) 18,379 (34%)

The design now fits comfortably on the xc7z020 with 66% LUT headroom.


Integrating the IP in Vivado

After synthesis, the exported IP is at:

cifar10_hls/solution1/impl/export.zip

In Vivado block design:

  1. Add export.zip as an IP repository (IP Catalog → Add Repository).
  2. Instantiate cifar10_infer.
  3. Connect gmem (AXI Master) → Zynq PS HP slave port (S_AXI_HP0). This is the high-bandwidth path for DMA — HP ports bypass the GP bus and connect directly to the DDR controller.
  4. Connect control (AXI-Lite slave) → Zynq PS GP master port (M_AXI_GP0). This is the low-bandwidth path for register access — the PS writes the image pointer address and reads the prediction result here.
  5. Connect clocks and resets. The HP and GP ports can share the same PL clock.
  6. Add an AXI Interconnect or SmartConnect between each Zynq port and the IP if needed by the block automation.

From the PS (Linux or bare-metal), usage is:

  1. Allocate a physically contiguous buffer, write 3072 float16 pixels.
  2. Write the physical address of that buffer to the image_in address register via /dev/mem or a UIO/XDMA driver.
  3. Write 1 to ap_start.
  4. Poll ap_done (or configure an interrupt on ap_ready).
  5. Read the 4-bit pred_out register.

Summary

Stage File Purpose
Float reference infer_txt_localds_fixed.cpp Validate architecture and weights
HLS kernel infer_txt_localds_fixed_hls.cpp C-sim + RTL synthesis target
Testbench tb_cifar10.cpp 10-image pass/fail gate
Weight converter dump_weights_to_hls.py Q1.7 txt → .inc ROM arrays
HLS flow run_hls.tcl Automate csim + synth + export

The four problems encountered during development — CRLF parsing, ap_fixed C-sim divergence, elaboration deadlock from over-pipelining, and 64-bit multipliers in maxpool — each had a clear root cause and a targeted fix. None required redesigning the overall approach. The final synthesised design meets 100 MHz timing at 136.99 MHz Fmax and fits on the xc7z020 at 34% LUT and 86% BRAM.