High-Level Synthesis (HLS) with C: A Deep Practical Guide from Code to RTL

High-Level Synthesis (HLS) with C: A Deep Practical Guide from Code to RTL

hardware fpga hls HLS Vitis HLS Vivado HLS C to RTL Zynq pipeline loop unrolling array partition AXI4 AXI-Stream BRAM DSP hardware acceleration FPGA design

High Level Synthesis, usually called HLS, lets you write hardware using C or C++ and then generate RTL from it. Tools such as Vitis HLS take a C function and turn it into synthesizable Verilog or VHDL.

HLS is often misunderstood.

It is not:

  • A CPU compiler
  • A push button way to convert software into hardware
  • A substitute for understanding digital design

It is:

  • A hardware generator that schedules operations over clock cycles
  • A way to describe datapaths and control logic using C syntax

By the end of this discussion, you should be able to:

  • Recognize how C constructs turn into hardware
  • Read HLS synthesis reports with confidence
  • Control latency and throughput
  • Build a pipelined accelerator
  • Add AXI interfaces
  • Estimate hardware structure before running synthesis

The Required Mindset Shift

In software, the model is simple:

One instruction runs after another.

In hardware:

Independent operations can run at the same time.

HLS converts C code into:

  • Datapath elements such as adders and multipliers
  • Registers
  • Memories
  • Control logic implemented as state machines

The tool decides how operations are scheduled across clock cycles.


From a C Function to an RTL Block

Consider:

int add(int a, int b) {
    return a + b;
}

HLS generates:

  • A module with two inputs
  • One output
  • Possibly handshake signals
  • One adder
  • One register if latency is greater than zero

Conceptually:

a -----
        |--> + --> result
b -----

How C Constructs Map to Hardware

Variables Become Registers

int x;

This becomes a 32 bit register. The width comes from the data type.


Static Variables Become Stateful Registers

static int counter = 0;
counter++;

This becomes:

  • A register
  • Feedback logic
  • Persistent state across calls

In RTL terms, it behaves like:

always @(posedge clk)
    counter <= counter + 1;

Arrays Become Memory

int A[1024];

This usually becomes block RAM.

If the array is small, it may become:

  • LUT RAM
  • Registers

The exact structure depends on:

  • Access pattern
  • Whether the array is partitioned
  • Pragmas applied

If and Else Become Multiplexers

if (a > b)
    c = a;
else
    c = b;

This becomes:

  • A comparator
  • A multiplexer

Loops Become State Machines

for(int i = 0; i < 8; i++)
    sum += A[i];

By default:

  • One iteration runs per clock cycle
  • A single adder is reused
  • A loop counter register is created

Latency is roughly 8 cycles.


Scheduling and Latency

HLS scheduling determines:

  • When each operation executes
  • How many hardware resources are created
  • Total latency
  • Initiation interval

Two key terms:

Latency The total number of clock cycles required to complete one function call.

Initiation Interval, written as ( II ) How many cycles must pass before a new input can start.


First Example: Vector Addition

Basic Code

void vec_add(int A[8], int B[8], int C[8]) {
    for (int i = 0; i < 8; i++) {
        C[i] = A[i] + B[i];
    }
}

Default Hardware

  • One adder
  • One loop counter
  • Eight iterations
  • Minimum latency of about 8 cycles

Structure:

   A[i] -----
             |--> + --> C[i]
   B[i] -----

A control state machine increments i.


Loop Unrolling

Adding:

#pragma HLS UNROLL

Now:

  • Eight adders are created
  • The loop controller disappears
  • All additions happen in parallel
  • Latency becomes 1 cycle

The tradeoff is area. Hardware usage increases roughly eight times.


Partial Unroll

#pragma HLS UNROLL factor=4

Now:

  • Four adders
  • Two cycles of execution
  • More balanced resource usage

Loop Pipelining

Consider:

for(int i = 0; i < 1024; i++)
    C[i] = A[i] + B[i];

Add:

#pragma HLS PIPELINE II=1

This tells the tool to start a new loop iteration every clock cycle.

A simplified timeline:

CycleIteration
1i = 0
2i = 1
3i = 2

Even if the addition itself has multiple stages, pipelining allows continuous throughput.


Memory Bottlenecks

Consider matrix multiplication:

for(i)
  for(j)
    for(k)
      C[i][j] += A[i][k] * B[k][j];

The issue is memory bandwidth.

A single block RAM has limited read and write ports. If multiple accesses are required in the same cycle, the pipeline stalls and ( II ) increases.

To solve this:

#pragma HLS ARRAY_PARTITION variable=A complete

Partitioning splits one large memory into several smaller memories, allowing parallel access.

Without partitioning:

  • Memory access becomes a bottleneck
  • ( II ) becomes greater than 1

Building a Matrix Multiplier Accelerator

Basic Version

#define N 4

void matmul(int A[N][N], int B[N][N], int C[N][N]) {
    for(int i = 0; i < N; i++) {
        for(int j = 0; j < N; j++) {
            int sum = 0;
            for(int k = 0; k < N; k++) {
                sum += A[i][k] * B[k][j];
            }
            C[i][j] = sum;
        }
    }
}

Hardware includes:

  • One multiplier
  • One adder
  • Nested state machines
  • Block RAM for A, B, and C

Latency is large and ( II ) is greater than 1.


Optimized Version

Add:

#pragma HLS PIPELINE
#pragma HLS ARRAY_PARTITION variable=A complete dim=2
#pragma HLS ARRAY_PARTITION variable=B complete dim=1

Now you get:

  • Multiple multipliers
  • An adder tree
  • Fewer stalls
  • ( II ) can approach 1

Fixed Point vs Floating Point

Floating point:

float x = a * b;

This becomes:

  • A floating point multiplier IP
  • Larger area
  • Higher latency

Fixed point:

ap_fixed<16,8> x;

This maps more directly to DSP blocks:

  • Smaller area
  • More predictable timing

Most machine learning accelerators use fixed point arithmetic for this reason.


Adding Interfaces

By default, HLS creates simple handshake interfaces.

To generate AXI Lite registers:

#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=A
#pragma HLS INTERFACE s_axilite port=B
#pragma HLS INTERFACE s_axilite port=C

Now the block can be accessed by an ARM processor in a Zynq device through memory mapped registers.


AXI Stream Interface

#pragma HLS INTERFACE axis port=input
#pragma HLS INTERFACE axis port=output

This creates a streaming hardware block suitable for video or DSP pipelines.


Reading HLS Reports

After synthesis, focus on:

Latency Total clock cycles per function call.

Initiation Interval ( II ) If ( II = 1 ), the design can accept new input every clock cycle.

Resource utilization

ResourceMeaning
LUTLogic
FFRegisters
DSPMultipliers
BRAMMemory

Critical path Check the estimated timing.


Common Mistakes

Assuming Strict Sequential Execution

Code such as:

a = b + c;
d = e + f;

These two additions can execute in parallel.


Ignoring Memory Ports

A single block RAM typically supports two reads per cycle. More accesses cause stalls.


Over Unrolling

Unrolling large loops can cause:

  • Excessive DSP usage
  • Routing congestion
  • Timing failure

Example: Streaming FIR Filter

#include <ap_int.h>

#define N 4

void fir(ap_int<16> input,
         ap_int<16> *output) {

#pragma HLS PIPELINE II=1

    static ap_int<16> shift[N] = {0};
    ap_int<16> coeff[N] = {1,2,3,4};
    ap_int<32> acc = 0;

    for(int i = N-1; i > 0; i--)
        shift[i] = shift[i-1];

    shift[0] = input;

    for(int i = 0; i < N; i++)
        acc += shift[i] * coeff[i];

    *output = acc;
}

This produces hardware with:

  • Shift registers
  • DSP multipliers
  • An adder tree
  • One output per clock cycle

It becomes a streaming DSP block.


What HLS Cannot Synthesize

The following are not supported:

  • malloc
  • Recursion
  • File input and output
  • Operating system calls
  • Dynamic memory allocation

HLS is still a structural hardware description method, just expressed in C.


Practical Exercise

  1. Implement an 8 tap FIR filter.
  2. Synthesize without any pragmas.
  3. Record latency and DSP usage.
  4. Add PIPELINE and ARRAY_PARTITION pragmas.
  5. Compare the results.

Observe:

  • Change in ( II )
  • Increase in resource usage
  • Reduction in latency

Final Thoughts

C in HLS describes:

  • Dataflow
  • Parallelism
  • Resource usage

It does not describe an instruction sequence the way software does.

When you can estimate:

  • Number of adders
  • Number of multipliers
  • Number of BRAM blocks
  • Number of clock cycles

before running synthesis, you are thinking in hardware terms.