High-Level Synthesis (HLS) with C: A Deep Practical Guide from Code to RTL

January 25, 2026

hardware fpga hls HLS Vitis HLS Vivado HLS C to RTL Zynq pipeline loop unrolling array partition AXI4 AXI-Stream BRAM DSP hardware acceleration FPGA design

High Level Synthesis, usually called HLS, lets you write hardware using C or C++ and then generate RTL from it. Tools such as Vitis HLS take a C function and turn it into synthesizable Verilog or VHDL.

HLS is often misunderstood.

It is not:

A CPU compiler
A push button way to convert software into hardware
A substitute for understanding digital design

It is:

A hardware generator that schedules operations over clock cycles
A way to describe datapaths and control logic using C syntax

By the end of this discussion, you should be able to:

Recognize how C constructs turn into hardware
Read HLS synthesis reports with confidence
Control latency and throughput
Build a pipelined accelerator
Add AXI interfaces
Estimate hardware structure before running synthesis

The Required Mindset Shift

In software, the model is simple:

One instruction runs after another.

In hardware:

Independent operations can run at the same time.

HLS converts C code into:

Datapath elements such as adders and multipliers
Registers
Memories
Control logic implemented as state machines

The tool decides how operations are scheduled across clock cycles.

From a C Function to an RTL Block

Consider:

int add(int a, int b) {
    return a + b;
}

HLS generates:

A module with two inputs
One output
Possibly handshake signals
One adder
One register if latency is greater than zero

Conceptually:

a -----
        |--> + --> result
b -----

How C Constructs Map to Hardware

Variables Become Registers

int x;

This becomes a 32 bit register. The width comes from the data type.

Static Variables Become Stateful Registers

static int counter = 0;
counter++;

This becomes:

A register
Feedback logic
Persistent state across calls

In RTL terms, it behaves like:

always @(posedge clk)
    counter <= counter + 1;

Arrays Become Memory

int A[1024];

This usually becomes block RAM.

If the array is small, it may become:

LUT RAM
Registers

The exact structure depends on:

Access pattern
Whether the array is partitioned
Pragmas applied

If and Else Become Multiplexers

if (a > b)
    c = a;
else
    c = b;

This becomes:

A comparator
A multiplexer

Loops Become State Machines

for(int i = 0; i < 8; i++)
    sum += A[i];

By default:

One iteration runs per clock cycle
A single adder is reused
A loop counter register is created

Latency is roughly 8 cycles.

Scheduling and Latency

HLS scheduling determines:

When each operation executes
How many hardware resources are created
Total latency
Initiation interval

Two key terms:

Latency The total number of clock cycles required to complete one function call.

Initiation Interval, written as ( II ) How many cycles must pass before a new input can start.

First Example: Vector Addition

Basic Code

void vec_add(int A[8], int B[8], int C[8]) {
    for (int i = 0; i < 8; i++) {
        C[i] = A[i] + B[i];
    }
}

Default Hardware

One adder
One loop counter
Eight iterations
Minimum latency of about 8 cycles

Structure:

   A[i] -----
             |--> + --> C[i]
   B[i] -----

A control state machine increments i.

Loop Unrolling

Adding:

#pragma HLS UNROLL

Now:

Eight adders are created
The loop controller disappears
All additions happen in parallel
Latency becomes 1 cycle

The tradeoff is area. Hardware usage increases roughly eight times.

Partial Unroll

#pragma HLS UNROLL factor=4

Now:

Four adders
Two cycles of execution
More balanced resource usage

Loop Pipelining

Consider:

for(int i = 0; i < 1024; i++)
    C[i] = A[i] + B[i];

Add:

#pragma HLS PIPELINE II=1

This tells the tool to start a new loop iteration every clock cycle.

A simplified timeline:

Cycle	Iteration
1	i = 0
2	i = 1
3	i = 2

Even if the addition itself has multiple stages, pipelining allows continuous throughput.

Memory Bottlenecks

Consider matrix multiplication:

for(i)
  for(j)
    for(k)
      C[i][j] += A[i][k] * B[k][j];

The issue is memory bandwidth.

A single block RAM has limited read and write ports. If multiple accesses are required in the same cycle, the pipeline stalls and ( II ) increases.

To solve this:

#pragma HLS ARRAY_PARTITION variable=A complete

Partitioning splits one large memory into several smaller memories, allowing parallel access.

Without partitioning:

Memory access becomes a bottleneck
( II ) becomes greater than 1

Building a Matrix Multiplier Accelerator

Basic Version

#define N 4

void matmul(int A[N][N], int B[N][N], int C[N][N]) {
    for(int i = 0; i < N; i++) {
        for(int j = 0; j < N; j++) {
            int sum = 0;
            for(int k = 0; k < N; k++) {
                sum += A[i][k] * B[k][j];
            }
            C[i][j] = sum;
        }
    }
}

Hardware includes:

One multiplier
One adder
Nested state machines
Block RAM for A, B, and C

Latency is large and ( II ) is greater than 1.

Optimized Version

Add:

#pragma HLS PIPELINE
#pragma HLS ARRAY_PARTITION variable=A complete dim=2
#pragma HLS ARRAY_PARTITION variable=B complete dim=1

Now you get:

Multiple multipliers
An adder tree
Fewer stalls
( II ) can approach 1

Fixed Point vs Floating Point

Floating point:

float x = a * b;

This becomes:

A floating point multiplier IP
Larger area
Higher latency

Fixed point:

ap_fixed<16,8> x;

This maps more directly to DSP blocks:

Smaller area
More predictable timing

Most machine learning accelerators use fixed point arithmetic for this reason.

Adding Interfaces

By default, HLS creates simple handshake interfaces.

To generate AXI Lite registers:

#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=A
#pragma HLS INTERFACE s_axilite port=B
#pragma HLS INTERFACE s_axilite port=C

Now the block can be accessed by an ARM processor in a Zynq device through memory mapped registers.

AXI Stream Interface

#pragma HLS INTERFACE axis port=input
#pragma HLS INTERFACE axis port=output

This creates a streaming hardware block suitable for video or DSP pipelines.

Reading HLS Reports

After synthesis, focus on:

Latency Total clock cycles per function call.

Initiation Interval ( II ) If ( II = 1 ), the design can accept new input every clock cycle.

Resource utilization

Resource	Meaning
LUT	Logic
FF	Registers
DSP	Multipliers
BRAM	Memory

Critical path Check the estimated timing.

Common Mistakes

Assuming Strict Sequential Execution

Code such as:

a = b + c;
d = e + f;

These two additions can execute in parallel.

Ignoring Memory Ports

A single block RAM typically supports two reads per cycle. More accesses cause stalls.

Over Unrolling

Unrolling large loops can cause:

Excessive DSP usage
Routing congestion
Timing failure

Example: Streaming FIR Filter

#include <ap_int.h>

#define N 4

void fir(ap_int<16> input,
         ap_int<16> *output) {

#pragma HLS PIPELINE II=1

    static ap_int<16> shift[N] = {0};
    ap_int<16> coeff[N] = {1,2,3,4};
    ap_int<32> acc = 0;

    for(int i = N-1; i > 0; i--)
        shift[i] = shift[i-1];

    shift[0] = input;

    for(int i = 0; i < N; i++)
        acc += shift[i] * coeff[i];

    *output = acc;
}

This produces hardware with:

Shift registers
DSP multipliers
An adder tree
One output per clock cycle

It becomes a streaming DSP block.

What HLS Cannot Synthesize

The following are not supported:

malloc
Recursion
File input and output
Operating system calls
Dynamic memory allocation

HLS is still a structural hardware description method, just expressed in C.

Practical Exercise

Implement an 8 tap FIR filter.
Synthesize without any pragmas.
Record latency and DSP usage.
Add PIPELINE and ARRAY_PARTITION pragmas.
Compare the results.

Observe:

Change in ( II )
Increase in resource usage
Reduction in latency

Final Thoughts

C in HLS describes:

Dataflow
Parallelism
Resource usage

It does not describe an instruction sequence the way software does.

When you can estimate:

Number of adders
Number of multipliers
Number of BRAM blocks
Number of clock cycles

before running synthesis, you are thinking in hardware terms.