
High-Level Synthesis (HLS) with C: A Deep Practical Guide from Code to RTL
High Level Synthesis, usually called HLS, lets you write hardware using C or C++ and then generate RTL from it. Tools such as Vitis HLS take a C function and turn it into synthesizable Verilog or VHDL.
HLS is often misunderstood.
It is not:
- A CPU compiler
- A push button way to convert software into hardware
- A substitute for understanding digital design
It is:
- A hardware generator that schedules operations over clock cycles
- A way to describe datapaths and control logic using C syntax
By the end of this discussion, you should be able to:
- Recognize how C constructs turn into hardware
- Read HLS synthesis reports with confidence
- Control latency and throughput
- Build a pipelined accelerator
- Add AXI interfaces
- Estimate hardware structure before running synthesis
The Required Mindset Shift
In software, the model is simple:
One instruction runs after another.
In hardware:
Independent operations can run at the same time.
HLS converts C code into:
- Datapath elements such as adders and multipliers
- Registers
- Memories
- Control logic implemented as state machines
The tool decides how operations are scheduled across clock cycles.
From a C Function to an RTL Block
Consider:
int add(int a, int b) {
return a + b;
}
HLS generates:
- A module with two inputs
- One output
- Possibly handshake signals
- One adder
- One register if latency is greater than zero
Conceptually:
a -----
|--> + --> result
b -----
How C Constructs Map to Hardware
Variables Become Registers
int x;
This becomes a 32 bit register. The width comes from the data type.
Static Variables Become Stateful Registers
static int counter = 0;
counter++;
This becomes:
- A register
- Feedback logic
- Persistent state across calls
In RTL terms, it behaves like:
always @(posedge clk)
counter <= counter + 1;
Arrays Become Memory
int A[1024];
This usually becomes block RAM.
If the array is small, it may become:
- LUT RAM
- Registers
The exact structure depends on:
- Access pattern
- Whether the array is partitioned
- Pragmas applied
If and Else Become Multiplexers
if (a > b)
c = a;
else
c = b;
This becomes:
- A comparator
- A multiplexer
Loops Become State Machines
for(int i = 0; i < 8; i++)
sum += A[i];
By default:
- One iteration runs per clock cycle
- A single adder is reused
- A loop counter register is created
Latency is roughly 8 cycles.
Scheduling and Latency
HLS scheduling determines:
- When each operation executes
- How many hardware resources are created
- Total latency
- Initiation interval
Two key terms:
Latency The total number of clock cycles required to complete one function call.
Initiation Interval, written as ( II ) How many cycles must pass before a new input can start.
First Example: Vector Addition
Basic Code
void vec_add(int A[8], int B[8], int C[8]) {
for (int i = 0; i < 8; i++) {
C[i] = A[i] + B[i];
}
}
Default Hardware
- One adder
- One loop counter
- Eight iterations
- Minimum latency of about 8 cycles
Structure:
A[i] -----
|--> + --> C[i]
B[i] -----
A control state machine increments i.
Loop Unrolling
Adding:
#pragma HLS UNROLL
Now:
- Eight adders are created
- The loop controller disappears
- All additions happen in parallel
- Latency becomes 1 cycle
The tradeoff is area. Hardware usage increases roughly eight times.
Partial Unroll
#pragma HLS UNROLL factor=4
Now:
- Four adders
- Two cycles of execution
- More balanced resource usage
Loop Pipelining
Consider:
for(int i = 0; i < 1024; i++)
C[i] = A[i] + B[i];
Add:
#pragma HLS PIPELINE II=1
This tells the tool to start a new loop iteration every clock cycle.
A simplified timeline:
| Cycle | Iteration |
|---|---|
| 1 | i = 0 |
| 2 | i = 1 |
| 3 | i = 2 |
Even if the addition itself has multiple stages, pipelining allows continuous throughput.
Memory Bottlenecks
Consider matrix multiplication:
for(i)
for(j)
for(k)
C[i][j] += A[i][k] * B[k][j];
The issue is memory bandwidth.
A single block RAM has limited read and write ports. If multiple accesses are required in the same cycle, the pipeline stalls and ( II ) increases.
To solve this:
#pragma HLS ARRAY_PARTITION variable=A complete
Partitioning splits one large memory into several smaller memories, allowing parallel access.
Without partitioning:
- Memory access becomes a bottleneck
- ( II ) becomes greater than 1
Building a Matrix Multiplier Accelerator
Basic Version
#define N 4
void matmul(int A[N][N], int B[N][N], int C[N][N]) {
for(int i = 0; i < N; i++) {
for(int j = 0; j < N; j++) {
int sum = 0;
for(int k = 0; k < N; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
Hardware includes:
- One multiplier
- One adder
- Nested state machines
- Block RAM for A, B, and C
Latency is large and ( II ) is greater than 1.
Optimized Version
Add:
#pragma HLS PIPELINE
#pragma HLS ARRAY_PARTITION variable=A complete dim=2
#pragma HLS ARRAY_PARTITION variable=B complete dim=1
Now you get:
- Multiple multipliers
- An adder tree
- Fewer stalls
- ( II ) can approach 1
Fixed Point vs Floating Point
Floating point:
float x = a * b;
This becomes:
- A floating point multiplier IP
- Larger area
- Higher latency
Fixed point:
ap_fixed<16,8> x;
This maps more directly to DSP blocks:
- Smaller area
- More predictable timing
Most machine learning accelerators use fixed point arithmetic for this reason.
Adding Interfaces
By default, HLS creates simple handshake interfaces.
To generate AXI Lite registers:
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=A
#pragma HLS INTERFACE s_axilite port=B
#pragma HLS INTERFACE s_axilite port=C
Now the block can be accessed by an ARM processor in a Zynq device through memory mapped registers.
AXI Stream Interface
#pragma HLS INTERFACE axis port=input
#pragma HLS INTERFACE axis port=output
This creates a streaming hardware block suitable for video or DSP pipelines.
Reading HLS Reports
After synthesis, focus on:
Latency Total clock cycles per function call.
Initiation Interval ( II ) If ( II = 1 ), the design can accept new input every clock cycle.
Resource utilization
| Resource | Meaning |
|---|---|
| LUT | Logic |
| FF | Registers |
| DSP | Multipliers |
| BRAM | Memory |
Critical path Check the estimated timing.
Common Mistakes
Assuming Strict Sequential Execution
Code such as:
a = b + c;
d = e + f;
These two additions can execute in parallel.
Ignoring Memory Ports
A single block RAM typically supports two reads per cycle. More accesses cause stalls.
Over Unrolling
Unrolling large loops can cause:
- Excessive DSP usage
- Routing congestion
- Timing failure
Example: Streaming FIR Filter
#include <ap_int.h>
#define N 4
void fir(ap_int<16> input,
ap_int<16> *output) {
#pragma HLS PIPELINE II=1
static ap_int<16> shift[N] = {0};
ap_int<16> coeff[N] = {1,2,3,4};
ap_int<32> acc = 0;
for(int i = N-1; i > 0; i--)
shift[i] = shift[i-1];
shift[0] = input;
for(int i = 0; i < N; i++)
acc += shift[i] * coeff[i];
*output = acc;
}
This produces hardware with:
- Shift registers
- DSP multipliers
- An adder tree
- One output per clock cycle
It becomes a streaming DSP block.
What HLS Cannot Synthesize
The following are not supported:
malloc- Recursion
- File input and output
- Operating system calls
- Dynamic memory allocation
HLS is still a structural hardware description method, just expressed in C.
Practical Exercise
- Implement an 8 tap FIR filter.
- Synthesize without any pragmas.
- Record latency and DSP usage.
- Add PIPELINE and ARRAY_PARTITION pragmas.
- Compare the results.
Observe:
- Change in ( II )
- Increase in resource usage
- Reduction in latency
Final Thoughts
C in HLS describes:
- Dataflow
- Parallelism
- Resource usage
It does not describe an instruction sequence the way software does.
When you can estimate:
- Number of adders
- Number of multipliers
- Number of BRAM blocks
- Number of clock cycles
before running synthesis, you are thinking in hardware terms.