High-Level Synthesis with C: A Practical Guide from Code to RTL

January 25, 2026

hardware fpga hls HLS Vitis HLS Vivado HLS C to RTL Zynq pipeline loop unrolling array partition AXI4 AXI-Stream BRAM DSP hardware acceleration FPGA design

High-level synthesis takes a C or C++ function and produces synthesizable RTL from it. Vitis HLS and the older Vivado HLS are the dominant tools in the Xilinx ecosystem. The basic premise is that instead of writing register-transfer-level Verilog describing exactly which signals connect to which flip-flops, you write a behavioral description in C and let the tool schedule operations into clock cycles and generate the corresponding hardware.

This is regularly misunderstood in two opposite directions. One group treats HLS as a magic black box that turns software into hardware without requiring any understanding of digital design. The other group dismisses it as a toy that produces inferior RTL compared to handwritten Verilog. Both are wrong. HLS is a hardware generator that operates on a C description. You have to understand what hardware you want, express it clearly in C, and guide the tool with pragmas that communicate your intent. The output RTL is predictable once you understand the mapping from C constructs to hardware structures, and understanding that mapping is the entire point of this post.

By the end, you should be able to look at a C function with HLS pragmas and estimate the resulting hardware structure before running synthesis, read the reports the tool generates and understand what they are telling you, and know which pragma and code changes have meaningful impact on area, latency, and throughput.

Setting Things UP

Software execution is sequential by default. One instruction runs, then the next. Parallelism in software is something the programmer adds explicitly (threads, SIMD intrinsics) on top of a sequential model.

Hardware execution is parallel by default. Every piece of combinational logic that has valid inputs produces a valid output simultaneously. Sequential behavior in hardware is something you add explicitly by inserting registers that break the computation into stages.

HLS operates on C code but generates hardware. Its job is to identify which operations in the C code are independent of each other and can therefore execute in parallel, and which operations have data dependencies that require ordering. The tool then schedules operations into clock cycles: independent operations may be placed in the same clock cycle, dependent operations are placed in successive cycles.

The fundamental output of HLS is not a program. It is a datapath with a control state machine. The datapath contains the arithmetic and logic units that perform the computation. The state machine controls which data flows through which units in each clock cycle. Both are generated automatically, but the structure of the C code determines what the datapath and state machine look like.

How C Constructs Map to Hardware

The mapping from C constructs to hardware is concrete and consistent. Understanding it means you can predict the hardware structure from the code rather than running synthesis to find out.

Variables and Registers

A local variable declared inside the function becomes a register in the datapath. The width of the register matches the data type: int gives a 32-bit register, short gives a 16-bit register, ap_int<N> from the Vitis HLS arbitrary precision header gives an N-bit register. Variables that hold intermediate results of computations are typically mapped to pipeline registers that sit between successive stages of the datapath.

int a = 5;
int b = a + 3;
int c = b * 2;

This produces two operations: an adder and a multiplier. If the tool can schedule them in the same clock cycle (given sufficient clock period), it will. If not, a register holds the result of the addition while the multiplication takes place in the next cycle.

Static Variables and Persistent State

A static local variable persists across function calls. In hardware, this corresponds to a register that holds its value between invocations of the module. The HLS tool generates a register with feedback: the stored value is the input to the next write operation.

void accumulate(int input, int *output) {
    static int acc = 0;
    acc += input;
    *output = acc;
}

This produces a register for acc, an adder that computes the new value, and a feedback path from the adder output back to the register input. Every time the function is called (every time the hardware is activated), the register updates. This is functionally equivalent to:

always @(posedge clk)
    acc <= acc + input;

Static arrays become multi-element persistent memories, typically block RAM, with the same behavior: their contents persist across calls.

Arrays and Memory

A local array declared inside the function becomes a memory. The HLS tool infers block RAM (BRAM) for arrays that benefit from it and distributed RAM (LUTRAM) for smaller arrays. The choice depends on the access pattern, array size, and applied pragmas.

int A[1024];

Without partitioning pragmas, this maps to a single block RAM. A typical BRAM on UltraScale devices has two ports: one read port and one write port in simple dual-port mode, or two read/write ports in true dual-port mode. If the C code requires more than two accesses to A in the same clock cycle (for example, inside an unrolled loop), the tool cannot schedule them simultaneously into a single-BRAM mapping. The pipeline stalls, and the initiation interval increases.

The access pattern is critical. An array accessed sequentially in a loop, where each loop iteration reads one element, maps efficiently to block RAM with a single read port. An array where each iteration of an unrolled loop reads multiple elements simultaneously requires either multiple BRAM instances (via array partitioning) or a wider BRAM word that packages multiple elements per address.

Conditionals and Multiplexers

An if-else statement generates a comparator and a multiplexer. The condition is evaluated in combinational logic. Both branches may be computed simultaneously (the hardware for each branch exists in parallel), and the mux selects which result is valid. This is called predicated execution and is the default behavior in HLS when both branches are short.

int result;
if (a > b)
    result = a - b;
else
    result = b - a;

This produces one comparator, two subtractors (both computing their results simultaneously), and one 2-to-1 mux at the output. Neither branch is conditional at the hardware level – both always compute their result, and the comparator selects which one passes through. If one branch is substantially more expensive (say, one involves a divide), it may make sense to restructure to avoid the expensive path being always active.

When one branch is significantly longer than the other, the tool has to schedule both paths to the same depth to maintain a consistent latency. The longer branch determines the total latency, and the shorter branch’s hardware sits idle for the extra cycles. This is the hardware equivalent of branch prediction failing: you always pay for the worst case.

Loops and State Machines

A for loop generates a state machine that iterates. By default, each iteration of the loop executes sequentially: the state machine advances one state per clock cycle (or per iteration latency), and a loop counter register tracks progress.

int sum = 0;
for (int i = 0; i < 8; i++)
    sum += A[i];

Without optimization pragmas, this generates a single adder, a loop counter register, and a state machine with 8 states. The adder is reused across all 8 iterations. Total latency is 8 cycles (plus memory read latency). Initiation interval is 8 cycles: the function cannot accept a new input until the loop completes.

The loop generates a state machine because the C loop implies sequential iteration. The tool respects that ordering unless you explicitly tell it to relax it via PIPELINE or UNROLL pragmas.

Latency and Initiation Interval

Two numbers define the performance of an HLS module. Understanding both is necessary to evaluate whether the generated hardware meets the application’s requirements.

Latency is the number of clock cycles from when the module accepts its inputs to when valid outputs are produced. For a simple combinational function with no loops and no memory access, the latency after pipelining may be as low as 1 or 2 cycles. For a function with a loop over 1024 elements, the minimum latency is 1024 cycles if the loop runs sequentially.

Initiation interval (II) is the number of clock cycles that must pass before the module can accept a new set of inputs. If a module has latency 8 and II = 8, it processes one input per 8 cycles. If it has latency 8 and II = 1 (achieved through pipelining), it can accept a new input every cycle and processes inputs at full throughput, with each input taking 8 cycles to produce its result.

The distinction matters for throughput. A module with II = 1 and latency 8 produces one output per clock cycle at steady state, even though each individual result takes 8 cycles. A module with II = 8 produces one output every 8 cycles at steady state. For a stream-processing application, II = 1 is the target. For a function called infrequently on large blocks of data, the latency matters more than II.

The PIPELINE pragma targets II. The UNROLL pragma reduces latency at the cost of area.

Loop Pipelining

The most impactful single pragma in HLS is HLS PIPELINE. It transforms a sequential loop into a pipelined one where a new iteration can begin before the previous one finishes.

void vec_add(int A[1024], int B[1024], int C[1024]) {
    for (int i = 0; i < 1024; i++) {
        #pragma HLS PIPELINE II=1
        C[i] = A[i] + B[i];
    }
}

With this pragma, the tool attempts to schedule the loop body such that a new iteration starts every clock cycle. If the loop body takes 3 clock cycles (for example, one cycle for the memory reads, one for the addition, one for the write), the pipeline has 3 stages, and iterations overlap:

Cycle 1:  Read A[0], Read B[0]
Cycle 2:  Add A[0]+B[0],    Read A[1], Read B[1]
Cycle 3:  Write C[0],       Add A[1]+B[1],    Read A[2], Read B[2]
Cycle 4:                    Write C[1],        Add A[2]+B[2],    ...

At steady state, one result is produced per cycle. The total latency for 1024 iterations is 1024 + (pipeline depth - 1) cycles, which is approximately 1026-1028 cycles rather than the sequential 3072 cycles.

The tool attempts to achieve the requested II but may fail if data dependencies or resource conflicts prevent it. A common failure is memory access bottleneck: if the loop body reads two elements from the same array in the same cycle, and that array is a single-port BRAM, the tool cannot schedule both reads simultaneously. The resulting II will be 2, not 1. The synthesis report states the achieved II and the reason if it failed to reach the requested value.

Loop Unrolling

Where PIPELINE improves throughput (reduces II), UNROLL reduces latency by instantiating multiple copies of the loop body hardware.

void vec_add_small(int A[8], int B[8], int C[8]) {
    #pragma HLS UNROLL
    for (int i = 0; i < 8; i++)
        C[i] = A[i] + B[i];
}

Full unrolling instantiates 8 adders, one per iteration, and eliminates the loop counter. All 8 additions execute in the same clock cycle, reducing latency from 8 cycles to 1. Area increases by a factor of 8 for the adder hardware. If the array maps to block RAM, all 8 reads must also happen simultaneously, which requires 8 BRAM ports, which requires array partitioning.

Partial unrolling is the practical compromise:

#pragma HLS UNROLL factor=4

This instantiates 4 adders and runs the 8-element loop in 2 iterations of 4 elements each. Latency halves (to 2 cycles), area doubles (4 adders instead of 1), and memory bandwidth requirements double (4 reads per cycle instead of 1).

The interaction between unrolling factor and memory partitioning is a recurring design problem. Unrolling by 4 demands 4-wide memory access. If the array is not partitioned to allow 4 simultaneous reads, the unrolling does not reduce latency – the memory bottleneck limits the pipeline to one access per cycle regardless of how many adders are available.

Memory Partitioning

Array partitioning splits one logical array into multiple physical memories, each with its own access ports.

int A[1024];
#pragma HLS ARRAY_PARTITION variable=A cyclic factor=4 dim=1

Cyclic partitioning with factor 4 splits the array into 4 banks using interleaving: element 0 goes to bank 0, element 1 to bank 1, element 2 to bank 2, element 3 to bank 3, element 4 back to bank 0, and so on. Elements at indices i%4 == k go to bank k. Each bank is a separate BRAM instance with its own read and write ports. Four elements at consecutive indices can now be read simultaneously because they reside in four separate memories.

Block partitioning is the alternative: elements 0-255 go to bank 0, 256-511 to bank 1, and so on. This is preferable when the access pattern is blocked rather than strided.

Complete partitioning (complete instead of cyclic factor=4) converts the entire array to registers, one flip-flop per element. Every element is simultaneously accessible. This is the only partitioning strategy that allows random access to any element in any cycle without port conflicts. It is appropriate for small arrays (coefficients, lookup tables) where the register cost is acceptable. For a 1024-element array, complete partitioning produces 1024 registers from what would otherwise be a single BRAM, which is almost certainly wrong.

The synthesis report shows the achieved II and the bottleneck when II is worse than requested. When memory port conflicts are the cause, the fix is always more partitioning or restructured access patterns. When the cause is data-flow dependence (a loop where each iteration depends on the result of the previous one, like an accumulator), no amount of partitioning helps – the loop is inherently sequential and PIPELINE cannot reduce its II below the dependence distance.

A Worked Example: Matrix Multiplication

Matrix multiplication is the canonical HLS example because it illustrates all the performance issues simultaneously.

Baseline

#define N 8

void matmul(int A[N][N], int B[N][N], int C[N][N]) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            int acc = 0;
            for (int k = 0; k < N; k++) {
                acc += A[i][k] * B[k][j];
            }
            C[i][j] = acc;
        }
    }
}

Without any pragmas, the tool generates three nested state machines. The innermost loop executes sequentially: one multiply-accumulate per cycle, 8 iterations, running once per (i, j) pair, for 8 x 8 x 8 = 512 multiply-accumulate operations total. With memory read latency, total latency is somewhere around 600-700 cycles depending on how the tool schedules the memory accesses. II is similarly large because the function cannot restart until the entire triple loop completes.

This is not useful for any application that needs matrix multiplication at high throughput.

Pipelining the Inner Loop

void matmul(int A[N][N], int B[N][N], int C[N][N]) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            int acc = 0;
            for (int k = 0; k < N; k++) {
                #pragma HLS PIPELINE II=1
                acc += A[i][k] * B[k][j];
            }
            C[i][j] = acc;
        }
    }
}

Pipelining the inner loop targets II = 1 for the accumulation loop. However, the accumulation has a loop-carried dependency: acc in iteration k depends on acc from iteration k-1. A dependency chain through a multiplier and an adder cannot have II = 1 if the multiply-add chain takes more than 1 clock cycle. On a 28nm FPGA with DSP48 blocks, a 32-bit multiply takes 3 cycles, so the dependency distance is 3. The tool will report achieved II = 3 unless the accumulator can be reorganized.

The more fundamental problem is memory bandwidth. Reading A[i][k] and B[k][j] in each inner loop iteration requires one read from each matrix per cycle. If both A and B map to single-BRAM memories, each has one read port (in simple dual-port mode). Two separate BRAMs give two independent read ports, one for A and one for B. This works as long as the inner loop accesses A and B once per iteration, which is the case here. Pipelining the inner loop with II = 1 is achievable if the loop-carried dependency is resolved.

Partitioning and Full Optimization

For small N, complete partitioning converts both input matrices to register files and exposes all NxN elements simultaneously.

#pragma HLS ARRAY_PARTITION variable=A complete dim=2
#pragma HLS ARRAY_PARTITION variable=B complete dim=1

Partitioning A on dimension 2 (the column index k) means all N elements of a given row of A are simultaneously accessible. Partitioning B on dimension 1 (the row index k) means all N elements of a given column of B are simultaneously accessible. The inner loop can now be fully unrolled:

void matmul(int A[N][N], int B[N][N], int C[N][N]) {
    #pragma HLS ARRAY_PARTITION variable=A complete dim=2
    #pragma HLS ARRAY_PARTITION variable=B complete dim=1

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            #pragma HLS PIPELINE II=1
            int acc = 0;
            for (int k = 0; k < N; k++) {
                #pragma HLS UNROLL
                acc += A[i][k] * B[k][j];
            }
            C[i][j] = acc;
        }
    }
}

With the inner loop unrolled and all elements accessible simultaneously, the tool instantiates N multipliers and an adder tree. The 8-element inner product is computed in one pipeline pass. The outer loops still run sequentially, giving latency of 8 x 8 = 64 cycles with II approaching 1 for the (i, j) iterations.

The hardware now contains 64 32-bit multipliers (or DSP blocks), an 8-input adder tree, and 128 registers for the two 8x8 register files. This is a large hardware footprint. The tradeoff is correct: high throughput at high area cost. Whether this tradeoff is appropriate depends on how large N is and how fast the application needs results.

For large N, complete partitioning becomes impractical (an NxN matrix of 1024x1024 32-bit integers would require 2^20 registers per matrix, which is not feasible). The standard approach for large matrices is tiling: partition the computation into smaller tiles that fit in on-chip memory, optimize the tile computation with partitioning and unrolling, and handle the data movement from off-chip memory separately.

Fixed-Point Arithmetic in HLS

Vitis HLS provides the ap_fixed and ap_int header types for arbitrary-precision and fixed-point arithmetic. Using float or double in HLS produces floating-point hardware with the area and latency costs described in the floating-point section of the previous topic. Using ap_fixed produces fixed-point datapaths that map directly to DSP blocks.

#include <ap_fixed.h>

typedef ap_fixed<16, 8> fixed_t;  // 16-bit total, 8 integer bits

void fir_fixed(fixed_t input, fixed_t *output) {
    #pragma HLS PIPELINE II=1

    static fixed_t shift[4] = {0, 0, 0, 0};
    const fixed_t coeff[4] = {0.125, 0.25, 0.25, 0.125};

    for (int i = 3; i > 0; i--)
        shift[i] = shift[i-1];
    shift[0] = input;

    fixed_t acc = 0;
    for (int i = 0; i < 4; i++)
        acc += shift[i] * coeff[i];

    *output = acc;
}

The ap_fixed<16, 8> type tells the tool that values are 16-bit fixed-point with 8 integer bits and 8 fractional bits. The tool generates multipliers sized for 16-bit operands and an adder sized for the accumulator width. These map to DSP48 blocks efficiently on Xilinx devices.

Using float in the same code:

void fir_float(float input, float *output) {
    static float shift[4] = {0, 0, 0, 0};
    const float coeff[4] = {0.125f, 0.25f, 0.25f, 0.125f};
    ...
}

produces a floating-point multiplier IP and floating-point adder IP, each requiring multiple DSP blocks and several cycles of pipeline latency. The fixed-point version is smaller, faster, and consumes fewer DSP blocks, at the cost of the precision and dynamic range tradeoff discussed in the floating-point vs fixed-point context.

For signal processing applications where the signal range is bounded and the precision of 16-bit fixed-point is sufficient (16 fractional bits gives resolution of 1.5x10^-5), the fixed-point HLS implementation is strictly better in hardware cost.

FIR Filter: A Complete Example

A finite impulse response filter is one of the most common DSP blocks and maps cleanly to HLS.

#include <ap_int.h>
#include <ap_fixed.h>

#define TAPS 16

typedef ap_fixed<16, 2> coeff_t;
typedef ap_fixed<16, 2> data_t;
typedef ap_fixed<32, 4> acc_t;

void fir(data_t input, data_t *output) {
#pragma HLS PIPELINE II=1

    static data_t delay_line[TAPS];
#pragma HLS ARRAY_PARTITION variable=delay_line complete

    // Shift register: move all elements one position
    for (int i = TAPS - 1; i > 0; i--)
        delay_line[i] = delay_line[i-1];
    delay_line[0] = input;

    // Coefficients: symmetric for linear phase
    const coeff_t h[TAPS] = {
        0.003, 0.012, 0.033, 0.075, 0.138, 0.207, 0.256, 0.275,
        0.275, 0.256, 0.207, 0.138, 0.075, 0.033, 0.012, 0.003
    };
#pragma HLS ARRAY_PARTITION variable=h complete

    // Multiply-accumulate
    acc_t acc = 0;
    for (int i = 0; i < TAPS; i++) {
#pragma HLS UNROLL
        acc += (acc_t)(delay_line[i] * h[i]);
    }

    *output = (data_t)acc;
}

With ARRAY_PARTITION complete on both the delay line and the coefficient array, and UNROLL on the inner loop, the tool instantiates 16 multipliers and a 16-input adder tree. The delay line shift is also fully unrolled. With PIPELINE II=1 on the function, the filter accepts one new sample per clock cycle.

The hardware structure is a classic direct-form FIR: 16 registers forming a shift register (the delay line), 16 coefficient multipliers computing products simultaneously, and a binary adder tree that sums 16 products in log2(16) = 4 adder stages. Pipeline registers inside the adder tree add latency but allow the clock frequency to be high.

This is the expected implementation for a real-time audio or communications DSP filter. At 250 MHz clock frequency and II = 1, the filter processes 250 million samples per second. At 16-bit Q2.14 input format, that handles signals up to 125 MHz. The hardware footprint is 16 DSP48 blocks plus some routing for the shift register and adder tree.

The same filter in floating-point would require 16 floating-point multiplier IPs (each using 3 DSP blocks) and 16 floating-point adder IPs, bringing the DSP block count to approximately 48-64, at the cost of increased latency (each FP multiplier has 5-7 cycles of pipeline depth versus 1-3 for fixed-point on a DSP). For audio signal processing where the input range is bounded, this cost buys nothing.

Interfaces

By default, HLS generates a simple handshake interface with ap_start, ap_done, ap_idle, and ap_ready signals. This is sufficient for a module called from a custom controller, but integration with standard AXI-based SoC infrastructure requires explicit interface pragmas.

AXI4-Lite for Register Access

void accelerator(int *A, int *B, int *C, int length) {
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=length
#pragma HLS INTERFACE m_axi port=A depth=1024
#pragma HLS INTERFACE m_axi port=B depth=1024
#pragma HLS INTERFACE m_axi port=C depth=1024
    ...
}

s_axilite on port=return generates an AXI4-Lite slave interface with control registers including start, done, and idle bits accessible from the processor. s_axilite on scalar parameters like length creates additional readable/writable registers in the AXI4-Lite register map. m_axi on pointer parameters generates an AXI4 master interface that allows the accelerator to read and write DDR memory through the PS-PL interconnect on a Zynq or Versal device.

This is the standard integration pattern for an ARM-controlled FPGA accelerator. The ARM processor writes parameters into the AXI-Lite control registers, starts the accelerator, waits for the done flag, and reads results from DDR.

AXI-Stream for Throughput

void stream_processor(hls::stream<ap_uint<16>> &input_stream,
                      hls::stream<ap_uint<16>> &output_stream) {
#pragma HLS INTERFACE axis port=input_stream
#pragma HLS INTERFACE axis port=output_stream
#pragma HLS INTERFACE ap_ctrl_none port=return
    ...
}

axis generates an AXI4-Stream interface with TDATA, TVALID, and TREADY signals. ap_ctrl_none removes the start/done handshake entirely, making the module a pure dataflow block that processes samples as they arrive on the stream. This is appropriate for signal processing pipelines where the module operates continuously and is integrated into a streaming chain rather than being called on demand.

hls::stream is the HLS FIFO abstraction. Reading from it with input_stream.read() blocks until data is available. Writing with output_stream.write() blocks until space is available. The tool generates the corresponding handshake logic around the stream accesses.

Reading the Synthesis Report

After running csynth_design in Vitis HLS, the synthesis report contains the numbers that matter.

Timing. The estimated clock period after synthesis should be below your target. The report shows the estimated worst-case combinational delay. If it exceeds the target, the critical path needs to be shortened by adding pipeline stages or reducing the combinational depth of the longest path.

Latency. Expressed in cycles. For a function with a loop that runs N times with the inner loop pipelined at II = 1, the expected latency is N + (pipeline depth - 1). Latency that is significantly larger than expected usually means the tool could not pipeline the loop to the requested II and defaulted to sequential execution.

Initiation interval. If the achieved II is greater than the requested II, the report states the reason. Common reasons: loop-carried dependency (sequential dependency between iterations), memory access conflict (more concurrent accesses than available ports), and resource conflict (two operations sharing the same hardware unit that cannot overlap). Each reason has a specific fix.

Resource utilization. LUT, FF, DSP, and BRAM counts. Compare against the device budget. DSP over-use is common when loops are over-unrolled. BRAM over-use is rare but occurs when large arrays are replicated by partitioning. LUT over-use often indicates that the bypass mux or control logic is heavier than expected.

================================================================
== Performance Estimates
================================================================
+ Timing:
    * Summary:
    +--------+----------+----------+------------+
    |Clock   |Target    |Estimated |Uncertainty |
    +--------+----------+----------+------------+
    |ap_clk  |  4.00 ns |  3.12 ns |     0.50 ns|
    +--------+----------+----------+------------+

+ Latency:
    * Summary:
    +---------+---------+----------+----------+-----+-----+----------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline |
    |     min |     max |      min |      max | min | max |   Type   |
    +---------+---------+----------+----------+-----+-----+----------+
    |     1028|     1028|  4.112 us|  4.112 us| 1029| 1029|      none|
    +---------+---------+----------+----------+-----+-----+----------+

+ Detail:
    * Instance:
    ...
    * Loop:
    +-----------+---------+---------+----------+-----------+-----------+------+----------+
    |           |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
    | Loop Name |     min |     max |  Latency |  achieved |   target  | Count| Pipelined|
    +-----------+---------+---------+----------+-----------+-----------+------+----------+
    |- Loop_1   |     1026|     1026|         3|          1|          1|  1024|       yes|
    +-----------+---------+---------+----------+-----------+-----------+------+----------+

This report shows a loop with 1024 iterations, achieved II = 1 (matching the target), iteration latency of 3 cycles, and total loop latency of 1026 cycles (1024 iterations + 2 cycles for the pipeline fill/drain). The function achieves the full pipelining target.

Common Design Mistakes

Assuming independence where dependence exists. A loop that accumulates into the same variable has a loop-carried dependence. Applying PIPELINE without resolving this dependence will not produce II = 1. The tool will report a dependence warning and a higher achieved II. The fix is either to restructure the accumulation (partial sums with a final reduction) or accept the higher II.

Unrolling without partitioning. Unrolling a loop by factor N demands N simultaneous array accesses. If the array is not partitioned to support N ports, the memory becomes the bottleneck. Synthesis will report II > 1 due to memory port conflicts. Unrolling and partitioning must be matched: unroll factor equals partitioning factor.

Over-unrolling large loops. Unrolling a 1024-iteration loop completely instantiates 1024 copies of the loop body hardware. For a loop body containing a multiplier, this produces 1024 DSPs. Most FPGAs have 1000-3000 DSP blocks total; this is not a reasonable allocation for one module. The synthesis tool will generate the requested hardware if resource constraints are not set, and the resulting design will fail to fit the device. Set resource directives or bound the unroll factor to a value the device can support.

Ignoring the dependency distance. For an accumulator loop with a multiply-add:

for (int i = 0; i < N; i++)
    acc += A[i] * B[i];  // Loop-carried dep on acc

The dependence is carried through the multiply-add chain. If the multiplier has 3 cycles of latency and the adder has 1 cycle, the dependency distance is 4: iteration k cannot complete its write to acc until 4 cycles after iteration k-1 wrote to it. The minimum achievable II for this loop is 4, not 1. To achieve II = 1, the accumulation must be restructured to break the dependency:

acc_t acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
for (int i = 0; i < N; i += 4) {
    acc0 += A[i]   * B[i];
    acc1 += A[i+1] * B[i+1];
    acc2 += A[i+2] * B[i+2];
    acc3 += A[i+3] * B[i+3];
}
acc = acc0 + acc1 + acc2 + acc3;

Now four independent accumulation chains run in parallel, each with its own dependency. Each chain has dependency distance 4, but with four chains the loop can issue one multiply-add per cycle across all four chains, achieving overall II = 1.

Synthesis Estimation Before Running the Tool

One of the useful skills to develop with HLS is estimating the hardware structure from the C code before synthesizing. This is not guesswork – it follows directly from the mapping rules.

For a pipelined loop with II = 1 over N iterations: latency is approximately N cycles. Each resource type in the loop body appears once (since the same hardware is reused across iterations in a pipeline). One adder, one multiplier, N/2 BRAM accesses if the loop reads N total elements from a partitioned array.

For an unrolled loop with factor F: F copies of each hardware resource in the loop body. F-wide BRAM access needed. Latency equals N/F loop body latencies.

For a fully unrolled loop: one complete copy of the loop body hardware per iteration. Total hardware is N copies of the loop body. All N memory accesses happen simultaneously.

For a function with no loops: hardware is a single deep combinational or pipelined chain. One copy of each operation.

These estimates let you check whether the design fits the device resource budget before waiting for synthesis to complete (which can take minutes to hours for complex designs). If the estimate shows you will use 3x the available DSP blocks, no amount of synthesis will fix that – the design itself needs to change.

Summary

HLS is a hardware generator. The quality of the output depends on how clearly the hardware intent is expressed in the C code and pragmas. Writing C code that looks like software but is intended for hardware, without understanding what hardware the tool will generate, reliably produces slow, area-inefficient results.

The core mental model is: identify the parallelism in the computation, express it through loop structure and data independence, guide the tool to implement that parallelism with PIPELINE and UNROLL pragmas, and resolve memory bandwidth limitations with ARRAY_PARTITION. The synthesis report confirms whether the tool achieved what was intended and identifies the specific reason when it did not.

The achievable II is bounded by the loop-carried dependency distance and the available memory port count. Latency is bounded by the inherent sequential depth of the computation. Unrolling reduces latency at linear area cost. Pipelining reduces II at small area cost. Partitioning resolves memory bottlenecks at area cost proportional to the partitioning factor.

When you can look at an HLS function and predict the DSP count, BRAM count, expected latency, and expected II before running synthesis, you are using the tool correctly. That prediction being accurate is the verification that you understand the mapping.