Loading…
Designing a Simple RTL Block for ASIC vs FPGA: A Pipelined 8-bit MAC

Designing a Simple RTL Block for ASIC vs FPGA: A Pipelined 8-bit MAC

rtl asic fpga ASIC design FPGA design MAC unit RTL synthesis timing closure DSP blocks standard cells

The multiply-accumulate operation

$$ Y \leftarrow Y + (A \times B) $$

is the computational primitive at the center of digital filters, CNN inference, matrix multiply units, and PID controllers. An 8-bit signed MAC is simple enough to analyze completely but representative enough that the conclusions transfer directly to wider datapaths. The RTL is a dozen lines. The implementation decisions that follow from targeting an FPGA versus an ASIC are not.

The function is the same. The implementation is not, and the differences run deeper than “FPGAs have DSP blocks.” This post works through a pipelined 8-bit MAC from RTL through physical design considerations for both targets, with attention to where the implementations diverge and why.


The Baseline RTL

A starting implementation that simulates correctly:

module mac8 (
    input  wire        clk,
    input  wire        rst,
    input  wire        valid,
    input  wire signed [7:0] A,
    input  wire signed [7:0] B,
    output reg  signed [31:0] Y
);

reg signed [15:0] mult;

always @(posedge clk) begin
    if (rst) begin
        Y    <= 0;
        mult <= 0;
    end else if (valid) begin
        mult <= A * B;
        Y    <= Y + mult;
    end
end

endmodule

This produces a two-stage pipeline: the multiply result is registered in mult on one clock edge, and added to the accumulator on the next. The signed keyword propagates sign extension correctly through both the multiplication and the accumulate addition. It synthesizes without errors on both FPGA and ASIC flows.

What it does not do is make any of the implementation decisions that determine whether the result meets timing, uses area efficiently, or consumes appropriate power. Those decisions depend on the target technology, and they are not decisions the synthesis tool makes for you by default.


FPGA Implementation

DSP Slice Inference

Modern FPGAs include dedicated DSP slices that implement multiply-accumulate operations in hard silicon, sitting alongside the LUT fabric but operating independently of it. On Xilinx 7-series and UltraScale devices, the DSP48 primitive supports an 18x25 multiplier with a 48-bit accumulator, internal pipeline registers, and pre-adder logic. On Intel Cyclone and Stratix devices, the DSP block supports 18x18 and 27x27 configurations with similar internal pipeline structure.

Mapping the A * B expression to a DSP slice rather than LUT logic is the most important inference decision in this design. A LUT-based 8x8 multiplier on 7-series consumes approximately 16 to 24 LUTs and has a combinational delay of 4 to 6 nanoseconds depending on input width and carry chain length. A DSP48-mapped multiplier consumes one DSP slice, zero LUTs, and has a registered output delay of under 2 nanoseconds. The difference in critical path delay is large enough to determine whether the design closes timing at a given clock frequency.

The synthesis tool infers DSP slices automatically from the * operator when the operand widths are compatible and the surrounding RTL matches patterns the tool recognizes. For an 8x8 signed multiply, this inference is reliable on both Xilinx and Intel tools. Where inference fails is when operand widths are unusual (13x13, for example, does not map cleanly to an 18x18 block), when the multiply result is not registered before use, or when the surrounding logic prevents the tool from recognizing the accumulate pattern as part of the same DSP primitive.

The DSP48 can implement the full multiply-accumulate in a single slice with its internal pipeline if the RTL matches the expected pattern. To maximize the probability of this, the accumulate addition should be directly connected to the multiply output without intervening combinational logic, and both should be written in the same always block with register inference that matches the DSP’s internal pipeline stages.

Pipelining for Frequency

DSP slices have internal pipeline stages that must be explicitly enabled through RTL structure. A registering structure that lets the tool use all three pipeline stages in the DSP48 (input registers, multiplier register, and output register) can reach clock frequencies above 400 MHz on 7-series in speed grade 1. A structure that bypasses the internal registers and uses the DSP combinationally may not close timing above 200 MHz in the same device.

The registering structure that enables full DSP pipeline use:

reg signed [7:0]  A_r, B_r;
reg signed [15:0] mult_r;

always @(posedge clk) begin
    if (rst) begin
        A_r    <= 0;
        B_r    <= 0;
        mult_r <= 0;
        Y      <= 0;
    end else if (valid) begin
        A_r    <= A;
        B_r    <= B;
        mult_r <= A_r * B_r;
        Y      <= Y + mult_r;
    end
end

This adds one cycle of latency relative to the baseline (the input registers are explicit, and the multiply now operates on registered inputs rather than the raw input ports), but enables the synthesis tool to map the entire datapath into a single DSP slice with its full internal pipeline active.

Routing Delay and Reset Convention

On FPGAs, routing delay is often comparable to or larger than logic delay. A combinational path that appears to be two logic levels in a schematic view may take 3 to 4 nanoseconds if the two cells are placed far apart. The placement tool minimizes this by grouping related cells, but deep pipelines and high fanout signals can force long routes that dominate the critical path.

For reset specifically: FPGA flip-flops have dedicated synchronous and asynchronous reset inputs that route through dedicated paths with minimal delay. Asynchronous reset (triggered by an always @(posedge clk or posedge rst) sensitivity list) uses the dedicated async reset network. Synchronous reset (triggered only on the clock edge) uses the data input mux. Both work correctly, but synchronous reset is generally preferred in FPGA designs because it does not create timing exceptions that are difficult to handle in static timing analysis and does not risk metastability on the reset deassertion edge.


ASIC Implementation

No DSP Slices: The Multiplier Is Yours to Build

An ASIC standard cell library contains no multiply primitive. When the synthesis tool encounters A * B, it generates a multiplier from logic gates: full adders, half adders, AND gates. The architecture it chooses depends on the synthesis constraints and the library it is mapping against.

For an 8x8 multiplier, the standard tool-inferred architecture is typically an array multiplier or a modified Booth-encoded multiplier. An array multiplier for 8x8 inputs generates 8 rows of partial products (each row is a right-shifted AND of the multiplicand with one bit of the multiplier), then adds them using a tree of carry-save adders. The area cost is proportional to the number of partial product bits: 8x8 produces 64 partial product bits, reduced to a sum-and-carry form through the carry-save tree, with a final carry-propagate adder at the output.

Radix-4 Booth encoding reduces the number of partial products from N to N/2 by recoding the multiplier into a signed digit representation where each digit takes one of the values {-2, -1, 0, 1, 2}. For 8-bit inputs, this reduces the partial product count from 8 to 4, which reduces the adder tree depth by one level. The impact on timing is measurable: a Booth-encoded 8x8 multiplier typically has 20 to 30% lower delay than an array multiplier at the same process node, at a modest area increase from the Booth encoder logic. For 8-bit inputs the difference is real but not dramatic. For 16-bit or 32-bit inputs, Booth encoding is essentially mandatory if the multiplier is on the critical path.

The synthesis tool uses Booth encoding automatically if the library includes a Booth multiplier macro, or if the constraints are tight enough to force the tool toward a faster architecture. In the absence of strong timing constraints, the tool may generate a larger, slower implementation because it did not know that the multiplier is on the critical path.

Critical Path Analysis

In the baseline RTL, the critical path through the MAC is the combinational path through the addition Y + mult. The multiply result is already registered in mult, so the critical path is a 32-bit addition from the registered accumulator to the registered output. A 32-bit ripple-carry adder at 28nm has a delay of roughly 1.5 to 2 nanoseconds. With carry-lookahead or carry-select structures, which synthesis tools typically use for additions this wide, the delay drops to under 1 nanosecond. This path closes timing at 500 MHz without heroic effort.

If the multiply is not registered (collapsed into a single combinational path from inputs through multiply through accumulate to output), the critical path includes the full multiplier delay plus the addition. For an 8x8 array multiplier at 28nm, multiplier delay is approximately 1.5 to 2 nanoseconds; adding the accumulate adder pushes the total to 2.5 to 3 nanoseconds, which closes at roughly 300 to 400 MHz. Registering the multiply result is not just about pipelining for throughput; it directly cuts the critical path and improves the achievable clock frequency.

Accumulator Width and Overflow

The 32-bit accumulator in the baseline RTL can hold the sum of up to $2^{32} / (2^{15}) = 2^{17}$, or 131072, full-scale 8x8 products before overflowing. Whether that is sufficient depends on how the block is used. In a MAC-based FIR filter, the number of accumulations per output sample equals the filter order, which for a typical design is tens to hundreds, well within the 32-bit range. In a matrix multiply that accumulates over a large inner dimension, or in an online learning update that runs indefinitely, the same accumulator width may overflow silently.

In ASIC design, every bit of the accumulator adds flip-flops, routing, and clock tree load. A 32-bit accumulator where 8 bits would be sufficient wastes approximately 75% of the register area in that block. For a single MAC unit this is trivial. For a systolic array with thousands of MAC units sharing a common accumulate width, the difference in register count is large and contributes to both area and power.

Clock Gating

In the baseline RTL, the always block conditionally updates the registers when valid is asserted. In a standard ASIC synthesis flow, this conditional becomes a clock enable on each register: the clock arrives at the flip-flop on every edge, but the enable input determines whether the data input is latched. The flip-flop itself still clocks, still consumes switching power on the clock input.

Replacing the conditional enable with a gated clock cell disconnects the clock entirely from the accumulator registers when valid is zero. Clock gating is standard practice in ASIC power optimization because dynamic power is proportional to switching activity, and suppressing the clock to registers that are not changing eliminates their contribution to dynamic power. For a MAC unit that is idle 50% of the time, clock gating approximately halves the accumulator’s dynamic power contribution.

Clock gating requires care. The gate cell must be placed early in the clock tree to maximize the number of registers behind it, and it must be verified that the enable signal is stable before the clock edge to avoid glitches. In a standard cell flow, the clock gating transformation is usually performed by the synthesis tool or the physical design tool, but the RTL must be structured to allow it. An always block with a simple valid condition is straightforwardly transformable. A block with complex nested conditions may not be.

Timing Constraints and SDC

FPGA synthesis accepts a single clock constraint and handles the rest internally. ASIC synthesis requires a complete SDC file that defines input arrival times, output required times, clock uncertainty, and any multicycle or false path exceptions. For the MAC unit in isolation, the constraints are simple:

create_clock -period 2.0 -name clk [get_ports clk]
set_clock_uncertainty 0.1 [get_clocks clk]
set_input_delay  0.3 -clock clk [get_ports {A B valid rst}]
set_output_delay 0.3 -clock clk [get_ports Y]

A 2-nanosecond clock period targets 500 MHz. The 0.1-nanosecond uncertainty accounts for clock network skew and jitter. The 0.3-nanosecond input and output delays assume that the signals are registered in adjacent flip-flops at the boundary of the block.

The quality of the SDC file determines the quality of the synthesis result. A constraint file with no input delays tells the synthesis tool that inputs arrive at time zero relative to the clock edge, which means the multiplier and all logic between the inputs and the first register has an artificially generous timing budget. The synthesis tool meets this relaxed constraint by generating a smaller, slower multiplier. When the block is integrated into a larger design with real input timing constraints, the path fails, and the fix requires re-synthesizing the block with correct constraints and potentially changing the RTL structure.


Physical Design Considerations

On FPGA, the DSP slices are placed in fixed columns in the device’s floorplan. The routing from the logic around the DSP to the DSP inputs and outputs crosses whatever fabric lies between them. Packing multiple MAC units that need to exchange data works best when they are placed in the same DSP column and in adjacent positions within that column, which the placer usually achieves automatically for small designs but may not for large ones with many competing placement constraints.

On ASIC, the MAC unit is placed in the standard cell area along with everything else. The placement tool assigns cell positions based on timing and congestion constraints. If the accumulator has high fanout (many downstream cells reading Y), the placer may need to insert repeaters in the fanout tree, which increases the effective delay and area. The multiplier, which has the longest combinational delay in the block, should ideally be placed so that its output cells are close to the input of the adder stage. The placement tool handles this within the standard optimization loop, but tight timing constraints on the critical path are necessary to make the optimization aggressive enough.

The flip-flops in the accumulator register are driven by the clock tree. A 32-bit accumulator contains 32 flip-flops, each of which is a leaf node on the clock tree. The clock tree insertion delay and skew budget for those 32 flops must be accounted for in the timing constraints and verified in post-CTS static timing analysis.


What the Target Changes

The same RTL synthesized to an FPGA and to an ASIC produces substantially different physical implementations, and the differences are not resolved by adjusting synthesis flags. On FPGA, the design’s quality is determined by whether the multiplier maps to a DSP slice, whether the pipeline registers match the DSP’s internal staging, and whether the placement achieves short routes between the DSP and surrounding logic. On ASIC, the quality is determined by the multiplier architecture the synthesis tool selects, the depth of the pipeline, the width of the accumulator relative to the actual overflow requirement, and the quality of the timing constraints provided.

An 8-bit MAC is simple enough that both implementations close timing easily at moderate clock frequencies. For wider datapaths, tighter timing targets, or designs with hundreds of MAC units, each of these decisions has measurable consequences in area, power, and timing margin, and the decisions compound.