Systolic Array Matrix Mutiplicator

Verilog SystemVerilog Pipelining Microarchitecture

Systolic Array Based Matrix Multiplication and Convolution

Systolic arrays are highly efficient hardware architectures designed for performing repetitive numerical computations, particularly matrix multiplication and convolution, in a structured and pipelined manner. The term “systolic” refers to the rhythmic, pulse-like propagation of data through a grid of processing elements (PEs), similar to the heartbeat. This design enables massive parallelism while minimizing external memory access, making it highly suitable for applications in linear algebra, deep learning accelerators, and signal processing.

The key principle behind systolic arrays is that each PE performs a simple computation—typically a multiply-accumulate (MAC)—and forwards its inputs to neighboring PEs. This allows each element of the input matrices to be reused across multiple PEs, reducing memory bandwidth requirements. Additionally, the regular, local communication between PEs makes the array highly scalable and amenable to pipelining.

Mathematical Background and Efficiency

Consider the multiplication of two matrices, (A \in \mathbb{R}^{N \times M}) and (B \in \mathbb{R}^{M \times K}), producing a matrix (C \in \mathbb{R}^{N \times K}):

\[ C_{ij} = \sum_{k=0}^{M-1} A_{ik} \cdot B_{kj}, \quad 0 \le i < N, \ 0 \le j < K \]

A naive implementation would require fetching elements of (A) and (B) from memory for each multiplication, leading to high memory bandwidth and latency. Systolic arrays mitigate this by streaming the elements of (A) across rows and elements of (B) across columns. Each PE receives one element from (A) and one from (B), performs a multiplication, adds it to an internal accumulator, and forwards the operands to neighboring PEs.

Mathematically, the computation within a PE can be expressed as:

\[ c_{ij}^{(k)} = c_{ij}^{(k-1)} + a_{ik} \cdot b_{kj} \]

where (c_{ij}^{(k)}) is the partial sum after processing the (k)-th element of the summation. By distributing the summation across a 2D grid of PEs and pipelining the inputs, the systolic array achieves:

Parallelism: Each PE operates concurrently, producing one partial sum per clock cycle.
Data Locality: Operands move only to immediate neighbors, minimizing global memory access.
Pipelining: Once the array is filled, one output per PE per clock cycle can be produced continuously.

This structure allows for high throughput and energy efficiency, especially for large matrices.

GEMM Implementation

The general matrix multiplication (GEMM) is implemented using a 2D grid of PEs. Each PE maintains a local accumulator, performs a multiply-accumulate operation, and forwards the operands to the next PE in its row or column. The top-level module organizes the PEs to multiply an (N \times M) matrix (A) with an (M \times K) matrix (B) to produce (C \in \mathbb{R}^{N \times K}).

Processing Element (PE)

Each PE implements a multiply-accumulate operation:

module pe(
    input clk,
    input rst,
    input [7:0] a_in,
    input [7:0] b_in,
    output reg [7:0] a_out,
    output reg [7:0] b_out,
    output reg [15:0] c_out
);
    reg [15:0] acc;

    always @(posedge clk or posedge rst) begin
        if (rst) begin
            a_out <= 0;
            b_out <= 0;
            c_out <= 0;
            acc   <= 0;
        end else begin
            a_out <= a_in;
            b_out <= b_in;
            acc   <= acc + a_in * b_in;
            c_out <= acc;
        end
    end
endmodule

The PE is fully pipelined: ‘a_in’ and ‘b_in’ are forwarded to the next PE while the internal accumulator updates the partial sum. This local accumulation avoids repeated memory accesses for intermediate sums.

Top-Level Systolic Array

The PEs are arranged in an (N \times K) grid, enabling the computation of all (C_{ij}) elements in parallel. Input streams are delayed appropriately using registers to align operands across the array. The top-level module can be parameterized for any matrix dimensions (N), (M), and (K):

module top #(
    parameter N = 4,
    parameter M = 4,
    parameter K = 4
)(
    input clk,
    input rst,
    input [7:0] a_in [0:N-1][0:M-1],
    input [7:0] b_in [0:M-1][0:K-1],
    output [15:0] c_out [0:N-1][0:K-1]
);
    // Instantiate PEs in an NxK grid
    // Forward a_in along rows, b_in along columns
    // Each PE computes partial sums
endmodule

The testbench only sequences the input matrices into the array and observes the output. Because the computation is fully pipelined, the testbench logic is minimal and does not introduce additional complexity. Its primary function is to verify correctness for different input sizes.

This GEMM implementation using systolic arrays achieves high throughput and energy efficiency by exploiting parallelism, pipelining, and local communication. The parameterized design allows scaling to arbitrary (N \times M \times K) matrices, making it suitable for hardware accelerators where matrix multiplication dominates computation time.

Convolutional Layer Implementation (Conv2D)

A 2D convolution operation computes a weighted sum over a sliding window of the input image. Mathematically, for an input image (I \in \mathbb{R}^{N \times N}) and a kernel (K \in \mathbb{R}^{K \times K}):

\[ O_{i,j} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} I_{i+m,j+n} \cdot K_{m,n}, \quad 0 \le i,j < N-K+1 \]

Two common convolution modes are valid (no padding, output smaller than input) and same (zero-padding, output same size as input).

Systolic Array Mapping and Inner Modules

{width=“100%”}

The 3x3 convolution is mapped to a systolic structure composed of 9 processing elements (PEs), each being a multiply-accumulate (MAC) unit. Every PE takes one pixel and one kernel weight, computes a product, and contributes it to the convolution sum. These 9 products are then aggregated through a tree of carry-save adders (CSAs) to produce the final convolution output.

Processing Element (PE): Each PE is realized as a MAC unit that accepts an 8-bit signed pixel ((A)), an 8-bit signed kernel coefficient ((B)), and a 16-bit accumulator input. Internally, it uses:
1. A Booth multiplier for signed 8(\times){=html}8 multiplication, producing a 16-bit product.
2. A carry-save adder (CSA) to add the product, the accumulator input, and a zero.
3. A final carry-propagate adder that combines the CSA’s sum and shifted carry outputs, generating the accumulated result.
This design minimizes critical path delay while allowing pipelining of MAC results.
CSA Tree for Accumulation: Since 9 MAC units operate in parallel, their outputs must be combined efficiently. A three-level CSA tree is employed: $$\begin{aligned} (P_0,P_1,P_2) &\rightarrow (s_0, c_0) \ (P_3,P_4,P_5) &\rightarrow (s_1, c_1) \ (P_6,P_7,P_8) &\rightarrow (s_2, c_2)

\end{aligned}$$ The final result is computed as: \[ Out = (s_0+s_1+s_2) + ((c_0+c_1+c_2) \ll 1) \] This reduces delay compared to a single 9-input adder.
Pipeline Registers:
1. Stage0: Holds sampled 16-bit pixel values (‘px16_s0’) from the input image.
2. Stage1: Truncates pixel and kernel values to 8 bits for MACs (‘px8_s1’, ‘ker8_s1’).
3. Stage2: Registers MAC outputs (‘prod_s2’) before feeding them to the CSA tree.

Pipeline Flow and Data Alignment

The convolution pipeline operates as follows:

Sampling Window: The FSM scans the input image, extracting a 3x3 window for each output pixel. For same convolution, padding is applied by inserting zeros where indices go out of bounds.
Kernel Flipping: The kernel is flipped prior to computation to satisfy the convolution definition.
Stage Propagation:
- Pixels from ‘px16_s0’ are truncated to 8-bit values and forwarded to MAC units in Stage1.
- MAC outputs (‘prod_wire’) are registered into ‘prod_s2’ in Stage2 to align with the CSA pipeline.
Valid Alignment: A 3-bit shift register (‘valid_pipe’) tracks the propagation of valid data through the stages. When ‘valid_pipe\[2\]’ is high, the output corresponds to a fully computed convolution value.
Output Counters: ‘out_row_cnt’ and ‘out_col_cnt’ track the spatial position of the output pixel. These counters advance only when an output is emitted.

Verilog Implementation Overview

The design instantiates:

9 MAC units, each forming a PE.
3 CSAs, reducing 9 partial products into final sum.
FSM + pipeline control logic, ensuring continuous throughput, flush handling, and SAME/VALID mode support.

Key Architectural Features

PE-level parallelism: All 9 multiplications happen simultaneously.
MAC architecture: Booth multiplier + CSA + final adder reduces critical path delay.
CSA accumulation: Parallel reduction tree for 9 products avoids long adder chains.
Stage-by-stage pipelining: Continuous data flow, one output per cycle after pipeline fill.
Flush support: Pipeline always propagates valid signals until the last output is produced.
Flexible convolution modes: SAME with zero-padding, or VALID without padding.

Yosys Netlists of Modules

To validate the register–transfer level design and visualize the datapath structure, Yosys was used to generate netlists for the major building blocks. These netlists illustrate how arithmetic primitives and processing elements are interconnected within the systolic arrangement.

Netlist of the Booth multiplier block. The partial product generation
and encoding stages are visible in the datapath. {width=“100%”}

Netlist of the carry–save adder (CSA). The three–input adder
structure reduces multiple partial results into sum and carry
outputs. {width=“50%”}

Netlist of a multiply–accumulate (MAC) unit. It integrates the Booth
multiplier with a CSA and final adder. {width=“50%”}

Netlist of a processing element (PE). The PE encapsulates one MAC unit
along with local registers for systolic data
movement. {width=“50%”}

Netlist of the complete systolic array arrangement. Multiple PEs are
interconnected to form a parallel datapath for matrix–style
computation. {width=“50%”}

Comparison of MAC Unit Architectures

In high-throughput accelerators such as systolic arrays for GEMM and Conv2D, the performance of individual MAC units cannot be considered in isolation. When multiple MAC units are paired or arranged in a 2D array, additional waiting cycles occur due to pipeline alignment, data propagation, and carry-save accumulation. Consequently, the latency of a combined MAC pair or array is always higher than the latency of a single MAC unit, while throughput may be limited by interconnect delays and output readiness. This subsection analyzes both adder and multiplier architectures to highlight these effects.

8-bit Signed Adders

Metric	CSA	Kogge–Stone	RCA
Core Area ((\mu)m²)	2,522	3,689	1,126
Die Area ((\mu)m²)	4,604	6,093	2,572
Utilization (%)	54.79	60.54	43.79
Flip-Flops	0	0	0
Total Cells	64	110	34
Combinational Cells	64	110	34
Clock Period (ns)	10.00	10.00	10.00
Critical Path (ns)	5.07	6.21	7.14
Fmax (MHz)	197.24	161.03	140.06
Total Power (mW)	0.0834	0.0957	0.0326
Energy per Cycle (pJ)	0.834	0.957	0.326
Latency (ns)	5.07	6.21	7.14
Throughput (ops/s)	197,238,700	161,030,600	140,056,000
Energy per Operation (pJ)	422.84	594.30	232.76
Power Efficiency (ops/s per mW)	2,364,972,000	1,682,660,000	4,296,197,000
Area Efficiency (ops/s per (\mu)m²)	42,839.65	26,429.79	54,459.20

Observations:

CSA achieves the shortest critical path (5.07 ns) and highest Fmax (197 MHz) due to parallel partial sum computation.
RCA is the most compact and energy-efficient, but its simple ripple structure results in the slowest performance.
Kogge–Stone has high speed theoretically, but wire-driven delays and dense prefix logic increase area and practical delay.

8-bit Signed Multipliers

Metric	MBE	Booth	Baugh–Wooley
Core Area ((\mu)m²)	9,594	14,369	11,911
Die Area ((\mu)m²)	13,094	18,687	15,827
Utilization (%)	73.27	76.89	75.26
Flip-Flops	0	0	0
Total Cells	294	470	374
Combinational Cells	294	470	374
Clock Period (ns)	10.00	10.00	10.00
Critical Path (ns)	8.84	12.50	8.63
Fmax (MHz)	113.12	80.00	115.87
Total Power (mW)	0.379	0.701	0.448
Energy per Cycle (pJ)	3.79	7.01	4.48
Latency (ns)	8.84	12.50	8.63
Throughput (ops/s)	113,122,200	80,000,000	115,874,900
Energy per Operation (pJ)	3,350.36	8,762.50	3,866.24
Power Efficiency (ops/s per mW)	298,475,400	114,122,700	258,649,200
Area Efficiency (ops/s per (\mu)m²)	8,639.16	4,281.08	7,321.29

Observations:

Baugh–Wooley achieves the shortest critical path (8.63 ns) and highest Fmax (115.9 MHz) due to its regular array layout.
Booth multiplier (Radix-2) suffers from high latency (12.5 ns) and power consumption (0.701 mW) because of additional partial product handling.
MBE provides the best energy efficiency and balanced area utilization, making it optimal for paired MAC deployments.

Summary of MAC Unit Efficiency

Category	Adders Best	Multipliers Best
Speed / Fmax	CSA	Baugh–Wooley / MBE
Power	RCA	MBE
Area	RCA	MBE
Overall Efficiency	RCA (compact, efficient)	MBE (balanced)

: Overall MAC Component Efficiency Summary

As seen in the previous section, we have selected the CSA and MBE as the final adder–multiplier pair for our design. CSA provides significantly better speed with only a modest increase in area, while MBE offers a balanced trade-off between speed, power, and area, making the combination optimal for overall MAC efficiency

Key Insight:

While these tables reflect the isolated performance of individual adders and multipliers, the effective latency of a MAC array in GEMM or Conv2D accelerators is higher due to pipeline propagation and waiting cycles. Similarly, throughput is influenced by data alignment and CSA accumulation stages, highlighting that array-level performance always differs from single-unit metrics.

Timing Estimate (MODEL_ARCH_8)

Assumptions. The following simplifying assumptions are made for estimating the latency of one forward inference:

The systolic convolution engine computes one convolution output (one spatial location (\times) one output channel) per clock cycle in steady state.
A small fixed overhead of approximately (3) cycles per kernel is incurred (kernel flip and pipeline flush).
Elementwise residual adds and global pooling reductions are implemented as (1) cycle per arithmetic operation.
The fully connected (Dense) layer is implemented with one multiply–accumulate per cycle.
No additional stalls are assumed from memory bandwidth or control logic.
Clock frequency is treated as a variable ((f)), and estimates are shown for (50, 100, 200,) and (250,\text{MHz}).

Output counts per layer.

\[ \begin{aligned} \text{conv2d (32(\times)32(\times)28)} & : 32 \cdot 32 \cdot 28 = 28{,}672 \ \text{conv2d_1 (32(\times)32(\times)28)} & : 28{,}672 \ \text{conv2d_2 (32(\times)32(\times)28)} & : 28{,}672 \ \text{conv2d_3 (16(\times)16(\times)56)} & : 16 \cdot 16 \cdot 56 = 14{,}336 \ \text{conv2d_4 (16(\times)16(\times)56)} & : 14{,}336 \ \text{conv2d_5 (16(\times)16(\times)56)} & : 14{,}336 \end{aligned} \]

\[ \text{Total convolution outputs} = 28{,}672 \times 3 + 14{,}336 \times 3 = 129{,}024 \]

Cycle accounting.

\[ \begin{aligned} \text{Base convolution cycles} &= 129{,}024 \ \text{Per-kernel overhead} &= 252 \times 3 = 756 \ \text{Residual adds} &= (32 \cdot 32 \cdot 28) + (16 \cdot 16 \cdot 56) \ &= 28{,}672 + 14{,}336 = 43{,}008 \ \text{Global average pooling} &= 56 \cdot 63 = 3{,}528 \ \text{Dense (56(\to)10)} &= 560 \end{aligned} \]

\[ \text{Total cycles} = 129{,}024 + 756 + 43{,}008 + 3{,}528 + 560 = 176{,}876 \]

Wall-clock latency.

Latency is given by \[ T = \frac{\text{cycles}}{f}. \]

Clock Frequency (f)	Period	Latency (T)
(50,\text{MHz})	(20,\text{ns})	(176{,}876 / 50{\times}10^{6} \approx 3.54,\text{ms})
(100,\text{MHz})	(10,\text{ns})	(176{,}876 / 100{\times}10^{6} \approx 1.77,\text{ms})
(200,\text{MHz})	(5,\text{ns})	(176{,}876 / 200{\times}10^{6} \approx 0.88,\text{ms})
(250,\text{MHz})	(4,\text{ns})	(176{,}876 / 250{\times}10^{6} \approx 0.71,\text{ms})

GDS and Layout Visualizations for Adders and Multipliers

Adder Visualizations

Multiplier Visualizations

Note on Environment and Verification Setup

All synthesis and verification experiments were carried out in a consistent flow, using the same TCL scripts and Yosys synthesis scripts. This uniform environment ensures that results are directly comparable without any tool-induced bias. Each RTL module was verified against its dedicated testbench, and post-synthesis reports confirmed that all designs met the specified timing requirements.

For technology mapping, only the SkyWater 130 nm (Sky130) open-source PDK was used. This PDK is provided through a collaboration between Google and SkyWater Technology, enabling open-source access to a fully characterized standard-cell library. The Sky130 platform allows digital design, synthesis, place-and-route, and verification flows to be tested on a real CMOS technology node.

Specifically, the standard-cell library used for synthesis was:

sky130_fd_sc_hs__tt_025C_1v80.lib

Breaking down the naming convention:

sky130 – Refers to the SkyWater 130 nm technology node.
fd – Denotes the "foundry" designation.
sc – Standard-cell library.
hs – High-speed (HS) flavor of the library, optimized for performance at the cost of area and power.
tt – Typical process corner (typical NMOS, typical PMOS), representing nominal manufacturing conditions.
025C – Temperature corner at (25^{\circ})C, i.e., room temperature.
1v80 – Operating supply voltage of 1.80 V.

sky130_fd_sc_hs__tt_025C_1v80.lib represents a standard-cell library characterized for the Sky130 process, using the high-speed cell set, under typical conditions at 25(^{\circ})C and 1.8 V supply. This ensures that synthesis, timing analysis, and verification are carried out on a realistic and industry-compatible open-source technology.