Systolic Array Matrix Multiplication
Overview
This project implements matrix multiplication using a systolic array architecture in Verilog.
The design enables parallel computation and leverages hardware-level performance optimization for high-speed data processing.
Architecture
Systolic Array
- Grid of Processing Elements (PEs) arranged for pipelined data movement.
- Each PE performs Multiply-Accumulate (MAC) operations.
- Designed for throughput optimization in matrix multiplication tasks.
Processing Element – MAC Unit
- Multiplication:
- Uses Booth’s algorithm for efficient signed multiplication.
- Addition:
- Employs a Kogge–Stone adder for fast parallel addition.
- Future enhancements:
- Exploring Modified Booth Encoding (MBE) and Carry-Save Adders (CSA) to further improve speed.
Key Features
- Fully parallel systolic array design
- Verilog-based hardware implementation
- Booth’s algorithm for multiplication efficiency
- Kogge–Stone adder for minimal addition latency
- Scalable for different matrix sizes
Applications
- Digital signal processing (DSP)
- Machine learning accelerators
- FPGA and ASIC implementations for high-speed computation
YOSYS SYNTHESIS STATS
3. Executing HIERARCHY pass (managing design hierarchy).
3.1. Analyzing design hierarchy..
Top module: \systolic_array
Used module: \pe
Used module: \mac_unit
Used module: \carry_save_adder
Used module: \booth_multiplier
3.2. Analyzing design hierarchy..
Top module: \systolic_array
Used module: \pe
Used module: \mac_unit
Used module: \carry_save_adder
Used module: \booth_multiplier
Removed 0 unused modules.
10. Printing statistics.
=== booth_multiplier ===
Number of wires: 42
Number of wire bits: 350
Number of public wires: 8
Number of public wire bits: 115
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 42
$add 3
$eq 21
$neg 8
$pmux 4
$reduce_or 6
=== carry_save_adder ===
Number of wires: 10
Number of wire bits: 160
Number of public wires: 5
Number of public wire bits: 80
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 7
$and 3
$or 2
$xor 2
=== mac_unit ===
Number of wires: 8
Number of wire bits: 112
Number of public wires: 8
Number of public wire bits: 112
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 3
$add 1
booth_multiplier 1
carry_save_adder 1
=== pe ===
Number of wires: 15
Number of wire bits: 88
Number of public wires: 10
Number of public wire bits: 83
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 10
$adff 3
$adffe 1
$logic_and 1
$logic_not 2
$not 1
$reduce_and 1
mac_unit 1
=== systolic_array ===
Number of wires: 37
Number of wire bits: 330
Number of public wires: 28
Number of public wire bits: 262
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 19
$add 1
$adff 5
$adffe 1
$eq 3
$lt 1
$pmux 4
pe 4
=== design hierarchy ===
systolic_array 1
pe 4
mac_unit 1
booth_multiplier 1
carry_save_adder 1
Number of wires: 337
Number of wire bits: 3170
Number of public wires: 152
Number of public wire bits: 1822
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 251
$add 17
$adff 17
$adffe 5
$and 12
$eq 87
$logic_and 4
$logic_not 8
$lt 1
$neg 32
$not 4
$or 8
$pmux 20
$reduce_and 4
$reduce_or 24
$xor 8
5 modules:
booth_multiplier
carry_save_adder
mac_unit
pe
systolic_array
<!-
TO BE VERIFIED | DEADLINE 2 SEPT
csa.v — carry_save_adder
module name and ports
carry_save_adder
input [15:0] A, B, C
output [15:0] Sum, Carry
functional behavior (bit-parallel, combinational)
Sum = A ^ B ^ C (bitwise XOR of the three operands). Carry = (A & B) | (B & C) | (C & A) (pairwise majority). Carry is not shifted inside the module; it is a bit-aligned carry vector suitable for left-shift and final addition outside the CSA.
timing and usage invariants
combinational block only. When used in a reduction tree, the consumer must perform any required carry shifting and final addition. The CSA preserves bit widths (N-bit in, N-bit out).
edge cases and signedness
operates on raw bit-vectors; signed interpretation must be handled by the consumer. No internal sign extension.
mac.v — booth multiplier + mac_unit
modules contained
booth_multiplier
(radix-4 style coded)
mac_unit
(instantiates booth multiplier + CSA reduction + final add)
booth_multiplier ports and widths
input signed [7:0] A
— multiplicand
input signed [7:0] B
— multiplier
output reg signed [15:0] P
— product
algorithm details and implementation notes
radix-4 Booth recoding: processes B in overlapping pairs. For each pair the booth code is {B[i+1], B[i], previous_bit}
where previous_bit is B[i-1] or 0 for i==0.
Cases produce: 0, +A<<i, +A«(i+1), -A<<i, -A«(i+1) as per Booth rules implemented in the code. Multiplicand is sign-extended to 16 bits ({{8{A[7]}}, A}
) before shifts and additions. result
accumulates signed partial products. The module is entirely combinational (always @(*)).
mac_unit ports and widths
input signed [7:0] A, B
input signed [15:0] AccIn
output signed [15:0] AccOut
internal wiring and reduction flow
product
is signed [15:0] from booth_multiplier
. The code instantiates a carry_save_adder
with operands product
, AccIn
, and 16'd0
. The CSA returns Sum
and Carry
. The carry is left-shifted by one (carry_shifted = carry << 1
). Final accumulator output AccOut = sum + carry_shifted
.
exact arithmetic semantics
AccOut
computes AccIn + (A * B)
implemented via CSA + final add. Because AccIn
is 16-bit signed and product
is 16-bit signed, the CSA and final adder are sized to 16 bits. Signed overflow semantics follow two’s complement wrap-around; no saturation logic exists.
timing and pipeline notes
design is combinational from inputs (A
, B
, AccIn
) to AccOut
. Any sequential timing or registerization must be provided by the caller. No internal registers in mac_unit
are used to pipeline the multiplier/adder.
pe.v — processing element
module name and ports
pe
input clk, rst, enable
input signed [7:0] A_in, B_in
output reg signed [7:0] A_out, B_out
output reg signed [15:0] Acc_out
internal state and wires
internal reg signed [15:0] acc_reg
holds the accumulator across cycles. Instantiates mac_unit
with A_in
, B_in
, AccIn = acc_reg
, AccOut = mac_result
.
clocking and reset behavior
on posedge clk
or posedge rst
:
- if
rst
is asserted:A_out <= 0
,B_out <= 0
,acc_reg <= 0
,Acc_out <= 0
. - else: data forwarding, accumulator update and freeze behavior as below.
accumulator update semantics (precise)
if enable
is true:
- if
A_in == 0 && B_in == 0
:acc_reg <= acc_reg
(explicit guard to avoid overwriting with mac result when both inputs are zero). This is a deliberate control in code (prevents a false accumulation when wave is over). - else:
acc_reg <= mac_result
. mac_result equalsacc_reg + A_in * B_in
permac_unit
combinational logic, so the assignment is the new accumulated value. ifenable
is false:acc_reg <= acc_reg
(accumulator freeze).
forwarding behavior
A_out
and B_out
are updated every clock (not conditional on enable
): A_out <= A_in
, B_out <= B_in
. Acc_out <= acc_reg
is also registered each clock.
timing invariant
Acc_out
reflects acc_reg
after the clock edge where it was updated (registered output). Using PEs in systolic arrays requires matching enable phases across the array so that mac updates coincide with correct wavefronts.
systolic_array.v — 2×2 systolic example
top-level ports and widths
input clk, rst
input [7:0] A00, A01, A10, A11
input [7:0] B00, B01, B10, B11
output [15:0] C00, C01, C10, C11
internal signals and PEs
wires for internal forwards: a0_out, a1_out, b0_out, b1_out
(signed 8-bit). Accumulators c00_out, c01_out, c10_out, c11_out
(signed 16-bit).
wavefront feeder FSM and control registers
reg [2:0] cycle; reg signed [7:0] A_in_0_0, A_in_1_0, B_in_0_0, B_in_0_1; reg enable;
Reset sets cycle=0, enable=0, inputs=0. On each clock cycle cycle increments until 5 (cycle < 5
) and enable
is cycle < 5
(enable asserted early cycles only). The case (cycle)
loads the first-column and first-row inputs across three explicit cycles:
- cycle == 1: feed A00 to A_in_0_0 and B00 to B_in_0_0, other inputs zero.
- cycle == 2: feed A01 to A_in_0_0 and B10 to B_in_0_0, feed A10 to A_in_1_0 and B01 to B_in_0_1.
- cycle == 3: feed A11 to A_in_1_0 and B11 to B_in_0_1, clear first inputs.
- other cycles: zero feed.
PE instantiation and connectivity (exact)
Row 0:
PE00
gets A_in_0_0, B_in_0_0 → outputs a0_out
, b0_out
, c00_out
.
PE01
gets A_in from a0_out
, B_in B_in_0_1 → outputs b1_out
, c01_out
.
Row 1:
PE10
gets A_in_1_0, B_in b0_out
→ outputs a1_out
, c10_out
.
PE11
gets A_in a1_out
, B_in b1_out
→ outputs c11_out
.
C outputs assignment
assign C00 = c00_out; assign C01 = c01_out; assign C10 = c10_out; assign C11 = c11_out;
timing and dataflow invariants
A values move right through A_out ports every cycle. B values move down through B_out usage every cycle. enable gating controls whether a PE will update its accumulator on a given cycle; forwarding outputs update regardless of enable. The 3-cycle feeding pattern creates a wavefront of products across the 2×2 array, and accumulators reflect local sums after the appropriate number of cycles.
conv.v — conv3x3_systolic
module name and ports (explicit)
conv3x3_systolic
inputs
clk, rst
we_img
(synchronous write enable to image memory)
img_addr [4:0]
(address into image memory, depth 32 available, used depth = IMG_PIX = 25)
img_wdata signed [15:0]
(write data)
we_ker
(write enable to kernel memory)
ker_addr [3:0]
(address into kernel memory, used depth = K_SZ = 9)
ker_wdata signed [15:0]
start
(start scanning)
mode_same
(1 => SAME zero-pad mode; 0 => VALID)
outputs
out_valid
(reg) — asserted when out_data
is valid this cycle
out_data signed [39:0]
— final assembled signed 40-bit convolution output
out_row [2:0]
, out_col [2:0]
— coordinates of the output in the output map
done
— pulses when the final output is emitted
parameters and memory layout constants (hardcoded)
parameter IMG_N = 5
(image dimension)
parameter IMG_PIX = 25
(image memory depth)
parameter K_SZ = 9
(kernel entries)
internal memories and data layout
img_mem [0:IMG_PIX-1]
holds signed [15:0] pixel values in row-major order; index memidx = row*IMG_N + col
. ker [0:K_SZ-1]
holds signed [15:0] kernel values in row-major order. ker_f [0:K_SZ-1]
is flipped kernel (convolution semantics: flip the kernel before multiply-accumulate).
pipeline registers and widths (explicit)
stage0 registers: px16_s0[0..8]
signed [15:0] — sampled window pixels
stage1 registers: px8_s1[0..8]
signed [7:0] — truncated pixel bytes used by MACs
ker8_s1[0..8]
signed [7:0] — kernel bytes (flipped)
stage2 registers: prod_s2[0..8]
signed [15:0] — registered MAC products
combinational macs and reduction tree
nine mac instances: mac_unit
used with AccIn = 16'd0
so they produce product-only outputs observed on prod_wire[0..8]
combinationally from px8_s1
and ker8_s1
.
CSA tree (three CSA blocks)
CSA groups:
- CSA0 reduces prod_s2[0..2] → s0 (sum), c0 (carry)
- CSA1 reduces prod_s2[3..5] → s1, c1
- CSA2 reduces prod_s2[6..8] → s2, c2
final assembly into out_data (precise bit operations)
When the valid pipeline reaches depth 3, out_data
is computed as:
out_data <= ({{24{s0[15]}}, s0} +
{{24{s1[15]}}, s1} +
{{24{s2[15]}}, s2} +
{{23{c0[15]}}, c0, 1'b0} +
{{23{c1[15]}}, c1, 1'b0} +
{{23{c2[15]}}, c2, 1'b0});
Explanation of the bit-concatenations and shifts
s0
,s1
,s2
are 16-bit sums; each is sign-extended to 40 bits by{{24{sX[15]}}, sX}
.c0
,c1
,c2
are 16-bit carry vectors; each is sign-extended to 39 bits{{23{cX[15]}}, cX}
then shifted left by one (... , 1'b0
) which places the carry bits at the correct significance. The carry shift-left is because CSA carry bits represent carries into the next more-significant bit position. Final addition sums all six extended vectors to produce signed 40-bitout_data
.
valid pipeline and alignment invariant (exact)
valid_pipe
is a 3-bit shift register. Each clock cycle it shifts left and inserts running
as the new LSB, where running
indicates a sample was taken during the previous stage0 action. valid_pipe[2]
being high indicates that data sampled 3 stages earlier has been assembled in prod_s2
and the final assembly result is ready this cycle. This explicit valid pipeline guarantees deterministic alignment between the sampled location and out_data
.
scan control, row/col iteration, running termination (detailed)
running
is asserted on start
and cleared when the scan reaches out_rows_max
and out_cols_max
and the last sampling cycle has been taken. out_rows_max
and out_cols_max
are set at start
:
- if
mode_same == 1
:out_rows_max = IMG_N - 1
,out_cols_max = IMG_N - 1
(output size = IMG_N × IMG_N with zero-padding) - else:
out_rows_max = IMG_N - 3
,out_cols_max = IMG_N - 3
(VALID mode; output size = (IMG_N-2) × (IMG_N-2))
sampling loop (exact code behavior)
when running
:
pad = (mode_same ? -1 : 0)
which is an integer negative for SAME mode.nested loops
rr=0..2
,cc=0..2
compute sample coordinatesxi = r_row + rr + pad
andxj = r_col + cc + pad
.if
mode_same
and coordinate out of bounds, sample value = 0; otherwisepx16_s0[idx] <= img_mem[memidx]
.after sampling a 3×3 window, indices advance:
- if
r_col == out_cols_max
:r_col <= 0
andr_row
increments or clearsrunning
when both at max; - else:
r_col <= r_col + 1
.
- if
flush invariant
running
can be cleared after the last sample but valid_pipe
continues shifting so the final sampled windows still propagate to the assembly stage and assert out_valid
. The code sets valid_pipe <= {valid_pipe[1:0], running}
each clock, so outputs corresponding to the last samples are emitted after pipeline latency.
output counters and done semantics (precise)
out_row_cnt
, out_col_cnt
advance only when an output is emitted (i.e., when valid_pipe[2]
is set and assembly occurs). out_row
and out_col
register the corresponding coordinates when out_valid
is asserted. done
is set when the emitted coordinates equal out_rows_max
and out_cols_max
(final output position).
memory write behavior (synchronous)
synchronous always @(posedge clk)
block handles we_img
and we_ker
writes. rst
zeroes both memories on reset. Writes occur on the positive edge when the corresponding we_*
signals are asserted.
truncation and precision notes (explicit)
pixels are stored in img_mem
as signed 16-bit. Before MACs they are truncated to signed 8-bit: px8_s1[i] <= px16_s0[i][7:0]
. kernel values flip are also truncated similarly: ker8_s1[i] <= ker_f[i][7:0]
. Because truncation is explicit, product values are computed with 8×8→16 precision and then combined into a 40-bit accumulation. The code does not scale or saturate; two’s complement wrap-around applies on overflow.
conv_tb.v — tb_conv3x3_systolic (testbench)
main purpose and topology
testbench instantiates conv3x3_systolic
and verifies SAME and VALID modes against hard-coded expected matrices.
clock generation
initial begin clk = 0; forever #5 clk = ~clk; end
→ clock period = 10 time units (the TB prints “drives clock (10 ns period in the TB)” semantics in comments).
reset and initial values
rst
asserted for 20 time units then deasserted. we_img
, we_ker
, start
cleared initially, mode_same
cleared initially.
test stimulus matrices (values exact)
image X[0..24]
assigned as (row-major, 5×5):
X = [ 0, 4, -2, 1, -4,
3, -1, 0, 2, 4,
-3, 2, 1, -1, 0,
4, -4, 3, 0, -2,
1, 0, -1, 4, -3 ]
kernel K[0..8]
assigned as (row-major, 3×3):
K = [ 2, -1, 0,
3, 4, -2,
-3, 1, 1 ]
write behavior and addresses used
the testbench writes 25 image locations sequentially to img_addr = 0..24
with we_img = 1
. It writes 9 kernel locations sequentially to ker_addr = 0..8
with we_ker = 1
.
SAME mode test sequence (explicit)
mode_same = 1
. On a positive clock edgestart
is asserted then deasserted next cycle (one-cycle pulse).count = 0
. TB waits on rising clock edges; wheneverout_valid
is true it capturessame_out[out_row*5 + out_col] = out_data
and incrementscount
. The loop continues untilcount == 25
(collect 25 outputs).- Expected SAME matrix in TB (
expect_same[0..24]
) is:
[ 7, 11, -9, 2, -22,
4, 0, 4, 32, 9,
-12, 29, -13, -20, 10,
-6, -21, 35, -22, -6,
20, -14, 7, 18, -22 ]
- TB compares each collected output to
expect_same
and reports mismatches via$display("Mismatch SAME at idx %0d: got %0d expected %0d", ...)
. A summary message prints whetherSAME: All outputs match expected.
orSAME: N mismatches.
VALID mode test sequence (explicit)
mode_same = 0
. Againstart
is pulsed. The TB collects 9 outputs intovalid_out[out_row*3 + out_col]
untilcount == 9
.- Expected VALID matrix in TB (
expect_valid[0..8]
) is:
[ 0, 4, 32,
29, -13, -20,
-21, 35, -22 ]
- TB compares the collected values to
expect_valid
and prints mismatch lines and a summary.
testbench termination and reporting semantics
after both modes are exercised and comparisons printed, the TB prints Testbench finished.
and calls $finish
.
explicit list of signals whose widths and signedness matter (reference)
img_mem entries: signed [15:0] ker entries: signed [15:0] px16_s0: signed [15:0] px8_s1, ker8_s1: signed [7:0] (truncation from 16-bit to 8-bit uses lower byte) prod_wire, prod_s2: signed [15:0] (8×8 product) CSA Sum/Carry: [15:0] each (carry is unshifted within CSA) out_data: signed [39:0] (final assembled 40-bit result) mac_unit AccIn/AccOut: signed [15:0] (accumulator for PE) PE A_out/B_out: signed [7:0] (forwards truncated A/B) PE Acc_out: signed [15:0]
hard invariants and corner cases to be aware of (explicit)
- kernel flip happens once at
start
byker_f[i] <= ker[K_SZ-1 - i]
. The flip is synchronous and latched on the same cyclerunning
is set.ker_f
used in stage1 → stage2 path. px8_s1[i] <= px16_s0[i][7:0]
extracts low byte; if the 16-bit pixel had significant bits above bit-7, they are dropped without rounding or saturation.- final
out_data
width is 40 bits. All intermediate sign-extensions use the MSB of 16-bit quantities to propagate sign to 40 bits prior to addition. valid_pipe
depth = 3 always; outputs are produced only whenvalid_pipe[2]
is high. Clearingrunning
does not immediately clearvalid_pipe
, so last samples flush.done
is set synchronously during the cycle that produces the final output (when emitted coordinates equalout_rows_max
andout_cols_max
).done
returns to zero on next reset or new start flow in code (the code setsdone <= 1'b0
at the start of the clock body each cycle except when set).we_img
andwe_ker
writes are synchronous; readingimg_mem[memidx]
in the same clock wherewe_img
wrote to the same address will return old or new data depending on simulator/FPGA RAM behavior. The code assumes synchronous RAM semantics: writes commit on posedge and reads reflect memory contents on next posedge when sampled in stage0 the following cycle.mode_same
toggles pad behavior:pad = -1
for SAME mode andpad = 0
for VALID.pad
type is integer; negative pad handled logically by boundary checkif (xi < 0 || xi >= IMG_N || xj < 0 || xj >= IMG_N) px16_s0[idx] <= 0;
.
factual summary of the runtime sequence
drives clock (10 time units period).
writes 25 image pixels into img_mem
.
writes 9 kernel values into ker[]
.
runs SAME (zero-pad) mode — expects and checks 25 outputs.
runs VALID (no pad) mode — expects and checks 9 outputs.
prints summary of matches/mismatches and finishes simulation.
explicit mapping: code locations → behaviors
carry-save logic: csa.v
top.
booth recode multiplier and MAC reduction: mac.v
.
processing element and accumulate/forward semantics: pe.v
.
2×2 systolic example and feeding cycles: systolic_array.v
.
conv3x3 pipelined sampling, MAC array, CSA tree, valid pipeline, kernel flipping, out assembly: conv.v
(module conv3x3_systolic
).
test sequences and expected golden outputs: conv_tb.v
.
$$ X=\begin{bmatrix} 0 & 4 & -2 & 1 & -4\ 3 & -1 & 0 & 2 & 4\ -3 & 2 & 1 & -1 & 0\ 4 & -4 & 3 & 0 & -2\ 1 & 0 & -1 & 4 & -3 \end{bmatrix}, \qquad K=\begin{bmatrix} 2 & -1 & 0\ 3 & 4 & -2\ -3 & 1 & 1 \end{bmatrix}. $$
Convolution definition (linear): $(X*K)[i,j]=\sum_{u=-1}^{1}\sum_{v=-1}^{1} \tilde K[u,v];X[i-u,;j-v]$, where $\tilde K$ is $K$ flipped in both axes (standard convolution).
Without padding (valid)
Output size $= (5-3+1)\times(5-3+1)=3\times3$:
$$ X*K;;(\text{valid})=\begin{bmatrix} 0 & 4 & 32\ 29 & -13 & -20\ -21 & 35 & -22 \end{bmatrix}. $$
With zero padding (same)
Zero-pad $X$ by 1 pixel on all sides; output size $=5\times5$:
$$ X*K;;(\text{same})=\begin{bmatrix} 7 & 11 & -9 & 2 & -22\ 4 & 0 & 4 & 32 & 9\ -12 & 29 & -13 & -20 & 10\ -6 & -21 & 35 & -22 & -6\ 20 & -14 & 7 & 18 & -22 \end{bmatrix}. $$