Systolic Array Matrix Multiplication

Overview

This project implements matrix multiplication using a systolic array architecture in Verilog.
The design enables parallel computation and leverages hardware-level performance optimization for high-speed data processing.

Architecture

Systolic Array

Grid of Processing Elements (PEs) arranged for pipelined data movement.
Each PE performs Multiply-Accumulate (MAC) operations.
Designed for throughput optimization in matrix multiplication tasks.

Processing Element – MAC Unit

Multiplication:
- Uses Booth’s algorithm for efficient signed multiplication.
Addition:
- Employs a Kogge–Stone adder for fast parallel addition.
Future enhancements:
- Exploring Modified Booth Encoding (MBE) and Carry-Save Adders (CSA) to further improve speed.

Key Features

Fully parallel systolic array design
Verilog-based hardware implementation
Booth’s algorithm for multiplication efficiency
Kogge–Stone adder for minimal addition latency
Scalable for different matrix sizes

Applications

Digital signal processing (DSP)
Machine learning accelerators
FPGA and ASIC implementations for high-speed computation

YOSYS SYNTHESIS STATS

3. Executing HIERARCHY pass (managing design hierarchy).

3.1. Analyzing design hierarchy..
Top module:  \systolic_array
Used module:     \pe
Used module:         \mac_unit
Used module:             \carry_save_adder
Used module:             \booth_multiplier

3.2. Analyzing design hierarchy..
Top module:  \systolic_array
Used module:     \pe
Used module:         \mac_unit
Used module:             \carry_save_adder
Used module:             \booth_multiplier
Removed 0 unused modules.

10. Printing statistics.

=== booth_multiplier ===

   Number of wires:                 42
   Number of wire bits:            350
   Number of public wires:           8
   Number of public wire bits:     115
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 42
     $add                            3
     $eq                            21
     $neg                            8
     $pmux                           4
     $reduce_or                      6

=== carry_save_adder ===

   Number of wires:                 10
   Number of wire bits:            160
   Number of public wires:           5
   Number of public wire bits:      80
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                  7
     $and                            3
     $or                             2
     $xor                            2

=== mac_unit ===

   Number of wires:                  8
   Number of wire bits:            112
   Number of public wires:           8
   Number of public wire bits:     112
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                  3
     $add                            1
     booth_multiplier                1
     carry_save_adder                1

=== pe ===

   Number of wires:                 15
   Number of wire bits:             88
   Number of public wires:          10
   Number of public wire bits:      83
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 10
     $adff                           3
     $adffe                          1
     $logic_and                      1
     $logic_not                      2
     $not                            1
     $reduce_and                     1
     mac_unit                        1

=== systolic_array ===

   Number of wires:                 37
   Number of wire bits:            330
   Number of public wires:          28
   Number of public wire bits:     262
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 19
     $add                            1
     $adff                           5
     $adffe                          1
     $eq                             3
     $lt                             1
     $pmux                           4
     pe                              4

=== design hierarchy ===

   systolic_array                    1
     pe                              4
       mac_unit                      1
         booth_multiplier            1
         carry_save_adder            1

   Number of wires:                337
   Number of wire bits:           3170
   Number of public wires:         152
   Number of public wire bits:    1822
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                251
     $add                           17
     $adff                          17
     $adffe                          5
     $and                           12
     $eq                            87
     $logic_and                      4
     $logic_not                      8
     $lt                             1
     $neg                           32
     $not                            4
     $or                             8
     $pmux                          20
     $reduce_and                     4
     $reduce_or                     24
     $xor                            8


5 modules:
  booth_multiplier
  carry_save_adder
  mac_unit
  pe
  systolic_array

<!-

TO BE VERIFIED | DEADLINE 2 SEPT

csa.v — carry_save_adder

module name and ports

carry_save_adder input [15:0] A, B, C output [15:0] Sum, Carry

functional behavior (bit-parallel, combinational)

Sum = A ^ B ^ C (bitwise XOR of the three operands). Carry = (A & B) | (B & C) | (C & A) (pairwise majority). Carry is not shifted inside the module; it is a bit-aligned carry vector suitable for left-shift and final addition outside the CSA.

timing and usage invariants

combinational block only. When used in a reduction tree, the consumer must perform any required carry shifting and final addition. The CSA preserves bit widths (N-bit in, N-bit out).

edge cases and signedness

operates on raw bit-vectors; signed interpretation must be handled by the consumer. No internal sign extension.

mac.v — booth multiplier + mac_unit

modules contained

booth_multiplier (radix-4 style coded) mac_unit (instantiates booth multiplier + CSA reduction + final add)

booth_multiplier ports and widths

input signed [7:0] A — multiplicand input signed [7:0] B — multiplier output reg signed [15:0] P — product

algorithm details and implementation notes

radix-4 Booth recoding: processes B in overlapping pairs. For each pair the booth code is {B[i+1], B[i], previous_bit} where previous_bit is B[i-1] or 0 for i==0.

Cases produce: 0, +A<<i, +A«(i+1), -A<<i, -A«(i+1) as per Booth rules implemented in the code. Multiplicand is sign-extended to 16 bits ({{8{A[7]}}, A}) before shifts and additions. result accumulates signed partial products. The module is entirely combinational (always @(*)).

mac_unit ports and widths

input signed [7:0] A, B input signed [15:0] AccIn output signed [15:0] AccOut

internal wiring and reduction flow

product is signed [15:0] from booth_multiplier. The code instantiates a carry_save_adder with operands product, AccIn, and 16'd0. The CSA returns Sum and Carry. The carry is left-shifted by one (carry_shifted = carry << 1). Final accumulator output AccOut = sum + carry_shifted.

exact arithmetic semantics

AccOut computes AccIn + (A * B) implemented via CSA + final add. Because AccIn is 16-bit signed and product is 16-bit signed, the CSA and final adder are sized to 16 bits. Signed overflow semantics follow two’s complement wrap-around; no saturation logic exists.

timing and pipeline notes

design is combinational from inputs (A, B, AccIn) to AccOut. Any sequential timing or registerization must be provided by the caller. No internal registers in mac_unit are used to pipeline the multiplier/adder.

pe.v — processing element

module name and ports

pe input clk, rst, enable input signed [7:0] A_in, B_in output reg signed [7:0] A_out, B_out output reg signed [15:0] Acc_out

internal state and wires

internal reg signed [15:0] acc_reg holds the accumulator across cycles. Instantiates mac_unit with A_in, B_in, AccIn = acc_reg, AccOut = mac_result.

clocking and reset behavior

on posedge clk or posedge rst:

if rst is asserted: A_out <= 0, B_out <= 0, acc_reg <= 0, Acc_out <= 0.
else: data forwarding, accumulator update and freeze behavior as below.

accumulator update semantics (precise)

if enable is true:

if A_in == 0 && B_in == 0: acc_reg <= acc_reg (explicit guard to avoid overwriting with mac result when both inputs are zero). This is a deliberate control in code (prevents a false accumulation when wave is over).
else: acc_reg <= mac_result. mac_result equals acc_reg + A_in * B_in per mac_unit combinational logic, so the assignment is the new accumulated value. if enable is false: acc_reg <= acc_reg (accumulator freeze).

forwarding behavior

A_out and B_out are updated every clock (not conditional on enable): A_out <= A_in, B_out <= B_in. Acc_out <= acc_reg is also registered each clock.

timing invariant

Acc_out reflects acc_reg after the clock edge where it was updated (registered output). Using PEs in systolic arrays requires matching enable phases across the array so that mac updates coincide with correct wavefronts.

systolic_array.v — 2×2 systolic example

top-level ports and widths

input clk, rst input [7:0] A00, A01, A10, A11 input [7:0] B00, B01, B10, B11 output [15:0] C00, C01, C10, C11

internal signals and PEs

wires for internal forwards: a0_out, a1_out, b0_out, b1_out (signed 8-bit). Accumulators c00_out, c01_out, c10_out, c11_out (signed 16-bit).

wavefront feeder FSM and control registers

reg [2:0] cycle; reg signed [7:0] A_in_0_0, A_in_1_0, B_in_0_0, B_in_0_1; reg enable;

Reset sets cycle=0, enable=0, inputs=0. On each clock cycle cycle increments until 5 (cycle < 5) and enable is cycle < 5 (enable asserted early cycles only). The case (cycle) loads the first-column and first-row inputs across three explicit cycles:

cycle == 1: feed A00 to A_in_0_0 and B00 to B_in_0_0, other inputs zero.
cycle == 2: feed A01 to A_in_0_0 and B10 to B_in_0_0, feed A10 to A_in_1_0 and B01 to B_in_0_1.
cycle == 3: feed A11 to A_in_1_0 and B11 to B_in_0_1, clear first inputs.
other cycles: zero feed.

PE instantiation and connectivity (exact)

Row 0: PE00 gets A_in_0_0, B_in_0_0 → outputs a0_out, b0_out, c00_out. PE01 gets A_in from a0_out, B_in B_in_0_1 → outputs b1_out, c01_out.

Row 1: PE10 gets A_in_1_0, B_in b0_out → outputs a1_out, c10_out. PE11 gets A_in a1_out, B_in b1_out → outputs c11_out.

C outputs assignment assign C00 = c00_out; assign C01 = c01_out; assign C10 = c10_out; assign C11 = c11_out;

timing and dataflow invariants

A values move right through A_out ports every cycle. B values move down through B_out usage every cycle. enable gating controls whether a PE will update its accumulator on a given cycle; forwarding outputs update regardless of enable. The 3-cycle feeding pattern creates a wavefront of products across the 2×2 array, and accumulators reflect local sums after the appropriate number of cycles.

conv.v — conv3x3_systolic

module name and ports (explicit)

conv3x3_systolic

inputs

clk, rst we_img (synchronous write enable to image memory) img_addr [4:0] (address into image memory, depth 32 available, used depth = IMG_PIX = 25) img_wdata signed [15:0] (write data) we_ker (write enable to kernel memory) ker_addr [3:0] (address into kernel memory, used depth = K_SZ = 9) ker_wdata signed [15:0] start (start scanning) mode_same (1 => SAME zero-pad mode; 0 => VALID)

outputs

out_valid (reg) — asserted when out_data is valid this cycle out_data signed [39:0] — final assembled signed 40-bit convolution output out_row [2:0], out_col [2:0] — coordinates of the output in the output map done — pulses when the final output is emitted

parameters and memory layout constants (hardcoded)

parameter IMG_N = 5 (image dimension) parameter IMG_PIX = 25 (image memory depth) parameter K_SZ = 9 (kernel entries)

internal memories and data layout

img_mem [0:IMG_PIX-1] holds signed [15:0] pixel values in row-major order; index memidx = row*IMG_N + col. ker [0:K_SZ-1] holds signed [15:0] kernel values in row-major order. ker_f [0:K_SZ-1] is flipped kernel (convolution semantics: flip the kernel before multiply-accumulate).

pipeline registers and widths (explicit)

stage0 registers: px16_s0[0..8] signed [15:0] — sampled window pixels stage1 registers: px8_s1[0..8] signed [7:0] — truncated pixel bytes used by MACs ker8_s1[0..8] signed [7:0] — kernel bytes (flipped) stage2 registers: prod_s2[0..8] signed [15:0] — registered MAC products

combinational macs and reduction tree

nine mac instances: mac_unit used with AccIn = 16'd0 so they produce product-only outputs observed on prod_wire[0..8] combinationally from px8_s1 and ker8_s1.

CSA tree (three CSA blocks)

CSA groups:

CSA0 reduces prod_s2[0..2] → s0 (sum), c0 (carry)
CSA1 reduces prod_s2[3..5] → s1, c1
CSA2 reduces prod_s2[6..8] → s2, c2

final assembly into out_data (precise bit operations)

When the valid pipeline reaches depth 3, out_data is computed as:

out_data <= ({{24{s0[15]}}, s0} +
             {{24{s1[15]}}, s1} +
             {{24{s2[15]}}, s2} +
             {{23{c0[15]}}, c0, 1'b0} +
             {{23{c1[15]}}, c1, 1'b0} +
             {{23{c2[15]}}, c2, 1'b0});

Explanation of the bit-concatenations and shifts

s0, s1, s2 are 16-bit sums; each is sign-extended to 40 bits by {{24{sX[15]}}, sX}.
c0, c1, c2 are 16-bit carry vectors; each is sign-extended to 39 bits {{23{cX[15]}}, cX} then shifted left by one (... , 1'b0) which places the carry bits at the correct significance. The carry shift-left is because CSA carry bits represent carries into the next more-significant bit position. Final addition sums all six extended vectors to produce signed 40-bit out_data.

valid pipeline and alignment invariant (exact)

valid_pipe is a 3-bit shift register. Each clock cycle it shifts left and inserts running as the new LSB, where running indicates a sample was taken during the previous stage0 action. valid_pipe[2] being high indicates that data sampled 3 stages earlier has been assembled in prod_s2 and the final assembly result is ready this cycle. This explicit valid pipeline guarantees deterministic alignment between the sampled location and out_data.

scan control, row/col iteration, running termination (detailed)

running is asserted on start and cleared when the scan reaches out_rows_max and out_cols_max and the last sampling cycle has been taken. out_rows_max and out_cols_max are set at start:

if mode_same == 1: out_rows_max = IMG_N - 1, out_cols_max = IMG_N - 1 (output size = IMG_N × IMG_N with zero-padding)
else: out_rows_max = IMG_N - 3, out_cols_max = IMG_N - 3 (VALID mode; output size = (IMG_N-2) × (IMG_N-2))

sampling loop (exact code behavior)

when running:

pad = (mode_same ? -1 : 0) which is an integer negative for SAME mode.
nested loops rr=0..2, cc=0..2 compute sample coordinates xi = r_row + rr + pad and xj = r_col + cc + pad.
if mode_same and coordinate out of bounds, sample value = 0; otherwise px16_s0[idx] <= img_mem[memidx].
after sampling a 3×3 window, indices advance:
- if r_col == out_cols_max: r_col <= 0 and r_row increments or clears running when both at max;
- else: r_col <= r_col + 1.

flush invariant

running can be cleared after the last sample but valid_pipe continues shifting so the final sampled windows still propagate to the assembly stage and assert out_valid. The code sets valid_pipe <= {valid_pipe[1:0], running} each clock, so outputs corresponding to the last samples are emitted after pipeline latency.

output counters and done semantics (precise)

out_row_cnt, out_col_cnt advance only when an output is emitted (i.e., when valid_pipe[2] is set and assembly occurs). out_row and out_col register the corresponding coordinates when out_valid is asserted. done is set when the emitted coordinates equal out_rows_max and out_cols_max (final output position).

memory write behavior (synchronous)

synchronous always @(posedge clk) block handles we_img and we_ker writes. rst zeroes both memories on reset. Writes occur on the positive edge when the corresponding we_* signals are asserted.

truncation and precision notes (explicit)

pixels are stored in img_mem as signed 16-bit. Before MACs they are truncated to signed 8-bit: px8_s1[i] <= px16_s0[i][7:0]. kernel values flip are also truncated similarly: ker8_s1[i] <= ker_f[i][7:0]. Because truncation is explicit, product values are computed with 8×8→16 precision and then combined into a 40-bit accumulation. The code does not scale or saturate; two’s complement wrap-around applies on overflow.

conv_tb.v — tb_conv3x3_systolic (testbench)

main purpose and topology

testbench instantiates conv3x3_systolic and verifies SAME and VALID modes against hard-coded expected matrices.

clock generation

initial begin clk = 0; forever #5 clk = ~clk; end → clock period = 10 time units (the TB prints “drives clock (10 ns period in the TB)” semantics in comments).

reset and initial values

rst asserted for 20 time units then deasserted. we_img, we_ker, start cleared initially, mode_same cleared initially.

test stimulus matrices (values exact)

image X[0..24] assigned as (row-major, 5×5):

X = [  0,  4, -2,  1, -4,
       3, -1,  0,  2,  4,
      -3,  2,  1, -1,  0,
       4, -4,  3,  0, -2,
       1,  0, -1,  4, -3 ]

kernel K[0..8] assigned as (row-major, 3×3):

K = [  2, -1,  0,
       3,  4, -2,
      -3,  1,  1 ]

write behavior and addresses used

the testbench writes 25 image locations sequentially to img_addr = 0..24 with we_img = 1. It writes 9 kernel locations sequentially to ker_addr = 0..8 with we_ker = 1.

SAME mode test sequence (explicit)

mode_same = 1. On a positive clock edge start is asserted then deasserted next cycle (one-cycle pulse).
count = 0. TB waits on rising clock edges; whenever out_valid is true it captures same_out[out_row*5 + out_col] = out_data and increments count. The loop continues until count == 25 (collect 25 outputs).
Expected SAME matrix in TB (expect_same[0..24]) is:

[ 7, 11, -9, 2, -22,
  4, 0, 4, 32, 9,
 -12, 29, -13, -20, 10,
 -6, -21, 35, -22, -6,
 20, -14, 7, 18, -22 ]

TB compares each collected output to expect_same and reports mismatches via $display("Mismatch SAME at idx %0d: got %0d expected %0d", ...). A summary message prints whether SAME: All outputs match expected. or SAME: N mismatches.

VALID mode test sequence (explicit)

mode_same = 0. Again start is pulsed. The TB collects 9 outputs into valid_out[out_row*3 + out_col] until count == 9.
Expected VALID matrix in TB (expect_valid[0..8]) is:

[ 0, 4, 32,
  29, -13, -20,
  -21, 35, -22 ]

TB compares the collected values to expect_valid and prints mismatch lines and a summary.

testbench termination and reporting semantics after both modes are exercised and comparisons printed, the TB prints Testbench finished. and calls $finish.

explicit list of signals whose widths and signedness matter (reference)

img_mem entries: signed [15:0] ker entries: signed [15:0] px16_s0: signed [15:0] px8_s1, ker8_s1: signed [7:0] (truncation from 16-bit to 8-bit uses lower byte) prod_wire, prod_s2: signed [15:0] (8×8 product) CSA Sum/Carry: [15:0] each (carry is unshifted within CSA) out_data: signed [39:0] (final assembled 40-bit result) mac_unit AccIn/AccOut: signed [15:0] (accumulator for PE) PE A_out/B_out: signed [7:0] (forwards truncated A/B) PE Acc_out: signed [15:0]

hard invariants and corner cases to be aware of (explicit)

kernel flip happens once at start by ker_f[i] <= ker[K_SZ-1 - i]. The flip is synchronous and latched on the same cycle running is set. ker_f used in stage1 → stage2 path.
px8_s1[i] <= px16_s0[i][7:0] extracts low byte; if the 16-bit pixel had significant bits above bit-7, they are dropped without rounding or saturation.
final out_data width is 40 bits. All intermediate sign-extensions use the MSB of 16-bit quantities to propagate sign to 40 bits prior to addition.
valid_pipe depth = 3 always; outputs are produced only when valid_pipe[2] is high. Clearing running does not immediately clear valid_pipe, so last samples flush.
done is set synchronously during the cycle that produces the final output (when emitted coordinates equal out_rows_max and out_cols_max). done returns to zero on next reset or new start flow in code (the code sets done <= 1'b0 at the start of the clock body each cycle except when set).
we_img and we_ker writes are synchronous; reading img_mem[memidx] in the same clock where we_img wrote to the same address will return old or new data depending on simulator/FPGA RAM behavior. The code assumes synchronous RAM semantics: writes commit on posedge and reads reflect memory contents on next posedge when sampled in stage0 the following cycle.
mode_same toggles pad behavior: pad = -1 for SAME mode and pad = 0 for VALID. pad type is integer; negative pad handled logically by boundary check if (xi < 0 || xi >= IMG_N || xj < 0 || xj >= IMG_N) px16_s0[idx] <= 0;.

factual summary of the runtime sequence

drives clock (10 time units period). writes 25 image pixels into img_mem. writes 9 kernel values into ker[]. runs SAME (zero-pad) mode — expects and checks 25 outputs. runs VALID (no pad) mode — expects and checks 9 outputs. prints summary of matches/mismatches and finishes simulation.

explicit mapping: code locations → behaviors

carry-save logic: csa.v top. booth recode multiplier and MAC reduction: mac.v. processing element and accumulate/forward semantics: pe.v. 2×2 systolic example and feeding cycles: systolic_array.v. conv3x3 pipelined sampling, MAC array, CSA tree, valid pipeline, kernel flipping, out assembly: conv.v (module conv3x3_systolic). test sequences and expected golden outputs: conv_tb.v.

$$ X=\begin{bmatrix} 0 & 4 & -2 & 1 & -4\ 3 & -1 & 0 & 2 & 4\ -3 & 2 & 1 & -1 & 0\ 4 & -4 & 3 & 0 & -2\ 1 & 0 & -1 & 4 & -3 \end{bmatrix}, \qquad K=\begin{bmatrix} 2 & -1 & 0\ 3 & 4 & -2\ -3 & 1 & 1 \end{bmatrix}. $$

Convolution definition (linear): $(X*K)[i,j]=\sum_{u=-1}^{1}\sum_{v=-1}^{1} \tilde K[u,v];X[i-u,;j-v]$, where $\tilde K$ is $K$ flipped in both axes (standard convolution).

Without padding (valid)

Output size $= (5-3+1)\times(5-3+1)=3\times3$:

$$ X*K;;(\text{valid})=\begin{bmatrix} 0 & 4 & 32\ 29 & -13 & -20\ -21 & 35 & -22 \end{bmatrix}. $$

With zero padding (same)

Zero-pad $X$ by 1 pixel on all sides; output size $=5\times5$:

$$ X*K;;(\text{same})=\begin{bmatrix} 7 & 11 & -9 & 2 & -22\ 4 & 0 & 4 & 32 & 9\ -12 & 29 & -13 & -20 & 10\ -6 & -21 & 35 & -22 & -6\ 20 & -14 & 7 & 18 & -22 \end{bmatrix}. $$