Designing a 4R2W Register File: Why It’s Harder Than It Looks

Designing a 4R2W Register File: Why It’s Harder Than It Looks

hardware rtl computer-architecture register file 4R2W multiport memory verilog microarchitecture hazards bypass logic ASIC design FPGA design

A register file is a small and very fast storage block inside a processor. It holds the architectural state of the machine, meaning the operands and results that instructions use. Because it sits directly in the pipeline, it has to support multiple accesses at the same time and still meet strict timing requirements.

A 4R2W register file means it supports:

  • 4 simultaneous read ports
  • 2 simultaneous write ports

You typically see this in:

  • Dual issue in order cores
  • Superscalar pipelines
  • SIMD front ends
  • Some DSP designs

If each instruction needs two source operands and the processor issues two instructions every cycle, then the number of reads per cycle is:

\[ 2 \times 2 = 4 \text{ reads} \]

If both instructions write back a result in the same cycle, then you need:

\[ 2 \text{ writes per cycle} \]

That is where the 4R2W requirement comes from.

Now let us build one.


A Simple Starting Point

Assume:

  • 32 registers
  • 32 bit width
  • Synchronous write
  • Asynchronous read

A straightforward implementation looks like this:

module regfile_4r2w (
    input  wire         clk,

    input  wire [4:0]   raddr0,
    input  wire [4:0]   raddr1,
    input  wire [4:0]   raddr2,
    input  wire [4:0]   raddr3,

    output wire [31:0]  rdata0,
    output wire [31:0]  rdata1,
    output wire [31:0]  rdata2,
    output wire [31:0]  rdata3,

    input  wire         we0,
    input  wire [4:0]   waddr0,
    input  wire [31:0]  wdata0,

    input  wire         we1,
    input  wire [4:0]   waddr1,
    input  wire [31:0]  wdata1
);

reg [31:0] mem [31:0];

assign rdata0 = mem[raddr0];
assign rdata1 = mem[raddr1];
assign rdata2 = mem[raddr2];
assign rdata3 = mem[raddr3];

always @(posedge clk) begin
    if (we0)
        mem[waddr0] <= wdata0;

    if (we1)
        mem[waddr1] <= wdata1;
end

endmodule

At first glance, this looks fine.

It is not.


Write Write Conflict

Consider this case:

  • we0 = 1
  • we1 = 1
  • waddr0 == waddr1

Both ports try to write to the same register in the same cycle.

Which value should be stored?

In the current RTL, the last assignment inside the always block wins. The exact behavior can depend on tool assumptions and ordering. There is no clearly defined architectural rule.

You need to define what happens. One option is to give one port priority. For example, give port 1 higher priority:

always @(posedge clk) begin
    if (we0 && !(we1 && (waddr0 == waddr1)))
        mem[waddr0] <= wdata0;

    if (we1)
        mem[waddr1] <= wdata1;
end

Now if both ports target the same register, port 1 overrides port 0.

A better approach is to clearly define this behavior in the architecture specification and also add assertions during verification.


Read After Write in the Same Cycle

Now consider a different problem.

  • One instruction writes to register 5 using port 0.
  • In the same cycle, another instruction reads register 5.

With asynchronous reads written as:

assign rdata0 = mem[raddr0];

The read happens before the clock edge. The write happens on the clock edge. That means the reader sees the old value.

In a real CPU, write back and operand read must behave consistently from an architectural point of view.

The solution is to add bypass logic.

wire [31:0] mem_rdata0 = mem[raddr0];

assign rdata0 = (we0 && (waddr0 == raddr0)) ? wdata0 :
                (we1 && (waddr1 == raddr0)) ? wdata1 :
                mem_rdata0;

You repeat this logic for all four read ports.

Now, if a register is written and read in the same cycle, the new value is forwarded directly to the reader.


Area Growth with Multiple Ports

Multi port register files do not scale well.

Each read port adds:

  • A 32 to 1 multiplexer for every bit
  • Large fanout
  • A longer critical path

For 32 registers with 32 bits and 4 read ports, the routing and muxing become heavy.

On an FPGA:

  • The tool may not infer block RAM properly.
  • The memory may get replicated using LUTs.

On an ASIC:

  • A custom multi port SRAM is usually required.
  • A standard cell implementation with flip flops becomes expensive in area.

Replicating Memory for Reads

Instead of building one memory with four read ports, you can replicate the storage four times.

reg [31:0] mem0 [31:0];
reg [31:0] mem1 [31:0];
reg [31:0] mem2 [31:0];
reg [31:0] mem3 [31:0];

Each read port reads from its own copy.

Writes update all copies:

always @(posedge clk) begin
    if (we0) begin
        mem0[waddr0] <= wdata0;
        mem1[waddr0] <= wdata0;
        mem2[waddr0] <= wdata0;
        mem3[waddr0] <= wdata0;
    end
end

The tradeoff is simple:

  • Area increases roughly four times.
  • Timing usually improves.
  • Read fanout is reduced.

This technique is common in FPGA designs.


Scaling Write Ports

A true two write port memory requires:

  • Two independent write drivers
  • Logic to resolve conflicts
  • Bitcell support if using an ASIC SRAM

In a standard cell design, this is usually implemented using flip flops, which increases area significantly.

One alternative is banking.


Banking the Register File

You can split the register file into two banks, for example even and odd registers.

if (waddr0[0] == 0)
    bank0[waddr0[4:1]] <= wdata0;
else
    bank1[waddr0[4:1]] <= wdata0;

Now two writes can occur in the same cycle only if they go to different banks. If both target the same bank, you have a structural conflict.

This approach is used in in order cores and some SIMD designs.

The tradeoff is that either the compiler or the hardware must handle bank conflicts.


Critical Path in the Read Mux

Each read port effectively needs a 32 to 1 multiplexer per bit.

Instead of a flat 32 to 1 mux, you can use a hierarchical structure:

  • Divide registers into 8 groups of 4
  • First stage uses 4 to 1 muxes
  • Second stage selects among the 8 groups

This can improve routing locality and timing. Some synthesis tools perform similar optimizations automatically, but structuring it manually can help in ASIC flows.


Pipelining the Read

If timing is tight, you can pipeline the read data.

reg [31:0] rdata0_reg;

always @(posedge clk) begin
    rdata0_reg <= rdata0_comb;
end

This introduces:

  • One cycle of read latency
  • Higher maximum clock frequency

The pipeline control logic must account for the extra cycle.


Special Case for a Zero Register

In many instruction sets, register 0 always reads as zero.

You can handle this explicitly:

assign rdata0 = (raddr0 == 5'd0) ? 32'b0 : real_data0;

This avoids storing an actual value for register 0 and simplifies write handling for that register.


What to Test in Verification

A multi port register file can fail in subtle ways. The following cases must be tested carefully:

  • Simultaneous writes to different registers
  • Simultaneous writes to the same register
  • Read after write in the same cycle
  • Read after write in the next cycle
  • Writes when write enable is deasserted
  • Randomized stress conditions

Directed Test Example

initial begin
    we0 = 1; waddr0 = 5; wdata0 = 32'hAAAA;
    we1 = 0;
    #10;

    we0 = 0;
    raddr0 = 5;
    #1;
    if (rdata0 !== 32'hAAAA)
        $display("Error: RAW failed");
end

Randomized Stress Example

repeat (1000) begin
    we0 = $random;
    we1 = $random;
    waddr0 = $random % 32;
    waddr1 = $random % 32;
    raddr0 = $random % 32;
    raddr1 = $random % 32;
    #10;
end

You can also add assertions:

assert (!(we0 && we1 && waddr0 == waddr1))
    else $display("Write conflict detected");

Why 4R2W Is Difficult

The difficulty comes from several directions at once:

Hardware cost increases quickly as you add ports. Write write conflicts must be clearly defined. Read after write cases require bypass logic. Routing often dominates timing. FPGA and ASIC implementations behave very differently.

In simple RTL, this design appears straightforward.

In silicon, it becomes one of the most dense and timing critical structures in the processor core.

At this point you have:

  • A basic implementation
  • A defined write conflict strategy
  • Bypass logic for read after write
  • Memory replication for read scaling
  • A banking approach
  • Timing improvement ideas
  • Verification strategies

From here, the design can be adapted for:

  • An FPGA prototype
  • A standard cell ASIC
  • A custom SRAM based implementation
  • A superscalar pipeline

Writing the RTL is the easy part.

Balancing correctness, timing, and area at the same time is the real challenge.