Designing a 4R2W Register File: Why It's Harder Than It Looks

October 3, 2025

hardware rtl computer-architecture register file 4R2W multiport memory verilog microarchitecture hazards bypass logic ASIC design FPGA design

A register file is one of those structures that looks trivial until you actually build one. The concept is simple: a small block of fast storage inside a processor that holds the live architectural state. Every instruction reads its operands from here and writes its result back here. Because it sits directly in the pipeline, it has to be available at the same cycle the rest of the datapath demands it, which means no cache miss tolerance, no pipeline stall on access, and no ambiguity about what value a port returns in the same cycle it is being written.

A 4R2W register file supports four simultaneous read ports and two simultaneous write ports. That requirement is not arbitrary. In a dual-issue in-order core, two instructions are fetched and decoded together. If each instruction reads two source operands, that is four reads per cycle. If both instructions write back results in the same cycle, that is two writes per cycle. The 4R2W configuration is the minimum that keeps the machine from stalling on register access alone.

The same configuration appears in SIMD front ends, some DSP pipelines, and any architecture where the decode stage needs to supply multiple functional units with independent operands simultaneously. The number of ports is directly tied to instruction-level parallelism: more issue width means more ports, and more ports means more design problems.

This post works through those problems concretely.

A Starting Implementation

Assume 32 registers, 32-bit width, synchronous write, and asynchronous read. Asynchronous read means the read data is combinationally derived from the current register contents, not clocked. This is the standard choice for register files used in the pipeline decode stage, where the read happens in the same cycle as decode and the result must be available before the clock edge that latches EX stage inputs.

module regfile_4r2w (
    input  wire         clk,

    input  wire [4:0]   raddr0,
    input  wire [4:0]   raddr1,
    input  wire [4:0]   raddr2,
    input  wire [4:0]   raddr3,

    output wire [31:0]  rdata0,
    output wire [31:0]  rdata1,
    output wire [31:0]  rdata2,
    output wire [31:0]  rdata3,

    input  wire         we0,
    input  wire [4:0]   waddr0,
    input  wire [31:0]  wdata0,

    input  wire         we1,
    input  wire [4:0]   waddr1,
    input  wire [31:0]  wdata1
);

reg [31:0] mem [31:0];

assign rdata0 = mem[raddr0];
assign rdata1 = mem[raddr1];
assign rdata2 = mem[raddr2];
assign rdata3 = mem[raddr3];

always @(posedge clk) begin
    if (we0) mem[waddr0] <= wdata0;
    if (we1) mem[waddr1] <= wdata1;
end

endmodule

This synthesizes and simulates. It also has at least three concrete failure modes that will cause incorrect behavior in a real pipeline.

Write-Write Conflict

The first problem is what happens when both write ports target the same register in the same cycle. Both we0 and we1 are asserted, and waddr0 == waddr1.

In the RTL above, the always block evaluates both assignments. The final stored value depends on simulator semantics and tool behavior, not on a defined architectural rule. In simulation, the last assignment typically wins because of how blocking and nonblocking assignments are ordered, but this is implementation-defined behavior, not a specification. Different tools may synthesize this differently.

The architecture has to define what the correct behavior is, and the RTL has to enforce it. The most common choice is to give one port higher priority and document it explicitly.

always @(posedge clk) begin
    if (we0 && !(we1 && (waddr0 == waddr1)))
        mem[waddr0] <= wdata0;

    if (we1)
        mem[waddr1] <= wdata1;
end

Now if both ports write the same register in the same cycle, write port 1 wins. Write port 0 is suppressed. This behavior has to match whatever the pipeline expects. If the microarchitecture guarantees that two instructions writing the same register in the same cycle cannot happen (because of a WAW hazard check in dispatch), then the conflict logic is a safety net that should never trigger. If the pipeline allows it and depends on defined priority, then this logic is the mechanism that produces correct behavior.

Either way, the architecture document and the RTL must agree, and the verification environment must exercise this case deliberately. A testbench that never generates waddr0 == waddr1 with both enables asserted will not find the bug.

Adding an assertion makes the assumption explicit and checkable:

// Architectural guarantee: simultaneous writes to the same register
// are not permitted. This assertion catches violations in simulation.
always @(posedge clk) begin
    if (we0 && we1)
        assert (waddr0 != waddr1)
            else $error("WAW conflict: waddr0=%0d waddr1=%0d", waddr0, waddr1);
end

If the pipeline guarantees no WAW, this assertion will never fire and the priority logic becomes dead code. If the assertion fires in simulation, you have found a dispatch bug, not a register file bug.

Read-After-Write in the Same Cycle

The second problem is more subtle and has a correct answer that must be implemented rather than just documented.

Consider: write port 0 is writing register 5 with wdata0 = 0xDEADBEEF, and simultaneously read port 2 is reading register 5. What should rdata2 return?

With the current asynchronous read implementation, rdata2 = mem[raddr2], the read sees whatever is currently stored in the array. The write happens at the clock edge. The read is combinational and happens before the clock edge. So the reader sees the old value, not the value being written this cycle.

Whether this is correct depends entirely on the pipeline. If the architecture specifies that a register read always returns the value written by a prior cycle’s write, and that same-cycle forwarding is handled by a separate bypass network, then returning the old value is correct and the register file is not responsible for forwarding. This is the typical arrangement in a processor with explicit forwarding logic in the EX stage.

If, however, the architecture specifies that the register file itself should return the new value when a write and a read to the same address happen in the same cycle, then bypass logic must be added inside the register file.

wire [31:0] mem_rdata0 = mem[raddr0];
wire [31:0] mem_rdata1 = mem[raddr1];
wire [31:0] mem_rdata2 = mem[raddr2];
wire [31:0] mem_rdata3 = mem[raddr3];

// Write port 1 takes priority over write port 0 on the same address
assign rdata0 = (we1 && (waddr1 == raddr0)) ? wdata1 :
                (we0 && (waddr0 == raddr0)) ? wdata0 :
                mem_rdata0;

assign rdata1 = (we1 && (waddr1 == raddr1)) ? wdata1 :
                (we0 && (waddr0 == raddr1)) ? wdata0 :
                mem_rdata1;

assign rdata2 = (we1 && (waddr1 == raddr2)) ? wdata1 :
                (we0 && (waddr0 == raddr2)) ? wdata0 :
                mem_rdata2;

assign rdata3 = (we1 && (waddr1 == raddr3)) ? wdata1 :
                (we0 && (waddr0 == raddr3)) ? wdata0 :
                mem_rdata3;

The bypass priority here matches the write priority: write port 1 overrides write port 0 when both target the same address. If write port 0 is writing an address that read port 2 is reading, and write port 1 is writing a different address, then the read port returns wdata0. If both write ports target the same read address, the read port returns wdata1. The logic is a chain of muxes, and the ordering must be consistent with the write conflict resolution policy.

This adds combinational logic between the memory read and the output. The critical path for each read port now goes through two comparators and a 3:1 mux chain in addition to the memory read itself. For timing-critical designs, this bypass network can push the read path close to the clock period limit, which is one argument for keeping the bypass network in the execute stage rather than inside the register file.

Area Scaling with Port Count

A single-port SRAM has one read/write data bus, one address bus, and one set of sense amplifiers. Adding ports means adding independent address decoders, independent bitline drivers, and independent sense amplifiers. Each port adds roughly proportional silicon area in a custom SRAM bitcell.

For a standard cell implementation using flip-flops, every read port adds a full 32-to-1 multiplexer tree per bit. A 32-register, 32-bit file has 1024 bits of state. Four read ports means four 32-to-1 mux trees, one per port, each selecting among 32 32-bit values. That is a significant amount of combinational logic, and it all adds fanout load to every flip-flop output since every register’s output must be routable to every read mux.

The fanout problem is concrete: each of the 1024 flip-flops drives up to four mux trees, one per read port. If those mux trees are implemented naively as flat 32-to-1 muxes, each flip-flop is driving four mux inputs simultaneously. Fanout-of-four on a flip-flop is manageable, but the routing congestion in the area where all those wires converge is not. This is why multi-port register files in real processors are custom cells, not synthesized standard cells.

On an FPGA, the situation is different in a specific way. Block RAMs on most FPGAs offer at most two independent read ports (or one read and one write in simple dual-port mode). A 4R2W register file cannot be mapped directly to block RAM without additional structure. The synthesis tool will typically infer the register file using distributed RAM or LUTs, which handles the read ports but at significant LUT cost. The designer often ends up choosing between block RAM with limited port count and LUT RAM with higher resource usage.

Memory Replication

One standard solution to the port count problem is to replicate the storage. Instead of building one memory with four read ports, build four copies of the same memory, each with one read port.

reg [31:0] mem0 [31:0];
reg [31:0] mem1 [31:0];
reg [31:0] mem2 [31:0];
reg [31:0] mem3 [31:0];

assign rdata0 = mem0[raddr0];
assign rdata1 = mem1[raddr1];
assign rdata2 = mem2[raddr2];
assign rdata3 = mem3[raddr3];

always @(posedge clk) begin
    if (we0) begin
        mem0[waddr0] <= wdata0;
        mem1[waddr0] <= wdata0;
        mem2[waddr0] <= wdata0;
        mem3[waddr0] <= wdata0;
    end
    if (we1) begin
        mem0[waddr1] <= wdata1;
        mem1[waddr1] <= wdata1;
        mem2[waddr1] <= wdata1;
        mem3[waddr1] <= wdata1;
    end
end

Each copy has one read port and handles only the reads for that port. Writes broadcast to all four copies simultaneously so they stay consistent.

The tradeoff is straightforward: area increases by roughly a factor of four because the storage itself is replicated. In exchange, each read port sees only a simple 32-to-1 mux over its own copy, with no shared fanout. The write broadcast is two write enables fanning out to four memories, which is much easier to route than four read muxes sharing 1024 fanout nodes.

On FPGAs, this approach works well with distributed RAM or LUTRAM because each copy maps independently to its own slice resources. The synthesis tool can infer four separate block RAM instances in simple dual-port mode if the write and read clocking allows it. This is the technique most FPGA-based processor implementations use when they need more than two read ports.

The write conflict problem does not change with replication – all four copies must resolve conflicts the same way, and the write enable suppression logic from the previous section applies equally to all four write paths.

Banking for Write Ports

Two write ports require two independent write paths into the storage. In a custom SRAM, this means two sets of write drivers and bitline circuitry per cell. In a standard cell design, it means the update logic has to handle two independent write addresses without interfering. Banking is an alternative that avoids truly multi-write-port cells by partitioning the register space.

The simplest partition is even/odd: registers with even indices go to bank 0, registers with odd indices go to bank 1.

reg [31:0] bank0 [15:0];  // Registers 0, 2, 4, ..., 30
reg [31:0] bank1 [15:0];  // Registers 1, 3, 5, ..., 31

always @(posedge clk) begin
    if (we0) begin
        if (waddr0[0] == 0)
            bank0[waddr0[4:1]] <= wdata0;
        else
            bank1[waddr0[4:1]] <= wdata0;
    end
    if (we1) begin
        if (waddr1[0] == 0)
            bank0[waddr1[4:1]] <= wdata1;
        else
            bank1[waddr1[4:1]] <= wdata1;
    end
end

Two writes targeting different banks proceed without conflict. Two writes targeting the same bank in the same cycle are a structural hazard, and the machine must either stall one of the writes or guarantee the compiler or dispatch logic never generates this case.

In practice, an in-order dual-issue machine can often constrain instruction pairing rules to avoid same-bank conflicts at the cost of some parallelism. The compiler allocates registers to instructions knowing that two instructions scheduled together must not write to the same bank. This is a common design point in embedded DSP cores where the register file banking structure is exposed architecturally and the ABI reflects it.

The read side with banking is straightforward: each read port computes which bank and which index to access based on the register address, and two independent reads can proceed in parallel from different banks. If both reads target the same bank, the bank needs two read ports, which pushes the problem back to the starting point. This is why banking is most effective on write-limited designs where read port count is already handled by replication.

Critical Path in the Read Mux

A flat 32-to-1 mux per bit is the straightforward implementation. For 5-bit address selection, a flat mux requires a 5-level deep tree of 2-to-1 muxes. With typical standard cell timing, this can be 0.4 to 0.6 ns on a 28nm process depending on drive strength and fanout, which may be a significant fraction of the target clock period.

A hierarchical mux structure reduces the depth by trading some routing for better local structure. One common approach is to divide the 32 registers into 8 groups of 4, build 4-to-1 muxes within each group, and then build a final 8-to-1 mux across the 8 group outputs.

Stage 1: 8 groups, each with a 4-to-1 mux (2 address bits select within group)
Stage 2: 8-to-1 mux (3 address bits select among groups)

The total depth is 4-to-1 followed by 8-to-1, each implementable as two levels of 2-to-1 muxes. This can offer better routing locality because the 4-to-1 stage inputs are physically close to each other in each group. Modern synthesis tools perform similar transformations automatically, but manually structuring the mux tree gives the tool a starting point that avoids pathological cases.

The bypass mux from the same-cycle write also adds depth to this path. If the bypass is implemented as two additional 2-to-1 muxes at the output of the read mux tree, the total depth increases. For a 500 MHz target clock, the read path budget including bypass logic is approximately 2 ns, which is tight for a heavily loaded multi-port structure in a standard cell flow.

Pipelining the Read

If the read path cannot close timing combinationally, the standard fix is to register the read output and add one cycle of read latency.

reg [31:0] rdata0_reg;
reg [31:0] rdata1_reg;
reg [31:0] rdata2_reg;
reg [31:0] rdata3_reg;

always @(posedge clk) begin
    rdata0_reg <= rdata0_comb;
    rdata1_reg <= rdata1_comb;
    rdata2_reg <= rdata2_comb;
    rdata3_reg <= rdata3_comb;
end

The combinational read is now registered, and the read data arrives one cycle after the address is presented. This is how synchronous SRAMs work. The pipeline must account for this by either stalling decode for one cycle while the register file read completes, or by issuing the register read one cycle earlier than the instruction arrives at decode.

In a standard five-stage pipeline, the register file read happens in the ID stage. If the read takes one clock cycle, the operands are not available until the cycle after ID, which is the EX stage. The pipeline must then forward from EX stage outputs or stall on any instruction that needs its operands at EX entry. This is a structural decision that changes how the hazard detection and forwarding network is organized, not just the register file.

The bypass logic for same-cycle writes is more complex in the pipelined case. The write happens one cycle earlier relative to when the result is consumed, so the write address and data from that cycle must be captured and compared against the read address in the following cycle.

Register Zero

RV32I and MIPS both define register 0 as architecturally hardwired to zero: any write to register 0 is discarded, and any read from register 0 returns zero regardless of what was written.

The register file must enforce this. The cleanest implementation suppresses writes to address 0 in the write enable logic and overrides the read output when the read address is 0.

// Suppress writes to x0
wire we0_effective = we0 && (waddr0 != 5'd0);
wire we1_effective = we1 && (waddr1 != 5'd0);

// Override reads from x0
assign rdata0 = (raddr0 == 5'd0) ? 32'd0 : rdata0_internal;
assign rdata1 = (raddr1 == 5'd0) ? 32'd0 : rdata1_internal;
assign rdata2 = (raddr2 == 5'd0) ? 32'd0 : rdata2_internal;
assign rdata3 = (raddr3 == 5'd0) ? 32'd0 : rdata3_internal;

This also simplifies the bypass logic: a read from x0 never needs to check the write ports because the output is unconditionally zero. A write to x0 from either port is suppressed before it reaches the memory, so the physical register 0 can hold any value without affecting correctness.

The synthesis tool will typically optimize out register 0 storage entirely if the write suppression is in place, since no write can ever reach it, meaning its value can never change from the reset value.

Verification Requirements

A multi-port register file has enough corner cases that informal testing consistently misses real bugs. Simulation coverage of the following cases must be deliberate, not incidental.

Simultaneous writes to different registers: this is the normal case and must work correctly. Both write ports commit in the same cycle.

Simultaneous writes to the same register: both enables asserted, both addresses identical. The defined priority must apply and the correct value must appear on the next read.

Read-after-write in the same cycle: write port committing to address A while read port reads address A. If bypass is implemented, the new value must appear. If bypass is not implemented, the old value must appear and the pipeline must not depend on seeing the new value.

Read-after-write across cycles: write in cycle N, read in cycle N+1. The stored value must reflect the write.

Write with enable deasserted: we0 = 0, waddr0 = anything, wdata0 = garbage. No storage should change.

X0 invariance: write to register 0 from either port, then read register 0. Result must always be zero.

Writes to different ports, same cycle, with reads from both written registers in the same cycle: this exercises the bypass logic for both write ports simultaneously.

A randomized testbench that does not explicitly bias toward these cases will produce low coverage of the conflict scenarios. The safe approach is a directed test for each case followed by a constrained random phase that can hit them again in context.

// Directed: write-write conflict, read from conflicting register
initial begin
    @(posedge clk);
    we0 = 1; waddr0 = 5'd7; wdata0 = 32'hAAAA_AAAA;
    we1 = 1; waddr1 = 5'd7; wdata1 = 32'hBBBB_BBBB;
    raddr0 = 5'd7;
    @(negedge clk);
    // With same-cycle bypass and write port 1 priority:
    assert (rdata0 == 32'hBBBB_BBBB)
        else $error("Write conflict bypass wrong: got %h", rdata0);

    // Check storage next cycle
    we0 = 0; we1 = 0;
    @(posedge clk); @(negedge clk);
    assert (rdata0 == 32'hBBBB_BBBB)
        else $error("Post-conflict storage wrong: got %h", rdata0);
end

Formal verification of the bypass and conflict logic is also practical at this scale. The register file is small enough that bounded model checking over 5 to 10 cycles can exhaustively cover all combinations of write addresses and read addresses.

FPGA vs ASIC Implementation Gap

The same RTL behaves very differently depending on the target technology, and this difference matters before tapeout.

On an FPGA with distributed RAM (LUTRAM), the register file synthesizes using lookup tables as memory cells. Xilinx 7-series and UltraScale devices support synchronous write, asynchronous read in LUTRAM mode, which matches the register file requirement directly. The synthesis tool infers this structure from the always block and assign pattern, but only if the RTL matches what the tool knows how to infer. Unusual write conflict logic or bypass muxes that touch the memory outputs can break inference and cause the tool to fall back to flip-flops, which doubles or triples the LUT count.

On an FPGA with block RAM, the port count mismatch is the main issue. Block RAMs offer two independent ports, typically configured as one read and one write (simple dual-port) or as two read-write ports (true dual-port). Neither configuration provides four independent read ports. If a 4R2W register file must map to block RAM, the memory replication technique from earlier is necessary: four block RAM instances each configured in simple dual-port mode (one write port and one read port per instance) covers four read ports, and all four are written identically on each write cycle.

On an ASIC, a synthesized standard cell implementation of a 4R2W register file is area-expensive because flip-flops cost significantly more per bit than SRAM bitcells. A 32-register, 32-bit file has 1024 bits of state. In a 28nm process, a flip-flop costs approximately 6-8 standard cell area units while an SRAM bitcell costs approximately 0.5-1.0. The ratio is about 6-to-1 in favor of SRAM per bit. For a real processor, the register file is one of the most frequently accessed structures and is typically a custom SRAM macro with a multi-port bitcell that supports the required read and write port count directly.

Custom SRAM generation for multi-port register files is a specialized flow. Commercial memory compilers from vendors support two-port SRAM cells. Four-port cells are available but require either a custom bitcell design or a known-good memory architecture that the foundry qualifies. This is one reason why many commercial processors use the banking and replication strategies at the architecture level even when custom SRAMs are available: it allows each bank to use a simpler, well-characterized two-port cell rather than requiring a wide multi-port cell.

Summary

The 4R2W register file is a case where the RTL is not the hard part. The basic module with four read assigns and two clocked writes takes ten minutes to write and synthesizes without errors. The hard parts are the failure modes that only appear under specific conditions.

Write-write conflicts require a defined priority policy that the RTL enforces and the verification environment exercises. The policy must match the pipeline’s architectural assumptions.

Read-after-write in the same cycle requires either a bypass network inside the register file or a documented guarantee that the pipeline will not rely on same-cycle register file forwarding. If bypass is implemented, it must handle both write ports with consistent priority.

Area scaling with port count is nonlinear. Memory replication solves the read port problem by trading area for routing simplicity, and banking addresses the write port problem by constraining which registers can be written simultaneously.

The critical path through the read mux and bypass chain is frequently the timing bottleneck in the decode stage. Hierarchical mux structures and, in the last resort, pipelined reads are the options when the combinational path is too long.

FPGA and ASIC implementations are not interchangeable at the register file level. An RTL that correctly infers LUTRAM on an FPGA may map to flip-flops in an ASIC standard cell flow, changing area by an order of magnitude. Custom SRAM macros are the standard solution for silicon, and the register file architecture should be designed with the available memory cell port count in mind.

The structure is small, sits at the heart of every pipeline, and is accessed every cycle at full bandwidth. Getting it wrong produces data corruption that is difficult to localize because the symptom appears far downstream from the register file access that caused it.