
Designing a 4R2W Register File: Why It’s Harder Than It Looks
A register file is a small and very fast storage block inside a processor. It holds the architectural state of the machine, meaning the operands and results that instructions use. Because it sits directly in the pipeline, it has to support multiple accesses at the same time and still meet strict timing requirements.
A 4R2W register file means it supports:
- 4 simultaneous read ports
- 2 simultaneous write ports
You typically see this in:
- Dual issue in order cores
- Superscalar pipelines
- SIMD front ends
- Some DSP designs
If each instruction needs two source operands and the processor issues two instructions every cycle, then the number of reads per cycle is:
\[ 2 \times 2 = 4 \text{ reads} \]
If both instructions write back a result in the same cycle, then you need:
\[ 2 \text{ writes per cycle} \]
That is where the 4R2W requirement comes from.
Now let us build one.
A Simple Starting Point
Assume:
- 32 registers
- 32 bit width
- Synchronous write
- Asynchronous read
A straightforward implementation looks like this:
module regfile_4r2w (
input wire clk,
input wire [4:0] raddr0,
input wire [4:0] raddr1,
input wire [4:0] raddr2,
input wire [4:0] raddr3,
output wire [31:0] rdata0,
output wire [31:0] rdata1,
output wire [31:0] rdata2,
output wire [31:0] rdata3,
input wire we0,
input wire [4:0] waddr0,
input wire [31:0] wdata0,
input wire we1,
input wire [4:0] waddr1,
input wire [31:0] wdata1
);
reg [31:0] mem [31:0];
assign rdata0 = mem[raddr0];
assign rdata1 = mem[raddr1];
assign rdata2 = mem[raddr2];
assign rdata3 = mem[raddr3];
always @(posedge clk) begin
if (we0)
mem[waddr0] <= wdata0;
if (we1)
mem[waddr1] <= wdata1;
end
endmodule
At first glance, this looks fine.
It is not.
Write Write Conflict
Consider this case:
we0 = 1we1 = 1waddr0 == waddr1
Both ports try to write to the same register in the same cycle.
Which value should be stored?
In the current RTL, the last assignment inside the always block wins. The exact behavior can depend on tool assumptions and ordering. There is no clearly defined architectural rule.
You need to define what happens. One option is to give one port priority. For example, give port 1 higher priority:
always @(posedge clk) begin
if (we0 && !(we1 && (waddr0 == waddr1)))
mem[waddr0] <= wdata0;
if (we1)
mem[waddr1] <= wdata1;
end
Now if both ports target the same register, port 1 overrides port 0.
A better approach is to clearly define this behavior in the architecture specification and also add assertions during verification.
Read After Write in the Same Cycle
Now consider a different problem.
- One instruction writes to register 5 using port 0.
- In the same cycle, another instruction reads register 5.
With asynchronous reads written as:
assign rdata0 = mem[raddr0];
The read happens before the clock edge. The write happens on the clock edge. That means the reader sees the old value.
In a real CPU, write back and operand read must behave consistently from an architectural point of view.
The solution is to add bypass logic.
wire [31:0] mem_rdata0 = mem[raddr0];
assign rdata0 = (we0 && (waddr0 == raddr0)) ? wdata0 :
(we1 && (waddr1 == raddr0)) ? wdata1 :
mem_rdata0;
You repeat this logic for all four read ports.
Now, if a register is written and read in the same cycle, the new value is forwarded directly to the reader.
Area Growth with Multiple Ports
Multi port register files do not scale well.
Each read port adds:
- A 32 to 1 multiplexer for every bit
- Large fanout
- A longer critical path
For 32 registers with 32 bits and 4 read ports, the routing and muxing become heavy.
On an FPGA:
- The tool may not infer block RAM properly.
- The memory may get replicated using LUTs.
On an ASIC:
- A custom multi port SRAM is usually required.
- A standard cell implementation with flip flops becomes expensive in area.
Replicating Memory for Reads
Instead of building one memory with four read ports, you can replicate the storage four times.
reg [31:0] mem0 [31:0];
reg [31:0] mem1 [31:0];
reg [31:0] mem2 [31:0];
reg [31:0] mem3 [31:0];
Each read port reads from its own copy.
Writes update all copies:
always @(posedge clk) begin
if (we0) begin
mem0[waddr0] <= wdata0;
mem1[waddr0] <= wdata0;
mem2[waddr0] <= wdata0;
mem3[waddr0] <= wdata0;
end
end
The tradeoff is simple:
- Area increases roughly four times.
- Timing usually improves.
- Read fanout is reduced.
This technique is common in FPGA designs.
Scaling Write Ports
A true two write port memory requires:
- Two independent write drivers
- Logic to resolve conflicts
- Bitcell support if using an ASIC SRAM
In a standard cell design, this is usually implemented using flip flops, which increases area significantly.
One alternative is banking.
Banking the Register File
You can split the register file into two banks, for example even and odd registers.
if (waddr0[0] == 0)
bank0[waddr0[4:1]] <= wdata0;
else
bank1[waddr0[4:1]] <= wdata0;
Now two writes can occur in the same cycle only if they go to different banks. If both target the same bank, you have a structural conflict.
This approach is used in in order cores and some SIMD designs.
The tradeoff is that either the compiler or the hardware must handle bank conflicts.
Critical Path in the Read Mux
Each read port effectively needs a 32 to 1 multiplexer per bit.
Instead of a flat 32 to 1 mux, you can use a hierarchical structure:
- Divide registers into 8 groups of 4
- First stage uses 4 to 1 muxes
- Second stage selects among the 8 groups
This can improve routing locality and timing. Some synthesis tools perform similar optimizations automatically, but structuring it manually can help in ASIC flows.
Pipelining the Read
If timing is tight, you can pipeline the read data.
reg [31:0] rdata0_reg;
always @(posedge clk) begin
rdata0_reg <= rdata0_comb;
end
This introduces:
- One cycle of read latency
- Higher maximum clock frequency
The pipeline control logic must account for the extra cycle.
Special Case for a Zero Register
In many instruction sets, register 0 always reads as zero.
You can handle this explicitly:
assign rdata0 = (raddr0 == 5'd0) ? 32'b0 : real_data0;
This avoids storing an actual value for register 0 and simplifies write handling for that register.
What to Test in Verification
A multi port register file can fail in subtle ways. The following cases must be tested carefully:
- Simultaneous writes to different registers
- Simultaneous writes to the same register
- Read after write in the same cycle
- Read after write in the next cycle
- Writes when write enable is deasserted
- Randomized stress conditions
Directed Test Example
initial begin
we0 = 1; waddr0 = 5; wdata0 = 32'hAAAA;
we1 = 0;
#10;
we0 = 0;
raddr0 = 5;
#1;
if (rdata0 !== 32'hAAAA)
$display("Error: RAW failed");
end
Randomized Stress Example
repeat (1000) begin
we0 = $random;
we1 = $random;
waddr0 = $random % 32;
waddr1 = $random % 32;
raddr0 = $random % 32;
raddr1 = $random % 32;
#10;
end
You can also add assertions:
assert (!(we0 && we1 && waddr0 == waddr1))
else $display("Write conflict detected");
Why 4R2W Is Difficult
The difficulty comes from several directions at once:
Hardware cost increases quickly as you add ports. Write write conflicts must be clearly defined. Read after write cases require bypass logic. Routing often dominates timing. FPGA and ASIC implementations behave very differently.
In simple RTL, this design appears straightforward.
In silicon, it becomes one of the most dense and timing critical structures in the processor core.
At this point you have:
- A basic implementation
- A defined write conflict strategy
- Bypass logic for read after write
- Memory replication for read scaling
- A banking approach
- Timing improvement ideas
- Verification strategies
From here, the design can be adapted for:
- An FPGA prototype
- A standard cell ASIC
- A custom SRAM based implementation
- A superscalar pipeline
Writing the RTL is the easy part.
Balancing correctness, timing, and area at the same time is the real challenge.