Dual-Issue In-Order Superscalar 16-bit RISC Processor

Microarchitecture Superscalar Design Pipeline Architecture Hazard Detection & Forwarding Multi-Port Register File Design Dependency Arbitration Logic Verilog RTL Design

Full documentation can be found here

Information about pipeline registers, memory and otehr details

View Project

Repository	Mummanajagadeesh/risc16-dual-superscalar-core
Start Date	Sep 2025

Summary

Item	Description
Issue Width	Dual (2 instructions per cycle)
Execution Model	In-order issue, execution, and retirement
Pipeline	IF → ID → EX → MEM → WB (per lane)
Register File	4R / 2W multi-ported
Memory	Unified, multi-port
Hazard Handling	Full inter- and intra-lane detection
Forwarding	Cross-lane and cross-stage

Superscalar Processor (RiSC-16 Variant)

Dual-Issue In-Order Superscalar 16-bit RISC Processor with Multi-Port Memory, 5-Stage Pipeline, Full Hazard Detection, Cross-Lane Forwarding, and Dual Commit Logic

This project implements a complete *2-way in-order superscalar processor- based on a compact 16-bit RISC architecture. The design dispatches up to two instructions per cycle while strictly preserving architectural ordering.

The processor integrates:

Parallel decode and execution lanes
Multi-way dependency checking
Cross-lane forwarding network
Multi-ported register file
Unified memory subsystem
Dual write-back commit logic

The entire system is written in synthesizable Verilog.

Architectural Block Diagram

                     +------------------+
                     |   Memory System  |
                     |  (3-Port ARAM)   |
                     +------------------+
                        ^        ^
                        |        |
         +--------------+        +-----------------+
         |                                     |
 +---------------+      +---------------+      +---------------+
 | IF Stage 0    |      | IF Stage 1    |      |   PC Logic    |
 +---------------+      +---------------+      +---------------+
         |                       |  
         v                       v
 +---------------+      +---------------+
 | IF/ID  Pipe   |      | IF/ID Pipe    |
 | Register 0    |      | Register 1    |
 +---------------+      +---------------+
         |                       |
         v                       v
 +---------------+      +---------------+
 | ID/EX Pipe    |      | ID/EX Pipe    |
 | Register 0    |      | Register 1    |
 +---------------+      +---------------+
         |                       |
         v                       v
 +---------------+      +---------------+
 | EX/MEM Pipe   |      | EX/MEM Pipe   |
 | Register 0    |      | Register 1    |
 +---------------+      +---------------+
         |                       |
         v                       v
 +---------------+      +---------------+
 | MEM/WB Pipe   |      | MEM/WB Pipe   |
 | Register 0    |      | Register 1    |
 +---------------+      +---------------+
         |                       |
         +-----------+-----------+
                     v
              Register File

Execution Model

The processor fetches 32 bits per cycle (two 16-bit instructions). Both lanes share identical pipeline depth but operate on independent datapaths where required.

Ordering guarantees:

In-order issue
In-order execution (pipeline-overlapped)
In-order retirement

Superscalar capability is enabled by:

Simultaneous dual decode
Independent ALUs
Multi-port register access
Dual write-back ports
Explicit inter-lane dependency arbitration

Superscalar Issue Logic

Lane 1 issues only when safe. The control logic enforces:

RAW Hazards (Inter-Lane)

If: \[ rd_0 = rs_1 ;; \text{or} ;; rd_0 = rt_1 \] lane 1 is suppressed.

WAW Hazards

If: \[ rd_0 = rd_1 \] lane 1 is cancelled.

Structural Restrictions

Only one memory operation per cycle
Branch instructions are single-issued
Functional unit conflicts block pairing

Branch Pairing Policy

Branches cannot pair
Taken branches squash younger instructions
PC redirection restarts fetch

Pipeline Organization

Parallelized classical RISC structure:

\[ IF \rightarrow ID0/ID1 \rightarrow EX0/EX1 \rightarrow MEM0/MEM1 \rightarrow WB0/WB1 \]

Each lane contains:

Independent pipeline registers
Independent ALU
Independent bypass selection
Dedicated write-back port

Shared structures:

Instruction fetch logic
Unified data memory
Multi-ported register file

Forwarding Network

Forwarding spans:

EX/MEM → ID/EX (both lanes)
MEM/WB → ID/EX (both lanes)
Lane 0 → Lane 1 (same cycle)
Lane 1 → Lane 0 (ordering-checked cases)

Selection logic is based on:

Register index matching
Destination validity
PC temporal ordering
Opcode classification

This enables correct resolution of most RAW hazards without stalling.

Register File Architecture

Tri-ported logical structure:

4 read ports (2 per lane)
2 write ports (1 per lane)
Lane 0 commit priority
r0 hardwired to zero

Commit policy guarantees strict in-order retirement.

Memory Subsystem

Unified memory design:

Two combinational instruction read ports
One effective data memory operation per cycle
Dual-issue logic prevents conflicting load/store pairing

This eliminates MEM-stage structural hazards.

Example Run — Summation Benchmark

Program computes:

\[ 0 + 1 + 2 + \dots + 10 = 55 \]

Assembly Program

ADDI r1, r0, 0
ADDI r1, r1, 1
ADDI r1, r1, 2
ADDI r1, r1, 3
ADDI r1, r1, 4
ADDI r1, r1, 5
ADDI r1, r1, 6
ADDI r1, r1, 7
ADDI r1, r1, 8
ADDI r1, r1, 9
ADDI r1, r1, 10
HALT

Machine Code

Final expected result:

\[ r1 = 0037_{16} = 55_{10} \]

Superscalar Waveform

alt text

Observed behavior:

Dual fetch visible in IF stage
Lane 1 suppressed due to RAW on r1

EXMEM_ALUOut_0 shows cumulative sums:

0000 → 0001 → 0003 → 0006 → 000A → ... → 0037

Correct in-order write-back
HALT cleanly drains pipeline

Although dual-issue capable, this benchmark serializes due to true dependencies.

Terminal Output

alt text

Final register dump:

regs .... 0037 0000 0000 0000

Confirms:

\[ 0037_{16} = 55_{10} \]

Waveform and terminal output are fully consistent.

Current Limitations

Partial ISA implementation
No speculation
No branch prediction
No dynamic scheduling
Only basic load-use stalling

Future Evolution Toward Out-of-Order

To transform this design into a modern dynamic superscalar core:

Register Renaming

Eliminate WAR/WAW hazards via physical register mapping.

Reservation Stations

Buffer instructions until operands are ready.

Common Data Bus (CDB)

Broadcast execution results.

Reorder Buffer (ROB)

Enable out-of-order execution with in-order retirement.

Tomasulo-Style Engine

Provide:

Dynamic scheduling
Automatic wakeup/select
Speculative execution
Higher instruction-level parallelism

Directory Structure

src/
    top.v
    pipelinestages/
    comps/
    memories/
    assembly/
test/
    test.v