Loading…
ROSE: RISC-V Open-Source Env

ROSE: RISC-V Open-Source Env

Verilog TL-Verilog Microarchitecture Pipeline Design Superscalar Design Hazard Detection Register File Design Multi-Port Memory RTL Verification
ROSE: RISC-V Open-Source Env
RV32I(M) CPU Core Variants: SC / MC / Pipelined / Superscalar / Out-of-Order
CPU Cores benchmarked on custom C kernels
View Project
ROSE: RISC-V Open-Source Env
Dual-Issue 16-bit Superscalar RISC Processor
Two-wide in-order superscalar core with hazard suppression and multi-port memory
View Project
ROSE: RISC-V Open-Source Env
RV32I RISC-V Core (TL-Verilog, Single-Cycle)
Single-stage RV32I implementation in TL-Verilog with full base ISA support and cycle-accurate visualization
View Project
MIPS & RISC Microarchitectures (SC / MC / Pipeline)
Three 32-bit MIPS/RISC designs comparing CPI, hazards, and pipeline overhead

Dual Issue Superscalar RISC Processor

RISC-V & MIPS Microarchitectures — SC / MC / Pipeline / Dual-Issue

This umbrella project explores progressively complex processor microarchitectures, from single-cycle execution to a two-wide in-order superscalar design. All designs were verified using directed assembly programs and reference models.

Milestone Update (Mar-Apr 2026)

After the Electrothon competition, I continued this project with a stronger focus on verification, benchmarking, and OOO exploration.

  • Added a QEMU + Spike cross-check flow for RV32I programs.
  • Added support to compile and run arbitrary C files on RTL, then compare output windows against QEMU and Spike.
  • Implemented a single-issue OOO RV32I core using a Tomasulo-style hybrid approach (tag wakeup/select + ROB in-order commit).
  • Attempted a super-OOO variant; this turned out to be far more complex than I initially expected, so I am currently fixing it step by step.
  • Added RV32IM support used for CoreMark bring-up and started working on CoreMark-focused tuning.

Common-Kernel Results Snapshot

VariantCommon KernelsCyclesInstrIPCCPI
sc3897289700.99981.0002
mc33901189700.22994.3491
pipe31232589700.72781.3740
superscalar3939589700.95481.0474
ooo31154889700.77681.2874
super-ooo1 (partial)318324100.75711.3207

These are the shared common-kernel benchmark numbers from the current run. Super-OOO is still under active fixes, so only one common kernel is currently validated there.


RV32I RISC-V Core (TL-Verilog, Single-Cycle)

Tools: Makerchip | TL-Verilog | Verilator

FeatureDescription
ISAFull RV32I base (I, S, B, U, J, R formats)
Register File32 × 32-bit, dual-read / single-write
Invariance( x0 = 0 ) enforced
ALUArithmetic, logical, shift, compare
VerificationDirected program (sum 1–9)
Cycles~50 cycles for validation program

Interactive visualization enabled via m4+cpu_viz() for cycle-accurate introspection.


RV32I RISC-V Core (Verilog, Single-Cycle)

Tools: Verilog | Icarus Verilog | ModelSim | Quartus Prime

FeatureDescription
DatapathModular ALU, control, imm-gen, PC logic
ISA CoverageAll 38 base RV32I instructions
Load/StoreLB, LBU, LH, LHU, LW, SB, SH, SW
ExtensionsCorrect sign/zero extension and RMW behavior
VerificationSelf-checking ModelSim testbenches
SynthesisQuartus Prime validated

MIPS Microarchitectures — SC / MC / 5-Stage Pipeline

Tools: Verilog | Icarus Verilog | ModelSim | GTKWave

Three independent 32-bit MIPS implementations were developed and benchmarked.


Architectural Variants

DesignDescriptionCPI
Single-CycleOne instruction per cycle1.0
Multi-CycleInstruction-dependent cycles~4.1
5-Stage PipelineIF–ID–EX–MEM–WB with forwarding~1.1–1.2

Multi-Cycle Instruction Latency

Instruction TypeCycles
R-type4
I-type arithmetic4
Load5
Store4
Branch3
Jump3

RISCV Comparison

MetricSCMCPipeline
Instructions131313
Cycles135021
CPI1.003.851.62
IPC1.000.260.62
Stall Cycles002
Flush Cycles002
Overhead Cycles0378

MIPS Comparison

MetricSCMCPipeline
Instructions181818
Cycles1870≈50
CPI1.003.89≈2.78
IPC1.000.257≈0.36
Overhead Cycles052≈32

Benchmark: Harris & Harris program (18 instructions) Correctly wrote ( 0x00000007 ) to memory addresses ( 0x50 ) and ( 0x54 ).


Dual-Issue 16-bit Superscalar RISC Processor (In-Order)

Tools: Verilog | Icarus Verilog | GTKWave


Architectural Overview

FeatureDescription
Issue Width2 instructions per cycle
Instruction Width16-bit
Fetch32-bit fetch (2×16-bit instructions)
PipelineIF–ID–EX–MEM–WB (dual lanes)
Register File4 read ports, 2 write ports
MemoryMulti-port ARAM (3-port)
Zero Register( r0 = 0 ) hardwired

Hazard & Dependency Handling

  • RAW detection
  • WAW detection
  • Inter-lane dependency suppression
  • Load-use stall insertion
  • Branch squashing
  • Independent pipeline registers per lane

Lane-1 issue suppressed if hazard detected.


Verification Program

Directed program: Sum integers ( 1 \rightarrow 10 )

Expected: \[ \sum_{i=1}^{10} i = 55 \]

Final register: \[ r1 = 0x0037 \]

Correct inter-lane suppression observed for true dependencies.


Architectural Evolution Summary

StageFocus
Single-CycleISA correctness
Multi-CycleResource reuse vs CPI tradeoff
5-Stage PipelineThroughput vs hazard cost
Dual-IssueParallel issue, dependency management, structural scaling

This umbrella project systematically explores performance scaling from scalar single-cycle execution to dual-issue in-order superscalar microarchitecture while maintaining architectural correctness and cycle-level validation.

Timeline Overview


Mar 2025 — First Exposure to RISC-V (TL-Verilog RV32I Core)

Work on processor design began around late March 2025 through an edX course by the Linux Foundation, which focused on building a RV32I CPU core using TL-Verilog.

During this phase:

  • Implemented a single-cycle RV32I processor in TL-Verilog using the Makerchip environment.

  • Supported the complete RV32I base ISA including all instruction formats (R, I, S, B, U, J).

  • Built the following datapath components:

    • 32 × 32-bit register file (dual read, single write)
    • ALU with arithmetic, logical, shift, and comparison operations
    • Immediate generator
    • Program counter logic
  • Enforced architectural invariants such as x0 = 0.

  • Verified functionality using a directed assembly test program (sum of integers 1–9).

  • Used Makerchip’s cycle-accurate visualization tools (m4+cpu_viz) to inspect pipeline behavior.

This phase primarily served as an introduction to:

ISA structure
CPU datapath construction
instruction decoding
cycle-accurate execution tracing

The TL-Verilog implementation provided a visual and educational introduction to processor design before moving to conventional RTL.


Oct 2025 – Nov 2025 — Verilog RV32I Core and Microarchitecture Exploration

Later in October–November 2025, the processor was re-implemented entirely in Verilog RTL as part of a course project.

This time the design focused on a more conventional RTL microarchitecture rather than TL-Verilog abstractions.

Implementations Developed

Three variants of the same RV32I processor were implemented:

VariantDescription
Single-CycleOne instruction completes per clock cycle
Multi-CycleInstructions executed across multiple stages depending on type
5-Stage PipelineIF–ID–EX–MEM–WB pipeline with forwarding

Key features:

  • Modular datapath structure:

    • ALU
    • control unit
    • register file
    • immediate generator
    • program counter logic
  • Full support for all 38 base RV32I instructions

  • Correct handling of:

    • sign extension
    • zero extension
    • load/store alignment behavior
  • Verification using self-checking ModelSim testbenches

  • Simulation using Icarus Verilog and ModelSim

  • Synthesis validation in Quartus Prime


Performance Comparison

All three implementations were evaluated using the same benchmark program and data file, allowing direct comparison.

Metrics measured included:

  • CPI
  • IPC
  • cycle counts
  • pipeline stall cycles
  • flush cycles
  • architectural overhead

This created a controlled study of how architectural complexity affects performance.


Parallel Exploration — MIPS Microarchitectures

During the same period, a parallel implementation effort was undertaken using MIPS to explore architectural behavior across a different ISA.

Three MIPS processors were implemented:

DesignDescription
Single-Cycleone instruction per clock
Multi-Cycleinstruction-dependent latency
5-Stage PipelineIF–ID–EX–MEM–WB pipeline

These designs were used to compare:

resource reuse
pipeline hazards
instruction latency
CPI differences across architectures

Benchmark used: Harris & Harris example program.


Oct–Nov 2025 — Early Superscalar Experiment (Parallel Work)

Alongside the scalar processors, an experimental superscalar processor was also implemented.

This processor was not RV32I-compatible, and instead used a small 16-bit custom RISC instruction set.

Because of the ISA mismatch, it was not directly comparable to the earlier processors.

Architecture

  • Dual-issue in-order superscalar design
  • Two instructions fetched per cycle (32-bit fetch)
  • Two parallel pipelines

Pipeline stages:

IF → ID → EX → MEM → WB

Key hardware upgrades included:

  • 4-read / 2-write port register file
  • dual-port / multi-port memory architecture
  • inter-lane dependency detection
  • RAW hazard detection
  • WAW hazard detection
  • load-use stall insertion
  • branch squashing
  • independent pipeline registers per lane

If a dependency was detected:

lane-1 issue suppressed
processor falls back to single-issue mode

Supported instruction types were limited:

basic R-type
basic I-type

This processor was mainly an early attempt to understand superscalar issue logic rather than a full ISA implementation.


Dec 2025 — Reference Model Verification Using QEMU

During the December 2025 vacation period, the verification infrastructure was improved.

The emulator QEMU was used as a reference model.

The following states were compared between the RTL simulation and QEMU execution:

  • program counter
  • register dumps
  • memory states

This significantly improved confidence in the correctness of the RV32I implementations.

The superscalar prototype could not be validated using QEMU because its ISA differed from RV32I.


Mar 2026 — Electrothon Hackathon (HackS’US / C-DAC Thiruvananthapuram)

In March 2026, the project entered a new phase during a 42-hour national hackathon organized by:

  • HackS’US RSET IEDC
  • C-DAC Thiruvananthapuram

Competition: Electrothon

The challenge required designing a RV32I processor implementation of any architecture.

This work expanded the earlier processor designs into a more modular and structured framework.


Architectures Implemented

Four RV32I cores were developed:

ArchitectureDescription
Single-Cyclebaseline architecture
Multi-Cycleresource-reusing execution
5-Stage Pipelineclassic IF–ID–EX–MEM–WB pipeline
Dual-Issue Superscalartwo-wide in-order design

The superscalar design was re-implemented using RV32I ISA compatibility, unlike the earlier experimental processor.

Due to time constraints, the superscalar design used:

  • 4-read / 2-write register file
  • single-port memory
  • hazard suppression logic
  • supports all instructions

Superscalar Features

  • two-wide fetch
  • parallel pipelines
  • inter-lane RAW/WAW hazard detection
  • load-use stalls
  • branch squashing
  • dual-issue scheduling
  • fallback to single-issue mode when dependencies occur

Performance Evaluation

Custom hardware performance counters were added to measure:

CPI
IPC
execution latency
cycle counts

All cores were tested using the same benchmark programs.


FPGA Demonstration

The processors were synthesized and demonstrated on a Spartan-7 FPGA, with comparisons of:

  • LUT utilization
  • architectural complexity
  • performance scaling

Competition Result

The project won:

1st Place — National Level
42-Hour Hackathon

Mar-Apr 2026 - Post-Competition Development

After the competition, work continued directly on this codebase as a practical extension of the hackathon prototypes.

Major progress in this phase:

  • Brought up a repeatable QEMU + Spike reference flow for RV32I verification.
  • Added generic C-program execution flow so arbitrary C files can be compiled, simulated on RTL, and cross-checked with QEMU/Spike memory dumps.
  • Implemented a single-issue out-of-order RV32I core and connected it to the existing benchmarking flow.
  • Attempted a super-OOO version; this design proved much more complex than expected in real implementation, and it is currently being stabilized.
  • Added RV32IM support needed for CoreMark-oriented work and started ongoing CoreMark optimization and validation.
  • Have benchmarked all variants and explored tradeoffs; and optimised pipelined variant.

Alongside implementation work, I also studied superscalar and benchmark literature, including:

“Design of a 32-bit Dual Pipeline Superscalar RISC-V Processor on FPGA” by Kuruvilla Varghese.

The paper included additional architectural features such as:

  • branch prediction
  • performance evaluation using CoreMark
  • performance evaluation using Dhrystone

This provided further context for future improvements.


ISA Compliance Testing — RISC-V ACT

To verify ISA correctness, the RISC-V ACT (Architecture Compliance Tests framework) suite was integrated.

The tests verified whether the processors fully complied with the official RV32I ISA specification.

Most cores passed the compliance tests successfully.

A few tests were skipped because Icarus Verilog simulation became extremely slow for large memory jumps and long execution traces.


Current Status (Apr 2026)

Current work is focused on consolidating post-competition progress into a stable and reproducible research workflow.

Active status:

  • QEMU + Spike cross-check flow is working for RV32I verification.
  • Arbitrary C files can be run end-to-end through RTL and compared against software references.
  • SC, MC, PIPE, and Superscalar variants are benchmarked in a common framework.
  • Single-issue OOO is implemented and under continued tuning.
  • Super-OOO is under active fixes; it was far more complex in practice than I initially imagined.
  • RV32IM support for CoreMark work is integrated, and CoreMark optimization is also done.

Near-term direction:

stabilize single-issue OOO first
fix super-OOO correctness path by path
complete CoreMark-focused tuning and reporting
extend benchmark-driven verification across all variants

Future experiments involve:

superscalar scheduling improvements
out-of-order variants and other RISC-V extensions

This project now functions as an evolving microarchitecture lab, from scalar RV32I baselines to superscalar and OOO designs, with correctness-first verification.

I am approaching the super-OOO bring-up humbly: it is a harder system than I expected, and I am fixing it incrementally from a solid single-issue OOO base.