RV32I Superscalar Optimization Journey

Scope and benchmark method

This log tracks the superscalar path from rv32i-superscalar to the current superscalar optimized line.
CoreMark is tracked at -O2, -O3, and -Ofast, with ITERATIONS=1/10/100.
For step-by-step comparison, ITERATIONS=100 is the main reference point (same style as pipe journey).
Metrics used:
- CoreMark/MHz = (ITERATIONS * 1,000,000) / cycles
- CPI = cycles / retired_instructions
- IPC = retired_instructions / cycles
CRC checks on successful runs:
- ITER=1: 0x0000e3c1
- ITER=10: 0x0000c64e
- ITER=100: 0x0000844d

How this superscalar path was built (relative to pipe)

The pipe journey explored predictor transitions in more granular steps (static/dynamic sweeps, later hybrid replacement, and later RAS addition).
For superscalar, we reused that learning directly:
- started with BR (early ID branch resolve),
- moved directly to HyBP in superscalar (instead of repeating static/dynamic sweeps),
- combined BR+HyBP,
- added BTB+RAS together in superscalar path,
- then applied hzopt and luopt/fulu style improvements.
CoreMark ELF reuse policy followed the same copied-ELF workflow used in the pipe-based comparisons.

Starting point: superscalar baseline

Variant: rv32i-superscalar

O2 i100: cycles 32,682,915, CPI 1.057730630, CoreMark/MHz 3.059702600
O3 i100: cycles 30,832,113, CPI 1.043518297, CoreMark/MHz 3.243371611
Ofast i100: cycles 30,832,108, CPI 1.043518198, CoreMark/MHz 3.243372137

Step 1: early branch resolution in superscalar

Variant: rv32i-super-br

What changed

Added early branch decision in ID for slot 0 path with branch forwarding.
Kept superscalar issue flow; branch control redirects happen earlier.

Result (`i100`)

O3: cycles 30,832,113 -> 28,582,406, CPI 1.043518297 -> 0.967376567, CoreMark/MHz 3.243371611 -> 3.498655781
Improvement vs previous (O3 i100): +7.871%

Step 2: superscalar hybrid predictor

Variant: rv32i-super-hybp

What changed

Added tournament-style branch direction predictor (local/global/choice + GHR) to superscalar flow.
This was taken directly from pipe learning; no repeated static/dynamic superscalar sweep.

Result (`i100`)

O3: cycles 28,582,406 -> 26,840,701, CPI 0.967376567 -> 0.908428254, CoreMark/MHz 3.498655781 -> 3.725685108
Improvement vs previous (O3 i100): +6.489%

Step 3: combine BR + HyBP

Variant: rv32i-super-br-hybp

What changed

Combined slot-0 ID branch resolve with HyBP update/redirect behavior in superscalar pipeline.

Result (`i100`)

O3: cycles 26,840,701 -> 26,483,239, CPI 0.908428254 -> 0.896329890, CoreMark/MHz 3.725685108 -> 3.775973173
Improvement vs previous (O3 i100): +1.350%

Step 4: add BTB + RAS

Variant: rv32i-super-br-hybp-btb-ras

What changed

Added BTB target prediction and RAS return prediction in superscalar path.
In pipe path, these features were introduced in staged evolution; here they were integrated together after BR+HyBP was stable.

Result (`i100`)

O3: cycles 26,483,239 -> 26,098,634, CPI 0.896329890 -> 0.883312866, CoreMark/MHz 3.775973173 -> 3.831618161
Improvement vs previous (O3 i100): +1.474%

Step 5: hzopt

Variant: rv32i-super-br-hybp-btb-ras-hzopt

What changed

Hazard behavior tightened to reduce false/overly conservative superscalar stalls.
Similar trend to pipe where hazard refinement helped unlock additional throughput.

Result (`i100`)

O3: cycles 26,098,634 -> 25,142,329, CPI 0.883312866 -> 0.850946555, CoreMark/MHz 3.831618161 -> 3.977356274
Improvement vs previous (O3 i100): +3.804%

Step 6: luopt + fulu on superscalar line

Variant: rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu

What changed

Applied load-use optimization style plus fuller forwarding behavior in superscalar line.
This was the largest throughput jump in the superscalar sequence.

Result (`i100`)

O3: cycles 25,142,329 -> 21,484,387, CPI 0.850946555 -> 0.727142863, CoreMark/MHz 3.977356274 -> 4.654542855
Improvement vs previous (O3 i100): +17.026%

Consolidated i100 timeline

O3 i100 timeline

Order	Variant	O3 i100 cycles	O3 i100 CPI	O3 i100 IPC	O3 CoreMark/MHz	Improvement vs previous
1	`rv32i-superscalar`	30,832,113	1.043518297	0.958296566	3.243371611	baseline
2	`rv32i-super-br`	28,582,406	0.967376567	1.033723613	3.498655781	+7.871%
3	`rv32i-super-hybp`	26,840,701	0.908428254	1.100802397	3.725685108	+6.489%
4	`rv32i-super-br-hybp`	26,483,239	0.896329890	1.115660664	3.775973173	+1.350%
5	`rv32i-super-br-hybp-btb-ras`	26,098,634	0.883312866	1.132101703	3.831618161	+1.474%
6	`rv32i-super-br-hybp-btb-ras-hzopt`	25,142,329	0.850946555	1.175161935	3.977356274	+3.804%
7	`rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu`	21,484,387	0.727142863	1.375245568	4.654542855	+17.026%

Full i100 values for superscalar milestones

Variant	O2 i100 cycles	O2 CPI	O2 CM/MHz	O3 i100 cycles	O3 CPI	O3 CM/MHz	Ofast i100 cycles	Ofast CPI	Ofast CM/MHz
`rv32i-superscalar`	32,682,915	1.057730630	3.059702600	30,832,113	1.043518297	3.243371611	30,832,108	1.043518198	3.243372137
`rv32i-super-br`	30,096,281	0.974018330	3.322669668	28,582,406	0.967376567	3.498655781	28,582,403	0.967376531	3.498656149
`rv32i-super-hybp`	27,787,312	0.899292215	3.598764789	26,840,701	0.908428254	3.725685108	26,836,840	0.908297640	3.726221120
`rv32i-super-br-hybp`	27,607,191	0.893462885	3.622244654	26,483,239	0.896329890	3.775973173	26,501,897	0.896961434	3.773314793
`rv32i-super-br-hybp-btb-ras`	27,185,545	0.879816982	3.678425428	26,098,634	0.883312866	3.831618161	26,117,294	0.883944477	3.828880588
`rv32i-super-br-hybp-btb-ras-hzopt`	26,202,245	0.847994040	3.816466871	25,142,329	0.850946555	3.977356274	25,160,989	0.851578163	3.974406570
`rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu`	22,589,232	0.731064613	4.426887997	21,484,387	0.727142863	4.654542855	21,489,076	0.727301613	4.653527215

Superscalar end status vs start

Using O3 i100:

Cycles: 30,832,113 -> 21,484,387 (~30.32% lower)
CPI: 1.043518297 -> 0.727142863
IPC: 0.958296566 -> 1.375245568
CoreMark/MHz: 3.243371611 -> 4.654542855
Total speedup: ~1.435x

Count branch for targeted optimization

To push CPI/IPC further, we made a measured tuning branch:

rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu-count
then working copy rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu-count-fq

Purpose: add counters, identify dominant loss sources, and optimize with validation after each change.

Current measured bottlenecks (stable O2 i1 count run)

dbg_cycles = 235625
dbg_slot1_bubble = 44920 (~19.064%)
dbg_squash_s1_cycles = 52416 (~22.246%)
dbg_squash_ctrl_cycles = 38603 (~73.647% of s1 squash)
dbg_squash_load_raw_cycles = 13793 (~26.314% of s1 squash)
dbg_squash_mem_cycles = 20 (~0.038% of s1 squash)
dbg_hz_stall_cycles = 7731 (~3.281%)
dbg_hz_flush_cycles = 13623 (~5.782%)

What this indicates

Main limiter is still control-driven slot1 squash.
Secondary limiter is load-RAW-driven slot1 squash.
Memory-alias squash is very small currently.

What was tried in count/fq and failed (reverted)

The following were attempted and then reverted after failing correctness checks (CRC mismatch, timeout loops, or program corruption):

Slot1 keep-path for pred_taken && resolved_not_taken branch case.
Fetch queue / prefetch queue variants (1-entry/2-entry style attempts).
Store-data-only load-RAW squash relaxation in hazard path.

These attempts often showed apparently lower cycles in short runs, but failed correctness (x12 CRC mismatch or non-halting behavior), so they were not retained.

Current status

Working retained branch state remains correctness-first.
Stable O2 i1 retained reference:
- cycles 235625
- CPI 0.729538854
- IPC 1.370728912
- CRC 0x0000e3c1

Ongoing work note

I am still working on pushing CPI/IPC lower in superscalar count/fq path, but only keeping changes that pass full correctness checks (instruction suite + C regressions + CoreMark smoke CRC) before claiming improvement.

RV32I Superscalar Optimization Journey

Scope and benchmark method

How this superscalar path was built (relative to pipe)

Starting point: superscalar baseline

Step 1: early branch resolution in superscalar

What changed

Result (i100)

Step 2: superscalar hybrid predictor

What changed

Result (i100)

Step 3: combine BR + HyBP

What changed

Result (i100)

Step 4: add BTB + RAS

What changed

Result (i100)

Step 5: hzopt

What changed

Result (i100)

Step 6: luopt + fulu on superscalar line

What changed

Result (i100)

Consolidated i100 timeline

O3 i100 timeline

Full i100 values for superscalar milestones

Superscalar end status vs start

Count branch for targeted optimization

Current measured bottlenecks (stable O2 i1 count run)

What this indicates

What was tried in count/fq and failed (reverted)

Current status

Ongoing work note

Result (`i100`)

Result (`i100`)

Result (`i100`)

Result (`i100`)

Result (`i100`)

Result (`i100`)