RV32I Superscalar Optimization Journey
Scope and benchmark method
- This log tracks the superscalar path from
rv32i-superscalarto the current superscalar optimized line. - CoreMark is tracked at
-O2,-O3, and-Ofast, withITERATIONS=1/10/100. - For step-by-step comparison,
ITERATIONS=100is the main reference point (same style as pipe journey). - Metrics used:
CoreMark/MHz = (ITERATIONS * 1,000,000) / cyclesCPI = cycles / retired_instructionsIPC = retired_instructions / cycles
- CRC checks on successful runs:
ITER=1:0x0000e3c1ITER=10:0x0000c64eITER=100:0x0000844d
How this superscalar path was built (relative to pipe)
- The pipe journey explored predictor transitions in more granular steps (static/dynamic sweeps, later hybrid replacement, and later RAS addition).
- For superscalar, we reused that learning directly:
- started with BR (early ID branch resolve),
- moved directly to HyBP in superscalar (instead of repeating static/dynamic sweeps),
- combined BR+HyBP,
- added BTB+RAS together in superscalar path,
- then applied hzopt and luopt/fulu style improvements.
- CoreMark ELF reuse policy followed the same copied-ELF workflow used in the pipe-based comparisons.
Starting point: superscalar baseline
Variant: rv32i-superscalar
O2 i100: cycles32,682,915, CPI1.057730630, CoreMark/MHz3.059702600O3 i100: cycles30,832,113, CPI1.043518297, CoreMark/MHz3.243371611Ofast i100: cycles30,832,108, CPI1.043518198, CoreMark/MHz3.243372137
Step 1: early branch resolution in superscalar
Variant: rv32i-super-br
What changed
- Added early branch decision in ID for slot 0 path with branch forwarding.
- Kept superscalar issue flow; branch control redirects happen earlier.
Result (i100)
O3: cycles30,832,113 -> 28,582,406, CPI1.043518297 -> 0.967376567, CoreMark/MHz3.243371611 -> 3.498655781- Improvement vs previous (
O3 i100):+7.871%
Step 2: superscalar hybrid predictor
Variant: rv32i-super-hybp
What changed
- Added tournament-style branch direction predictor (local/global/choice + GHR) to superscalar flow.
- This was taken directly from pipe learning; no repeated static/dynamic superscalar sweep.
Result (i100)
O3: cycles28,582,406 -> 26,840,701, CPI0.967376567 -> 0.908428254, CoreMark/MHz3.498655781 -> 3.725685108- Improvement vs previous (
O3 i100):+6.489%
Step 3: combine BR + HyBP
Variant: rv32i-super-br-hybp
What changed
- Combined slot-0 ID branch resolve with HyBP update/redirect behavior in superscalar pipeline.
Result (i100)
O3: cycles26,840,701 -> 26,483,239, CPI0.908428254 -> 0.896329890, CoreMark/MHz3.725685108 -> 3.775973173- Improvement vs previous (
O3 i100):+1.350%
Step 4: add BTB + RAS
Variant: rv32i-super-br-hybp-btb-ras
What changed
- Added BTB target prediction and RAS return prediction in superscalar path.
- In pipe path, these features were introduced in staged evolution; here they were integrated together after BR+HyBP was stable.
Result (i100)
O3: cycles26,483,239 -> 26,098,634, CPI0.896329890 -> 0.883312866, CoreMark/MHz3.775973173 -> 3.831618161- Improvement vs previous (
O3 i100):+1.474%
Step 5: hzopt
Variant: rv32i-super-br-hybp-btb-ras-hzopt
What changed
- Hazard behavior tightened to reduce false/overly conservative superscalar stalls.
- Similar trend to pipe where hazard refinement helped unlock additional throughput.
Result (i100)
O3: cycles26,098,634 -> 25,142,329, CPI0.883312866 -> 0.850946555, CoreMark/MHz3.831618161 -> 3.977356274- Improvement vs previous (
O3 i100):+3.804%
Step 6: luopt + fulu on superscalar line
Variant: rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu
What changed
- Applied load-use optimization style plus fuller forwarding behavior in superscalar line.
- This was the largest throughput jump in the superscalar sequence.
Result (i100)
O3: cycles25,142,329 -> 21,484,387, CPI0.850946555 -> 0.727142863, CoreMark/MHz3.977356274 -> 4.654542855- Improvement vs previous (
O3 i100):+17.026%
Consolidated i100 timeline
O3 i100 timeline
| Order | Variant | O3 i100 cycles | O3 i100 CPI | O3 i100 IPC | O3 CoreMark/MHz | Improvement vs previous |
|---|---|---|---|---|---|---|
| 1 | rv32i-superscalar |
30,832,113 | 1.043518297 | 0.958296566 | 3.243371611 | baseline |
| 2 | rv32i-super-br |
28,582,406 | 0.967376567 | 1.033723613 | 3.498655781 | +7.871% |
| 3 | rv32i-super-hybp |
26,840,701 | 0.908428254 | 1.100802397 | 3.725685108 | +6.489% |
| 4 | rv32i-super-br-hybp |
26,483,239 | 0.896329890 | 1.115660664 | 3.775973173 | +1.350% |
| 5 | rv32i-super-br-hybp-btb-ras |
26,098,634 | 0.883312866 | 1.132101703 | 3.831618161 | +1.474% |
| 6 | rv32i-super-br-hybp-btb-ras-hzopt |
25,142,329 | 0.850946555 | 1.175161935 | 3.977356274 | +3.804% |
| 7 | rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu |
21,484,387 | 0.727142863 | 1.375245568 | 4.654542855 | +17.026% |
Full i100 values for superscalar milestones
| Variant | O2 i100 cycles | O2 CPI | O2 CM/MHz | O3 i100 cycles | O3 CPI | O3 CM/MHz | Ofast i100 cycles | Ofast CPI | Ofast CM/MHz |
|---|---|---|---|---|---|---|---|---|---|
rv32i-superscalar |
32,682,915 | 1.057730630 | 3.059702600 | 30,832,113 | 1.043518297 | 3.243371611 | 30,832,108 | 1.043518198 | 3.243372137 |
rv32i-super-br |
30,096,281 | 0.974018330 | 3.322669668 | 28,582,406 | 0.967376567 | 3.498655781 | 28,582,403 | 0.967376531 | 3.498656149 |
rv32i-super-hybp |
27,787,312 | 0.899292215 | 3.598764789 | 26,840,701 | 0.908428254 | 3.725685108 | 26,836,840 | 0.908297640 | 3.726221120 |
rv32i-super-br-hybp |
27,607,191 | 0.893462885 | 3.622244654 | 26,483,239 | 0.896329890 | 3.775973173 | 26,501,897 | 0.896961434 | 3.773314793 |
rv32i-super-br-hybp-btb-ras |
27,185,545 | 0.879816982 | 3.678425428 | 26,098,634 | 0.883312866 | 3.831618161 | 26,117,294 | 0.883944477 | 3.828880588 |
rv32i-super-br-hybp-btb-ras-hzopt |
26,202,245 | 0.847994040 | 3.816466871 | 25,142,329 | 0.850946555 | 3.977356274 | 25,160,989 | 0.851578163 | 3.974406570 |
rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu |
22,589,232 | 0.731064613 | 4.426887997 | 21,484,387 | 0.727142863 | 4.654542855 | 21,489,076 | 0.727301613 | 4.653527215 |
Superscalar end status vs start
Using O3 i100:
- Cycles:
30,832,113 -> 21,484,387(~30.32% lower) - CPI:
1.043518297 -> 0.727142863 - IPC:
0.958296566 -> 1.375245568 - CoreMark/MHz:
3.243371611 -> 4.654542855 - Total speedup:
~1.435x
Count branch for targeted optimization
To push CPI/IPC further, we made a measured tuning branch:
rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu-count- then working copy
rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu-count-fq
Purpose: add counters, identify dominant loss sources, and optimize with validation after each change.
Current measured bottlenecks (stable O2 i1 count run)
dbg_cycles = 235625dbg_slot1_bubble = 44920(~19.064%)dbg_squash_s1_cycles = 52416(~22.246%)dbg_squash_ctrl_cycles = 38603(~73.647% of s1 squash)dbg_squash_load_raw_cycles = 13793(~26.314% of s1 squash)dbg_squash_mem_cycles = 20(~0.038% of s1 squash)dbg_hz_stall_cycles = 7731(~3.281%)dbg_hz_flush_cycles = 13623(~5.782%)
What this indicates
- Main limiter is still control-driven slot1 squash.
- Secondary limiter is load-RAW-driven slot1 squash.
- Memory-alias squash is very small currently.
What was tried in count/fq and failed (reverted)
The following were attempted and then reverted after failing correctness checks (CRC mismatch, timeout loops, or program corruption):
- Slot1 keep-path for
pred_taken && resolved_not_takenbranch case. - Fetch queue / prefetch queue variants (1-entry/2-entry style attempts).
- Store-data-only load-RAW squash relaxation in hazard path.
These attempts often showed apparently lower cycles in short runs, but failed correctness (x12 CRC mismatch or non-halting behavior), so they were not retained.
Current status
- Working retained branch state remains correctness-first.
- Stable O2 i1 retained reference:
- cycles
235625 - CPI
0.729538854 - IPC
1.370728912 - CRC
0x0000e3c1
- cycles
Ongoing work note
I am still working on pushing CPI/IPC lower in superscalar count/fq path, but only keeping changes that pass full correctness checks (instruction suite + C regressions + CoreMark smoke CRC) before claiming improvement.