RV32I Superscalar Optimization Journey

RV32I Superscalar Optimization Journey

Scope and benchmark method

  • This log tracks the superscalar path from rv32i-superscalar to the current superscalar optimized line.
  • CoreMark is tracked at -O2, -O3, and -Ofast, with ITERATIONS=1/10/100.
  • For step-by-step comparison, ITERATIONS=100 is the main reference point (same style as pipe journey).
  • Metrics used:
    • CoreMark/MHz = (ITERATIONS * 1,000,000) / cycles
    • CPI = cycles / retired_instructions
    • IPC = retired_instructions / cycles
  • CRC checks on successful runs:
    • ITER=1: 0x0000e3c1
    • ITER=10: 0x0000c64e
    • ITER=100: 0x0000844d

How this superscalar path was built (relative to pipe)

  • The pipe journey explored predictor transitions in more granular steps (static/dynamic sweeps, later hybrid replacement, and later RAS addition).
  • For superscalar, we reused that learning directly:
    • started with BR (early ID branch resolve),
    • moved directly to HyBP in superscalar (instead of repeating static/dynamic sweeps),
    • combined BR+HyBP,
    • added BTB+RAS together in superscalar path,
    • then applied hzopt and luopt/fulu style improvements.
  • CoreMark ELF reuse policy followed the same copied-ELF workflow used in the pipe-based comparisons.

Starting point: superscalar baseline

Variant: rv32i-superscalar

  • O2 i100: cycles 32,682,915, CPI 1.057730630, CoreMark/MHz 3.059702600
  • O3 i100: cycles 30,832,113, CPI 1.043518297, CoreMark/MHz 3.243371611
  • Ofast i100: cycles 30,832,108, CPI 1.043518198, CoreMark/MHz 3.243372137

Step 1: early branch resolution in superscalar

Variant: rv32i-super-br

What changed

  • Added early branch decision in ID for slot 0 path with branch forwarding.
  • Kept superscalar issue flow; branch control redirects happen earlier.

Result (i100)

  • O3: cycles 30,832,113 -> 28,582,406, CPI 1.043518297 -> 0.967376567, CoreMark/MHz 3.243371611 -> 3.498655781
  • Improvement vs previous (O3 i100): +7.871%

Step 2: superscalar hybrid predictor

Variant: rv32i-super-hybp

What changed

  • Added tournament-style branch direction predictor (local/global/choice + GHR) to superscalar flow.
  • This was taken directly from pipe learning; no repeated static/dynamic superscalar sweep.

Result (i100)

  • O3: cycles 28,582,406 -> 26,840,701, CPI 0.967376567 -> 0.908428254, CoreMark/MHz 3.498655781 -> 3.725685108
  • Improvement vs previous (O3 i100): +6.489%

Step 3: combine BR + HyBP

Variant: rv32i-super-br-hybp

What changed

  • Combined slot-0 ID branch resolve with HyBP update/redirect behavior in superscalar pipeline.

Result (i100)

  • O3: cycles 26,840,701 -> 26,483,239, CPI 0.908428254 -> 0.896329890, CoreMark/MHz 3.725685108 -> 3.775973173
  • Improvement vs previous (O3 i100): +1.350%

Step 4: add BTB + RAS

Variant: rv32i-super-br-hybp-btb-ras

What changed

  • Added BTB target prediction and RAS return prediction in superscalar path.
  • In pipe path, these features were introduced in staged evolution; here they were integrated together after BR+HyBP was stable.

Result (i100)

  • O3: cycles 26,483,239 -> 26,098,634, CPI 0.896329890 -> 0.883312866, CoreMark/MHz 3.775973173 -> 3.831618161
  • Improvement vs previous (O3 i100): +1.474%

Step 5: hzopt

Variant: rv32i-super-br-hybp-btb-ras-hzopt

What changed

  • Hazard behavior tightened to reduce false/overly conservative superscalar stalls.
  • Similar trend to pipe where hazard refinement helped unlock additional throughput.

Result (i100)

  • O3: cycles 26,098,634 -> 25,142,329, CPI 0.883312866 -> 0.850946555, CoreMark/MHz 3.831618161 -> 3.977356274
  • Improvement vs previous (O3 i100): +3.804%

Step 6: luopt + fulu on superscalar line

Variant: rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu

What changed

  • Applied load-use optimization style plus fuller forwarding behavior in superscalar line.
  • This was the largest throughput jump in the superscalar sequence.

Result (i100)

  • O3: cycles 25,142,329 -> 21,484,387, CPI 0.850946555 -> 0.727142863, CoreMark/MHz 3.977356274 -> 4.654542855
  • Improvement vs previous (O3 i100): +17.026%

Consolidated i100 timeline

O3 i100 timeline

Order Variant O3 i100 cycles O3 i100 CPI O3 i100 IPC O3 CoreMark/MHz Improvement vs previous
1 rv32i-superscalar 30,832,113 1.043518297 0.958296566 3.243371611 baseline
2 rv32i-super-br 28,582,406 0.967376567 1.033723613 3.498655781 +7.871%
3 rv32i-super-hybp 26,840,701 0.908428254 1.100802397 3.725685108 +6.489%
4 rv32i-super-br-hybp 26,483,239 0.896329890 1.115660664 3.775973173 +1.350%
5 rv32i-super-br-hybp-btb-ras 26,098,634 0.883312866 1.132101703 3.831618161 +1.474%
6 rv32i-super-br-hybp-btb-ras-hzopt 25,142,329 0.850946555 1.175161935 3.977356274 +3.804%
7 rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu 21,484,387 0.727142863 1.375245568 4.654542855 +17.026%

Full i100 values for superscalar milestones

Variant O2 i100 cycles O2 CPI O2 CM/MHz O3 i100 cycles O3 CPI O3 CM/MHz Ofast i100 cycles Ofast CPI Ofast CM/MHz
rv32i-superscalar 32,682,915 1.057730630 3.059702600 30,832,113 1.043518297 3.243371611 30,832,108 1.043518198 3.243372137
rv32i-super-br 30,096,281 0.974018330 3.322669668 28,582,406 0.967376567 3.498655781 28,582,403 0.967376531 3.498656149
rv32i-super-hybp 27,787,312 0.899292215 3.598764789 26,840,701 0.908428254 3.725685108 26,836,840 0.908297640 3.726221120
rv32i-super-br-hybp 27,607,191 0.893462885 3.622244654 26,483,239 0.896329890 3.775973173 26,501,897 0.896961434 3.773314793
rv32i-super-br-hybp-btb-ras 27,185,545 0.879816982 3.678425428 26,098,634 0.883312866 3.831618161 26,117,294 0.883944477 3.828880588
rv32i-super-br-hybp-btb-ras-hzopt 26,202,245 0.847994040 3.816466871 25,142,329 0.850946555 3.977356274 25,160,989 0.851578163 3.974406570
rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu 22,589,232 0.731064613 4.426887997 21,484,387 0.727142863 4.654542855 21,489,076 0.727301613 4.653527215

Superscalar end status vs start

Using O3 i100:

  • Cycles: 30,832,113 -> 21,484,387 (~30.32% lower)
  • CPI: 1.043518297 -> 0.727142863
  • IPC: 0.958296566 -> 1.375245568
  • CoreMark/MHz: 3.243371611 -> 4.654542855
  • Total speedup: ~1.435x

Count branch for targeted optimization

To push CPI/IPC further, we made a measured tuning branch:

  • rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu-count
  • then working copy rv32i-super-br-hybp-btb-ras-hzopt-luopt-fulu-count-fq

Purpose: add counters, identify dominant loss sources, and optimize with validation after each change.

Current measured bottlenecks (stable O2 i1 count run)

  • dbg_cycles = 235625
  • dbg_slot1_bubble = 44920 (~19.064%)
  • dbg_squash_s1_cycles = 52416 (~22.246%)
  • dbg_squash_ctrl_cycles = 38603 (~73.647% of s1 squash)
  • dbg_squash_load_raw_cycles = 13793 (~26.314% of s1 squash)
  • dbg_squash_mem_cycles = 20 (~0.038% of s1 squash)
  • dbg_hz_stall_cycles = 7731 (~3.281%)
  • dbg_hz_flush_cycles = 13623 (~5.782%)

What this indicates

  • Main limiter is still control-driven slot1 squash.
  • Secondary limiter is load-RAW-driven slot1 squash.
  • Memory-alias squash is very small currently.

What was tried in count/fq and failed (reverted)

The following were attempted and then reverted after failing correctness checks (CRC mismatch, timeout loops, or program corruption):

  1. Slot1 keep-path for pred_taken && resolved_not_taken branch case.
  2. Fetch queue / prefetch queue variants (1-entry/2-entry style attempts).
  3. Store-data-only load-RAW squash relaxation in hazard path.

These attempts often showed apparently lower cycles in short runs, but failed correctness (x12 CRC mismatch or non-halting behavior), so they were not retained.

Current status

  • Working retained branch state remains correctness-first.
  • Stable O2 i1 retained reference:
    • cycles 235625
    • CPI 0.729538854
    • IPC 1.370728912
    • CRC 0x0000e3c1

Ongoing work note

I am still working on pushing CPI/IPC lower in superscalar count/fq path, but only keeping changes that pass full correctness checks (instruction suite + C regressions + CoreMark smoke CRC) before claiming improvement.