RV32I Core Benchmarking Report

CoreMark Performance Across Microarchitectural Variants


Build & Measurement Flow

All results are generated using a consistent toolchain and simulation flow.

Compilation

riscv32-unknown-elf-gcc -O3 -ffreestanding -fno-builtin -march=rv32im -mabi=ilp32 -nostdlib -Ttext 0x0 \
  -I scripts/coremark -DITERATIONS=100 -DVALIDATION_RUN=1 -DTOTAL_DATA_SIZE=2000 \
  -o coremark_i100_o3.elf \
  scripts/coremark/crt0.S \
  scripts/coremark/core_main.c \
  scripts/coremark/core_list_join.c \
  scripts/coremark/core_matrix.c \
  scripts/coremark/core_state.c \
  scripts/coremark/core_util.c \
  scripts/coremark/core_portme.c \
  -lgcc

Binary Conversion

python3 scripts/elf2hex.py coremark_i100_o3.elf hex/inst_mem.hex hex/data_mem.hex

Simulation

make verilator-prog
./obj_dir/Vtb_program
cat tb_program_results.txt
cp tb_program_results.txt tb_program_results_i100_o3.txt

Result Parsing & Metrics

import re

def parse(path):
    txt = open(path).read()
    cycles = int(re.search(r"Total cycles:\s*(\d+)", txt).group(1))
    inst = int(re.search(r"Retired instructions:\s*(\d+)", txt).group(1))
    return cycles, inst

runs = [
    ("O2", 1, "tb_program_results_i1_o2.txt"),
    ("O2", 10, "tb_program_results_i10_o2.txt"),
    ("O2", 100, "tb_program_results_i100_o2.txt"),
    ("O3", 1, "tb_program_results_i1_o3.txt"),
    ("O3", 10, "tb_program_results_i10_o3.txt"),
    ("O3", 100, "tb_program_results_i100_o3.txt"),
    ("Ofast", 1, "tb_program_results_i1_ofast.txt"),
    ("Ofast", 10, "tb_program_results_i10_ofast.txt"),
    ("Ofast", 100, "tb_program_results_i100_ofast.txt"),
]

for opt, iters, path in runs:
    cycles, inst = parse(path)
    cpi = cycles / inst
    ipc = inst / cycles
    cm_mhz = (iters * 1_000_000) / cycles
    print(f"{opt} ITER={iters}: cycles={cycles}, inst={inst}, CPI={cpi:.9f}, IPC={ipc:.9f}, CoreMark/MHz={cm_mhz:.9f}")

1. RV32I Single-Cycle (rv32i-sc)

Results

OptIterCyclesRetired InstCPIIPCCoreMark/MHzScore @100MHz
O213229773229751.0000061920.9999938083.096195704309.619570
O210310227431022721.0000006450.9999993553.223441901322.344190
O210030899090308990881.0000000650.9999999353.236341264323.634126
O313071713071691.0000065110.9999934893.255515657325.551566
O310296505629650541.0000006750.9999993253.372617583337.261758
O310029546307295463051.0000000680.9999999323.384517733338.451773
Ofast13071693071671.0000065110.9999934893.255536854325.553685
Ofast10296505429650521.0000006750.9999993253.372619858337.261986
Ofast10029546305295463031.0000000680.9999999323.384517963338.451796

Explanation

  • CPI ≈ 1.0 across all runs
  • IPC ≈ 1.0 (ideal)
  • No hazards, no overlap → every instruction completes in one cycle
  • Performance limited entirely by long clock period

2. RV32I Multi-Cycle (rv32i-mc)

Results

OptIterCyclesRetired InstCPIIPCCoreMark/MHzScore @100MHz
O2112841323229763.9759362920.2515130840.77873614277.873614
O2101233770431022733.9769884860.2514465410.81052357981.052358
O2100122889486308990893.9771232740.2514380200.81373926581.373926
O3112255853071703.9899241460.2506313310.81593687981.593688
O3101183014429650553.9898565120.2506355800.84529824884.529825
O3100117885350295463063.9898507110.2506359440.84828182684.828183
Ofast112255763071683.9899208250.2506315400.81594287181.594287
Ofast101183013529650533.9898561680.2506356010.84529889184.529889
Ofast100117885341295463043.9898506760.2506359460.84828189184.828189

Explanation

  • CPI ≈ 4.0
  • IPC ≈ 0.25
  • Each instruction broken into sequential steps (IF/ID/EX/MEM/WB)
  • No overlap → very low throughput

3. RV32I Pipelined (rv32i-pipe)

Results

OptIterCyclesRetired InstCPIIPCCoreMark/MHzScore @100MHz
O214303513229771.3324509180.7504966882.323684620232.368462
O210414319431022741.3355345140.7487638762.413596853241.359685
O210041275362308990901.3358115720.7486085772.422752828242.275283
O314027653071711.3112077640.7626556432.482837387248.283739
O310390032029650561.3154287810.7602083932.563892193256.389219
O310038877313295463071.3158095530.7599884022.572194225257.219423
Ofast14027603071691.3112000230.7626601452.482868209248.286821
Ofast10390031529650541.3154279820.7602088552.563895480256.389548
Ofast10038877308295463051.3158094730.7599884492.572194556257.219456

Explanation

  • CPI ≈ 1.31

  • IPC ≈ 0.76

  • Performance loss from:

    • Branch penalties
    • Load-use hazards
  • Still significantly better than multi-cycle


4. RV32I Superscalar

Results

IterCyclesRetired InstCPIIPCCoreMark/MHzScore @100MHzCRC
13191903071721.0391246600.9623484453.132930230313.2930230x0000e3c1
10309296529650571.0431384620.9586455073.233143602323.3143600x0000c64e
10030832113295463081.0435182970.9582965663.243371611324.3371610x0000844d

Explanation

  • CPI ≈ 1.04
  • IPC ≈ 0.96
  • Dual-issue capability
  • Limited by dependency and control hazards

All Variants Implemented

Base Architectures

  • rv32i-sc — single-cycle
  • rv32i-mc — multi-cycle
  • rv32i-pipe — baseline pipeline
  • rv32i-superscalar — dual-issue
  • rv32i-ooo — out-of-order
  • rv32i-super-ooo — optimized OOO

Pipeline Evolution Variants

Branch Handling

  • rv32i-pipe-br — ID-stage branch resolution
  • rv32i-pipe-sbp — static predictor
  • rv32i-pipe-gbp — global predictor
  • rv32i-pipe-gbp-tuned — tuned global predictor
  • rv32i-pipe-hybp — hybrid predictor

Direction + Target

  • rv32i-pipe-br-2dbp — 2-bit predictor
  • rv32i-pipe-br-2dbp-btb — + BTB

Hazard Optimisations

  • rv32i-pipe-br-2dbp-btb-hzopt
  • rv32i-pipe-br-2dbp-btb-hzopt-luopt
  • rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu

Advanced Prediction

  • rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu
  • rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu

Experimental

  • rv32i-pipe-1dbp
  • rv32i-pipe-2dbp

RV32I Pipelined Optimization Journey

Scope and benchmark method

  • This log tracks the single issue pipeline path from rv32i-pipe to the latest optimized variants.
  • CoreMark was run at -O2, -O3, and -Ofast, with ITERATIONS=1/10/100.
  • For step by step comparison, -O3, ITERATIONS=100 is the main reference point.
  • Metrics used:
    • CoreMark/MHz = (ITERATIONS * 1,000,000) / cycles
    • CPI = cycles / retired_instructions
    • IPC = retired_instructions / cycles
  • CRC checks were consistent in all successful runs:
    • ITER=1: 0x0000e3c1
    • ITER=10: 0x0000c64e
    • ITER=100: 0x0000844d

Starting point: plain pipelined core

Variant: rv32i-pipe

  • O2 i100: cycles 41,275,362, CPI 1.335811572, CoreMark/MHz 2.422752828
  • O3 i100: cycles 38,877,313, CPI 1.315809553, CoreMark/MHz 2.572194225
  • Ofast i100: cycles 38,877,308, CPI 1.315809473, CoreMark/MHz 2.572194556

This is the baseline. The initial CPI point you mentioned is correct, around 1.33 at O2 and 1.3158 at O3.

Step 1: early branch resolution

Variant: rv32i-pipe-br

What changed

  • Branch decision moved earlier to ID stage.
  • Added ID stage operand forwarding for branch compare so branch operands are ready sooner.
  • Added redirect from ID when branch decision differs from sequential path.

Why it helped

  • Reduces control hazard penalty because branch wait time is shorter.

Result (O3 i100)

  • Cycles: 38,877,313 -> 35,812,817
  • CPI: 1.315809553 -> 1.212091142
  • CoreMark/MHz: 2.572194225 -> 2.792296400
  • Improvement vs previous: +8.557% (cycle reduction based)

Branch prediction type sweep (before BTB/hzopt/luopt/fulu)

This sweep compares direction predictors on the simpler pipeline lines.

VariantPredictor typeO3 i100 cyclesO3 i100 CPIO3 CoreMark/MHz
rv32i-pipe-sbpstatic backward taken35,390,8511.1978096282.825589020
rv32i-pipe-1dbplocal 1-bit dynamic34,640,8131.1724244592.886768275
rv32i-pipe-2dbplocal 2-bit dynamic34,185,0951.1570006022.925251488

Best of this sweep was local 2-bit dynamic, so the main optimization line continued with that style.

Step 2: branch resolve + 2-bit dynamic predictor

Variant: rv32i-pipe-br-2dbp

What changed

  • Kept ID stage branch resolution.
  • Added dynamic 2-bit branch history table in IF.

Result (O3 i100)

  • Cycles: 35,812,817 -> 33,466,269
  • CPI: 1.212091142 -> 1.132671809
  • CoreMark/MHz: 2.792296400 -> 2.988083315
  • Improvement vs previous: +6.547% over rv32i-pipe-br

Step 3: BTB

Variant: rv32i-pipe-br-2dbp-btb

What changed

  • Added BTB for target prediction in IF.
  • Direction and target prediction were both available earlier.

Why it helped

  • Correctly predicted taken branches no longer wait for late target compute.

Result (O3 i100)

  • Cycles: 33,466,269 -> 32,347,569
  • CPI: 1.132671809 -> 1.094809209
  • CoreMark/MHz: 2.988083315 -> 3.091422419
  • Improvement vs previous: +3.458%

Step 4: hzopt (false load use cleanup)

Variant: rv32i-pipe-br-2dbp-btb-hzopt

What changed

  • Hazard logic was made source aware, using decoded use_rs1 and use_rs2.
  • Removed false positive load use stalls when source register is not actually consumed by the instruction.

Result (O3 i100)

  • Cycles: 32,347,569 -> 32,346,765
  • CPI: 1.094809209 -> 1.094781998
  • CoreMark/MHz: 3.091422419 -> 3.091499258
  • Improvement vs previous: +0.002%

Step 5: luopt (first true load use reduction)

Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt

What changed

  • Relaxed one true load use case for store data (load -> store rs2) when safe.
  • Added matching EX path load data forwarding so correctness is preserved.

Result (O3 i100)

  • Cycles: 32,346,765 -> 32,342,681
  • CPI: 1.094781998 -> 1.094643774
  • CoreMark/MHz: 3.091499258 -> 3.091889630
  • Improvement vs previous: +0.013%

Step 6: fulu (generalized load use reduction)

Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu

What changed

  • Generalized MEM to EX load forwarding for both EX operands.
  • Stalls kept only for ID stage consumers that truly need value in ID (branch and jalr related reads).
  • Removed extra load use stalls for EX stage consumers now covered by forwarding.

Result (O3 i100)

  • Cycles: 32,342,681 -> 31,709,661
  • CPI: 1.094643774 -> 1.073219100
  • CoreMark/MHz: 3.091889630 -> 3.153613027
  • Improvement vs previous: +1.996%

Later revisit: predictor path that started from global and was tuned

After fulu, predictor alternatives were revisited on a separate line.

VariantO3 i100 cyclesO3 i100 CPIO3 CoreMark/MHz
rv32i-pipe-gbp34,709,6491.1747542262.881043251
rv32i-pipe-gbp-tuned34,527,5991.1685927112.896233822
rv32i-pipe-hybp34,021,0251.1514476242.939358823

The tuned path improved over its own earlier versions and then moved to a hybrid predictor.

Using that predictor in the full optimized line

Variant: rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu

What changed

  • Replaced the local 2-bit direction predictor in the full optimized line with tournament style prediction.
  • Kept BTB, hzopt, luopt, and fulu behavior.

Result (O3 i100)

  • Cycles: 31,709,661 -> 31,566,209
  • CPI: 1.073219100 -> 1.068363941
  • CoreMark/MHz: 3.153613027 -> 3.167944557
  • Improvement vs previous: +0.455%

Final control-flow add on this line

Variant: rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu

What changed

  • Added JAL fast path prediction in IF.
  • Added RAS for return style jalr prediction.

Result (O3 i100)

  • Cycles: 31,566,209 -> 31,540,922
  • CPI: 1.068363941 -> 1.067508098
  • CoreMark/MHz: 3.167944557 -> 3.170484363
  • Improvement vs previous: +0.080%

Consolidated O3 i100 timeline

OrderVariantO3 i100 cyclesO3 i100 CPIO3 CoreMark/MHzImprovement vs previous
1rv32i-pipe38,877,3131.3158095532.572194225baseline
2rv32i-pipe-br35,812,8171.2120911422.792296400+8.557%
3rv32i-pipe-sbp35,390,8511.1978096282.825589020+1.192%
4rv32i-pipe-1dbp34,640,8131.1724244592.886768275+2.165%
5rv32i-pipe-2dbp34,185,0951.1570006022.925251488+1.333%
6rv32i-pipe-br-2dbp33,466,2691.1326718092.988083315+2.148%
7rv32i-pipe-br-2dbp-btb32,347,5691.0948092093.091422419+3.458%
8rv32i-pipe-br-2dbp-btb-hzopt32,346,7651.0947819983.091499258+0.002%
9rv32i-pipe-br-2dbp-btb-hzopt-luopt32,342,6811.0946437743.091889630+0.013%
10rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu31,709,6611.0732191003.153613027+1.996%
11rv32i-pipe-gbp34,709,6491.1747542262.881043251separate branch
12rv32i-pipe-gbp-tuned34,527,5991.1685927112.896233822+0.527% over step 11
13rv32i-pipe-hybp34,021,0251.1514476242.939358823+1.489% over step 12
14rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu31,566,2091.0683639413.167944557+0.455% over step 10
15rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu31,540,9221.0675080983.170484363+0.080% over step 14

Full i100 values for main milestones

VariantO2 i100 cyclesO2 CPIO2 CM/MHzO3 i100 cyclesO3 CPIO3 CM/MHzOfast i100 cyclesOfast CPIOfast CM/MHz
rv32i-pipe41,275,3621.3358115722.42275282838,877,3131.3158095532.57219422538,877,3081.3158094732.572194556
rv32i-pipe-br37,838,5321.2245840252.64280865935,812,8171.2120911422.79229640035,812,8121.2120910552.792296790
rv32i-pipe-br-2dbp35,141,7701.1373076042.84561648433,466,2691.1326718092.98808331533,466,2681.1326718522.988083404
rv32i-pipe-br-2dbp-btb33,813,5741.0943226482.95739220032,347,5691.0948092093.09142241932,347,5681.0948092493.091422514
rv32i-pipe-br-2dbp-btb-hzopt33,812,4681.0942868542.95748893632,346,7651.0947819983.09149925832,346,7641.0947820383.091499354
rv32i-pipe-br-2dbp-btb-hzopt-luopt33,808,8681.0941703462.95780385232,342,6811.0946437743.09188963032,342,6801.0946438143.091889726
rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu33,144,3241.0726634343.01710784631,709,6611.0732191003.15361302731,709,6601.0732191393.153613126
rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu32,954,0531.0665056163.03452810531,566,2091.0683639413.16794455731,554,7751.0679770283.169092475
rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu32,895,8991.0646235543.03989260231,540,9221.0675080983.17048436331,529,4891.0671212193.171634022

Superscalar comparison (fair same-ELF numbers)

From rv32i-superscalar/COREMARK_RESULTS.md fair run:

  • Superscalar O3 i100: cycles 30,832,113, CPI 1.043518297, CoreMark/MHz 3.243371611
  • Best pipelined line so far (rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu):
    • cycles 31,540,922, CPI 1.067508098, CoreMark/MHz 3.170484363

Remaining gap to superscalar on O3 i100:

  • CoreMark/MHz gap: 3.243371611 - 3.170484363 = 0.072887248
  • Relative gap: about 2.299%

Final status summary

  • The main pipeline line improved from 2.572194225 to 3.170484363 CoreMark/MHz at O3 i100.
  • That is about 23.260% better than the original pipelined baseline.
  • The largest single gains came from:
    • early branch resolution,
    • better branch prediction direction,
    • BTB target prediction,
    • generalized load use forwarding (fulu).
  • Later tuning with tournament style direction selection and RAS gave additional smaller gains.

Reference

Full optimization breakdown and detailed analysis: ../../blogs/optm-riscv-core/


Summary

  • Single-cycle: Perfect CPI, limited by clock
  • Multi-cycle: High CPI (~4), low throughput
  • Pipeline: Balanced (~1.3 CPI)
  • Superscalar: Best (~1.04 CPI), but underutilized

The full optimized pipeline approaches superscalar performance with significantly lower complexity.