RV32I Core Benchmarking Report

RV32I Core Benchmarking Report

CoreMark Performance Across Microarchitectural Variants


Build & Measurement Flow

All results are generated using a consistent toolchain and simulation flow.

Compilation

riscv32-unknown-elf-gcc -O3 -ffreestanding -fno-builtin -march=rv32im -mabi=ilp32 -nostdlib -Ttext 0x0 \
  -I scripts/coremark -DITERATIONS=100 -DVALIDATION_RUN=1 -DTOTAL_DATA_SIZE=2000 \
  -o coremark_i100_o3.elf \
  scripts/coremark/crt0.S \
  scripts/coremark/core_main.c \
  scripts/coremark/core_list_join.c \
  scripts/coremark/core_matrix.c \
  scripts/coremark/core_state.c \
  scripts/coremark/core_util.c \
  scripts/coremark/core_portme.c \
  -lgcc

Binary Conversion

python3 scripts/elf2hex.py coremark_i100_o3.elf hex/inst_mem.hex hex/data_mem.hex

Simulation

make verilator-prog
./obj_dir/Vtb_program
cat tb_program_results.txt
cp tb_program_results.txt tb_program_results_i100_o3.txt

Result Parsing & Metrics

import re

def parse(path):
    txt = open(path).read()
    cycles = int(re.search(r"Total cycles:\s*(\d+)", txt).group(1))
    inst = int(re.search(r"Retired instructions:\s*(\d+)", txt).group(1))
    return cycles, inst

runs = [
    ("O2", 1, "tb_program_results_i1_o2.txt"),
    ("O2", 10, "tb_program_results_i10_o2.txt"),
    ("O2", 100, "tb_program_results_i100_o2.txt"),
    ("O3", 1, "tb_program_results_i1_o3.txt"),
    ("O3", 10, "tb_program_results_i10_o3.txt"),
    ("O3", 100, "tb_program_results_i100_o3.txt"),
    ("Ofast", 1, "tb_program_results_i1_ofast.txt"),
    ("Ofast", 10, "tb_program_results_i10_ofast.txt"),
    ("Ofast", 100, "tb_program_results_i100_ofast.txt"),
]

for opt, iters, path in runs:
    cycles, inst = parse(path)
    cpi = cycles / inst
    ipc = inst / cycles
    cm_mhz = (iters * 1_000_000) / cycles
    print(f"{opt} ITER={iters}: cycles={cycles}, inst={inst}, CPI={cpi:.9f}, IPC={ipc:.9f}, CoreMark/MHz={cm_mhz:.9f}")

1. RV32I Single-Cycle (rv32i-sc)

Results

Opt Iter Cycles Retired Inst CPI IPC CoreMark/MHz Score @100MHz
O2 1 322977 322975 1.000006192 0.999993808 3.096195704 309.619570
O2 10 3102274 3102272 1.000000645 0.999999355 3.223441901 322.344190
O2 100 30899090 30899088 1.000000065 0.999999935 3.236341264 323.634126
O3 1 307171 307169 1.000006511 0.999993489 3.255515657 325.551566
O3 10 2965056 2965054 1.000000675 0.999999325 3.372617583 337.261758
O3 100 29546307 29546305 1.000000068 0.999999932 3.384517733 338.451773
Ofast 1 307169 307167 1.000006511 0.999993489 3.255536854 325.553685
Ofast 10 2965054 2965052 1.000000675 0.999999325 3.372619858 337.261986
Ofast 100 29546305 29546303 1.000000068 0.999999932 3.384517963 338.451796

Explanation

  • CPI ≈ 1.0 across all runs
  • IPC ≈ 1.0 (ideal)
  • No hazards, no overlap → every instruction completes in one cycle
  • Performance limited entirely by long clock period

2. RV32I Multi-Cycle (rv32i-mc)

Results

Opt Iter Cycles Retired Inst CPI IPC CoreMark/MHz Score @100MHz
O2 1 1284132 322976 3.975936292 0.251513084 0.778736142 77.873614
O2 10 12337704 3102273 3.976988486 0.251446541 0.810523579 81.052358
O2 100 122889486 30899089 3.977123274 0.251438020 0.813739265 81.373926
O3 1 1225585 307170 3.989924146 0.250631331 0.815936879 81.593688
O3 10 11830144 2965055 3.989856512 0.250635580 0.845298248 84.529825
O3 100 117885350 29546306 3.989850711 0.250635944 0.848281826 84.828183
Ofast 1 1225576 307168 3.989920825 0.250631540 0.815942871 81.594287
Ofast 10 11830135 2965053 3.989856168 0.250635601 0.845298891 84.529889
Ofast 100 117885341 29546304 3.989850676 0.250635946 0.848281891 84.828189

Explanation

  • CPI ≈ 4.0
  • IPC ≈ 0.25
  • Each instruction broken into sequential steps (IF/ID/EX/MEM/WB)
  • No overlap → very low throughput

3. RV32I Pipelined (rv32i-pipe)

Results

Opt Iter Cycles Retired Inst CPI IPC CoreMark/MHz Score @100MHz
O2 1 430351 322977 1.332450918 0.750496688 2.323684620 232.368462
O2 10 4143194 3102274 1.335534514 0.748763876 2.413596853 241.359685
O2 100 41275362 30899090 1.335811572 0.748608577 2.422752828 242.275283
O3 1 402765 307171 1.311207764 0.762655643 2.482837387 248.283739
O3 10 3900320 2965056 1.315428781 0.760208393 2.563892193 256.389219
O3 100 38877313 29546307 1.315809553 0.759988402 2.572194225 257.219423
Ofast 1 402760 307169 1.311200023 0.762660145 2.482868209 248.286821
Ofast 10 3900315 2965054 1.315427982 0.760208855 2.563895480 256.389548
Ofast 100 38877308 29546305 1.315809473 0.759988449 2.572194556 257.219456

Explanation

  • CPI ≈ 1.31

  • IPC ≈ 0.76

  • Performance loss from:

    • Branch penalties
    • Load-use hazards
  • Still significantly better than multi-cycle


4. RV32I Superscalar

Results

Iter Cycles Retired Inst CPI IPC CoreMark/MHz Score @100MHz CRC
1 319190 307172 1.039124660 0.962348445 3.132930230 313.293023 0x0000e3c1
10 3092965 2965057 1.043138462 0.958645507 3.233143602 323.314360 0x0000c64e
100 30832113 29546308 1.043518297 0.958296566 3.243371611 324.337161 0x0000844d

Explanation

  • CPI ≈ 1.04
  • IPC ≈ 0.96
  • Dual-issue capability
  • Limited by dependency and control hazards

All Variants Implemented

Base Architectures

  • rv32i-sc — single-cycle
  • rv32i-mc — multi-cycle
  • rv32i-pipe — baseline pipeline
  • rv32i-superscalar — dual-issue
  • rv32i-ooo — out-of-order
  • rv32i-super-ooo — optimized OOO

Pipeline Evolution Variants

Branch Handling

  • rv32i-pipe-br — ID-stage branch resolution
  • rv32i-pipe-sbp — static predictor
  • rv32i-pipe-gbp — global predictor
  • rv32i-pipe-gbp-tuned — tuned global predictor
  • rv32i-pipe-hybp — hybrid predictor

Direction + Target

  • rv32i-pipe-br-2dbp — 2-bit predictor
  • rv32i-pipe-br-2dbp-btb — + BTB

Hazard Optimisations

  • rv32i-pipe-br-2dbp-btb-hzopt
  • rv32i-pipe-br-2dbp-btb-hzopt-luopt
  • rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu

Advanced Prediction

  • rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu
  • rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu

Experimental

  • rv32i-pipe-1dbp
  • rv32i-pipe-2dbp

RV32I Pipelined Optimization Journey

Scope and benchmark method

  • This log tracks the single issue pipeline path from rv32i-pipe to the latest optimized variants.
  • CoreMark was run at -O2, -O3, and -Ofast, with ITERATIONS=1/10/100.
  • For step by step comparison, -O3, ITERATIONS=100 is the main reference point.
  • Metrics used:
    • CoreMark/MHz = (ITERATIONS * 1,000,000) / cycles
    • CPI = cycles / retired_instructions
    • IPC = retired_instructions / cycles
  • CRC checks were consistent in all successful runs:
    • ITER=1: 0x0000e3c1
    • ITER=10: 0x0000c64e
    • ITER=100: 0x0000844d

Starting point: plain pipelined core

Variant: rv32i-pipe

  • O2 i100: cycles 41,275,362, CPI 1.335811572, CoreMark/MHz 2.422752828
  • O3 i100: cycles 38,877,313, CPI 1.315809553, CoreMark/MHz 2.572194225
  • Ofast i100: cycles 38,877,308, CPI 1.315809473, CoreMark/MHz 2.572194556

This is the baseline. The initial CPI point you mentioned is correct, around 1.33 at O2 and 1.3158 at O3.

Step 1: early branch resolution

Variant: rv32i-pipe-br

What changed

  • Branch decision moved earlier to ID stage.
  • Added ID stage operand forwarding for branch compare so branch operands are ready sooner.
  • Added redirect from ID when branch decision differs from sequential path.

Why it helped

  • Reduces control hazard penalty because branch wait time is shorter.

Result (O3 i100)

  • Cycles: 38,877,313 -> 35,812,817
  • CPI: 1.315809553 -> 1.212091142
  • CoreMark/MHz: 2.572194225 -> 2.792296400
  • Improvement vs previous: +8.557% (cycle reduction based)

Branch prediction type sweep (before BTB/hzopt/luopt/fulu)

This sweep compares direction predictors on the simpler pipeline lines.

Variant Predictor type O3 i100 cycles O3 i100 CPI O3 CoreMark/MHz
rv32i-pipe-sbp static backward taken 35,390,851 1.197809628 2.825589020
rv32i-pipe-1dbp local 1-bit dynamic 34,640,813 1.172424459 2.886768275
rv32i-pipe-2dbp local 2-bit dynamic 34,185,095 1.157000602 2.925251488

Best of this sweep was local 2-bit dynamic, so the main optimization line continued with that style.

Step 2: branch resolve + 2-bit dynamic predictor

Variant: rv32i-pipe-br-2dbp

What changed

  • Kept ID stage branch resolution.
  • Added dynamic 2-bit branch history table in IF.

Result (O3 i100)

  • Cycles: 35,812,817 -> 33,466,269
  • CPI: 1.212091142 -> 1.132671809
  • CoreMark/MHz: 2.792296400 -> 2.988083315
  • Improvement vs previous: +6.547% over rv32i-pipe-br

Step 3: BTB

Variant: rv32i-pipe-br-2dbp-btb

What changed

  • Added BTB for target prediction in IF.
  • Direction and target prediction were both available earlier.

Why it helped

  • Correctly predicted taken branches no longer wait for late target compute.

Result (O3 i100)

  • Cycles: 33,466,269 -> 32,347,569
  • CPI: 1.132671809 -> 1.094809209
  • CoreMark/MHz: 2.988083315 -> 3.091422419
  • Improvement vs previous: +3.458%

Step 4: hzopt (false load use cleanup)

Variant: rv32i-pipe-br-2dbp-btb-hzopt

What changed

  • Hazard logic was made source aware, using decoded use_rs1 and use_rs2.
  • Removed false positive load use stalls when source register is not actually consumed by the instruction.

Result (O3 i100)

  • Cycles: 32,347,569 -> 32,346,765
  • CPI: 1.094809209 -> 1.094781998
  • CoreMark/MHz: 3.091422419 -> 3.091499258
  • Improvement vs previous: +0.002%

Step 5: luopt (first true load use reduction)

Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt

What changed

  • Relaxed one true load use case for store data (load -> store rs2) when safe.
  • Added matching EX path load data forwarding so correctness is preserved.

Result (O3 i100)

  • Cycles: 32,346,765 -> 32,342,681
  • CPI: 1.094781998 -> 1.094643774
  • CoreMark/MHz: 3.091499258 -> 3.091889630
  • Improvement vs previous: +0.013%

Step 6: fulu (generalized load use reduction)

Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu

What changed

  • Generalized MEM to EX load forwarding for both EX operands.
  • Stalls kept only for ID stage consumers that truly need value in ID (branch and jalr related reads).
  • Removed extra load use stalls for EX stage consumers now covered by forwarding.

Result (O3 i100)

  • Cycles: 32,342,681 -> 31,709,661
  • CPI: 1.094643774 -> 1.073219100
  • CoreMark/MHz: 3.091889630 -> 3.153613027
  • Improvement vs previous: +1.996%

Later revisit: predictor path that started from global and was tuned

After fulu, predictor alternatives were revisited on a separate line.

Variant O3 i100 cycles O3 i100 CPI O3 CoreMark/MHz
rv32i-pipe-gbp 34,709,649 1.174754226 2.881043251
rv32i-pipe-gbp-tuned 34,527,599 1.168592711 2.896233822
rv32i-pipe-hybp 34,021,025 1.151447624 2.939358823

The tuned path improved over its own earlier versions and then moved to a hybrid predictor.

Using that predictor in the full optimized line

Variant: rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu

What changed

  • Replaced the local 2-bit direction predictor in the full optimized line with tournament style prediction.
  • Kept BTB, hzopt, luopt, and fulu behavior.

Result (O3 i100)

  • Cycles: 31,709,661 -> 31,566,209
  • CPI: 1.073219100 -> 1.068363941
  • CoreMark/MHz: 3.153613027 -> 3.167944557
  • Improvement vs previous: +0.455%

Final control-flow add on this line

Variant: rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu

What changed

  • Added JAL fast path prediction in IF.
  • Added RAS for return style jalr prediction.

Result (O3 i100)

  • Cycles: 31,566,209 -> 31,540,922
  • CPI: 1.068363941 -> 1.067508098
  • CoreMark/MHz: 3.167944557 -> 3.170484363
  • Improvement vs previous: +0.080%

Consolidated O3 i100 timeline

Order Variant O3 i100 cycles O3 i100 CPI O3 CoreMark/MHz Improvement vs previous
1 rv32i-pipe 38,877,313 1.315809553 2.572194225 baseline
2 rv32i-pipe-br 35,812,817 1.212091142 2.792296400 +8.557%
3 rv32i-pipe-sbp 35,390,851 1.197809628 2.825589020 +1.192%
4 rv32i-pipe-1dbp 34,640,813 1.172424459 2.886768275 +2.165%
5 rv32i-pipe-2dbp 34,185,095 1.157000602 2.925251488 +1.333%
6 rv32i-pipe-br-2dbp 33,466,269 1.132671809 2.988083315 +2.148%
7 rv32i-pipe-br-2dbp-btb 32,347,569 1.094809209 3.091422419 +3.458%
8 rv32i-pipe-br-2dbp-btb-hzopt 32,346,765 1.094781998 3.091499258 +0.002%
9 rv32i-pipe-br-2dbp-btb-hzopt-luopt 32,342,681 1.094643774 3.091889630 +0.013%
10 rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu 31,709,661 1.073219100 3.153613027 +1.996%
11 rv32i-pipe-gbp 34,709,649 1.174754226 2.881043251 separate branch
12 rv32i-pipe-gbp-tuned 34,527,599 1.168592711 2.896233822 +0.527% over step 11
13 rv32i-pipe-hybp 34,021,025 1.151447624 2.939358823 +1.489% over step 12
14 rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu 31,566,209 1.068363941 3.167944557 +0.455% over step 10
15 rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu 31,540,922 1.067508098 3.170484363 +0.080% over step 14

Full i100 values for main milestones

Variant O2 i100 cycles O2 CPI O2 CM/MHz O3 i100 cycles O3 CPI O3 CM/MHz Ofast i100 cycles Ofast CPI Ofast CM/MHz
rv32i-pipe 41,275,362 1.335811572 2.422752828 38,877,313 1.315809553 2.572194225 38,877,308 1.315809473 2.572194556
rv32i-pipe-br 37,838,532 1.224584025 2.642808659 35,812,817 1.212091142 2.792296400 35,812,812 1.212091055 2.792296790
rv32i-pipe-br-2dbp 35,141,770 1.137307604 2.845616484 33,466,269 1.132671809 2.988083315 33,466,268 1.132671852 2.988083404
rv32i-pipe-br-2dbp-btb 33,813,574 1.094322648 2.957392200 32,347,569 1.094809209 3.091422419 32,347,568 1.094809249 3.091422514
rv32i-pipe-br-2dbp-btb-hzopt 33,812,468 1.094286854 2.957488936 32,346,765 1.094781998 3.091499258 32,346,764 1.094782038 3.091499354
rv32i-pipe-br-2dbp-btb-hzopt-luopt 33,808,868 1.094170346 2.957803852 32,342,681 1.094643774 3.091889630 32,342,680 1.094643814 3.091889726
rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu 33,144,324 1.072663434 3.017107846 31,709,661 1.073219100 3.153613027 31,709,660 1.073219139 3.153613126
rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu 32,954,053 1.066505616 3.034528105 31,566,209 1.068363941 3.167944557 31,554,775 1.067977028 3.169092475
rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu 32,895,899 1.064623554 3.039892602 31,540,922 1.067508098 3.170484363 31,529,489 1.067121219 3.171634022

Superscalar comparison (fair same-ELF numbers)

From rv32i-superscalar/COREMARK_RESULTS.md fair run:

  • Superscalar O3 i100: cycles 30,832,113, CPI 1.043518297, CoreMark/MHz 3.243371611
  • Best pipelined line so far (rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu):
    • cycles 31,540,922, CPI 1.067508098, CoreMark/MHz 3.170484363

Remaining gap to superscalar on O3 i100:

  • CoreMark/MHz gap: 3.243371611 - 3.170484363 = 0.072887248
  • Relative gap: about 2.299%

Final status summary

  • The main pipeline line improved from 2.572194225 to 3.170484363 CoreMark/MHz at O3 i100.
  • That is about 23.260% better than the original pipelined baseline.
  • The largest single gains came from:
    • early branch resolution,
    • better branch prediction direction,
    • BTB target prediction,
    • generalized load use forwarding (fulu).
  • Later tuning with tournament style direction selection and RAS gave additional smaller gains.

Reference

Full optimization breakdown and detailed analysis: ../../blogs/optm-riscv-core/


Summary

  • Single-cycle: Perfect CPI, limited by clock
  • Multi-cycle: High CPI (~4), low throughput
  • Pipeline: Balanced (~1.3 CPI)
  • Superscalar: Best (~1.04 CPI), but underutilized

The full optimized pipeline approaches superscalar performance with significantly lower complexity.