RV32I Core Benchmarking Report

CoreMark Performance Across Microarchitectural Variants

Build & Measurement Flow

All results are generated using a consistent toolchain and simulation flow.

Compilation

riscv32-unknown-elf-gcc -O3 -ffreestanding -fno-builtin -march=rv32im -mabi=ilp32 -nostdlib -Ttext 0x0 \
  -I scripts/coremark -DITERATIONS=100 -DVALIDATION_RUN=1 -DTOTAL_DATA_SIZE=2000 \
  -o coremark_i100_o3.elf \
  scripts/coremark/crt0.S \
  scripts/coremark/core_main.c \
  scripts/coremark/core_list_join.c \
  scripts/coremark/core_matrix.c \
  scripts/coremark/core_state.c \
  scripts/coremark/core_util.c \
  scripts/coremark/core_portme.c \
  -lgcc

Binary Conversion

python3 scripts/elf2hex.py coremark_i100_o3.elf hex/inst_mem.hex hex/data_mem.hex

Simulation

make verilator-prog
./obj_dir/Vtb_program
cat tb_program_results.txt
cp tb_program_results.txt tb_program_results_i100_o3.txt

Result Parsing & Metrics

import re

def parse(path):
    txt = open(path).read()
    cycles = int(re.search(r"Total cycles:\s*(\d+)", txt).group(1))
    inst = int(re.search(r"Retired instructions:\s*(\d+)", txt).group(1))
    return cycles, inst

runs = [
    ("O2", 1, "tb_program_results_i1_o2.txt"),
    ("O2", 10, "tb_program_results_i10_o2.txt"),
    ("O2", 100, "tb_program_results_i100_o2.txt"),
    ("O3", 1, "tb_program_results_i1_o3.txt"),
    ("O3", 10, "tb_program_results_i10_o3.txt"),
    ("O3", 100, "tb_program_results_i100_o3.txt"),
    ("Ofast", 1, "tb_program_results_i1_ofast.txt"),
    ("Ofast", 10, "tb_program_results_i10_ofast.txt"),
    ("Ofast", 100, "tb_program_results_i100_ofast.txt"),
]

for opt, iters, path in runs:
    cycles, inst = parse(path)
    cpi = cycles / inst
    ipc = inst / cycles
    cm_mhz = (iters * 1_000_000) / cycles
    print(f"{opt} ITER={iters}: cycles={cycles}, inst={inst}, CPI={cpi:.9f}, IPC={ipc:.9f}, CoreMark/MHz={cm_mhz:.9f}")

1. RV32I Single-Cycle (rv32i-sc)

Results

Opt	Iter	Cycles	Retired Inst	CPI	IPC	CoreMark/MHz	Score @100MHz
O2	1	322977	322975	1.000006192	0.999993808	3.096195704	309.619570
O2	10	3102274	3102272	1.000000645	0.999999355	3.223441901	322.344190
O2	100	30899090	30899088	1.000000065	0.999999935	3.236341264	323.634126
O3	1	307171	307169	1.000006511	0.999993489	3.255515657	325.551566
O3	10	2965056	2965054	1.000000675	0.999999325	3.372617583	337.261758
O3	100	29546307	29546305	1.000000068	0.999999932	3.384517733	338.451773
Ofast	1	307169	307167	1.000006511	0.999993489	3.255536854	325.553685
Ofast	10	2965054	2965052	1.000000675	0.999999325	3.372619858	337.261986
Ofast	100	29546305	29546303	1.000000068	0.999999932	3.384517963	338.451796

Explanation

CPI ≈ 1.0 across all runs
IPC ≈ 1.0 (ideal)
No hazards, no overlap → every instruction completes in one cycle
Performance limited entirely by long clock period

2. RV32I Multi-Cycle (rv32i-mc)

Results

Opt	Iter	Cycles	Retired Inst	CPI	IPC	CoreMark/MHz	Score @100MHz
O2	1	1284132	322976	3.975936292	0.251513084	0.778736142	77.873614
O2	10	12337704	3102273	3.976988486	0.251446541	0.810523579	81.052358
O2	100	122889486	30899089	3.977123274	0.251438020	0.813739265	81.373926
O3	1	1225585	307170	3.989924146	0.250631331	0.815936879	81.593688
O3	10	11830144	2965055	3.989856512	0.250635580	0.845298248	84.529825
O3	100	117885350	29546306	3.989850711	0.250635944	0.848281826	84.828183
Ofast	1	1225576	307168	3.989920825	0.250631540	0.815942871	81.594287
Ofast	10	11830135	2965053	3.989856168	0.250635601	0.845298891	84.529889
Ofast	100	117885341	29546304	3.989850676	0.250635946	0.848281891	84.828189

Explanation

CPI ≈ 4.0
IPC ≈ 0.25
Each instruction broken into sequential steps (IF/ID/EX/MEM/WB)
No overlap → very low throughput

3. RV32I Pipelined (rv32i-pipe)

Results

Opt	Iter	Cycles	Retired Inst	CPI	IPC	CoreMark/MHz	Score @100MHz
O2	1	430351	322977	1.332450918	0.750496688	2.323684620	232.368462
O2	10	4143194	3102274	1.335534514	0.748763876	2.413596853	241.359685
O2	100	41275362	30899090	1.335811572	0.748608577	2.422752828	242.275283
O3	1	402765	307171	1.311207764	0.762655643	2.482837387	248.283739
O3	10	3900320	2965056	1.315428781	0.760208393	2.563892193	256.389219
O3	100	38877313	29546307	1.315809553	0.759988402	2.572194225	257.219423
Ofast	1	402760	307169	1.311200023	0.762660145	2.482868209	248.286821
Ofast	10	3900315	2965054	1.315427982	0.760208855	2.563895480	256.389548
Ofast	100	38877308	29546305	1.315809473	0.759988449	2.572194556	257.219456

Explanation

CPI ≈ 1.31
IPC ≈ 0.76
Performance loss from:
- Branch penalties
- Load-use hazards
Still significantly better than multi-cycle

4. RV32I Superscalar

Results

Iter	Cycles	Retired Inst	CPI	IPC	CoreMark/MHz	Score @100MHz	CRC
1	319190	307172	1.039124660	0.962348445	3.132930230	313.293023	0x0000e3c1
10	3092965	2965057	1.043138462	0.958645507	3.233143602	323.314360	0x0000c64e
100	30832113	29546308	1.043518297	0.958296566	3.243371611	324.337161	0x0000844d

Explanation

CPI ≈ 1.04
IPC ≈ 0.96
Dual-issue capability
Limited by dependency and control hazards

All Variants Implemented

Base Architectures

rv32i-sc — single-cycle
rv32i-mc — multi-cycle
rv32i-pipe — baseline pipeline
rv32i-superscalar — dual-issue
rv32i-ooo — out-of-order
rv32i-super-ooo — optimized OOO

Pipeline Evolution Variants

Branch Handling

rv32i-pipe-br — ID-stage branch resolution
rv32i-pipe-sbp — static predictor
rv32i-pipe-gbp — global predictor
rv32i-pipe-gbp-tuned — tuned global predictor
rv32i-pipe-hybp — hybrid predictor

Direction + Target

rv32i-pipe-br-2dbp — 2-bit predictor
rv32i-pipe-br-2dbp-btb — + BTB

Hazard Optimisations

rv32i-pipe-br-2dbp-btb-hzopt
rv32i-pipe-br-2dbp-btb-hzopt-luopt
rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu

Advanced Prediction

rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu
rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu

Experimental

rv32i-pipe-1dbp
rv32i-pipe-2dbp

RV32I Pipelined Optimization Journey

Scope and benchmark method

This log tracks the single issue pipeline path from rv32i-pipe to the latest optimized variants.
CoreMark was run at -O2, -O3, and -Ofast, with ITERATIONS=1/10/100.
For step by step comparison, -O3, ITERATIONS=100 is the main reference point.
Metrics used:
- CoreMark/MHz = (ITERATIONS * 1,000,000) / cycles
- CPI = cycles / retired_instructions
- IPC = retired_instructions / cycles
CRC checks were consistent in all successful runs:
- ITER=1: 0x0000e3c1
- ITER=10: 0x0000c64e
- ITER=100: 0x0000844d

Starting point: plain pipelined core

Variant: rv32i-pipe

O2 i100: cycles 41,275,362, CPI 1.335811572, CoreMark/MHz 2.422752828
O3 i100: cycles 38,877,313, CPI 1.315809553, CoreMark/MHz 2.572194225
Ofast i100: cycles 38,877,308, CPI 1.315809473, CoreMark/MHz 2.572194556

This is the baseline. The initial CPI point you mentioned is correct, around 1.33 at O2 and 1.3158 at O3.

Step 1: early branch resolution

Variant: rv32i-pipe-br

What changed

Branch decision moved earlier to ID stage.
Added ID stage operand forwarding for branch compare so branch operands are ready sooner.
Added redirect from ID when branch decision differs from sequential path.

Why it helped

Reduces control hazard penalty because branch wait time is shorter.

Result (`O3 i100`)

Cycles: 38,877,313 -> 35,812,817
CPI: 1.315809553 -> 1.212091142
CoreMark/MHz: 2.572194225 -> 2.792296400
Improvement vs previous: +8.557% (cycle reduction based)

Branch prediction type sweep (before BTB/hzopt/luopt/fulu)

This sweep compares direction predictors on the simpler pipeline lines.

Variant	Predictor type	O3 i100 cycles	O3 i100 CPI	O3 CoreMark/MHz
`rv32i-pipe-sbp`	static backward taken	35,390,851	1.197809628	2.825589020
`rv32i-pipe-1dbp`	local 1-bit dynamic	34,640,813	1.172424459	2.886768275
`rv32i-pipe-2dbp`	local 2-bit dynamic	34,185,095	1.157000602	2.925251488

Best of this sweep was local 2-bit dynamic, so the main optimization line continued with that style.

Step 2: branch resolve + 2-bit dynamic predictor

Variant: rv32i-pipe-br-2dbp

What changed

Kept ID stage branch resolution.
Added dynamic 2-bit branch history table in IF.

Result (`O3 i100`)

Cycles: 35,812,817 -> 33,466,269
CPI: 1.212091142 -> 1.132671809
CoreMark/MHz: 2.792296400 -> 2.988083315
Improvement vs previous: +6.547% over rv32i-pipe-br

Step 3: BTB

Variant: rv32i-pipe-br-2dbp-btb

What changed

Added BTB for target prediction in IF.
Direction and target prediction were both available earlier.

Why it helped

Correctly predicted taken branches no longer wait for late target compute.

Result (`O3 i100`)

Cycles: 33,466,269 -> 32,347,569
CPI: 1.132671809 -> 1.094809209
CoreMark/MHz: 2.988083315 -> 3.091422419
Improvement vs previous: +3.458%

Step 4: hzopt (false load use cleanup)

Variant: rv32i-pipe-br-2dbp-btb-hzopt

What changed

Hazard logic was made source aware, using decoded use_rs1 and use_rs2.
Removed false positive load use stalls when source register is not actually consumed by the instruction.

Result (`O3 i100`)

Cycles: 32,347,569 -> 32,346,765
CPI: 1.094809209 -> 1.094781998
CoreMark/MHz: 3.091422419 -> 3.091499258
Improvement vs previous: +0.002%

Step 5: luopt (first true load use reduction)

Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt

What changed

Relaxed one true load use case for store data (load -> store rs2) when safe.
Added matching EX path load data forwarding so correctness is preserved.

Result (`O3 i100`)

Cycles: 32,346,765 -> 32,342,681
CPI: 1.094781998 -> 1.094643774
CoreMark/MHz: 3.091499258 -> 3.091889630
Improvement vs previous: +0.013%

Step 6: fulu (generalized load use reduction)

Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu

What changed

Generalized MEM to EX load forwarding for both EX operands.
Stalls kept only for ID stage consumers that truly need value in ID (branch and jalr related reads).
Removed extra load use stalls for EX stage consumers now covered by forwarding.

Result (`O3 i100`)

Cycles: 32,342,681 -> 31,709,661
CPI: 1.094643774 -> 1.073219100
CoreMark/MHz: 3.091889630 -> 3.153613027
Improvement vs previous: +1.996%

Later revisit: predictor path that started from global and was tuned

After fulu, predictor alternatives were revisited on a separate line.

Variant	O3 i100 cycles	O3 i100 CPI	O3 CoreMark/MHz
`rv32i-pipe-gbp`	34,709,649	1.174754226	2.881043251
`rv32i-pipe-gbp-tuned`	34,527,599	1.168592711	2.896233822
`rv32i-pipe-hybp`	34,021,025	1.151447624	2.939358823

The tuned path improved over its own earlier versions and then moved to a hybrid predictor.

Using that predictor in the full optimized line

Variant: rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu

What changed

Replaced the local 2-bit direction predictor in the full optimized line with tournament style prediction.
Kept BTB, hzopt, luopt, and fulu behavior.

Result (`O3 i100`)

Cycles: 31,709,661 -> 31,566,209
CPI: 1.073219100 -> 1.068363941
CoreMark/MHz: 3.153613027 -> 3.167944557
Improvement vs previous: +0.455%

Final control-flow add on this line

Variant: rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu

What changed

Added JAL fast path prediction in IF.
Added RAS for return style jalr prediction.

Result (`O3 i100`)

Cycles: 31,566,209 -> 31,540,922
CPI: 1.068363941 -> 1.067508098
CoreMark/MHz: 3.167944557 -> 3.170484363
Improvement vs previous: +0.080%

Consolidated O3 i100 timeline

Order	Variant	O3 i100 cycles	O3 i100 CPI	O3 CoreMark/MHz	Improvement vs previous
1	`rv32i-pipe`	38,877,313	1.315809553	2.572194225	baseline
2	`rv32i-pipe-br`	35,812,817	1.212091142	2.792296400	+8.557%
3	`rv32i-pipe-sbp`	35,390,851	1.197809628	2.825589020	+1.192%
4	`rv32i-pipe-1dbp`	34,640,813	1.172424459	2.886768275	+2.165%
5	`rv32i-pipe-2dbp`	34,185,095	1.157000602	2.925251488	+1.333%
6	`rv32i-pipe-br-2dbp`	33,466,269	1.132671809	2.988083315	+2.148%
7	`rv32i-pipe-br-2dbp-btb`	32,347,569	1.094809209	3.091422419	+3.458%
8	`rv32i-pipe-br-2dbp-btb-hzopt`	32,346,765	1.094781998	3.091499258	+0.002%
9	`rv32i-pipe-br-2dbp-btb-hzopt-luopt`	32,342,681	1.094643774	3.091889630	+0.013%
10	`rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu`	31,709,661	1.073219100	3.153613027	+1.996%
11	`rv32i-pipe-gbp`	34,709,649	1.174754226	2.881043251	separate branch
12	`rv32i-pipe-gbp-tuned`	34,527,599	1.168592711	2.896233822	+0.527% over step 11
13	`rv32i-pipe-hybp`	34,021,025	1.151447624	2.939358823	+1.489% over step 12
14	`rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu`	31,566,209	1.068363941	3.167944557	+0.455% over step 10
15	`rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu`	31,540,922	1.067508098	3.170484363	+0.080% over step 14

Full i100 values for main milestones

Variant	O2 i100 cycles	O2 CPI	O2 CM/MHz	O3 i100 cycles	O3 CPI	O3 CM/MHz	Ofast i100 cycles	Ofast CPI	Ofast CM/MHz
`rv32i-pipe`	41,275,362	1.335811572	2.422752828	38,877,313	1.315809553	2.572194225	38,877,308	1.315809473	2.572194556
`rv32i-pipe-br`	37,838,532	1.224584025	2.642808659	35,812,817	1.212091142	2.792296400	35,812,812	1.212091055	2.792296790
`rv32i-pipe-br-2dbp`	35,141,770	1.137307604	2.845616484	33,466,269	1.132671809	2.988083315	33,466,268	1.132671852	2.988083404
`rv32i-pipe-br-2dbp-btb`	33,813,574	1.094322648	2.957392200	32,347,569	1.094809209	3.091422419	32,347,568	1.094809249	3.091422514
`rv32i-pipe-br-2dbp-btb-hzopt`	33,812,468	1.094286854	2.957488936	32,346,765	1.094781998	3.091499258	32,346,764	1.094782038	3.091499354
`rv32i-pipe-br-2dbp-btb-hzopt-luopt`	33,808,868	1.094170346	2.957803852	32,342,681	1.094643774	3.091889630	32,342,680	1.094643814	3.091889726
`rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu`	33,144,324	1.072663434	3.017107846	31,709,661	1.073219100	3.153613027	31,709,660	1.073219139	3.153613126
`rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu`	32,954,053	1.066505616	3.034528105	31,566,209	1.068363941	3.167944557	31,554,775	1.067977028	3.169092475
`rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu`	32,895,899	1.064623554	3.039892602	31,540,922	1.067508098	3.170484363	31,529,489	1.067121219	3.171634022

Superscalar comparison (fair same-ELF numbers)

From rv32i-superscalar/COREMARK_RESULTS.md fair run:

Superscalar O3 i100: cycles 30,832,113, CPI 1.043518297, CoreMark/MHz 3.243371611
Best pipelined line so far (rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu):
- cycles 31,540,922, CPI 1.067508098, CoreMark/MHz 3.170484363

Remaining gap to superscalar on O3 i100:

CoreMark/MHz gap: 3.243371611 - 3.170484363 = 0.072887248
Relative gap: about 2.299%

Final status summary

The main pipeline line improved from 2.572194225 to 3.170484363 CoreMark/MHz at O3 i100.
That is about 23.260% better than the original pipelined baseline.
The largest single gains came from:
- early branch resolution,
- better branch prediction direction,
- BTB target prediction,
- generalized load use forwarding (fulu).
Later tuning with tournament style direction selection and RAS gave additional smaller gains.

Reference

Full optimization breakdown and detailed analysis: ../../blogs/optm-riscv-core/

Summary

Single-cycle: Perfect CPI, limited by clock
Multi-cycle: High CPI (~4), low throughput
Pipeline: Balanced (~1.3 CPI)
Superscalar: Best (~1.04 CPI), but underutilized

The full optimized pipeline approaches superscalar performance with significantly lower complexity.

RV32I Core Benchmarking Report

Build & Measurement Flow

Compilation

Binary Conversion

Simulation

Result Parsing & Metrics

1. RV32I Single-Cycle (rv32i-sc)

Results

Explanation

2. RV32I Multi-Cycle (rv32i-mc)

Results

Explanation

3. RV32I Pipelined (rv32i-pipe)

Results

Explanation

4. RV32I Superscalar

Results

Explanation

All Variants Implemented

Base Architectures

Pipeline Evolution Variants

Branch Handling

Direction + Target

Hazard Optimisations

Advanced Prediction

Experimental

RV32I Pipelined Optimization Journey

Scope and benchmark method

Starting point: plain pipelined core

Step 1: early branch resolution

What changed

Why it helped

Result (O3 i100)

Branch prediction type sweep (before BTB/hzopt/luopt/fulu)

Step 2: branch resolve + 2-bit dynamic predictor

What changed

Result (O3 i100)

Step 3: BTB

What changed

Why it helped

Result (O3 i100)

Step 4: hzopt (false load use cleanup)

What changed

Result (O3 i100)

Step 5: luopt (first true load use reduction)

What changed

Result (O3 i100)

Step 6: fulu (generalized load use reduction)

What changed

Result (O3 i100)

Later revisit: predictor path that started from global and was tuned

Using that predictor in the full optimized line

What changed

Result (O3 i100)

Final control-flow add on this line

What changed

Result (O3 i100)

Consolidated O3 i100 timeline

Full i100 values for main milestones

Superscalar comparison (fair same-ELF numbers)

Final status summary

Reference

Summary

Result (`O3 i100`)

Result (`O3 i100`)

Result (`O3 i100`)

Result (`O3 i100`)

Result (`O3 i100`)

Result (`O3 i100`)

Result (`O3 i100`)

Result (`O3 i100`)