RV32I Core Benchmarking Report
CoreMark Performance Across Microarchitectural Variants
Build & Measurement Flow
All results are generated using a consistent toolchain and simulation flow.
Compilation
riscv32-unknown-elf-gcc -O3 -ffreestanding -fno-builtin -march=rv32im -mabi=ilp32 -nostdlib -Ttext 0x0 \
-I scripts/coremark -DITERATIONS=100 -DVALIDATION_RUN=1 -DTOTAL_DATA_SIZE=2000 \
-o coremark_i100_o3.elf \
scripts/coremark/crt0.S \
scripts/coremark/core_main.c \
scripts/coremark/core_list_join.c \
scripts/coremark/core_matrix.c \
scripts/coremark/core_state.c \
scripts/coremark/core_util.c \
scripts/coremark/core_portme.c \
-lgcc
Binary Conversion
python3 scripts/elf2hex.py coremark_i100_o3.elf hex/inst_mem.hex hex/data_mem.hex
Simulation
make verilator-prog
./obj_dir/Vtb_program
cat tb_program_results.txt
cp tb_program_results.txt tb_program_results_i100_o3.txt
Result Parsing & Metrics
import re
def parse(path):
txt = open(path).read()
cycles = int(re.search(r"Total cycles:\s*(\d+)", txt).group(1))
inst = int(re.search(r"Retired instructions:\s*(\d+)", txt).group(1))
return cycles, inst
runs = [
("O2", 1, "tb_program_results_i1_o2.txt"),
("O2", 10, "tb_program_results_i10_o2.txt"),
("O2", 100, "tb_program_results_i100_o2.txt"),
("O3", 1, "tb_program_results_i1_o3.txt"),
("O3", 10, "tb_program_results_i10_o3.txt"),
("O3", 100, "tb_program_results_i100_o3.txt"),
("Ofast", 1, "tb_program_results_i1_ofast.txt"),
("Ofast", 10, "tb_program_results_i10_ofast.txt"),
("Ofast", 100, "tb_program_results_i100_ofast.txt"),
]
for opt, iters, path in runs:
cycles, inst = parse(path)
cpi = cycles / inst
ipc = inst / cycles
cm_mhz = (iters * 1_000_000) / cycles
print(f"{opt} ITER={iters}: cycles={cycles}, inst={inst}, CPI={cpi:.9f}, IPC={ipc:.9f}, CoreMark/MHz={cm_mhz:.9f}")
1. RV32I Single-Cycle (rv32i-sc)
Results
| Opt | Iter | Cycles | Retired Inst | CPI | IPC | CoreMark/MHz | Score @100MHz |
|---|---|---|---|---|---|---|---|
| O2 | 1 | 322977 | 322975 | 1.000006192 | 0.999993808 | 3.096195704 | 309.619570 |
| O2 | 10 | 3102274 | 3102272 | 1.000000645 | 0.999999355 | 3.223441901 | 322.344190 |
| O2 | 100 | 30899090 | 30899088 | 1.000000065 | 0.999999935 | 3.236341264 | 323.634126 |
| O3 | 1 | 307171 | 307169 | 1.000006511 | 0.999993489 | 3.255515657 | 325.551566 |
| O3 | 10 | 2965056 | 2965054 | 1.000000675 | 0.999999325 | 3.372617583 | 337.261758 |
| O3 | 100 | 29546307 | 29546305 | 1.000000068 | 0.999999932 | 3.384517733 | 338.451773 |
| Ofast | 1 | 307169 | 307167 | 1.000006511 | 0.999993489 | 3.255536854 | 325.553685 |
| Ofast | 10 | 2965054 | 2965052 | 1.000000675 | 0.999999325 | 3.372619858 | 337.261986 |
| Ofast | 100 | 29546305 | 29546303 | 1.000000068 | 0.999999932 | 3.384517963 | 338.451796 |
Explanation
- CPI ≈ 1.0 across all runs
- IPC ≈ 1.0 (ideal)
- No hazards, no overlap → every instruction completes in one cycle
- Performance limited entirely by long clock period
2. RV32I Multi-Cycle (rv32i-mc)
Results
| Opt | Iter | Cycles | Retired Inst | CPI | IPC | CoreMark/MHz | Score @100MHz |
|---|---|---|---|---|---|---|---|
| O2 | 1 | 1284132 | 322976 | 3.975936292 | 0.251513084 | 0.778736142 | 77.873614 |
| O2 | 10 | 12337704 | 3102273 | 3.976988486 | 0.251446541 | 0.810523579 | 81.052358 |
| O2 | 100 | 122889486 | 30899089 | 3.977123274 | 0.251438020 | 0.813739265 | 81.373926 |
| O3 | 1 | 1225585 | 307170 | 3.989924146 | 0.250631331 | 0.815936879 | 81.593688 |
| O3 | 10 | 11830144 | 2965055 | 3.989856512 | 0.250635580 | 0.845298248 | 84.529825 |
| O3 | 100 | 117885350 | 29546306 | 3.989850711 | 0.250635944 | 0.848281826 | 84.828183 |
| Ofast | 1 | 1225576 | 307168 | 3.989920825 | 0.250631540 | 0.815942871 | 81.594287 |
| Ofast | 10 | 11830135 | 2965053 | 3.989856168 | 0.250635601 | 0.845298891 | 84.529889 |
| Ofast | 100 | 117885341 | 29546304 | 3.989850676 | 0.250635946 | 0.848281891 | 84.828189 |
Explanation
- CPI ≈ 4.0
- IPC ≈ 0.25
- Each instruction broken into sequential steps (IF/ID/EX/MEM/WB)
- No overlap → very low throughput
3. RV32I Pipelined (rv32i-pipe)
Results
| Opt | Iter | Cycles | Retired Inst | CPI | IPC | CoreMark/MHz | Score @100MHz |
|---|---|---|---|---|---|---|---|
| O2 | 1 | 430351 | 322977 | 1.332450918 | 0.750496688 | 2.323684620 | 232.368462 |
| O2 | 10 | 4143194 | 3102274 | 1.335534514 | 0.748763876 | 2.413596853 | 241.359685 |
| O2 | 100 | 41275362 | 30899090 | 1.335811572 | 0.748608577 | 2.422752828 | 242.275283 |
| O3 | 1 | 402765 | 307171 | 1.311207764 | 0.762655643 | 2.482837387 | 248.283739 |
| O3 | 10 | 3900320 | 2965056 | 1.315428781 | 0.760208393 | 2.563892193 | 256.389219 |
| O3 | 100 | 38877313 | 29546307 | 1.315809553 | 0.759988402 | 2.572194225 | 257.219423 |
| Ofast | 1 | 402760 | 307169 | 1.311200023 | 0.762660145 | 2.482868209 | 248.286821 |
| Ofast | 10 | 3900315 | 2965054 | 1.315427982 | 0.760208855 | 2.563895480 | 256.389548 |
| Ofast | 100 | 38877308 | 29546305 | 1.315809473 | 0.759988449 | 2.572194556 | 257.219456 |
Explanation
CPI ≈ 1.31
IPC ≈ 0.76
Performance loss from:
- Branch penalties
- Load-use hazards
Still significantly better than multi-cycle
4. RV32I Superscalar
Results
| Iter | Cycles | Retired Inst | CPI | IPC | CoreMark/MHz | Score @100MHz | CRC |
|---|---|---|---|---|---|---|---|
| 1 | 319190 | 307172 | 1.039124660 | 0.962348445 | 3.132930230 | 313.293023 | 0x0000e3c1 |
| 10 | 3092965 | 2965057 | 1.043138462 | 0.958645507 | 3.233143602 | 323.314360 | 0x0000c64e |
| 100 | 30832113 | 29546308 | 1.043518297 | 0.958296566 | 3.243371611 | 324.337161 | 0x0000844d |
Explanation
- CPI ≈ 1.04
- IPC ≈ 0.96
- Dual-issue capability
- Limited by dependency and control hazards
All Variants Implemented
Base Architectures
rv32i-sc— single-cyclerv32i-mc— multi-cyclerv32i-pipe— baseline pipelinerv32i-superscalar— dual-issuerv32i-ooo— out-of-orderrv32i-super-ooo— optimized OOO
Pipeline Evolution Variants
Branch Handling
rv32i-pipe-br— ID-stage branch resolutionrv32i-pipe-sbp— static predictorrv32i-pipe-gbp— global predictorrv32i-pipe-gbp-tuned— tuned global predictorrv32i-pipe-hybp— hybrid predictor
Direction + Target
rv32i-pipe-br-2dbp— 2-bit predictorrv32i-pipe-br-2dbp-btb— + BTB
Hazard Optimisations
rv32i-pipe-br-2dbp-btb-hzoptrv32i-pipe-br-2dbp-btb-hzopt-luoptrv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu
Advanced Prediction
rv32i-pipe-br-hybp-btb-hzopt-luopt-fulurv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu
Experimental
rv32i-pipe-1dbprv32i-pipe-2dbp
RV32I Pipelined Optimization Journey
Scope and benchmark method
- This log tracks the single issue pipeline path from
rv32i-pipeto the latest optimized variants. - CoreMark was run at
-O2,-O3, and-Ofast, withITERATIONS=1/10/100. - For step by step comparison,
-O3,ITERATIONS=100is the main reference point. - Metrics used:
CoreMark/MHz = (ITERATIONS * 1,000,000) / cyclesCPI = cycles / retired_instructionsIPC = retired_instructions / cycles
- CRC checks were consistent in all successful runs:
ITER=1:0x0000e3c1ITER=10:0x0000c64eITER=100:0x0000844d
Starting point: plain pipelined core
Variant: rv32i-pipe
O2 i100: cycles41,275,362, CPI1.335811572, CoreMark/MHz2.422752828O3 i100: cycles38,877,313, CPI1.315809553, CoreMark/MHz2.572194225Ofast i100: cycles38,877,308, CPI1.315809473, CoreMark/MHz2.572194556
This is the baseline. The initial CPI point you mentioned is correct, around 1.33 at O2 and 1.3158 at O3.
Step 1: early branch resolution
Variant: rv32i-pipe-br
What changed
- Branch decision moved earlier to ID stage.
- Added ID stage operand forwarding for branch compare so branch operands are ready sooner.
- Added redirect from ID when branch decision differs from sequential path.
Why it helped
- Reduces control hazard penalty because branch wait time is shorter.
Result (O3 i100)
- Cycles:
38,877,313 -> 35,812,817 - CPI:
1.315809553 -> 1.212091142 - CoreMark/MHz:
2.572194225 -> 2.792296400 - Improvement vs previous:
+8.557%(cycle reduction based)
Branch prediction type sweep (before BTB/hzopt/luopt/fulu)
This sweep compares direction predictors on the simpler pipeline lines.
| Variant | Predictor type | O3 i100 cycles | O3 i100 CPI | O3 CoreMark/MHz |
|---|---|---|---|---|
rv32i-pipe-sbp | static backward taken | 35,390,851 | 1.197809628 | 2.825589020 |
rv32i-pipe-1dbp | local 1-bit dynamic | 34,640,813 | 1.172424459 | 2.886768275 |
rv32i-pipe-2dbp | local 2-bit dynamic | 34,185,095 | 1.157000602 | 2.925251488 |
Best of this sweep was local 2-bit dynamic, so the main optimization line continued with that style.
Step 2: branch resolve + 2-bit dynamic predictor
Variant: rv32i-pipe-br-2dbp
What changed
- Kept ID stage branch resolution.
- Added dynamic 2-bit branch history table in IF.
Result (O3 i100)
- Cycles:
35,812,817 -> 33,466,269 - CPI:
1.212091142 -> 1.132671809 - CoreMark/MHz:
2.792296400 -> 2.988083315 - Improvement vs previous:
+6.547%overrv32i-pipe-br
Step 3: BTB
Variant: rv32i-pipe-br-2dbp-btb
What changed
- Added BTB for target prediction in IF.
- Direction and target prediction were both available earlier.
Why it helped
- Correctly predicted taken branches no longer wait for late target compute.
Result (O3 i100)
- Cycles:
33,466,269 -> 32,347,569 - CPI:
1.132671809 -> 1.094809209 - CoreMark/MHz:
2.988083315 -> 3.091422419 - Improvement vs previous:
+3.458%
Step 4: hzopt (false load use cleanup)
Variant: rv32i-pipe-br-2dbp-btb-hzopt
What changed
- Hazard logic was made source aware, using decoded
use_rs1anduse_rs2. - Removed false positive load use stalls when source register is not actually consumed by the instruction.
Result (O3 i100)
- Cycles:
32,347,569 -> 32,346,765 - CPI:
1.094809209 -> 1.094781998 - CoreMark/MHz:
3.091422419 -> 3.091499258 - Improvement vs previous:
+0.002%
Step 5: luopt (first true load use reduction)
Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt
What changed
- Relaxed one true load use case for store data (
load -> store rs2) when safe. - Added matching EX path load data forwarding so correctness is preserved.
Result (O3 i100)
- Cycles:
32,346,765 -> 32,342,681 - CPI:
1.094781998 -> 1.094643774 - CoreMark/MHz:
3.091499258 -> 3.091889630 - Improvement vs previous:
+0.013%
Step 6: fulu (generalized load use reduction)
Variant: rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu
What changed
- Generalized MEM to EX load forwarding for both EX operands.
- Stalls kept only for ID stage consumers that truly need value in ID (branch and jalr related reads).
- Removed extra load use stalls for EX stage consumers now covered by forwarding.
Result (O3 i100)
- Cycles:
32,342,681 -> 31,709,661 - CPI:
1.094643774 -> 1.073219100 - CoreMark/MHz:
3.091889630 -> 3.153613027 - Improvement vs previous:
+1.996%
Later revisit: predictor path that started from global and was tuned
After fulu, predictor alternatives were revisited on a separate line.
| Variant | O3 i100 cycles | O3 i100 CPI | O3 CoreMark/MHz |
|---|---|---|---|
rv32i-pipe-gbp | 34,709,649 | 1.174754226 | 2.881043251 |
rv32i-pipe-gbp-tuned | 34,527,599 | 1.168592711 | 2.896233822 |
rv32i-pipe-hybp | 34,021,025 | 1.151447624 | 2.939358823 |
The tuned path improved over its own earlier versions and then moved to a hybrid predictor.
Using that predictor in the full optimized line
Variant: rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu
What changed
- Replaced the local 2-bit direction predictor in the full optimized line with tournament style prediction.
- Kept BTB, hzopt, luopt, and fulu behavior.
Result (O3 i100)
- Cycles:
31,709,661 -> 31,566,209 - CPI:
1.073219100 -> 1.068363941 - CoreMark/MHz:
3.153613027 -> 3.167944557 - Improvement vs previous:
+0.455%
Final control-flow add on this line
Variant: rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu
What changed
- Added JAL fast path prediction in IF.
- Added RAS for return style jalr prediction.
Result (O3 i100)
- Cycles:
31,566,209 -> 31,540,922 - CPI:
1.068363941 -> 1.067508098 - CoreMark/MHz:
3.167944557 -> 3.170484363 - Improvement vs previous:
+0.080%
Consolidated O3 i100 timeline
| Order | Variant | O3 i100 cycles | O3 i100 CPI | O3 CoreMark/MHz | Improvement vs previous |
|---|---|---|---|---|---|
| 1 | rv32i-pipe | 38,877,313 | 1.315809553 | 2.572194225 | baseline |
| 2 | rv32i-pipe-br | 35,812,817 | 1.212091142 | 2.792296400 | +8.557% |
| 3 | rv32i-pipe-sbp | 35,390,851 | 1.197809628 | 2.825589020 | +1.192% |
| 4 | rv32i-pipe-1dbp | 34,640,813 | 1.172424459 | 2.886768275 | +2.165% |
| 5 | rv32i-pipe-2dbp | 34,185,095 | 1.157000602 | 2.925251488 | +1.333% |
| 6 | rv32i-pipe-br-2dbp | 33,466,269 | 1.132671809 | 2.988083315 | +2.148% |
| 7 | rv32i-pipe-br-2dbp-btb | 32,347,569 | 1.094809209 | 3.091422419 | +3.458% |
| 8 | rv32i-pipe-br-2dbp-btb-hzopt | 32,346,765 | 1.094781998 | 3.091499258 | +0.002% |
| 9 | rv32i-pipe-br-2dbp-btb-hzopt-luopt | 32,342,681 | 1.094643774 | 3.091889630 | +0.013% |
| 10 | rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu | 31,709,661 | 1.073219100 | 3.153613027 | +1.996% |
| 11 | rv32i-pipe-gbp | 34,709,649 | 1.174754226 | 2.881043251 | separate branch |
| 12 | rv32i-pipe-gbp-tuned | 34,527,599 | 1.168592711 | 2.896233822 | +0.527% over step 11 |
| 13 | rv32i-pipe-hybp | 34,021,025 | 1.151447624 | 2.939358823 | +1.489% over step 12 |
| 14 | rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu | 31,566,209 | 1.068363941 | 3.167944557 | +0.455% over step 10 |
| 15 | rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu | 31,540,922 | 1.067508098 | 3.170484363 | +0.080% over step 14 |
Full i100 values for main milestones
| Variant | O2 i100 cycles | O2 CPI | O2 CM/MHz | O3 i100 cycles | O3 CPI | O3 CM/MHz | Ofast i100 cycles | Ofast CPI | Ofast CM/MHz |
|---|---|---|---|---|---|---|---|---|---|
rv32i-pipe | 41,275,362 | 1.335811572 | 2.422752828 | 38,877,313 | 1.315809553 | 2.572194225 | 38,877,308 | 1.315809473 | 2.572194556 |
rv32i-pipe-br | 37,838,532 | 1.224584025 | 2.642808659 | 35,812,817 | 1.212091142 | 2.792296400 | 35,812,812 | 1.212091055 | 2.792296790 |
rv32i-pipe-br-2dbp | 35,141,770 | 1.137307604 | 2.845616484 | 33,466,269 | 1.132671809 | 2.988083315 | 33,466,268 | 1.132671852 | 2.988083404 |
rv32i-pipe-br-2dbp-btb | 33,813,574 | 1.094322648 | 2.957392200 | 32,347,569 | 1.094809209 | 3.091422419 | 32,347,568 | 1.094809249 | 3.091422514 |
rv32i-pipe-br-2dbp-btb-hzopt | 33,812,468 | 1.094286854 | 2.957488936 | 32,346,765 | 1.094781998 | 3.091499258 | 32,346,764 | 1.094782038 | 3.091499354 |
rv32i-pipe-br-2dbp-btb-hzopt-luopt | 33,808,868 | 1.094170346 | 2.957803852 | 32,342,681 | 1.094643774 | 3.091889630 | 32,342,680 | 1.094643814 | 3.091889726 |
rv32i-pipe-br-2dbp-btb-hzopt-luopt-fulu | 33,144,324 | 1.072663434 | 3.017107846 | 31,709,661 | 1.073219100 | 3.153613027 | 31,709,660 | 1.073219139 | 3.153613126 |
rv32i-pipe-br-hybp-btb-hzopt-luopt-fulu | 32,954,053 | 1.066505616 | 3.034528105 | 31,566,209 | 1.068363941 | 3.167944557 | 31,554,775 | 1.067977028 | 3.169092475 |
rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu | 32,895,899 | 1.064623554 | 3.039892602 | 31,540,922 | 1.067508098 | 3.170484363 | 31,529,489 | 1.067121219 | 3.171634022 |
Superscalar comparison (fair same-ELF numbers)
From rv32i-superscalar/COREMARK_RESULTS.md fair run:
- Superscalar
O3 i100: cycles30,832,113, CPI1.043518297, CoreMark/MHz3.243371611 - Best pipelined line so far (
rv32i-pipe-br-hybp-btb-ras-hzopt-luopt-fulu):- cycles
31,540,922, CPI1.067508098, CoreMark/MHz3.170484363
- cycles
Remaining gap to superscalar on O3 i100:
- CoreMark/MHz gap:
3.243371611 - 3.170484363 = 0.072887248 - Relative gap: about
2.299%
Final status summary
- The main pipeline line improved from
2.572194225to3.170484363CoreMark/MHz atO3 i100. - That is about
23.260%better than the original pipelined baseline. - The largest single gains came from:
- early branch resolution,
- better branch prediction direction,
- BTB target prediction,
- generalized load use forwarding (
fulu).
- Later tuning with tournament style direction selection and RAS gave additional smaller gains.
Reference
Full optimization breakdown and detailed analysis: ../../blogs/optm-riscv-core/
Summary
- Single-cycle: Perfect CPI, limited by clock
- Multi-cycle: High CPI (~4), low throughput
- Pipeline: Balanced (~1.3 CPI)
- Superscalar: Best (~1.04 CPI), but underutilized
The full optimized pipeline approaches superscalar performance with significantly lower complexity.