ADR-0043: Extended Benchmark Infrastructure — Comptime, Opt Levels, and Runtime
Status
Implemented
Summary
Extend the benchmarking infrastructure to cover three dimensions that are currently unmeasured: (1) comptime evaluation performance, (2) LLVM compilation + runtime performance at -O0, and (3) LLVM compilation + runtime performance at -O3. Add new benchmark programs that stress comptime execution, add a tracing span around the comptime interpreter, run each benchmark at multiple optimization levels, measure the runtime of compiled binaries, and render all of this on the website's performance dashboard.
Context
What the current benchmarks measure
The existing benchmark infrastructure (ADR-0019, ADR-0031) measures compilation time for 7 stress programs. For each program, --benchmark-json reports per-pass timing (lexer, parser, astgen, sema, cfg_construction, codegen, linker) and aggregate metrics (total time, peak memory, binary size). All compilation happens at the default optimization level (-O0).
What's missing
1. Comptime evaluation performance. None of the 7 stress programs use comptime expressions. The comptime interpreter runs inside the sema pass and has no dedicated tracing span, so even if a program used comptime heavily, the interpreter's cost would be hidden inside the sema number. After ADR-0040 expanded the comptime interpreter to support nearly all pure language constructs, comptime is a substantial subsystem that deserves visibility.
2. Optimization level coverage. The compiler supports -O0 through -O3 via LLVM, but benchmarks only run at -O0. The LLVM mid-end pipeline (default<O3>) is a significant cost center at higher optimization levels. We need to track:
- How long LLVM optimization takes at each level
- Whether codegen or linking time changes across levels
- Binary size differences across optimization levels
3. Runtime performance of compiled binaries. The benchmarks measure how fast the compiler runs but not how fast the compiled programs run. This matters because:
- Optimization level changes should visibly improve runtime performance
- Runtime regressions from codegen changes would go unnoticed today
- Users care about the quality of generated code, not just compilation speed
Design constraints
- CI budget: More configurations = more wall-clock time. Each new optimization-level run multiplies the benchmark matrix.
- Determinism: Runtime benchmarks need programs that do meaningful work and return deterministic results (not I/O-bound or allocation-heavy).
- History compatibility: The existing JSON schema,
append-benchmark.py, andgenerate-charts.pyneed to be extended, not replaced.
Decision
Part 1: Comptime tracing span
Add a dedicated info_span!("comptime") inside the comptime interpreter entry point (evaluate_comptime_block in gruel-air/src/sema/analyze_ops.rs). This span nests under the existing sema span and records time spent specifically on comptime evaluation.
The --benchmark-json output will naturally include "comptime" as a new pass name. The chart generator already handles arbitrary pass names — no change needed in generate-charts.py for this to render.
Add "comptime" to PASS_ORDER and PASS_COLORS in generate-charts.py so it gets a consistent color and stacking position. Place it between "sema" and "cfg" in the stack, since comptime runs during sema but is conceptually its own phase.
Part 2: Comptime stress benchmark
Add a new benchmark program benchmarks/stress/comptime_heavy.gruel and register it in manifest.toml.
The program should exercise:
- Comptime arithmetic: Functions with heavy
comptime { ... }blocks doing loops and arithmetic - Comptime function calls: Functions called at compile time that recurse or iterate
comptime_unrollfor-loops: Multiplecomptime_unrollloops that expand into many iterations- Comptime struct/array construction: Building composite values at compile time
- Comptime pattern matching: Matching on enum variants at compile time
Target ~1-2 seconds of compile time (same as other stress tests) with the majority spent in the comptime interpreter. The program must also run successfully (for Part 4's runtime measurement).
Part 3: Multi-opt-level benchmarking
Extend bench.sh to run each benchmark at multiple optimization levels. Currently bench.sh compiles each program once (at -O0 default). Change it to compile at -O0 and -O3.
Manifest extension:
# Global benchmark config
[]
= ["O0", "O3"] # Levels to benchmark
[[]]
= "many_functions"
= "stress/many_functions.gruel"
= "1000 functions to stress function handling"
If the [config] section or opt_levels key is absent, default to ["O0"] for backwards compatibility.
Result naming: Each benchmark result is tagged with its optimization level. The benchmark name becomes "{name}@{opt_level}" in the JSON output (e.g., "many_functions@O0", "many_functions@O3").
bench.sh changes:
- Parse
opt_levelsfrom the[config]section ofmanifest.toml - For each benchmark × opt level combination, run
$GRUEL_BIN --benchmark-json -{opt_level} "$full_path" "$output_binary" - Tag the result with the opt level
- Keep the compiled binary for runtime measurement (Part 4)
JSON schema: The existing per-benchmark object gains an "opt_level" field:
Part 4: Runtime benchmarking
After compiling each benchmark program, run the resulting binary and measure its execution time. This captures the quality of generated code.
bench.sh changes:
- After compiling, run the binary with
/usr/bin/timeto capture wall-clock time and peak memory - Run multiple iterations (same count as compilation iterations)
- Record mean and stddev of execution time
- Store in the benchmark result as
"runtime_ms"and"runtime_std_ms"
Program requirements for runtime benchmarks:
- Programs must do deterministic computation (no I/O, no randomness)
- Programs must take long enough to measure reliably (at least ~10ms runtime)
- Programs must return a deterministic exit code for verification
Some existing benchmarks (e.g., many_functions where most functions are never called from main) may produce trivially fast binaries. That's fine — the runtime will just be near-zero. The interesting runtime data comes from programs that actually compute (e.g., arithmetic_heavy, control_flow, and the new comptime_heavy which should produce non-trivial runtime code from unrolled loops).
JSON extension:
Part 5: Website visualization
Extend the performance dashboard to display the new data:
1. Opt-level selector. Add a dropdown next to the existing "Program" dropdown that lets the user filter by optimization level. Default to "O0" to match current behavior.
2. Compilation time by opt level. A new chart type showing compilation time for the same program at different optimization levels side by side (grouped bar chart or overlaid lines). This makes it easy to see the LLVM optimization overhead.
3. Runtime performance chart. A new time-series chart in its own section showing runtime of compiled binaries over commits, similar to the existing compilation time trend. Separate lines per opt level.
4. Binary size by opt level. The existing binary size chart already works per-program; extend it to show O0 vs O3 side by side.
5. Comptime breakdown. No special chart needed — the comptime pass will appear in the existing stacked pass breakdown as its own color band.
Chart generator changes (generate-charts.py):
- Add
"comptime"toPASS_COLORSandPASS_ORDER - Add runtime chart generation function
- Modify the per-program timeline to support opt-level grouping
- Add binary size comparison across opt levels
- Generate new SVG files:
runtime.svg,binary_size_by_opt.svg - Update
metadata.jsonto include runtime stats and opt-level info
Website template changes (performance.html):
- Add opt-level selector dropdown
- Add "Runtime Performance" section with chart container
- Add "Binary Size by Optimization Level" section
- Update methodology text to describe new measurements
- Update benchmark suite list to include
comptime_heavy
Part 6: CI integration
The existing CI workflow (benchmarks.yml) runs bench.sh. Since bench.sh will now run multiple opt levels, the CI time increases roughly 2× (O0 + O3). To keep this manageable:
- Keep the 5-iteration count (reliable enough for trends)
- The parallelism across platforms remains unchanged
- History files on the perf branch naturally grow to include the new data points
No changes to the workflow YAML itself — the expansion is entirely within bench.sh and manifest.toml.
Implementation Phases
Phase 1: Comptime tracing span — Add
info_span!("comptime")aroundevaluate_comptime_blockingruel-air/src/sema/analyze_ops.rs. Verify it appears in--time-passesand--benchmark-jsonoutput. Add "comptime" to chart generator'sPASS_ORDER/PASS_COLORS.Phase 2: Comptime stress benchmark — Write
benchmarks/stress/comptime_heavy.gruelexercising comptime arithmetic, function calls,comptime_unroll, struct/array construction, and pattern matching. Register inmanifest.toml. Verify the program compiles and runs, and that the comptime span shows significant time.Phase 3: Multi-opt-level benchmarking — Extend
bench.shto parseopt_levelsfrommanifest.toml[config]section. Run each benchmark at each opt level. Tag results with"opt_level"field and"@{level}"suffix in benchmark name. Updatemanifest.tomlwithopt_levels = ["O0", "O3"]. Updateappend-benchmark.pyif needed to handle the new fields.Phase 4: Runtime benchmarking — After compiling each benchmark in
bench.sh, run the binary with/usr/bin/timefor multiple iterations. Compute mean/stddev of wall-clock execution time. Add"runtime_ms"and"runtime_std_ms"to the per-benchmark JSON output.Phase 5: Website visualization — Update
generate-charts.pyto generate runtime charts and opt-level comparison charts. Updateperformance.htmlto add opt-level dropdown, runtime section, and binary size comparison section. Updatemetadata.jsongeneration to include runtime stats.Phase 6: Polish and documentation — Update
benchmarks/README.mdwith new benchmark descriptions and opt-level configuration. Update the website's methodology section. Update CLAUDE.md if the benchmark workflow instructions change.
Consequences
Positive
- Comptime visibility: Comptime interpreter performance is tracked independently from sema, enabling focused optimization
- Optimization level insight: Can see how much compilation time LLVM optimization adds and whether it improves runtime
- Runtime quality tracking: Codegen regressions that produce slower binaries are now detectable
- Richer dashboard: The website gives a more complete picture of compiler performance
Negative
- ~2× CI benchmark time: Running at both O0 and O3 roughly doubles the benchmark wall clock per commit
- More chart complexity: Dashboard gains new sections and dropdowns; could be overwhelming if not laid out well
- Larger history files: Roughly double the data points per run stored in perf branch
Neutral
- No schema break: Existing history data remains valid — new fields are additive
- Comptime tracing span is zero-cost: When
--time-passes/--benchmark-jsonis not active, the span has no overhead
Resolved Questions
Should we also benchmark
-O1and-O2? Starting with just O0 and O3 captures the extremes. O1/O2 can be added later by editingmanifest.tomlif the data is valuable.Are all existing benchmarks worth running at O3? Programs like
many_functions(1000 trivial functions) may have negligible runtime. We could add a per-benchmarkskip_runtime = trueflag in the manifest, but it may not be worth the complexity — near-zero runtimes are still valid data.Should runtime be measured in the same iteration loop as compilation? Running the binary immediately after compiling means the binary is hot in the disk cache. For consistency we should always do it this way, not mix.
Future Work
- Microbenchmarks for specific codegen patterns: e.g., how efficiently the LLVM backend handles struct passing, array copies
- Comptime step count tracking: Report how many interpreter steps each comptime block takes (already tracked internally via
COMPTIME_MAX_STEPS) - Regression alerting: Automatically flag commits where runtime performance degrades significantly
- Comparison with other compilers: Track compile time and runtime performance against equivalent Rust/Zig/C programs
References
- ADR-0019: Compiler Performance Dashboard — Original benchmark infrastructure
- ADR-0031: Robust Performance Testing Infrastructure — Parallel execution and batching
- ADR-0033: LLVM Backend and Comptime Interpreter — LLVM backend with opt levels
- ADR-0040: Comptime Interpreter Expansion — Latest comptime interpreter capabilities