ADR-0019: Compiler Performance Dashboard

Status

Implemented

Summary

Create a system for tracking Gruel compiler performance over time and displaying it on the website. This includes a benchmark corpus of representative Gruel programs, infrastructure to collect and store timing data, and interactive visualizations for the website.

Context

As Gruel develops, we want to track compiler performance to:

Detect regressions - Notice when changes slow down compilation
Track improvements - See the impact of optimization work
Provide transparency - Show the community our performance characteristics

The existing --time-passes flag provides per-pass timing, but there's no:

Standardized benchmark corpus
Historical data storage
Visualization infrastructure

Decision

Part 1: Benchmark Corpus

Location: benchmarks/ directory at repository root

Design choices:

Decision: Hand-crafted benchmark programs

Spec tests are too small to meaningfully benchmark. We'll create purpose-built programs that exercise different compiler phases at scale.

Corpus structure:

benchmarks/
├── README.md           # Describes the benchmark suite
├── stress/             # Hand-crafted stress tests
│   ├── many_functions.gruel    # 100+ functions
│   ├── deep_nesting.gruel      # Deeply nested blocks/expressions
│   ├── large_structs.gruel     # Many struct types with fields
│   ├── arithmetic_heavy.gruel  # Lots of arithmetic expressions
│   └── control_flow.gruel      # Complex if/while/match patterns
└── manifest.toml       # Benchmark metadata

Manifest format:

[[benchmark]]
name = "spec_tests"
type = "spec_aggregate"  # Run all spec tests, aggregate timing
description = "Aggregate timing across all spec tests"

[[benchmark]]
name = "many_functions"
path = "stress/many_functions.gruel"
description = "100 trivial functions to stress function handling"

[[benchmark]]
name = "deep_nesting"
path = "stress/deep_nesting.gruel"
description = "20 levels of nested blocks"

Part 2: Data Collection & Storage

CLI addition: --benchmark-json <file> flag

When provided, outputs structured JSON timing data instead of human-readable report.

JSON output format:

{
  "version": 1,
  "timestamp": "2025-12-27T10:30:00Z",
  "commit": "abc123def",
  "host": {
    "os": "darwin",
    "arch": "aarch64",
    "cpu": "Apple M1 Pro"
  },
  "benchmarks": [
    {
      "name": "many_functions",
      "iterations": 5,
      "passes": {
        "lexer": { "mean_ms": 0.5, "std_ms": 0.02 },
        "parser": { "mean_ms": 2.1, "std_ms": 0.1 },
        "astgen": { "mean_ms": 1.2, "std_ms": 0.05 },
        "sema": { "mean_ms": 3.4, "std_ms": 0.15 },
        "cfg": { "mean_ms": 0.8, "std_ms": 0.03 },
        "codegen": { "mean_ms": 5.2, "std_ms": 0.2 },
        "linker": { "mean_ms": 1.0, "std_ms": 0.04 }
      },
      "total_ms": { "mean": 14.2, "std": 0.3 }
    }
  ]
}

Data storage:

Decision: Dedicated perf branch

Since benchmarks run on every commit to trunk, storing results directly in the main branch would create noise. Instead:

A dedicated perf branch holds benchmark history
CI runs benchmarks on each trunk commit, pushes results to perf branch
website/static/benchmarks/history.json is built from the perf branch during website deploy
Keep the last 100 runs to limit file size

Benchmark runner script: ./bench.sh

#!/bin/bash
# Build release compiler, run benchmarks, append to history
cargo build -p gruel --release
cargo run -p gruel -- --benchmark-json /tmp/bench.json benchmarks/
# Append to history (via a small Rust tool or script)
./scripts/append-benchmark.py /tmp/bench.json website/static/benchmarks/history.json

Part 3: Website Visualization

Page: /performance/ on the Gruel website

Decision: Static charts generated at build time

Keep scope minimal with static SVG charts generated during website build. We can add Chart.js for interactivity later if needed.

Charts to generate:

Time-series chart - Total compilation time over commits
Pass breakdown chart - Stacked bar showing time per pass

Implementation:

A Rust tool or Python script reads history.json and generates SVG charts
Charts are placed in website/static/benchmarks/ during build
Zola includes them in the performance page

Template structure:

website/
├── content/
│   └── performance.md       # Performance page content
├── templates/
│   └── performance.html     # Template including chart images
├── static/
│   └── benchmarks/
│       ├── history.json     # Historical data (from perf branch)
│       ├── timeline.svg     # Generated time-series chart
│       └── breakdown.svg    # Generated pass breakdown

Future enhancement: Add Chart.js for hover tooltips and filtering when scope allows

Implementation Phases

Epic: gruel-a5ah

Phase 1: Benchmark corpus - gruel-a5ah.1
- Create benchmarks/ directory structure
- Write initial stress test programs (many_functions, deep_nesting, etc.)
- Create manifest.toml format
Phase 2: Data collection - gruel-a5ah.2
- Implement --benchmark-json flag in the CLI
- Extend TimingData to output JSON format
- Support multiple iterations with mean/std calculation
Phase 3: Runner & storage - gruel-a5ah.3
- Create ./bench.sh script
- Create scripts/append-benchmark.py to manage history
- Set up perf branch structure
- Document benchmark workflow
- Add release/debug build modes via --release flag
Phase 4: CI integration - gruel-a5ah.4
- GitHub Actions workflow to run benchmarks on trunk commits
- Push results to perf branch
- Configure consistent benchmark environment
Phase 5: Website visualization - gruel-a5ah.5
- Create SVG chart generator (Rust or Python)
- Create /performance/ page template
- Integrate chart generation into website build

Consequences

Positive

Clear visibility into compiler performance over time
Ability to detect regressions before they accumulate
Public transparency about performance characteristics
Foundation for future optimization work
CI runs on every commit ensures continuous tracking

Negative

CI runner variability may add noise (mitigated by multiple iterations)
Adds maintenance burden (corpus, scripts, visualization)
History file will grow over time (mitigated by limiting to 100 entries)

Neutral

Benchmarks measure compilation time, not runtime performance
Initial corpus may not represent "real-world" usage until we have users
Dedicated perf branch keeps main branch clean but adds complexity

Resolved Questions

Multiple iterations? Yes, 5 iterations with mean/std reported.
What stress tests? Initial set: many_functions, deep_nesting, large_structs, arithmetic_heavy, control_flow.
History retention? 100 most recent runs.
Visualization approach? Static SVG charts for now, Chart.js as future enhancement.
When to run benchmarks? On every commit to trunk via CI.
Data storage? Dedicated perf branch to avoid noise in main.

Future Work

Interactive charts with Chart.js (hover tooltips, filtering, zoom)
Runtime performance benchmarks (execution speed of compiled programs)
Memory usage tracking
Code size tracking (binary size over time)
Comparison with other compilers (Rust, Zig, etc.)

References

ADR-0018: Tracing Infrastructure - The timing layer this builds on
rustc-perf - Inspiration from Rust's approach
Zig perf - Inspiration from Zig's approach