ADR-0031: Robust Performance Testing Infrastructure
Status
Implemented
Summary
Redesign the performance testing infrastructure to handle high commit velocity (multiple commits per minute) while ensuring complete data collection and eliminating race conditions. The current system uses sequential job execution which creates an ever-growing queue that GitHub Actions cannot process in time.
Context
Current Architecture (ADR-0019)
The existing performance testing system (ADR-0019) runs benchmarks on every commit to trunk:
- GitHub Actions workflow triggers on push to trunk
- Three platform jobs (x86-64-linux, aarch64-linux, aarch64-macos) run sequentially (max-parallel: 1)
- Each job:
- Builds the compiler
- Runs 7 benchmarks with 5 iterations each
- Appends results to a platform-specific history file on the
perfbranch - Commits and pushes to
perfbranch
- Concurrency control:
cancel-in-progress: false(jobs queue up instead of canceling)
The Problem
With multiple commits per minute and sequential execution taking 3-6 minutes per commit, the queue grows faster than it drains:
- Each platform takes ~1-2 minutes to benchmark
- Sequential execution: 3-6 minutes total per commit
- Commit rate: >60 commits per hour
- Processing rate: ~10-20 commits per hour
- Result: Queue grows indefinitely, jobs timeout or get dropped, data is missing
Root Causes
- Sequential execution bottleneck:
max-parallel: 1was added to prevent race conditions on perf branch, but creates throughput bottleneck - Per-commit granularity: Benchmarking every single commit at high velocity is unsustainable
- Git branch as database: Using the perf branch for atomic updates is slow and prone to conflicts
- Lack of batching: No mechanism to group multiple commits together
Why Sequential Execution Existed
The original max-parallel: 1 was added to prevent this race:
Job 1 (x86-64): fetch perf branch → append → push
Job 2 (ARM64): fetch perf branch → append → push (CONFLICT!)
But this "solution" made throughput the limiting factor.
Decision
Architecture Overview
Replace the sequential push-to-perf-branch model with a parallel execution + atomic collection model:
Commit → 3 parallel platform jobs → Artifacts → Collector job → Single atomic push
Key principles:
- Parallel platform execution: All 3 platforms run concurrently
- Artifact-based data flow: Platform jobs upload results as artifacts, not git pushes
- Atomic collection: Single collector job fetches artifacts and pushes once
- Smart batching: Debounce rapid commits to reduce load
- Graceful degradation: If queue is too long, skip intermediate commits intelligently
Part 1: Parallel Platform Execution
Change: Remove max-parallel: 1, let all 3 platforms run concurrently.
Implementation:
- Remove strategy.max-parallel constraint
- Each job stores results as GitHub artifact instead of pushing to perf branch
- Artifact names include commit SHA and platform:
benchmark-results-{sha}-{platform}.json
Benefit: Reduces per-commit time from 3-6 minutes to 1-2 minutes (3x speedup).
Part 2: Atomic Result Collection
New job: collect-results that runs after all platform jobs complete.
Workflow:
jobs:
benchmark:
strategy:
matrix:
include:
steps:
- run benchmarks
- upload artifact: benchmark-results-{sha}-{platform}.json
collect-results:
needs: benchmark
steps:
- download all artifacts
- checkout perf branch
- append all results to platform-specific history files
- commit and push (single atomic push, no race)
Key insight: Only one job pushes to perf branch, so no race conditions.
Part 3: Commit Batching / Debouncing
Problem: Even with parallelization, benchmarking every commit at 60+/hour is expensive.
Solution: Batch multiple commits together using GitHub Actions concurrency groups.
Strategy A: Time-based batching
concurrency:
group: benchmarks-${{ github.run_number / 5 }} # Batch every 5 runs
cancel-in-progress: true # Cancel older batches
Strategy B: Scheduled + on-demand
on:
push:
branches:
schedule:
- cron: '*/15 * * * *' # Every 15 minutes
- On push: Debounce and run once per time window
- Scheduled: Ensure coverage even during quiet periods
- Manual: workflow_dispatch for on-demand runs
Recommendation: Start with Strategy B (scheduled + on-demand) because:
- Predictable resource usage
- Clear sampling frequency
- Easy to reason about
- Can tune frequency based on load
Part 4: Smart Commit Sampling
For very high velocity, don't benchmark every commit. Use one of:
Option 1: Latest commit in time window
- Every 15 minutes, benchmark the most recent commit in that window
- Tag benchmark results with the commit it represents
- Interpolate missing commits in visualization
Option 2: Merge-commit only
- Only benchmark merge commits to trunk (not every development commit)
- Requires workflow that merges PRs vs direct push
- Lower frequency, higher signal
Option 3: Mark commits for benchmarking
- Allow developers to tag commits that should be benchmarked
- Use commit message convention:
[bench]or label on PR - Benchmark tagged commits + scheduled fallback
Recommendation: Start with Option 1 (time window) as it requires no process change.
Part 5: Data Completeness Tracking
Even with batching, we want visibility into what was benchmarked.
Add metadata to results:
Dashboard improvements:
- Show benchmark coverage (% of commits benchmarked)
- Highlight gaps in data
- Allow filtering by commit range
Part 6: Graceful Degradation
If the queue still backs up (e.g., during a period of intense development):
Strategy: Adaptive sampling
- Monitor queue depth using GitHub API
- If queue > 10 jobs: increase time window (30 min instead of 15)
- If queue > 20 jobs: skip all queued jobs, start fresh with latest commit
- Log all skipped commits for transparency
Implementation: Add a pre-job step that checks queue and decides whether to run.
Implementation Phases
Epic: gruel-1h38
Phase 1: Parallel execution + artifact upload - gruel-1h38.1
- Modify benchmarks.yml to remove max-parallel constraint
- Change platform jobs to upload artifacts instead of pushing to perf
- Verify artifacts are created correctly
Phase 2: Atomic collector job - gruel-1h38.2
- Add collect-results job that downloads artifacts
- Implement atomic push to perf branch
- Test with multiple parallel platform jobs
Phase 3: Time-based batching - gruel-1h38.3
- Add scheduled trigger (every 15 minutes)
- Enable cancel-in-progress for concurrency control
- Update documentation
Phase 4: Commit range tracking - gruel-1h38.4
- Update JSON schema to include commit_range field (version 2)
- Capture commit range in collector job (last 24 hours)
- Inject commit_range and benchmark_reason into results
- Update append-benchmark.py documentation for version 2 schema
Phase 5: Graceful degradation - gruel-1h38.5
- Add queue depth monitoring using GitHub API
- Implement adaptive sampling logic (skip if queue > 10 for push, > 20 for all)
- Add logging for skipped commits with reason and queue depth
- Manual runs always proceed regardless of queue depth
Phase 6: Dashboard improvements - gruel-1h38.6
- Visualize benchmark coverage
- Show commit ranges for each benchmark run
- Highlight data gaps
Consequences
Positive
- Throughput: 3x faster per-commit time (parallel vs sequential)
- No race conditions: Single atomic push eliminates conflicts
- Scalable: Handles high commit velocity through batching
- Complete data: No dropped jobs, all commits accounted for
- Graceful degradation: System adapts to load automatically
- Transparency: Clear visibility into what was benchmarked and why
Negative
- Complexity: More moving parts (artifacts, collector job, batching logic)
- Not per-commit: Won't have data for every single commit (but we track ranges)
- Delayed feedback: 15-minute batching means slower results
- Storage costs: GitHub artifact storage (though artifacts expire after 90 days)
Neutral
- Different granularity: From per-commit to per-time-window sampling
- Requires tuning: May need to adjust time windows based on usage patterns
Resolved Questions
How to eliminate race conditions without serialization?
- Artifact-based flow with single collector job
How to handle high commit velocity sustainably?
- Time-based batching (15-minute windows)
How to maintain data completeness?
- Track commit ranges, show coverage metrics
What if the queue still backs up?
- Adaptive sampling with graceful degradation
How to make the change incrementally?
- 6 phases, each independently valuable
Open Questions
What's the right time window? Start with 15 minutes, tune based on data.
Should we keep any per-commit benchmarking? Could benchmark tagged commits in addition to scheduled runs.
How long to keep artifacts? GitHub default is 90 days, should we archive to long-term storage?
Should we alert on missing data? Add monitoring for benchmark gaps exceeding threshold.
Future Work
- Benchmark result diffing: Compare results across commits in UI
- Performance regression detection: Automatic alerts for slowdowns
- Historical trend analysis: ML-based anomaly detection
- Cross-platform comparison: Visualize platform differences
- Incremental benchmarking: Only re-benchmark changed components
References
- ADR-0019: Compiler Performance Dashboard - Original design
- GitHub Actions: Artifacts
- GitHub Actions: Concurrency
- Rust perf infrastructure - Similar problems and solutions