3.7 KiB
End-to-End Benchmarks
When the project has an instrumented benchmark tool available, use it as the authoritative before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.
Detection
Check at session start whether the project's benchmark tool is available. Record the result in .codeflash/setup.md under ## E2E Benchmarks. See your language's e2e-benchmarks.md for the specific detection steps.
How It Works
An E2E benchmark tool should:
- Auto-detect changed functions from
git diff - Create isolated git worktrees for each ref — no working-tree contamination
- Instrument target functions for per-function timing
- Run benchmarks in isolation
- Produce per-function timings and a side-by-side comparison table
This is strictly better than ad-hoc timing scripts because:
- Isolation: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes
- Instrumentation: Per-function timing, not just wall-clock
- Reproducibility: Same command produces same measurement on any machine
- Structured output: Per-function breakdown with speedup ratios, not just total time
Two-Phase Measurement
The experiment loop uses a two-phase approach:
Phase 1: Quick pre-screen (ad-hoc micro-benchmark)
Before committing, run a quick ad-hoc micro-benchmark (see micro-benchmark.md) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
Purpose: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.
Phase 2: Authoritative measurement (E2E benchmark)
After committing a KEEP, run the E2E benchmark tool for the official numbers that go into results.tsv and determine the final keep/discard verdict.
Purpose: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.
If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), trust the E2E measurement — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.
Usage in the Experiment Loop
After every KEEP commit
Compare the commit before your optimization with HEAD. Record the baseline commit SHA in .codeflash/HANDOFF.md at session start for easy reference. See your language's e2e-benchmarks.md for the specific commands.
Reading the output
The output typically includes:
- End-to-End table: Total benchmark time for base vs head, with delta and speedup
- Per-Function Breakdown: Each instrumented function's time in both refs
- Share of Benchmark Time: What percentage of total time each function consumes
Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.
Fallback: When E2E Benchmarks Are Not Available
If the project doesn't have an instrumented benchmark tool:
- Use ad-hoc micro-benchmarks as the primary measurement (see
micro-benchmark.md) - Use the language's test runner timing as a secondary signal
- Use the language's profiler for per-function attribution
These are less rigorous but still useful. Note the fallback in .codeflash/setup.md.
Known Limitations
- Requires committed code: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
- Benchmark files must exist: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.