codeflash-agent/plugin/references/shared/e2e-benchmarks.md

73 lines
3.7 KiB
Markdown
Raw Permalink Normal View History

2026-04-09 08:36:01 +00:00
# End-to-End Benchmarks
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
When the project has an instrumented benchmark tool available, use it as the **authoritative** before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.
2026-04-03 22:36:50 +00:00
## Detection
2026-04-09 08:36:01 +00:00
Check at session start whether the project's benchmark tool is available. Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`. See your language's `e2e-benchmarks.md` for the specific detection steps.
2026-04-03 22:36:50 +00:00
## How It Works
2026-04-09 08:36:01 +00:00
An E2E benchmark tool should:
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
1. Auto-detect changed functions from `git diff`
2. Create **isolated git worktrees** for each ref — no working-tree contamination
3. Instrument target functions for per-function timing
4. Run benchmarks in isolation
5. Produce per-function timings and a side-by-side comparison table
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
This is strictly better than ad-hoc timing scripts because:
- **Isolation**: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes
- **Instrumentation**: Per-function timing, not just wall-clock
2026-04-03 22:36:50 +00:00
- **Reproducibility**: Same command produces same measurement on any machine
- **Structured output**: Per-function breakdown with speedup ratios, not just total time
2026-04-09 08:36:01 +00:00
## Two-Phase Measurement
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
The experiment loop uses a **two-phase** approach:
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
**Purpose**: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
### Phase 2: Authoritative measurement (E2E benchmark)
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
After committing a KEEP, run the E2E benchmark tool for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), **trust the E2E measurement** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
## Usage in the Experiment Loop
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
### After every KEEP commit
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
Compare the commit before your optimization with HEAD. Record the baseline commit SHA in `.codeflash/HANDOFF.md` at session start for easy reference. See your language's `e2e-benchmarks.md` for the specific commands.
2026-04-03 22:36:50 +00:00
### Reading the output
2026-04-09 08:36:01 +00:00
The output typically includes:
2026-04-03 22:36:50 +00:00
1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
3. **Share of Benchmark Time**: What percentage of total time each function consumes
Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.
2026-04-09 08:36:01 +00:00
## Fallback: When E2E Benchmarks Are Not Available
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
If the project doesn't have an instrumented benchmark tool:
2026-04-03 22:36:50 +00:00
1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
2026-04-09 08:36:01 +00:00
2. Use the language's test runner timing as a secondary signal
3. Use the language's profiler for per-function attribution
2026-04-03 22:36:50 +00:00
2026-04-09 08:36:01 +00:00
These are less rigorous but still useful. Note the fallback in `.codeflash/setup.md`.
2026-04-03 22:36:50 +00:00
## Known Limitations
2026-04-09 08:36:01 +00:00
- **Requires committed code**: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
- **Benchmark files must exist**: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.