# End-to-End Benchmarks When the project has an instrumented benchmark tool available, use it as the **authoritative** before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination. ## Detection Check at session start whether the project's benchmark tool is available. Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`. See your language's `e2e-benchmarks.md` for the specific detection steps. ## How It Works An E2E benchmark tool should: 1. Auto-detect changed functions from `git diff` 2. Create **isolated git worktrees** for each ref — no working-tree contamination 3. Instrument target functions for per-function timing 4. Run benchmarks in isolation 5. Produce per-function timings and a side-by-side comparison table This is strictly better than ad-hoc timing scripts because: - **Isolation**: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes - **Instrumentation**: Per-function timing, not just wall-clock - **Reproducibility**: Same command produces same measurement on any machine - **Structured output**: Per-function breakdown with speedup ratios, not just total time ## Two-Phase Measurement The experiment loop uses a **two-phase** approach: ### Phase 1: Quick pre-screen (ad-hoc micro-benchmark) Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early. **Purpose**: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation. ### Phase 2: Authoritative measurement (E2E benchmark) After committing a KEEP, run the E2E benchmark tool for the official numbers that go into `results.tsv` and determine the final keep/discard verdict. **Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state. If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), **trust the E2E measurement** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths. ## Usage in the Experiment Loop ### After every KEEP commit Compare the commit before your optimization with HEAD. Record the baseline commit SHA in `.codeflash/HANDOFF.md` at session start for easy reference. See your language's `e2e-benchmarks.md` for the specific commands. ### Reading the output The output typically includes: 1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup 2. **Per-Function Breakdown**: Each instrumented function's time in both refs 3. **Share of Benchmark Time**: What percentage of total time each function consumes Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others. ## Fallback: When E2E Benchmarks Are Not Available If the project doesn't have an instrumented benchmark tool: 1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`) 2. Use the language's test runner timing as a secondary signal 3. Use the language's profiler for per-function attribution These are less rigorous but still useful. Note the fallback in `.codeflash/setup.md`. ## Known Limitations - **Requires committed code**: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1. - **Benchmark files must exist**: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.