codeflash-agent/plugin/references/shared/e2e-benchmarks.md

# End-to-End Benchmarks

When the project has an instrumented benchmark tool available, use it as the **authoritative** before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.

## Detection

Check at session start whether the project's benchmark tool is available. Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`. See your language's `e2e-benchmarks.md` for the specific detection steps.

## How It Works

An E2E benchmark tool should:

1. Auto-detect changed functions from `git diff`
2. Create **isolated git worktrees** for each ref — no working-tree contamination
3. Instrument target functions for per-function timing
4. Run benchmarks in isolation
5. Produce per-function timings and a side-by-side comparison table

This is strictly better than ad-hoc timing scripts because:
- **Isolation**: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes
- **Instrumentation**: Per-function timing, not just wall-clock
- **Reproducibility**: Same command produces same measurement on any machine
- **Structured output**: Per-function breakdown with speedup ratios, not just total time

## Two-Phase Measurement

The experiment loop uses a **two-phase** approach:

### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)

Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.

**Purpose**: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.

### Phase 2: Authoritative measurement (E2E benchmark)

After committing a KEEP, run the E2E benchmark tool for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.

**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.

If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), **trust the E2E measurement** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.

## Usage in the Experiment Loop

### After every KEEP commit

Compare the commit before your optimization with HEAD. Record the baseline commit SHA in `.codeflash/HANDOFF.md` at session start for easy reference. See your language's `e2e-benchmarks.md` for the specific commands.

### Reading the output

The output typically includes:

1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
3. **Share of Benchmark Time**: What percentage of total time each function consumes

Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.

## Fallback: When E2E Benchmarks Are Not Available

If the project doesn't have an instrumented benchmark tool:

1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
2. Use the language's test runner timing as a secondary signal
3. Use the language's profiler for per-function attribution

These are less rigorous but still useful. Note the fallback in `.codeflash/setup.md`.

## Known Limitations

- **Requires committed code**: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
- **Benchmark files must exist**: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.