codeflash-agent/plugin/references/shared/e2e-benchmarks.md

# End-to-End Benchmarks

When the project has an instrumented benchmark tool available, use it as the **authoritative** before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.

## Detection

Check at session start whether the project's benchmark tool is available. Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`. See your language's `e2e-benchmarks.md` for the specific detection steps.

## How It Works

An E2E benchmark tool should:

1. Auto-detect changed functions from `git diff`
2. Create **isolated git worktrees** for each ref — no working-tree contamination
3. Instrument target functions for per-function timing
4. Run benchmarks in isolation
5. Produce per-function timings and a side-by-side comparison table

This is strictly better than ad-hoc timing scripts because:
- **Isolation**: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes
- **Instrumentation**: Per-function timing, not just wall-clock
- **Reproducibility**: Same command produces same measurement on any machine
- **Structured output**: Per-function breakdown with speedup ratios, not just total time

## Two-Phase Measurement

The experiment loop uses a **two-phase** approach:

### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)

Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.

**Purpose**: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.

### Phase 2: Authoritative measurement (E2E benchmark)

After committing a KEEP, run the E2E benchmark tool for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.

**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.

If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), **trust the E2E measurement** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.

## Usage in the Experiment Loop

### After every KEEP commit

Compare the commit before your optimization with HEAD. Record the baseline commit SHA in `.codeflash/HANDOFF.md` at session start for easy reference. See your language's `e2e-benchmarks.md` for the specific commands.

### Reading the output

The output typically includes:

1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
3. **Share of Benchmark Time**: What percentage of total time each function consumes

Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.

## Fallback: When E2E Benchmarks Are Not Available

If the project doesn't have an instrumented benchmark tool:

1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
2. Use the language's test runner timing as a secondary signal
3. Use the language's profiler for per-function attribution

These are less rigorous but still useful. Note the fallback in `.codeflash/setup.md`.

## Known Limitations

- **Requires committed code**: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
- **Benchmark files must exist**: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.
squash 2026-04-09 08:36:01 +00:00			`# End-to-End Benchmarks`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`When the project has an instrumented benchmark tool available, use it as the authoritative before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
			`## Detection`

squash 2026-04-09 08:36:01 +00:00			Check at session start whether the project's benchmark tool is available. Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`. See your language's `e2e-benchmarks.md` for the specific detection steps.
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
			`## How It Works`

squash 2026-04-09 08:36:01 +00:00			`An E2E benchmark tool should:`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			1. Auto-detect changed functions from `git diff`
			`2. Create isolated git worktrees for each ref — no working-tree contamination`
			`3. Instrument target functions for per-function timing`
			`4. Run benchmarks in isolation`
			`5. Produce per-function timings and a side-by-side comparison table`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`This is strictly better than ad-hoc timing scripts because:`
			`- Isolation: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes`
			`- Instrumentation: Per-function timing, not just wall-clock`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00			`- Reproducibility: Same command produces same measurement on any machine`
			`- Structured output: Per-function breakdown with speedup ratios, not just total time`

squash 2026-04-09 08:36:01 +00:00			`## Two-Phase Measurement`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`The experiment loop uses a two-phase approach:`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`Purpose: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`### Phase 2: Authoritative measurement (E2E benchmark)`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			After committing a KEEP, run the E2E benchmark tool for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`Purpose: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), trust the E2E measurement — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`## Usage in the Experiment Loop`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`### After every KEEP commit`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			Compare the commit before your optimization with HEAD. Record the baseline commit SHA in `.codeflash/HANDOFF.md` at session start for easy reference. See your language's `e2e-benchmarks.md` for the specific commands.
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
			`### Reading the output`

squash 2026-04-09 08:36:01 +00:00			`The output typically includes:`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
			`1. End-to-End table: Total benchmark time for base vs head, with delta and speedup`
			`2. Per-Function Breakdown: Each instrumented function's time in both refs`
			`3. Share of Benchmark Time: What percentage of total time each function consumes`

			`Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.`

squash 2026-04-09 08:36:01 +00:00			`## Fallback: When E2E Benchmarks Are Not Available`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			`If the project doesn't have an instrumented benchmark tool:`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
			1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
squash 2026-04-09 08:36:01 +00:00			`2. Use the language's test runner timing as a secondary signal`
			`3. Use the language's profiler for per-function attribution`
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
squash 2026-04-09 08:36:01 +00:00			These are less rigorous but still useful. Note the fallback in `.codeflash/setup.md`.
Merge main-teammate branch 2026-04-03 22:36:50 +00:00
			`## Known Limitations`

squash 2026-04-09 08:36:01 +00:00			`- Requires committed code: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.`
			`- Benchmark files must exist: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.`