codeflash-agent/plugin/references/shared/e2e-benchmarks.md

# End-to-End Benchmarks with `codeflash compare`

When the project has `codeflash` installed and `benchmarks-root` configured in `pyproject.toml`, use `codeflash compare` as the **authoritative** before/after measurement for every optimization. It provides worktree-isolated, instrumented benchmarks that are reproducible and free from working-tree contamination.

## Detection

Check at session start:

```bash
# Is codeflash installed?
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"

# Is benchmarks-root configured?
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root
```

If both checks pass, `codeflash compare` is available. Record this in `.codeflash/setup.md`:
```
## E2E Benchmarks
codeflash compare: available
benchmarks-root: <path>
```

If either check fails, fall back to ad-hoc micro-benchmarks (see `micro-benchmark.md`).

## How It Works

`codeflash compare <base_ref> <head_ref>`:

1. Auto-detects changed functions from `git diff` (line-level overlap, not just file-level)
2. Creates **isolated git worktrees** for each ref — no working-tree contamination
3. Instruments target functions with `@codeflash_trace`
4. Runs benchmarks via `trace_benchmarks_pytest`
5. Produces per-function nanosecond timings and a side-by-side comparison table

This is strictly better than ad-hoc `time.perf_counter` scripts because:
- **Isolation**: Each ref runs in its own worktree — no stale `.pyc` files, no uncommitted changes
- **Instrumentation**: `@codeflash_trace` captures per-function timing, not just wall-clock
- **Reproducibility**: Same command produces same measurement on any machine
- **Structured output**: Per-function breakdown with speedup ratios, not just total time

## Usage in the Experiment Loop

### After every KEEP commit

Once you commit an optimization, run:

```bash
# Compare the commit before your optimization with HEAD
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120
```

The `<pre-optimization-sha>` is the commit just before your optimization. If you're on experiment N and your last KEEP was commit `abc1234`:

```bash
$RUNNER -m codeflash compare abc1234^ abc1234 --timeout 120
```

Or to measure cumulative improvement since the session baseline:

```bash
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120
```

Record the baseline SHA in `.codeflash/HANDOFF.md` at session start for easy reference.

### Explicit function targeting

When auto-detection misses functions (e.g., methods inside classes are excluded by default), use `--functions`:

```bash
$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"
```

### Reading the output

The output includes:

1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
3. **Share of Benchmark Time**: What percentage of total time each function consumes

Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.

## Two-Phase Measurement

The experiment loop uses a **two-phase** approach:

### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)

Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.

**Purpose**: Gate for investing in a full `codeflash compare` run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.

### Phase 2: Authoritative measurement (`codeflash compare`)

After committing a KEEP, run `codeflash compare` for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.

**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.

If `codeflash compare` contradicts the micro-benchmark (e.g., micro showed 15% but e2e shows 2%), **trust `codeflash compare`** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.

## Fallback: When `codeflash compare` Is Not Available

If the project doesn't have `codeflash` installed or `benchmarks-root` configured:

1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
2. Use `pytest --durations` for test suite wall-clock as a secondary signal
3. Use `cProfile` cumtime comparisons for project-function-level attribution

These are less rigorous but still useful. Note in `.codeflash/setup.md`:
```
## E2E Benchmarks
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
fallback: ad-hoc micro-benchmarks + pytest durations
```

## Known Limitations

- **Only top-level functions** are auto-detected and instrumented. Class methods are excluded because `@codeflash_trace` pickles `self` on every call, which is catastrophic when `self` holds large objects (e.g., CST trees). Use `--functions` to explicitly target methods when needed.
- **Requires committed code**. `codeflash compare` works on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
- **Benchmark files must exist** in `benchmarks-root`. If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.
Merge main-teammate branch 2026-04-03 22:36:50 +00:00			# End-to-End Benchmarks with `codeflash compare`

			When the project has `codeflash` installed and `benchmarks-root` configured in `pyproject.toml`, use `codeflash compare` as the authoritative before/after measurement for every optimization. It provides worktree-isolated, instrumented benchmarks that are reproducible and free from working-tree contamination.

			`## Detection`

			`Check at session start:`

			```bash
			`# Is codeflash installed?`
			`$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" \|\| echo "not available"`

			`# Is benchmarks-root configured?`
			`grep -A5 '\[tool\.codeflash\]' pyproject.toml \| grep benchmarks.root`
			```

			If both checks pass, `codeflash compare` is available. Record this in `.codeflash/setup.md`:
			```
			`## E2E Benchmarks`
			`codeflash compare: available`
			`benchmarks-root: <path>`
			```

			If either check fails, fall back to ad-hoc micro-benchmarks (see `micro-benchmark.md`).

			`## How It Works`

			`codeflash compare <base_ref> <head_ref>`:

			1. Auto-detects changed functions from `git diff` (line-level overlap, not just file-level)
			`2. Creates isolated git worktrees for each ref — no working-tree contamination`
			3. Instruments target functions with `@codeflash_trace`
			4. Runs benchmarks via `trace_benchmarks_pytest`
			`5. Produces per-function nanosecond timings and a side-by-side comparison table`

			This is strictly better than ad-hoc `time.perf_counter` scripts because:
			- Isolation: Each ref runs in its own worktree — no stale `.pyc` files, no uncommitted changes
			- Instrumentation: `@codeflash_trace` captures per-function timing, not just wall-clock
			`- Reproducibility: Same command produces same measurement on any machine`
			`- Structured output: Per-function breakdown with speedup ratios, not just total time`

			`## Usage in the Experiment Loop`

			`### After every KEEP commit`

			`Once you commit an optimization, run:`

			```bash
			`# Compare the commit before your optimization with HEAD`
			`$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120`
			```

			The `<pre-optimization-sha>` is the commit just before your optimization. If you're on experiment N and your last KEEP was commit `abc1234`:

			```bash
			`$RUNNER -m codeflash compare abc1234^ abc1234 --timeout 120`
			```

			`Or to measure cumulative improvement since the session baseline:`

			```bash
			`$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120`
			```

			Record the baseline SHA in `.codeflash/HANDOFF.md` at session start for easy reference.

			`### Explicit function targeting`

			When auto-detection misses functions (e.g., methods inside classes are excluded by default), use `--functions`:

			```bash
			`$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"`
			```

			`### Reading the output`

			`The output includes:`

			`1. End-to-End table: Total benchmark time for base vs head, with delta and speedup`
			`2. Per-Function Breakdown: Each instrumented function's time in both refs`
			`3. Share of Benchmark Time: What percentage of total time each function consumes`

			`Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.`

			`## Two-Phase Measurement`

			`The experiment loop uses a two-phase approach:`

			`### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)`

			Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.

			Purpose: Gate for investing in a full `codeflash compare` run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.

			### Phase 2: Authoritative measurement (`codeflash compare`)

			After committing a KEEP, run `codeflash compare` for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.

			`Purpose: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.`

			If `codeflash compare` contradicts the micro-benchmark (e.g., micro showed 15% but e2e shows 2%), trust `codeflash compare` — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.

			## Fallback: When `codeflash compare` Is Not Available

			If the project doesn't have `codeflash` installed or `benchmarks-root` configured:

			1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
			2. Use `pytest --durations` for test suite wall-clock as a secondary signal
			3. Use `cProfile` cumtime comparisons for project-function-level attribution

			These are less rigorous but still useful. Note in `.codeflash/setup.md`:
			```
			`## E2E Benchmarks`
			`codeflash compare: not available (reason: <no benchmarks-root \| codeflash not installed>)`
			`fallback: ad-hoc micro-benchmarks + pytest durations`
			```

			`## Known Limitations`

			- Only top-level functions are auto-detected and instrumented. Class methods are excluded because `@codeflash_trace` pickles `self` on every call, which is catastrophic when `self` holds large objects (e.g., CST trees). Use `--functions` to explicitly target methods when needed.
			- Requires committed code. `codeflash compare` works on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
			- Benchmark files must exist in `benchmarks-root`. If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.