codeflash-agent/plugin/references/shared/e2e-benchmarks.md

123 lines
5.4 KiB
Markdown
Raw Normal View History

2026-04-03 22:36:50 +00:00
# End-to-End Benchmarks with `codeflash compare`
When the project has `codeflash` installed and `benchmarks-root` configured in `pyproject.toml`, use `codeflash compare` as the **authoritative** before/after measurement for every optimization. It provides worktree-isolated, instrumented benchmarks that are reproducible and free from working-tree contamination.
## Detection
Check at session start:
```bash
# Is codeflash installed?
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"
# Is benchmarks-root configured?
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root
```
If both checks pass, `codeflash compare` is available. Record this in `.codeflash/setup.md`:
```
## E2E Benchmarks
codeflash compare: available
benchmarks-root: <path>
```
If either check fails, fall back to ad-hoc micro-benchmarks (see `micro-benchmark.md`).
## How It Works
`codeflash compare <base_ref> <head_ref>`:
1. Auto-detects changed functions from `git diff` (line-level overlap, not just file-level)
2. Creates **isolated git worktrees** for each ref — no working-tree contamination
3. Instruments target functions with `@codeflash_trace`
4. Runs benchmarks via `trace_benchmarks_pytest`
5. Produces per-function nanosecond timings and a side-by-side comparison table
This is strictly better than ad-hoc `time.perf_counter` scripts because:
- **Isolation**: Each ref runs in its own worktree — no stale `.pyc` files, no uncommitted changes
- **Instrumentation**: `@codeflash_trace` captures per-function timing, not just wall-clock
- **Reproducibility**: Same command produces same measurement on any machine
- **Structured output**: Per-function breakdown with speedup ratios, not just total time
## Usage in the Experiment Loop
### After every KEEP commit
Once you commit an optimization, run:
```bash
# Compare the commit before your optimization with HEAD
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120
```
The `<pre-optimization-sha>` is the commit just before your optimization. If you're on experiment N and your last KEEP was commit `abc1234`:
```bash
$RUNNER -m codeflash compare abc1234^ abc1234 --timeout 120
```
Or to measure cumulative improvement since the session baseline:
```bash
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120
```
Record the baseline SHA in `.codeflash/HANDOFF.md` at session start for easy reference.
### Explicit function targeting
When auto-detection misses functions (e.g., methods inside classes are excluded by default), use `--functions`:
```bash
$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"
```
### Reading the output
The output includes:
1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
3. **Share of Benchmark Time**: What percentage of total time each function consumes
Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.
## Two-Phase Measurement
The experiment loop uses a **two-phase** approach:
### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)
Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
**Purpose**: Gate for investing in a full `codeflash compare` run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.
### Phase 2: Authoritative measurement (`codeflash compare`)
After committing a KEEP, run `codeflash compare` for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.
**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.
If `codeflash compare` contradicts the micro-benchmark (e.g., micro showed 15% but e2e shows 2%), **trust `codeflash compare`** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.
## Fallback: When `codeflash compare` Is Not Available
If the project doesn't have `codeflash` installed or `benchmarks-root` configured:
1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
2. Use `pytest --durations` for test suite wall-clock as a secondary signal
3. Use `cProfile` cumtime comparisons for project-function-level attribution
These are less rigorous but still useful. Note in `.codeflash/setup.md`:
```
## E2E Benchmarks
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
fallback: ad-hoc micro-benchmarks + pytest durations
```
## Known Limitations
- **Only top-level functions** are auto-detected and instrumented. Class methods are excluded because `@codeflash_trace` pickles `self` on every call, which is catastrophic when `self` holds large objects (e.g., CST trees). Use `--functions` to explicitly target methods when needed.
- **Requires committed code**. `codeflash compare` works on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
- **Benchmark files must exist** in `benchmarks-root`. If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.