codeflash-agent/plugin/references/shared/e2e-benchmarks.md
2026-04-03 17:36:50 -05:00

5.4 KiB

End-to-End Benchmarks with codeflash compare

When the project has codeflash installed and benchmarks-root configured in pyproject.toml, use codeflash compare as the authoritative before/after measurement for every optimization. It provides worktree-isolated, instrumented benchmarks that are reproducible and free from working-tree contamination.

Detection

Check at session start:

# Is codeflash installed?
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"

# Is benchmarks-root configured?
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root

If both checks pass, codeflash compare is available. Record this in .codeflash/setup.md:

## E2E Benchmarks
codeflash compare: available
benchmarks-root: <path>

If either check fails, fall back to ad-hoc micro-benchmarks (see micro-benchmark.md).

How It Works

codeflash compare <base_ref> <head_ref>:

  1. Auto-detects changed functions from git diff (line-level overlap, not just file-level)
  2. Creates isolated git worktrees for each ref — no working-tree contamination
  3. Instruments target functions with @codeflash_trace
  4. Runs benchmarks via trace_benchmarks_pytest
  5. Produces per-function nanosecond timings and a side-by-side comparison table

This is strictly better than ad-hoc time.perf_counter scripts because:

  • Isolation: Each ref runs in its own worktree — no stale .pyc files, no uncommitted changes
  • Instrumentation: @codeflash_trace captures per-function timing, not just wall-clock
  • Reproducibility: Same command produces same measurement on any machine
  • Structured output: Per-function breakdown with speedup ratios, not just total time

Usage in the Experiment Loop

After every KEEP commit

Once you commit an optimization, run:

# Compare the commit before your optimization with HEAD
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120

The <pre-optimization-sha> is the commit just before your optimization. If you're on experiment N and your last KEEP was commit abc1234:

$RUNNER -m codeflash compare abc1234^ abc1234 --timeout 120

Or to measure cumulative improvement since the session baseline:

$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120

Record the baseline SHA in .codeflash/HANDOFF.md at session start for easy reference.

Explicit function targeting

When auto-detection misses functions (e.g., methods inside classes are excluded by default), use --functions:

$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"

Reading the output

The output includes:

  1. End-to-End table: Total benchmark time for base vs head, with delta and speedup
  2. Per-Function Breakdown: Each instrumented function's time in both refs
  3. Share of Benchmark Time: What percentage of total time each function consumes

Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.

Two-Phase Measurement

The experiment loop uses a two-phase approach:

Phase 1: Quick pre-screen (ad-hoc micro-benchmark)

Before committing, run a quick ad-hoc micro-benchmark (see micro-benchmark.md) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.

Purpose: Gate for investing in a full codeflash compare run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.

Phase 2: Authoritative measurement (codeflash compare)

After committing a KEEP, run codeflash compare for the official numbers that go into results.tsv and determine the final keep/discard verdict.

Purpose: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.

If codeflash compare contradicts the micro-benchmark (e.g., micro showed 15% but e2e shows 2%), trust codeflash compare — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.

Fallback: When codeflash compare Is Not Available

If the project doesn't have codeflash installed or benchmarks-root configured:

  1. Use ad-hoc micro-benchmarks as the primary measurement (see micro-benchmark.md)
  2. Use pytest --durations for test suite wall-clock as a secondary signal
  3. Use cProfile cumtime comparisons for project-function-level attribution

These are less rigorous but still useful. Note in .codeflash/setup.md:

## E2E Benchmarks
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
fallback: ad-hoc micro-benchmarks + pytest durations

Known Limitations

  • Only top-level functions are auto-detected and instrumented. Class methods are excluded because @codeflash_trace pickles self on every call, which is catastrophic when self holds large objects (e.g., CST trees). Use --functions to explicitly target methods when needed.
  • Requires committed code. codeflash compare works on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
  • Benchmark files must exist in benchmarks-root. If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.