5.4 KiB
End-to-End Benchmarks with codeflash compare
When the project has codeflash installed and benchmarks-root configured in pyproject.toml, use codeflash compare as the authoritative before/after measurement for every optimization. It provides worktree-isolated, instrumented benchmarks that are reproducible and free from working-tree contamination.
Detection
Check at session start:
# Is codeflash installed?
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"
# Is benchmarks-root configured?
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root
If both checks pass, codeflash compare is available. Record this in .codeflash/setup.md:
## E2E Benchmarks
codeflash compare: available
benchmarks-root: <path>
If either check fails, fall back to ad-hoc micro-benchmarks (see micro-benchmark.md).
How It Works
codeflash compare <base_ref> <head_ref>:
- Auto-detects changed functions from
git diff(line-level overlap, not just file-level) - Creates isolated git worktrees for each ref — no working-tree contamination
- Instruments target functions with
@codeflash_trace - Runs benchmarks via
trace_benchmarks_pytest - Produces per-function nanosecond timings and a side-by-side comparison table
This is strictly better than ad-hoc time.perf_counter scripts because:
- Isolation: Each ref runs in its own worktree — no stale
.pycfiles, no uncommitted changes - Instrumentation:
@codeflash_tracecaptures per-function timing, not just wall-clock - Reproducibility: Same command produces same measurement on any machine
- Structured output: Per-function breakdown with speedup ratios, not just total time
Usage in the Experiment Loop
After every KEEP commit
Once you commit an optimization, run:
# Compare the commit before your optimization with HEAD
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120
The <pre-optimization-sha> is the commit just before your optimization. If you're on experiment N and your last KEEP was commit abc1234:
$RUNNER -m codeflash compare abc1234^ abc1234 --timeout 120
Or to measure cumulative improvement since the session baseline:
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120
Record the baseline SHA in .codeflash/HANDOFF.md at session start for easy reference.
Explicit function targeting
When auto-detection misses functions (e.g., methods inside classes are excluded by default), use --functions:
$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"
Reading the output
The output includes:
- End-to-End table: Total benchmark time for base vs head, with delta and speedup
- Per-Function Breakdown: Each instrumented function's time in both refs
- Share of Benchmark Time: What percentage of total time each function consumes
Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.
Two-Phase Measurement
The experiment loop uses a two-phase approach:
Phase 1: Quick pre-screen (ad-hoc micro-benchmark)
Before committing, run a quick ad-hoc micro-benchmark (see micro-benchmark.md) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
Purpose: Gate for investing in a full codeflash compare run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.
Phase 2: Authoritative measurement (codeflash compare)
After committing a KEEP, run codeflash compare for the official numbers that go into results.tsv and determine the final keep/discard verdict.
Purpose: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.
If codeflash compare contradicts the micro-benchmark (e.g., micro showed 15% but e2e shows 2%), trust codeflash compare — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.
Fallback: When codeflash compare Is Not Available
If the project doesn't have codeflash installed or benchmarks-root configured:
- Use ad-hoc micro-benchmarks as the primary measurement (see
micro-benchmark.md) - Use
pytest --durationsfor test suite wall-clock as a secondary signal - Use
cProfilecumtime comparisons for project-function-level attribution
These are less rigorous but still useful. Note in .codeflash/setup.md:
## E2E Benchmarks
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
fallback: ad-hoc micro-benchmarks + pytest durations
Known Limitations
- Only top-level functions are auto-detected and instrumented. Class methods are excluded because
@codeflash_tracepicklesselfon every call, which is catastrophic whenselfholds large objects (e.g., CST trees). Use--functionsto explicitly target methods when needed. - Requires committed code.
codeflash compareworks on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1. - Benchmark files must exist in
benchmarks-root. If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.