codeflash-agent/plugin/languages/python/references/e2e-benchmarks.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

2.7 KiB

End-to-End Benchmarks — Python

Python-specific E2E benchmark tooling. Read ${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md first for the language-agnostic framework.

Detection: codeflash compare

Check at session start:

# Is codeflash installed?
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"

# Is benchmarks-root configured?
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root

If both checks pass, codeflash compare is available. Record in .codeflash/setup.md:

## E2E Benchmarks
codeflash compare: available
benchmarks-root: <path>

If either check fails, record:

## E2E Benchmarks
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
fallback: ad-hoc micro-benchmarks + pytest durations

How codeflash compare Works

codeflash compare <base_ref> <head_ref>:

  1. Auto-detects changed functions from git diff (line-level overlap, not just file-level)
  2. Creates isolated git worktrees for each ref
  3. Instruments target functions with @codeflash_trace
  4. Runs benchmarks via trace_benchmarks_pytest
  5. Produces per-function nanosecond timings and a side-by-side comparison table

Usage

After every KEEP commit

# Compare the commit before your optimization with HEAD
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120

# Or to measure cumulative improvement since the session baseline
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120

Explicit function targeting

When auto-detection misses functions (e.g., methods inside classes are excluded by default), use --functions:

$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"

At milestones

# Cumulative e2e measurement
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120

Fallback: When codeflash compare Is Not Available

  1. Use ad-hoc micro-benchmarks as the primary measurement (see micro-benchmark.md)
  2. Use pytest --durations for test suite wall-clock as a secondary signal
  3. Use cProfile cumtime comparisons for project-function-level attribution

Known Limitations

  • Only top-level functions are auto-detected and instrumented. Class methods are excluded because @codeflash_trace pickles self on every call, which is catastrophic when self holds large objects. Use --functions to explicitly target methods when needed.
  • Requires committed code: Works on git refs, so changes must be committed before benchmarking.
  • Benchmark files must exist in benchmarks-root. If the project has no benchmarks yet, fall back to ad-hoc measurement.