End-to-End Benchmarks — Python

Python-specific E2E benchmark tooling. Read ${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md first for the language-agnostic framework.

Detection: `codeflash compare`

Check at session start:

# Is codeflash installed?
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"

# Is benchmarks-root configured?
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root

If both checks pass, codeflash compare is available. Record in .codeflash/setup.md:

## E2E Benchmarks
codeflash compare: available
benchmarks-root: <path>

If either check fails, record:

## E2E Benchmarks
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
fallback: ad-hoc micro-benchmarks + pytest durations

How `codeflash compare` Works

codeflash compare <base_ref> <head_ref>:

Auto-detects changed functions from git diff (line-level overlap, not just file-level)
Creates isolated git worktrees for each ref
Instruments target functions with @codeflash_trace
Runs benchmarks via trace_benchmarks_pytest
Produces per-function nanosecond timings and a side-by-side comparison table

Usage

After every KEEP commit

# Compare the commit before your optimization with HEAD
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120

# Or to measure cumulative improvement since the session baseline
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120

Explicit function targeting

When auto-detection misses functions (e.g., methods inside classes are excluded by default), use --functions:

$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"

At milestones

# Cumulative e2e measurement
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120

Fallback: When `codeflash compare` Is Not Available

Use ad-hoc micro-benchmarks as the primary measurement (see micro-benchmark.md)
Use pytest --durations for test suite wall-clock as a secondary signal
Use cProfile cumtime comparisons for project-function-level attribution

Known Limitations

Only top-level functions are auto-detected and instrumented. Class methods are excluded because @codeflash_trace pickles self on every call, which is catastrophic when self holds large objects. Use --functions to explicitly target methods when needed.
Requires committed code: Works on git refs, so changes must be committed before benchmarking.
Benchmark files must exist in benchmarks-root. If the project has no benchmarks yet, fall back to ad-hoc measurement.

2.7 KiB Raw Permalink Blame History

End-to-End Benchmarks — Python

Detection: codeflash compare

How codeflash compare Works