123 lines
5.4 KiB
Markdown
123 lines
5.4 KiB
Markdown
|
|
# End-to-End Benchmarks with `codeflash compare`
|
||
|
|
|
||
|
|
When the project has `codeflash` installed and `benchmarks-root` configured in `pyproject.toml`, use `codeflash compare` as the **authoritative** before/after measurement for every optimization. It provides worktree-isolated, instrumented benchmarks that are reproducible and free from working-tree contamination.
|
||
|
|
|
||
|
|
## Detection
|
||
|
|
|
||
|
|
Check at session start:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Is codeflash installed?
|
||
|
|
$RUNNER -c "import codeflash" 2>/dev/null && echo "codeflash available" || echo "not available"
|
||
|
|
|
||
|
|
# Is benchmarks-root configured?
|
||
|
|
grep -A5 '\[tool\.codeflash\]' pyproject.toml | grep benchmarks.root
|
||
|
|
```
|
||
|
|
|
||
|
|
If both checks pass, `codeflash compare` is available. Record this in `.codeflash/setup.md`:
|
||
|
|
```
|
||
|
|
## E2E Benchmarks
|
||
|
|
codeflash compare: available
|
||
|
|
benchmarks-root: <path>
|
||
|
|
```
|
||
|
|
|
||
|
|
If either check fails, fall back to ad-hoc micro-benchmarks (see `micro-benchmark.md`).
|
||
|
|
|
||
|
|
## How It Works
|
||
|
|
|
||
|
|
`codeflash compare <base_ref> <head_ref>`:
|
||
|
|
|
||
|
|
1. Auto-detects changed functions from `git diff` (line-level overlap, not just file-level)
|
||
|
|
2. Creates **isolated git worktrees** for each ref — no working-tree contamination
|
||
|
|
3. Instruments target functions with `@codeflash_trace`
|
||
|
|
4. Runs benchmarks via `trace_benchmarks_pytest`
|
||
|
|
5. Produces per-function nanosecond timings and a side-by-side comparison table
|
||
|
|
|
||
|
|
This is strictly better than ad-hoc `time.perf_counter` scripts because:
|
||
|
|
- **Isolation**: Each ref runs in its own worktree — no stale `.pyc` files, no uncommitted changes
|
||
|
|
- **Instrumentation**: `@codeflash_trace` captures per-function timing, not just wall-clock
|
||
|
|
- **Reproducibility**: Same command produces same measurement on any machine
|
||
|
|
- **Structured output**: Per-function breakdown with speedup ratios, not just total time
|
||
|
|
|
||
|
|
## Usage in the Experiment Loop
|
||
|
|
|
||
|
|
### After every KEEP commit
|
||
|
|
|
||
|
|
Once you commit an optimization, run:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Compare the commit before your optimization with HEAD
|
||
|
|
$RUNNER -m codeflash compare <pre-optimization-sha> HEAD --timeout 120
|
||
|
|
```
|
||
|
|
|
||
|
|
The `<pre-optimization-sha>` is the commit just before your optimization. If you're on experiment N and your last KEEP was commit `abc1234`:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
$RUNNER -m codeflash compare abc1234^ abc1234 --timeout 120
|
||
|
|
```
|
||
|
|
|
||
|
|
Or to measure cumulative improvement since the session baseline:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
$RUNNER -m codeflash compare <baseline-sha> HEAD --timeout 120
|
||
|
|
```
|
||
|
|
|
||
|
|
Record the baseline SHA in `.codeflash/HANDOFF.md` at session start for easy reference.
|
||
|
|
|
||
|
|
### Explicit function targeting
|
||
|
|
|
||
|
|
When auto-detection misses functions (e.g., methods inside classes are excluded by default), use `--functions`:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
$RUNNER -m codeflash compare <base> HEAD --functions "src/module.py::func1,func2;src/other.py::func3"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Reading the output
|
||
|
|
|
||
|
|
The output includes:
|
||
|
|
|
||
|
|
1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
|
||
|
|
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
|
||
|
|
3. **Share of Benchmark Time**: What percentage of total time each function consumes
|
||
|
|
|
||
|
|
Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.
|
||
|
|
|
||
|
|
## Two-Phase Measurement
|
||
|
|
|
||
|
|
The experiment loop uses a **two-phase** approach:
|
||
|
|
|
||
|
|
### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)
|
||
|
|
|
||
|
|
Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
|
||
|
|
|
||
|
|
**Purpose**: Gate for investing in a full `codeflash compare` run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.
|
||
|
|
|
||
|
|
### Phase 2: Authoritative measurement (`codeflash compare`)
|
||
|
|
|
||
|
|
After committing a KEEP, run `codeflash compare` for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.
|
||
|
|
|
||
|
|
**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.
|
||
|
|
|
||
|
|
If `codeflash compare` contradicts the micro-benchmark (e.g., micro showed 15% but e2e shows 2%), **trust `codeflash compare`** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.
|
||
|
|
|
||
|
|
## Fallback: When `codeflash compare` Is Not Available
|
||
|
|
|
||
|
|
If the project doesn't have `codeflash` installed or `benchmarks-root` configured:
|
||
|
|
|
||
|
|
1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
|
||
|
|
2. Use `pytest --durations` for test suite wall-clock as a secondary signal
|
||
|
|
3. Use `cProfile` cumtime comparisons for project-function-level attribution
|
||
|
|
|
||
|
|
These are less rigorous but still useful. Note in `.codeflash/setup.md`:
|
||
|
|
```
|
||
|
|
## E2E Benchmarks
|
||
|
|
codeflash compare: not available (reason: <no benchmarks-root | codeflash not installed>)
|
||
|
|
fallback: ad-hoc micro-benchmarks + pytest durations
|
||
|
|
```
|
||
|
|
|
||
|
|
## Known Limitations
|
||
|
|
|
||
|
|
- **Only top-level functions** are auto-detected and instrumented. Class methods are excluded because `@codeflash_trace` pickles `self` on every call, which is catastrophic when `self` holds large objects (e.g., CST trees). Use `--functions` to explicitly target methods when needed.
|
||
|
|
- **Requires committed code**. `codeflash compare` works on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
|
||
|
|
- **Benchmark files must exist** in `benchmarks-root`. If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.
|