mirror of
https://github.com/codeflash-ai/codeflash-agent.git
synced 2026-05-04 18:25:19 +00:00
72 lines
3.7 KiB
Markdown
72 lines
3.7 KiB
Markdown
# End-to-End Benchmarks
|
|
|
|
When the project has an instrumented benchmark tool available, use it as the **authoritative** before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.
|
|
|
|
## Detection
|
|
|
|
Check at session start whether the project's benchmark tool is available. Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`. See your language's `e2e-benchmarks.md` for the specific detection steps.
|
|
|
|
## How It Works
|
|
|
|
An E2E benchmark tool should:
|
|
|
|
1. Auto-detect changed functions from `git diff`
|
|
2. Create **isolated git worktrees** for each ref — no working-tree contamination
|
|
3. Instrument target functions for per-function timing
|
|
4. Run benchmarks in isolation
|
|
5. Produce per-function timings and a side-by-side comparison table
|
|
|
|
This is strictly better than ad-hoc timing scripts because:
|
|
- **Isolation**: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes
|
|
- **Instrumentation**: Per-function timing, not just wall-clock
|
|
- **Reproducibility**: Same command produces same measurement on any machine
|
|
- **Structured output**: Per-function breakdown with speedup ratios, not just total time
|
|
|
|
## Two-Phase Measurement
|
|
|
|
The experiment loop uses a **two-phase** approach:
|
|
|
|
### Phase 1: Quick pre-screen (ad-hoc micro-benchmark)
|
|
|
|
Before committing, run a quick ad-hoc micro-benchmark (see `micro-benchmark.md`) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.
|
|
|
|
**Purpose**: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.
|
|
|
|
### Phase 2: Authoritative measurement (E2E benchmark)
|
|
|
|
After committing a KEEP, run the E2E benchmark tool for the official numbers that go into `results.tsv` and determine the final keep/discard verdict.
|
|
|
|
**Purpose**: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.
|
|
|
|
If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), **trust the E2E measurement** — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.
|
|
|
|
## Usage in the Experiment Loop
|
|
|
|
### After every KEEP commit
|
|
|
|
Compare the commit before your optimization with HEAD. Record the baseline commit SHA in `.codeflash/HANDOFF.md` at session start for easy reference. See your language's `e2e-benchmarks.md` for the specific commands.
|
|
|
|
### Reading the output
|
|
|
|
The output typically includes:
|
|
|
|
1. **End-to-End table**: Total benchmark time for base vs head, with delta and speedup
|
|
2. **Per-Function Breakdown**: Each instrumented function's time in both refs
|
|
3. **Share of Benchmark Time**: What percentage of total time each function consumes
|
|
|
|
Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.
|
|
|
|
## Fallback: When E2E Benchmarks Are Not Available
|
|
|
|
If the project doesn't have an instrumented benchmark tool:
|
|
|
|
1. Use ad-hoc micro-benchmarks as the primary measurement (see `micro-benchmark.md`)
|
|
2. Use the language's test runner timing as a secondary signal
|
|
3. Use the language's profiler for per-function attribution
|
|
|
|
These are less rigorous but still useful. Note the fallback in `.codeflash/setup.md`.
|
|
|
|
## Known Limitations
|
|
|
|
- **Requires committed code**: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
|
|
- **Benchmark files must exist**: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.
|