End-to-End Benchmarks

When the project has an instrumented benchmark tool available, use it as the authoritative before/after measurement for every optimization. E2E benchmarks provide worktree-isolated, instrumented measurements that are reproducible and free from working-tree contamination.

Detection

Check at session start whether the project's benchmark tool is available. Record the result in .codeflash/setup.md under ## E2E Benchmarks. See your language's e2e-benchmarks.md for the specific detection steps.

How It Works

An E2E benchmark tool should:

Auto-detect changed functions from git diff
Create isolated git worktrees for each ref — no working-tree contamination
Instrument target functions for per-function timing
Run benchmarks in isolation
Produce per-function timings and a side-by-side comparison table

This is strictly better than ad-hoc timing scripts because:

Isolation: Each ref runs in its own worktree — no stale build artifacts or uncommitted changes
Instrumentation: Per-function timing, not just wall-clock
Reproducibility: Same command produces same measurement on any machine
Structured output: Per-function breakdown with speedup ratios, not just total time

Two-Phase Measurement

The experiment loop uses a two-phase approach:

Phase 1: Quick pre-screen (ad-hoc micro-benchmark)

Before committing, run a quick ad-hoc micro-benchmark (see micro-benchmark.md) to validate the optimization is worth a full benchmark. This is fast (<10s) and catches obvious regressions or no-ops early.

Purpose: Gate for investing in a full E2E run. If the micro-benchmark shows no improvement, discard immediately without the overhead of worktree creation.

Phase 2: Authoritative measurement (E2E benchmark)

After committing a KEEP, run the E2E benchmark tool for the official numbers that go into results.tsv and determine the final keep/discard verdict.

Purpose: Produce trustworthy, isolated, reproducible measurements. These are the numbers you report to the user and record in session state.

If the E2E benchmark contradicts the micro-benchmark (e.g., micro showed 15% but E2E shows 2%), trust the E2E measurement — the micro-benchmark may have missed overhead from setup, imports, or interaction with other code paths.

Usage in the Experiment Loop

After every KEEP commit

Compare the commit before your optimization with HEAD. Record the baseline commit SHA in .codeflash/HANDOFF.md at session start for easy reference. See your language's e2e-benchmarks.md for the specific commands.

Reading the output

The output typically includes:

End-to-End table: Total benchmark time for base vs head, with delta and speedup
Per-Function Breakdown: Each instrumented function's time in both refs
Share of Benchmark Time: What percentage of total time each function consumes

Use the per-function breakdown to confirm your optimization targeted the right function and didn't cause regressions in others.

Fallback: When E2E Benchmarks Are Not Available

If the project doesn't have an instrumented benchmark tool:

Use ad-hoc micro-benchmarks as the primary measurement (see micro-benchmark.md)
Use the language's test runner timing as a secondary signal
Use the language's profiler for per-function attribution

These are less rigorous but still useful. Note the fallback in .codeflash/setup.md.

Known Limitations

Requires committed code: E2E tools work on git refs, so changes must be committed before they can be benchmarked. This is why it's a Phase 2 step (after commit), not Phase 1.
Benchmark files must exist: If the project has no benchmarks yet, this tool can't help — fall back to ad-hoc measurement.

3.7 KiB Raw Blame History