refactor: move Java-specific content out of shared files into language overlay

Review feedback: shared experiment-loop-base.md and pre-submit-review.md
contained Java/Kotlin-specific content that all languages inherit. This
broke step numbering for non-Java agents and polluted cross-language files.

Changes:
- Revert experiment-loop-base.md to language-neutral (18-step original)
- Revert pre-submit-review.md to language-neutral (remove Java section)
- Create plugin/languages/java/references/pre-submit-review.md following
  the same pattern as the existing Python pre-submit-review.md
- Reduce duplication in all 4 domain agents (cpu, memory, async, deep):
  replace inlined benchmark-validity and correctness-verification content
  with concise references, keeping only domain-specific additions
- Add pre-submit-review.md to Deep References in all agents

No content was removed — all JMH validation, correctness verification,
mechanism explanation, milestone sanity check, and JDK compatibility
requirements remain in the Java language overlay. They are now referenced
from the single-source-of-truth files instead of being duplicated inline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Mohamed Ashraf 2026-04-16 11:22:38 +00:00
parent ba194f2a02
commit 5b2b94fd71
7 changed files with 101 additions and 106 deletions

View file

@ -113,32 +113,27 @@ grep -rn "Hashtable\|Vector\|synchronizedMap\|StringBuffer" --include="*.java" -
## JMH Benchmark Requirement
**JMH is MANDATORY for every KEEP decision.** Concurrency benchmarks are especially vulnerable to JVM measurement artifacts because thread scheduling, JIT compilation, and GC interact non-deterministically. JMH with proper forking isolates these effects.
**JMH is MANDATORY for every KEEP decision.** Validate all benchmarks per `../references/benchmark-validity.md`. Verify correctness per `../references/correctness-verification.md`.
**Concurrency-specific JMH settings**:
- `@Fork(3)` minimum (concurrency behavior varies more between JVM instances than single-threaded code)
**Concurrency-specific JMH settings** (on top of the standard validity checklist):
- `@Fork(3)` minimum (concurrency behavior varies more between JVM instances)
- `@Threads(N)` matching the target production concurrency level
- `@State(Scope.Benchmark)` for shared state, `@State(Scope.Thread)` for thread-local state
- `@State(Scope.Benchmark)` for shared state, `@State(Scope.Thread)` for thread-local
- Longer warmup (>=5 iterations) to stabilize thread pool ramp-up and JIT under contention
**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the benchmark validity checklist.
## Experiment Loop
Read `${CLAUDE_PLUGIN_ROOT}/references/shared/experiment-loop-base.md` for the full loop. Concurrency-specific additions:
### Correctness Verification
Before benchmarking, verify output equivalence per `../references/correctness-verification.md`. Concurrency optimizations are especially prone to behavioral changes:
- Lock changes can introduce race conditions (section 4 — mutable parameter state)
- Collection type changes (synchronized to concurrent) may change iteration ordering (section 3)
- Exception behavior may change under concurrent access (section 5)
Before benchmarking, verify output equivalence per `../references/correctness-verification.md`. Concurrency optimizations are especially high-risk for sections 3 (collection ordering — synchronized to concurrent changes iteration order), 4 (mutable state — lock changes can introduce races), and 5 (exceptions — behavior may change under concurrent access).
Additionally, run tests under load to detect race conditions. Use Thread Sanitizer or stress-test with multiple threads.
Additionally, run tests under load to detect race conditions.
### After each fix
Run JMH at agreed thread count with `@Fork(3)`. Validate benchmark per `../references/benchmark-validity.md`. Also verify: run tests under load to detect races.
Run JMH at agreed thread count with `@Fork(3)`. Validate per `../references/benchmark-validity.md`. Also run tests under load to detect races.
### Keep/Discard
@ -204,6 +199,7 @@ commit target_test baseline_throughput optimized_throughput throughput_change ba
For code examples, virtual thread migration guide, JMH concurrency templates, and lock patterns:
- **`../references/benchmark-validity.md`** -- **MANDATORY** JMH benchmark validity checklist (anti-DCE, anti-constant-folding, warmup, forks, error bars)
- **`../references/correctness-verification.md`** -- **MANDATORY** Java output equivalence rules (deep equality, floats, collections, mutability, exceptions)
- **`../references/pre-submit-review.md`** -- Java/Kotlin pre-submit checklist (JMH audit, mechanism explanation, JDK compat, serialization safety)
- **`../references/async/guide.md`** -- Lock hierarchies, virtual threads, CompletableFuture, structured concurrency, thread pool sizing
- **`../references/data-structures/guide.md`** -- Concurrent collection selection
- **`../../shared/e2e-benchmarks.md`** -- Two-phase measurement with `codeflash compare`

View file

@ -133,31 +133,17 @@ Run JFR or async-profiler. Print `[ranked targets]` with time percentages. Save
### JMH Benchmark Requirement
**JMH is MANDATORY for every KEEP decision.** The JVM's JIT compiler, garbage collector, and runtime optimizations actively manipulate measurements. Ad-hoc timing (System.nanoTime loops, test suite wall-clock time) is vulnerable to JIT warmup artifacts, dead code elimination, constant folding, GC noise, and on-stack replacement. JMH is the only tool that defeats all of these simultaneously.
**JMH is MANDATORY for every KEEP decision.** Use two-phase measurement: (1) quick ad-hoc pre-screen for directionality, (2) authoritative JMH run for the KEEP decision. If JMH is not available, add it as a test-scope dependency first. See `../references/data-structures/guide.md` for the JMH template.
**Two-phase measurement**:
- **Phase 1 (pre-screen)**: Quick ad-hoc micro-benchmark to check directionality. If this shows no improvement or regression, DISCARD immediately without investing in a full JMH run.
- **Phase 2 (authoritative)**: JMH benchmark with proper warmup, forking, and statistical output. This is the ONLY basis for a KEEP decision.
If JMH is not available in the project, add it as a test-scope dependency before running any optimization experiments. See `../references/data-structures/guide.md` for the JMH template.
**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the benchmark validity checklist. A methodologically flawed benchmark produces meaningless numbers.
**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the full validity checklist covering anti-DCE, anti-constant-folding, warmup floors, fork isolation, and error bar checks.
### Correctness Verification
**Before benchmarking**, verify output equivalence per `../references/correctness-verification.md`. This covers:
- Return value deep equality (not reference equality)
- Floating-point tolerance (relative epsilon, not exact match)
- Collection ordering semantics (ordered vs unordered comparison)
- Mutable parameter state preservation
- Exception contract preservation
- Side effect preservation
If ANY correctness check fails, DISCARD immediately. Do not proceed to benchmarking.
**Before benchmarking**, verify output equivalence per `../references/correctness-verification.md`. If ANY check fails, DISCARD immediately — do not proceed to benchmarking.
### After each fix
Run JMH benchmark comparing baseline and optimized implementations. Verify benchmark validity (anti-DCE, anti-constant-folding, warmup floors, fork count, error bar check) per `../references/benchmark-validity.md`.
Run JMH benchmark comparing baseline and optimized implementations. Validate per `../references/benchmark-validity.md`.
### Keep/Discard
@ -195,11 +181,9 @@ Re-run JFR/async-profiler. Print new `[ranked targets]`. Compare against ORIGINA
## Milestone Sanity Check
At each milestone (every 3-5 KEEPs), run a cumulative JMH benchmark comparing the baseline commit (before all optimizations) with current HEAD. Compare the cumulative improvement with the sum of individual experiment improvements.
At each milestone (every 3-5 KEEPs), run a cumulative JMH benchmark comparing the baseline commit (before all optimizations) with current HEAD.
**If the cumulative improvement is less than 70% of the sum of individual improvements**: At least one previous KEEP is likely a false positive. Re-measure each individual KEEP by reverting to just before that commit, running JMH, then re-applying. Identify which KEEP does not actually contribute and revert it.
This catches a dangerous failure mode: individual KEEPs that look good in isolation but do not compound because they optimize the same JIT path, or because one KEEP's improvement was actually a GC timing artifact.
**If cumulative improvement < 70% of sum of individual improvements**: at least one KEEP is a false positive (same JIT path optimized twice, or a GC timing artifact). Re-measure each individual KEEP by reverting to just before that commit, running JMH, then re-applying. Revert the non-contributing one(s).
## Plateau Detection
@ -262,6 +246,7 @@ commit target_test baseline_ms optimized_ms speedup tests_passed tests_failed st
For detailed domain knowledge, code examples, JMH templates, and collection contract traps:
- **`../references/benchmark-validity.md`** -- **MANDATORY** JMH benchmark validity checklist (anti-DCE, anti-constant-folding, warmup, forks, error bars)
- **`../references/correctness-verification.md`** -- **MANDATORY** Java output equivalence rules (deep equality, floats, collections, mutability, exceptions)
- **`../references/pre-submit-review.md`** -- Java/Kotlin pre-submit checklist (JMH audit, mechanism explanation, JDK compat, serialization safety)
- **`../references/data-structures/guide.md`** -- Collection selection, autoboxing, JIT patterns, JMH template
- **`../references/memory/guide.md`** -- Allocation profiling, GC tuning, escape analysis
- **`../references/native/guide.md`** -- JNI, Panama FFI, Vector API

View file

@ -137,25 +137,13 @@ Cross-reference CPU hotspots with allocation sites and GC behavior:
## JMH Benchmark Requirement
**JMH is MANDATORY for every KEEP decision.** The JVM's JIT compiler, garbage collector, and runtime optimizations actively manipulate measurements. Ad-hoc timing is vulnerable to JIT warmup artifacts, dead code elimination, constant folding, GC noise, and on-stack replacement. JMH is the only tool that defeats all of these simultaneously.
**JMH is MANDATORY for every KEEP decision.** Use two-phase measurement: (1) quick ad-hoc pre-screen for directionality, (2) authoritative JMH run for the KEEP decision. If JMH is not available, add it as a test-scope dependency first.
**Two-phase measurement**:
- **Phase 1 (pre-screen)**: Quick ad-hoc measurement to check directionality. DISCARD immediately if no improvement.
- **Phase 2 (authoritative)**: JMH benchmark with proper warmup, forking, and statistical output. This is the ONLY basis for a KEEP decision.
If JMH is not available in the project, add it as a test-scope dependency before running any optimization experiments.
**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the benchmark validity checklist. A methodologically flawed benchmark produces meaningless numbers.
**Validate every JMH benchmark against `../references/benchmark-validity.md`** before trusting results.
## GC Measurement Isolation
If an optimization affects allocation patterns or claims GC improvement, measurements MUST use separate JVM processes. In the same JVM, the "after" run inherits heap state from the "before" run, contaminating GC behavior measurements.
**Requirements for GC-affecting optimizations**:
- JMH `@Fork(3)` minimum (each fork starts a fresh JVM with clean heap)
- Collect JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`, `jdk.YoungGarbageCollection`) from both baseline and optimized runs separately
- Compare GC pause **distributions** (p50, p99, max), not just averages — a single outlier pause can skew the mean
- If you claim "GC pauses reduced by X ms," the JFR evidence must show it explicitly
If an optimization affects allocation patterns or claims GC improvement, measurements MUST use `@Fork(3)` minimum (each fork starts a fresh JVM with clean heap). Collect JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`) from baseline and optimized runs separately. Compare pause **distributions** (p50, p99, max), not averages.
## Experiment Loop
@ -174,9 +162,9 @@ LOOP (until plateau or user requests stop):
3. **Read source.** Read ONLY the target function. Use Explore subagent for broader context. Do NOT read the whole codebase upfront.
4. **Correctness verification.** Before implementing, capture the original function's output with representative inputs (normal, edge, error cases) per `../references/correctness-verification.md`. This is your correctness oracle.
5. **Implement ONE fix.** Print `[experiment N] Implementing: <summary>`.
6. **Verify output equivalence.** Run the optimized version with the same inputs from step 4. Compare per `../references/correctness-verification.md` (return value deep equality, floating-point tolerance, collection ordering, mutable parameter state, exception contracts). If ANY check fails, DISCARD immediately — do not proceed to benchmarking.
6. **Verify output equivalence.** Run the optimized version with the same inputs from step 4. Compare per `../references/correctness-verification.md`. If ANY check fails, DISCARD immediately — do not proceed to benchmarking.
7. **Guard** (run tests). Revert if fails.
8. **Multi-dimensional JMH measurement.** Run JMH benchmark. Validate the benchmark per `../references/benchmark-validity.md` (anti-DCE, anti-constant-folding, warmup floors, fork count, error bar check). Also re-run profiling to measure ALL dimensions (CPU, Memory, GC).
8. **Multi-dimensional JMH measurement.** Run JMH benchmark. Validate per `../references/benchmark-validity.md`. Also re-run profiling to measure ALL dimensions (CPU, Memory, GC).
9. **Print results** -- ALL dimensions: CPU time, Memory delta, GC pause delta. Include JMH Score +/- Error.
10. **Cross-domain impact assessment.** Did the fix in domain A affect domain B? Was the interaction expected? Record it.
11. **Keep/discard.** Commit after KEEP (see decision tree below).
@ -205,16 +193,9 @@ Correctness verified? (per correctness-verification.md)
### Milestone Sanity Check (mandatory at every milestone)
At each milestone (every 3-5 KEEPs), run a **cumulative** JMH benchmark comparing the original baseline commit (before all optimizations) with current HEAD.
At each milestone (every 3-5 KEEPs), run a **cumulative** JMH benchmark comparing the original baseline commit with current HEAD.
**Compare**: cumulative improvement vs sum of individual experiment improvements.
**If cumulative improvement < 70% of sum of individuals**: At least one previous KEEP is a false positive it looked good in isolation but does not contribute when composed with other optimizations. This happens when:
- Two KEEPs optimize the same JIT-compiled path and only one actually contributes
- A KEEP's improvement was actually a GC timing artifact that doesn't reproduce across forks
- Two KEEPs interact negatively (e.g., one changed data layout, breaking the other's cache optimization)
**Recovery**: Re-measure each individual KEEP by reverting to just before that commit, running JMH, then re-applying. Identify which KEEP(s) do not actually contribute and revert them.
**If cumulative improvement < 70% of sum of individuals**: at least one KEEP is a false positive (same JIT path, GC timing artifact, or negative interaction). Re-measure each individual KEEP by reverting to just before that commit, running JMH, then re-applying. Revert non-contributing ones.
### Plateau Detection
@ -275,6 +256,7 @@ An optimization that uses APIs unavailable on the project's target JDK is invali
|---------------|-------------------|
| Any KEEP decision | `../references/benchmark-validity.md` **(MANDATORY)** |
| Any optimization | `../references/correctness-verification.md` **(MANDATORY)** |
| Pre-submit review | `../references/pre-submit-review.md` **(MANDATORY)** |
| O(n^2), wrong collection, autoboxing | `../references/data-structures/guide.md` |
| High allocs, GC pressure, memory leaks | `../references/memory/guide.md` |
| Lock contention, VT pinning, thread pools | `../references/async/guide.md` |
@ -358,16 +340,10 @@ CI mode is triggered when the prompt contains "CI" context (e.g., "This is a CI
## Pre-Submit Review
**MANDATORY before sending `[complete]`.** Read `${CLAUDE_PLUGIN_ROOT}/references/shared/pre-submit-review.md` for the shared checklist. Additional deep-mode checks:
**MANDATORY before sending `[complete]`.** Read `${CLAUDE_PLUGIN_ROOT}/references/shared/pre-submit-review.md` for the shared checklist, then `../references/pre-submit-review.md` for Java/Kotlin-specific checks. Additional deep-mode checks:
1. **Benchmark validity audit**: For every KEEP, confirm the JMH benchmark was validated per `../references/benchmark-validity.md`. Verify: results consumed (anti-DCE), inputs dynamic (anti-constant-folding), warmup met floors, forks >= 2, error bars do not overlap. If any KEEP lacks valid JMH evidence, re-run the benchmark now.
2. **Mechanism explanation present**: Every KEEP commit message must contain a one-paragraph explanation of WHY the optimization is faster at the JVM level. "Improved performance" is not acceptable — the mechanism must be stated.
3. **Cross-domain tradeoffs disclosed**: If any experiment improved one dimension at the cost of another, document the tradeoff in commit messages and HANDOFF.md.
4. **GC impact verified**: If you claimed GC improvement, verify with JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`) or `-Xlog:gc*`, not just CPU timing. Compare pause distributions, not just averages.
5. **Interaction claims verified**: Every cross-domain interaction you reported must have profiling evidence in BOTH dimensions. "I think this helps memory too" without measurement is not acceptable.
6. **JDK version compatibility**: Verify every optimization uses only APIs available on the project's minimum JDK version (from setup.md). See the JDK Version Compatibility table above.
7. **Serialization safety**: If you changed collection types (e.g., `ArrayList` to `EnumSet`, `HashMap` to `Map.of()`), check if the object is serialized anywhere (Java serialization, Jackson, protobuf). See `../references/correctness-verification.md` section 7.
8. **Correctness verification complete**: For every KEEP, confirm output equivalence was verified per `../references/correctness-verification.md` before benchmarking.
1. **Cross-domain tradeoffs disclosed**: If any experiment improved one dimension at the cost of another, document the tradeoff in commit messages and HANDOFF.md.
2. **Interaction claims verified**: Every cross-domain interaction you reported must have profiling evidence in BOTH dimensions. "I think this helps memory too" without measurement is not acceptable.
If you find issues, fix them, re-run tests, and update results.tsv. Note findings in HANDOFF.md under "Pre-submit review findings". Only send `[complete]` after all checks pass.

View file

@ -131,18 +131,13 @@ jcmd $(pgrep -f "target/.*jar") VM.native_memory summary
## JMH and GC Measurement Requirements
**JMH is MANDATORY for every KEEP decision that claims performance improvement.** Memory optimizations that reduce heap or GC pauses affect execution speed — this must be measured with JMH, not ad-hoc timing.
**JMH is MANDATORY for every KEEP decision.** Validate all benchmarks per `../references/benchmark-validity.md`. Verify correctness per `../references/correctness-verification.md`.
**GC measurement isolation is critical.** Measuring "before" and "after" in the same JVM is invalid because the "after" run inherits garbage from the "before" run. GC behavior depends on heap state.
**Requirements for ALL memory optimizations**:
- JMH `@Fork(3)` minimum (more forks than standard because GC behavior varies between JVM instances)
**Memory-specific requirements**:
- `@Fork(3)` minimum (GC behavior varies more between JVM instances than CPU behavior)
- Collect JFR GC events from separate runs for baseline and optimized
- Compare GC pause **distributions** (p50, p99, max), not just averages
- Validate JMH benchmarks per `../references/benchmark-validity.md`
- Verify correctness per `../references/correctness-verification.md`
**For GC pause claims specifically**: Collect `jdk.GCPhasePause` and `jdk.G1GarbageCollection` (or equivalent for ZGC/Shenandoah) from both baseline and optimized JFR recordings. If the reduction is not visible in the JFR data, the claim is unsubstantiated.
- Compare GC pause **distributions** (p50, p99, max), not averages
- For GC pause claims: collect `jdk.GCPhasePause` and `jdk.G1GarbageCollection` from both runs — if the reduction is not visible in JFR data, the claim is unsubstantiated
## Experiment Loop
@ -154,13 +149,11 @@ Run heap histogram + JFR allocation profiling. Build ranked allocator table with
### Correctness Verification
Before benchmarking, verify output equivalence per `../references/correctness-verification.md`. Memory optimizations frequently change collection types or object structures — this makes correctness verification especially critical. Pay particular attention to:
- Serialization compatibility (section 7) — changed collection types affect serialized output
- Collection ordering semantics (section 3) — HashMap to EnumMap changes iteration order
Before benchmarking, verify output equivalence per `../references/correctness-verification.md`. Memory optimizations are especially high-risk for sections 3 (collection ordering — HashMap to EnumMap changes iteration order) and 7 (serialization — changed collection types affect serialized output).
### After each fix
Re-run heap histogram profiling. Print `[experiment N] <before> MiB -> <after> MiB (<delta> MiB)`. Run JMH benchmark with `@Fork(3)`. Validate benchmark per `../references/benchmark-validity.md`. Note GC impact from JFR events.
Re-run heap histogram profiling. Print `[experiment N] <before> MiB -> <after> MiB (<delta> MiB)`. Run JMH benchmark with `@Fork(3)`. Validate per `../references/benchmark-validity.md`. Note GC impact from JFR events.
### Keep/Discard
@ -226,6 +219,7 @@ commit target_test target_mib heap_used_mib gc_pause_ms gc_count tests_passed te
For code examples, JMH templates, GC tuning recipes, leak detection patterns, and per-stage profiling:
- **`../references/benchmark-validity.md`** -- **MANDATORY** JMH benchmark validity checklist (anti-DCE, anti-constant-folding, warmup, forks, error bars)
- **`../references/correctness-verification.md`** -- **MANDATORY** Java output equivalence rules (deep equality, floats, collections, mutability, exceptions)
- **`../references/pre-submit-review.md`** -- Java/Kotlin pre-submit checklist (JMH audit, mechanism explanation, JDK compat, serialization safety)
- **`../references/memory/guide.md`** -- JVM heap layout, GC algorithms, escape analysis, leak detection, GC tuning
- **`../references/data-structures/guide.md`** -- Primitive collections, memory-efficient structures
- **`../references/native/guide.md`** -- DirectByteBuffer, NMT, off-heap allocators

View file

@ -0,0 +1,58 @@
# Pre-Submit Self-Review — Java/Kotlin
Java/Kotlin-specific checks for the pre-submit review. Read `${CLAUDE_PLUGIN_ROOT}/references/shared/pre-submit-review.md` first for the language-agnostic checklist.
## JMH Benchmark Validity
For every KEEP, confirm the benchmark was validated per `benchmark-validity.md`:
- Results consumed (anti-DCE): every `@Benchmark` method returns or uses `Blackhole.consume()`
- Inputs dynamic (anti-constant-folding): all inputs from `@State` or `@Param`, not literals
- Warmup meets floors: >=5 iterations for fast ops, >=3 for slow
- Fork count: `@Fork(2)` minimum, `@Fork(3)` for GC-sensitive or marginal improvements
- Error bars do NOT overlap between baseline and optimized
If any KEEP lacks valid JMH evidence, re-run the benchmark now.
## Mechanism Explanation
Every KEEP commit message must contain a one-paragraph explanation of WHY the optimization is faster at the JVM level — the specific mechanism (e.g., "eliminates autoboxing allocations" or "replaces O(n^2) nested loop with HashMap index"). If you wrote "improved performance" without explaining the mechanism, fix the commit message.
## JDK Version Compatibility
Verify every optimization uses only APIs available on the project's minimum JDK version (from `.codeflash/setup.md`). Common traps:
| API / Feature | Minimum JDK |
|--------------|-------------|
| `List.of()`, `Map.of()`, `Set.of()` | 9 |
| `String.isBlank()`, `String.strip()`, `String.repeat()` | 11 |
| `String.formatted()` | 15 |
| `Stream.toList()` | 16 |
| `record` types | 16 |
| `sealed` classes | 17 |
| Virtual threads (`Thread.ofVirtual()`) | 21 |
| `SequencedCollection`, `SequencedMap` | 21 |
An optimization using unavailable APIs does not compile in production.
## Correctness Verification
For every KEEP, confirm output equivalence was verified per `correctness-verification.md`:
- Return value deep equality (`.equals()`, not `==`)
- Floating-point tolerance (relative epsilon)
- Collection ordering semantics (ordered vs unordered comparison)
- Mutable parameter state preservation
- Exception contract preservation
- Side effect preservation
- Serialization compatibility (if the object is serialized)
## Milestone Sanity Check
If milestones were reached, confirm the cumulative JMH improvement was at least 70% of the sum of individual improvements. If not, at least one KEEP is a false positive — re-measure each individual KEEP and revert the non-contributing one(s).
## GC Impact Verification
If you claimed GC improvement, verify with JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`) or `-Xlog:gc*`, not just CPU timing. Compare pause **distributions** (p50, p99, max), not just averages.
## Serialization Safety
If you changed collection types (e.g., `ArrayList` to `EnumSet`, `HashMap` to `Map.of()`), check if the object is serialized anywhere (Java serialization, Jackson, protobuf). See `correctness-verification.md` section 7.

View file

@ -8,8 +8,6 @@ LOOP (until plateau detected or user requests stop):
**Print a status line before each step** so the user can follow progress (see Progress Updates in the agent prompt).
**IMPORTANT (Java/Kotlin):** If the project is Java or Kotlin, read `${CLAUDE_PLUGIN_ROOT}/languages/java/references/benchmark-validity.md` and `${CLAUDE_PLUGIN_ROOT}/languages/java/references/correctness-verification.md` before entering this loop. JMH is mandatory for every KEEP decision, and correctness verification must happen before benchmarking. These are non-negotiable requirements that exist because the JVM's JIT compiler actively manipulates measurements — see those files for details.
1. **Review git history.** Before choosing a target, read recent experiment history to learn from past attempts:
```bash
git log --oneline -20 # experiment sequence — what was tried
@ -25,17 +23,15 @@ LOOP (until plateau detected or user requests stop):
7. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified function arguments, wrapper flags, pool sizes, or configuration, the benchmark must use the same values. If the benchmark was written before step 6, the implementation may have changed assumptions — update the benchmark to match. A benchmark that doesn't mirror the production change proves nothing.
8. **Verify output equivalence.** Run the optimized version with the same inputs from step 4 and compare outputs. If outputs differ, **discard immediately** — this is a correctness regression, not an optimization. Do not proceed to benchmarking.
9. **Benchmark**: Run target test. Print `[experiment N] Benchmarking...`. Always run for correctness, even for micro-only optimizations.
10. **Verify benchmark validity** (Java/Kotlin). If the project is Java or Kotlin, verify the benchmark against the language-specific benchmark validity checklist before trusting results. A benchmark affected by dead code elimination, constant folding, insufficient warmup, or inadequate forking produces meaningless numbers. See domain file for the specific checklist reference. If the benchmark is invalid, fix it and re-run — do not proceed with invalid measurements.
11. **Guard** (if configured). Run the guard command (see Guard Command below). If the guard fails, the optimization broke something — revert and rework (max 2 attempts), then discard if still failing.
12. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
13. If crashed or regressed = fix or discard immediately.
14. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise. For Java/Kotlin: if JMH error bars overlap, the improvement is NOT PROVEN — re-run with more iterations/forks or DISCARD.
15. **Record** in `.codeflash/results.tsv` (schema in domain file).
16. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
17. **Mechanism explanation** (after KEEP). Write a one-paragraph explanation of WHY the optimization improves performance — the specific mechanism, not just "it's faster." This catches measurement artifacts: if you cannot explain the mechanism, the improvement may be fake. See domain file for examples.
18. **E2E benchmark** (after KEEP, when available). If `codeflash compare` is available (see `e2e-benchmarks.md`), run `$RUNNER -m codeflash compare <pre-opt-sha> HEAD` to get authoritative isolated measurements. Record e2e results alongside micro-bench results in `results.tsv`. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate the keep decision trust the e2e measurement. Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)`.
19. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
20. **Milestones** (every 3-5 keeps): Run full benchmark (including `codeflash compare <baseline-sha> HEAD` for cumulative e2e measurement), create milestone branch. **Milestone sanity check**: compare cumulative improvement with sum of individual experiment improvements. If cumulative < 70% of sum, at least one KEEP is a false positive investigate (see domain file for recovery procedure). Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.
10. **Guard** (if configured). Run the guard command (see Guard Command below). If the guard fails, the optimization broke something — revert and rework (max 2 attempts), then discard if still failing.
11. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
12. If crashed or regressed = fix or discard immediately.
13. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise.
14. **Record** in `.codeflash/results.tsv` (schema in domain file).
15. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
16. **E2E benchmark** (after KEEP, when available). If `codeflash compare` is available (see `e2e-benchmarks.md`), run `$RUNNER -m codeflash compare <pre-opt-sha> HEAD` to get authoritative isolated measurements. Record e2e results alongside micro-bench results in `results.tsv`. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate the keep decision trust the e2e measurement. Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)`.
17. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
18. **Milestones** (every 3-5 keeps): Run full benchmark (including `codeflash compare <baseline-sha> HEAD` for cumulative e2e measurement), create milestone branch. Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.
## Keep/Discard Decision Tree — Common Structure

View file

@ -45,16 +45,6 @@ Cross-check your implementation against what the PR claims:
- **Tests cover the alternate paths** your change affects. If the function is called from both sync and async contexts, test both.
- **Regression tests for edge cases** mentioned in your analysis (e.g., empty input, single element, concurrent access).
## 6. Java/Kotlin-Specific Checks
If the project is Java or Kotlin, also verify:
- **JMH benchmark validity**: For every KEEP, confirm the benchmark was validated per the language-specific benchmark validity checklist (anti-DCE, anti-constant-folding, warmup floors, fork count, error bar non-overlap). If any KEEP lacks valid JMH evidence, re-run the benchmark now.
- **Mechanism explanation present**: Every KEEP commit message must contain a one-paragraph explanation of WHY the optimization is faster at the JVM level — the specific mechanism (e.g., "eliminates autoboxing allocations" or "replaces O(n^2) nested loop with HashMap index"). If you wrote "improved performance" without explaining the mechanism, fix the commit message.
- **JDK version compatibility**: Verify every optimization uses only APIs available on the project's minimum JDK version (from `.codeflash/setup.md`). Common traps: `List.of()` requires JDK 9, `String.isBlank()` requires JDK 11, virtual threads require JDK 21. An optimization using unavailable APIs does not compile in production.
- **Correctness verification complete**: For every KEEP, confirm output equivalence was verified per the language-specific correctness verification checklist (return value deep equality, floating-point tolerance, collection ordering, mutable parameters, exception contracts, serialization compatibility).
- **Milestone sanity check passed**: If milestones were reached, confirm the cumulative improvement was at least 70% of the sum of individual improvements. If not, at least one KEEP is a false positive that should be reverted.
## How to Run
At the end of the experiment loop, before sending `[complete]`: