feat(java): enforce mandatory JMH validation and correctness verification for all optimization agents

The Java optimization agents previously allowed non-JMH benchmarking (ad-hoc timing, test suite wall-clock) which is vulnerable to JVM measurement artifacts — dead code elimination, constant folding, insufficient warmup, GC noise, and OSR. This led to false-positive optimization results being kept and shipped. Changes: - Create benchmark-validity.md: mandatory JMH checklist covering anti-DCE, anti-constant- folding, warmup floors, fork isolation, statistical significance (2x error margin rule), and GC isolation. Includes validity decision tree. - Create correctness-verification.md: 7-check Java output equivalence rules covering deep equality, floating-point tolerance, collection ordering, mutable parameter state, exception contracts, side effects, and serialization compatibility. - Update all 4 domain agents (cpu, memory, async, deep): mandate JMH for every KEEP, add correctness verification before benchmarking, rewrite keep/discard trees to start with correctness check, require mechanism explanations, add JDK version compatibility. - Update experiment-loop-base.md: add Java-specific benchmark validity step, error bar overlap check, mechanism explanation requirement, milestone sanity check (70% threshold). - Update data-structures/guide.md: add JMH validity rules and output reading guide. - Update pre-submit-review.md: add Java/Kotlin-specific pre-submit checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 01:56:32 +00:00 · 2026-04-16 01:56:32 +00:00 · 525b2d243e
commit 525b2d243e
parent f8281a24a0
9 changed files with 620 additions and 70 deletions
--- a/plugin/languages/java/agents/codeflash-java-async.md
+++ b/plugin/languages/java/agents/codeflash-java-async.md
@ -111,29 +111,60 @@ grep -rn "newFixedThreadPool\|newCachedThreadPool\|ThreadPoolExecutor" --include
 grep -rn "Hashtable\|Vector\|synchronizedMap\|StringBuffer" --include="*.java" --include="*.kt" src/
 ```

+## JMH Benchmark Requirement
+
+**JMH is MANDATORY for every KEEP decision.** Concurrency benchmarks are especially vulnerable to JVM measurement artifacts because thread scheduling, JIT compilation, and GC interact non-deterministically. JMH with proper forking isolates these effects.
+
+**Concurrency-specific JMH settings**:
+- `@Fork(3)` minimum (concurrency behavior varies more between JVM instances than single-threaded code)
+- `@Threads(N)` matching the target production concurrency level
+- `@State(Scope.Benchmark)` for shared state, `@State(Scope.Thread)` for thread-local state
+- Longer warmup (>=5 iterations) to stabilize thread pool ramp-up and JIT under contention
+
+**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the benchmark validity checklist.
+
 ## Experiment Loop

 Read `${CLAUDE_PLUGIN_ROOT}/references/shared/experiment-loop-base.md` for the full loop. Concurrency-specific additions:

+### Correctness Verification
+
+Before benchmarking, verify output equivalence per `../references/correctness-verification.md`. Concurrency optimizations are especially prone to behavioral changes:
+- Lock changes can introduce race conditions (section 4 — mutable parameter state)
+- Collection type changes (synchronized to concurrent) may change iteration ordering (section 3)
+- Exception behavior may change under concurrent access (section 5)
+
+Additionally, run tests under load to detect race conditions. Use Thread Sanitizer or stress-test with multiple threads.
+
 ### After each fix

-Run JMH at agreed thread count. Also verify: `go test -race` equivalent -- run tests under load to detect races.
+Run JMH at agreed thread count with `@Fork(3)`. Validate benchmark per `../references/benchmark-validity.md`. Also verify: run tests under load to detect races.

 ### Keep/Discard

 ```
-Tests pass? AND no race conditions?
-+-- NO -> DISCARD (race conditions are bugs)
-+-- YES -> Metric improved?
-   +-- >=10% latency or throughput improvement -> KEEP
-   +-- <10% -> Re-run 3x (concurrency benchmarks have high variance)
-   +-- Lock removal or VT migration -> Always KEEP (prevents thread starvation)
-   +-- No improvement -> DISCARD
+Correctness verified? (per correctness-verification.md)
+-- NO -> DISCARD immediately (behavioral change)
+-- YES -> Tests pass? AND no race conditions under load?
+    +-- NO -> DISCARD (race conditions are bugs)
+    +-- YES -> JMH benchmark valid? (per benchmark-validity.md)
+        +-- NO -> Fix benchmark and re-run
+        +-- YES -> JMH error bars do NOT overlap?
+            +-- OVERLAP -> Re-run with more iterations/forks (3x minimum for concurrency). Still overlap? DISCARD
+            +-- NO OVERLAP -> Metric improved?
+                +-- >=10% latency or throughput improvement AND improvement > 2x error margin -> KEEP
+                +-- <10% -> Re-run 3x (concurrency benchmarks have high variance). Confirmed? KEEP. Not confirmed? DISCARD
+                +-- Lock removal or VT migration -> KEEP if no race conditions detected (prevents thread starvation)
+                +-- No improvement -> DISCARD
 ```

+### Mechanism Explanation (mandatory for every KEEP)
+
+Write one paragraph explaining WHY the optimization improves throughput or latency at the JVM concurrency level. Example: "The original used a global `synchronized` block protecting the entire `updateCache()` method. With 8 concurrent threads, each thread waited an average of 340ms per second on monitor entry (JFR `jdk.JavaMonitorEnter`). The optimized version uses `StampedLock` with optimistic reads, allowing readers to proceed without acquiring any lock. Under the same 8-thread load, monitor wait time dropped to <5ms and throughput increased from 12K to 38K ops/sec."
+
 ### Record after each experiment

-Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after every keep/discard. Update Hotspot Summary and Kept/Discarded sections in HANDOFF.md.
+Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after every keep/discard. Update Hotspot Summary and Kept/Discarded sections in HANDOFF.md. Include the mechanism explanation for KEEPs.

 ## Plateau Detection

@ -160,6 +191,8 @@ commit	target_test	baseline_throughput	optimized_throughput	throughput_change	ba
 ## Deep References

 For code examples, virtual thread migration guide, JMH concurrency templates, and lock patterns:
+- **`../references/benchmark-validity.md`** -- **MANDATORY** JMH benchmark validity checklist (anti-DCE, anti-constant-folding, warmup, forks, error bars)
+- **`../references/correctness-verification.md`** -- **MANDATORY** Java output equivalence rules (deep equality, floats, collections, mutability, exceptions)
 - **`../references/async/guide.md`** -- Lock hierarchies, virtual threads, CompletableFuture, structured concurrency, thread pool sizing
 - **`../references/data-structures/guide.md`** -- Concurrent collection selection
 - **`../../shared/e2e-benchmarks.md`** -- Two-phase measurement with `codeflash compare`
--- a/plugin/languages/java/agents/codeflash-java-cpu.md
+++ b/plugin/languages/java/agents/codeflash-java-cpu.md
@ -129,35 +129,78 @@ Read `${CLAUDE_PLUGIN_ROOT}/references/shared/experiment-loop-base.md` for the f

 ### Baseline

-Run JFR or async-profiler. Print `[ranked targets]` with time percentages. Save baseline total.
+Run JFR or async-profiler. Print `[ranked targets]` with time percentages. Save baseline total. Save the baseline commit SHA for later JMH comparison.
+
+### JMH Benchmark Requirement
+
+**JMH is MANDATORY for every KEEP decision.** The JVM's JIT compiler, garbage collector, and runtime optimizations actively manipulate measurements. Ad-hoc timing (System.nanoTime loops, test suite wall-clock time) is vulnerable to JIT warmup artifacts, dead code elimination, constant folding, GC noise, and on-stack replacement. JMH is the only tool that defeats all of these simultaneously.
+
+**Two-phase measurement**:
+- **Phase 1 (pre-screen)**: Quick ad-hoc micro-benchmark to check directionality. If this shows no improvement or regression, DISCARD immediately without investing in a full JMH run.
+- **Phase 2 (authoritative)**: JMH benchmark with proper warmup, forking, and statistical output. This is the ONLY basis for a KEEP decision.
+
+If JMH is not available in the project, add it as a test-scope dependency before running any optimization experiments. See `../references/data-structures/guide.md` for the JMH template.
+
+**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the benchmark validity checklist. A methodologically flawed benchmark produces meaningless numbers.
+
+### Correctness Verification
+
+**Before benchmarking**, verify output equivalence per `../references/correctness-verification.md`. This covers:
+- Return value deep equality (not reference equality)
+- Floating-point tolerance (relative epsilon, not exact match)
+- Collection ordering semantics (ordered vs unordered comparison)
+- Mutable parameter state preservation
+- Exception contract preservation
+- Side effect preservation
+
+If ANY correctness check fails, DISCARD immediately. Do not proceed to benchmarking.

 ### After each fix

-Run JMH benchmark or target test suite. Compare before/after. See `../references/data-structures/guide.md` for JMH template.
+Run JMH benchmark comparing baseline and optimized implementations. Verify benchmark validity (anti-DCE, anti-constant-folding, warmup floors, fork count, error bar check) per `../references/benchmark-validity.md`.

 ### Keep/Discard

 ```
-Tests pass? (mvn test / gradle test)
-+-- NO -> Fix or discard
-+-- YES -> benchstat/JMH shows significant improvement?
-   +-- >=5% speedup (p < 0.05) -> KEEP
-   +-- <5% -> Re-run 3 times (JIT warmup variance is real)
-   |  +-- Confirmed -> KEEP
-   |  +-- Not significant -> DISCARD
-   +-- Micro-bench only: >=20% on confirmed hot path -> KEEP
-   +-- JIT deopt fix: KEEP if PrintCompilation confirms deopt eliminated
-   +-- No improvement -> DISCARD
+Correctness verified? (per correctness-verification.md)
+-- NO -> DISCARD immediately (behavioral change)
+-- YES -> Tests pass? (mvn test / gradle test)
+    +-- NO -> Fix or discard
+    +-- YES -> JMH benchmark valid? (per benchmark-validity.md)
+        +-- NO -> Fix benchmark and re-run
+        +-- YES -> JMH error bars do NOT overlap?
+            +-- OVERLAP -> Re-run with more iterations/forks. Still overlap? DISCARD
+            +-- NO OVERLAP -> Speedup >= 5%?
+                +-- YES (>= 5%, improvement > 2x error margin) -> KEEP
+                +-- YES but improvement < 2x error margin -> INCONCLUSIVE, re-run or DISCARD
+                +-- JIT deopt fix: KEEP if PrintCompilation confirms deopt eliminated AND JMH shows stable improvement
+                +-- NO -> DISCARD
 ```

+### Mechanism Explanation (mandatory for every KEEP)
+
+After each KEEP, write a one-paragraph explanation of **the mechanism** by which the optimization improves performance. Not "it's faster" — explain WHY at the JVM level.
+
+Example: "The original used `ArrayList.contains()` inside a loop of N items, giving O(N*M) complexity. The optimized version builds a `HashSet` from the list first (O(M)), then does O(1) lookups per item, giving O(N+M) total. For the benchmark input size of 10,000 items, this eliminates approximately 50 million linear scans."
+
+If you cannot explain the mechanism, treat the optimization with suspicion — it may be a measurement artifact, not a real improvement. Re-validate the benchmark.
+
 ### Record after each experiment

-Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after every keep/discard. Update Hotspot Summary and Kept/Discarded sections in HANDOFF.md.
+Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after every keep/discard. Update Hotspot Summary and Kept/Discarded sections in HANDOFF.md. Include the mechanism explanation for KEEPs.

 ### Mandatory re-profiling after KEEP

 Re-run JFR/async-profiler. Print new `[ranked targets]`. Compare against ORIGINAL baseline total. **STOP if all remaining targets below 2% of original baseline.**

+## Milestone Sanity Check
+
+At each milestone (every 3-5 KEEPs), run a cumulative JMH benchmark comparing the baseline commit (before all optimizations) with current HEAD. Compare the cumulative improvement with the sum of individual experiment improvements.
+
+**If the cumulative improvement is less than 70% of the sum of individual improvements**: At least one previous KEEP is likely a false positive. Re-measure each individual KEEP by reverting to just before that commit, running JMH, then re-applying. Identify which KEEP does not actually contribute and revert it.
+
+This catches a dangerous failure mode: individual KEEPs that look good in isolation but do not compound because they optimize the same JIT path, or because one KEEP's improvement was actually a GC timing artifact.
+
 ## Plateau Detection

 - 3+ consecutive discards -> check if remaining hotspots are I/O-bound, native, or JVM internals
@ -166,6 +209,27 @@ Re-run JFR/async-profiler. Print new `[ranked targets]`. Compare against ORIGINA

 Strategy rotation: collection swaps -> algorithmic restructuring -> JIT deopt fixes -> caching/memoization -> lock reduction -> native methods

+## JDK Version Compatibility
+
+Read the project's minimum JDK version from `.codeflash/setup.md` (detected during setup from `pom.xml` `<maven.compiler.source>` or `build.gradle` `sourceCompatibility`). Every optimization MUST be compatible with this version.
+
+**Common version gates**:
+
+| API / Feature | Minimum JDK | What Breaks If Used on Older JDK |
+|--------------|-------------|----------------------------------|
+| `List.of()`, `Map.of()`, `Set.of()` | 9 | Compilation error |
+| `String.isBlank()`, `String.strip()`, `String.repeat()` | 11 | Compilation error |
+| `Stream.toList()` | 16 | Compilation error |
+| `record` types | 16 | Compilation error |
+| `sealed` classes | 17 | Compilation error |
+| Virtual threads (`Thread.ofVirtual()`) | 21 | Compilation error |
+| `SequencedCollection`, `SequencedMap` | 21 | Compilation error |
+| `String.formatted()` | 15 | Compilation error |
+
+**If the project targets JDK 8**: None of the above are available. Use only JDK 8 APIs.
+
+An optimization that uses APIs unavailable on the project's target JDK is invalid — it produces code that does not compile in production.
+
 ## Diff Hygiene

 Before pushing, review `git diff <base>..HEAD`:
@ -173,7 +237,7 @@ Before pushing, review `git diff <base>..HEAD`:
 1. No unintended formatting changes (IDE auto-format, import reordering)
 2. No deleted code you didn't mean to remove
 3. Consistent style with surrounding code (brace placement, naming conventions)
-4. No accidental JDK version bumps (e.g., using `List.of()` when project targets JDK 8)
+4. No JDK API usage above the project's minimum version (check the table above)

 ## Results Schema

@ -196,6 +260,8 @@ commit	target_test	baseline_ms	optimized_ms	speedup	tests_passed	tests_failed	st
 ## Deep References

 For detailed domain knowledge, code examples, JMH templates, and collection contract traps:
+- **`../references/benchmark-validity.md`** -- **MANDATORY** JMH benchmark validity checklist (anti-DCE, anti-constant-folding, warmup, forks, error bars)
+- **`../references/correctness-verification.md`** -- **MANDATORY** Java output equivalence rules (deep equality, floats, collections, mutability, exceptions)
 - **`../references/data-structures/guide.md`** -- Collection selection, autoboxing, JIT patterns, JMH template
 - **`../references/memory/guide.md`** -- Allocation profiling, GC tuning, escape analysis
 - **`../references/native/guide.md`** -- JNI, Panama FFI, Vector API
--- a/plugin/languages/java/agents/codeflash-java-deep.md
+++ b/plugin/languages/java/agents/codeflash-java-deep.md
@ -135,6 +135,28 @@ Cross-reference CPU hotspots with allocation sites and GC behavior:

 **Read `../references/team-orchestration.md`** for the full protocol: creating the team, dispatching domain agents with cross-domain context, dispatching researchers, receiving results, parallel dispatch with profiling conflict awareness, merging dispatched work, and team cleanup.

+## JMH Benchmark Requirement
+
+**JMH is MANDATORY for every KEEP decision.** The JVM's JIT compiler, garbage collector, and runtime optimizations actively manipulate measurements. Ad-hoc timing is vulnerable to JIT warmup artifacts, dead code elimination, constant folding, GC noise, and on-stack replacement. JMH is the only tool that defeats all of these simultaneously.
+
+**Two-phase measurement**:
+- **Phase 1 (pre-screen)**: Quick ad-hoc measurement to check directionality. DISCARD immediately if no improvement.
+- **Phase 2 (authoritative)**: JMH benchmark with proper warmup, forking, and statistical output. This is the ONLY basis for a KEEP decision.
+
+If JMH is not available in the project, add it as a test-scope dependency before running any optimization experiments.
+
+**After writing any JMH benchmark, validate it against `../references/benchmark-validity.md`** — the benchmark validity checklist. A methodologically flawed benchmark produces meaningless numbers.
+
+## GC Measurement Isolation
+
+If an optimization affects allocation patterns or claims GC improvement, measurements MUST use separate JVM processes. In the same JVM, the "after" run inherits heap state from the "before" run, contaminating GC behavior measurements.
+
+**Requirements for GC-affecting optimizations**:
+- JMH `@Fork(3)` minimum (each fork starts a fresh JVM with clean heap)
+- Collect JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`, `jdk.YoungGarbageCollection`) from both baseline and optimized runs separately
+- Compare GC pause **distributions** (p50, p99, max), not just averages — a single outlier pause can skew the mean
+- If you claim "GC pauses reduced by X ms," the JFR evidence must show it explicitly
+
 ## Experiment Loop

 **PROFILING GATE:** Must have printed `[unified targets]` table before entering this loop.
@ -150,28 +172,50 @@ LOOP (until plateau or user requests stop):
 1. **Choose target.** Prefer multi-domain targets. For each target, decide: **handle it yourself** (cross-domain interaction) or **dispatch to a domain agent** (single-domain). Print `[experiment N] Target: <name> (<domains>, hypothesis: <interaction>)`.
 2. **Joint reasoning checklist.** Answer all 10 questions. If the interaction hypothesis is unclear, profile deeper first.
 3. **Read source.** Read ONLY the target function. Use Explore subagent for broader context. Do NOT read the whole codebase upfront.
-4. **Implement ONE fix.** Print `[experiment N] Implementing: <summary>`.
-5. **Multi-dimensional measurement.** Re-run profiling, measure ALL dimensions (CPU, Memory, GC).
-6. **Guard** (run tests). Revert if fails.
-7. **Print results** -- ALL dimensions: CPU, Memory, GC pauses.
-8. **Cross-domain impact assessment.** Did the fix in domain A affect domain B? Was the interaction expected? Record it.
-9. **Keep/discard.** Commit after KEEP (see decision tree below).
-10. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Include ALL dimensions measured. Update Hotspot Summary and Kept/Discarded sections.
-11. **Strategy revision** (after every KEEP). Re-run unified profiling. Print updated `[unified targets]` table. Check for remaining targets (>1% CPU, >2 MiB memory, >5ms latency). Scan for code antipatterns (autoboxing, `String.format` in loops, `synchronized` on hot path) that may not rank high in profiling but are trivially fixable. Ask: "What did I learn? What changed across domains? Should I continue or pivot?"
-12. **Milestones** (every 3-5 keeps): Full benchmark, tag, AND run adversarial review on commits since last milestone. Fix HIGH-severity findings before continuing.
+4. **Correctness verification.** Before implementing, capture the original function's output with representative inputs (normal, edge, error cases) per `../references/correctness-verification.md`. This is your correctness oracle.
+5. **Implement ONE fix.** Print `[experiment N] Implementing: <summary>`.
+6. **Verify output equivalence.** Run the optimized version with the same inputs from step 4. Compare per `../references/correctness-verification.md` (return value deep equality, floating-point tolerance, collection ordering, mutable parameter state, exception contracts). If ANY check fails, DISCARD immediately — do not proceed to benchmarking.
+7. **Guard** (run tests). Revert if fails.
+8. **Multi-dimensional JMH measurement.** Run JMH benchmark. Validate the benchmark per `../references/benchmark-validity.md` (anti-DCE, anti-constant-folding, warmup floors, fork count, error bar check). Also re-run profiling to measure ALL dimensions (CPU, Memory, GC).
+9. **Print results** -- ALL dimensions: CPU time, Memory delta, GC pause delta. Include JMH Score +/- Error.
+10. **Cross-domain impact assessment.** Did the fix in domain A affect domain B? Was the interaction expected? Record it.
+11. **Keep/discard.** Commit after KEEP (see decision tree below).
+12. **Mechanism explanation** (mandatory for every KEEP). Write one paragraph explaining WHY the optimization is faster at the JVM level. Not "it's faster" — explain the mechanism (e.g., "eliminates 2M autoboxed Integer allocations per call, reducing young GC frequency from 12/sec to 3/sec, which cuts GC pause contribution from 40ms to 8ms"). If you cannot explain the mechanism, re-validate — it may be a measurement artifact.
+13. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Include ALL dimensions measured and the mechanism explanation. Update Hotspot Summary and Kept/Discarded sections.
+14. **Strategy revision** (after every KEEP). Re-run unified profiling. Print updated `[unified targets]` table. Check for remaining targets (>1% CPU, >2 MiB memory, >5ms latency). Scan for code antipatterns (autoboxing, `String.format` in loops, `synchronized` on hot path) that may not rank high in profiling but are trivially fixable. Ask: "What did I learn? What changed across domains? Should I continue or pivot?"
+15. **Milestones** (every 3-5 keeps): Full cumulative JMH benchmark + milestone sanity check + adversarial review. Fix HIGH-severity findings before continuing.

 ### Keep/Discard

 ```
-Tests passed?
-+-- NO -> Fix or discard
-+-- YES -> Net cross-domain effect:
-   +-- Target >=5% improved AND no regression -> KEEP
-   +-- Target + other dimension both improved -> KEEP (compound)
-   +-- Target improved but other regressed -> net positive? KEEP with note; net negative? DISCARD
-   +-- No dimension improved -> DISCARD
+Correctness verified? (per correctness-verification.md)
+-- NO -> DISCARD immediately (behavioral change)
+-- YES -> Tests passed?
+    +-- NO -> Fix or discard
+    +-- YES -> JMH benchmark valid? (per benchmark-validity.md)
+        +-- NO -> Fix benchmark and re-run
+        +-- YES -> JMH error bars do NOT overlap?
+            +-- OVERLAP -> Re-run with more iterations/forks. Still overlap? DISCARD
+            +-- NO OVERLAP -> Net cross-domain effect:
+                +-- Target >=5% improved AND no regression AND improvement > 2x error margin -> KEEP
+                +-- Target + other dimension both improved -> KEEP (compound)
+                +-- Target improved but other regressed -> net positive? KEEP with note; net negative? DISCARD
+                +-- No dimension improved -> DISCARD
 ```

+### Milestone Sanity Check (mandatory at every milestone)
+
+At each milestone (every 3-5 KEEPs), run a **cumulative** JMH benchmark comparing the original baseline commit (before all optimizations) with current HEAD.
+
+**Compare**: cumulative improvement vs sum of individual experiment improvements.
+
+**If cumulative improvement < 70% of sum of individuals**: At least one previous KEEP is a false positive — it looked good in isolation but does not contribute when composed with other optimizations. This happens when:
+- Two KEEPs optimize the same JIT-compiled path and only one actually contributes
+- A KEEP's improvement was actually a GC timing artifact that doesn't reproduce across forks
+- Two KEEPs interact negatively (e.g., one changed data layout, breaking the other's cache optimization)
+
+**Recovery**: Re-measure each individual KEEP by reverting to just before that commit, running JMH, then re-applying. Identify which KEEP(s) do not actually contribute and revert them.
+
 ### Plateau Detection

 - Cross-domain plateau: EVERY dimension has 3+ consecutive discards
@ -206,12 +250,30 @@ commit	target_test	cpu_baseline_s	cpu_optimized_s	cpu_speedup	mem_baseline_mb	me
 - `interaction`: cross-domain effect observed (e.g., `alloc_to_gc_reduction`, `none`)
 - `status`: `keep`, `discard`, or `crash`

+## JDK Version Compatibility
+
+Read the project's minimum JDK version from `.codeflash/setup.md`. Every optimization MUST be compatible with this version.
+
+| API / Feature | Minimum JDK |
+|--------------|-------------|
+| `List.of()`, `Map.of()`, `Set.of()` | 9 |
+| `String.isBlank()`, `String.strip()`, `String.repeat()` | 11 |
+| `Stream.toList()` | 16 |
+| `record` types | 16 |
+| `sealed` classes | 17 |
+| Virtual threads (`Thread.ofVirtual()`) | 21 |
+| `SequencedCollection`, `SequencedMap` | 21 |
+
+An optimization that uses APIs unavailable on the project's target JDK is invalid — it produces code that does not compile in production. Always check before implementing.
+
 ## Reference Loading

 **Read on demand, not upfront.** Only load when you've identified a pattern through profiling:

 | Pattern found | Reference to read |
 |---------------|-------------------|
+| Any KEEP decision | `../references/benchmark-validity.md` **(MANDATORY)** |
+| Any optimization | `../references/correctness-verification.md` **(MANDATORY)** |
 | O(n^2), wrong collection, autoboxing | `../references/data-structures/guide.md` |
 | High allocs, GC pressure, memory leaks | `../references/memory/guide.md` |
 | Lock contention, VT pinning, thread pools | `../references/async/guide.md` |
@ -297,11 +359,14 @@ CI mode is triggered when the prompt contains "CI" context (e.g., "This is a CI

 **MANDATORY before sending `[complete]`.** Read `${CLAUDE_PLUGIN_ROOT}/references/shared/pre-submit-review.md` for the shared checklist. Additional deep-mode checks:

-1. **Cross-domain tradeoffs disclosed**: If any experiment improved one dimension at the cost of another, document the tradeoff in commit messages and HANDOFF.md.
-2. **GC impact verified**: If you claimed GC improvement, verify with JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`) or `-Xlog:gc*`, not just CPU timing.
-3. **Interaction claims verified**: Every cross-domain interaction you reported must have profiling evidence in BOTH dimensions. "I think this helps memory too" without measurement is not acceptable.
-4. **JDK version guards**: If your fix depends on JDK 9+/11+/17+/21+ APIs, verify the project's minimum JDK version (from setup.md) supports it.
-5. **Serialization safety**: If you changed collection types (e.g., `ArrayList` to `EnumSet`, `HashMap` to `Map.of()`), check if the object is serialized anywhere (Java serialization, Jackson, protobuf).
+1. **Benchmark validity audit**: For every KEEP, confirm the JMH benchmark was validated per `../references/benchmark-validity.md`. Verify: results consumed (anti-DCE), inputs dynamic (anti-constant-folding), warmup met floors, forks >= 2, error bars do not overlap. If any KEEP lacks valid JMH evidence, re-run the benchmark now.
+2. **Mechanism explanation present**: Every KEEP commit message must contain a one-paragraph explanation of WHY the optimization is faster at the JVM level. "Improved performance" is not acceptable — the mechanism must be stated.
+3. **Cross-domain tradeoffs disclosed**: If any experiment improved one dimension at the cost of another, document the tradeoff in commit messages and HANDOFF.md.
+4. **GC impact verified**: If you claimed GC improvement, verify with JFR GC events (`jdk.G1GarbageCollection`, `jdk.GCPhasePause`) or `-Xlog:gc*`, not just CPU timing. Compare pause distributions, not just averages.
+5. **Interaction claims verified**: Every cross-domain interaction you reported must have profiling evidence in BOTH dimensions. "I think this helps memory too" without measurement is not acceptable.
+6. **JDK version compatibility**: Verify every optimization uses only APIs available on the project's minimum JDK version (from setup.md). See the JDK Version Compatibility table above.
+7. **Serialization safety**: If you changed collection types (e.g., `ArrayList` to `EnumSet`, `HashMap` to `Map.of()`), check if the object is serialized anywhere (Java serialization, Jackson, protobuf). See `../references/correctness-verification.md` section 7.
+8. **Correctness verification complete**: For every KEEP, confirm output equivalence was verified per `../references/correctness-verification.md` before benchmarking.

 If you find issues, fix them, re-run tests, and update results.tsv. Note findings in HANDOFF.md under "Pre-submit review findings". Only send `[complete]` after all checks pass.

--- a/plugin/languages/java/agents/codeflash-java-memory.md
+++ b/plugin/languages/java/agents/codeflash-java-memory.md
@ -129,34 +129,63 @@ java -XX:NativeMemoryTracking=summary -jar app.jar
 jcmd $(pgrep -f "target/.*jar") VM.native_memory summary
 ```

+## JMH and GC Measurement Requirements
+
+**JMH is MANDATORY for every KEEP decision that claims performance improvement.** Memory optimizations that reduce heap or GC pauses affect execution speed — this must be measured with JMH, not ad-hoc timing.
+
+**GC measurement isolation is critical.** Measuring "before" and "after" in the same JVM is invalid because the "after" run inherits garbage from the "before" run. GC behavior depends on heap state.
+
+**Requirements for ALL memory optimizations**:
+- JMH `@Fork(3)` minimum (more forks than standard because GC behavior varies between JVM instances)
+- Collect JFR GC events from separate runs for baseline and optimized
+- Compare GC pause **distributions** (p50, p99, max), not just averages
+- Validate JMH benchmarks per `../references/benchmark-validity.md`
+- Verify correctness per `../references/correctness-verification.md`
+
+**For GC pause claims specifically**: Collect `jdk.GCPhasePause` and `jdk.G1GarbageCollection` (or equivalent for ZGC/Shenandoah) from both baseline and optimized JFR recordings. If the reduction is not visible in the JFR data, the claim is unsubstantiated.
+
 ## Experiment Loop

 Read `${CLAUDE_PLUGIN_ROOT}/references/shared/experiment-loop-base.md` for the full loop. Memory-specific additions:

 ### Baseline

-Run heap histogram + JFR allocation profiling. Build ranked allocator table with bytes and object counts.
+Run heap histogram + JFR allocation profiling. Build ranked allocator table with bytes and object counts. Save the baseline commit SHA for JMH comparison.
+
+### Correctness Verification
+
+Before benchmarking, verify output equivalence per `../references/correctness-verification.md`. Memory optimizations frequently change collection types or object structures — this makes correctness verification especially critical. Pay particular attention to:
+- Serialization compatibility (section 7) — changed collection types affect serialized output
+- Collection ordering semantics (section 3) — HashMap to EnumMap changes iteration order

 ### After each fix

-Re-run profiling. Print `[experiment N] <before> MiB -> <after> MiB (<delta> MiB)`. Note GC impact.
+Re-run heap histogram profiling. Print `[experiment N] <before> MiB -> <after> MiB (<delta> MiB)`. Run JMH benchmark with `@Fork(3)`. Validate benchmark per `../references/benchmark-validity.md`. Note GC impact from JFR events.

 ### Keep/Discard

 ```
-Tests pass?
-+-- NO -> Fix or discard
-+-- YES -> Metric improved?
-   +-- >=5 MiB reduction -> KEEP
-   +-- <5 MiB -> Re-run with forced GC to confirm
-   +-- Leak fix (unbounded growth stopped) -> Always KEEP
-   +-- GC pause reduction >=50ms -> KEEP even if heap unchanged
-   +-- No improvement -> DISCARD
+Correctness verified? (per correctness-verification.md)
+-- NO -> DISCARD immediately (behavioral change)
+-- YES -> Tests pass?
+    +-- NO -> Fix or discard
+    +-- YES -> JMH benchmark valid? (per benchmark-validity.md)
+        +-- NO -> Fix benchmark and re-run
+        +-- YES -> Metric improved?
+            +-- >=5 MiB reduction (confirmed by heap histogram in separate JVM) -> KEEP
+            +-- <5 MiB -> Re-run with forced GC in separate JVM to confirm
+            +-- Leak fix (unbounded growth stopped) -> Always KEEP
+            +-- GC pause reduction >=50ms (confirmed by JFR GC events from separate runs) -> KEEP even if heap unchanged
+            +-- No improvement -> DISCARD
 ```

+### Mechanism Explanation (mandatory for every KEEP)
+
+Write one paragraph explaining WHY the optimization reduces memory or GC pauses at the JVM level. Example: "The original stored 500K Integer keys in a HashMap, requiring ~48 bytes per entry (32M total). The optimized version uses an EnumMap with enum ordinals as indices, requiring ~4 bytes per entry (2M total), reducing live heap by 30M and eliminating 250K Integer autoboxing allocations per request cycle."
+
 ### Record after each experiment

-Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after every keep/discard. Update Hotspot Summary and Kept/Discarded sections in HANDOFF.md.
+Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after every keep/discard. Update Hotspot Summary and Kept/Discarded sections in HANDOFF.md. Include the mechanism explanation for KEEPs.

 ### Mandatory re-profiling after KEEP

@ -186,6 +215,8 @@ commit	target_test	target_mib	heap_used_mib	gc_pause_ms	gc_count	tests_passed	te
 ## Deep References

 For code examples, JMH templates, GC tuning recipes, leak detection patterns, and per-stage profiling:
+- **`../references/benchmark-validity.md`** -- **MANDATORY** JMH benchmark validity checklist (anti-DCE, anti-constant-folding, warmup, forks, error bars)
+- **`../references/correctness-verification.md`** -- **MANDATORY** Java output equivalence rules (deep equality, floats, collections, mutability, exceptions)
 - **`../references/memory/guide.md`** -- JVM heap layout, GC algorithms, escape analysis, leak detection, GC tuning
 - **`../references/data-structures/guide.md`** -- Primitive collections, memory-efficient structures
 - **`../references/native/guide.md`** -- DirectByteBuffer, NMT, off-heap allocators
--- a/plugin/languages/java/references/benchmark-validity.md
+++ b/plugin/languages/java/references/benchmark-validity.md
@ -0,0 +1,145 @@
+# Java Benchmark Validity Checklist
+
+**MANDATORY for every KEEP decision.** Before trusting any benchmark result for a Java/Kotlin optimization, verify ALL items below. A benchmark that violates any of these produces meaningless numbers — and meaningless numbers must not drive keep/discard decisions.
+
+This checklist exists because the JVM's JIT compiler, garbage collector, and runtime optimizations actively manipulate benchmark measurements. Unlike interpreted languages, what the JVM executes is NOT what you wrote — the JIT rewrites your code at runtime. These checks ensure you are measuring your optimization, not a JVM artifact.
+
+## The Five JVM Measurement Enemies
+
+Each checklist item below defeats a specific JVM behavior that produces fake results:
+
+| Enemy | What It Does | How It Fakes Results |
+|-------|-------------|---------------------|
+| **JIT Warmup** | JVM interprets code initially, then compiles to optimized native code after ~10,000 calls | First N calls are 10-100x slower than steady-state. Measuring during warmup compares interpreted vs compiled code, not original vs optimized. |
+| **Dead Code Elimination (DCE)** | JIT removes computations whose results are never consumed | Benchmark reports near-zero time because the target function was never actually executed — the JIT eliminated it entirely. |
+| **Constant Folding** | JIT pre-computes results when inputs are compile-time constants | Benchmark measures loading a pre-computed constant from a register, not executing the function. Both versions fold to the same constant — no measurable difference. |
+| **GC Noise** | Garbage collection pauses stop application threads unpredictably | A GC pause landing in one measurement but not the other creates a false delta. An optimization that reduces allocations may show as a "CPU improvement" when it is actually a GC improvement. |
+| **On-Stack Replacement (OSR)** | JIT compiles a loop while it is executing, producing differently optimized code than normal method compilation | Benchmark running one giant loop measures OSR-compiled code, which behaves differently from production code compiled via normal method invocation. |
+
+## Checklist
+
+### 1. Anti-Dead-Code-Elimination (Anti-DCE)
+
+Every `@Benchmark` method must consume its result. The JIT will eliminate any computation whose result is not used.
+
+**Verify**: Each benchmark method either:
+- **Returns** the computed value (JMH consumes it automatically), OR
+- Passes the result to `Blackhole.consume(result)`
+
+**If neither**: The benchmark is **INVALID**. The JIT may have eliminated the entire computation. Do not trust the numbers.
+
+**Common traps**:
+- Calling a void method without verifying it has observable side effects
+- Computing a value and storing it in a local variable that is never read
+- Using `assert` to consume the value (asserts are disabled by default in the JVM)
+
+### 2. Anti-Constant-Folding
+
+All benchmark inputs must be opaque to the JIT compiler. If the JIT can determine inputs at compile time, it pre-computes the result and replaces the function call with a constant.
+
+**Verify**: All inputs come from one of:
+- `@State` objects (populated in `@Setup` methods)
+- `@Param` annotations (JMH injects these at runtime)
+- `ThreadLocalRandom` or other runtime sources in `@Setup`
+
+**If inputs are hardcoded literals** (e.g., `fibonacci(30)`, `processRecords(fixedList)`): The benchmark is **UNRELIABLE**. The JIT may fold the computation to a constant. Replace with `@Param` or `@State`-sourced inputs.
+
+**Common traps**:
+- Benchmark method receives its data from a `static final` field (the JIT treats these as constants)
+- Input data is always the same across iterations (JIT may specialize for that specific input)
+- String literals as benchmark inputs
+
+### 3. Warmup Sufficiency
+
+The JIT compiler has multiple tiers (interpreter -> C1 -> C2). Measurements taken before C2 compilation stabilizes are measuring the JIT compiler, not your code.
+
+**Minimum floors** (non-negotiable):
+
+| Operation Speed | Warmup Iterations | Measurement Iterations | Why |
+|----------------|-------------------|----------------------|-----|
+| Fast (<1 us) | >= 5 | >= 10 | Fast operations are most susceptible to JIT variance. More samples needed to average out noise. |
+| Medium (1 us - 1 ms) | >= 5 | >= 5 | Standard warmup is usually sufficient at this scale. |
+| Slow (>1 ms) | >= 3 | >= 5 | Each iteration provides significant signal. Less warmup needed but still must reach C2. |
+
+**If warmup count is below these floors**: Increase it. Insufficient warmup produces unstable results that vary wildly between runs.
+
+**How to detect insufficient warmup**: If the first 2-3 measurement iterations show significantly different scores than later iterations, warmup was insufficient. JMH's per-iteration output reveals this — check it.
+
+### 4. Fork Isolation
+
+The JIT compiler uses profile-guided optimization, which is **non-deterministic** — different JVM instances may compile the same code differently. A single-fork benchmark captures one JIT compilation path that may not be representative.
+
+**Minimum**: `@Fork(2)` for standard optimizations. `@Fork(3)` for:
+- Marginal improvements (<15% claimed speedup)
+- GC-sensitive optimizations (heap state differs per fork)
+- Concurrency optimizations (thread scheduling differs per fork)
+
+**If `@Fork(1)` or `@Fork(0)`**: The result is **NOT GENERALIZABLE**. The improvement may exist only under one specific JIT compilation path. Re-run with at least `@Fork(2)`.
+
+**Why forking matters**: Each fork starts a completely fresh JVM. This means fresh JIT decisions, fresh heap, fresh GC state. Consistent results across forks prove the improvement is real, not a JIT artifact.
+
+### 5. Statistical Significance (Error Bar Check)
+
+JMH reports results as `Score +/- Error` at 99.9% confidence. The error represents the confidence interval — the true value lies within `[Score - Error, Score + Error]` with 99.9% probability.
+
+**The rule**: If the error bars of baseline and optimized **overlap**, the improvement is **NOT PROVEN**.
+
+**How to check**:
+```
+Baseline:  245.3 +/- 12.1 ns/op  -> range [233.2, 257.4]
+Optimized: 189.7 +/-  8.4 ns/op  -> range [181.3, 198.1]
+Ranges do NOT overlap -> improvement is PROVEN
+```
+
+```
+Baseline:  245.3 +/- 40.0 ns/op  -> range [205.3, 285.3]
+Optimized: 230.1 +/- 35.0 ns/op  -> range [195.1, 265.1]
+Ranges OVERLAP -> improvement is NOT PROVEN -> DISCARD or re-run with more iterations/forks
+```
+
+**Additional rule**: The improvement percentage must be at least **2x the relative error margin**. Example: if the error is +/- 5% of the score, the improvement must be at least 10% to be credible. A "7% speedup" with 5% error bars is noise.
+
+**If error bars overlap**: Increase `@Measurement(iterations)` or `@Fork` count and re-run. If they still overlap after 20+ measurement iterations across 3+ forks, the improvement is not real — DISCARD.
+
+### 6. GC Isolation (for memory-affecting optimizations)
+
+If the optimization changes allocation patterns (reduces object creation, changes collection types, pools objects), GC behavior changes as a side effect. Measuring in the same JVM contaminates results because the "after" run inherits heap state from the "before" run.
+
+**Verify**: JMH `@Fork(N)` naturally handles this (each fork is a fresh JVM). For GC-sensitive optimizations, use `@Fork(3)` minimum.
+
+**Additional verification for GC claims**: If you claim "GC pauses reduced," collect JFR GC events from both baseline and optimized runs separately:
+```bash
+# Baseline:
+jfr print --events jdk.GCPhasePause baseline.jfr | grep "duration"
+# Optimized:
+jfr print --events jdk.GCPhasePause optimized.jfr | grep "duration"
+```
+
+Compare pause distributions, not just averages. A single outlier GC pause can skew the average.
+
+## Quick Reference: Validity Decision
+
+```
+All results consumed? (anti-DCE)
+-- NO -> INVALID — do not trust
+-- YES -> All inputs dynamic? (anti-constant-folding)
+    +-- NO -> UNRELIABLE — replace inputs and re-run
+    +-- YES -> Warmup meets minimum floors?
+        +-- NO -> INSUFFICIENT — increase warmup and re-run
+        +-- YES -> At least 2 forks?
+            +-- NO -> NOT GENERALIZABLE — add forks and re-run
+            +-- YES -> Error bars do not overlap?
+                +-- OVERLAP -> NOT PROVEN — increase iterations/forks or DISCARD
+                +-- NO OVERLAP -> VALID — proceed with keep/discard decision
+```
+
+## When the Agent Cannot Use JMH
+
+In rare cases, JMH may not be feasible (e.g., project has incompatible build configuration, test environment constraints). In that case:
+
+1. Document why JMH is not feasible in `.codeflash/HANDOFF.md`
+2. Use the best available alternative (e.g., JFR-based timing, manual warmup loop with System.nanoTime)
+3. Apply the same principles manually: warmup before measuring, consume results, use dynamic inputs, run multiple separate invocations
+4. **Raise the improvement threshold to 20%** (to compensate for measurement uncertainty)
+5. Mark the confidence level as LOW in results.tsv
+6. Note in the commit message that the benchmark was non-JMH and therefore lower confidence
--- a/plugin/languages/java/references/correctness-verification.md
+++ b/plugin/languages/java/references/correctness-verification.md
@ -0,0 +1,165 @@
+# Java Correctness Verification
+
+**MANDATORY before benchmarking.** An optimization that changes behavior is a bug, not an improvement. This checklist covers Java-specific correctness traps that compile cleanly but break at runtime.
+
+Correctness verification happens BEFORE benchmarking in the experiment loop (step 8). If any check fails, DISCARD immediately — do not proceed to benchmarking.
+
+## Why Java Correctness Is Harder Than It Looks
+
+Java has several language features that make behavior verification non-trivial:
+
+- **Reference equality vs value equality**: `==` compares memory addresses for objects, not content. Two objects with identical fields are `!=` unless they are literally the same object.
+- **Mutable method parameters**: Java passes object references by value — methods can (and often do) modify the objects passed to them. An optimization that stops mutating a parameter changes behavior for every caller.
+- **Checked exceptions as API contract**: Java method signatures declare which exceptions they throw. Changing the exception type — even to a "more correct" one — breaks callers that catch the original type.
+- **Collection iteration order**: `HashMap` does not guarantee order. Code that depends on accidental HashMap ordering breaks when you switch to a different Map implementation.
+- **Serialization contracts**: Objects that are serialized (JSON, Java serialization, protobuf) have implicit contracts about their structure. Changing internal collection types can change serialized output.
+
+## Verification Checklist
+
+### 1. Return Value Equivalence
+
+Compare the return value of the original and optimized versions using deep equality, not reference equality.
+
+**Method**: Run both versions with identical inputs. Compare outputs using:
+- `.equals()` for types that override it correctly (String, Integer, collections, well-implemented domain objects)
+- Serialization-based comparison (Kryo, JSON) for complex object graphs where `.equals()` may not be implemented
+- Field-by-field comparison for objects without `.equals()` override
+
+**Traps to check**:
+- `equals()` not overridden: defaults to reference equality (`==`), which always returns false for different objects even with identical content
+- `hashCode()` inconsistent with `equals()`: objects may be "equal" but behave differently in hash-based collections
+- `Comparable.compareTo()` inconsistent with `equals()`: TreeMap/TreeSet behavior differs from HashMap/HashSet
+
+**If the function returns void**: Skip return value comparison but verify all other checks (especially mutable parameter state and side effects).
+
+### 2. Floating-Point Tolerance
+
+Java `double` and `float` follow IEEE 754, which means:
+- Arithmetic is not associative: `(a + b) + c != a + (b + c)` in some cases
+- `Double.NaN != Double.NaN` (NaN is not equal to itself)
+- `0.0 == -0.0` is true but they have different bit patterns
+- Different computation orders produce slightly different results due to rounding
+
+**Method**: Compare floating-point results using relative epsilon tolerance:
+- `double`: relative tolerance of 1e-10 (or absolute tolerance of 1e-15 for values near zero)
+- `float`: relative tolerance of 1e-5 (or absolute tolerance of 1e-7 for values near zero)
+
+**Formula**: `|original - optimized| <= max(relativeEpsilon * max(|original|, |optimized|), absoluteEpsilon)`
+
+**If exact floating-point equality is required by the function's contract** (e.g., a serializer, a hash function): Use `Double.doubleToRawLongBits()` for bitwise comparison instead of epsilon. Document this requirement.
+
+### 3. Collection Ordering Semantics
+
+Different collection types have different ordering guarantees. Swapping one for another can silently change iteration order.
+
+| Collection Type | Order Guarantee | Comparison Method |
+|----------------|----------------|-------------------|
+| `List` (ArrayList, LinkedList) | Insertion order preserved | Compare elements in order |
+| `LinkedHashMap` | Insertion order preserved | Compare entries in order |
+| `TreeMap` / `TreeSet` | Sorted by key/element | Compare elements in order |
+| `HashMap` / `HashSet` | **No order guarantee** | Compare contents only (ignore order) |
+| `ConcurrentHashMap` | **No order guarantee** | Compare contents only (ignore order) |
+| `EnumMap` / `EnumSet` | Enum declaration order | Compare elements in order |
+
+**Method**: When comparing outputs that contain collections:
+1. Identify the collection type used by the **original** code
+2. If the original uses an unordered collection (HashMap, HashSet): compare **contents** regardless of iteration order
+3. If the original uses an ordered collection (List, LinkedHashMap, TreeMap): compare contents **AND** order
+4. If the optimization changes the collection type: verify the new type's ordering semantics are compatible with all downstream consumers
+
+**Trap**: The original code may use `HashMap` but downstream code accidentally depends on iteration order (e.g., serializes the map to JSON, and tests compare JSON strings). The optimization changes nothing semantically, but tests break because JSON key order changed. This is a test problem, not an optimization problem — but it must be addressed before merging.
+
+### 4. Mutable Parameter State
+
+Java methods frequently mutate their input parameters. If the original method modifies a list, map, or object passed to it, the optimized version must produce the same mutations.
+
+**Method**: For each parameter that is a mutable object (not a primitive, not a String):
+1. Capture the parameter state **before** calling the original
+2. Call the original, capture the parameter state **after**
+3. Reset the parameter to the "before" state
+4. Call the optimized version, capture the parameter state **after**
+5. Compare the two "after" states
+
+**Common mutations to check**:
+- Elements added to or removed from a List/Set/Map parameter
+- Fields modified on an object parameter
+- Array elements modified
+- Counters incremented on a passed-in accumulator object
+- Iterators advanced on a passed-in iterator
+
+**If the optimization eliminates a parameter mutation** (e.g., original sorts a list in-place, optimized returns a new sorted list): This is a **behavioral change**. DISCARD unless:
+- You verified ALL callers of this method and none depends on the mutation
+- You update the method signature/documentation to reflect the new contract
+
+### 5. Exception Contract Preservation
+
+Java methods have explicit (checked) and implicit (unchecked) exception contracts. Changing which exceptions are thrown — even in error cases — breaks callers.
+
+**Method**: Test with inputs that trigger exceptions in the original. Verify:
+1. **Same exception types**: If original throws `IllegalArgumentException`, optimized must not throw `NullPointerException` for the same input
+2. **Same conditions**: If original throws on null input, optimized must also throw on null input (not silently return null)
+3. **Same message content** (if the message is part of the API contract — rare but possible)
+
+**Common traps**:
+- Optimization removes an explicit null check, causing a `NullPointerException` deep in the code instead of an `IllegalArgumentException` at the entry point
+- Optimization catches a broader exception type (e.g., catches `Exception` instead of `IOException`), swallowing exceptions the caller expected to propagate
+- Optimization changes loop bounds, causing `IndexOutOfBoundsException` on edge case inputs that the original handled
+
+**If the optimization improves exception behavior** (e.g., adds a missing null check that the original should have had): Document this as a behavioral improvement in the commit message. Do not silently change exception behavior.
+
+### 6. Side Effect Preservation
+
+If the method produces observable effects beyond its return value — writing to files, logging, sending network requests, updating shared state — the optimized version must preserve those effects.
+
+**Method**: Identify all side effects in the original:
+- I/O operations (file writes, network calls, database updates)
+- Logging calls (log level, message content)
+- Static field modifications
+- Cache updates (adding/removing entries from shared caches)
+- Event publishing (observer/listener notifications)
+
+**If the optimization removes or reorders side effects**: This is a behavioral change. Either:
+- Restore the side effects in the optimized version, OR
+- Document the change explicitly in the commit message and verify no consumer depends on the removed side effect
+
+### 7. Serialization Compatibility
+
+If the modified class or its fields are serialized anywhere in the codebase, changing internal types can break serialization.
+
+**Check**: Search the codebase for serialization usage:
+- Java serialization: `implements Serializable`, `serialVersionUID`, `ObjectOutputStream`
+- Jackson/JSON: `@JsonProperty`, `ObjectMapper`, `@JsonSerialize`
+- Protobuf: `.proto` files, `GeneratedMessageV3`
+- Kryo: `kryo.register()`, `KryoSerializer`
+
+**Common traps when changing collection types**:
+- `ArrayList` to `List.of()`: Java serialization fails (List.of returns an unserializable implementation in some JDK versions)
+- `HashMap` to `EnumMap`: Jackson serializes enum keys as strings vs integers depending on configuration
+- `TreeMap` to `HashMap`: JSON key order changes, breaking string-based comparisons in tests or APIs
+
+**If the object is serialized**: Verify that the serialized output is byte-compatible (for Java serialization) or semantically equivalent (for JSON/protobuf) between original and optimized versions.
+
+## Verification Workflow
+
+For each optimization, before benchmarking:
+
+1. **Identify the function's contract**: What does it promise? (return value, mutations, exceptions, side effects)
+2. **Prepare representative inputs**: Normal cases, edge cases (empty, null, single element, maximum size), and error cases
+3. **Run original**: Capture return value, parameter state after call, any side effects
+4. **Run optimized**: Same inputs, capture same outputs
+5. **Compare**: Apply checks 1-7 above
+6. **If ANY check fails**: DISCARD immediately. Do not proceed to benchmarking.
+7. **If ALL checks pass**: Proceed to benchmarking with confidence that the optimization preserves behavior.
+
+## Edge Cases That Must Be Tested
+
+For every optimization, test these inputs at minimum:
+
+| Input Category | Examples | Why |
+|---------------|---------|-----|
+| **Empty** | empty list, empty string, empty map, 0, null | Optimizations often skip empty cases differently |
+| **Single element** | list with 1 item, string with 1 char | Off-by-one errors in loop restructuring |
+| **Boundary** | Integer.MAX_VALUE, Integer.MIN_VALUE, Long overflow | Arithmetic optimizations may overflow differently |
+| **Null** | null parameter, null element in collection, null map value | Null handling is a frequent source of behavioral changes |
+| **Concurrent** (if applicable) | Multiple threads calling simultaneously | Thread-safety regressions from lock changes |
+| **Large** | 10x the normal input size | Algorithmic changes may have different constant factors at scale |
--- a/plugin/languages/java/references/data-structures/guide.md
+++ b/plugin/languages/java/references/data-structures/guide.md
@ -497,38 +497,69 @@ For dynamic dispatch: use `LambdaMetafactory` to generate a functional interface

 ## JMH Benchmark Template

-Standard template for validating collection/algorithm changes:
+Standard template for validating collection/algorithm changes. **Every JMH benchmark must be validated against `../benchmark-validity.md` before trusting results.**

 ```java
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
-@Warmup(iterations = 5, time = 1)
-@Measurement(iterations = 10, time = 1)
-@Fork(2)
+@Warmup(iterations = 5, time = 1)    // MINIMUM: 5 warmup iterations for fast ops, 3 for slow
+@Measurement(iterations = 10, time = 1) // MINIMUM: 5 measurement iterations
+@Fork(2)                               // MINIMUM: 2 forks (3 for GC-sensitive or marginal improvements)
 public class CollectionBenchmark {

-    @Param({"100", "10000", "1000000"})
+    @Param({"100", "10000", "1000000"})  // ANTI-CONSTANT-FOLDING: inputs from @Param, not literals
    int size;

    private List<Integer> data;

    @Setup(Level.Trial)
    public void setup() {
+        // ANTI-CONSTANT-FOLDING: use random data, not predictable sequences
        data = ThreadLocalRandom.current()
            .ints(size).boxed().collect(Collectors.toList());
    }

    @Benchmark
    public int baseline(Blackhole bh) {
+        // ANTI-DCE: either return the result OR use bh.consume(result)
+        // If the result is not consumed, the JIT may eliminate the entire computation
        // original implementation
    }

    @Benchmark
    public int optimized(Blackhole bh) {
+        // ANTI-DCE: same consumption requirement
        // optimized implementation
    }
 }
 ```

-**Key rules:** Always use `Blackhole` or return results (prevent DCE). Use `@Param` for multiple sizes (prevent constant folding). Use `@Fork(2)+` to isolate JIT behavior across runs.
+### JMH Validity Rules (non-negotiable)
+
+These rules defeat the five JVM measurement enemies. Violating any of them produces fake numbers.
+
+**1. Anti-Dead-Code-Elimination (DCE):** Every `@Benchmark` method must either **return** the computed value (JMH consumes it automatically) or pass it to `Blackhole.consume()`. If the result is not consumed, the JIT compiler will eliminate the entire computation and you will measure an empty method. This is the most common source of "infinite speedup" results.
+
+**2. Anti-Constant-Folding:** All inputs must come from `@State` or `@Param`, never from hardcoded literals or `static final` fields. The JIT pre-computes results when it can determine inputs at compile time. If you benchmark `fibonacci(30)` with a literal 30, the JIT may fold it to a constant after enough iterations.
+
+**3. Warmup Floors:** At least 5 warmup iterations for operations faster than 1 microsecond. At least 3 for slower operations. The JVM's JIT compiler has multiple tiers (interpreter -> C1 -> C2). Measurements before C2 stabilization are measuring the JIT compiler, not your code.
+
+**4. Fork Isolation:** At least `@Fork(2)`. JIT profile-guided optimization is non-deterministic — different JVM instances may compile the same code differently. A single fork captures one JIT path that may not generalize. Use `@Fork(3)` for GC-sensitive or marginal (<15%) improvements.
+
+**5. Error Bar Check:** After running, check JMH output: `Score +/- Error`. If the error bars of baseline and optimized **overlap**, the improvement is NOT PROVEN. Re-run with more iterations or forks. The improvement percentage must be at least 2x the relative error margin.
+
+### Reading JMH Output
+
+```
+Benchmark                          (size)  Mode  Cnt    Score    Error  Units
+CollectionBenchmark.baseline          100  avgt   20   245.3 ±   12.1  ns/op
+CollectionBenchmark.optimized         100  avgt   20   189.7 ±    8.4  ns/op
+```
+
+- **Score**: average time per operation (lower is better for `avgt`)
+- **Error**: 99.9% confidence interval
+- **Cnt**: total measurement iterations across all forks
+- **Check**: baseline range [233.2, 257.4], optimized range [181.3, 198.1] — no overlap, improvement is real
+
+See `../benchmark-validity.md` for the complete validity decision tree.
--- a/plugin/references/shared/experiment-loop-base.md
+++ b/plugin/references/shared/experiment-loop-base.md
@ -8,6 +8,8 @@ LOOP (until plateau detected or user requests stop):

 **Print a status line before each step** so the user can follow progress (see Progress Updates in the agent prompt).

+**IMPORTANT (Java/Kotlin):** If the project is Java or Kotlin, read `${CLAUDE_PLUGIN_ROOT}/languages/java/references/benchmark-validity.md` and `${CLAUDE_PLUGIN_ROOT}/languages/java/references/correctness-verification.md` before entering this loop. JMH is mandatory for every KEEP decision, and correctness verification must happen before benchmarking. These are non-negotiable requirements that exist because the JVM's JIT compiler actively manipulates measurements — see those files for details.
+
 1. **Review git history.** Before choosing a target, read recent experiment history to learn from past attempts:
   ```bash
   git log --oneline -20    # experiment sequence — what was tried
@ -23,15 +25,17 @@ LOOP (until plateau detected or user requests stop):
 7. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified function arguments, wrapper flags, pool sizes, or configuration, the benchmark must use the same values. If the benchmark was written before step 6, the implementation may have changed assumptions — update the benchmark to match. A benchmark that doesn't mirror the production change proves nothing.
 8. **Verify output equivalence.** Run the optimized version with the same inputs from step 4 and compare outputs. If outputs differ, **discard immediately** — this is a correctness regression, not an optimization. Do not proceed to benchmarking.
 9. **Benchmark**: Run target test. Print `[experiment N] Benchmarking...`. Always run for correctness, even for micro-only optimizations.
-10. **Guard** (if configured). Run the guard command (see Guard Command below). If the guard fails, the optimization broke something — revert and rework (max 2 attempts), then discard if still failing.
-11. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
-12. If crashed or regressed = fix or discard immediately.
-13. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise.
-14. **Record** in `.codeflash/results.tsv` (schema in domain file).
-15. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
-16. **E2E benchmark** (after KEEP, when available). If `codeflash compare` is available (see `e2e-benchmarks.md`), run `$RUNNER -m codeflash compare <pre-opt-sha> HEAD` to get authoritative isolated measurements. Record e2e results alongside micro-bench results in `results.tsv`. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate the keep decision — trust the e2e measurement. Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)`.
-17. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
-18. **Milestones** (every 3-5 keeps): Run full benchmark (including `codeflash compare <baseline-sha> HEAD` for cumulative e2e measurement), create milestone branch. Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.
+10. **Verify benchmark validity** (Java/Kotlin). If the project is Java or Kotlin, verify the benchmark against the language-specific benchmark validity checklist before trusting results. A benchmark affected by dead code elimination, constant folding, insufficient warmup, or inadequate forking produces meaningless numbers. See domain file for the specific checklist reference. If the benchmark is invalid, fix it and re-run — do not proceed with invalid measurements.
+11. **Guard** (if configured). Run the guard command (see Guard Command below). If the guard fails, the optimization broke something — revert and rework (max 2 attempts), then discard if still failing.
+12. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
+13. If crashed or regressed = fix or discard immediately.
+14. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise. For Java/Kotlin: if JMH error bars overlap, the improvement is NOT PROVEN — re-run with more iterations/forks or DISCARD.
+15. **Record** in `.codeflash/results.tsv` (schema in domain file).
+16. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+17. **Mechanism explanation** (after KEEP). Write a one-paragraph explanation of WHY the optimization improves performance — the specific mechanism, not just "it's faster." This catches measurement artifacts: if you cannot explain the mechanism, the improvement may be fake. See domain file for examples.
+18. **E2E benchmark** (after KEEP, when available). If `codeflash compare` is available (see `e2e-benchmarks.md`), run `$RUNNER -m codeflash compare <pre-opt-sha> HEAD` to get authoritative isolated measurements. Record e2e results alongside micro-bench results in `results.tsv`. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate the keep decision — trust the e2e measurement. Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)`.
+19. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
+20. **Milestones** (every 3-5 keeps): Run full benchmark (including `codeflash compare <baseline-sha> HEAD` for cumulative e2e measurement), create milestone branch. **Milestone sanity check**: compare cumulative improvement with sum of individual experiment improvements. If cumulative < 70% of sum, at least one KEEP is a false positive — investigate (see domain file for recovery procedure). Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.

 ## Keep/Discard Decision Tree — Common Structure

--- a/plugin/references/shared/pre-submit-review.md
+++ b/plugin/references/shared/pre-submit-review.md
@ -45,6 +45,16 @@ Cross-check your implementation against what the PR claims:
 - **Tests cover the alternate paths** your change affects. If the function is called from both sync and async contexts, test both.
 - **Regression tests for edge cases** mentioned in your analysis (e.g., empty input, single element, concurrent access).

+## 6. Java/Kotlin-Specific Checks
+
+If the project is Java or Kotlin, also verify:
+
+- **JMH benchmark validity**: For every KEEP, confirm the benchmark was validated per the language-specific benchmark validity checklist (anti-DCE, anti-constant-folding, warmup floors, fork count, error bar non-overlap). If any KEEP lacks valid JMH evidence, re-run the benchmark now.
+- **Mechanism explanation present**: Every KEEP commit message must contain a one-paragraph explanation of WHY the optimization is faster at the JVM level — the specific mechanism (e.g., "eliminates autoboxing allocations" or "replaces O(n^2) nested loop with HashMap index"). If you wrote "improved performance" without explaining the mechanism, fix the commit message.
+- **JDK version compatibility**: Verify every optimization uses only APIs available on the project's minimum JDK version (from `.codeflash/setup.md`). Common traps: `List.of()` requires JDK 9, `String.isBlank()` requires JDK 11, virtual threads require JDK 21. An optimization using unavailable APIs does not compile in production.
+- **Correctness verification complete**: For every KEEP, confirm output equivalence was verified per the language-specific correctness verification checklist (return value deep equality, floating-point tolerance, collection ordering, mutable parameters, exception contracts, serialization compatibility).
+- **Milestone sanity check passed**: If milestones were reached, confirm the cumulative improvement was at least 70% of the sum of individual improvements. If not, at least one KEEP is a false positive that should be reverted.
+
 ## How to Run

 At the end of the experiment loop, before sending `[complete]`: