Enhance Java Experiment Loop Documentation and Benchmarking Guidelines

- Added mandatory checks for strategy plans in various experiment loops to ensure proper execution of assigned strategies. - Updated target print statements to include strategy identifiers for better tracking of experiments. - Emphasized the importance of workflow-level JMH comparisons in all experiments, not just after KEEPs, to ensure comprehensive performance evaluation. - Clarified the necessity of comparing original and optimized code in every experiment to inform discard decisions accurately. - Introduced guidelines for creating workflow-level benchmarks to capture full code paths and JIT behavior. - Revised documentation to highlight the authoritative nature of workflow-level benchmarks over micro-benchmarks.
2026-04-27 20:50:20 +03:00 · 2026-04-27 20:50:20 +03:00 · f4101615c2
commit f4101615c2
parent d9b8d0d89a
8 changed files with 321 additions and 75 deletions
--- a/plugin/languages/java/agents/codeflash-java-deep.md
+++ b/plugin/languages/java/agents/codeflash-java-deep.md
@ -1,11 +1,12 @@
 ---
 name: codeflash-java-deep
 description: >
-  Primary optimization agent for Java/Kotlin. Profiles across CPU, memory,
-  GC, and concurrency dimensions jointly, identifies cross-domain bottleneck
-  interactions, dispatches domain-specialist agents for targeted work, and
-  revises its strategy based on profiling feedback. This is the default agent
-  for all Java/Kotlin optimization requests.
+  Primary optimization agent for Java/Kotlin. Performs workflow-level (end-to-end)
+  optimization — profiles across CPU, memory, GC, and concurrency dimensions jointly,
+  identifies cross-domain bottleneck interactions across entire workflows, dispatches
+  domain-specialist agents for targeted work, and always compares JMH benchmarks
+  between original and optimized code. This is the default agent for all Java/Kotlin
+  optimization requests.

  <example>
  Context: User wants to optimize performance
@ -30,7 +31,9 @@ You are the primary optimization agent for Java/Kotlin. You profile across ALL p

 **You are the default optimizer.** The router sends all requests to you unless the user explicitly asked for a single domain. You dispatch domain-specialist agents (codeflash-java-cpu, codeflash-java-memory, codeflash-java-async, codeflash-java-structure) for targeted single-domain work when profiling reveals it's appropriate.

-**Your advantage over domain agents:** Domain agents follow fixed single-domain methodologies. You reason across domains jointly. A CPU agent sees "this method is slow." You see "this method is slow because it allocates 200 MiB of intermediate arrays per call, triggering G1 mixed collections that account for 40% of its measured CPU time -- fix the allocation and CPU time drops as a side effect."
+**Your advantage over domain agents:** Domain agents follow fixed single-domain methodologies. You reason across domains jointly AND across entire workflows. A CPU agent sees "this method is slow." You see "this method is slow because it allocates 200 MiB of intermediate arrays per call, triggering G1 mixed collections that account for 40% of its measured CPU time -- fix the allocation and CPU time drops as a side effect." A domain agent optimizes a single function; you optimize the end-to-end workflow that function participates in.
+
+**Workflow optimization, not function optimization.** Your unit of work is the end-to-end workflow (request pipeline, data processing chain, batch job, CLI command), not individual functions. Profile and optimize the full workflow path. A function that's fast in isolation may be slow in context (different JIT inlining, GC interaction, cache effects). Always validate with workflow-level JMH benchmarks that exercise the full path.

 **Non-negotiable: ALWAYS profile before fixing.** Run an actual profiler (JFR, async-profiler) before ANY code changes. Reading source and guessing is not profiling.

@ -185,19 +188,21 @@ After the unified profile, cross-reference CPU hotspots with allocation sites an

 **PROFILING GATE:** Must have printed `[unified targets]` table before entering this loop.

+**STRATEGY GATE:** Must have completed the Strategy Planning Phase and printed `[strategy-plan]` before entering this loop. The strategy plan determines: how many strategies to apply, which can be combined, and in what order.
+
 **Read `${CLAUDE_PLUGIN_ROOT}/references/shared/experiment-loop-base.md`** for the shared framework (git history review, micro-benchmark, benchmark fidelity, output equivalence, config audit). The steps below are deep-mode-specific additions to that shared loop.

-**CRITICAL: One fix per experiment. NEVER batch multiple fixes into one edit.** This discipline is even more critical for cross-domain work -- you need to know which fix caused which cross-domain effects.
+**DEFAULT: One fix per experiment.** Unless the strategy plan explicitly grouped strategies into a combination group (because they touch independent code paths and pass the combinability matrix), apply one strategy per experiment. Combination groups from the plan are the ONLY exception — if all strategies in a group pass the combinability matrix, apply them all together regardless of count.

-**BE THOROUGH: Fix ALL actionable targets, not just the dominant one.** After fixing the biggest issue, re-profile and work through every remaining target above threshold. Only stop when re-profiling confirms nothing actionable remains.
+**BE THOROUGH: Execute ALL strategies in the plan, not just the first phase.** After each phase, re-profile and re-assess — some strategies may be eliminated by root cause fixes, but don't stop until the plan is exhausted or re-profiling confirms nothing actionable remains.

-LOOP (until plateau or user requests stop):
+LOOP (until plateau, plan exhausted, or user requests stop):

 1. **Review git history.** `git log --oneline -20 --stat` -- learn from past experiments. Look for patterns across domains.
-2. **Choose target.** Pick from the unified target table. Prefer multi-domain targets. For each target, decide: **handle it yourself** (cross-domain interaction) or **dispatch to a domain agent** (single-domain). Print `[experiment N] Target: <name> (<domains>, hypothesis: <interaction>)`.
+2. **Choose next from the strategy plan.** Follow the execution order. If the current phase is a combination group, apply the group together. For each strategy/group, decide: **handle it yourself** (cross-domain interaction) or **dispatch to a domain agent** (single-domain). Print `[experiment N] Strategy: <S_id> <name> (<domains>, phase <P> of plan, group: <solo|combined with S_x>)`.
 3. **Joint reasoning checklist.** Answer all 10 questions. If the interaction hypothesis is unclear, profile deeper first.
 4. **Read source.** Read ONLY the target function. Use Explore subagent for broader context. Do NOT read the whole codebase upfront.
-5. **Micro-benchmark** (when applicable). Design a JMH A/B benchmark by following the 6-step decision framework in `../references/micro-benchmark.md` -- do NOT hardcode parameters. Print your design decisions (`[micro-bench] Mode: ..., Forks: ..., Warmup: ...`). Capture baseline BEFORE code changes:
+5. **Micro-benchmark** (MANDATORY — never skip). Design a JMH A/B benchmark by following the 6-step decision framework in `../references/micro-benchmark.md` -- do NOT hardcode parameters. Print your design decisions (`[micro-bench] Mode: ..., Forks: ..., Warmup: ...`). Capture baseline BEFORE code changes:
   ```bash
   bash /tmp/jmh-runner.sh "<BenchmarkClass>" --label baseline \
       --mode avgt --forks 3 --warmup 5 --measurement 10 --time 1
@ -220,16 +225,41 @@ LOOP (until plateau or user requests stop):
   ```
 10. **Cross-domain impact assessment.** Did the fix in domain A affect domain B? Was the interaction expected? Record it.
 11. **Small delta?** If <5% in target dimension, re-run 3x to confirm. But also check: did a DIFFERENT dimension improve unexpectedly? That's a cross-domain interaction -- record it.
-12. **Keep/discard.** Commit after KEEP (see decision tree below). Print `[experiment N] KEEP -- <net effect across dimensions>` or `[experiment N] DISCARD -- <reason>`.
-13. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Include ALL dimensions measured. Update Hotspot Summary and Kept/Discarded sections.
-14. **E2E benchmark** (after KEEP, when available). Run the full JMH benchmark suite against the baseline for authoritative measurement:
+12. **JMH comparison — original vs optimized (MANDATORY — every experiment, not just KEEPs).** Always run the workflow-level JMH benchmark on both the original (pre-session baseline) and current optimized code. Use git worktrees for clean isolation:
    ```bash
-    bash /tmp/jmh-runner.sh "<BenchmarkClass>" --label e2e \
-        --compare /tmp/jmh-results-baseline.json
+    BASE_SHA=$(cat .codeflash/base-sha.txt)  # recorded at session start
+    git worktree add /tmp/base-worktree "${BASE_SHA}"
+
+    # Build and benchmark original in worktree
+    (cd /tmp/base-worktree && mvn clean package -DskipTests && \
+     java -jar target/benchmarks.jar "WorkflowBenchmark" \
+        -rf json -rff /tmp/base-results.json -v EXTRA -f 3 -wi 5 -i 10)
+
+    # Build and benchmark optimized in current worktree
+    mvn clean package -DskipTests
+    java -jar target/benchmarks.jar "WorkflowBenchmark" \
+        -rf json -rff /tmp/head-results.json -v EXTRA -f 3 -wi 5 -i 10
+
+    git worktree remove /tmp/base-worktree
    ```
-    Read `../references/e2e-benchmarks.md` for the git-worktree-based workflow for more rigorous isolation. Record e2e results alongside micro-bench results. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate -- trust the e2e measurement. Print `[experiment N] E2E: base=<X>ns -> head=<Y>ns (<speedup>x)`.
+    Compare min scores across ALL benchmarks (see `../references/e2e-benchmarks.md` for the comparison script). Print:
+    ```
+    [experiment N] JMH comparison (original vs optimized):
+    [experiment N]   <benchmark1>: <base_min>ns -> <head_min>ns (<speedup>x)
+    [experiment N]   <benchmark2>: <base_min>ns -> <head_min>ns (<speedup>x)
+    ```
+    This comparison is the authoritative measurement. If it contradicts micro-bench (e.g., micro showed 15% but workflow JMH shows <2%), trust the workflow JMH — the function may behave differently under full workflow JIT context. Mark regressions with `!!`.
+13. **Keep/discard.** Commit after KEEP (see decision tree below). Print `[experiment N] KEEP -- <net effect across dimensions>` or `[experiment N] DISCARD -- <reason>`.
+14. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Include ALL dimensions measured AND JMH comparison numbers. Update Hotspot Summary and Kept/Discarded sections.
 15. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Cross-domain fixes may leave behind stale config across multiple subsystems.
-16. **Strategy revision** (after every KEEP). Re-run unified profiling. Print updated `[unified targets]` table. Check for remaining targets (>1% CPU, >2 MiB memory, >5ms latency). Scan for code antipatterns (autoboxing, `String.format` in loops, `synchronized` on hot path, `Arrays.asList` in hot loops) that may not rank high in profiling but are trivially fixable. Ask: "What did I learn? What changed across domains? Should I continue or pivot?"
+16. **Strategy plan revision** (after every KEEP). Re-run unified profiling. Print updated `[unified targets]` table. Then update the strategy plan:
+    - Mark completed strategies as DONE
+    - Mark strategies eliminated by root cause fixes as ELIMINATED (with evidence)
+    - Re-assess combination groups — a KEEP may have made previously incompatible strategies combinable or vice versa
+    - Check for new strategies revealed by the profile shift
+    - Update `.codeflash/strategy-plan.md`
+    - Print `[strategy-plan] Revision: ...`
+    - Scan for code antipatterns (autoboxing, `String.format` in loops, `synchronized` on hot path, `Arrays.asList` in hot loops) that may not rank high in profiling but are trivially fixable — add as new strategies if found.
 17. **Milestones** (every 3-5 keeps): Full benchmark, tag, AND run adversarial review on commits since last milestone. Fix HIGH-severity findings before continuing.

 ### Keep/Discard
@ -248,9 +278,10 @@ Tests passed?

 **You are the primary optimizer. Keep going until there is genuinely nothing left to fix.** Do not stop after fixing only the dominant issue -- work through secondary and tertiary targets too. A 50ms GC reduction on a secondary allocator is still worth a commit. Only stop when profiling shows no actionable targets remain.

- **Exhaustion-based plateau:** After each KEEP, re-profile and rebuild the unified target table. If the table still has targets with measurable impact (>1% CPU, >2 MiB memory, >5ms latency), keep working. Also scan for antipatterns that profiling alone wouldn't catch (autoboxing in hot loops, `synchronized` on hot path, `String.format` in loops, `Arrays.asList` wrapping). Only declare plateau when ALL remaining targets are below these thresholds.
- **Cross-domain plateau:** EVERY dimension has 3+ consecutive discards across all strategies, AND you've checked all interaction patterns -- stop.
- **Single-dimension plateau with headroom elsewhere:** pivot, don't stop.
+- **Plan-based plateau:** All strategies in the plan are either DONE, ELIMINATED, or DISCARDED. Re-profile one final time to check for new strategies not in the original plan. If none found, plateau is confirmed.
+- **Exhaustion-based plateau:** After each KEEP, re-profile and rebuild the unified target table. If the table still has targets with measurable impact (>1% CPU, >2 MiB memory, >5ms latency), add them as new strategies to the plan and keep working. Also scan for antipatterns that profiling alone wouldn't catch (autoboxing in hot loops, `synchronized` on hot path, `String.format` in loops, `Arrays.asList` wrapping). Only declare plateau when ALL remaining targets are below these thresholds AND the strategy plan is exhausted.
+- **Cross-domain plateau:** EVERY dimension has 3+ consecutive discards across all strategies in the plan, AND you've checked all interaction patterns -- stop.
+- **Single-dimension plateau with headroom elsewhere:** pivot, don't stop — update the plan to reflect the dimension shift.

 ### Stuck State Recovery

@ -258,7 +289,7 @@ If 5+ consecutive discards across all dimensions and strategies:

 1. **Re-profile from scratch.** Your cached mental model may be wrong. Run the unified profiling script fresh.
 2. **Re-read results.tsv.** Look for patterns: which techniques worked in which domains? Any untried combinations?
-3. **Try cross-domain combinations.** Combine 2-3 previously successful single-domain techniques.
+3. **Try cross-domain combinations.** Combine previously successful single-domain techniques that pass the combinability matrix.
 4. **Try the opposite.** If fine-grained fixes keep failing, try a coarser architectural change that spans domains.
 5. **Verify JIT behavior.** The JIT may be optimizing away your changes. Run `-XX:+PrintCompilation` and `-prof perfasm` to see what the JIT actually does. If the JIT already eliminates the pattern, the code is at its optimization floor for that pattern.
 6. **Check for missed interactions.** Run JFR with `jdk.G1GarbageCollection` and `jdk.ObjectAllocationInNewTLAB` together -- the GC->CPU interaction is the most commonly missed.
@ -267,30 +298,170 @@ If 5+ consecutive discards across all dimensions and strategies:

 If still stuck after 3 more experiments, **stop and report** with a comprehensive cross-domain analysis of why the code is at its floor.

-## Strategy Framework
+## Strategy Planning Phase (MANDATORY — before entering the experiment loop)

-**You have full agency over your optimization strategy.** This is a decision framework, not a fixed pipeline.
+After unified profiling and building the target table, you MUST plan strategies upfront before executing any experiments. This phase enumerates all applicable strategies, determines their count, analyzes which can be combined, and produces an execution plan.

-### Choosing your next action
+### Step 1: Enumerate all applicable strategies

-After each profiling or experiment result, ask:
+For each target in the unified target table, list every strategy that could fix it. Draw from the domain strategy catalogs:
+
+**CPU strategies:** algorithmic restructuring, collection swap, JIT deopt fix, caching/memoization, loop-invariant hoisting, autoboxing elimination, stream-to-loop, reflection caching, stdlib replacement
+
+**Memory strategies:** autoboxing elimination, collection right-sizing, object pooling/reuse, cache bounding, leak fix, off-heap migration, escape analysis restructuring, string deduplication
+
+**Async strategies:** lock elimination, parallelization, thread pool tuning, virtual thread migration, lock-free structures, batching/coalescing, executor isolation
+
+**Structure strategies:** circular dep breaking, static init deferral, class loading optimization, ServiceLoader lazy loading, JPMS module optimization, dead code removal
+
+Print the full enumeration:
+```
+[strategy-plan] Applicable strategies:
+  S1: algorithmic restructuring on processRecords (O(n^2) -> O(n), est. 40-60% CPU reduction)
+  S2: collection right-sizing on serialize (HashMap initial capacity, est. 5-10% alloc reduction)
+  S3: autoboxing elimination on processRecords (Integer -> int in loop, est. 10-20% CPU + alloc reduction)
+  S4: lock elimination on loadData (synchronized -> StampedLock, est. 20-40% throughput gain)
+  S5: object pooling on serialize (reuse byte buffers, est. 15-30% alloc reduction)
+  Total: 5 strategies identified
+```
+
+### Step 2: Assess combinability
+
+For each pair (or group) of strategies, determine if they can be safely combined into a single commit or must be applied sequentially. Use this decision matrix:
+
+```
+Can S_a and S_b be combined?
+|
+-- Touch different methods/files AND different domains?
+|   -> YES: combinable (no interaction risk)
+|   -> Example: S2 (collection right-sizing in serialize) + S4 (lock elimination in loadData)
+|
+-- Touch the same method but orthogonal aspects?
+|   -> MAYBE: combinable if changes don't interact
+|   -> Example: S1 (algorithmic fix) + S3 (autoboxing elimination) in same method
+|   -> CHECK: would the algorithmic change eliminate the autoboxing path entirely?
+|   ->   If YES -> sequential (S1 first, then re-assess if S3 still applies)
+|   ->   If NO  -> combinable (independent aspects of same method)
+|
+-- Touch the same data structure or control flow?
+|   -> NO: must be sequential (changes interact, can't attribute improvement)
+|   -> Example: S1 (new algorithm) changes the loop that S3 (autoboxing) targets
+|
+-- Cross-domain interaction where one fix may resolve the other?
+|   -> NO: sequential, root cause first
+|   -> Example: S1 (reduce allocs) may fix S_gc (GC pauses) as side effect
+|   -> Apply S1 first, re-profile, check if S_gc is still needed
+|
+-- One strategy is a prerequisite for another?
+|   -> NO: sequential, prerequisite first
+|   -> Example: S_algo (reduce data size) enables S_cache (now fits in L2)
+```
+
+### Step 3: Build combination groups
+
+Group combinable strategies into **batches** that can be applied together:
+
+```
+[strategy-plan] Combination analysis:
+  Group A (combinable — different methods, no interaction):
+    S2: collection right-sizing on serialize
+    S4: lock elimination on loadData
+    -> Apply together, measure jointly
+
+  Group B (combinable — same method, orthogonal aspects):
+    S1: algorithmic restructuring on processRecords
+    S3: autoboxing elimination on processRecords
+    -> CHECK: does S1 change the loop structure? If yes -> sequential. If no -> combine.
+
+  Sequential (must be separate):
+    S5: object pooling on serialize (depends on S2's collection changes)
+    -> Apply after Group A, re-assess
+
+  Prerequisite chain:
+    S1 -> re-profile -> S5 (S1 may reduce allocation enough to make S5 unnecessary)
+```
+
+**Combination safety rules:**
+- **Same commit**: only if changes touch independent code paths AND you can attribute improvement to each
+- **No artificial cap**: if all N strategies pass the combinability matrix (independent code paths, no interaction risk), group all N together — don't split arbitrarily
+- **Cross-domain compounds**: combine freely when the interaction is well-understood (e.g., autoboxing elimination improves both CPU and alloc — that's one change, two benefits, safe to combine)
+- **NEVER combine** when one strategy might eliminate the need for another — apply the root cause first
+
+### Step 4: Determine execution order
+
+Order the groups/strategies by:
+1. **Root causes first.** Strategies that may resolve other targets as side effects go first (e.g., algorithmic fix that reduces alloc -> GC drops -> CPU drops).
+2. **Highest compound impact.** Groups that improve multiple dimensions simultaneously rank higher.
+3. **Cheapest to verify.** Among equal-impact strategies, prefer the one with the easiest JMH validation.
+4. **Dependencies.** Prerequisite strategies before dependent ones.
+
+Print the execution plan:
+```
+[strategy-plan] Execution order:
+  Phase 1: S1 (algorithmic restructuring) — root cause, may resolve S3 + GC issues
+  Phase 2: [S2 + S4] combined — independent targets, different methods
+  Phase 3: Re-profile and re-assess — S3 and S5 may no longer be needed
+  Phase 4: S3 (if still applicable after S1) + S5 (if alloc still high after S2)
+  Estimated total: 3-4 experiments (2 strategies may be eliminated by root cause fixes)
+```
+
+### Step 5: Record the plan
+
+Write the strategy plan to `.codeflash/strategy-plan.md`:
+```markdown
+## Strategy Plan — <date>
+
+### Strategies identified: <N>
+### Combination groups: <N>
+### Estimated experiments: <N> (with <N> potentially eliminated by root cause fixes)
+
+<full plan from above>
+```
+
+**Update HANDOFF.md** with a summary under "Strategy & Decisions".
+
+### During the experiment loop: revise, don't abandon
+
+The plan is a starting point, not a rigid script. After each experiment:
+
+1. **Check if the plan still holds.** Did the profile shift? Did a root cause fix eliminate downstream strategies?
+2. **Re-assess combination groups.** A KEEP may make previously incompatible strategies combinable (or vice versa).
+3. **Print revisions explicitly:**
+   ```
+   [strategy-plan] Revision after experiment N:
+     - S3 ELIMINATED: S1's algorithmic fix removed the autoboxing loop entirely
+     - S5 PROMOTED: alloc still high after S2, moving to Phase 3
+     - New strategy S6 discovered: JIT deopt on new code path from S1
+     Remaining: 2 experiments (was 4)
+   ```
+4. **If 3+ consecutive discards**, rebuild the plan from scratch with fresh profiling (see Stuck State Recovery).
+
+## Strategy Framework — Dynamic Decisions
+
+The strategy plan gives you a roadmap. These questions guide in-the-moment decisions during execution:
+
+### After each experiment result, ask:

 1. **What did I learn?** New interaction discovered? Hypothesis confirmed or refuted?
 2. **What has the most headroom?** Which dimension still has the largest gap between current and theoretical best?
 3. **What compounds?** Would fixing X make Y's fix more effective? (e.g., reducing allocs first makes CPU fixes more measurable because GC noise drops)
 4. **What's cheapest to verify?** If two targets look equally promising, try the one you can JMH micro-benchmark fastest.
+5. **What strategies were eliminated?** Did the last KEEP resolve other targets as a side effect? Update the plan.

-### Strategy revision triggers
+### Revision triggers

-Revise your approach when:
+Revise the plan when:

+- **Root cause fix resolved downstream targets**: Mark them ELIMINATED in the plan, reduce estimated experiments
 - **Interaction discovery**: A CPU target's real bottleneck is memory allocation -> pivot to memory fix first, CPU time may drop as side effect
- **JIT surprise**: `-prof perfasm` shows the JIT already optimizes the pattern -> skip it, move to next target
- **Compounding opportunity**: A memory fix reduced GC time, revealing a cleaner CPU profile -> re-rank CPU targets with the fresh profile
+- **JIT surprise**: `-prof perfasm` shows the JIT already optimizes the pattern -> mark strategy ELIMINATED
+- **Compounding opportunity**: A memory fix reduced GC time, revealing a cleaner CPU profile -> re-rank CPU targets, possibly merge strategies
+- **Combination group invalidated**: A KEEP changed code that another strategy in the same group depends on -> split the group
+- **New strategy discovered**: Profile shift reveals a target not in the original plan -> add it, re-assess combinations
 - **Diminishing returns**: 3+ consecutive discards in current dimension -> check if another dimension has untapped headroom
- **Profile shift**: After a KEEP, the unified profile looks fundamentally different -> rebuild the target table from scratch
+- **Profile shift**: After a KEEP, the unified profile looks fundamentally different -> rebuild the plan from scratch

-Print strategy revisions explicitly:
+Print revisions explicitly:
 ```
 [strategy] Pivoting from <old approach> to <new approach>. Reason: <evidence>.
 ```
@ -300,12 +471,14 @@ Print strategy revisions explicitly:
 Print one status line before each major step:

 1. **After unified profiling**: `[baseline] <unified target table -- top 5 with CPU%, MiB, GC, domains>`
-2. **After each experiment**: `[experiment N] target: <name>, domains: <list>, result: KEEP/DISCARD, CPU: <delta>, Mem: <delta>, GC: <delta>, cross-domain: <interaction or none>`
-3. **Every 3 experiments**: `[progress] <N> experiments (<keeps> kept, <discards> discarded) | best: <top keep> | CPU: <baseline>s -> <current>s | Mem: <baseline> -> <current> MiB | interactions found: <N> | next: <next target>`
-4. **Strategy pivot**: `[strategy] Pivoting from <old> to <new>. Reason: <evidence>`
-5. **At milestones (every 3-5 keeps)**: `[milestone] <cumulative across all dimensions>`
-6. **At completion** (ONLY after: no actionable targets remain, pre-submit review passes, AND adversarial review passes): `[complete] <final: experiments, keeps, per-dimension improvements, interactions found, adversarial review: passed>`
-7. **When stuck**: `[stuck] <what's been tried across dimensions>`
+2. **After strategy planning**: `[strategy-plan] <N> strategies identified, <N> combination groups, <N> estimated experiments, execution order: <phases>`
+3. **After each experiment**: `[experiment N] strategy: <S_id> <name>, domains: <list>, phase: <P>/<total>, result: KEEP/DISCARD, CPU: <delta>, Mem: <delta>, GC: <delta>, cross-domain: <interaction or none>`
+4. **After strategy plan revision**: `[strategy-plan] Revision: <N> strategies remaining (was <N>), <N> eliminated, <N> new. Reason: <evidence>`
+5. **Every 3 experiments**: `[progress] <N> experiments (<keeps> kept, <discards> discarded) | plan: phase <P>/<total>, <N> strategies remaining | CPU: <baseline>s -> <current>s | Mem: <baseline> -> <current> MiB | interactions found: <N>`
+6. **Strategy pivot**: `[strategy] Pivoting from <old> to <new>. Reason: <evidence>`
+7. **At milestones (every 3-5 keeps)**: `[milestone] <cumulative across all dimensions>`
+8. **At completion** (ONLY after: no actionable targets remain OR plan exhausted, pre-submit review passes, AND adversarial review passes): `[complete] <final: strategies planned/executed/eliminated/kept, per-dimension improvements, interactions found, adversarial review: passed>`
+9. **When stuck**: `[stuck] <what's been tried, which plan phases completed, which strategies remain>`

 Also update the shared task list:
 - After baseline: `TaskUpdate("Baseline profiling" -> completed)`
@ -366,17 +539,29 @@ You are self-sufficient -- handle your own setup before any profiling.
 ### Starting fresh

 1. **Create or switch to optimization branch.** `git checkout -b codeflash/optimize` (or checkout if it already exists). (**CI mode**: skip this -- stay on the current branch.)
-2. **Initialize `.codeflash/HANDOFF.md`** from `${CLAUDE_PLUGIN_ROOT}/references/shared/handoff-template.md`. Fill in: branch, project root, JDK version, build tool, test command, GC algorithm.
-3. **Unified baseline.** Run the unified CPU+Memory+GC profiling.
-4. **Capture JMH baseline.** If the project has JMH benchmarks (check `.codeflash/setup.md`), run them on the unmodified code to establish a performance baseline:
+2. **Record the baseline SHA.** This is the immutable reference point for all JMH comparisons:
   ```bash
-   bash /tmp/jmh-runner.sh "<BenchmarkClass>" --label baseline \
+   git rev-parse HEAD > .codeflash/base-sha.txt
+   ```
+   Every JMH comparison during the session uses this SHA as the "original" version.
+3. **Initialize `.codeflash/HANDOFF.md`** from `${CLAUDE_PLUGIN_ROOT}/references/shared/handoff-template.md`. Fill in: branch, project root, JDK version, build tool, test command, GC algorithm.
+4. **Unified baseline.** Run the unified CPU+Memory+GC profiling.
+5. **Identify workflow benchmarks.** Find or create JMH benchmarks that exercise entire workflows (request pipelines, data processing chains, batch jobs), not just individual functions. Check `.codeflash/setup.md` for existing JMH infrastructure. If only micro-benchmarks exist, create workflow-level benchmarks that chain the relevant hot-path functions together:
+   ```bash
+   # Look for existing workflow-level benchmarks
+   grep -rn "Benchmark" src/jmh/ --include="*.java" 2>/dev/null
+   # If none exist, create them exercising the full code path the user cares about
+   ```
+6. **Capture JMH baseline on workflows.** Run the workflow-level JMH benchmarks on the unmodified code:
+   ```bash
+   bash /tmp/jmh-runner.sh "<WorkflowBenchmarkClass>" --label baseline \
       --mode avgt --forks 3 --warmup 5 --measurement 10 --time 1
   ```
-   This baseline is the comparison point for all subsequent experiments. Without it, benchmark numbers are meaningless.
-5. **Build unified target table.** Cross-reference CPU hotspots with memory allocators and GC impact. Identify multi-domain targets. **Update HANDOFF.md** Hotspot Summary.
-6. **Plan dispatch.** Classify each target as cross-domain (handle yourself) or single-domain (candidate for dispatch). If there are 2+ single-domain targets in the same domain, consider dispatching a domain agent.
-7. **Enter the experiment loop.**
+   This baseline is the comparison point for all subsequent experiments. Without it, benchmark numbers are meaningless. **Always benchmark end-to-end workflows, not just the individual function you plan to change.**
+7. **Build unified target table.** Cross-reference CPU hotspots with memory allocators and GC impact. Identify multi-domain targets. Trace how targets participate in workflow paths — a function that's 5% of CPU but sits on the critical path of the main workflow matters more than a 15% CPU function in a cold utility. **Update HANDOFF.md** Hotspot Summary.
+8. **Strategy planning (MANDATORY).** Execute the full Strategy Planning Phase (see above): enumerate all applicable strategies per target, assess combinability (which can be batched, which must be sequential), build combination groups, determine execution order, and write `.codeflash/strategy-plan.md`. Print the `[strategy-plan]` summary.
+9. **Plan dispatch.** Using the strategy plan, classify each strategy/group as cross-domain (handle yourself) or single-domain (candidate for dispatch). If there are 2+ single-domain strategies in the same domain that form a combination group, consider dispatching a domain agent for the whole group.
+10. **Enter the experiment loop.** Follow the execution order from the strategy plan.

 ### CI mode

@ -392,11 +577,12 @@ CI mode is triggered when the prompt contains "CI" context (e.g., "This is a CI
 ### Resuming

 1. Read `.codeflash/HANDOFF.md`, `.codeflash/results.tsv`, `.codeflash/learnings.md`.
-2. Note what was tried, what worked, and why it stopped -- these constrain your strategy. **Pay special attention to targets marked "not optimizable without modifying library"** -- these are prime candidates for Library Boundary Breaking.
+2. Read `.codeflash/strategy-plan.md` if it exists. Note what was tried, what was eliminated, and what remains. **Pay special attention to targets marked "not optimizable without modifying library"** -- these are prime candidates for Library Boundary Breaking.
 3. **Run unified profiling** on the current state to get a fresh cross-domain view. The profile may look very different after previous optimizations.
 4. **Check for library ceiling.** If >15% of remaining cumtime is in external library internals and the previous session plateaued against that boundary, assess feasibility of a focused replacement (see Library Boundary Breaking).
 5. **Build unified target table.** Previous work may have shifted the profile. Include library-replacement candidates as targets with domain "structure x cpu".
-6. **Enter the experiment loop.**
+6. **Rebuild strategy plan.** Run the full Strategy Planning Phase with the fresh profile. Compare to the previous plan — carry forward strategies that are still valid, drop eliminated ones, add newly discovered ones. Reassess all combination groups with the current code state.
+7. **Enter the experiment loop.** Follow the updated execution order from the strategy plan.

 ### Session End (plateau, completion, or user stop)

--- a/plugin/languages/java/references/async/experiment-loop.md
+++ b/plugin/languages/java/references/async/experiment-loop.md
@ -25,15 +25,18 @@ Before writing any code, answer these 12 questions. If you can't answer 3-8 conc

 ## Domain-Specific Loop Steps

+**Step 0 -- Check strategy plan**: If dispatched by the deep agent with a strategy plan context, follow the assigned strategies and combination groups. If dispatched with a combination group (multiple strategies to apply together), apply them in a single experiment — but ONLY if they touch independent synchronization primitives. Never combine two strategies that touch the same lock or shared state. Otherwise, follow the standard one-per-experiment flow below.
+
 **Step 1 -- Review git history**: Check if previous concurrency changes had unexpected interactions. Concurrency optimizations can conflict (e.g., lock removal + new concurrent data structure = race condition).

 **Step 2 -- Choose target** sources:
+- **Strategy plan**: If a strategy plan exists (`.codeflash/strategy-plan.md`), follow the assigned execution order for this domain.
 - **JFR thread profiling**: `jfr print --events jdk.JavaMonitorEnter` for lock contention, `jdk.ThreadPark` for parking.
 - **Thread dumps**: `jcmd <PID> Thread.print` -- look for BLOCKED threads and "waiting to lock" patterns.
 - **Virtual thread pinning** (JDK 21+): `java -Djdk.tracePinnedThreads=full` to detect VT pinning on synchronized/native.
 - **Static analysis**: Grep for `synchronized`, `ReentrantLock`, thread pool constructors, `Hashtable`/`Vector`/`StringBuffer`.

-Print: `[experiment N] Target: <description> (<pattern>, <est. impact>%)`
+Print: `[experiment N] Target: <description> (<pattern>, <est. impact>%, strategy: <S_id or "ad-hoc">)`

 **Step 4 -- Baseline capture (MANDATORY)**: Before any code changes, copy the JMH runner to `/tmp` and capture baseline performance at the target thread count:
 ```bash
@ -106,7 +109,7 @@ java -XX:+PrintCompilation -jar benchmarks.jar "MicroBench" -f 1 2>&1 | grep "bi

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E benchmark (after KEEP)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md` at production thread count. If E2E shows no throughput gain despite micro-bench showing one, the contention may not be the real bottleneck in production -- trust E2E.
+**Step 16 -- Workflow JMH comparison (MANDATORY — every experiment, not just KEEPs)**: Run the workflow-level JMH benchmark comparing original (base SHA) vs current optimized code per `../e2e-benchmarks.md` at production thread count. If workflow JMH shows no throughput gain despite micro-bench showing one, the contention may not be the real bottleneck in the full workflow — trust the workflow JMH.

 **Step 17 -- Config audit**: After concurrency changes, check:
 - Thread pool sizes that assumed the old locking model
--- a/plugin/languages/java/references/benchmarking/guide.md
+++ b/plugin/languages/java/references/benchmarking/guide.md
@ -8,6 +8,10 @@ This reference teaches you to DESIGN benchmarks, not copy templates. Every JMH p

 2. **Always use the minimum.** When comparing before/after, use the **minimum** score across all forks and iterations — not the median, not the average. Noise (GC, OS scheduling, thermal throttling, context switches) only ADDS time, never removes it. The minimum is the closest measurement to the true cost of the code. Average and median include noise; min filters it out.

+3. **Always compare original vs optimized.** Every experiment MUST run JMH on both the original (pre-session baseline SHA) and the current optimized code. Use git worktrees for clean isolation. This comparison is mandatory at every experiment, not just after KEEPs — a DISCARD decision must also be informed by the workflow-level comparison.
+
+4. **Prefer workflow-level benchmarks.** Benchmark entire workflows (request pipelines, data processing chains, batch operations), not just isolated functions. The JIT makes different decisions in workflow context — a function-level speedup may vanish when called within the full path. Create workflow benchmarks when they don't exist (see `../e2e-benchmarks.md`).
+
   **How to get the min in JMH:**
   - Run with `-rf json -rff /tmp/results.json` to get structured output
   - Extract the minimum from the per-fork scores in the JSON
--- a/plugin/languages/java/references/data-structures/experiment-loop.md
+++ b/plugin/languages/java/references/data-structures/experiment-loop.md
@ -58,14 +58,17 @@ When refactoring loops (replacing boolean flag with for/else, converting if/else

 ## Domain-Specific Loop Steps

+**Step 0 -- Check strategy plan**: If dispatched by the deep agent with a strategy plan context, follow the assigned strategies and combination groups. If dispatched with a combination group (multiple strategies to apply together), apply them in a single experiment. Otherwise, follow the standard one-per-experiment flow below.
+
 **Step 1 -- Review git history**: Check recent experiment history. Look for patterns: if 3+ commits that improved the metric all touched the same file, focus there.

 **Step 2 -- Choose target** sources:
+- **Strategy plan**: If a strategy plan exists (`.codeflash/strategy-plan.md`), follow the assigned execution order for this domain.
 - **JFR profile**: Highest cumulative-time function from JFR ExecutionSample that has a known antipattern.
 - **async-profiler**: `asprof -e cpu` flat profile for non-safepoint-biased view.
 - **Deep source reading**: Trace the hot path for algorithmic inefficiencies. Use Explore subagents.

-Print: `[experiment N] Target: <description> (<pattern>, <est. impact>%)`
+Print: `[experiment N] Target: <description> (<pattern>, <est. impact>%, strategy: <S_id or "ad-hoc">)`

 **Step 4 -- Baseline capture (MANDATORY)**: Before any code changes, copy the JMH runner to `/tmp` and capture baseline performance:
 ```bash
@ -120,9 +123,9 @@ java -jar benchmarks.jar "MicroBench" -prof gc

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E benchmark (after KEEP)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md`. Use git worktrees for isolation. If E2E contradicts micro-bench, trust the E2E.
+**Step 16 -- Workflow JMH comparison (MANDATORY — every experiment, not just KEEPs)**: Run the workflow-level JMH benchmark comparing original (base SHA) vs current optimized code per `../e2e-benchmarks.md`. Use git worktrees for isolation. This is the authoritative measurement — if it contradicts micro-bench, trust the workflow JMH.

-Print: `[experiment N] E2E: baseline <X> ns/op, optimized <Y> ns/op (<Z>x, min-to-min)`
+Print: `[experiment N] Workflow JMH (original vs optimized): <X> ns/op -> <Y> ns/op (<Z>x, min-to-min)`

 **Step 17 -- Config audit**: Check for related configuration that may be stale after your change (e.g., initial capacities, cache sizes, pool sizes that assumed the old data structure).

--- a/plugin/languages/java/references/e2e-benchmarks.md
+++ b/plugin/languages/java/references/e2e-benchmarks.md
@ -1,10 +1,12 @@
 # End-to-End Benchmarks — Java

-Java-specific E2E benchmark tooling. Read `${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md` first for the language-agnostic framework.
+Java-specific workflow-level benchmark tooling using JMH. Read `${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md` first for the language-agnostic framework.
+
+**Core principle: Always compare JMH benchmarks between the original (pre-session baseline) and the current optimized code. This comparison is MANDATORY at every experiment, not just after KEEPs.** Workflow-level JMH benchmarks that exercise full code paths (request pipelines, data processing chains, batch operations) are the authoritative measurement — they capture JIT inlining decisions, GC interactions, and cache effects that micro-benchmarks miss.

 ## Detection

-At session start, check whether the project has JMH benchmarks or an E2E benchmark harness:
+At session start, check whether the project has workflow-level JMH benchmarks or an E2E benchmark harness:

 ```bash
 # Check for JMH plugin (Gradle)
@ -25,29 +27,67 @@ Record the result in `.codeflash/setup.md` under `## E2E Benchmarks`:
 - `jmh: available (no benchmarks)` — JMH plugin present but no benchmark classes
 - `jmh: not configured` — no JMH infrastructure

-## JMH-Based E2E Workflow
+## Creating Workflow-Level Benchmarks
+
+If the project only has micro-benchmarks (single-function JMH tests), create workflow-level benchmarks before starting optimization. A workflow benchmark chains together the functions that form the end-to-end path the user cares about:
+
+```java
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+@State(Scope.Thread)
+@Warmup(iterations = 5, time = 1)
+@Measurement(iterations = 10, time = 1)
+@Fork(value = 3, jvmArgs = {"-Xms2g", "-Xmx2g"})
+public class WorkflowBenchmark {
+
+    private /* InputType */ input;
+
+    @Setup(Level.Trial)
+    public void setup() {
+        // Build representative end-to-end input (not synthetic micro-input)
+        input = buildProductionLikeInput();
+    }
+
+    @Benchmark
+    public /* ResultType */ fullWorkflow() {
+        // Exercise the FULL workflow path, not just one function
+        var parsed = parse(input);
+        var validated = validate(parsed);
+        var processed = process(validated);
+        return serialize(processed);
+    }
+}
+```
+
+**Why workflow benchmarks matter more than micro-benchmarks:**
+- JIT makes different inlining decisions when functions are called in context vs isolation
+- GC pressure compounds across the workflow (allocations in step 1 cause pauses in step 3)
+- Cache effects differ (data locality across the workflow path)
+- A function optimized in isolation may show 0% improvement in the workflow (JIT already handled it in context)
+
+## JMH-Based Workflow Comparison

 When the project has JMH benchmarks that exercise the changed code paths, use them as the authoritative measurement.

-### After every KEEP commit
+### At every experiment (MANDATORY — not just after KEEPs)

 ```bash
-# Record the pre-optimization SHA at session start
-BASE_SHA=$(git rev-parse HEAD)  # before any changes
+# Read the pre-optimization SHA recorded at session start
+BASE_SHA=$(cat .codeflash/base-sha.txt)

-# After committing a KEEP, run the relevant benchmarks on both refs:
+# After EVERY experiment (not just KEEPs), compare workflow benchmarks:

 # 1. Run on current (optimized) state
 mvn clean package -DskipTests
-java -jar target/benchmarks.jar "RelevantBenchmark" \
+java -jar target/benchmarks.jar "WorkflowBenchmark" \
    -rf json -rff /tmp/head-results.json -v EXTRA \
    -f 3 -wi 5 -i 10

 # 2. Stash or checkout base, run same benchmark
-git stash  # or use worktree
+git stash  # or use worktree (preferred)
 git checkout "${BASE_SHA}"
 mvn clean package -DskipTests
-java -jar target/benchmarks.jar "RelevantBenchmark" \
+java -jar target/benchmarks.jar "WorkflowBenchmark" \
    -rf json -rff /tmp/base-results.json -v EXTRA \
    -f 3 -wi 5 -i 10
 git checkout -  # return to optimized branch
@ -81,19 +121,19 @@ jq -n --slurpfile base /tmp/base-results.json --slurpfile head /tmp/head-results
 ### Using git worktrees (preferred — no stash/checkout needed)

 ```bash
-BASE_SHA="abc1234"  # pre-optimization commit
+BASE_SHA=$(cat .codeflash/base-sha.txt)  # pre-optimization commit

 # Create a worktree for the base ref
 git worktree add /tmp/base-worktree "${BASE_SHA}"

-# Build and benchmark in the base worktree
+# Build and benchmark the ORIGINAL in the base worktree
 (cd /tmp/base-worktree && mvn clean package -DskipTests && \
- java -jar target/benchmarks.jar "RelevantBenchmark" \
+ java -jar target/benchmarks.jar "WorkflowBenchmark" \
    -rf json -rff /tmp/base-results.json -v EXTRA -f 3 -wi 5 -i 10)

-# Build and benchmark in the current worktree
+# Build and benchmark the OPTIMIZED in the current worktree
 mvn clean package -DskipTests
-java -jar target/benchmarks.jar "RelevantBenchmark" \
+java -jar target/benchmarks.jar "WorkflowBenchmark" \
    -rf json -rff /tmp/head-results.json -v EXTRA -f 3 -wi 5 -i 10

 # Clean up worktree
@ -102,6 +142,8 @@ git worktree remove /tmp/base-worktree
 # Compare (same script as above)
 ```

+**Run this at every experiment**, not just after KEEPs. A DISCARD decision must still be informed by the workflow-level comparison — the micro-bench might say "no improvement" while the workflow benchmark reveals a regression you need to understand.
+
 ## Gradle JMH Plugin Workflow

 ```bash
@ -114,9 +156,9 @@ git worktree remove /tmp/base-worktree

 ## Fallback: When No JMH Benchmarks Exist

-If the project has no JMH infrastructure:
+If the project has no JMH infrastructure, **create workflow-level JMH benchmarks first** (see "Creating Workflow-Level Benchmarks" above). This is strongly preferred over the fallbacks below. Only use these fallbacks if JMH setup is truly infeasible (e.g., complex native dependencies that prevent standalone compilation):

-1. **Use the unified profiling script** as the primary E2E measurement:
+1. **Use the unified profiling script** as the primary workflow-level measurement:
   ```bash
   # Run on base
   git stash
@ -138,7 +180,7 @@ If the project has no JMH infrastructure:
   ./gradlew test 2>&1 | grep -E "tests completed|SUCCESS|FAILED"
   ```

-3. **Create targeted JMH benchmarks** for the changed code paths (see `micro-benchmark.md` for the A/B template). These become the project's first JMH benchmarks and can be included in the PR.
+3. **Create workflow-level JMH benchmarks** for the end-to-end code paths, not just isolated functions (see "Creating Workflow-Level Benchmarks" above). These become the project's first JMH benchmarks and can be included in the PR.

 ## Reading the Output

--- a/plugin/languages/java/references/memory/experiment-loop.md
+++ b/plugin/languages/java/references/memory/experiment-loop.md
@ -20,15 +20,18 @@ Before writing any code, answer these 11 questions. If you can't answer 3-7 conc

 ## Domain-Specific Loop Steps

+**Step 0 -- Check strategy plan**: If dispatched by the deep agent with a strategy plan context, follow the assigned strategies and combination groups. If dispatched with a combination group (multiple strategies to apply together), apply them in a single experiment. Otherwise, follow the standard one-per-experiment flow below.
+
 **Step 1 -- Review git history**: Check recent experiment history. If 3+ successful commits reduced allocation in the same code path, check if diminishing returns have set in.

 **Step 2 -- Choose target** sources:
+- **Strategy plan**: If a strategy plan exists (`.codeflash/strategy-plan.md`), follow the assigned execution order for this domain.
 - **JFR allocation profiling**: `jfr print --events jdk.ObjectAllocationInNewTLAB` for TLAB allocations, `jdk.ObjectAllocationOutsideTLAB` for large objects.
 - **Heap histogram**: `jcmd <PID> GC.class_histogram` for quick snapshot of dominant object types.
 - **async-profiler alloc mode**: `asprof -e alloc --live` for allocation flamegraph with stack traces.
 - **Unified profiling script**: `bash /tmp/java_deep_profile.sh <package> -- <command>` for combined CPU + memory + GC view.

-Print: `[experiment N] Target: <description> (<category>, <size> MiB, <est. reduction>%)`
+Print: `[experiment N] Target: <description> (<category>, <size> MiB, <est. reduction>%, strategy: <S_id or "ad-hoc">)`

 **Step 4 -- Baseline capture (MANDATORY)**: Before any code changes, copy the JMH runner to `/tmp` and capture baseline performance with GC profiling:
 ```bash
@ -85,7 +88,7 @@ If both baseline and optimized show zero allocation (`gc.alloc.rate.norm = 0`),

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E benchmark (after KEEP)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md` with `-prof gc`. Compare both time AND allocation. If E2E shows no allocation difference despite micro-bench showing one, the JIT optimizes differently in production context -- trust E2E.
+**Step 16 -- Workflow JMH comparison (MANDATORY — every experiment, not just KEEPs)**: Run the workflow-level JMH benchmark comparing original (base SHA) vs current optimized code per `../e2e-benchmarks.md` with `-prof gc`. Compare both time AND allocation. If workflow JMH shows no allocation difference despite micro-bench showing one, the JIT optimizes differently in workflow context — trust the workflow JMH.

 **Step 17 -- Config audit**: After allocation reduction, check:
 - Heap sizing (`-Xms`, `-Xmx`) -- may be oversized now
--- a/plugin/languages/java/references/micro-benchmark.md
+++ b/plugin/languages/java/references/micro-benchmark.md
@ -2,6 +2,8 @@

 Java-specific micro-benchmark tooling using JMH. Read `${CLAUDE_PLUGIN_ROOT}/references/shared/micro-benchmark.md` first for the language-agnostic framework.

+**Micro-benchmarks are a pre-screen, not the final verdict.** They test individual functions in isolation, which is useful for quick keep/discard decisions during the experiment loop. However, the authoritative measurement is always the workflow-level JMH benchmark (see `e2e-benchmarks.md`) that exercises the full code path. A micro-bench speedup that doesn't show up in the workflow JMH comparison is not real — the JIT behaves differently in context.
+
 ## Why JMH is Non-Negotiable for Java

 Unlike Python (where `timeit` directly measures interpreted code), Java's JIT compiler creates a large gap between source code and actual execution. A naive `System.nanoTime()` loop will:
--- a/plugin/languages/java/references/structure/experiment-loop.md
+++ b/plugin/languages/java/references/structure/experiment-loop.md
@ -21,15 +21,18 @@ Before writing any code, answer these 9 questions. If you can't answer 3-6 concr

 ## Domain-Specific Loop Steps

+**Step 0 -- Check strategy plan**: If dispatched by the deep agent with a strategy plan context, follow the assigned strategies and combination groups. If dispatched with a combination group (multiple strategies to apply together), apply them in a single experiment. Otherwise, follow the standard one-per-experiment flow below.
+
 **Step 1 -- Review git history**: Check if previous structure changes introduced regressions. Structure refactors have wide blast radius -- verify recent commits haven't broken anything.

 **Step 2 -- Choose target** sources:
+- **Strategy plan**: If a strategy plan exists (`.codeflash/strategy-plan.md`), follow the assigned execution order for this domain.
 - **jdeps analysis**: `jdeps -verbose:package -dotoutput /tmp/deps target/classes` for dependency graph, `grep "->" /tmp/deps/*.dot | sort` for cycle detection.
 - **Class loading profiling**: `java -verbose:class -jar target/app.jar 2>&1 | grep "^\[Loaded" | wc -l` for total count. JFR `jdk.ClassLoad` events with duration for slow class loads.
 - **Static init analysis**: `grep -rn "static {" --include="*.java" src/` for static initializer blocks. `grep -rn "static final .* = .*(" --include="*.java" src/` for heavy static field init.
 - **Fan-in analysis**: `grep -rh "^import " --include="*.java" src/ | sed 's/import \(static \)\?//;s/\.[A-Z][^.]*;$//' | sort | uniq -c | sort -rn | head -15` for god packages.

-Print: `[experiment N] Target: <description> (<smell>, <metric>)`
+Print: `[experiment N] Target: <description> (<smell>, <metric>, strategy: <S_id or "ad-hoc">)`

 **Step 5 -- Baseline measurement (MANDATORY)**: Before any structural change, capture baseline metrics:
 ```bash
@ -79,7 +82,7 @@ Print: `[experiment N] startup: <X>ms -> <Y>ms (<delta>ms), classes: <A> -> <B>`

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E verification (after KEEP)**: For structure changes, E2E means running the full application or test suite and verifying:
+**Step 16 -- Workflow JMH comparison (MANDATORY — every experiment, not just KEEPs)**: Run the workflow-level JMH benchmark comparing original (base SHA) vs current optimized code per `../e2e-benchmarks.md`. For structure changes, also verify:
 - All tests pass (structure changes have wide blast radius)
 - Startup time improved (or at least didn't regress)
 - No runtime errors from reflection, serialization, or framework scanning