LARGE-SCALE session mode

2026-04-30 13:30:53 +03:00 · 2026-04-30 13:30:53 +03:00 · c5645ce1fe
commit c5645ce1fe
parent 92105254f8
26 changed files with 429 additions and 100 deletions
--- a/README.md
+++ b/README.md
@ -31,8 +31,8 @@ Build the plugin first, then launch Claude with it:
 ```bash
 git clone https://github.com/codeflash-ai/codeflash-agent.git
 cd codeflash-agent
-make build-plugin  # assembles plugin into dist/ — must run before launching
-claude --dangerously-skip-permissions --effort max --plugin-dir ./dist/
+powershell -ExecutionPolicy Bypass -File scripts/build_plugin.ps1 -Lang java
+claude --dangerously-skip-permissions --effort max --plugin-dir ./dist-java/
 ```

 ## Your first optimization
@ -76,15 +76,19 @@ uv run pytest packages/ -v      # test all packages
 ### Plugin development

 ```bash
-make build-plugin    # assemble plugin → dist/ (base + python overlay + vendor)
-make clean           # remove dist/
+powershell -ExecutionPolicy Bypass -File scripts/build_plugin.ps1              # Windows: build all dist-* plugins
+powershell -ExecutionPolicy Bypass -File scripts/build_plugin.ps1 -Lang java   # Windows: build only dist-java/
+python scripts/build_plugin.py --lang java                                     # Python alternative
+make build                                                                     # optional wrapper on systems with make
+make clean           # remove dist-* plugin builds
 ```

 The plugin is self-contained under `plugin/`:
 - `plugin/` — language-agnostic agents, hooks, shared references
 - `plugin/languages/python/` — Python domain agents, skills, references
 - `plugin/languages/javascript/` — JavaScript domain agents, skills, references
- `make build-plugin` assembles base + language overlay into `dist/` (default: `LANG=python`)
+- `plugin/languages/java/` — Java/Kotlin domain agents, skills, JFR/JMH references
+- `scripts/build_plugin.ps1` or `scripts/build_plugin.py` assembles one language overlay per `dist-<language>/`; install `dist-java/` for Java/Kotlin work

 ## Optimization patterns

--- a/plugin/.claude-plugin/marketplace.json
+++ b/plugin/.claude-plugin/marketplace.json
@ -4,21 +4,21 @@
    "name": "Codeflash"
  },
  "metadata": {
-    "description": "Autonomous performance optimization plugins for Python and JavaScript/TypeScript",
+    "description": "Autonomous performance optimization plugins for Python, JavaScript/TypeScript, Go, and Java/Kotlin",
    "version": "0.1.0"
  },
  "plugins": [
    {
      "name": "codeflash-agent",
      "source": "./",
-      "description": "Autonomous performance optimization agent. Profiles code, implements optimizations, benchmarks before and after, and iterates until plateau. Supports Python and JavaScript/TypeScript.",
+      "description": "Autonomous performance optimization agent. Profiles code, implements optimizations, benchmarks before and after, and iterates until plateau. Supports Python, JavaScript/TypeScript, Go, and Java/Kotlin.",
      "version": "0.1.0",
      "author": {
        "name": "Codeflash"
      },
      "repository": "https://github.com/codeflash-ai/codeflash-agent",
      "license": "BSL-1.1",
-      "keywords": ["optimization", "performance", "profiling", "python", "javascript", "typescript"]
+      "keywords": ["optimization", "performance", "profiling", "python", "javascript", "typescript", "go", "java", "kotlin"]
    }
  ]
 }
--- a/plugin/.claude-plugin/plugin.json
+++ b/plugin/.claude-plugin/plugin.json
@ -1,13 +1,13 @@
 {
  "name": "codeflash-agent",
  "version": "0.1.0",
-  "description": "Autonomous performance optimization agent for Python and JavaScript/TypeScript.",
+  "description": "Autonomous performance optimization agent for Python, JavaScript/TypeScript, Go, and Java/Kotlin.",
  "author": {
    "name": "Codeflash"
  },
  "repository": "https://github.com/codeflash-ai/codeflash-agent",
  "license": "BSL-1.1",
-  "keywords": ["optimization", "performance", "profiling", "python", "javascript", "typescript"],
+  "keywords": ["optimization", "performance", "profiling", "python", "javascript", "typescript", "go", "java", "kotlin"],
  "mcpServers": {
    "context7": {
      "type": "http",
--- a/plugin/ARCHITECTURE.md
+++ b/plugin/ARCHITECTURE.md
@ -5,7 +5,7 @@
 1. **SessionStart hook** — initializes Codex session state
 2. **User triggers** `/codeflash-optimize start` (skill)
 3. **Language router** (`codeflash`) — detects project language, delegates to language-specific router
-4. **Language-specific router** (e.g., `codeflash-python`) — detects domain, asks user questions, launches setup
+4. **Language-specific router** (e.g., `codeflash-python`, `codeflash-java`) — detects domain/session mode, asks user questions when not autonomous, launches setup
 5. **Setup agent** (e.g., `codeflash-setup`) — detects env, installs deps/profilers, writes `.codeflash/setup.md`
 6. **Router validates** setup, runs test suite, researches deps via context7
 7. **Router creates team** and dispatches optimizer agent
@ -14,7 +14,7 @@

 8. **Optimizer** (`codeflash-deep` or domain-specific: `-cpu`, `-memory`, `-async`, `-structure`) — profiles all dimensions, ranks targets
 9. **Researcher** (`codeflash-researcher`) — launched alongside to analyze targets in parallel, sends findings back to optimizer
-10. **Experiment cycle**: profile → reason → implement → test → benchmark → keep/discard → commit → re-profile → repeat
+10. **Experiment cycle**: profile → reason → implement → test → benchmark → E2E/workload verification when required → keep/discard → commit/audit → re-profile → repeat
 11. **Plateau detection** (3+ consecutive discards) → optimizer sends `[complete]`

 ## Review Gate
@ -63,6 +63,19 @@ Defined in `plugin/hooks/hooks.json`, fire at session boundaries:
 | `codeflash-ci` | CI mode agent for GitHub webhooks | CI service |
 | `codeflash-pr-prep` | PR preparation agent | Post-session |

+### Java/Kotlin-specific (`plugin/languages/java/agents/`)
+
+| Agent | Role | Triggered by |
+|-------|------|-------------|
+| `codeflash-java` | Java/Kotlin domain router/team lead | Language router after detecting Maven/Gradle |
+| `codeflash-java-setup` | Maven/Gradle, JDK, JMH, profiler, representative workload detection | Java router |
+| `codeflash-java-scan` | Quick cross-domain diagnosis | `/codeflash-optimize scan` or router recon |
+| `codeflash-java-deep` | Primary optimizer (JFR/JMH across CPU, memory, GC, concurrency) | Java router default |
+| `codeflash-java-cpu` | CPU/data-structure/JIT specialist; owns profile-gated cross-function refactors | Java router or deep agent dispatch |
+| `codeflash-java-memory` | Allocation/GC specialist | Java router or deep agent dispatch |
+| `codeflash-java-async` | Concurrency/threading specialist | Java router or deep agent dispatch |
+| `codeflash-java-structure` | Startup/class-loading/module specialist | Java router or deep agent dispatch |
+
 ## Commands (`plugin/commands/`)

 User-invocable anytime:
@ -151,4 +164,4 @@ Created during execution in `.codeflash/`:

 ## Assembly

-`make build-plugin` merges `plugin/` (base, excluding `languages/`) + `plugin/languages/python/` (overlay) into `dist/`. Set `LANG=javascript` to build for JS instead. Agent files use `${CLAUDE_PLUGIN_ROOT}` for references — paths differ between source and assembled output.
+`make build` creates one assembled plugin directory per language (`dist-python/`, `dist-javascript/`, `dist-java/`, etc.) by merging `plugin/` (base, excluding `languages/`) with `plugin/languages/<lang>/` (overlay). Install the assembled language-specific directory, not raw `plugin/`, so Claude Code sees one `/codeflash-optimize` skill and one language's agent set.
--- a/plugin/README.md
+++ b/plugin/README.md
@ -29,13 +29,17 @@ Tier 1: Top Router (plugin/agents/codeflash.md)
         └─ Detects language, delegates immediately

 Tier 2: Language Router / Team Lead
-         ├─ codeflash-python   (plugin/languages/python/agents/)
-         └─ codeflash-javascript (plugin/languages/javascript/agents/)
+         ├─ codeflash-python      (plugin/languages/python/agents/)
+         ├─ codeflash-javascript  (plugin/languages/javascript/agents/)
+         ├─ codeflash-java        (plugin/languages/java/agents/)
+         └─ codeflash-go          (plugin/languages/go/agents/)
         Tools: TeamCreate, TeamDelete, Agent, SendMessage, TaskCreate/Update

 Tier 3: Deep Agent / Sub-Team Lead
-         ├─ codeflash-deep     (Python)
-         └─ codeflash-js-deep  (JavaScript)
+         ├─ codeflash-deep       (Python)
+         ├─ codeflash-js-deep    (JavaScript)
+         ├─ codeflash-java-deep  (Java/Kotlin)
+         └─ codeflash-deep       (Go overlay)
         Tools: TeamCreate, Agent, SendMessage (can dispatch domain specialists)

 Tier 4: Domain Specialists
@ -77,6 +81,15 @@ Shared (language-agnostic):
 - `skills/` — `/codeflash-optimize` entry point, V8 profiling reference
 - `references/` — JS-specific references (Prisma performance, domain deep-dives)

+### Java/Kotlin (`languages/java/`)
+- `agents/codeflash-java.md` — Java/Kotlin domain router (Team Lead)
+- `agents/codeflash-java-deep.md` — primary optimizer (JFR/JMH, CPU, memory, GC, concurrency)
+- `agents/codeflash-java-cpu.md`, `-memory.md`, `-async.md`, `-structure.md` — domain specialists
+- `agents/codeflash-java-setup.md` — detects Maven/Gradle, JDK, JMH, representative workloads
+- `agents/codeflash-java-scan.md`, `-ci.md`, `-pr-prep.md` — scan, CI, and PR preparation
+- `skills/` — `/codeflash-optimize` entry point and JFR profiling reference
+- `references/` — Java-specific JFR/JMH, E2E benchmarking, data-structure, memory, async, I/O, and worker guides
+
 ## Adding a New Language

 Follow this template — Team Lead + Deep Agent are required, domain specialists are added as needed:
@ -100,9 +113,9 @@ Shared agents (`codeflash-researcher`, `codeflash-review`) work across all langu
 ## Build

 ```bash
-make build-plugin                  # default: LANG=python → dist/ is a Python-only plugin
-make build-plugin LANG=javascript  # dist/ is a JavaScript-only plugin
-make clean                         # remove dist/
+powershell -ExecutionPolicy Bypass -File scripts/build_plugin.ps1  # Windows
+python scripts/build_plugin.py                                      # Python alternative
+make clean                         # remove dist-* plugin builds
 ```

-Each build produces a **single-language plugin** in `dist/`. The Makefile copies language-agnostic files from `plugin/`, overlays `plugin/languages/<LANG>/` (agents, references, skills), and rewrites internal paths so everything is flat. You pick the language at build time — there is no multi-language dist yet, because loading all languages at once is extremely heavy on context and compute.
+Each build produces **single-language plugin directories** (`dist-python/`, `dist-javascript/`, `dist-java/`, etc.). `scripts/build_plugin.ps1` and `scripts/build_plugin.py` copy language-agnostic files from `plugin/`, overlay `plugin/languages/<LANG>/` (agents, references, skills), and rewrite internal paths so everything is flat. Install the built `dist-java/` plugin for Java/Kotlin work so Claude Code sees one `/codeflash-optimize` skill and the Java agent set without duplicate language-specific skill names.
--- a/plugin/agents/codeflash.md
+++ b/plugin/agents/codeflash.md
@ -23,6 +23,12 @@ description: >
  assistant: "I'll use codeflash to profile memory and iteratively optimize."
  </example>

+  <example>
+  Context: User wants to optimize a Java/Kotlin project
+  user: "Find E2E performance wins in this Maven service"
+  assistant: "I'll launch codeflash to detect Java/Kotlin and optimize with the Java router."
+  </example>
+
  <example>
  Context: User wants to continue a previous session
  user: "Continue the mar20 optimization experiments"
@ -57,7 +63,7 @@ Check the project root for these markers:

 Detection priority:
 1. Check for unambiguous markers first (e.g., `pyproject.toml` = Python, `package.json` = JS).
-2. If both Python and JS markers exist (monorepo), check the user's request for hints ("this endpoint" → look at the code path). If still ambiguous, ask the user which language to optimize.
+2. If multiple language markers exist (monorepo), check the user's request for hints ("Maven", "Gradle", "JFR", "JMH", ".java", ".kt" -> Java/Kotlin; "this endpoint" plus package.json -> JS; "pytest"/"pyproject" -> Python). If still ambiguous, ask the user which language to optimize.
 3. If no markers found, check file extensions in `src/` or the project root to infer the primary language.

 ## Routing
--- a/plugin/languages/java/agents/codeflash-java-async.md
+++ b/plugin/languages/java/agents/codeflash-java-async.md
@ -28,6 +28,10 @@ You are an autonomous concurrency and async performance optimization agent for J

 **Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules.

+## Value Gate
+
+If your prompt contains `SESSION MODE: LARGE-SCALE`, only work on concurrency targets tied to a representative workload: measurable lock/park/pinning events, at least 5% throughput headroom, at least 5 ms/request latency, or a method involved in at least 1% workload CPU. A JMH contention benchmark validates the mechanism, but KEEP requires representative throughput/latency evidence at a realistic thread count. Do not tune pools from guesses; measure wait/compute ratio or use workload throughput curves.
+
 ## Target Categories

 | Category | Worth fixing? | Typical Impact |
@ -166,7 +170,7 @@ Strategy rotation: lock elimination -> parallelization -> thread pool tuning ->
 ## Results Schema

 ```
-commit	target_test	baseline_throughput	optimized_throughput	throughput_change	baseline_latency_p99_ms	optimized_latency_p99_ms	threads	status	pattern	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	baseline_throughput	optimized_throughput	throughput_change	baseline_latency_p99_ms	optimized_latency_p99_ms	threads	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 ## Progress Reporting
--- a/plugin/languages/java/agents/codeflash-java-ci.md
+++ b/plugin/languages/java/agents/codeflash-java-ci.md
@ -72,6 +72,9 @@ Steps:
   ```
   Agent(subagent_type="codeflash-java-deep", prompt="AUTONOMOUS MODE: The user has already been asked for context (included below). Do NOT ask the user any questions -- work fully autonomously. Make all decisions yourself: generate a run tag from today's date, identify benchmark tiers from available tests, choose optimization targets from profiler output. If something is ambiguous, pick the reasonable default and document your choice in HANDOFF.md.

+   SESSION MODE: LARGE-SCALE
+   REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED
+
   Optimize the Java/Kotlin code in this repository. This is a CI run triggered by PR #{number} ({head_ref} -> {base_ref}).

   Focus on the files changed in this PR: {file_list}.
--- a/plugin/languages/java/agents/codeflash-java-cpu.md
+++ b/plugin/languages/java/agents/codeflash-java-cpu.md
@ -28,6 +28,14 @@ You are an autonomous CPU/runtime performance optimization agent for Java and Ko

 **Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules.

+## Value Gate
+
+Default to workload-scale value, not isolated source cleanup. If your prompt contains `SESSION MODE: LARGE-SCALE`, you may only optimize targets that the lead's workload profile shows as hot: at least 1% workload CPU, at least 20 MiB/run avoidable allocation caused by CPU-side data structure choices, at least 5 ms/request latency, or a documented cross-domain interaction. JMH is still mandatory for mechanism validation, but a micro-benchmark alone is not enough to KEEP a large-scale experiment.
+
+If your prompt contains `SESSION MODE: LIBRARY PRIMITIVE`, JMH-only acceptance is allowed only for a pervasive primitive (decoder, collection, hash/string utility) with at least three named downstream callers. Record the caller list in HANDOFF.md before editing.
+
+If your prompt contains `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED` or `SESSION MODE: CROSS-FUNCTION REFACTOR`, you may attempt a cross-function refactor only for a profile-proven data-flow target. The target must have a Touched Call Graph, a preserved-behavior contract, a committed behavioral-equivalence test, and E2E/workload evidence before final KEEP.
+
 ## Target Categories

 | Category | Worth fixing? | Threshold |
@ -44,7 +52,7 @@ You are an autonomous CPU/runtime performance optimization agent for Java and Ko
 | **Eager initialization on hot path** (upfront alloc of rarely-used objects) | Yes if profiler-confirmed | Object used in <20% of calls |
 | **Synchronized hot path** (unnecessary locking) | Yes | Profiler shows contention |
 | **Cold code** (<2% profiler time) | **NEVER fix** | Below noise floor |
-| **Cross-function refactor** (remove intermediate materialization between methods, fuse loops across call boundaries, monomorphize polymorphic dispatch, eliminate allocation along a pipeline) | Only in CROSS-FUNCTION REFACTOR mode | See dedicated section below |
+| **Cross-function refactor** (remove intermediate materialization between methods, fuse loops across call boundaries, monomorphize polymorphic dispatch, eliminate allocation along a pipeline) | Only when `SESSION MODE: CROSS-FUNCTION REFACTOR` or `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED` is present | See dedicated section below |

 ### Top Antipatterns

@ -93,7 +101,7 @@ When you see `for (T x : items) { x.doThing(); }` and want to add a fast-path sk

 Rule: Don't hoist guards out of polymorphic call targets. See `../references/data-structures/guide.md` "Polymorphic Dispatch Safety" for the full trap catalog.

-## CROSS-FUNCTION REFACTOR Mode (only when the session brief says so)
+## CROSS-FUNCTION REFACTOR Mode (only when the prompt authorizes it)

 By default, every optimization you attempt is a **single-method change**. That is the correct default for most targets and matches the small-PR reviewer expectation — one method, one fix, one PR, under ~100 LOC.

@ -104,7 +112,7 @@ Some optimizations cannot be captured by changing a single method. Examples:
 - **Monomorphizing a polymorphic dispatch.** A hot call site has an abstract type with 10 implementations but only 3 are ever hot. Replacing the abstract call with an explicit type switch + direct calls removes method-handle overhead, but the fix touches the abstract class, the hot call site, and the direct-call path.
 - **Eliminating allocation along a data-flow pipeline.** Method A creates a `Range` object, passes it to B, B reads two fields and discards the `Range`. Inlining the two fields into A/B's signatures removes the per-row allocation, touching A, B, possibly the `Range` class, and every caller.

-**These optimizations are only permitted when the session's scoped brief explicitly declares `SESSION MODE: CROSS-FUNCTION REFACTOR`.** If the brief does NOT say this, you MUST restrict yourself to single-method changes — the project's PR reviewers reject sweeping refactors as blast-radius risks.
+**These optimizations are only permitted when the prompt contains `SESSION MODE: CROSS-FUNCTION REFACTOR` or `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED`.** If neither appears, you MUST restrict yourself to single-method changes and record the larger candidate in HANDOFF.md. Reviewers reject sweeping refactors unless the workload profile proves the blast radius is justified.

 ### When you ARE in CROSS-FUNCTION REFACTOR mode

@ -201,7 +209,7 @@ Tests pass? (mvn test / gradle test)
   +-- <5% -> Re-run 3 times (JIT warmup variance is real)
   |  +-- Confirmed -> KEEP
   |  +-- Not significant -> DISCARD
-   +-- Micro-bench only: >=20% on confirmed hot path -> KEEP
+   +-- Micro-bench only: >=20% on confirmed hot path -> KEEP only in LIBRARY PRIMITIVE mode with >=3 named downstream callers; in LARGE-SCALE mode require E2E/workload evidence
   +-- JIT deopt fix: KEEP if PrintCompilation confirms deopt eliminated
   +-- No improvement -> DISCARD
 ```
@ -238,7 +246,7 @@ Before pushing, review `git diff <base>..HEAD`:
 ## Results Schema

 ```
-commit	target_test	baseline_ms	optimized_ms	speedup	tests_passed	tests_failed	status	pattern	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 ## Progress Reporting
--- a/plugin/languages/java/agents/codeflash-java-deep.md
+++ b/plugin/languages/java/agents/codeflash-java-deep.md
@ -90,6 +90,22 @@ All three conditions must hold: (1) >15% CPU in library internals, (2) domain ag

 You MUST profile before making any code changes. The unified profiling script below is your starting point -- run it first, then use deeper tools as needed. Do NOT skip profiling to "just read the code and fix obvious issues."

+### Large-Scale Value Gate (mandatory before experiments)
+
+Your default job is to find valuable workload wins, not source-level cleanups. Before choosing any experiment in `LARGE-SCALE` mode, build a workload-backed target list:
+
+1. **Pick the most representative workload available.** Prefer, in order: an existing production-like benchmark, an integration/E2E test that drives the user-facing path, a benchmark suite named in `codeflash_profile.md` or `.codeflash/setup.md`, then the narrowest test that exercises the requested path. Do not use `mvn compile` as a workload unless the user requested build/startup performance.
+2. **Map entry points to hot methods.** For each top target, record the entry point, owning class, caller chain, CPU percentage, allocation volume, GC pause contribution, and the exact workload command that exercises it.
+3. **Apply the impact threshold.** In `LARGE-SCALE` mode, only optimize targets with at least 1% of workload CPU, at least 20 MiB/run avoidable allocation, at least 5 ms/request latency, or at least 5% throughput headroom from lock/thread contention. Anything below that is a micro-win and must be skipped unless it is a pervasive primitive with at least three named downstream callers.
+4. **Classify acceptance tier before editing.**
+   - `strong`: expected >=10% E2E improvement on at least two representative workloads.
+   - `modest`: expected >=5% E2E improvement on one representative workload.
+   - `primitive`: JMH-only acceptance allowed because the target is a decoder/collection/hash/string utility used by at least three named downstream call sites.
+   - `cross-function`: expected E2E win requires coordinated changes across multiple methods. Allowed only when the prompt contains `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED` or `SESSION MODE: CROSS-FUNCTION REFACTOR`.
+5. **Write the table to HANDOFF.md.** Include `session_mode`, `workload_command`, `entry_point`, `target_method`, `workload_cpu_pct`, `alloc_mib`, `gc_ms`, `acceptance_tier`, and `why_this_is_not_a_micro_win`.
+
+If you cannot produce this table, keep profiling or discover a better workload. Do not enter the experiment loop with only a list of source-code smells.
+
 ### Unified CPU + Memory + GC profiling (MANDATORY first step)

 This gives you the cross-domain view that single-domain agents lack. The script lives at `${CLAUDE_PLUGIN_ROOT}/languages/java/references/unified-profiling-script.sh` -- copy it to `/tmp/java_deep_profile.sh` and run it.
@ -105,7 +121,7 @@ chmod +x /tmp/jmh-runner.sh

 **Usage:** `bash /tmp/java_deep_profile.sh <source_package> -- <command> [args...]`

- `<source_package>` -- Java package prefix to filter CPU results. Only methods in this package (or subpackages) appear in the CPU report. Read this from `.codeflash/setup.md` (the base package). Use "." to include everything.
+- `<source_package>` -- Java package prefix to filter CPU results. Only methods in this package (or subpackages) appear in the CPU report. Read `Base package for profiling` from `.codeflash/setup.md`. Use "." when setup says the package root is ambiguous.
 - Everything after `--` is the command to profile.

 **Examples:**
@ -153,6 +169,14 @@ After the unified profile, cross-reference CPU hotspots with allocation sites an

 **Methods in 2+ domains rank higher** -- cross-domain targets are where deep reasoning adds value.

+For `LARGE-SCALE` mode, add these columns to the table and sort by expected workload value, not ease of implementation:
+
+```
+| Entry point | Workload command | Method | CPU % | Alloc MiB | GC ms | Callers | Acceptance tier | Skip? |
+```
+
+Skip cold targets explicitly with a reason such as `cold: 0.3% CPU` or `micro-only: no downstream caller evidence`. Recording skips prevents the session from drifting into cosmetic cleanup.
+
 ### Additional profiling tools (use on demand)

 | Tool | When to use | How |
@ -179,24 +203,24 @@ After the unified profile, cross-reference CPU hotspots with allocation sites an
 8. **Exercised?** Does benchmark exercise this path?
 9. **Correctness?** Thread safety, null handling, exception contracts.
 10. **Production context?** Server/CLI/batch/library changes what "improvement" means.
-11. **Session mode?** LARGE-SCALE / LIBRARY PRIMITIVE / CROSS-FUNCTION REFACTOR / PLUGIN VALIDATION — read the Part 3 brief. The mode determines your acceptance tier, your evidence bar, and whether multi-method refactors are allowed. If no mode is declared, assume LARGE-SCALE and act accordingly.
-12. **Single-method fix sufficient?** If yes, keep the diff small and do NOT invoke CROSS-FUNCTION REFACTOR patterns. If no, and the session mode permits it (CROSS-FUNCTION REFACTOR), follow that mode's protocol explicitly.
+11. **Session mode and refactor scope?** LARGE-SCALE / LIBRARY PRIMITIVE / PLUGIN VALIDATION, plus optional `REFACTOR SCOPE`. If no mode is declared, assume LARGE-SCALE. If no refactor scope is declared, assume `SINGLE-METHOD ONLY` unless the launcher provided the Java default `PROFILE-GATED CROSS-FUNCTION ALLOWED`.
+12. **Single-method fix sufficient?** If yes, keep the diff small. If no, and the prompt permits cross-function work (`REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED` or `SESSION MODE: CROSS-FUNCTION REFACTOR`), follow the cross-function protocol explicitly. If cross-function is not permitted, record the candidate in HANDOFF.md and continue to the next allowed target.

 ## Session Mode Handling

 Your Part 3 prompt may declare `SESSION MODE: <mode>`. This determines what kinds of changes you are authorized to attempt and what evidence you must produce for a keep.

- **LARGE-SCALE:** targets selected from the Phase 2.5 workload profile ranked by CPU%. Each keep requires hot-path evidence (≥1% of workload CPU) AND an end-to-end query-level wall-time measurement showing the win carries through. Default mode when unspecified.
+- **LARGE-SCALE:** targets selected from the Phase 2.5 workload profile ranked by CPU%. Each keep requires hot-path evidence (≥1% of workload CPU, or the memory/GC/latency/contention thresholds from the value gate) AND an end-to-end query-level wall-time measurement showing the win carries through. Default mode when unspecified. If the prompt also contains `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED`, large-scale may include cross-function refactors when profiling proves the optimization spans multiple methods.
 - **LIBRARY PRIMITIVE:** targets are primitives (decoders, collections, hash/string utilities). Each keep requires JMH rigor (@Fork≥2, non-overlapping 99% CI) AND ≥3 named downstream call sites. No whole-workload profile needed.
- **CROSS-FUNCTION REFACTOR:** targets are data-flow-level optimizations that span multiple methods. Small-PR Rule is lifted to ~500 LOC. Each keep requires:
+- **CROSS-FUNCTION REFACTOR / PROFILE-GATED CROSS-FUNCTION:** targets are data-flow-level optimizations that span multiple methods. This may be explicit as `SESSION MODE: CROSS-FUNCTION REFACTOR` or enabled under large-scale by `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED`. Small-PR Rule is lifted only for the scoped data-flow change, up to ~500 LOC of production code. Each keep requires:
  - Hot-path evidence showing combined ≥5% CPU across touched methods (from Phase 2.5 profile).
  - End-to-end wall-time improvement: ≥10% on ≥2 representative queries (Strong tier) or ≥5% on ≥1 query (Modest tier), with non-overlapping CI.
  - A behavioral-equivalence property-based test COMMITTED in the PR's diff at the project's conventional test location (e.g. `core/trino-main/src/test/java/...`). Session-local tests do NOT count — the test ships with the PR.
  - A Touched Call Graph in HANDOFF.md: every changed method with old→new signature, every call site, the preserved-behavior contract.
-  Delegate to `codeflash-java-cpu` with the mode declared verbatim in your prompt so the CPU agent follows its CROSS-FUNCTION REFACTOR Mode section.
+  You may implement the refactor yourself if the cross-domain interaction is the fix; otherwise delegate to `codeflash-java-cpu` with both the `SESSION MODE` line and `REFACTOR SCOPE` line declared verbatim in your prompt so the CPU agent follows its CROSS-FUNCTION REFACTOR Mode section.
 - **PLUGIN VALIDATION:** you're being tested as an agent, not shipping production code. Rigor gates are relaxed. Do the work that demonstrates agent behavior.

-When you spawn a subagent, **the first line of your prompt to the subagent MUST be `SESSION MODE: <mode>`**. This is non-negotiable — subagents choose their acceptance tier based on the mode, and the mode cannot be inferred from the rest of the prompt reliably.
+When you spawn a subagent, **the first lines of your prompt to the subagent MUST include `SESSION MODE: <mode>` and any `REFACTOR SCOPE:` line**. This is non-negotiable — subagents choose their acceptance tier and refactor authorization from those lines, and they cannot be inferred from the rest of the prompt reliably.

 ## Team Orchestration

@ -227,7 +251,7 @@ LOOP (until plateau or user requests stop):
 2. **Choose target.** Pick from the unified target table. Prefer multi-domain targets. For each target, decide: **handle it yourself** (cross-domain interaction) or **dispatch to a domain agent** (single-domain). Print `[experiment N] Target: <name> (<domains>, hypothesis: <interaction>)`.
 3. **Joint reasoning checklist.** Answer all 10 questions. If the interaction hypothesis is unclear, profile deeper first.
 4. **Read source.** Read ONLY the target function. Use Explore subagent for broader context. Do NOT read the whole codebase upfront.
-5. **Micro-benchmark** (when applicable). Design a JMH A/B benchmark by following the 6-step decision framework in `../references/micro-benchmark.md` -- do NOT hardcode parameters. Print your design decisions (`[micro-bench] Mode: ..., Forks: ..., Warmup: ...`). Capture baseline BEFORE code changes:
+5. **Micro-benchmark** (when applicable). Design a JMH A/B benchmark by following the 6-step decision framework in `../references/micro-benchmark.md` -- do NOT hardcode parameters. Print your design decisions (`[micro-bench] Mode: ..., Forks: ..., Warmup: ...`). In `LARGE-SCALE` mode, this is only a pre-screen for the implementation mechanism; it cannot justify KEEP by itself. Capture baseline BEFORE code changes:
   ```bash
   bash /tmp/jmh-runner.sh "<BenchmarkClass>" --label baseline \
       --mode avgt --forks 3 --warmup 5 --measurement 10 --time 1
@ -239,7 +263,7 @@ LOOP (until plateau or user requests stop):
       --compare /tmp/jmh-results-baseline.json
   ```
   The runner extracts min scores and computes speedup automatically. Print `[experiment N] Micro: baseline min=<X>ns, optimized min=<Y>ns, speedup=<Z>x`. If micro-benchmark shows no improvement, verify with `--prof perfasm` whether the JIT already optimizes this pattern. If so, DISCARD without implementing.
-6. **Implement ONE fix.** Print `[experiment N] Implementing: <summary>`.
+6. **Implement ONE fix.** Print `[experiment N] Implementing: <summary>`. For a profile-gated cross-function refactor, "one fix" means one coherent data-flow change with a Touched Call Graph, not a grab bag of local cleanups.
 7. **Multi-dimensional measurement.** Re-run the unified profiling script. Measure ALL dimensions (CPU, Memory, GC), not just the one you targeted.
 8. **Guard** (run tests). `mvn test` or `./gradlew test`. Revert if fails.
 9. **Print results** -- ALL dimensions: CPU, Memory, GC pauses.
@ -251,17 +275,18 @@ LOOP (until plateau or user requests stop):
 10. **Cross-domain impact assessment.** Did the fix in domain A affect domain B? Was the interaction expected? Record it.
 11. **Small delta?** If <5% in target dimension, re-run 3x to confirm. But also check: did a DIFFERENT dimension improve unexpectedly? That's a cross-domain interaction -- record it.
 11a. **Correctness probe before KEEP (mandatory for data-flow changes).** If this experiment swapped a constructor, changed an encoding/charset, replaced an algorithm, fused loops across methods, monomorphized dispatch, removed a fast-path, or otherwise changed HOW output is produced, write a behavioral-equivalence probe per `${CLAUDE_PLUGIN_ROOT}/references/shared/correctness-probe-patterns.md` BEFORE step 12. The probe must target inputs where the before and after implementations COULD diverge (boundary bytes 0x00/0x7F/0x80/0xFF, embedded multi-byte sequences, adversarial/malformed inputs, empty, max-length), ship as a committed test at the project's conventional test path, and pass under the project's existing test runner. "Existing tests pass" is NOT a substitute. If you cannot construct an adversarial input that falsifies equivalence — either the probe passes (good, KEEP is defensible) or you don't yet understand the difference (stop and investigate, do NOT KEEP).
-12. **Keep/discard.** (Step 11a must have passed for any data-flow change.) Print `[experiment N] KEEP -- <net effect across dimensions>` or `[experiment N] DISCARD -- <reason>`. Do NOT `git commit` here — commits are handled per `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` "Git Operations Boundary" and only after the post-return audit in `${CLAUDE_PLUGIN_ROOT}/references/shared/router-base.md` has cleared the experiment.
-13. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Include ALL dimensions measured. Update Hotspot Summary and Kept/Discarded sections.
-14. **E2E benchmark** (after KEEP, when available). Run the full JMH benchmark suite against the baseline for authoritative measurement:
+12. **Provisional keep/discard.** (Step 11a must have passed for any data-flow change.) If the local benchmark/profile result is clearly regressed or unchanged, DISCARD now. If it appears promising, mark it `provisional_keep` in your notes and continue to E2E/workload measurement. Do NOT record final `keep` for `LARGE-SCALE` or cross-function work before E2E evidence exists.
+13. **E2E benchmark before final KEEP.** For `LARGE-SCALE` and all cross-function refactors, run the representative workload or full JMH benchmark suite against the baseline for authoritative measurement:
    ```bash
    bash /tmp/jmh-runner.sh "<BenchmarkClass>" --label e2e \
        --compare /tmp/jmh-results-baseline.json
    ```
-    Read `../references/e2e-benchmarks.md` for the git-worktree-based workflow for more rigorous isolation. Record e2e results alongside micro-bench results. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate -- trust the e2e measurement. Print `[experiment N] E2E: base=<X>ns -> head=<Y>ns (<speedup>x)`.
-15. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Cross-domain fixes may leave behind stale config across multiple subsystems.
-16. **Strategy revision** (after every KEEP). Re-run unified profiling. Print updated `[unified targets]` table. Check for remaining targets (>1% CPU, >2 MiB memory, >5ms latency). Scan for code antipatterns (autoboxing, `String.format` in loops, `synchronized` on hot path, `Arrays.asList` in hot loops) that may not rank high in profiling but are trivially fixable. Ask: "What did I learn? What changed across domains? Should I continue or pivot?"
-17. **Milestones** (every 3-5 keeps): Full benchmark, tag, AND run adversarial review on commits since last milestone. Fix HIGH-severity findings before continuing.
+    Read `../references/e2e-benchmarks.md` for the git-worktree-based workflow for more rigorous isolation. If E2E contradicts micro-bench (e.g., micro showed 15% but E2E shows <2%), trust E2E and DISCARD or rework.
+14. **Final keep/discard.** In `LARGE-SCALE` mode, final KEEP requires workload evidence: the target met the value gate before the change AND an E2E/representative workload measurement improved by >=5% with no material regression. Strong cross-function keeps require >=10% on >=2 representative workloads when available. Micro-only wins are DISCARD unless the session was reclassified as `LIBRARY PRIMITIVE` and has >=3 named downstream callers. Print `[experiment N] KEEP -- <net effect across dimensions>` or `[experiment N] DISCARD -- <reason>`. Do NOT `git commit` here — commits are handled per `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` "Git Operations Boundary" and only after the post-return audit in `${CLAUDE_PLUGIN_ROOT}/references/shared/router-base.md` has cleared the experiment.
+15. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Include ALL dimensions measured and E2E metrics. Update Hotspot Summary and Kept/Discarded sections.
+16. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Cross-domain fixes may leave behind stale config across multiple subsystems.
+17. **Strategy revision** (after every KEEP). Re-run unified profiling. Print updated `[unified targets]` table. Check for remaining targets (>1% CPU, >2 MiB memory, >5ms latency). Scan for code antipatterns that may not rank high in profiling but are only worth fixing if they meet the value gate or are part of the same hot path. Ask: "What did I learn? What changed across domains? Should I continue or pivot?"
+18. **Milestones** (every 3-5 keeps): Full benchmark, tag, AND run adversarial review on commits since last milestone. Fix HIGH-severity findings before continuing.

 ### Keep/Discard

@ -347,12 +372,15 @@ Also update the shared task list:
 Tab-separated `.codeflash/results.tsv`:

 ```
-commit	target_test	cpu_baseline_s	cpu_optimized_s	cpu_speedup	mem_baseline_mb	mem_optimized_mb	mem_delta_mb	gc_before_s	gc_after_s	tests_passed	tests_failed	status	domains	interaction	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	cpu_baseline_s	cpu_optimized_s	cpu_speedup	mem_baseline_mb	mem_optimized_mb	mem_delta_mb	gc_before_s	gc_after_s	tests_passed	tests_failed	status	domains	interaction	correctness_probe	description
 ```

 - `domains`: comma-separated (e.g., `cpu,mem`)
 - `interaction`: cross-domain effect observed (e.g., `alloc_to_gc_reduction`, `none`)
 - `status`: `keep`, `discard`, or `crash`
+- `baseline_metric` / `optimized_metric`: real measured values with units (`12345 ns/op`, `82 ms`, `240 MiB`, `920 ops/s`), never identifiers or prose.
+- `improvement_pct`: finite numeric percent improvement in the primary measured metric.
+- `e2e_improvement_pct`: finite numeric percent improvement from the representative workload for `LARGE-SCALE`; use `N/A` only for `discard` rows or explicitly documented `LIBRARY PRIMITIVE` rows.

 ## Reference Loading

@ -399,15 +427,16 @@ You are self-sufficient -- handle your own setup before any profiling.
 1. **Verify you are on `codeflash/optimize`.** Run `git branch --show-current`. If the branch is already `codeflash/optimize`, continue. If it exists but is not checked out, run `git checkout codeflash/optimize`. **Do NOT create the branch yourself** — per `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` "Git Operations Boundary", branch creation is the router's (or user's) responsibility and happens before you are launched. If you find yourself on `main` (or any other branch) and `codeflash/optimize` does not exist, report this back via the coordination loop as a router-setup gap and stop — do not paper over it with `git checkout -b`. (**CI mode**: skip this step entirely — stay on the current branch.)
 2. **Initialize `.codeflash/HANDOFF.md`** from `${CLAUDE_PLUGIN_ROOT}/references/shared/handoff-template.md`. Fill in: branch, project root, JDK version, build tool, test command, GC algorithm.
 3. **Unified baseline.** Run the unified CPU+Memory+GC profiling.
-4. **Capture JMH baseline.** If the project has JMH benchmarks (check `.codeflash/setup.md`), run them on the unmodified code to establish a performance baseline:
+4. **Build the Large-Scale Value Gate table.** In `LARGE-SCALE` mode, do this before any JMH micro-benchmark. If no target meets the threshold, discover a better representative workload or stop with a measured plateau; do not fall back to cold source-code smells.
+5. **Capture JMH baseline.** If the project has JMH benchmarks (check `.codeflash/setup.md`) and they exercise a target from the value-gate table, run them on the unmodified code to establish a performance baseline:
   ```bash
   bash /tmp/jmh-runner.sh "<BenchmarkClass>" --label baseline \
       --mode avgt --forks 3 --warmup 5 --measurement 10 --time 1
   ```
   This baseline is the comparison point for all subsequent experiments. Without it, benchmark numbers are meaningless.
-5. **Build unified target table.** Cross-reference CPU hotspots with memory allocators and GC impact. Identify multi-domain targets. **Update HANDOFF.md** Hotspot Summary.
-6. **Plan dispatch.** Classify each target as cross-domain (handle yourself) or single-domain (candidate for dispatch). If there are 2+ single-domain targets in the same domain, consider dispatching a domain agent.
-7. **Enter the experiment loop.**
+6. **Build unified target table.** Cross-reference CPU hotspots with memory allocators and GC impact. Identify multi-domain targets. **Update HANDOFF.md** Hotspot Summary.
+7. **Plan dispatch.** Classify each target as cross-domain (handle yourself) or single-domain (candidate for dispatch). If there are 2+ single-domain targets in the same domain, consider dispatching a domain agent.
+8. **Enter the experiment loop.**

 ### CI mode

--- a/plugin/languages/java/agents/codeflash-java-memory.md
+++ b/plugin/languages/java/agents/codeflash-java-memory.md
@ -28,6 +28,10 @@ You are an autonomous memory and GC optimization agent for Java and Kotlin. You

 **Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules.

+## Value Gate
+
+If your prompt contains `SESSION MODE: LARGE-SCALE`, only work on allocation/GC targets tied to a representative workload: at least 20 MiB/run avoidable allocation, at least 5 ms/request GC or allocation-driven latency, or a method that also accounts for at least 1% workload CPU. JMH with `-prof gc` validates the mechanism, but KEEP requires workload-level allocation/GC or latency improvement. For `SESSION MODE: LIBRARY PRIMITIVE`, JMH-only acceptance is allowed only when the primitive has at least three named downstream callers.
+
 ## Allocation Categories

 | Category | Reducible? | Strategy |
@ -227,7 +231,7 @@ Re-run heap histogram. Print updated allocator table. The #2 allocator may now b
 ## Results Schema

 ```
-commit	target_test	target_mib	heap_used_mib	gc_pause_ms	gc_count	tests_passed	tests_failed	status	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	mem_baseline_mb	mem_optimized_mb	mem_delta_mb	gc_before_s	gc_after_s	gc_count	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 ## Progress Reporting
--- a/plugin/languages/java/agents/codeflash-java-setup.md
+++ b/plugin/languages/java/agents/codeflash-java-setup.md
@ -163,6 +163,17 @@ ls -d src/test/java/ src/test/kotlin/ 2>/dev/null
 find . -name "module-info.java" -not -path "*/node_modules/*" 2>/dev/null | head -5
 ```

+Detect the base package for profiler filtering. Prefer the most common package prefix under production sources; fall back to `.` if ambiguous:
+
+```bash
+find src/main/java src/main/kotlin -type f \( -name "*.java" -o -name "*.kt" \) 2>/dev/null | \
+  xargs grep -hE '^[[:space:]]*package[[:space:]]+[A-Za-z0-9_.]+' 2>/dev/null | \
+  sed -E 's/^[[:space:]]*package[[:space:]]+([^;]+).*/\1/' | \
+  awk -F. '{print $1"."$2}' | sort | uniq -c | sort -rn | head -5
+```
+
+If a multi-module project has several unrelated package roots, record `.` and list the candidates.
+
 ### 8. Check for existing benchmark infrastructure

 ```bash
@ -199,6 +210,19 @@ Determine JMH mode based on what you find:
 | `jmh-core` in pom.xml or build.gradle (no plugin) | `project-deps` |
 | None of the above | `standalone` (jmh-runner.sh will download JMH JARs) |

+Also detect representative workload entry points for large-scale optimization:
+
+```bash
+# Integration/E2E/performance test names
+find . -path "*/src/test/*" -name "*.java" 2>/dev/null | \
+  grep -Ei "(IT|Integration|E2E|EndToEnd|Perf|Performance|Benchmark|Load)" | head -20
+
+# Maven/Gradle benchmark or integration-test tasks/profiles
+grep -R -n -Ei "integrationTest|e2e|performance|benchmark|jmh|failsafe" pom.xml build.gradle build.gradle.kts settings.gradle settings.gradle.kts 2>/dev/null | head -30
+```
+
+Record the best available workload candidates in setup.md. Deep mode uses these before falling back to generic unit tests.
+
 ### 9. Detect GC algorithm

 ```bash
@ -242,6 +266,8 @@ Create the `.codeflash/` directory if needed, then write:
 - **JMH mode**: <gradle-plugin|source-dir|project-deps|standalone>
 - **Benchmark source dir**: <path to src/jmh/java or benchmark directory|N/A>
 - **Benchmark infrastructure**: <JMH benchmarks found|benchmark directory|none>
+- **Representative workloads**: <commands/tests/profiles likely to exercise production paths, or none found>
+- **Base package for profiling**: <e.g., com.example or . if ambiguous>
 - **Project root**: <absolute path>
 ```

--- a/plugin/languages/java/agents/codeflash-java-structure.md
+++ b/plugin/languages/java/agents/codeflash-java-structure.md
@ -28,6 +28,10 @@ You are an autonomous codebase structure optimization agent for Java and Kotlin.

 **Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules.

+## Value Gate
+
+If your prompt contains `SESSION MODE: LARGE-SCALE`, only work on structure targets with measured startup/build/runtime impact: startup/class-loading cost in the representative workload, a class/module load chain contributing at least 5 ms, or a structural issue that blocks a larger measured optimization. Do not perform broad cleanup refactors as performance work. KEEP requires startup/build/workload timing or class-load-count evidence plus normal correctness checks.
+
 ## Target Categories

 | Category | Worth fixing? | How to measure |
@ -153,7 +157,7 @@ Update `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately after ev
 ## Results Schema

 ```
-commit	target	metric_name	baseline	result	delta	tests_passed	tests_failed	status	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	metric_name	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 ## Progress Reporting
--- a/plugin/languages/java/agents/codeflash-java.md
+++ b/plugin/languages/java/agents/codeflash-java.md
@ -51,18 +51,23 @@ You are the team lead for Java/Kotlin performance optimization. Your job is to d

 **Structure optimization is opt-in.** Only route to `codeflash-java-structure` when the user explicitly mentions startup time, class loading, module structure, or circular dependencies.

-## Session Mode Detection (NEW)
+## Session Mode Detection

 The user-supplied scoped brief (Part 3 of your prompt) may declare an explicit session mode on its first line:

 - `SESSION MODE: LARGE-SCALE` — the user wants infrastructure-cost-grade evidence: whole-workload profiling, target methods at ≥1% of workload CPU, end-to-end query-level benchmark confirming the win carries through. Route to `codeflash-java-deep` (default) and FORWARD the mode declaration + the workload profile context (if attached in the brief) to every subagent you spawn.
 - `SESSION MODE: LIBRARY PRIMITIVE` — the user is optimizing a primitive (file-format decoder, collection, hash/string utility) whose hotness is implied by its pervasive call pattern. JMH rigor + CI non-overlap + ≥3 named callers is the bar. Route to the appropriate single-domain agent (usually `codeflash-java-cpu`).
- `SESSION MODE: CROSS-FUNCTION REFACTOR` — the user has authorized multi-method coordinated changes (removing intermediate materialization, loop fusion across call boundaries, monomorphizing dispatch, pipeline allocation elimination). The default Small-PR Rule is lifted for this session up to ~500 LOC. Route to `codeflash-java-cpu` and INCLUDE in the agent prompt: "This session is CROSS-FUNCTION REFACTOR mode per the user's brief. You MAY attempt coordinated changes across multiple methods. You MUST follow the CROSS-FUNCTION REFACTOR Mode section in your instructions, including writing a committed behavioral-equivalence test and producing a Touched Call Graph."
+- `SESSION MODE: CROSS-FUNCTION REFACTOR` — legacy explicit mode for a session whose primary target is already known to require multi-method coordinated changes (removing intermediate materialization, loop fusion across call boundaries, monomorphizing dispatch, pipeline allocation elimination). Prefer `codeflash-java-deep` unless the user also explicitly says CPU-only; deep mode must still build workload evidence before dispatching CPU.
 - `SESSION MODE: PLUGIN VALIDATION` — the user is testing agent behavior on a fixture, not shipping merge-ready code. Relax rigor gates; the output is being judged for agent-behavior correctness, not merge-readiness.

 **If the brief does NOT declare a mode**, assume `LIBRARY PRIMITIVE` for file-format/collection/utility targets and `LARGE-SCALE` for engine/operator/planner targets. Default to LARGE-SCALE when in doubt.

-**The session mode is forwarded verbatim to every spawned subagent.** Subagents rely on the mode to know which acceptance tier applies to their work and whether to attempt cross-function refactors or stay single-method. When you spawn a subagent, prepend your prompt with `SESSION MODE: <mode>` on its own line so the subagent cannot miss it.
+The scoped brief may also include:
+
+- `REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED` — default for Java large-scale sessions. This is NOT a replacement for `LARGE-SCALE`; it is authorization to attempt cross-function refactors only after profiling proves the win spans multiple methods and the stricter cross-function evidence/test gates are satisfied.
+- `REFACTOR SCOPE: SINGLE-METHOD ONLY` — user/reviewer wants small local diffs. Do not attempt cross-function refactors; record larger candidates in HANDOFF.md.
+
+**Forward the session mode and refactor scope verbatim to every spawned subagent.** Subagents rely on these lines to know which acceptance tier applies and whether multi-method refactors are allowed. When you spawn a subagent, prepend your prompt with the `SESSION MODE: <mode>` line and any `REFACTOR SCOPE:` line so the subagent cannot miss them.

 ## Reference Loading

--- a/plugin/languages/java/references/async/experiment-loop.md
+++ b/plugin/languages/java/references/async/experiment-loop.md
@ -106,7 +106,7 @@ java -XX:+PrintCompilation -jar benchmarks.jar "MicroBench" -f 1 2>&1 | grep "bi

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E benchmark (after KEEP)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md` at production thread count. If E2E shows no throughput gain despite micro-bench showing one, the contention may not be the real bottleneck in production -- trust E2E.
+**Step 16 -- E2E benchmark (before final KEEP for LARGE-SCALE)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md` at production thread count. If E2E shows no throughput gain despite micro-bench showing one, the contention may not be the real bottleneck in production -- trust E2E.

 **Step 17 -- Config audit**: After concurrency changes, check:
 - Thread pool sizes that assumed the old locking model
@ -174,10 +174,10 @@ If top 3 remaining issues are all non-optimizable, **stop and report to user** w

 ## Logging Format

-Tab-separated `.codeflash/results.tsv`, 13 columns:
+Tab-separated `.codeflash/results.tsv`:

 ```
-commit	target_test	baseline_throughput	optimized_throughput	throughput_change	baseline_latency_p99_ms	optimized_latency_p99_ms	threads	jit_verified	tests_passed	tests_failed	status	pattern	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	baseline_throughput	optimized_throughput	throughput_change	baseline_latency_p99_ms	optimized_latency_p99_ms	threads	jit_verified	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 - `target_test`: test name, endpoint name, or `micro:<JMH benchmark name>`
--- a/plugin/languages/java/references/data-structures/experiment-loop.md
+++ b/plugin/languages/java/references/data-structures/experiment-loop.md
@ -120,7 +120,7 @@ java -jar benchmarks.jar "MicroBench" -prof gc

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E benchmark (after KEEP)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md`. Use git worktrees for isolation. If E2E contradicts micro-bench, trust the E2E.
+**Step 16 -- E2E benchmark (before final KEEP for LARGE-SCALE/cross-function)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md`. Use git worktrees for isolation. If E2E contradicts micro-bench, trust the E2E.

 Print: `[experiment N] E2E: baseline <X> ns/op, optimized <Y> ns/op (<Z>x, min-to-min)`

@ -141,7 +141,7 @@ Output matches original?
        +-- <5% -> Re-run 3x with min-to-min comparison
        |   +-- Confirmed (min-to-min consistent) -> KEEP
        |   +-- Noise (min varies by >5% across runs) -> DISCARD
-        +-- Micro-bench only: >=20% on confirmed hot path -> KEEP
+        +-- Micro-bench only: >=20% on confirmed hot path -> KEEP only in LIBRARY PRIMITIVE mode with >=3 named downstream callers; in LARGE-SCALE mode require E2E/workload evidence
        +-- JIT deopt fix: KEEP if -XX:+PrintCompilation confirms deopt eliminated
        +-- No improvement -> DISCARD
 ```
@ -173,16 +173,16 @@ If top 3 hotspots are all non-optimizable, **stop and report to user** with what

 ## Logging Format

-Tab-separated `.codeflash/results.tsv`, 11 columns:
+Tab-separated `.codeflash/results.tsv`:

 ```
-commit	target_test	baseline_ns_op	optimized_ns_op	speedup	jit_verified	tests_passed	tests_failed	status	pattern	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	jit_verified	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 - `target_test`: test name, `all`, or `micro:<JMH benchmark name>`
- `baseline_ns_op`: baseline min score in nanoseconds per operation
- `optimized_ns_op`: optimized min score in nanoseconds per operation
- `speedup`: ratio (e.g., `1.37x`) or percentage (e.g., `37%`)
+- `baseline_metric` / `optimized_metric`: real measured values with units, typically min nanoseconds per operation for JMH
+- `improvement_pct`: finite numeric percent improvement in the primary metric
+- `e2e_improvement_pct`: finite numeric workload improvement for LARGE-SCALE keeps; `N/A` only for discards or LIBRARY PRIMITIVE rows
 - `jit_verified`: `yes`, `no`, or `skip` (for >2x speedups)
 - `status`: `keep`, `discard`, `crash`, or `escalate`
 - `pattern`: antipattern category (e.g., `arraylist-contains`, `quadratic-loop`, `autoboxing`)
--- a/plugin/languages/java/references/e2e-benchmarks.md
+++ b/plugin/languages/java/references/e2e-benchmarks.md
@ -43,15 +43,8 @@ java -jar target/benchmarks.jar "RelevantBenchmark" \
    -rf json -rff /tmp/head-results.json -v EXTRA \
    -f 3 -wi 5 -i 10

-# 2. Stash or checkout base, run same benchmark
-git stash  # or use worktree
-git checkout "${BASE_SHA}"
-mvn clean package -DskipTests
-java -jar target/benchmarks.jar "RelevantBenchmark" \
-    -rf json -rff /tmp/base-results.json -v EXTRA \
-    -f 3 -wi 5 -i 10
-git checkout -  # return to optimized branch
-git stash pop   # if stashed
+# 2. Prefer a git worktree for the base ref; do not disturb the user's worktree.
+# See "Using git worktrees" below.

 # 3. Compare min scores (requires jq)
 # Extract min scores from both files and compare
@ -116,17 +109,18 @@ git worktree remove /tmp/base-worktree

 If the project has no JMH infrastructure:

-1. **Use the unified profiling script** as the primary E2E measurement:
+1. **Use the unified profiling script** as the primary E2E measurement in two worktrees:
   ```bash
-   # Run on base
-   git stash
-   bash /tmp/java_deep_profile.sh com.example -- mvn test
-   # Record wall time and JFR metrics
+   BASE_SHA="abc1234"  # pre-optimization commit
+   git worktree add /tmp/base-worktree "${BASE_SHA}"

-   # Run on head
-   git stash pop
+   # Base workload profile
+   (cd /tmp/base-worktree && bash /tmp/java_deep_profile.sh com.example -- mvn test)
+
+   # Head workload profile
   bash /tmp/java_deep_profile.sh com.example -- mvn test
-   # Compare wall time and JFR metrics
+
+   git worktree remove /tmp/base-worktree
   ```

 2. **Use test suite timing** as a secondary signal:
--- a/plugin/languages/java/references/memory/experiment-loop.md
+++ b/plugin/languages/java/references/memory/experiment-loop.md
@ -85,7 +85,7 @@ If both baseline and optimized show zero allocation (`gc.alloc.rate.norm = 0`),

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E benchmark (after KEEP)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md` with `-prof gc`. Compare both time AND allocation. If E2E shows no allocation difference despite micro-bench showing one, the JIT optimizes differently in production context -- trust E2E.
+**Step 16 -- E2E benchmark (before final KEEP for LARGE-SCALE)**: Run JMH-based E2E comparison per `../e2e-benchmarks.md` with `-prof gc`. Compare both time AND allocation. If E2E shows no allocation difference despite micro-bench showing one, the JIT optimizes differently in production context -- trust E2E.

 **Step 17 -- Config audit**: After allocation reduction, check:
 - Heap sizing (`-Xms`, `-Xmx`) -- may be oversized now
@ -141,10 +141,10 @@ If >85% of peak is irreducible, **stop and report to user** with what's left and

 ## Logging Format

-Tab-separated `.codeflash/results.tsv`, 12 columns:
+Tab-separated `.codeflash/results.tsv`:

 ```
-commit	target_test	target_mib	heap_used_mib	gc_alloc_rate_norm	gc_pause_ms	gc_count	tests_passed	tests_failed	status	pattern	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	mem_baseline_mb	mem_optimized_mb	mem_delta_mb	gc_before_s	gc_after_s	gc_count	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 - `target_test`: test name, `all`, or `micro:<JMH benchmark name>`
--- a/plugin/languages/java/references/structure/experiment-loop.md
+++ b/plugin/languages/java/references/structure/experiment-loop.md
@ -79,7 +79,7 @@ Print: `[experiment N] startup: <X>ms -> <Y>ms (<delta>ms), classes: <A> -> <B>`

 **Step 14 -- Record**: Record immediately in `.codeflash/results.tsv`. Do not batch.

-**Step 16 -- E2E verification (after KEEP)**: For structure changes, E2E means running the full application or test suite and verifying:
+**Step 16 -- E2E verification (before final KEEP for LARGE-SCALE)**: For structure changes, E2E means running the full application or test suite and verifying:
 - All tests pass (structure changes have wide blast radius)
 - Startup time improved (or at least didn't regress)
 - No runtime errors from reflection, serialization, or framework scanning
@ -136,10 +136,10 @@ If top 3 remaining issues are all non-optimizable, **stop and report to user** w

 ## Logging Format

-Tab-separated `.codeflash/results.tsv`, 11 columns:
+Tab-separated `.codeflash/results.tsv`:

 ```
-commit	target	metric_name	baseline	result	delta	jdeps_cycles	class_load_count	tests_passed	tests_failed	status	description
+commit	session_mode	target_test	workload_command	workload_cpu_pct	baseline_metric	optimized_metric	improvement_pct	e2e_improvement_pct	metric_name	jdeps_cycles	class_load_count	tests_passed	tests_failed	status	pattern	correctness_probe	description
 ```

 - `target`: package or class affected (e.g., `com.app.model<->service`)
--- a/plugin/languages/java/references/team-orchestration.md
+++ b/plugin/languages/java/references/team-orchestration.md
@ -19,16 +19,28 @@ TaskCreate("Dispatched: Async targets")  -- if dispatching async agent

 The key difference from the router dispatching blindly: **you provide cross-domain context the domain agent wouldn't have.**

+Every dispatched prompt MUST start with the current `SESSION MODE: <mode>` line and any `REFACTOR SCOPE:` line, then include the workload row that justified the assignment: `workload_command`, `entry_point`, `target_method`, `workload_cpu_pct`, allocation/GC/latency evidence, and acceptance tier. Do not dispatch a specialist with only "this code looks slow"; specialists need the workload context to avoid chasing micro-wins.
+
 ### CPU specialist example

 ```
 Agent(subagent_type: "codeflash-java-cpu", name: "cpu-specialist",
      team_name: "deep-session", isolation: "worktree", prompt: "
+  SESSION MODE: LARGE-SCALE
+  REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED
+
  You are working under the deep optimizer's direction.

  ## Targeted Assignment
  Optimize these specific functions: <list from unified target table>

+  ## Workload Evidence
+  - workload_command: <exact command profiled>
+  - entry_point: <test/benchmark/service path>
+  - target hotness: <CPU %, allocation MiB, GC ms, latency/throughput evidence>
+  - acceptance tier: <strong|modest|primitive>
+  - why this is not a micro-win: <caller or workload evidence>
+
  ## Cross-Domain Context (from deep profiling)
  - processRecords: 45% CPU, but 40% of that is GC from 120 MiB allocation.
    I've already fixed the allocation in experiment 1. Re-profile -- the CPU
--- a/plugin/languages/java/skills/codeflash-optimize/SKILL.md
+++ b/plugin/languages/java/skills/codeflash-optimize/SKILL.md
@ -31,10 +31,16 @@ AUTONOMOUS MODE: The user has already been asked for context (included below). D

 Part 2 — the user's original request (verbatim).

-Part 3 — the user's answer from Step 1 (verbatim).
+Part 3 — session brief. If the user's answer already starts with `SESSION MODE:`, include it verbatim. Otherwise prepend this line before the user's answer:
+```
+SESSION MODE: LARGE-SCALE
+REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED
+```

 Do not add any other instructions — the router sets up the project, creates the team, launches the optimizer in the background, and coordinates the session. Progress streams directly to the user.

+`LARGE-SCALE` is the default because most Java optimization sessions should find infrastructure-cost-grade wins, not isolated micro-wins. `PROFILE-GATED CROSS-FUNCTION ALLOWED` means the optimizer may attempt a multi-method refactor only after profiling proves a single-method fix cannot capture the win, and only with a committed behavioral-equivalence test, touched call graph, and E2E evidence. Other explicit session modes are `LIBRARY PRIMITIVE`, `CROSS-FUNCTION REFACTOR`, and `PLUGIN VALIDATION`.
+
 ## For `resume`

 Launch the language router:
--- a/plugin/references/shared/agent-base-protocol.md
+++ b/plugin/references/shared/agent-base-protocol.md
@ -79,13 +79,12 @@ All session state lives in `.codeflash/`:
 ## Session Resume

 1. Read `.codeflash/HANDOFF.md`, `.codeflash/results.tsv`, `.codeflash/conventions.md`.
-2. Confirm with user what to work on next.
-3. Continue the experiment loop.
+2. Continue the experiment loop from the strongest remaining measured target. Under AUTONOMOUS MODE, do not ask the user what to work on next; infer it from session state and profiling.

 ## Session Start — Common Steps

 1. **Read setup.** Read `.codeflash/setup.md` for the runner, language version, and test command. Read `.codeflash/conventions.md` if it exists. Also check for org-level conventions at `../conventions.md` (project-level overrides org-level). Read `.codeflash/learnings.md` if it exists — these are discoveries from previous sessions that prevent repeating dead ends. Read CLAUDE.md. Use the runner from setup.md everywhere you see `$RUNNER`.
-2. **Create or switch to optimization branch.** `git checkout -b codeflash/optimize` (or `git checkout codeflash/optimize` if it already exists). All optimizations stack as commits on this single branch.
+2. **Confirm optimization branch.** The router creates or switches to `codeflash/optimize` before launching you. If you are not already on that branch, you may `git checkout codeflash/optimize` if it exists. Do NOT create branches yourself; report a router setup gap if the branch is missing.
 3. **Initialize HANDOFF.md** with environment and discovery.

 Domain agents add domain-specific steps after these common steps (e.g., baseline profiling method, benchmark tier definition).
--- a/plugin/references/shared/experiment-loop-base.md
+++ b/plugin/references/shared/experiment-loop-base.md
@ -20,7 +20,7 @@ LOOP (until plateau detected or user requests stop):
 4. **Capture original output and baseline performance.** Before changing anything:
   - **Correctness oracle:** Run the target function with representative inputs and save its output. The optimized version must produce identical results.
   - **Performance baseline:** Run the benchmark on the CURRENT (unmodified) code and save the results with a "baseline" label. This is your comparison point — without a baseline captured on the same machine in the same session, benchmark numbers are meaningless. See the domain file for the specific benchmark command and tool (e.g., JMH runner for Java, `codeflash compare` for Python).
-5. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
+5. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result. A micro-benchmark is a screening tool unless the domain file explicitly says the session is a primitive/library optimization. For large-scale sessions, never keep a change on micro-benchmark evidence alone; the target must be hot in the workload profile and the win must carry through to an end-to-end or representative workload measurement.
 6. **Implement**. Print `[experiment N] Implementing: <one-line summary of change>`.
 7. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified function arguments, wrapper flags, pool sizes, or configuration, the benchmark must use the same values. If the benchmark was written before step 6, the implementation may have changed assumptions — update the benchmark to match. A benchmark that doesn't mirror the production change proves nothing.
 8. **Verify output equivalence.** Run the optimized version with the same inputs from step 4 and compare outputs. If outputs differ, **discard immediately** — this is a correctness regression, not an optimization. Do not proceed to benchmarking.
@ -29,9 +29,10 @@ LOOP (until plateau detected or user requests stop):
 11. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
 12. If crashed or regressed = fix or discard immediately.
 13. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise.
-14. **Record** in `.codeflash/results.tsv` (schema in domain file).
-15. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
-16. **E2E benchmark** (after KEEP, when available). Run an end-to-end comparison between the pre-optimization baseline and the current code using the language's authoritative tool. For Python: `$RUNNER -m codeflash compare <pre-opt-sha> HEAD`. For Java: `bash /tmp/jmh-runner.sh <BenchClass> --label e2e --compare /tmp/jmh-results-baseline.json` (see `e2e-benchmarks.md`). Record e2e results alongside micro-bench results in `results.tsv`. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate the keep decision — trust the e2e measurement. Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)`.
+14. **Record provisional results** in `.codeflash/results.tsv` (schema in domain file). Use `status=provisional_keep` only when local evidence is promising but the mode still requires E2E/workload confirmation.
+15. **Provisional keep/discard** (see decision tree in domain file). Print `[experiment N] PROVISIONAL KEEP` if local evidence is promising; print `[experiment N] DISCARD — <reason>` if it already failed correctness, tests, or local measurement. For modes that require workload evidence, do not record final `keep` yet.
+16. **E2E benchmark** (required before final KEEP when the domain/mode demands it; otherwise after provisional KEEP when available). Run an end-to-end comparison between the pre-optimization baseline and the current code using the language's authoritative tool. For Python: `$RUNNER -m codeflash compare <pre-opt-sha> HEAD`. For Java: `bash /tmp/jmh-runner.sh <BenchClass> --label e2e --compare /tmp/jmh-results-baseline.json` (see `e2e-benchmarks.md`). Record e2e results alongside micro-bench results in `results.tsv`. If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows <2%), re-evaluate the keep decision — trust the e2e measurement. Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)`.
+16a. **Final keep/discard.** If the domain/mode requires E2E/workload evidence, decide final KEEP/DISCARD after step 16 and update the row from `provisional_keep` to `keep` or `discard`.
 17. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
 18. **Milestones** (every 3-5 keeps): Run full benchmark (including `codeflash compare <baseline-sha> HEAD` for cumulative e2e measurement), create milestone branch. Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.

@ -50,7 +51,7 @@ Output matches original?
            +-- YES (< domain threshold) -> Re-run to confirm not noise
            |   +-- Confirmed -> KEEP
            |   +-- Noise -> DISCARD
-            +-- Micro-bench only improved (>= domain micro threshold) -> KEEP (if on confirmed hot path)
+            +-- Micro-bench only improved (>= domain micro threshold) -> KEEP only when the domain file allows primitive/library mode and the target has documented downstream callers; otherwise require E2E/workload evidence before KEEP
            +-- NO -> DISCARD
 ```

@ -111,7 +112,7 @@ During profiling or experimentation, you may discover the real bottleneck is in
 When you detect a cross-domain signal:

 1. **Log it** in results.tsv: `experiment N | ESCALATE | <signal description> | suggests <domain>`
-2. **Tell the user**: "I'm finding that the real bottleneck is [description] — this is a [domain] issue, not [current domain]. Want me to switch?"
+2. **Tell the router/team lead**: "I'm finding that the real bottleneck is [description] — this is a [domain] issue, not [current domain]." Under AUTONOMOUS MODE, do not ask the user whether to switch; record the evidence, pick the most likely profitable domain, and continue or hand off through the router.
 3. **Write it in HANDOFF.md** so a resumed session picks it up.

 Do NOT silently switch domains or attempt fixes outside your expertise.
--- a/plugin/references/shared/router-base.md
+++ b/plugin/references/shared/router-base.md
@ -22,7 +22,7 @@ Before forming a team, optionally read these references (in `references/shared/`
 - The ONLY files you should read are: `CLAUDE.md`, `codeflash_profile.md` (project or parent directory), the dependency manifest (see Language Configuration), `.codeflash/*.md`, `.codeflash/results.tsv`, and guide.md reference files.
 - The ONLY files you should write are: `.codeflash/conventions.md`, `.codeflash/learnings.md`, `.codeflash/changelog.md`.
 - Follow the numbered steps in order. Do not skip steps or improvise your own workflow.
- **AUTONOMOUS MODE**: If the prompt includes "AUTONOMOUS MODE", pass it through to the optimizer agent and do NOT ask the user any questions yourself. Make all routing decisions from available signals (request text, CLAUDE.md, branch names, .codeflash/ state).
+- **AUTONOMOUS MODE**: If the prompt includes "AUTONOMOUS MODE", pass it through to the optimizer agent and do NOT ask the user any questions yourself. Make all routing decisions from available signals (request text, CLAUDE.md, branch names, .codeflash/ state). If setup, profiling, or tests fail, diagnose and attempt technical workarounds; do not stop to ask the user unless the prompt lacks autonomous mode.
 - **Batch your questions.** Never ask one question at a time across multiple round-trips. If you need to ask the user about domain, scope, constraints, and guard command — ask them all in one message (max 4 questions per batch).

 ## Domain Detection
@ -58,7 +58,7 @@ See your Language Configuration for the reference loading table (directories and
 ### Start (new session)

 1. **Gather context in one batch.** Detect domain from the user's request using your domain detection table. If anything is unclear or missing (and NOT in autonomous mode), ask all questions in one message (max 4 questions). For example, if you need domain, scope, and constraints — ask them together, not in separate round-trips. Also ask: "Is there a command that must always pass as a safety net? (e.g., <guard examples from your Language Configuration>)" to configure the guard. If the user already provided enough context or you are in autonomous mode, skip the questions and proceed.
-2. **Verify branch state.** Run `git status` and `git branch --show-current`. If on `codeflash/optimize`, treat as resume. If on `main` (or another branch), check if `codeflash/optimize` already exists — if so, check it out; if not, the domain agent will create it. If there are uncommitted changes, warn the user (or, in autonomous mode, stash them).
+2. **Verify branch state.** Run `git status` and `git branch --show-current`. If on `codeflash/optimize`, treat as resume. If on `main` (or another branch), check if `codeflash/optimize` already exists — if so, check it out; if not, create `codeflash/optimize` yourself before launching any optimizer. Branch creation is the router's responsibility; domain agents must never create optimization branches. If there are uncommitted changes, warn the user when not in autonomous mode; in autonomous mode, preserve them with a named stash before switching branches and record the stash name in `.codeflash/conventions.md`.
 3. **Detect multi-repo context.** Check if `CLAUDE.md` mentions related repositories or if the parent directory contains sibling repos. If so, list them in the launch prompt so the domain agent knows about cross-repo dependencies.
 4. **Create team.** `TeamCreate("codeflash-session")`. Then create tasks to track the session phases:
   - `TaskCreate("Setup environment")` — assign to self
@ -107,6 +107,12 @@ See your Language Configuration for the reference loading table (directories and

   Begin a new optimization session. The user wants: <user's request>

+   ## Session Brief
+   <Full scoped brief from the launcher, including the SESSION MODE line and any REFACTOR SCOPE line. For Java general optimization, default to:
+   SESSION MODE: LARGE-SCALE
+   REFACTOR SCOPE: PROFILE-GATED CROSS-FUNCTION ALLOWED
+   Forward these lines verbatim to every spawned optimizer/subagent.>
+
   ## Environment
   <.codeflash/setup.md contents>

@ -151,7 +157,7 @@ See your Language Configuration for the reference loading table (directories and
   - **Actively monitor.** Track the last message timestamp from the optimizer. If it goes silent (no message after what feels like a long time for the current phase — e.g., baseline should report within a few minutes, experiments within 5-10 minutes each), proactively ping it: `SendMessage(to: "optimizer", summary: "Status check", message: "Report your current status — what are you working on?")`. Relay the response to the user.
   - **Keep teammates on track.** If the optimizer reports it's "approaching plateau" or "running out of targets" but hasn't sent `[complete]`, push back: read `.codeflash/results.tsv` to check what's been tried, and suggest unexplored areas or ask it to do a final re-profile before declaring plateau.
   - **Recover from drops.** If the optimizer stops without sending `[complete]` (context limit, error, goes idle), do NOT just report to the user and exit. Instead: read `.codeflash/HANDOFF.md` and `.codeflash/results.tsv`, assess whether there's more work to do, and if so, **relaunch the optimizer** with the resume prompt (step 6 from Resume flow). Only proceed to Cleanup when the session is genuinely complete or the user says to stop.
-   - **Audit `results.tsv` before exit (Post-Return Keep Audit).** After the optimizer sends `[complete]` and before you proceed to Cleanup or the Post-session review gate, read `.codeflash/results.tsv` and inspect EVERY row with `status=keep`. For each such row, both of these MUST hold: (a) `improvement_pct` parses as a finite numeric value (not `N/A`, `TBD`, `deferred`, empty, or any non-number), AND (b) `optimized_metric` is a measurement (ns/op, ms, MiB, throughput, etc.) — NOT a literal identifier like `StandardCharsets.US_ASCII`, a constructor signature, a class name, or prose. If any `keep` row fails either check, **downgrade it to `status=blocked` in-place**, add a note `blocked-by-audit: no benchmark evidence (improvement_pct=<x>, optimized_metric=<y>)`, rewrite `HANDOFF.md`'s "Kept" section to move that row to a new "Downgraded by audit" subsection, and include in your `[complete]` summary to the user: `Post-return audit downgraded <N> keep(s) to blocked (no benchmark evidence).` This audit is non-negotiable under AUTONOMOUS MODE — the optimizer's self-declared keep is a claim, not a verdict. A row without a measured number is never a valid keep.
+   - **Audit `results.tsv` before exit (Post-Return Keep Audit).** After the optimizer sends `[complete]` and before you proceed to Cleanup or the Post-session review gate, read `.codeflash/results.tsv` and inspect EVERY row with `status=keep`. For each such row, both of these MUST hold: (a) `improvement_pct` parses as a finite numeric value (not `N/A`, `TBD`, `deferred`, empty, or any non-number), AND (b) `optimized_metric` is a measurement (ns/op, ms, MiB, throughput, etc.) — NOT a literal identifier like `StandardCharsets.US_ASCII`, a constructor signature, a class name, or prose. If `session_mode=LARGE-SCALE`, the row must also contain workload evidence: at least one hotness field meets the Java value gate (`workload_cpu_pct >= 1`, `mem_delta_mb >= 20`, `gc_before_s - gc_after_s >= 0.005`, or documented latency/throughput evidence in `description`) and `e2e_improvement_pct >= 5` for a modest keep. Strong keeps should document `e2e_improvement_pct >= 10` on at least two representative workloads in `description` or HANDOFF.md. If any `keep` row fails these checks, **downgrade it to `status=blocked` in-place**, add a note `blocked-by-audit: no benchmark/workload evidence (...)`, rewrite `HANDOFF.md`'s "Kept" section to move that row to a new "Downgraded by audit" subsection, and include in your `[complete]` summary to the user: `Post-return audit downgraded <N> keep(s) to blocked (insufficient benchmark/workload evidence).` This audit is non-negotiable under AUTONOMOUS MODE — the optimizer's self-declared keep is a claim, not a verdict. A row without a measured number is never a valid keep.
   - **Exit only on `[complete]` or user stop.** The only valid exit conditions are: optimizer sends `[complete]` AND the post-return audit has run, or the user explicitly says to stop/wrap up. In either case, proceed to Cleanup.

 ### Resume
@ -512,6 +518,9 @@ For a **new session**, follow the standard setup steps (1-9 from Start), then:

     Begin a deep optimization session. The user wants: <user's request>

+     ## Session Brief
+     <Full scoped brief from the launcher, including SESSION MODE and REFACTOR SCOPE lines>
+
     ## Environment
     <.codeflash/setup.md contents>

--- a/scripts/build_plugin.ps1
+++ b/scripts/build_plugin.ps1
@ -0,0 +1,90 @@
+param(
+    [string[]] $Lang,
+    [string] $PluginRoot = "plugin",
+    [string] $OutputDir = ".",
+    [switch] $NoClean
+)
+
+$ErrorActionPreference = "Stop"
+
+$pluginRootPath = (Resolve-Path -LiteralPath $PluginRoot).Path
+$outputParent = (Resolve-Path -LiteralPath $OutputDir).Path
+$languagesRoot = Join-Path $pluginRootPath "languages"
+$excludeFromBase = @("languages", "README.md", "ARCHITECTURE.md", "ROADMAP.md")
+
+if (-not $Lang -or $Lang.Count -eq 0) {
+    $Lang = Get-ChildItem -LiteralPath $languagesRoot -Directory | ForEach-Object { $_.Name }
+}
+
+function Copy-DirectoryContents {
+    param(
+        [string] $Source,
+        [string] $Destination
+    )
+
+    if (-not (Test-Path -LiteralPath $Source)) {
+        return
+    }
+
+    New-Item -ItemType Directory -Force -Path $Destination | Out-Null
+    Get-ChildItem -LiteralPath $Source -Force | ForEach-Object {
+        Copy-Item -LiteralPath $_.FullName -Destination $Destination -Recurse -Force
+    }
+}
+
+function Rewrite-MarkdownPaths {
+    param(
+        [string] $Root,
+        [string] $Language
+    )
+
+    $replacements = @{
+        "languages/$Language/references/" = "references/"
+        "languages/$Language/agents/" = "agents/"
+        "languages/$Language/skills/" = "skills/"
+        "languages\$Language\references\" = "references/"
+        "languages\$Language\agents\" = "agents/"
+        "languages\$Language\skills\" = "skills/"
+    }
+
+    Get-ChildItem -LiteralPath $Root -Recurse -File -Filter "*.md" | ForEach-Object {
+        $text = Get-Content -LiteralPath $_.FullName -Raw
+        $updated = $text
+        foreach ($key in $replacements.Keys) {
+            $updated = $updated.Replace($key, $replacements[$key])
+        }
+        if ($updated -ne $text) {
+            Set-Content -LiteralPath $_.FullName -Value $updated -Encoding UTF8
+        }
+    }
+}
+
+$built = @()
+foreach ($language in $Lang) {
+    $langRoot = Join-Path $languagesRoot $language
+    if (-not (Test-Path -LiteralPath $langRoot)) {
+        throw "Unknown language: $language ($langRoot not found)"
+    }
+
+    $out = Join-Path $outputParent "dist-$language"
+    if (-not $NoClean -and (Test-Path -LiteralPath $out)) {
+        Remove-Item -LiteralPath $out -Recurse -Force
+    }
+    New-Item -ItemType Directory -Force -Path $out | Out-Null
+
+    Get-ChildItem -LiteralPath $pluginRootPath -Force | Where-Object {
+        $excludeFromBase -notcontains $_.Name
+    } | ForEach-Object {
+        Copy-Item -LiteralPath $_.FullName -Destination $out -Recurse -Force
+    }
+
+    foreach ($folder in @("agents", "references", "skills")) {
+        Copy-DirectoryContents -Source (Join-Path $langRoot $folder) -Destination (Join-Path $out $folder)
+    }
+
+    Rewrite-MarkdownPaths -Root $out -Language $language
+    $built += $out
+    Write-Host "Assembled plugin ($language) -> $out"
+}
+
+Write-Host ("Built: " + ($built -join ", "))
--- a/scripts/build_plugin.py
+++ b/scripts/build_plugin.py
@ -0,0 +1,99 @@
+"""Assemble language-specific Claude Code plugin directories.
+
+This is the cross-platform equivalent of the Makefile build target. It copies the
+language-agnostic plugin files, overlays one language's agents/references/skills,
+and rewrites source-tree reference paths to the flattened assembled layout.
+"""
+
+from __future__ import annotations
+
+import argparse
+import shutil
+from pathlib import Path
+
+
+EXCLUDE_FROM_BASE = {"languages", "README.md", "ARCHITECTURE.md", "ROADMAP.md"}
+
+
+def copy_base(plugin_root: Path, output_dir: Path) -> None:
+    for item in plugin_root.iterdir():
+        if item.name in EXCLUDE_FROM_BASE:
+            continue
+
+        dest = output_dir / item.name
+        if item.is_dir():
+            shutil.copytree(item, dest, dirs_exist_ok=True)
+        else:
+            shutil.copy2(item, dest)
+
+
+def overlay_language(plugin_root: Path, output_dir: Path, language: str) -> None:
+    lang_root = plugin_root / "languages" / language
+    if not lang_root.is_dir():
+        raise SystemExit(f"Unknown language: {language} ({lang_root} not found)")
+
+    for folder in ("agents", "references", "skills"):
+        src = lang_root / folder
+        if src.is_dir():
+            shutil.copytree(src, output_dir / folder, dirs_exist_ok=True)
+
+
+def rewrite_paths(output_dir: Path, language: str) -> None:
+    replacements = {
+        f"languages/{language}/references/": "references/",
+        f"languages/{language}/agents/": "agents/",
+        f"languages/{language}/skills/": "skills/",
+        f"languages\\{language}\\references\\": "references/",
+        f"languages\\{language}\\agents\\": "agents/",
+        f"languages\\{language}\\skills\\": "skills/",
+    }
+
+    for path in output_dir.rglob("*.md"):
+        text = path.read_text(encoding="utf-8")
+        updated = text
+        for old, new in replacements.items():
+            updated = updated.replace(old, new)
+        if updated != text:
+            path.write_text(updated, encoding="utf-8", newline="\n")
+
+
+def build_language(plugin_root: Path, output_parent: Path, language: str, clean: bool) -> Path:
+    output_dir = output_parent / f"dist-{language}"
+    if clean and output_dir.exists():
+        shutil.rmtree(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    copy_base(plugin_root, output_dir)
+    overlay_language(plugin_root, output_dir, language)
+    rewrite_paths(output_dir, language)
+    return output_dir
+
+
+def discover_languages(plugin_root: Path) -> list[str]:
+    languages_root = plugin_root / "languages"
+    return sorted(path.name for path in languages_root.iterdir() if path.is_dir())
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Build language-specific Claude plugin directories.")
+    parser.add_argument("--lang", action="append", help="Language to build. Repeat for multiple. Defaults to all.")
+    parser.add_argument("--plugin-root", default="plugin", help="Path to the source plugin directory.")
+    parser.add_argument("--output-dir", default=".", help="Directory where dist-<lang>/ folders are written.")
+    parser.add_argument("--no-clean", action="store_true", help="Do not remove existing dist-<lang>/ before building.")
+    args = parser.parse_args()
+
+    plugin_root = Path(args.plugin_root).resolve()
+    output_parent = Path(args.output_dir).resolve()
+    languages = args.lang or discover_languages(plugin_root)
+
+    built: list[Path] = []
+    for language in languages:
+        output = build_language(plugin_root, output_parent, language, clean=not args.no_clean)
+        built.append(output)
+        print(f"Assembled plugin ({language}) -> {output}")
+
+    print("Built: " + ", ".join(str(path) for path in built))
+
+
+if __name__ == "__main__":
+    main()