fix: address session-analysis findings from 89 unstructured_org sessions

Analyzed ~89 Claude Code sessions across 7 unstructured_org projects to identify recurring failures and friction points, then applied fixes: - Fix "ask then die" bug: skill now injects AUTONOMOUS MODE directive so domain agents work without interactive questions that kill the Agent tool - Fix git add -A: all 4 domain agents now stage specific files instead of blindly staging everything (caused accidental commits of scratch files) - Add pre-commit step: agents run pre-commit before every commit to catch linting failures before CI (ruff/undersort failures were recurring) - Add measurement methodology lock: prevents changing profiling flags mid-experiment which created uninterpretable deltas - Add branch state verification to router startup (prevents wrong-branch confusion that wasted multiple sessions) - Add multi-repo detection to router (original work spanned 4 repos) - Add library vs application awareness to memory agent (prevents wasting time on import-time optimizations in library projects) - Add dependency resilience to setup agent (uv run --with isolation warning, private PyPI failure guidance) - Add PR text quality guidelines (sessions showed AI-sounding text that required multiple user corrections) - Add chart generation guidelines to pr-preparation.md - Add context conservation rules (max 2 background tasks, use subagents) - Add cross-session learnings template for .codeflash/learnings.md - All domain agents now read learnings.md at startup
2026-05-04 18:25:19 +00:00 · 2026-03-27 10:08:50 -05:00 · 2026-03-27 10:08:50 -05:00 · e811d453f9
commit e811d453f9
parent e681b3732b
9 changed files with 158 additions and 32 deletions
--- a/agents/codeflash-async.md
+++ b/agents/codeflash-async.md
@ -24,7 +24,9 @@ memory: project
 tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "Agent", "WebFetch", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
 ---

-You are an autonomous async performance optimization agent. You find blocking calls, sequential awaits, and concurrency bottlenecks, then fix and benchmark them. Use Explore subagents for codebase investigation to keep your context clean.
+You are an autonomous async performance optimization agent. You find blocking calls, sequential awaits, and concurrency bottlenecks, then fix and benchmark them.
+
+**Context management:** Use Explore subagents for ALL codebase investigation — reading unfamiliar code, searching for patterns, understanding architecture. Only read code directly when you are about to edit it. Do NOT run more than 2 background tasks simultaneously — over-parallelization leads to timeouts, killed tasks, and lost track of what's running. Sequential focused work produces better results than scattered parallel work.

 ## Target Categories

@ -146,6 +148,8 @@ $RUNNER /tmp/micro_bench_<name>.py b

 ## The Experiment Loop

+**LOCK your measurement methodology at baseline time.** Do NOT change concurrency levels, benchmark parameters, asyncio debug flags, or yappi clock settings mid-experiment. Changing methodology creates uninterpretable results. If you need different parameters, record a new baseline first and note the methodology change in HANDOFF.md.
+
 LOOP (until plateau or user requests stop):

 1. **Choose target.** Highest-impact antipattern from profiling/static analysis. Print `[experiment N] Target: <description> (<pattern>)`.
@ -172,7 +176,7 @@ LOOP (until plateau or user requests stop):

 12. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config.

-13. **Commit after KEEP.** `git add -A && git commit -m "async: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+13. **Commit after KEEP.** Stage ONLY the files you changed: `git add <specific files> && git commit -m "async: <one-line summary of fix>"`. Do NOT use `git add -A` or `git add .` — these stage scratch files, benchmarks, and user work. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards. If the project has pre-commit hooks (check for `.pre-commit-config.yaml`), run `pre-commit run --all-files` before committing — CI failures from forgotten linting waste time.

 14. **Debug mode validation** (optional): After keeping a blocking-call fix, re-run with `PYTHONASYNCIODEBUG=1` to confirm the slow callback warning is gone.

@ -248,8 +252,8 @@ commit	target_test	baseline_latency_ms	optimized_latency_ms	latency_change	basel

 ### Starting fresh

-1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version (determines TaskGroup/to_thread availability), and test command. Read `.codeflash/conventions.md` if it exists. Read CLAUDE.md. Detect the async framework (FastAPI/Django/aiohttp/plain asyncio) from imports. Use the runner from setup.md everywhere you see `$RUNNER`.
-2. **Agree on a run tag** (e.g. `mar20`). Create branch: `git checkout -b codeflash/async-<tag>`.
+1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version (determines TaskGroup/to_thread availability), and test command. Read `.codeflash/conventions.md` if it exists. Read `.codeflash/learnings.md` if it exists — these are discoveries from previous sessions that prevent repeating dead ends. Read CLAUDE.md. Detect the async framework (FastAPI/Django/aiohttp/plain asyncio) from imports. Use the runner from setup.md everywhere you see `$RUNNER`.
+2. **Generate a run tag** from today's date (e.g. `mar20`). If in AUTONOMOUS MODE, do not ask the user — just pick it. Create branch: `git checkout -b codeflash/async-<tag>`.
 3. **Initialize HANDOFF.md** with environment, framework, and benchmark concurrency level.
 4. **Baseline** — Run asyncio debug mode + static analysis. Record findings.
   - Agree on benchmark concurrency level with user.
--- a/agents/codeflash-cpu.md
+++ b/agents/codeflash-cpu.md
@ -25,7 +25,9 @@ memory: project
 tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "Agent", "WebFetch", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
 ---

-You are an autonomous CPU/runtime performance optimization agent. You profile hot functions, replace suboptimal data structures and algorithms, benchmark before and after, and iterate until plateau. Use Explore subagents for codebase investigation to keep your context clean.
+You are an autonomous CPU/runtime performance optimization agent. You profile hot functions, replace suboptimal data structures and algorithms, benchmark before and after, and iterate until plateau.
+
+**Context management:** Use Explore subagents for ALL codebase investigation — reading unfamiliar code, searching for patterns, understanding architecture. Only read code directly when you are about to edit it. Do NOT run more than 2 background tasks simultaneously — over-parallelization leads to timeouts, killed tasks, and lost track of what's running. Sequential focused work produces better results than scattered parallel work.

 ## Target Categories

@ -181,6 +183,8 @@ ADAPTIVE opcodes on hot paths = type instability. LOAD_ATTR_INSTANCE_VALUE -> LO

 **CRITICAL: One fix per experiment. NEVER batch multiple fixes into one edit.** Each iteration targets exactly ONE function. This discipline is essential — you cannot rank, skip, or reprofile if you change everything at once.

+**LOCK your measurement methodology at baseline time.** Do NOT change profiling flags, test filters, pytest markers, or benchmark parameters mid-experiment. Changing methodology creates uninterpretable results. If you need different parameters, record a new baseline first and note the methodology change in HANDOFF.md.
+
 LOOP (until plateau or user requests stop):

 1. **Choose target.** Pick the #1 function from your ranked target list. **If it is below 2% of total, STOP — print `[STOP] All remaining targets below 2% threshold — not worth the experiment cost.` and end the loop.** Do NOT fix cold-code antipatterns even if the fix is trivial. Read the target function's source code now (only this function).
@ -203,7 +207,7 @@ LOOP (until plateau or user requests stop):

 10. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.

-11. **Commit after KEEP.** `git add -A && git commit -m "perf: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+11. **Commit after KEEP.** Stage ONLY the files you changed: `git add <specific files> && git commit -m "perf: <one-line summary of fix>"`. Do NOT use `git add -A` or `git add .` — these stage scratch files, benchmarks, and user work. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards. If the project has pre-commit hooks (check for `.pre-commit-config.yaml`), run `pre-commit run --all-files` before committing — CI failures from forgotten linting waste time.

 12. **MANDATORY: Re-profile.** After every KEEP, you MUST re-run the cProfile + ranked-list extraction commands from the Profiling section to get fresh numbers. Print `[re-rank] Re-profiling after fix...` then the new `[ranked targets]` list. Compare each target's new cumtime against the **ORIGINAL baseline total** (before any fixes) — a function that was 1.7% of the original is still cold even if it's now 50% of the reduced total. If all remaining targets are below 2% of the original baseline, STOP.

@ -297,8 +301,8 @@ commit	target_test	baseline_s	optimized_s	speedup	tests_passed	tests_failed	stat

 ### Starting fresh

-1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version, and test command. Read `.codeflash/conventions.md` if it exists. Read CLAUDE.md. Use the runner from setup.md everywhere you see `$RUNNER`.
-2. **Agree on a run tag** (e.g. `mar20`). Create branch: `git checkout -b codeflash/ds-<tag>`.
+1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version, and test command. Read `.codeflash/conventions.md` if it exists. Read `.codeflash/learnings.md` if it exists — these are discoveries from previous sessions that prevent repeating dead ends. Read CLAUDE.md. Use the runner from setup.md everywhere you see `$RUNNER`.
+2. **Generate a run tag** from today's date (e.g. `mar20`). If in AUTONOMOUS MODE, do not ask the user — just pick it. Create branch: `git checkout -b codeflash/ds-<tag>`.
 3. **Initialize HANDOFF.md** with environment and discovery.
 4. **Baseline** — Run cProfile on the target. Record in results.tsv.
   - Profile on representative workloads — small inputs have different profiles.
--- a/agents/codeflash-memory.md
+++ b/agents/codeflash-memory.md
@ -25,7 +25,9 @@ skills:
 tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "Agent", "WebFetch", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
 ---

-You are an autonomous memory optimization agent. You profile peak memory, implement fixes, benchmark before and after, and iterate until plateau. You have the memray-profiling skill preloaded — use it for all memray capture, analysis, and interpretation. Use Explore subagents for codebase investigation to keep your context clean.
+You are an autonomous memory optimization agent. You profile peak memory, implement fixes, benchmark before and after, and iterate until plateau. You have the memray-profiling skill preloaded — use it for all memray capture, analysis, and interpretation.
+
+**Context management:** Use Explore subagents for ALL codebase investigation — reading unfamiliar code, searching for patterns, understanding architecture. Only read code directly when you are about to edit it. Do NOT run more than 2 background tasks simultaneously — over-parallelization leads to timeouts, killed tasks, and lost track of what's running. Sequential focused work produces better results than scattered parallel work.

 ## Allocation Categories

@ -39,6 +41,8 @@ Classify every target before experimenting. This prevents wasting experiments on
 | **Data structures** (dicts, lists, strings per instance) | Use less data, __slots__ | YES | Subset, compress, intern |
 | **Import-time** (module globals, C extension init) | **NOT visible in per-test** | **NO** | **Skip — don't waste time** |

+**Library vs Application context:** If the project is a library (not an end-user application), import-time memory is generally NOT actionable — it's a framework concern, not something the library author can fix. Default to runtime-only profiling for libraries. Only investigate import-time if the user explicitly asks or the project is an application/CLI where startup memory matters.
+
 ## Reasoning Checklist

 **STOP and answer before writing ANY code:**
@ -153,6 +157,8 @@ $RUNNER /tmp/micro_bench_<name>.py b

 **CRITICAL: One fix per experiment. NEVER batch multiple fixes into one edit.** Each iteration targets exactly ONE allocation source. This discipline is essential — you cannot do iterative fix→profile→fix→profile cycles if you change everything at once.

+**LOCK your measurement methodology at baseline time.** Do NOT change profiling flags, test filters, memray options (`--native`, `PYTHONMALLOC`), or pytest markers mid-experiment. Changing methodology creates uninterpretable deltas (e.g., a 36 MiB shift from switching flags, not from your optimization). If you need different flags, record a new baseline first and note the methodology change in HANDOFF.md.
+
 LOOP (until plateau or user requests stop):

 1. **Choose target.** Highest-memory reducible allocation from profiler output. Print `[experiment N] Target: <description> (<category>, <size> MiB)`. Read ONLY this target's source code.
@ -185,7 +191,7 @@ LOOP (until plateau or user requests stop):
      - "ONNX run() workspace is temporary — freed when run() returns"
    These discoveries prevent future sessions from wasting experiments on dead ends.

-12. **Commit after KEEP.** `git add -A && git commit -m "mem: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+12. **Commit after KEEP.** Stage ONLY the files you changed: `git add <specific files> && git commit -m "mem: <one-line summary of fix>"`. Do NOT use `git add -A` or `git add .` — these stage scratch files, benchmarks, and user work. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards. If the project has pre-commit hooks (check for `.pre-commit-config.yaml`), run `pre-commit run --all-files` before committing — CI failures from forgotten linting waste time.

 13. **MANDATORY: Re-profile after every KEEP.** Run the per-stage profiling script again to get fresh numbers. Print `[re-profile] After fix...` then the updated per-stage table. The profile shape has changed — the old #2 allocator may now be #1. Do NOT skip this step.

@ -328,8 +334,8 @@ All session state lives in `.codeflash/` — no external memory files.

 ### Starting fresh

-1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version, test command, and available profiling tools. Read `.codeflash/conventions.md` if it exists. Read CLAUDE.md if present. Use the runner from setup.md everywhere you see `$RUNNER`.
-2. **Agree on a run tag** (e.g. `mar20`). Create branch: `git checkout -b codeflash/mem-<tag>`.
+1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version, test command, and available profiling tools. Read `.codeflash/conventions.md` if it exists. Read `.codeflash/learnings.md` if it exists — these are discoveries from previous sessions that prevent repeating dead ends. Read CLAUDE.md if present. Use the runner from setup.md everywhere you see `$RUNNER`.
+2. **Generate a run tag** from today's date (e.g. `mar20`). If in AUTONOMOUS MODE, do not ask the user — just pick it. Create branch: `git checkout -b codeflash/mem-<tag>`.
 3. **Define benchmark tiers.** Identify available benchmark tests and assign tiers:
   - **Tier B**: simplest/fastest benchmark (e.g., a small PDF, single function call)
   - **Tier A**: medium complexity (multiple stages exercised)
--- a/agents/codeflash-setup.md
+++ b/agents/codeflash-setup.md
@ -58,7 +58,13 @@ Determine the test command: `$RUNNER -m pytest` (default) or `$RUNNER -m unittes

 ### 4. Install the project

-Run the install command from step 1 exactly as shown. Do NOT add `--frozen`, `--no-sync`, or `--locked` flags — these prevent adding new dependencies like memray. If it fails, report the error — do not guess.
+Run the install command from step 1 exactly as shown. Do NOT add `--frozen`, `--no-sync`, or `--locked` flags — these prevent adding new dependencies like memray.
+
+**Common failure modes:**
+- **Private PyPI index in pyproject.toml** (Azure DevOps, Artifactory, etc.): If `uv sync` fails with 401/403 on a private index, do NOT thrash with workarounds. Note the failure in setup.md and suggest the user either comment out the `[[tool.uv.index]]` block or provide credentials.
+- **Version incompatibilities**: If install fails due to conflicting versions, report the exact error — do not attempt multiple rounds of downgrades.
+
+If it fails, report the error — do not guess.

 ### 5. Install profiling tools

@ -71,6 +77,8 @@ Install `memray` as a dev dependency:
 | pdm | `pdm add -dG dev memray` |
 | pip | `pip install memray` |

+**WARNING:** Do NOT use `uv run --with memray` as an alternative to installing. The `--with` flag creates an isolated temporary environment that may conflict with the project's dependencies (e.g., different onnxruntime/torch versions). Always install memray into the project's own environment.
+
 If memray installation fails (e.g., unsupported platform, missing compiler), note it in setup.md but don't fail — tracemalloc (stdlib) is always available.

 Verify memray works:
@ -113,6 +121,15 @@ Print a short summary for the parent agent:
 [setup] Runner: uv run | Python: 3.12.1 | Profiling: tracemalloc, memray 1.14.0
 ```

+### 9. Detect pre-commit hooks
+
+Check if the project uses pre-commit:
+```bash
+ls .pre-commit-config.yaml 2>/dev/null
+```
+
+If present, note the linters in setup.md (e.g., "Pre-commit: ruff, undersort, mypy"). Domain agents will run pre-commit before every commit.
+
 ## Rules

 - Do NOT read source code — only configuration files.
--- a/agents/codeflash-structure.md
+++ b/agents/codeflash-structure.md
@ -24,7 +24,9 @@ memory: project
 tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "Agent", "WebFetch", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
 ---

-You are an autonomous codebase structure optimization agent. You analyze module dependencies, reduce import time, break circular imports, and decompose god modules. Use Explore subagents for codebase investigation to keep your context clean.
+You are an autonomous codebase structure optimization agent. You analyze module dependencies, reduce import time, break circular imports, and decompose god modules.
+
+**Context management:** Use Explore subagents for ALL codebase investigation — reading unfamiliar code, searching for patterns, understanding architecture. Only read code directly when you are about to edit it. Do NOT run more than 2 background tasks simultaneously — over-parallelization leads to timeouts, killed tasks, and lost track of what's running. Sequential focused work produces better results than scattered parallel work.

 ## Target Categories

@ -166,6 +168,8 @@ if __name__ == "__main__":

 ## The Experiment Loop

+**LOCK your measurement methodology at baseline time.** Do NOT change import time measurement approach, `-X importtime` flags, or test scope mid-experiment. Changing methodology creates uninterpretable results. If you need different parameters, record a new baseline first.
+
 LOOP (until plateau or user requests stop):

 1. **Choose target.** Highest-impact structural issue. Print `[experiment N] Target: <description> (<smell>)`.
@ -186,7 +190,7 @@ LOOP (until plateau or user requests stop):

 9. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.

-10. **Commit after KEEP.** `git add -A && git commit -m "struct: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+10. **Commit after KEEP.** Stage ONLY the files you changed: `git add <specific files> && git commit -m "struct: <one-line summary of fix>"`. Do NOT use `git add -A` or `git add .` — these stage scratch files, benchmarks, and user work. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards. If the project has pre-commit hooks (check for `.pre-commit-config.yaml`), run `pre-commit run --all-files` before committing — CI failures from forgotten linting waste time.

 11. **Re-assess** (every 3-5 keeps): Rebuild call matrix. Print `[milestone] vN — Cross-module calls: <before> -> <after>`.

@ -256,8 +260,8 @@ commit	target	metric_name	baseline	result	delta	tests_passed	tests_failed	status

 ### Starting fresh

-1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version, and test command. Read `.codeflash/conventions.md` if it exists. Read CLAUDE.md. Use the runner from setup.md everywhere you see `$RUNNER`.
-2. **Agree on a run tag** (e.g. `mar20`). Create branch: `git checkout -b codeflash/struct-<tag>`.
+1. **Read setup.** Read `.codeflash/setup.md` for the runner, Python version, and test command. Read `.codeflash/conventions.md` if it exists. Read `.codeflash/learnings.md` if it exists — these are discoveries from previous sessions that prevent repeating dead ends. Read CLAUDE.md. Use the runner from setup.md everywhere you see `$RUNNER`.
+2. **Generate a run tag** from today's date (e.g. `mar20`). If in AUTONOMOUS MODE, do not ask the user — just pick it. Create branch: `git checkout -b codeflash/struct-<tag>`.
 3. **Initialize HANDOFF.md** with environment and discovery.
 4. **Baseline** — Run import profiling + static analysis. Record findings.
 5. **Build call matrix** — Entity catalog, cross-module call counts, affinity analysis.
--- a/agents/codeflash.md
+++ b/agents/codeflash.md
@ -47,6 +47,7 @@ You are a routing agent for performance optimization. Your ONLY job is to detect
 - Do NOT profile, benchmark, or optimize anything — that is the domain agent's job.
 - The ONLY files you should read are: `CLAUDE.md`, `pyproject.toml`/`requirements.txt` (for dependency research), `.codeflash/*.md`, `.codeflash/results.tsv`, and guide.md reference files.
 - Follow the numbered steps in order. Do not skip steps or improvise your own workflow.
+- **AUTONOMOUS MODE**: If the prompt includes "AUTONOMOUS MODE", pass it through to the domain agent and do NOT ask the user any questions yourself. Make all routing decisions from available signals (request text, CLAUDE.md, branch names, .codeflash/ state).

 ## Domain Detection

@ -88,14 +89,18 @@ Once the domain agent is selected, optionally read `${CLAUDE_PLUGIN_ROOT}/agents

 ### Start (new session)

-1. Detect domain from the user's request. If unclear, do quick discovery — read CLAUDE.md, scan the target code, or ask the user.
-2. Run **codeflash-setup** agent and wait for it to complete.
-3. **Read project context.** Read `.codeflash/setup.md` for environment info. Read the project's `CLAUDE.md` (if it exists) for architecture decisions and coding conventions. Read `.codeflash/learnings.md` (if it exists) for insights from previous sessions. Optionally read guide.md for the detected domain.
-4. **Validate tests.** Run the test command from setup.md. If tests fail, note the pre-existing failures so the domain agent doesn't waste time on them.
-5. **Research dependencies.** Read `pyproject.toml` (or `requirements.txt`) to identify the project's key dependencies. Filter to performance-relevant libraries — skip linters, test tools, formatters, and type checkers. For each relevant library, use `mcp__context7__resolve-library-id` to find each library, then `mcp__context7__query-docs` to fetch performance-related documentation (query with terms like "performance", "optimization", "best practices" scoped to the detected domain). Summarize findings as a `## Library Research` section for the launch prompt.
-6. **Include user context.** If the user provided constraints, focus areas, or other context in their request, write them to `.codeflash/conventions.md` and include in the launch prompt.
-7. Launch the domain-specific agent:
+1. Detect domain from the user's request. If unclear, do quick discovery — read CLAUDE.md, scan the target code, or (only if NOT in autonomous mode) ask the user.
+2. **Verify branch state.** Run `git status` and `git branch --show-current` to confirm you're on a clean branch. If on `main`, you'll create a new branch in the domain agent. If on an existing `codeflash/*` branch, treat as resume. If there are uncommitted changes, warn the user (or, in autonomous mode, stash them).
+3. **Detect multi-repo context.** Check if `CLAUDE.md` mentions related repositories or if the parent directory contains sibling repos. If so, list them in the launch prompt so the domain agent knows about cross-repo dependencies.
+4. Run **codeflash-setup** agent and wait for it to complete.
+5. **Read project context.** Read `.codeflash/setup.md` for environment info. Read the project's `CLAUDE.md` (if it exists) for architecture decisions and coding conventions. Read `.codeflash/learnings.md` (if it exists) for insights from previous sessions. Optionally read guide.md for the detected domain.
+6. **Validate tests.** Run the test command from setup.md. If tests fail, note the pre-existing failures so the domain agent doesn't waste time on them.
+7. **Research dependencies.** Read `pyproject.toml` (or `requirements.txt`) to identify the project's key dependencies. Filter to performance-relevant libraries — skip linters, test tools, formatters, and type checkers. For each relevant library, use `mcp__context7__resolve-library-id` to find each library, then `mcp__context7__query-docs` to fetch performance-related documentation (query with terms like "performance", "optimization", "best practices" scoped to the detected domain). Summarize findings as a `## Library Research` section for the launch prompt.
+8. **Include user context.** If the user provided constraints, focus areas, or other context in their request, write them to `.codeflash/conventions.md` and include in the launch prompt.
+9. Launch the domain-specific agent:
   ```
+   <If autonomous mode: include the AUTONOMOUS MODE directive from the original prompt>
+
   Begin a new optimization session. The user wants: <user's request>

   ## Environment
@ -113,20 +118,24 @@ Once the domain agent is selected, optionally read `${CLAUDE_PLUGIN_ROOT}/agents
   ## Pre-existing Test Failures
   <list of failing tests, if any — so you don't waste time on them>

+   ## Related Repositories
+   <sibling repos and their roles, if detected in step 3>
+
   ## Library Research
   <context7 findings summary>

   ## Domain Knowledge
   <guide.md contents if loaded>
   ```
-8. For **multiple domains**, run setup once and launch the primary domain's agent first. It can detect cross-domain signals and the user can pivot later.
+10. For **multiple domains**, run setup once and launch the primary domain's agent first. It can detect cross-domain signals and the user can pivot later.

 ### Resume

-1. Read `.codeflash/HANDOFF.md` and detect the domain from the branch name.
-2. Read `.codeflash/results.tsv`, `.codeflash/conventions.md`, and `.codeflash/learnings.md` (if they exist).
-3. Read the project's `CLAUDE.md` (if it exists). Optionally read the domain's guide.md.
-4. Launch the domain-specific agent:
+1. **Verify branch state.** Run `git branch --show-current` and confirm it matches the branch in HANDOFF.md. If mismatched, checkout the correct branch before proceeding.
+2. Read `.codeflash/HANDOFF.md` and detect the domain from the branch name.
+3. Read `.codeflash/results.tsv`, `.codeflash/conventions.md`, and `.codeflash/learnings.md` (if they exist).
+4. Read the project's `CLAUDE.md` (if it exists). Optionally read the domain's guide.md.
+5. Launch the domain-specific agent:
   ```
   Resume the optimization session.

@ -164,3 +173,15 @@ Do NOT launch an agent for status — just read the files and summarize.
 When the user shares maintainer feedback, PR review comments, or project-specific conventions (e.g. from Slack, GitHub reviews, or conversation), write them to `.codeflash/conventions.md` — NOT to auto-memory. The agents read `conventions.md` at startup and follow it as binding constraints.

 Append to the file if it already exists. Use clear headings per topic (e.g. `## Pylint Policy`, `## Profiling`, `## Code Style`).
+
+## Cross-Session Learnings
+
+When domain agents discover non-obvious technical facts about the codebase (e.g., "PIL close() preserves metadata", "Paddle arena chunks are 500 MiB from C++"), they record them in HANDOFF.md's "Key Discoveries" section. After a session ends or plateau is reached, distill the most important discoveries into `.codeflash/learnings.md` so future sessions across ALL domains can benefit.
+
+Learnings.md is NOT a session log — it's a curated set of facts that prevent future sessions from repeating dead ends. Each entry should be:
+```
+## <Short title>
+<Specific technical detail with evidence. Include what was tried and why it didn't work.>
+```
+
+Read learnings.md at every session start and include it in the domain agent's launch prompt.
--- a/agents/references/shared/learnings-template.md
+++ b/agents/references/shared/learnings-template.md
@ -0,0 +1,42 @@
+# Cross-Session Learnings
+
+Non-obvious technical discoveries about this codebase. Read at session start to avoid repeating dead ends.
+
+## How to use this file
+
+- **Domain agents**: Add entries after discovering something non-obvious (keep or discard).
+- **Router agent**: Read this file at every session start and include it in the domain agent's launch prompt.
+- **Entries should be**: specific, technical, evidence-based. Not opinions or preferences.
+- **Remove entries** when they become outdated (e.g., a library version changes and the workaround no longer applies).
+
+## Template
+
+```markdown
+## <Short descriptive title>
+**Domain:** memory | cpu | async | structure
+**Discovered:** <date>
+
+<1-3 sentences with the specific technical finding. Include evidence: profiler output, version numbers, error messages.>
+
+**Implication:** <What this means for future optimization attempts. What to do or avoid.>
+```
+
+## Example entries
+
+```markdown
+## pytest-memray measures per-test peak only
+**Domain:** memory
+**Discovered:** 2026-03-17
+
+pytest-memray's `@pytest.mark.limit_memory` and `--memray` flag measure memory allocated during the test function body only. Import-time allocations (module globals, C extension init) are NOT counted. Verified: 40 MiB english_words list invisible in pytest-memray but visible in `memray run`.
+
+**Implication:** Import-time memory optimizations will show zero improvement in pytest-memray benchmarks. Use `memray run` on the full process to capture import-time.
+
+## Paddle inference engine allocates in 500 MiB arena chunks
+**Domain:** memory
+**Discovered:** 2026-03-19
+
+PaddleOCR's C++ inference engine allocates memory in 500 MiB arena chunks via `auto_growth` strategy. These are native memory pools, not proportional to data size. `config.memory_pool_init_size_mb()` is read-only (100 MiB default, but pool grows to 500 MiB). `enable_ort_optimization()` requires Paddle compiled with ONNX Runtime support. `rec_batch_num` controls the number of arena chunks allocated during recognition (6 -> 4 chunks, 1 -> 1 chunk).
+
+**Implication:** Cannot cap Paddle arena size directly. Only lever is `rec_batch_num` to reduce number of chunks. Don't waste time on arena configuration APIs.
+```
--- a/agents/references/shared/pr-preparation.md
+++ b/agents/references/shared/pr-preparation.md
@ -93,10 +93,18 @@ Use `gh pr edit NUMBER --repo ORG/REPO --body "$(cat <<'EOF' ... EOF)"` to repla

 Each domain agent defines its own branch prefix and PR title prefix. Common rules:

- **Do NOT open PRs yourself** unless the user explicitly asks. Prepare the branch, push it, tell the user it's ready.
+- **Do NOT open PRs yourself** unless the user explicitly asks. Prepare the branch, push it, tell the user it's ready. Do NOT push branches or create PRs as a "next step" — wait for explicit instruction.
 - Keep PR changed files minimal — only the actual code change, not benchmark scripts or images.
 - Benchmark scripts go inline in the PR body `<details>` block.

+### Writing quality
+
+Write PR descriptions like a human engineer, not a summarizer:
+- **Be specific**: "Replaces HuggingFace's RTDetrImageProcessor with torchvision transforms to eliminate 110 MiB of duplicate weight loading" — not "Improves memory efficiency of image processing."
+- **Lead with the technical mechanism**, not the benefit. Reviewers want to know WHAT you did, not that it's "an improvement."
+- **No generic headings** like "Summary", "Overview", "Key Changes" unless the PR template requires them. If the change is simple enough for 2 sentences, use 2 sentences.
+- **Don't over-explain** the problem. Assume the reviewer knows the codebase. Explain WHY your approach works, not what the code does line-by-line.
+
 ### 7. Chart hosting (if available)

 If the project has an image hosting setup (e.g., an orphan branch for assets), use it:
@ -123,3 +131,13 @@ gh api repos/ORG/REPO/contents/images/{name}.png \
 ```

 Otherwise, describe the results in text tables only.
+
+### 8. Chart generation guidelines
+
+When generating benchmark charts (e.g., with plotly, matplotlib):
+
+- **Separate concerns**: Use distinct charts for different metrics (throughput vs memory, latency vs RSS). Combined charts are hard to read and require multiple iterations.
+- **Plain-language axis labels**: Use "Peak Memory (MiB)" not "RSS delta". Use "Throughput (req/s)" not "ops".
+- **Include the baseline**: Always show the baseline variant as the first bar/line for comparison.
+- **Annotate absolute values**: Don't just show bars — label each with the actual number.
+- **Keep it simple**: Bar charts for before/after comparisons. Line charts only for scaling tests (varying N). No 3D charts, no unnecessary styling.
--- a/skills/codeflash-optimize/SKILL.md
+++ b/skills/codeflash-optimize/SKILL.md
@ -15,11 +15,21 @@ Optimization session launcher.

 Before launching the agent, ask the user: "Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed."

-Wait for the user's response. Then use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Include the user's original request AND their answer in the prompt. Do not add any other instructions — the agent has its own workflow.
+Wait for the user's response. Then use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Include the user's original request AND their answer in the prompt. Also include this directive at the top of the prompt:
+
+```
+AUTONOMOUS MODE: The user has already been asked for context (included below). Do NOT ask the user any questions — work fully autonomously. Make all decisions yourself: generate a run tag from today's date, identify benchmark tiers from available tests, choose optimization targets from profiler output. If something is ambiguous, pick the reasonable default and document your choice in HANDOFF.md.
+```
+
+Do not add any other instructions — the agent has its own workflow.

 ## For `resume`

-Use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Pass `resume` and the user's request as the prompt.
+Use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Pass `resume` and the user's request as the prompt. Include this at the top:
+
+```
+AUTONOMOUS MODE: Work fully autonomously. Do NOT ask the user any questions. Read session state from .codeflash/ and continue where the last session left off.
+```

 ## For `status`