feat: git memory, guard command, stuck recovery, batched setup (#2)

* feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add CI plugin validation workflow (#3) Uses claude-code-action with plugin-dev plugin to validate plugin structure, agent consistency, eval manifests, and skills on every PR. Includes @claude mention support for interactive fixes. * fix: correct plugin marketplace name for CI validation plugin-dev is in claude-plugins-official, not claude-code-plugins. Also adds plugin_marketplaces URL for discovery. * fix: expand allowed tools for validation workflow Add gh pr comment, gh api, cat, python3, jq to allowed tools so Claude can post PR summary comments and subagents can function. * fix: enable track_progress and show_full_output for debugging * fix: remove colons from Bash glob patterns in validate allowedTools The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which are treated as literal characters, so they never matched actual commands like `gh pr diff 2 --name-only`. This caused 1 permission denial per CI run and prevented the summary comment from posting. * fix: fail CI job when validation finds issues Add verdict step that writes PASS/FAIL to a file, with a follow-up workflow step that exits 1 on FAIL. Previously validation reported issues in comments but the job always succeeded. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: treat validation warnings as blocking failures Warnings were previously non-blocking — the verdict step only checked for "issues that need fixing." Now any warning also triggers FAIL. * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * fix: parse verdict from PR comment instead of temp file Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't execute the python3 file-write command. The check step now reads the claude[bot] comment via gh api and greps for the verdict line. * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run
2026-03-27 07:43:14 -05:00 · 2026-03-27 07:43:14 -05:00 · d1f34cf794
commit d1f34cf794
parent e681b3732b
16 changed files with 603 additions and 95 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@ -7,5 +7,11 @@
  },
  "repository": "https://github.com/codeflash-ai/codeflash-agent",
  "license": "BSL-1.1",
-  "keywords": ["optimization", "performance", "profiling", "code-quality"]
+  "keywords": ["optimization", "performance", "profiling", "python"],
+  "mcpServers": {
+    "context7": {
+      "command": "npx",
+      "args": ["-y", "@upstash/context7-mcp@2.1.4"]
+    }
+  }
 }
--- a/.github/workflows/eval-regression.yml
+++ b/.github/workflows/eval-regression.yml
@ -0,0 +1,107 @@
+name: Eval Regression
+
+on:
+  workflow_dispatch:
+    inputs:
+      templates:
+        description: 'Comma-separated eval templates (blank = all baseline evals)'
+        required: false
+        default: ''
+
+jobs:
+  eval:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      id-token: write
+    timeout-minutes: 30
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Configure AWS Credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
+          aws-region: ${{ secrets.AWS_REGION }}
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+
+      - name: Install Claude Code
+        run: npm install -g @anthropic-ai/claude-code
+
+      - name: Configure Claude for Bedrock
+        run: |
+          mkdir -p ~/.claude
+          cat > ~/.claude/settings.json << 'EOF'
+          {
+            "permissions": {
+              "allow": ["Bash", "Read", "Write", "Edit", "Glob", "Grep", "Agent", "Skill"],
+              "deny": []
+            }
+          }
+          EOF
+
+      - name: Run regression check
+        env:
+          ANTHROPIC_MODEL: us.anthropic.claude-sonnet-4-6
+          CLAUDE_CODE_USE_BEDROCK: 1
+        run: |
+          chmod +x evals/check-regression.sh evals/run-eval.sh evals/score-eval.sh
+
+          ARGS=()
+          if [ -n "${{ inputs.templates }}" ]; then
+            IFS=',' read -ra TMPLS <<< "${{ inputs.templates }}"
+            for t in "${TMPLS[@]}"; do
+              ARGS+=("$(echo "$t" | xargs)")
+            done
+          fi
+
+          ./evals/check-regression.sh "${ARGS[@]}"
+
+      - name: Upload results
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: eval-results-${{ github.run_number }}
+          path: evals/results/
+          retention-days: 30
+
+      - name: Post job summary
+        if: always()
+        run: |
+          SUMMARY="evals/results/regression-summary.json"
+          if [ ! -f "$SUMMARY" ]; then
+            echo "::warning::No regression summary found"
+            exit 0
+          fi
+
+          passed=$(jq -r '.passed' "$SUMMARY")
+          echo "## Eval Regression Results" >> $GITHUB_STEP_SUMMARY
+          echo "" >> $GITHUB_STEP_SUMMARY
+
+          if [ "$passed" = "true" ]; then
+            echo "**Status: PASSED**" >> $GITHUB_STEP_SUMMARY
+          else
+            echo "**Status: FAILED**" >> $GITHUB_STEP_SUMMARY
+          fi
+
+          echo "" >> $GITHUB_STEP_SUMMARY
+          echo "| Template | Score | Min | Expected | Status |" >> $GITHUB_STEP_SUMMARY
+          echo "|----------|-------|-----|----------|--------|" >> $GITHUB_STEP_SUMMARY
+
+          jq -r '.results | to_entries[] | "\(.key)\t\(.value.score)\t\(.value.min)\t\(.value.expected)"' "$SUMMARY" | \
+          while IFS=$'\t' read -r template score min expected; do
+            if [ "$score" -lt "$min" ]; then
+              status="FAIL"
+            elif [ "$score" -lt "$expected" ]; then
+              status="WARN"
+            else
+              status="PASS"
+            fi
+            echo "| $template | $score | $min | $expected | $status |" >> $GITHUB_STEP_SUMMARY
+          done
+
+          echo "" >> $GITHUB_STEP_SUMMARY
+          echo "*Triggered at $(jq -r '.timestamp' "$SUMMARY")*" >> $GITHUB_STEP_SUMMARY
--- a/.github/workflows/validate.yml
+++ b/.github/workflows/validate.yml
@ -149,8 +149,9 @@ jobs:
            **Verdict: PASS**
            **Verdict: FAIL**

-            Use FAIL if ANY step found issues or warnings. Warnings are blocking.
-            Use PASS only if every step passed with zero issues and zero warnings.
+            Use FAIL only if a step found a **major** issue (broken functionality, missing required fields, incorrect cross-references).
+            Warnings and minor style suggestions are NOT blocking — use PASS if the only findings are warnings.
+            Use PASS if every step passed or only had minor/warning-level findings.
            </step>
          claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Glob,Grep,Bash(gh pr diff*),Bash(gh pr view*),Bash(gh pr comment*),Bash(gh api*),Bash(git diff*),Bash(git log*),Bash(git status*),Bash(cat *),Bash(python3 *),Bash(jq *)"'

--- a/agents/codeflash-async.md
+++ b/agents/codeflash-async.md
@ -148,35 +148,39 @@ $RUNNER /tmp/micro_bench_<name>.py b

 LOOP (until plateau or user requests stop):

-1. **Choose target.** Highest-impact antipattern from profiling/static analysis. Print `[experiment N] Target: <description> (<pattern>)`.
+1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments. Look for patterns: if 3+ commits that improved the metric all touched the same file or area, focus there. If a specific approach failed 3+ times, avoid it. If a successful commit used a technique, look for similar opportunities elsewhere.

-2. **Reasoning checklist.** Answer all 10 questions. Unknown = research more.
+2. **Choose target.** Highest-impact antipattern from profiling/static analysis, informed by git history patterns. Print `[experiment N] Target: <description> (<pattern>)`.

-3. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
+3. **Reasoning checklist.** Answer all 10 questions. Unknown = research more.

-4. **Implement and commit.** Print `[experiment N] Implementing: <one-line summary>`.
+4. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.

-5. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified wrapper flags (e.g., `thread_sensitive`), pool sizes, or driver config, the benchmark must use the same values. Update the benchmark if needed.
+5. **Implement.** Print `[experiment N] Implementing: <one-line summary>`.

-6. **Benchmark.** Run at agreed concurrency level. Print `[experiment N] Benchmarking at concurrency=<N>...`.
+6. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified wrapper flags (e.g., `thread_sensitive`), pool sizes, or driver config, the benchmark must use the same values. Update the benchmark if needed.

-7. **Read results.** Print `[experiment N] Latency: <before>ms -> <after>ms (<Z>% faster). Throughput: <X> -> <Y> req/s`.
+7. **Benchmark.** Run at agreed concurrency level. Print `[experiment N] Benchmarking at concurrency=<N>...`.

-8. **Crashed or regressed?** Fix or discard immediately.
+8. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.

-9. **Small delta?** If <10%, re-run 3 times. Async benchmarks have higher variance.
+9. **Read results.** Print `[experiment N] Latency: <before>ms -> <after>ms (<Z>% faster). Throughput: <X> -> <Y> req/s`.

-10. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.
+10. **Crashed or regressed?** Fix or discard immediately.

-11. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+11. **Small delta?** If <10%, re-run 3 times. Async benchmarks have higher variance.

-12. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config.
+12. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.

-13. **Commit after KEEP.** `git add -A && git commit -m "async: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+13. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.

-14. **Debug mode validation** (optional): After keeping a blocking-call fix, re-run with `PYTHONASYNCIODEBUG=1` to confirm the slow callback warning is gone.
+14. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config.

-15. **Milestones** (every 3-5 keeps): Full benchmark, `codeflash/async-<tag>-v<N>` tag.
+15. **Commit after KEEP.** `git add -A && git commit -m "async: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+
+16. **Debug mode validation** (optional): After keeping a blocking-call fix, re-run with `PYTHONASYNCIODEBUG=1` to confirm the slow callback warning is gone.
+
+17. **Milestones** (every 3-5 keeps): Full benchmark, `codeflash/async-<tag>-v<N>` tag.

 ### Keep/Discard

@ -207,6 +211,19 @@ Async changes often show larger gains under higher concurrency. If a change remo
 3+ consecutive discards on same type -> switch:
 sequential await gathering -> blocking call removal -> connection management -> architectural restructuring

+### Stuck State Recovery
+
+If 5+ consecutive discards (across all strategy rotations), trigger this recovery protocol before giving up:
+
+1. **Re-read all in-scope files from scratch.** Your mental model may have drifted — re-read the actual code, not your cached understanding.
+2. **Re-read the full results log** (`.codeflash/results.tsv`). Look for patterns: which files/functions appeared in successful experiments (focus there), which techniques worked (try variants on new targets), which approaches failed repeatedly (avoid them).
+3. **Re-read the original goal.** Has the focus drifted from what the user asked for?
+4. **Try combining 2-3 previously successful changes** that might compound (e.g., an await gathering + a connection pool change in the same async path).
+5. **Try the opposite** of what hasn't worked. If fine-grained optimizations keep failing, try a coarser architectural change. If local changes keep failing, try a cross-function refactor.
+6. **Check git history for hints**: `git log --oneline -20 --stat` — do successful commits cluster in specific files or patterns?
+
+If recovery still produces no improvement after 3 more experiments, **stop and report** with a summary of what was tried and why the codebase appears to be at its optimization floor for this domain.
+
 ## Progress Updates

 Print one status line before each major step:
--- a/agents/codeflash-cpu.md
+++ b/agents/codeflash-cpu.md
@ -183,31 +183,37 @@ ADAPTIVE opcodes on hot paths = type instability. LOAD_ATTR_INSTANCE_VALUE -> LO

 LOOP (until plateau or user requests stop):

-1. **Choose target.** Pick the #1 function from your ranked target list. **If it is below 2% of total, STOP — print `[STOP] All remaining targets below 2% threshold — not worth the experiment cost.` and end the loop.** Do NOT fix cold-code antipatterns even if the fix is trivial. Read the target function's source code now (only this function).
+1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments. Look for patterns: if 3+ commits that improved the metric all touched the same file or area, focus there. If a specific approach failed 3+ times, avoid it. If a successful commit used a technique, look for similar opportunities elsewhere.

-2. **Reasoning checklist.** Answer all 9 questions. Unknown = research more.
+2. **Choose target.** Pick the #1 function from your ranked target list. **If it is below 2% of total, STOP — print `[STOP] All remaining targets below 2% threshold — not worth the experiment cost.` and end the loop.** Do NOT fix cold-code antipatterns even if the fix is trivial. Read the target function's source code now (only this function).

-3. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
+3. **Reasoning checklist.** Answer all 9 questions. Unknown = research more.

-4. **Implement.** Fix ONLY the one target function. Do not touch other functions. Print `[experiment N] Implementing: <one-line summary>`.
+4. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.

-5. **Benchmark.** Run target test. Always run for correctness.
+5. **Implement.** Fix ONLY the one target function. Do not touch other functions. Print `[experiment N] Implementing: <one-line summary>`.

-6. **Read results.** Print `[experiment N] baseline <X>s, optimized <Y>s — <Z>% faster`.
+6. **Benchmark.** Run target test. Always run for correctness.

-7. **Crashed or regressed?** Fix or discard immediately.
+7. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.

-8. **Small delta?** If <5% speedup, re-run 3 times to confirm not noise.
+8. **Read results.** Print `[experiment N] baseline <X>s, optimized <Y>s — <Z>% faster`.

-9. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.
+9. **Crashed or regressed?** Fix or discard immediately.

-10. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+10. **Small delta?** If <5% speedup, re-run 3 times to confirm not noise.

-11. **Commit after KEEP.** `git add -A && git commit -m "perf: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+11. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.

-12. **MANDATORY: Re-profile.** After every KEEP, you MUST re-run the cProfile + ranked-list extraction commands from the Profiling section to get fresh numbers. Print `[re-rank] Re-profiling after fix...` then the new `[ranked targets]` list. Compare each target's new cumtime against the **ORIGINAL baseline total** (before any fixes) — a function that was 1.7% of the original is still cold even if it's now 50% of the reduced total. If all remaining targets are below 2% of the original baseline, STOP.
+12. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.

-13. **Milestones** (every 3-5 keeps): Full benchmark, `codeflash/ds-<tag>-v<N>` tag.
+13. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Data structure changes (container swaps, caching, __slots__) may leave behind unused size hints, obsolete cache settings, or redundant validation.
+
+14. **Commit after KEEP.** `git add -A && git commit -m "perf: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+
+15. **MANDATORY: Re-profile.** After every KEEP, you MUST re-run the cProfile + ranked-list extraction commands from the Profiling section to get fresh numbers. Print `[re-rank] Re-profiling after fix...` then the new `[ranked targets]` list. Compare each target's new cumtime against the **ORIGINAL baseline total** (before any fixes) — a function that was 1.7% of the original is still cold even if it's now 50% of the reduced total. If all remaining targets are below 2% of the original baseline, STOP.
+
+16. **Milestones** (every 3-5 keeps): Full benchmark, `codeflash/ds-<tag>-v<N>` tag.

 ### Keep/Discard

@ -236,6 +242,19 @@ Test passed?
 3+ consecutive discards on same type -> switch:
 container swaps -> algorithmic restructuring -> caching/precomputation -> stdlib replacements

+### Stuck State Recovery
+
+If 5+ consecutive discards (across all strategy rotations), trigger this recovery protocol before giving up:
+
+1. **Re-read all in-scope files from scratch.** Your mental model may have drifted — re-read the actual code, not your cached understanding.
+2. **Re-read the full results log** (`.codeflash/results.tsv`). Look for patterns: which files/functions appeared in successful experiments (focus there), which techniques worked (try variants on new targets), which approaches failed repeatedly (avoid them).
+3. **Re-read the original goal.** Has the focus drifted from what the user asked for?
+4. **Try combining 2-3 previously successful changes** that might compound (e.g., a data structure change + an algorithm change in the same hot path).
+5. **Try the opposite** of what hasn't worked. If fine-grained optimizations keep failing, try a coarser architectural change. If local changes keep failing, try a cross-function refactor.
+6. **Check git history for hints**: `git log --oneline -20 --stat` — do successful commits cluster in specific files or patterns?
+
+If recovery still produces no improvement after 3 more experiments, **stop and report** with a summary of what was tried and why the codebase appears to be at its optimization floor for this domain.
+
 ## Diff Hygiene

 Before pushing, review `git diff <base>..HEAD`:
--- a/agents/codeflash-memory.md
+++ b/agents/codeflash-memory.md
@ -20,6 +20,7 @@ description: >

 model: inherit
 color: yellow
+memory: project
 skills:
  - memray-profiling
 tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "Agent", "WebFetch", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
@ -155,27 +156,33 @@ $RUNNER /tmp/micro_bench_<name>.py b

 LOOP (until plateau or user requests stop):

-1. **Choose target.** Highest-memory reducible allocation from profiler output. Print `[experiment N] Target: <description> (<category>, <size> MiB)`. Read ONLY this target's source code.
+1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments. Look for patterns: if 3+ commits that improved the metric all touched the same file or area, focus there. If a specific approach failed 3+ times, avoid it. If a successful commit used a technique, look for similar opportunities elsewhere.

-2. **Reasoning checklist.** Answer all 8 questions. Unknown = research more.
+2. **Choose target.** Highest-memory reducible allocation from profiler output. Print `[experiment N] Target: <description> (<category>, <size> MiB)`. Read ONLY this target's source code.

-3. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
+3. **Reasoning checklist.** Answer all 8 questions. Unknown = research more.

-4. **Implement.** Fix ONLY the one target allocation. Do not touch other functions. Print `[experiment N] Implementing: <one-line summary>`.
+4. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.

-5. **Benchmark.** Run target test. Always run for correctness, even for micro-only changes.
+5. **Implement.** Fix ONLY the one target allocation. Do not touch other functions. Print `[experiment N] Implementing: <one-line summary>`.

-6. **Read results.** Print `[experiment N] <before> MiB -> <after> MiB (<delta> MiB)`.
+6. **Benchmark.** Run target test. Always run for correctness, even for micro-only changes.

-7. **Crashed or regressed?** Fix or discard immediately.
+7. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.

-8. **Small delta?** If <5 MiB, re-run to confirm not noise.
+8. **Read results.** Print `[experiment N] <before> MiB -> <after> MiB (<delta> MiB)`.

-9. **Record** in `.codeflash/results.tsv` immediately. Don't batch.
+9. **Crashed or regressed?** Fix or discard immediately.

-10. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+10. **Small delta?** If <5 MiB, re-run to confirm not noise.

-11. **Update HANDOFF.md** immediately after each experiment:
+11. **Record** in `.codeflash/results.tsv` immediately. Don't batch.
+
+12. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+
+13. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Memory changes (buffer management, loading strategies, format changes) may leave behind unused pool sizes, stale allocation hints, or redundant config.
+
+14. **Update HANDOFF.md** immediately after each experiment:
    - **KEEP**: Add to "Optimizations Kept" with numbered entry, mechanism, and MiB savings.
    - **DISCARD**: Add to "What Was Tried and Discarded" table with exp#, what, and specific reason.
    - **Discovery**: Did you learn something non-obvious about how this system allocates memory? Add to "Key Discoveries" with a numbered entry. Examples of discoveries worth recording:
@ -185,11 +192,11 @@ LOOP (until plateau or user requests stop):
      - "ONNX run() workspace is temporary — freed when run() returns"
    These discoveries prevent future sessions from wasting experiments on dead ends.

-12. **Commit after KEEP.** `git add -A && git commit -m "mem: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+15. **Commit after KEEP.** `git add -A && git commit -m "mem: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.

-13. **MANDATORY: Re-profile after every KEEP.** Run the per-stage profiling script again to get fresh numbers. Print `[re-profile] After fix...` then the updated per-stage table. The profile shape has changed — the old #2 allocator may now be #1. Do NOT skip this step.
+16. **MANDATORY: Re-profile after every KEEP.** Run the per-stage profiling script again to get fresh numbers. Print `[re-profile] After fix...` then the updated per-stage table. The profile shape has changed — the old #2 allocator may now be #1. Do NOT skip this step.

-14. **Milestones** (every 3-5 keeps): Full benchmark, `codeflash/mem-<tag>-v<N>` tag.
+17. **Milestones** (every 3-5 keeps): Full benchmark, `codeflash/mem-<tag>-v<N>` tag.

 ### Keep/Discard

@ -251,6 +258,19 @@ A tier escalation often reveals new optimization targets that were invisible in
 3+ failures on same allocation type -> switch:
 allocations -> format changes -> reordering -> quantization

+### Stuck State Recovery
+
+If 5+ consecutive discards (across all strategy rotations), trigger this recovery protocol before giving up:
+
+1. **Re-read all in-scope files from scratch.** Your mental model may have drifted — re-read the actual code, not your cached understanding.
+2. **Re-read the full results log** (`.codeflash/results.tsv`). Look for patterns: which files/functions appeared in successful experiments (focus there), which techniques worked (try variants on new targets), which approaches failed repeatedly (avoid them).
+3. **Re-read the original goal.** Has the focus drifted from what the user asked for?
+4. **Try combining 2-3 previously successful changes** that might compound (e.g., a format change + a reordering in the same allocation-heavy path).
+5. **Try the opposite** of what hasn't worked. If fine-grained optimizations keep failing, try a coarser architectural change. If local changes keep failing, try a cross-function refactor.
+6. **Check git history for hints**: `git log --oneline -20 --stat` — do successful commits cluster in specific files or patterns?
+
+If recovery still produces no improvement after 3 more experiments, **stop and report** with a summary of what was tried and why the codebase appears to be at its optimization floor for this domain.
+
 ## Source Reading Rules

 Investigate stages in **strict measured-delta order**. Do NOT let source appearance re-order.
--- a/agents/codeflash-setup.md
+++ b/agents/codeflash-setup.md
@ -5,8 +5,16 @@ description: >
  manager, installs the project, installs profiling tools (memray), and writes
  .codeflash/setup.md with the discovered environment. Called automatically
  before domain agents start fresh sessions.
+
+  <example>
+  Context: Router agent starts a fresh optimization session
+  user: "Set up the project environment for optimization"
+  assistant: "I'll launch codeflash-setup to detect the environment and install profiling tools."
+  </example>
+
 model: sonnet
-color: green
+color: red
+memory: project
 tools: ["Read", "Bash", "Glob", "Grep", "Write"]
 ---

--- a/agents/codeflash-structure.md
+++ b/agents/codeflash-structure.md
@ -168,27 +168,33 @@ if __name__ == "__main__":

 LOOP (until plateau or user requests stop):

-1. **Choose target.** Highest-impact structural issue. Print `[experiment N] Target: <description> (<smell>)`.
+1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments. Look for patterns: if 3+ commits that improved the metric all touched the same file or area, focus there. If a specific approach failed 3+ times, avoid it. If a successful commit used a technique, look for similar opportunities elsewhere.

-2. **Reasoning checklist.** Answer all 8 questions.
+2. **Choose target.** Highest-impact structural issue. Print `[experiment N] Target: <description> (<smell>)`.

-3. **Measure baseline.** Print `[experiment N] Baseline: <metric>=<value>`.
+3. **Reasoning checklist.** Answer all 8 questions.

-4. **Implement the move.** Follow safe refactoring protocol (below). Print `[experiment N] Moving: <entity> from <source> to <target>`.
+4. **Measure baseline.** Print `[experiment N] Baseline: <metric>=<value>`.

-5. **Run tests.** All tests must pass after each move.
+5. **Implement the move.** Follow safe refactoring protocol (below). Print `[experiment N] Moving: <entity> from <source> to <target>`.

-6. **Measure result.** Print `[experiment N] <metric>: <before> -> <after>`.
+6. **Run tests.** All tests must pass after each move.

-7. **Tests fail?** Fix or revert immediately.
+7. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.

-8. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.
+8. **Measure result.** Print `[experiment N] <metric>: <before> -> <after>`.

-9. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+9. **Tests fail?** Fix or revert immediately.

-10. **Commit after KEEP.** `git add -A && git commit -m "struct: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+10. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.

-11. **Re-assess** (every 3-5 keeps): Rebuild call matrix. Print `[milestone] vN — Cross-module calls: <before> -> <after>`.
+11. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+
+12. **Config audit** (after KEEP). Check for related configuration flags that became dead or inconsistent. Module restructuring may leave behind stale `__all__` exports, unused re-exports, or inconsistent import paths.
+
+13. **Commit after KEEP.** `git add -A && git commit -m "struct: <one-line summary of fix>"`. Each optimization gets its own commit so they can be reverted or cherry-picked independently. Do NOT commit discards.
+
+14. **Re-assess** (every 3-5 keeps): Rebuild call matrix. Print `[milestone] vN — Cross-module calls: <before> -> <after>`.

 ### Safe Refactoring Protocol

@ -218,6 +224,19 @@ Tests passed?
 3+ failures on same type -> switch:
 entity moves -> circular dep breaking -> god module decomposition -> dead code removal

+### Stuck State Recovery
+
+If 5+ consecutive discards (across all strategy rotations), trigger this recovery protocol before giving up:
+
+1. **Re-read all in-scope files from scratch.** Your mental model may have drifted — re-read the actual code, not your cached understanding.
+2. **Re-read the full results log** (`.codeflash/results.tsv`). Look for patterns: which files/functions appeared in successful experiments (focus there), which techniques worked (try variants on new targets), which approaches failed repeatedly (avoid them).
+3. **Re-read the original goal.** Has the focus drifted from what the user asked for?
+4. **Try combining 2-3 previously successful changes** that might compound (e.g., an entity move + a circular dep break in the same module cluster).
+5. **Try the opposite** of what hasn't worked. If fine-grained moves keep failing, try a coarser decomposition. If local changes keep failing, try a cross-module refactor.
+6. **Check git history for hints**: `git log --oneline -20 --stat` — do successful commits cluster in specific files or patterns?
+
+If recovery still produces no improvement after 3 more experiments, **stop and report** with a summary of what was tried and why the codebase appears to be at its optimization floor for this domain.
+
 ## Progress Updates

 ```
--- a/agents/codeflash.md
+++ b/agents/codeflash.md
@ -35,7 +35,7 @@ description: >
 model: sonnet
 color: green
 memory: project
-tools: ["Read", "Bash", "Grep", "Glob", "Agent", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
+tools: ["Read", "Write", "Edit", "Bash", "Grep", "Glob", "Agent", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
 ---

 You are a routing agent for performance optimization. Your ONLY job is to detect the optimization domain, run setup, and launch the right specialized agent.
@ -47,6 +47,7 @@ You are a routing agent for performance optimization. Your ONLY job is to detect
 - Do NOT profile, benchmark, or optimize anything — that is the domain agent's job.
 - The ONLY files you should read are: `CLAUDE.md`, `pyproject.toml`/`requirements.txt` (for dependency research), `.codeflash/*.md`, `.codeflash/results.tsv`, and guide.md reference files.
 - Follow the numbered steps in order. Do not skip steps or improvise your own workflow.
+- **Batch your questions.** Never ask one question at a time across multiple round-trips. If you need to ask the user about domain, scope, constraints, and guard command — ask them all in one message (max 4 questions per batch). Users should see all configuration choices together.

 ## Domain Detection

@ -88,13 +89,14 @@ Once the domain agent is selected, optionally read `${CLAUDE_PLUGIN_ROOT}/agents

 ### Start (new session)

-1. Detect domain from the user's request. If unclear, do quick discovery — read CLAUDE.md, scan the target code, or ask the user.
+1. **Gather context in one batch.** Detect domain from the user's request. If anything is unclear or missing, ask all questions in one message (max 4 questions). For example, if you need domain, scope, and constraints — ask them together, not in separate round-trips. Also ask: "Is there a command that must always pass as a safety net? (e.g., `pytest tests/`, `mypy .`)" to configure the guard. If the user already provided enough context, skip the questions and proceed.
 2. Run **codeflash-setup** agent and wait for it to complete.
 3. **Read project context.** Read `.codeflash/setup.md` for environment info. Read the project's `CLAUDE.md` (if it exists) for architecture decisions and coding conventions. Read `.codeflash/learnings.md` (if it exists) for insights from previous sessions. Optionally read guide.md for the detected domain.
 4. **Validate tests.** Run the test command from setup.md. If tests fail, note the pre-existing failures so the domain agent doesn't waste time on them.
-5. **Research dependencies.** Read `pyproject.toml` (or `requirements.txt`) to identify the project's key dependencies. Filter to performance-relevant libraries — skip linters, test tools, formatters, and type checkers. For each relevant library, use `mcp__context7__resolve-library-id` to find each library, then `mcp__context7__query-docs` to fetch performance-related documentation (query with terms like "performance", "optimization", "best practices" scoped to the detected domain). Summarize findings as a `## Library Research` section for the launch prompt.
-6. **Include user context.** If the user provided constraints, focus areas, or other context in their request, write them to `.codeflash/conventions.md` and include in the launch prompt.
-7. Launch the domain-specific agent:
+5. **Research dependencies.** Read `pyproject.toml` (or `requirements.txt`) to identify the project's key dependencies. Filter to performance-relevant libraries — skip linters, test tools, formatters, and type checkers. For each relevant library, use `mcp__context7__resolve-library-id` to find each library, then `mcp__context7__query-docs` to fetch performance-related documentation (query with terms like "performance", "optimization", "best practices" scoped to the detected domain). Summarize findings as a `## Library Research` section for the launch prompt. If context7 tools are unavailable (e.g., npx not installed), skip this step — library research is supplemental, not blocking.
+6. **Configure guard.** If the user specified a guard command, write it to `.codeflash/conventions.md` under `## Guard`. The domain agent will run this command after every benchmark — if it fails, the optimization is reverted.
+7. **Include user context.** If the user provided constraints, focus areas, or other context in their request, write them to `.codeflash/conventions.md` and include in the launch prompt.
+8. Launch the domain-specific agent:
   ```
   Begin a new optimization session. The user wants: <user's request>

@ -105,7 +107,7 @@ Once the domain agent is selected, optionally read `${CLAUDE_PLUGIN_ROOT}/agents
   <CLAUDE.md contents if it exists>

   ## Conventions
-   <conventions.md contents if it exists>
+   <conventions.md contents if it exists, including guard command if configured>

   ## Learnings from Previous Sessions
   <learnings.md contents if it exists>
@ -119,7 +121,7 @@ Once the domain agent is selected, optionally read `${CLAUDE_PLUGIN_ROOT}/agents
   ## Domain Knowledge
   <guide.md contents if loaded>
   ```
-8. For **multiple domains**, run setup once and launch the primary domain's agent first. It can detect cross-domain signals and the user can pivot later.
+9. For **multiple domains**, run setup once and launch the primary domain's agent first. It can detect cross-domain signals and the user can pivot later.

 ### Resume

--- a/agents/references/async/experiment-loop.md
+++ b/agents/references/async/experiment-loop.md
@ -1,6 +1,8 @@
 # Experiment Loop — Async Domain

 > Base framework: `../shared/experiment-loop-base.md`
+>
+> **Note:** Step numbers below match the async agent's inline loop (17 steps), not the shared base (which has extra steps 4 "Capture original output" and 8 "Verify output equivalence" that are not applicable to async domain).

 ## Reasoning Checklist

@ -18,7 +20,7 @@ Before writing any code, answer these 9 questions. If you can't answer 3-6 concr

 ## Domain-Specific Loop Steps

-**Step 1 — Choose target** sources:
+**Step 2 — Choose target** sources:
 - **asyncio debug mode output**: Slow callback warnings pointing to blocking calls.
 - **Static analysis**: Grep-based pattern detection — sequential awaits, await-in-loop, blocking calls, @cache on async.
 - **Deep source reading**: Cross-function concurrency opportunities, connection management issues, architectural bottlenecks. Use Explore subagents for this.
@ -27,15 +29,15 @@ Print: `[experiment N] Target: <description> (<pattern>, <est. impact>%)`

 **Step 6 — Benchmark fidelity (async-specific):** If your change involves wrapper parameters (e.g., `thread_sensitive`, `Semaphore` bounds, pool sizes, driver config), verify the benchmark uses the same parameters. A benchmark testing `sync_to_async(fn)` does NOT validate a fix that uses `sync_to_async(fn, thread_sensitive=False)`. A benchmark using default pool settings does NOT validate a fix that changes pool configuration.

-**Step 8 — Benchmark**: Run at the agreed concurrency level. Print `[experiment N] Benchmarking at concurrency=<N>...`.
+**Step 7 — Benchmark**: Run at the agreed concurrency level. Print `[experiment N] Benchmarking at concurrency=<N>...`.

 **Step 9 — Read results**: Print `[experiment N] Latency: <before>ms -> <after>ms (<Z>% faster). Throughput: <X> -> <Y> req/s`. Include whichever metrics are available.

 **Step 11 — Noise threshold**: If speedup is <10%, re-run 3 times to confirm not noise. Async benchmarks have higher variance than sync.

-**Step 14 — Debug mode validation (optional)**: After keeping a fix for a blocking call, re-run with `PYTHONASYNCIODEBUG=1` to confirm the slow callback warning is gone.
+**Step 16 — Debug mode validation (optional)**: After keeping a fix for a blocking call, re-run with `PYTHONASYNCIODEBUG=1` to confirm the slow callback warning is gone.

-**Step 15 — Milestones**: Create `codeflash/async-<tag>-v<N>` branch. Print `[milestone] vN — <total kept>/<total experiments>. Latency: <baseline>ms -> <current>ms. Throughput: <baseline> -> <current> req/s`.
+**Step 17 — Milestones**: Create `codeflash/async-<tag>-v<N>` branch. Print `[milestone] vN — <total kept>/<total experiments>. Latency: <baseline>ms -> <current>ms. Throughput: <baseline> -> <current> req/s`.

 ## Keep/Discard Thresholds

--- a/agents/references/shared/experiment-loop-base.md
+++ b/agents/references/shared/experiment-loop-base.md
@ -8,21 +8,29 @@ LOOP (until plateau detected or user requests stop):

 **Print a status line before each step** so the user can follow progress (see Progress Updates in the agent prompt).

-1. **Choose target.** Pick the next candidate from the ranked bottleneck list (see Bottleneck Ranking in the agent prompt). Print `[experiment N] Target: <description> (<category>, <est. impact>)`. If the list is empty or stale (after a re-rank), rebuild it from profiling data (see domain file for sources).
-2. **Reasoning checklist.** Answer all questions from the domain file. Unknown answers = research more.
-3. **Capture original output.** Before changing anything, run the target function with representative inputs and save its output. This is your correctness oracle — the optimized version must produce identical results.
-4. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
-5. **Implement and commit**. Print `[experiment N] Implementing: <one-line summary of change>`.
-6. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified function arguments, wrapper flags, pool sizes, or configuration, the benchmark must use the same values. If the benchmark was written before step 5, the implementation may have changed assumptions — update the benchmark to match. A benchmark that doesn't mirror the production change proves nothing.
-7. **Verify output equivalence.** Run the optimized version with the same inputs from step 3 and compare outputs. If outputs differ, **discard immediately** — this is a correctness regression, not an optimization. Do not proceed to benchmarking.
-8. **Benchmark**: Run target test. Print `[experiment N] Benchmarking...`. Always run for correctness, even for micro-only optimizations.
-9. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
-10. If crashed or regressed = fix or discard immediately.
-11. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise.
-12. **Record** in `.codeflash/results.tsv` (schema in domain file).
-13. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
-14. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
-15. **Milestones** (every 3-5 keeps): Run full benchmark, create milestone branch. Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.
+1. **Review git history.** Before choosing a target, read recent experiment history to learn from past attempts:
+   ```bash
+   git log --oneline -20    # experiment sequence — what was tried
+   git diff HEAD~1          # why the last change worked (or didn't)
+   git log -20 --stat       # which files drive improvements
+   ```
+   Look for patterns: if 3+ commits that improved the metric all touched the same file or area, focus there. If a specific approach failed 3+ times, avoid it. If a successful commit used a technique (e.g., "replaced list with set"), look for similar opportunities elsewhere.
+2. **Choose target.** Pick the next candidate from the ranked bottleneck list (see Bottleneck Ranking in the agent prompt), informed by patterns from step 1. Print `[experiment N] Target: <description> (<category>, <est. impact>)`. If the list is empty or stale (after a re-rank), rebuild it from profiling data (see domain file for sources).
+3. **Reasoning checklist.** Answer all questions from the domain file. Unknown answers = research more.
+4. **Capture original output.** Before changing anything, run the target function with representative inputs and save its output. This is your correctness oracle — the optimized version must produce identical results.
+5. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
+6. **Implement**. Print `[experiment N] Implementing: <one-line summary of change>`.
+7. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified function arguments, wrapper flags, pool sizes, or configuration, the benchmark must use the same values. If the benchmark was written before step 6, the implementation may have changed assumptions — update the benchmark to match. A benchmark that doesn't mirror the production change proves nothing.
+8. **Verify output equivalence.** Run the optimized version with the same inputs from step 4 and compare outputs. If outputs differ, **discard immediately** — this is a correctness regression, not an optimization. Do not proceed to benchmarking.
+9. **Benchmark**: Run target test. Print `[experiment N] Benchmarking...`. Always run for correctness, even for micro-only optimizations.
+10. **Guard** (if configured). Run the guard command (see Guard Command below). If the guard fails, the optimization broke something — revert and rework (max 2 attempts), then discard if still failing.
+11. **Read results**: pass/fail, metrics. Print the domain-specific result line (see domain file).
+12. If crashed or regressed = fix or discard immediately.
+13. **Confirm small deltas**: If improvement is below the domain's noise threshold, re-run to confirm not noise.
+14. **Record** in `.codeflash/results.tsv` (schema in domain file).
+15. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+16. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
+17. **Milestones** (every 3-5 keeps): Run full benchmark, create milestone branch. Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>`.

 ## Keep/Discard Decision Tree — Common Structure

@ -31,27 +39,58 @@ Output matches original?
 +-- NO -> DISCARD immediately (correctness regression)
 +-- YES -> Test passed?
    +-- NO -> Fix or discard immediately
-    +-- YES -> Primary metric improved?
-    +-- YES (>= domain threshold) -> KEEP
-    +-- YES (< domain threshold) -> Re-run to confirm not noise
-    |   +-- Confirmed -> KEEP
-    |   +-- Noise -> DISCARD
-    +-- Micro-bench only improved (>= domain micro threshold) -> KEEP (if on confirmed hot path)
-    +-- NO -> DISCARD
+    +-- YES -> Guard passed? (skip if no guard configured)
+        +-- NO -> Revert, rework optimization (max 2 attempts)
+        |   +-- Still fails -> DISCARD
+        +-- YES -> Primary metric improved?
+            +-- YES (>= domain threshold) -> KEEP
+            +-- YES (< domain threshold) -> Re-run to confirm not noise
+            |   +-- Confirmed -> KEEP
+            |   +-- Noise -> DISCARD
+            +-- Micro-bench only improved (>= domain micro threshold) -> KEEP (if on confirmed hot path)
+            +-- NO -> DISCARD
 ```

 Domain files specify the exact thresholds and any additional branches.

+## Guard Command
+
+An optional secondary verification that must always pass — a regression safety net. The guard prevents optimizing one metric while silently breaking another.
+
+**Setup:** During session initialization, ask the user if there's a command that must always pass (e.g., `pytest tests/`, `mypy .`, `npm run typecheck`). Store it in `.codeflash/conventions.md` under `## Guard`. If no guard is specified, skip step 10 in the loop.
+
+**Rules:**
+- The guard runs AFTER benchmarking (step 10), not before — don't waste time guarding a change that didn't even improve the metric.
+- If the metric improved but the guard fails: revert the change, rework the optimization to not break the guard, and re-run (max 2 attempts). If it still fails after 2 rework attempts, DISCARD.
+- NEVER modify guard/test files to make the guard pass. Always adapt the implementation instead.
+- Record guard status in results.tsv: add `guard_pass` or `guard_fail` to the status column.
+
 ## Strategy Rotation

 If 3+ consecutive discards on the same type of optimization, switch strategy. Domain files list the rotation order.

-## Plateau Detection
+## Plateau Detection & Stuck State Recovery

 **Universal checks** (run after every experiment): See Stopping Criteria in the agent prompt — diminishing returns, user target reached, cumulative stall. If any fires, stop.

 **Domain-specific**: 3+ consecutive discards across all strategies = check if remaining candidates are non-optimizable (see domain file for criteria). If top 3 candidates are all non-optimizable, **stop and report to user** with what's left and why.

+### Stuck State Recovery
+
+If 5+ consecutive discards (across all strategy rotations), trigger this recovery protocol before giving up:
+
+1. **Re-read all in-scope files from scratch.** Your mental model may have drifted — re-read the actual code, not your cached understanding.
+2. **Re-read the full results log** (`.codeflash/results.tsv`). Look for patterns:
+   - Which files/functions appeared in successful experiments? Focus there.
+   - Which techniques worked? Try variants of those techniques on new targets.
+   - Which approaches failed repeatedly? Explicitly avoid them.
+3. **Re-read the original goal.** Has the focus drifted from what the user asked for?
+4. **Try combining 2-3 previously successful changes** that might compound (e.g., a data structure change + an algorithm change in the same hot path).
+5. **Try the opposite** of what hasn't worked. If fine-grained optimizations keep failing, try a coarser architectural change. If local changes keep failing, try a cross-function refactor.
+6. **Check git history for hints**: `git log --oneline -20 --stat` — do successful commits cluster in specific files or patterns?
+
+If recovery still produces no improvement after 3 more experiments, **stop and report** with a summary of what was tried and why the codebase appears to be at its optimization floor for this domain.
+
 ## Cross-Domain Escalation

 During profiling or experimentation, you may discover the real bottleneck is in a different domain than the one you're optimizing. Watch for these signals:
--- a/evals/baseline-scores.json
+++ b/evals/baseline-scores.json
@ -0,0 +1,10 @@
+{
+  "version": 2,
+  "updated": "2026-03-27",
+  "note": "Deterministic auto-scoring for profiler usage, iterative profiling, and ranked list reduces LLM variance. Thresholds tightened from expected-3 to expected-2.",
+  "evals": {
+    "ranking":             { "expected": 9, "min": 7, "max": 10 },
+    "memory-hard":         { "expected": 9, "min": 7, "max": 10 },
+    "memory-misdirection": { "expected": 9, "min": 7, "max": 10 }
+  }
+}
--- a/evals/check-regression.sh
+++ b/evals/check-regression.sh
@ -0,0 +1,178 @@
+#!/bin/bash
+set -euo pipefail
+
+# Eval regression checker for codeflash-agent plugin
+#
+# Runs a subset of evals, scores them, and compares to checked-in baselines.
+# Exits 1 if any score drops below the minimum threshold.
+#
+# Usage:
+#   ./check-regression.sh                      # run all baseline evals
+#   ./check-regression.sh ranking              # run specific template(s)
+#   ./check-regression.sh --score-only <dir>   # score existing results, skip running
+
+EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
+BASELINE_FILE="$EVAL_DIR/baseline-scores.json"
+
+die() { echo "ERROR: $*" >&2; exit 1; }
+
+[ -f "$BASELINE_FILE" ] || die "Baseline file not found: $BASELINE_FILE"
+
+# --- Parse args ---
+
+SCORE_ONLY=""
+RESULTS_DIRS=()
+TEMPLATES=()
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --score-only)
+            SCORE_ONLY=1
+            shift
+            while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+                RESULTS_DIRS+=("$1")
+                shift
+            done
+            ;;
+        *)
+            TEMPLATES+=("$1")
+            shift
+            ;;
+    esac
+done
+
+# If no templates specified, use all from baseline file
+if [[ ${#TEMPLATES[@]} -eq 0 && -z "$SCORE_ONLY" ]]; then
+    mapfile -t TEMPLATES < <(jq -r '.evals | keys[]' "$BASELINE_FILE")
+fi
+
+# --- Score-only mode ---
+
+if [[ -n "$SCORE_ONLY" ]]; then
+    [[ ${#RESULTS_DIRS[@]} -gt 0 ]] || die "Usage: $0 --score-only <results-dir> [<results-dir> ...]"
+
+    for dir in "${RESULTS_DIRS[@]}"; do
+        [ -d "$dir" ] || die "Results directory not found: $dir"
+        echo "Scoring: $dir"
+        "$EVAL_DIR/score-eval.sh" "$dir"
+    done
+
+    echo ""
+    echo "Scored ${#RESULTS_DIRS[@]} result(s). Compare manually or re-run without --score-only."
+    exit 0
+fi
+
+# --- Run evals ---
+
+echo "=== Eval Regression Check ==="
+echo "Templates: ${TEMPLATES[*]}"
+echo "Baseline:  $BASELINE_FILE"
+echo ""
+
+declare -A RESULT_DIRS
+
+for template in "${TEMPLATES[@]}"; do
+    # Verify template exists in baseline
+    min=$(jq -r --arg t "$template" '.evals[$t].min // empty' "$BASELINE_FILE")
+    [[ -n "$min" ]] || die "Template '$template' not found in baseline file"
+
+    echo "--- Running: $template ---"
+    output=$("$EVAL_DIR/run-eval.sh" "$template" --skill-only 2>&1)
+    echo "$output"
+
+    # Extract the results directory from run-eval output
+    result_dir=$(echo "$output" | grep "^Results:" | head -1 | awk '{print $2}')
+    [[ -n "$result_dir" && -d "$result_dir" ]] || die "Could not find results directory for $template"
+    RESULT_DIRS[$template]="$result_dir"
+    echo ""
+done
+
+# --- Score ---
+
+echo "=== Scoring ==="
+echo ""
+
+declare -A SCORES
+
+for template in "${TEMPLATES[@]}"; do
+    result_dir="${RESULT_DIRS[$template]}"
+    echo "--- Scoring: $template ---"
+    "$EVAL_DIR/score-eval.sh" "$result_dir"
+
+    # Read the score
+    score_file="$result_dir/skill.score.json"
+    if [[ -f "$score_file" ]]; then
+        score=$(jq -r '.total' "$score_file")
+        SCORES[$template]="$score"
+    else
+        echo "WARNING: No score file for $template"
+        SCORES[$template]="0"
+    fi
+    echo ""
+done
+
+# --- Compare to baseline ---
+
+echo "=== Regression Check ==="
+echo ""
+
+printf "%-25s %8s %8s %8s %10s\n" "Template" "Score" "Min" "Expected" "Status"
+printf "%-25s %8s %8s %8s %10s\n" "--------" "-----" "---" "--------" "------"
+
+FAILED=0
+
+for template in "${TEMPLATES[@]}"; do
+    score="${SCORES[$template]}"
+    min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
+    expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
+    max=$(jq -r --arg t "$template" '.evals[$t].max' "$BASELINE_FILE")
+
+    if [[ "$score" -lt "$min" ]]; then
+        status="FAIL"
+        FAILED=1
+    elif [[ "$score" -lt "$expected" ]]; then
+        status="WARN"
+    else
+        status="PASS"
+    fi
+
+    printf "%-25s %8s %8s %8s %10s\n" "$template" "$score/$max" "$min" "$expected" "$status"
+done
+
+echo ""
+
+# --- Write summary for CI ---
+
+SUMMARY_FILE="$EVAL_DIR/results/regression-summary.json"
+mkdir -p "$(dirname "$SUMMARY_FILE")"
+
+{
+    echo "{"
+    echo "  \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
+    echo "  \"passed\": $([ $FAILED -eq 0 ] && echo true || echo false),"
+    echo "  \"results\": {"
+    first=1
+    for template in "${TEMPLATES[@]}"; do
+        [ $first -eq 0 ] && echo ","
+        first=0
+        score="${SCORES[$template]}"
+        min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
+        expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
+        printf '    "%s": { "score": %s, "min": %s, "expected": %s }' "$template" "$score" "$min" "$expected"
+    done
+    echo ""
+    echo "  }"
+    echo "}"
+} > "$SUMMARY_FILE"
+
+echo "Summary: $SUMMARY_FILE"
+
+if [[ $FAILED -eq 1 ]]; then
+    echo ""
+    echo "REGRESSION DETECTED: One or more evals scored below minimum threshold."
+    exit 1
+fi
+
+echo ""
+echo "All evals passed regression check."
+exit 0
--- a/evals/score.py
+++ b/evals/score.py
@ -101,6 +101,52 @@ def check_tests_pass(test_output_path: Path) -> bool:
    return "passed" in text.lower() and "FAILED" not in text


+# --- Deterministic session-based scoring ---
+
+_MEMORY_PROFILER_PATTERNS = re.compile(
+    r"\[Bash\]\s.*(?:memray\s+(?:run|stats|flamegraph|table|tree)|"
+    r"tracemalloc|"
+    r"pytest\s.*--memray|"
+    r"@pytest\.mark\.limit_memory)",
+    re.IGNORECASE,
+)
+
+_CPU_PROFILER_PATTERNS = re.compile(
+    r"\[Bash\]\s.*(?:python[3]?\s+-m\s+cProfile|"
+    r"cProfile\.run|"
+    r"pstats|"
+    r"pyinstrument|"
+    r"py-spy)",
+    re.IGNORECASE,
+)
+
+
+def detect_memory_profiler_usage(session_text: str) -> bool:
+    """Check if the agent used a memory profiler during the session."""
+    return bool(_MEMORY_PROFILER_PATTERNS.search(session_text))
+
+
+def count_profiling_runs(session_text: str, profiler_type: str = "memory") -> int:
+    """Count distinct profiling command invocations in the session."""
+    pattern = _MEMORY_PROFILER_PATTERNS if profiler_type == "memory" else _CPU_PROFILER_PATTERNS
+    return len(pattern.findall(session_text))
+
+
+def detect_ranked_list(session_text: str) -> bool:
+    """Check if the agent built a ranked list with impact percentages.
+
+    Looks for: (1) CPU profiler usage AND (2) output with percentage-based ranking.
+    """
+    has_profiler = bool(_CPU_PROFILER_PATTERNS.search(session_text))
+    # Look for ranking output — lines with percentages in a list/table context
+    has_ranking = bool(re.search(
+        r"(?:\d+\.?\d*\s*%.*(?:function|target|time|cumtime|tottime))|"
+        r"(?:(?:#\d|rank|\d\.\s).*\d+\.?\d*\s*%)",
+        session_text, re.IGNORECASE,
+    ))
+    return has_profiler and has_ranking
+
+
 # --- LLM scoring ---


@ -263,6 +309,36 @@ def score_variant(variant: str, results_dir: Path, manifest: dict) -> dict:
                    llm_notes += f" | optimization_depth: {peak:.1f}MB → {t['label']}"
                    break

+    # Auto-score: used_memory_profiler (deterministic — did agent use memray/tracemalloc?)
+    if "used_memory_profiler" in criteria and conversation:
+        if detect_memory_profiler_usage(conversation):
+            scores["used_memory_profiler"] = criteria["used_memory_profiler"]
+            llm_notes += " | used_memory_profiler: detected (deterministic)"
+        else:
+            scores["used_memory_profiler"] = 0
+            llm_notes += " | used_memory_profiler: NOT detected (deterministic)"
+
+    # Auto-score: profiled_iteratively (deterministic — count profiling runs)
+    if "profiled_iteratively" in criteria and conversation:
+        count = count_profiling_runs(conversation, "memory")
+        max_pts = criteria["profiled_iteratively"]
+        if count >= 2:
+            scores["profiled_iteratively"] = max_pts
+        elif count == 1:
+            scores["profiled_iteratively"] = 1
+        else:
+            scores["profiled_iteratively"] = 0
+        llm_notes += f" | profiled_iteratively: {count} runs (deterministic)"
+
+    # Auto-score: built_ranked_list_with_impact_pct (deterministic — profiler + ranking output)
+    if "built_ranked_list_with_impact_pct" in criteria and conversation:
+        if detect_ranked_list(conversation):
+            scores["built_ranked_list_with_impact_pct"] = criteria["built_ranked_list_with_impact_pct"]
+            llm_notes += " | built_ranked_list: detected (deterministic)"
+        else:
+            scores["built_ranked_list_with_impact_pct"] = 0
+            llm_notes += " | built_ranked_list: NOT detected (deterministic)"
+
    # Fill missing criteria with 0
    for name in criteria:
        if name not in scores:
--- a/skills/codeflash-optimize/SKILL.md
+++ b/skills/codeflash-optimize/SKILL.md
@ -6,14 +6,14 @@ description: >-
  memory usage", "fix slow functions", or "run performance experiments". Covers CPU, async,
  memory, and codebase structure optimization.
 argument-hint: "[start|resume|status]"
-allowed-tools: ["Agent", "Read", "AskUserQuestion"]
+allowed-tools: ["Agent", "AskUserQuestion", "Read"]
 ---

 Optimization session launcher.

 ## For `start` (or no arguments)

-Before launching the agent, ask the user: "Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed."
+Before launching the agent, use the AskUserQuestion tool to ask: "Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed."

 Wait for the user's response. Then use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Include the user's original request AND their answer in the prompt. Do not add any other instructions — the agent has its own workflow.

--- a/skills/memray-profiling/SKILL.md
+++ b/skills/memray-profiling/SKILL.md
@ -3,8 +3,9 @@ name: memray-profiling
 description: >
  This skill should be used when the user mentions "memray", "memory profiling", "memory leaks",
  "peak memory usage", "high watermark", "pytest --memray", "@pytest.mark.limit_memory",
-  "memory budgets in CI", "flamegraphs for memory", "memory allocation tracking", or wants to
-  analyze or reduce memory consumption of Python code.
+  "memory budgets in CI", "flamegraphs for memory", "memory allocation tracking", "tracemalloc",
+  or wants to analyze or reduce memory consumption of Python code.
+allowed-tools: ["Bash", "Read", "Write", "Grep", "Glob"]
 ---

 # Memray Memory Profiling — Quick Reference
@ -50,3 +51,6 @@ Import-time optimizations are invisible to `pytest --memray`.
 - `${CLAUDE_PLUGIN_ROOT}/agents/references/memory/cli-reference.md` — All CLI commands and flags
 - `${CLAUDE_PLUGIN_ROOT}/agents/references/memory/pytest-memray.md` — pytest markers, CI setup, gotchas
 - `${CLAUDE_PLUGIN_ROOT}/agents/references/memory/python-api.md` — Tracker, FileReader API
+- `${CLAUDE_PLUGIN_ROOT}/agents/references/memory/reference.md` — Memory optimization patterns and techniques
+- `${CLAUDE_PLUGIN_ROOT}/agents/references/memory/experiment-loop.md` — Memory domain experiment loop (used by codeflash-memory agent)
+- `${CLAUDE_PLUGIN_ROOT}/agents/references/memory/handoff-template.md` — Handoff template (used by codeflash-memory agent)