2026-03-24 21:14:04 +00:00
# Experiment Loop — Shared Base
Each domain's `experiment-loop.md` extends this base with domain-specific reasoning checklists, metrics, thresholds, and logging schemas. Read the domain file first — it will reference this for the common framework.
## The Loop
LOOP (until plateau detected or user requests stop):
**Print a status line before each step** so the user can follow progress (see Progress Updates in the agent prompt).
feat: git memory, guard command, stuck recovery, batched setup (#2)
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
2026-03-27 12:43:14 +00:00
1. **Review git history.** Before choosing a target, read recent experiment history to learn from past attempts:
```bash
git log --oneline -20 # experiment sequence — what was tried
git diff HEAD~1 # why the last change worked (or didn't)
git log -20 --stat # which files drive improvements
```
Look for patterns: if 3+ commits that improved the metric all touched the same file or area, focus there. If a specific approach failed 3+ times, avoid it. If a successful commit used a technique (e.g., "replaced list with set"), look for similar opportunities elsewhere.
2. **Choose target.** Pick the next candidate from the ranked bottleneck list (see Bottleneck Ranking in the agent prompt), informed by patterns from step 1. Print `[experiment N] Target: <description> (<category>, <est. impact>)` . If the list is empty or stale (after a re-rank), rebuild it from profiling data (see domain file for sources).
3. **Reasoning checklist.** Answer all questions from the domain file. Unknown answers = research more.
4. **Capture original output.** Before changing anything, run the target function with representative inputs and save its output. This is your correctness oracle — the optimized version must produce identical results.
5. **Micro-benchmark** (when applicable). Print `[experiment N] Micro-benchmarking...` then result.
6. **Implement** . Print `[experiment N] Implementing: <one-line summary of change>` .
7. **Verify benchmark fidelity.** Re-read the benchmark and confirm it exercises the exact code path and parameters you changed. If you modified function arguments, wrapper flags, pool sizes, or configuration, the benchmark must use the same values. If the benchmark was written before step 6, the implementation may have changed assumptions — update the benchmark to match. A benchmark that doesn't mirror the production change proves nothing.
8. **Verify output equivalence.** Run the optimized version with the same inputs from step 4 and compare outputs. If outputs differ, **discard immediately** — this is a correctness regression, not an optimization. Do not proceed to benchmarking.
9. **Benchmark** : Run target test. Print `[experiment N] Benchmarking...` . Always run for correctness, even for micro-only optimizations.
2026-04-16 11:22:38 +00:00
10. **Guard** (if configured). Run the guard command (see Guard Command below). If the guard fails, the optimization broke something — revert and rework (max 2 attempts), then discard if still failing.
11. **Read results** : pass/fail, metrics. Print the domain-specific result line (see domain file).
12. If crashed or regressed = fix or discard immediately.
13. **Confirm small deltas** : If improvement is below the domain's noise threshold, re-run to confirm not noise.
14. **Record** in `.codeflash/results.tsv` (schema in domain file).
15. **Keep/discard** (see decision tree in domain file). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>` .
16. **E2E benchmark** (after KEEP, when available). If `codeflash compare` is available (see `e2e-benchmarks.md` ), run `$RUNNER -m codeflash compare <pre-opt-sha> HEAD` to get authoritative isolated measurements. Record e2e results alongside micro-bench results in `results.tsv` . If e2e contradicts micro-bench (e.g., micro showed 15% but e2e shows < 2 %), re-evaluate the keep decision — trust the e2e measurement . Print `[experiment N] E2E: <base>ms → <head>ms (<speedup>x)` .
17. **Config audit** (after KEEP). Check for related configuration flags that may have become dead or inconsistent after your change. Infrastructure changes (drivers, pools, middleware) often leave behind no-op config. Remove or update stale flags.
18. **Milestones** (every 3-5 keeps): Run full benchmark (including `codeflash compare <baseline-sha> HEAD` for cumulative e2e measurement), create milestone branch. Print `[milestone] vN — <total kept>/<total experiments>, cumulative <metric>` .
2026-03-24 21:14:04 +00:00
## Keep/Discard Decision Tree — Common Structure
```
Output matches original?
+-- NO -> DISCARD immediately (correctness regression)
+-- YES -> Test passed?
+-- NO -> Fix or discard immediately
feat: git memory, guard command, stuck recovery, batched setup (#2)
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
2026-03-27 12:43:14 +00:00
+-- YES -> Guard passed? (skip if no guard configured)
+-- NO -> Revert, rework optimization (max 2 attempts)
| +-- Still fails -> DISCARD
+-- YES -> Primary metric improved?
+-- YES (>= domain threshold) -> KEEP
+-- YES (< domain threshold ) - > Re-run to confirm not noise
| +-- Confirmed -> KEEP
| +-- Noise -> DISCARD
+-- Micro-bench only improved (>= domain micro threshold) -> KEEP (if on confirmed hot path)
+-- NO -> DISCARD
2026-03-24 21:14:04 +00:00
```
Domain files specify the exact thresholds and any additional branches.
feat: git memory, guard command, stuck recovery, batched setup (#2)
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
2026-03-27 12:43:14 +00:00
## Guard Command
An optional secondary verification that must always pass — a regression safety net. The guard prevents optimizing one metric while silently breaking another.
**Setup:** During session initialization, ask the user if there's a command that must always pass (e.g., `pytest tests/` , `mypy .` , `npm run typecheck` ). Store it in `.codeflash/conventions.md` under `## Guard` . If no guard is specified, skip step 10 in the loop.
**Rules:**
- The guard runs AFTER benchmarking (step 10), not before — don't waste time guarding a change that didn't even improve the metric.
- If the metric improved but the guard fails: revert the change, rework the optimization to not break the guard, and re-run (max 2 attempts). If it still fails after 2 rework attempts, DISCARD.
- NEVER modify guard/test files to make the guard pass. Always adapt the implementation instead.
- Record guard status in results.tsv: add `guard_pass` or `guard_fail` to the status column.
2026-03-24 21:14:04 +00:00
## Strategy Rotation
If 3+ consecutive discards on the same type of optimization, switch strategy. Domain files list the rotation order.
feat: git memory, guard command, stuck recovery, batched setup (#2)
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
2026-03-27 12:43:14 +00:00
## Plateau Detection & Stuck State Recovery
2026-03-24 21:14:04 +00:00
**Universal checks** (run after every experiment): See Stopping Criteria in the agent prompt — diminishing returns, user target reached, cumulative stall. If any fires, stop.
**Domain-specific**: 3+ consecutive discards across all strategies = check if remaining candidates are non-optimizable (see domain file for criteria). If top 3 candidates are all non-optimizable, **stop and report to user** with what's left and why.
feat: git memory, guard command, stuck recovery, batched setup (#2)
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
2026-03-27 12:43:14 +00:00
### Stuck State Recovery
If 5+ consecutive discards (across all strategy rotations), trigger this recovery protocol before giving up:
1. **Re-read all in-scope files from scratch.** Your mental model may have drifted — re-read the actual code, not your cached understanding.
2. **Re-read the full results log** (`.codeflash/results.tsv`). Look for patterns:
- Which files/functions appeared in successful experiments? Focus there.
- Which techniques worked? Try variants of those techniques on new targets.
- Which approaches failed repeatedly? Explicitly avoid them.
3. **Re-read the original goal.** Has the focus drifted from what the user asked for?
4. **Try combining 2-3 previously successful changes** that might compound (e.g., a data structure change + an algorithm change in the same hot path).
5. **Try the opposite** of what hasn't worked. If fine-grained optimizations keep failing, try a coarser architectural change. If local changes keep failing, try a cross-function refactor.
6. **Check git history for hints** : `git log --oneline -20 --stat` — do successful commits cluster in specific files or patterns?
If recovery still produces no improvement after 3 more experiments, **stop and report** with a summary of what was tried and why the codebase appears to be at its optimization floor for this domain.
2026-03-24 21:14:04 +00:00
## Cross-Domain Escalation
During profiling or experimentation, you may discover the real bottleneck is in a different domain than the one you're optimizing. Watch for these signals:
| You are | Signal | Likely domain |
|---------|--------|---------------|
| CPU agent | Peak memory dominates runtime (GC pressure, swapping) | **Memory** |
| CPU agent | Hot function is `await` -heavy or serializes I/O | **Async** |
| Memory agent | Allocations are fast but algorithm is O(n^2) | **CPU** |
| Memory agent | Memory growth from connection/session accumulation | **Async** |
| Async agent | Individual coroutines are CPU-bound, not I/O-bound | **CPU** |
| Async agent | Coroutines hold large buffers that overlap at peak | **Memory** |
| Any agent | Import time or circular deps are the real bottleneck | **Structure** |
When you detect a cross-domain signal:
1. **Log it** in results.tsv: `experiment N | ESCALATE | <signal description> | suggests <domain>`
2. **Tell the user** : "I'm finding that the real bottleneck is [description] — this is a [domain] issue, not [current domain]. Want me to switch?"
3. **Write it in HANDOFF.md** so a resumed session picks it up.
Do NOT silently switch domains or attempt fixes outside your expertise.
## Session End — Learnings
When the session ends (plateau, user stop, or escalation), write `.codeflash/learnings.md` with insights that would help future sessions on this codebase. Append if the file already exists.
Format:
```markdown
## <date> — <domain> session on <branch>
### What worked
- < technique > on < target > gave < improvement > (e.g., "dict index for dedup gave 12x on process_records")
### What didn't work
- < technique > on < target > — < why > (e.g., "generator pipeline for parse_rows — overhead exceeded savings at n< 1000 ")
### Codebase insights
- < observation > (e.g., "ORM layer accounts for 60% of runtime — query optimization would have more impact than Python-level changes")
```
Keep entries concise. Future sessions read this file to avoid repeating failed approaches and to build on successful patterns.