Remove .codeflash/ from ruff extend-exclude, add per-file ignores
for .codeflash/, scripts/, evals/, and plugin/ (benchmark/script
patterns like print, eval, magic values). Remove shebangs. Widen
pre-commit hooks to check the full repo.
* Add Unstructured engagement report as uv workspace member
Three-tier Plotly Dash app (Executive Brief, Engineering Team, Full
Detail) with data in JSON, theme constants in theme.py, and Dash
production improvements (Google Fonts, clientside callbacks, meta tags).
Also: add .playwright-mcp/ to .gitignore, add reports/* ruff overrides,
remove tracked .codeflash/observability/read-tracker.
* Rewrite statusline to derive context from git state
Detects active area from changed files (reports, packages, plugin,
.codeflash, case-studies, evals), falls back to branch name convention
(perf/*, feat/*, fix/*), shows dirty indicator. Uses whoami for
cross-platform user detection.
* Add pre-push lint rule to commit guidelines
* Exclude .codeflash/ from ruff linting
Benchmark and profiling scripts in .codeflash/ are scratch work, not
package source. Excluding them prevents CI failures from ad-hoc scripts.
* Run ruff format across packages, scripts, evals, and plugin refs
* Fix github-app async test failures in CI
Add asyncio_mode = "auto" to root pytest config so async tests
are detected when running from the repo root via uv run pytest packages/.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
- Shallow clone (--no-checkout --depth 1 + fetch specific commit) instead
of full clone — 15s vs 2+ min for large repos like codeflash-internal
- Cache clone in evals/repos/<name>/workspace/, cp -r for each run
- Use gh repo clone for private repo auth
- Fix eval prompt to skip skill's AskUserQuestion step in non-interactive mode
- Gitignore workspace/ dirs
- Update intro.md with v2 eval docs
Add support for v2 evals that clone a real repo at a specific commit
instead of using bundled template source. The agent handles setup,
diagnosis, and fixing on its own.
- run-eval.sh: v1/v2 dispatch, repos/ directory, prompt from manifest
- First v2 eval: codeflash-internal psycopg serialization (PR #2489)
- EVAL-V2-SKETCH.md: design doc for the v2 eval system
- intro.md: repo onboarding guide