Commit graph

26 commits

Author SHA1 Message Date
Kevin Turcios
3b59d97647 squash 2026-04-13 14:12:17 -05:00
Kevin Turcios
cee3987d7b cleanup 2026-04-06 05:58:13 -05:00
Kevin Turcios
ebb9658dfd Merge main-teammate branch 2026-04-03 17:36:50 -05:00
Kevin Turcios
0cda0d907c fix: align marketplace version with plugin.json and recursive .DS_Store ignore
- marketplace.json metadata.version 1.0.0 → 0.1.0 to match plugin.json
- .gitignore .DS_Store → **/.DS_Store for nested directories
2026-03-27 11:39:34 -05:00
Kevin Turcios
7fab0082c0
Merge pull request #6 from codeflash-ai/feat/tool-configs
feat: improve skill and eval system
2026-03-27 11:31:30 -05:00
Kevin Turcios
37efa524d7 feat: improve skill, eval system, and tessl config
- Optimize codeflash-optimize SKILL.md (review score 17% → 98%, eval 87% → 100%)
  - Fix frontmatter (allowed-tools format, argument-hint under metadata)
  - Lead description with concrete actions, explicit agent launch parameters
- Add multi-run variance detection to eval system (--runs N flag)
  - score.py aggregate command: min/max/avg/stddev per criterion, flaky detection
  - check-regression.sh defaults to 3 runs for reliable regression detection
- Add per-criterion regression tracking to baseline-scores.json (v3)
  - Reports exactly which criteria regressed, not just total score drops
- Rename evals/ → codeflash-evals/ to avoid tessl directory conflicts
- Switch tessl to managed mode, gitignore vendored tiles and symlinks
2026-03-27 11:30:17 -05:00
Kevin Turcios
999e08fb5e
Merge pull request #5 from codeflash-ai/fix/session-analysis-improvements
fix: session-analysis improvements from 89 real-world sessions
2026-03-27 10:17:44 -05:00
Kevin Turcios
61c393e7ed ci: add actions:read permission for CI status checks
The claude-code-action MCP server requires 'actions: read' to enable
CI status check functionality. Without it, the server is skipped with
a warning.
2026-03-27 10:16:16 -05:00
Kevin Turcios
24ffa83bbf merge: resolve conflicts with main (guard, git history, stuck recovery)
Merge origin/main which added guard commands, git history review step,
stuck state recovery, batched setup questions, and config audit steps.

Resolved 5 conflicts by keeping both:
- Our git-add-specific-files + pre-commit rules applied to the new
  renumbered commit steps (15 instead of 12, etc.)
- Upstream's Record, Config audit, Guard steps preserved
- Router keeps both AUTONOMOUS MODE and batch-questions rules
- Router start steps merged: our branch verification + multi-repo
  detection integrated into upstream's batched-questions flow
2026-03-27 10:15:10 -05:00
Kevin Turcios
ce02fdee29 fix: add .codeflash/ gitignore and session cleanup workflow
- Setup agent now ensures .codeflash/ is in .gitignore before writing
  session state files (prevents accidental commits of profiling artifacts)
- Router agent gets a Cleanup section: preserves learnings.md and
  results.tsv across sessions, deletes transient files (HANDOFF.md,
  setup.md, conventions.md, bench scripts), removes agent-memory dir
2026-03-27 10:09:51 -05:00
Kevin Turcios
e811d453f9 fix: address session-analysis findings from 89 unstructured_org sessions
Analyzed ~89 Claude Code sessions across 7 unstructured_org projects to
identify recurring failures and friction points, then applied fixes:

- Fix "ask then die" bug: skill now injects AUTONOMOUS MODE directive so
  domain agents work without interactive questions that kill the Agent tool
- Fix git add -A: all 4 domain agents now stage specific files instead of
  blindly staging everything (caused accidental commits of scratch files)
- Add pre-commit step: agents run pre-commit before every commit to catch
  linting failures before CI (ruff/undersort failures were recurring)
- Add measurement methodology lock: prevents changing profiling flags
  mid-experiment which created uninterpretable deltas
- Add branch state verification to router startup (prevents wrong-branch
  confusion that wasted multiple sessions)
- Add multi-repo detection to router (original work spanned 4 repos)
- Add library vs application awareness to memory agent (prevents wasting
  time on import-time optimizations in library projects)
- Add dependency resilience to setup agent (uv run --with isolation
  warning, private PyPI failure guidance)
- Add PR text quality guidelines (sessions showed AI-sounding text that
  required multiple user corrections)
- Add chart generation guidelines to pr-preparation.md
- Add context conservation rules (max 2 background tasks, use subagents)
- Add cross-session learnings template for .codeflash/learnings.md
- All domain agents now read learnings.md at startup
2026-03-27 10:08:50 -05:00
Kevin Turcios
d1f34cf794
feat: git memory, guard command, stuck recovery, batched setup (#2)
* feat: git memory, guard command, stuck recovery, batched setup

- Add git history review as step 1 of every experiment loop iteration
  (read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
  revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
  re-read of all files, results log analysis, goal re-check, combination
  of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
  message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init

Inspired by patterns from uditgoenka/autoresearch.

* feat: git memory, guard command, stuck recovery, batched setup

- Add git history review as step 1 of every experiment loop iteration
  (read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
  revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
  re-read of all files, results log analysis, goal re-check, combination
  of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
  message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init

Inspired by patterns from uditgoenka/autoresearch.

* feat: add git history, guard, and config audit steps to cpu/memory/structure agents

Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.

* feat: add git history, guard, and config audit steps to cpu/memory/structure agents

Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.

* feat: add CI plugin validation workflow (#3)

Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.

* fix: correct plugin marketplace name for CI validation

plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.

* fix: expand allowed tools for validation workflow

Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.

* fix: enable track_progress and show_full_output for debugging

* fix: remove colons from Bash glob patterns in validate allowedTools

The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.

* fix: fail CI job when validation finds issues

Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.

* fix: remove double-commit contradiction in async agent and shared base

Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.

* fix: remove double-commit contradiction in async agent and shared base

Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.

* fix: treat validation warnings as blocking failures

Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.

* fix: address plugin-validator warnings

- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter

* fix: address plugin-validator warnings

- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter

* feat: add eval regression testing (Layer 3)

- baseline-scores.json: checked-in expected scores + min thresholds
  for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
  compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
  Bedrock OIDC, artifact upload, and job summary table

* feat: add eval regression testing (Layer 3)

- baseline-scores.json: checked-in expected scores + min thresholds
  for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
  compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
  Bedrock OIDC, artifact upload, and job summary table

* fix: address remaining plugin-validator warnings

- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md

* fix: address remaining plugin-validator warnings

- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md

* feat: add deterministic scoring for profiler and ranking criteria

Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output

These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.

* feat: add deterministic scoring for profiler and ranking criteria

Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output

These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.

* fix: parse verdict from PR comment instead of temp file

Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.

* fix: address all remaining validator findings for PR #2

- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill

* fix: address all remaining validator findings for PR #2

- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill

* fix: wire stuck state recovery into domain agents, fix skill allowed-tools

- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
  agents (cpu, memory, async, structure) inline experiment loops,
  matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
  can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
  profiling harness scripts

* fix: wire stuck state recovery into domain agents, fix skill allowed-tools

- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
  agents (cpu, memory, async, structure) inline experiment loops,
  matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
  can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
  profiling harness scripts

* fix: sync versions, add missing tools to router and skills

- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
  writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
  execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
  test files, capture outputs, and config files)

* fix: sync versions, add missing tools to router and skills

- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
  writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
  execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
  test files, capture outputs, and config files)

* fix: restore plugin.json fields, revert marketplace version, relax validator verdict

- Restore repository, license, keywords in plugin.json (accidentally
  removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
  version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
  warnings — prevents the LLM validator from blocking on subjective
  minor findings each run

* fix: restore plugin.json fields, revert marketplace version, relax validator verdict

- Restore repository, license, keywords in plugin.json (accidentally
  removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
  version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
  warnings — prevents the LLM validator from blocking on subjective
  minor findings each run
2026-03-27 07:43:14 -05:00
Kevin Turcios
e681b3732b Merge pull request #4 from codeflash-ai/feat/eval-v2-real-repos
feat: eval v2 — real-repo evals
2026-03-27 07:31:02 -05:00
Kevin Turcios
66187bbcc3 fix: v2 eval runner — shallow cached clones + non-interactive prompt
- Shallow clone (--no-checkout --depth 1 + fetch specific commit) instead
  of full clone — 15s vs 2+ min for large repos like codeflash-internal
- Cache clone in evals/repos/<name>/workspace/, cp -r for each run
- Use gh repo clone for private repo auth
- Fix eval prompt to skip skill's AskUserQuestion step in non-interactive mode
- Gitignore workspace/ dirs
- Update intro.md with v2 eval docs
2026-03-27 07:27:12 -05:00
Kevin Turcios
0d4fc9d8b7 feat: eval v2 — real-repo evals cloned from git
Add support for v2 evals that clone a real repo at a specific commit
instead of using bundled template source. The agent handles setup,
diagnosis, and fixing on its own.

- run-eval.sh: v1/v2 dispatch, repos/ directory, prompt from manifest
- First v2 eval: codeflash-internal psycopg serialization (PR #2489)
- EVAL-V2-SKETCH.md: design doc for the v2 eval system
- intro.md: repo onboarding guide
2026-03-27 07:25:10 -05:00
Kevin Turcios
cae86f669b fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
2026-03-27 06:40:40 -05:00
Kevin Turcios
b81c50f7f4 fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
2026-03-27 06:17:50 -05:00
Kevin Turcios
037b1940eb fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
2026-03-27 06:09:21 -05:00
Kevin Turcios
a9fa0687d5 fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
2026-03-27 06:05:07 -05:00
Kevin Turcios
a529812b9d fix: enable track_progress and show_full_output for debugging 2026-03-27 05:58:34 -05:00
Kevin Turcios
9f4a8eda6d fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
2026-03-27 05:53:45 -05:00
Kevin Turcios
a482bf6af8 fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
2026-03-27 05:49:15 -05:00
Kevin Turcios
021d64a1fd feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
2026-03-27 05:44:44 -05:00
Kevin Turcios
5a990f78a9 Merge pull request #1 from codeflash-ai/fix/async-benchmark-validation
fix: add benchmark fidelity verification and config audit steps
2026-03-27 05:06:43 -05:00
Kevin Turcios
93e93ff1f6 fix: add benchmark fidelity verification and config audit steps
The async agent wrote a benchmark that tested the problem (psycopg2
single connection) but didn't validate the fix (thread_sensitive=False),
and missed dead config (conn_health_checks) after a driver migration.

- Add step 6 (benchmark fidelity) to shared and async experiment loops
- Add step 14 (config audit after KEEP) to shared and async loops
- Add architectural change workflow for driver/infrastructure migrations
- Strengthen profiling rule: never present benchmarking as optional
- Add config audit to async reasoning checklist (question #9)
2026-03-27 05:04:54 -05:00
Kevin Turcios
64268dd023 Hello World 2026-03-24 16:14:04 -05:00