codeflash-agent/.github/workflows/validate.yml
Kevin Turcios 37efa524d7 feat: improve skill, eval system, and tessl config
- Optimize codeflash-optimize SKILL.md (review score 17% → 98%, eval 87% → 100%)
  - Fix frontmatter (allowed-tools format, argument-hint under metadata)
  - Lead description with concrete actions, explicit agent launch parameters
- Add multi-run variance detection to eval system (--runs N flag)
  - score.py aggregate command: min/max/avg/stddev per criterion, flaky detection
  - check-regression.sh defaults to 3 runs for reliable regression detection
- Add per-criterion regression tracking to baseline-scores.json (v3)
  - Reports exactly which criteria regressed, not just total score drops
- Rename evals/ → codeflash-evals/ to avoid tessl directory conflicts
- Switch tessl to managed mode, gitignore vendored tiles and symlinks
2026-03-27 11:30:17 -05:00

249 lines
11 KiB
YAML

name: Plugin Validation
on:
pull_request:
types: [opened, synchronize, ready_for_review, reopened]
issue_comment:
types: [created]
pull_request_review_comment:
types: [created]
pull_request_review:
types: [submitted]
jobs:
validate:
concurrency:
group: validate-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
if: |
(
github.event_name == 'pull_request' &&
github.event.sender.login != 'claude[bot]' &&
github.event.pull_request.head.repo.full_name == github.repository
)
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
pull-requests: write
issues: read
id-token: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.ref }}
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Run Plugin Validation
uses: anthropics/claude-code-action@v1
with:
use_bedrock: "true"
use_sticky_comment: true
track_progress: true
show_full_output: true
prompt: |
You are validating the codeflash-agent Claude Code plugin. This plugin has:
- 6 agents in `agents/` (router + setup + 4 domain agents)
- 2 skills in `skills/` (codeflash-optimize, memray-profiling)
- Eval templates in `codeflash-evals/templates/`
- Plugin manifest at `.claude-plugin/plugin.json`
- No hooks directory
Execute each step in order. If a step finds no issues, state that and continue.
<step name="triage">
Assess what changed in this PR:
1. Run `gh pr diff ${{ github.event.pull_request.number }} --name-only` to get changed files.
2. Classify changes:
- AGENTS: files in `agents/`
- SKILLS: files in `skills/`
- EVALS: files in `codeflash-evals/`
- PLUGIN_CONFIG: `.claude-plugin/plugin.json`, hooks
- DOCS: `*.md` outside agents/skills, LICENSE
- OTHER: anything else
3. Record which categories have changes — later steps only run if relevant.
</step>
<step name="plugin_structure">
First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
"Look up the full Claude Code plugin specification. I need the required and optional fields for:
1. plugin.json manifest schema
2. Agent .md frontmatter (YAML between --- markers) — all valid fields
3. Skill SKILL.md frontmatter — all valid fields
Return the complete field lists with types and whether each is required."
Then, using the spec returned by that agent, validate this plugin:
- Read `.claude-plugin/plugin.json` and check against the plugin.json schema
- Read each `agents/*.md` and validate frontmatter fields against the agent spec
- Read each `skills/*/SKILL.md` and validate frontmatter fields against the skill spec
- Check file cross-references (agents referenced in plugin.json exist, skills referenced in agent frontmatter exist)
- Report any issues found
</step>
<step name="agent_consistency">
Only run if AGENTS changed.
The 4 domain agents (codeflash-cpu.md, codeflash-memory.md, codeflash-async.md, codeflash-structure.md)
must all have these steps in their experiment loops:
1. A "Review git history" step (step 1) with `git log --oneline -20` and `git diff HEAD~1`
2. A "Guard" step (if configured in conventions.md) with revert/rework/discard logic
3. A "Config audit" step (after KEEP) checking for dead/inconsistent config flags
Check each domain agent:
1. Read the experiment loop section of each file.
2. Verify all 3 steps are present.
3. Verify step numbering is sequential with no gaps.
4. Verify the Guard step includes "revert, rework (max 2 attempts), then discard".
5. Verify the Config audit step has domain-specific guidance (not generic).
Also check: router agent (codeflash.md) domain detection table matches the 4 domain agents that exist.
</step>
<step name="eval_manifests">
Only run if EVALS changed.
For each `codeflash-evals/templates/*/manifest.json`:
1. Verify valid JSON.
2. Verify required fields: `name`, `eval_type`, `bugs` (array), `rubric` (object with `criteria`).
3. Verify each bug has: `id`, `file`, `description`, `domain`.
4. Verify `rubric.criteria` values are positive integers.
5. Verify `rubric.total` equals the sum of criteria values (if present).
6. Verify referenced files (`file` in bugs, `test_file`) actually exist in that template directory.
</step>
<step name="skill_review">
Only run if SKILLS changed.
First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
"Look up Claude Code skill best practices. I need:
1. What makes a good skill description (trigger terms, specificity, completeness)
2. Best practices for allowed-tools restrictions
3. Best practices for skill content structure (conciseness, actionability, progressive disclosure)
Return the complete guidelines."
Then, using those guidelines, review each skill in `skills/`:
- Check description quality and trigger term coverage
- Check allowed-tools restrictions are appropriate
- Check content follows best practices (concise, actionable, clear workflow)
- Report any issues found
</step>
<step name="summary">
Post exactly one summary comment with all results:
## Plugin Validation
### Plugin Structure
(validation findings or "All checks passed")
### Agent Consistency
(experiment loop check results or "Not applicable — no agent changes")
### Eval Manifests
(manifest validation results or "Not applicable — no eval changes")
### Skill Review
(skill review findings or "Not applicable — no skill changes")
---
*Validated by claude-code-guide + codeflash-agent checks*
</step>
<step name="verdict">
End your summary comment with exactly one of these lines (no other text on that line):
**Verdict: PASS**
**Verdict: FAIL**
Use FAIL only if a step found a **major** issue (broken functionality, missing required fields, incorrect cross-references).
Warnings and minor style suggestions are NOT blocking — use PASS if the only findings are warnings.
Use PASS if every step passed or only had minor/warning-level findings.
</step>
claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Glob,Grep,Bash(gh pr diff*),Bash(gh pr view*),Bash(gh pr comment*),Bash(gh api*),Bash(git diff*),Bash(git log*),Bash(git status*),Bash(cat *),Bash(python3 *),Bash(jq *)"'
- name: Check validation verdict
if: always()
env:
GH_TOKEN: ${{ github.token }}
run: |
# Parse verdict from Claude's PR comment
VERDICT=$(gh api repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments \
--jq '[.[] | select(.user.login == "claude[bot]")] | last | .body' \
| grep -oP 'Verdict:\s*\K(PASS|FAIL)' | tail -1 || true)
if [ -z "$VERDICT" ]; then
echo "::warning::Could not find verdict in Claude's PR comment"
exit 0
fi
echo "Verdict: $VERDICT"
if [ "$VERDICT" = "FAIL" ]; then
echo "::error::Plugin validation found issues that need fixing"
exit 1
fi
claude-mention:
concurrency:
group: claude-mention-${{ github.event.issue.number || github.event.pull_request.number || github.run_id }}
cancel-in-progress: false
if: |
(
github.event_name == 'issue_comment' &&
contains(github.event.comment.body, '@claude') &&
(github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR')
) ||
(
github.event_name == 'pull_request_review_comment' &&
contains(github.event.comment.body, '@claude') &&
(github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR') &&
github.event.pull_request.head.repo.full_name == github.repository
) ||
(
github.event_name == 'pull_request_review' &&
contains(github.event.review.body, '@claude') &&
(github.event.review.author_association == 'OWNER' || github.event.review.author_association == 'MEMBER' || github.event.review.author_association == 'COLLABORATOR') &&
github.event.pull_request.head.repo.full_name == github.repository
)
runs-on: ubuntu-latest
permissions:
contents: write
pull-requests: write
issues: read
id-token: write
steps:
- name: Get PR head ref
id: pr-ref
env:
GH_TOKEN: ${{ github.token }}
run: |
if [ "${{ github.event_name }}" = "issue_comment" ]; then
PR_REF=$(gh api repos/${{ github.repository }}/pulls/${{ github.event.issue.number }} --jq '.head.ref')
echo "ref=$PR_REF" >> $GITHUB_OUTPUT
else
echo "ref=${{ github.event.pull_request.head.ref || github.head_ref }}" >> $GITHUB_OUTPUT
fi
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
ref: ${{ steps.pr-ref.outputs.ref }}
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Run Claude Code
uses: anthropics/claude-code-action@v1
with:
use_bedrock: "true"
claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Edit,Write,Glob,Grep,Bash(git status*),Bash(git diff*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git log*),Bash(gh pr comment*),Bash(gh pr view*),Bash(gh pr diff*)"'