- Optimize codeflash-optimize SKILL.md (review score 17% → 98%, eval 87% → 100%) - Fix frontmatter (allowed-tools format, argument-hint under metadata) - Lead description with concrete actions, explicit agent launch parameters - Add multi-run variance detection to eval system (--runs N flag) - score.py aggregate command: min/max/avg/stddev per criterion, flaky detection - check-regression.sh defaults to 3 runs for reliable regression detection - Add per-criterion regression tracking to baseline-scores.json (v3) - Reports exactly which criteria regressed, not just total score drops - Rename evals/ → codeflash-evals/ to avoid tessl directory conflicts - Switch tessl to managed mode, gitignore vendored tiles and symlinks
249 lines
11 KiB
YAML
249 lines
11 KiB
YAML
name: Plugin Validation
|
|
|
|
on:
|
|
pull_request:
|
|
types: [opened, synchronize, ready_for_review, reopened]
|
|
issue_comment:
|
|
types: [created]
|
|
pull_request_review_comment:
|
|
types: [created]
|
|
pull_request_review:
|
|
types: [submitted]
|
|
|
|
jobs:
|
|
validate:
|
|
concurrency:
|
|
group: validate-${{ github.head_ref || github.run_id }}
|
|
cancel-in-progress: true
|
|
if: |
|
|
(
|
|
github.event_name == 'pull_request' &&
|
|
github.event.sender.login != 'claude[bot]' &&
|
|
github.event.pull_request.head.repo.full_name == github.repository
|
|
)
|
|
runs-on: ubuntu-latest
|
|
permissions:
|
|
actions: read
|
|
contents: read
|
|
pull-requests: write
|
|
issues: read
|
|
id-token: write
|
|
steps:
|
|
- name: Checkout repository
|
|
uses: actions/checkout@v4
|
|
with:
|
|
fetch-depth: 0
|
|
ref: ${{ github.event.pull_request.head.ref }}
|
|
|
|
- name: Configure AWS Credentials
|
|
uses: aws-actions/configure-aws-credentials@v4
|
|
with:
|
|
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
|
|
aws-region: ${{ secrets.AWS_REGION }}
|
|
|
|
- name: Run Plugin Validation
|
|
uses: anthropics/claude-code-action@v1
|
|
with:
|
|
use_bedrock: "true"
|
|
use_sticky_comment: true
|
|
track_progress: true
|
|
show_full_output: true
|
|
prompt: |
|
|
You are validating the codeflash-agent Claude Code plugin. This plugin has:
|
|
- 6 agents in `agents/` (router + setup + 4 domain agents)
|
|
- 2 skills in `skills/` (codeflash-optimize, memray-profiling)
|
|
- Eval templates in `codeflash-evals/templates/`
|
|
- Plugin manifest at `.claude-plugin/plugin.json`
|
|
- No hooks directory
|
|
|
|
Execute each step in order. If a step finds no issues, state that and continue.
|
|
|
|
<step name="triage">
|
|
Assess what changed in this PR:
|
|
1. Run `gh pr diff ${{ github.event.pull_request.number }} --name-only` to get changed files.
|
|
2. Classify changes:
|
|
- AGENTS: files in `agents/`
|
|
- SKILLS: files in `skills/`
|
|
- EVALS: files in `codeflash-evals/`
|
|
- PLUGIN_CONFIG: `.claude-plugin/plugin.json`, hooks
|
|
- DOCS: `*.md` outside agents/skills, LICENSE
|
|
- OTHER: anything else
|
|
3. Record which categories have changes — later steps only run if relevant.
|
|
</step>
|
|
|
|
<step name="plugin_structure">
|
|
First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
|
|
"Look up the full Claude Code plugin specification. I need the required and optional fields for:
|
|
1. plugin.json manifest schema
|
|
2. Agent .md frontmatter (YAML between --- markers) — all valid fields
|
|
3. Skill SKILL.md frontmatter — all valid fields
|
|
Return the complete field lists with types and whether each is required."
|
|
|
|
Then, using the spec returned by that agent, validate this plugin:
|
|
- Read `.claude-plugin/plugin.json` and check against the plugin.json schema
|
|
- Read each `agents/*.md` and validate frontmatter fields against the agent spec
|
|
- Read each `skills/*/SKILL.md` and validate frontmatter fields against the skill spec
|
|
- Check file cross-references (agents referenced in plugin.json exist, skills referenced in agent frontmatter exist)
|
|
- Report any issues found
|
|
</step>
|
|
|
|
<step name="agent_consistency">
|
|
Only run if AGENTS changed.
|
|
|
|
The 4 domain agents (codeflash-cpu.md, codeflash-memory.md, codeflash-async.md, codeflash-structure.md)
|
|
must all have these steps in their experiment loops:
|
|
1. A "Review git history" step (step 1) with `git log --oneline -20` and `git diff HEAD~1`
|
|
2. A "Guard" step (if configured in conventions.md) with revert/rework/discard logic
|
|
3. A "Config audit" step (after KEEP) checking for dead/inconsistent config flags
|
|
|
|
Check each domain agent:
|
|
1. Read the experiment loop section of each file.
|
|
2. Verify all 3 steps are present.
|
|
3. Verify step numbering is sequential with no gaps.
|
|
4. Verify the Guard step includes "revert, rework (max 2 attempts), then discard".
|
|
5. Verify the Config audit step has domain-specific guidance (not generic).
|
|
|
|
Also check: router agent (codeflash.md) domain detection table matches the 4 domain agents that exist.
|
|
</step>
|
|
|
|
<step name="eval_manifests">
|
|
Only run if EVALS changed.
|
|
|
|
For each `codeflash-evals/templates/*/manifest.json`:
|
|
1. Verify valid JSON.
|
|
2. Verify required fields: `name`, `eval_type`, `bugs` (array), `rubric` (object with `criteria`).
|
|
3. Verify each bug has: `id`, `file`, `description`, `domain`.
|
|
4. Verify `rubric.criteria` values are positive integers.
|
|
5. Verify `rubric.total` equals the sum of criteria values (if present).
|
|
6. Verify referenced files (`file` in bugs, `test_file`) actually exist in that template directory.
|
|
</step>
|
|
|
|
<step name="skill_review">
|
|
Only run if SKILLS changed.
|
|
|
|
First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
|
|
"Look up Claude Code skill best practices. I need:
|
|
1. What makes a good skill description (trigger terms, specificity, completeness)
|
|
2. Best practices for allowed-tools restrictions
|
|
3. Best practices for skill content structure (conciseness, actionability, progressive disclosure)
|
|
Return the complete guidelines."
|
|
|
|
Then, using those guidelines, review each skill in `skills/`:
|
|
- Check description quality and trigger term coverage
|
|
- Check allowed-tools restrictions are appropriate
|
|
- Check content follows best practices (concise, actionable, clear workflow)
|
|
- Report any issues found
|
|
</step>
|
|
|
|
<step name="summary">
|
|
Post exactly one summary comment with all results:
|
|
|
|
## Plugin Validation
|
|
|
|
### Plugin Structure
|
|
(validation findings or "All checks passed")
|
|
|
|
### Agent Consistency
|
|
(experiment loop check results or "Not applicable — no agent changes")
|
|
|
|
### Eval Manifests
|
|
(manifest validation results or "Not applicable — no eval changes")
|
|
|
|
### Skill Review
|
|
(skill review findings or "Not applicable — no skill changes")
|
|
|
|
---
|
|
*Validated by claude-code-guide + codeflash-agent checks*
|
|
</step>
|
|
|
|
<step name="verdict">
|
|
End your summary comment with exactly one of these lines (no other text on that line):
|
|
|
|
**Verdict: PASS**
|
|
**Verdict: FAIL**
|
|
|
|
Use FAIL only if a step found a **major** issue (broken functionality, missing required fields, incorrect cross-references).
|
|
Warnings and minor style suggestions are NOT blocking — use PASS if the only findings are warnings.
|
|
Use PASS if every step passed or only had minor/warning-level findings.
|
|
</step>
|
|
claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Glob,Grep,Bash(gh pr diff*),Bash(gh pr view*),Bash(gh pr comment*),Bash(gh api*),Bash(git diff*),Bash(git log*),Bash(git status*),Bash(cat *),Bash(python3 *),Bash(jq *)"'
|
|
|
|
- name: Check validation verdict
|
|
if: always()
|
|
env:
|
|
GH_TOKEN: ${{ github.token }}
|
|
run: |
|
|
# Parse verdict from Claude's PR comment
|
|
VERDICT=$(gh api repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments \
|
|
--jq '[.[] | select(.user.login == "claude[bot]")] | last | .body' \
|
|
| grep -oP 'Verdict:\s*\K(PASS|FAIL)' | tail -1 || true)
|
|
|
|
if [ -z "$VERDICT" ]; then
|
|
echo "::warning::Could not find verdict in Claude's PR comment"
|
|
exit 0
|
|
fi
|
|
|
|
echo "Verdict: $VERDICT"
|
|
if [ "$VERDICT" = "FAIL" ]; then
|
|
echo "::error::Plugin validation found issues that need fixing"
|
|
exit 1
|
|
fi
|
|
|
|
claude-mention:
|
|
concurrency:
|
|
group: claude-mention-${{ github.event.issue.number || github.event.pull_request.number || github.run_id }}
|
|
cancel-in-progress: false
|
|
if: |
|
|
(
|
|
github.event_name == 'issue_comment' &&
|
|
contains(github.event.comment.body, '@claude') &&
|
|
(github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR')
|
|
) ||
|
|
(
|
|
github.event_name == 'pull_request_review_comment' &&
|
|
contains(github.event.comment.body, '@claude') &&
|
|
(github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR') &&
|
|
github.event.pull_request.head.repo.full_name == github.repository
|
|
) ||
|
|
(
|
|
github.event_name == 'pull_request_review' &&
|
|
contains(github.event.review.body, '@claude') &&
|
|
(github.event.review.author_association == 'OWNER' || github.event.review.author_association == 'MEMBER' || github.event.review.author_association == 'COLLABORATOR') &&
|
|
github.event.pull_request.head.repo.full_name == github.repository
|
|
)
|
|
runs-on: ubuntu-latest
|
|
permissions:
|
|
contents: write
|
|
pull-requests: write
|
|
issues: read
|
|
id-token: write
|
|
steps:
|
|
- name: Get PR head ref
|
|
id: pr-ref
|
|
env:
|
|
GH_TOKEN: ${{ github.token }}
|
|
run: |
|
|
if [ "${{ github.event_name }}" = "issue_comment" ]; then
|
|
PR_REF=$(gh api repos/${{ github.repository }}/pulls/${{ github.event.issue.number }} --jq '.head.ref')
|
|
echo "ref=$PR_REF" >> $GITHUB_OUTPUT
|
|
else
|
|
echo "ref=${{ github.event.pull_request.head.ref || github.head_ref }}" >> $GITHUB_OUTPUT
|
|
fi
|
|
|
|
- name: Checkout repository
|
|
uses: actions/checkout@v4
|
|
with:
|
|
fetch-depth: 0
|
|
ref: ${{ steps.pr-ref.outputs.ref }}
|
|
|
|
- name: Configure AWS Credentials
|
|
uses: aws-actions/configure-aws-credentials@v4
|
|
with:
|
|
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
|
|
aws-region: ${{ secrets.AWS_REGION }}
|
|
|
|
- name: Run Claude Code
|
|
uses: anthropics/claude-code-action@v1
|
|
with:
|
|
use_bedrock: "true"
|
|
claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Edit,Write,Glob,Grep,Bash(git status*),Bash(git diff*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git log*),Bash(gh pr comment*),Bash(gh pr view*),Bash(gh pr diff*)"'
|