Merge pull request #6 from codeflash-ai/feat/tool-configs

feat: improve skill and eval system
This commit is contained in:
Kevin Turcios 2026-03-27 11:31:30 -05:00 committed by GitHub
commit 7fab0082c0
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
75 changed files with 680 additions and 278 deletions

View file

@ -48,7 +48,7 @@ jobs:
ANTHROPIC_MODEL: us.anthropic.claude-sonnet-4-6
CLAUDE_CODE_USE_BEDROCK: 1
run: |
chmod +x evals/check-regression.sh evals/run-eval.sh evals/score-eval.sh
chmod +x codeflash-evals/check-regression.sh codeflash-evals/run-eval.sh codeflash-evals/score-eval.sh
ARGS=()
if [ -n "${{ inputs.templates }}" ]; then
@ -58,20 +58,20 @@ jobs:
done
fi
./evals/check-regression.sh "${ARGS[@]}"
./codeflash-evals/check-regression.sh "${ARGS[@]}"
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results-${{ github.run_number }}
path: evals/results/
path: codeflash-evals/results/
retention-days: 30
- name: Post job summary
if: always()
run: |
SUMMARY="evals/results/regression-summary.json"
SUMMARY="codeflash-evals/results/regression-summary.json"
if [ ! -f "$SUMMARY" ]; then
echo "::warning::No regression summary found"
exit 0

View file

@ -48,13 +48,11 @@ jobs:
use_sticky_comment: true
track_progress: true
show_full_output: true
plugins: "plugin-dev@claude-plugins-official"
plugin_marketplaces: "https://github.com/anthropics/claude-plugins-official.git"
prompt: |
You are validating the codeflash-agent Claude Code plugin. This plugin has:
- 6 agents in `agents/` (router + setup + 4 domain agents)
- 2 skills in `skills/` (codeflash-optimize, memray-profiling)
- Eval templates in `evals/templates/`
- Eval templates in `codeflash-evals/templates/`
- Plugin manifest at `.claude-plugin/plugin.json`
- No hooks directory
@ -66,22 +64,27 @@ jobs:
2. Classify changes:
- AGENTS: files in `agents/`
- SKILLS: files in `skills/`
- EVALS: files in `evals/`
- PLUGIN_CONFIG: `.claude-plugin/plugin.json`, hooks, `.mcp.json`
- EVALS: files in `codeflash-evals/`
- PLUGIN_CONFIG: `.claude-plugin/plugin.json`, hooks
- DOCS: `*.md` outside agents/skills, LICENSE
- OTHER: anything else
3. Record which categories have changes — later steps only run if relevant.
</step>
<step name="plugin_structure">
Use the Agent tool to launch the **plugin-dev:plugin-validator** agent with this prompt:
"Validate the plugin at the current directory. Check plugin.json, all agent frontmatter, all skill frontmatter, and hook configuration. Report any issues found."
First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
"Look up the full Claude Code plugin specification. I need the required and optional fields for:
1. plugin.json manifest schema
2. Agent .md frontmatter (YAML between --- markers) — all valid fields
3. Skill SKILL.md frontmatter — all valid fields
Return the complete field lists with types and whether each is required."
This agent knows the full Claude Code plugin spec and will check:
- plugin.json schema and required fields
- Agent YAML frontmatter (name, description, model, tools, color)
- Skill YAML frontmatter (name, description, allowed-tools)
- File cross-references and directory structure
Then, using the spec returned by that agent, validate this plugin:
- Read `.claude-plugin/plugin.json` and check against the plugin.json schema
- Read each `agents/*.md` and validate frontmatter fields against the agent spec
- Read each `skills/*/SKILL.md` and validate frontmatter fields against the skill spec
- Check file cross-references (agents referenced in plugin.json exist, skills referenced in agent frontmatter exist)
- Report any issues found
</step>
<step name="agent_consistency">
@ -106,7 +109,7 @@ jobs:
<step name="eval_manifests">
Only run if EVALS changed.
For each `evals/templates/*/manifest.json`:
For each `codeflash-evals/templates/*/manifest.json`:
1. Verify valid JSON.
2. Verify required fields: `name`, `eval_type`, `bugs` (array), `rubric` (object with `criteria`).
3. Verify each bug has: `id`, `file`, `description`, `domain`.
@ -118,9 +121,18 @@ jobs:
<step name="skill_review">
Only run if SKILLS changed.
Use the Agent tool to launch the **plugin-dev:skill-reviewer** agent with this prompt:
"Review all skills in the `skills/` directory. Check description quality, triggering accuracy,
allowed-tools restrictions, and whether the skill content follows best practices."
First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
"Look up Claude Code skill best practices. I need:
1. What makes a good skill description (trigger terms, specificity, completeness)
2. Best practices for allowed-tools restrictions
3. Best practices for skill content structure (conciseness, actionability, progressive disclosure)
Return the complete guidelines."
Then, using those guidelines, review each skill in `skills/`:
- Check description quality and trigger term coverage
- Check allowed-tools restrictions are appropriate
- Check content follows best practices (concise, actionable, clear workflow)
- Report any issues found
</step>
<step name="summary">
@ -129,7 +141,7 @@ jobs:
## Plugin Validation
### Plugin Structure
(plugin-validator findings or "All checks passed")
(validation findings or "All checks passed")
### Agent Consistency
(experiment loop check results or "Not applicable — no agent changes")
@ -138,10 +150,10 @@ jobs:
(manifest validation results or "Not applicable — no eval changes")
### Skill Review
(skill-reviewer findings or "Not applicable — no skill changes")
(skill review findings or "Not applicable — no skill changes")
---
*Validated by plugin-dev + codeflash-agent checks*
*Validated by claude-code-guide + codeflash-agent checks*
</step>
<step name="verdict">
@ -234,6 +246,4 @@ jobs:
uses: anthropics/claude-code-action@v1
with:
use_bedrock: "true"
plugins: "plugin-dev@claude-plugins-official"
plugin_marketplaces: "https://github.com/anthropics/claude-plugins-official.git"
claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Edit,Write,Glob,Grep,Bash(git status*),Bash(git diff*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git log*),Bash(gh pr comment*),Bash(gh pr view*),Bash(gh pr diff*)"'

View file

@ -15,6 +15,8 @@ description: >
model: sonnet
color: red
memory: project
skills:
- memray-profiling
tools: ["Read", "Bash", "Glob", "Grep", "Write"]
---

View file

@ -0,0 +1,43 @@
{
"version": 3,
"updated": "2026-03-27",
"note": "v3: per-criterion baselines for pinpointed regression detection",
"evals": {
"ranking": {
"expected": 9,
"min": 7,
"max": 10,
"criteria": {
"built_ranked_list_with_impact_pct": { "expected": 3, "min": 2 },
"fixed_highest_impact_first": { "expected": 2, "min": 1 },
"skipped_low_impact_targets": { "expected": 3, "min": 2 },
"reprofiled_after_major_fix": { "expected": 2, "min": 1 }
}
},
"memory-hard": {
"expected": 9,
"min": 7,
"max": 10,
"criteria": {
"used_memory_profiler": { "expected": 2, "min": 2 },
"profiled_per_stage": { "expected": 2, "min": 1 },
"identified_dominant_allocator": { "expected": 3, "min": 2 },
"fixed_dominant_issue": { "expected": 2, "min": 1 },
"fixed_secondary_issues": { "expected": 1, "min": 0 }
}
},
"memory-misdirection": {
"expected": 9,
"min": 7,
"max": 10,
"criteria": {
"used_memory_profiler": { "expected": 1, "min": 1 },
"profiled_iteratively": { "expected": 2, "min": 1 },
"identified_analytics_as_major": { "expected": 2, "min": 1 },
"fixed_analytics_details": { "expected": 2, "min": 1 },
"fixed_other_issues": { "expected": 2, "min": 1 },
"tests_pass": { "expected": 1, "min": 1 }
}
}
}
}

View file

@ -0,0 +1,321 @@
#!/bin/bash
set -euo pipefail
# Eval regression checker for codeflash-agent plugin
#
# Runs evals, scores them, and compares to checked-in baselines.
# Reports per-criterion regressions so you know exactly what broke.
# Exits 1 if any score drops below the minimum threshold.
#
# Usage:
# ./check-regression.sh # run all baseline evals (3 runs each)
# ./check-regression.sh ranking # run specific template(s)
# ./check-regression.sh --runs 1 # single run (faster, less reliable)
# ./check-regression.sh --score-only <dir> # score existing results, skip running
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
BASELINE_FILE="$EVAL_DIR/baseline-scores.json"
die() { echo "ERROR: $*" >&2; exit 1; }
[ -f "$BASELINE_FILE" ] || die "Baseline file not found: $BASELINE_FILE"
# --- Parse args ---
SCORE_ONLY=""
RESULTS_DIRS=()
TEMPLATES=()
NUM_RUNS=3
while [[ $# -gt 0 ]]; do
case "$1" in
--score-only)
SCORE_ONLY=1
shift
while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
RESULTS_DIRS+=("$1")
shift
done
;;
--runs)
NUM_RUNS="${2:?--runs requires a number}"
shift 2
;;
*)
TEMPLATES+=("$1")
shift
;;
esac
done
# If no templates specified, use all from baseline file
if [[ ${#TEMPLATES[@]} -eq 0 && -z "$SCORE_ONLY" ]]; then
mapfile -t TEMPLATES < <(jq -r '.evals | keys[]' "$BASELINE_FILE")
fi
# --- Score-only mode ---
if [[ -n "$SCORE_ONLY" ]]; then
[[ ${#RESULTS_DIRS[@]} -gt 0 ]] || die "Usage: $0 --score-only <results-dir> [<results-dir> ...]"
for dir in "${RESULTS_DIRS[@]}"; do
[ -d "$dir" ] || die "Results directory not found: $dir"
echo "Scoring: $dir"
# Check if this is a multi-run parent dir
if ls "$dir"/run-*/ >/dev/null 2>&1; then
for run_dir in "$dir"/run-*/; do
"$EVAL_DIR/score-eval.sh" "$run_dir"
done
python3 "$EVAL_DIR/score.py" aggregate "$dir"
else
"$EVAL_DIR/score-eval.sh" "$dir"
fi
done
echo ""
echo "Scored ${#RESULTS_DIRS[@]} result(s). Compare manually or re-run without --score-only."
exit 0
fi
# --- Run evals ---
echo "=== Eval Regression Check ==="
echo "Templates: ${TEMPLATES[*]}"
echo "Runs per eval: $NUM_RUNS"
echo "Baseline: $BASELINE_FILE"
echo ""
declare -A RESULT_DIRS
for template in "${TEMPLATES[@]}"; do
# Verify template exists in baseline
min=$(jq -r --arg t "$template" '.evals[$t].min // empty' "$BASELINE_FILE")
[[ -n "$min" ]] || die "Template '$template' not found in baseline file"
echo "--- Running: $template ---"
run_args=("$template" --skill-only)
if [[ "$NUM_RUNS" -gt 1 ]]; then
run_args+=(--runs "$NUM_RUNS")
fi
output=$("$EVAL_DIR/run-eval.sh" "${run_args[@]}" 2>&1)
echo "$output"
# Extract the results directory from run-eval output
result_dir=$(echo "$output" | grep "^Directory:" | head -1 | awk '{print $2}')
# Fallback to "Results:" prefix
if [[ -z "$result_dir" || ! -d "$result_dir" ]]; then
result_dir=$(echo "$output" | grep "^Results:" | head -1 | awk '{print $2}')
fi
[[ -n "$result_dir" && -d "$result_dir" ]] || die "Could not find results directory for $template"
RESULT_DIRS[$template]="$result_dir"
echo ""
done
# --- Score (single-run only, multi-run scores during run-eval.sh) ---
if [[ "$NUM_RUNS" -eq 1 ]]; then
echo "=== Scoring ==="
echo ""
for template in "${TEMPLATES[@]}"; do
result_dir="${RESULT_DIRS[$template]}"
echo "--- Scoring: $template ---"
"$EVAL_DIR/score-eval.sh" "$result_dir"
echo ""
done
fi
# --- Read scores ---
declare -A SCORES
read_scores() {
local template=$1
local result_dir="${RESULT_DIRS[$template]}"
if [[ "$NUM_RUNS" -gt 1 ]]; then
# Multi-run: read from aggregate
local agg_file="$result_dir/skill.aggregate.json"
if [[ -f "$agg_file" ]]; then
SCORES[$template]=$(jq -r '.total.avg | floor' "$agg_file")
return
fi
fi
# Single-run: read from score file
local score_file="$result_dir/skill.score.json"
if [[ -f "$score_file" ]]; then
SCORES[$template]=$(jq -r '.total' "$score_file")
else
echo "WARNING: No score file for $template"
SCORES[$template]="0"
fi
}
for template in "${TEMPLATES[@]}"; do
read_scores "$template"
done
# --- Compare to baseline (totals) ---
echo "=== Regression Check ==="
echo ""
printf "%-25s %8s %8s %8s %10s\n" "Template" "Score" "Min" "Expected" "Status"
printf "%-25s %8s %8s %8s %10s\n" "--------" "-----" "---" "--------" "------"
FAILED=0
for template in "${TEMPLATES[@]}"; do
score="${SCORES[$template]}"
min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
max=$(jq -r --arg t "$template" '.evals[$t].max' "$BASELINE_FILE")
if [[ "$score" -lt "$min" ]]; then
status="FAIL"
FAILED=1
elif [[ "$score" -lt "$expected" ]]; then
status="WARN"
else
status="PASS"
fi
printf "%-25s %8s %8s %8s %10s\n" "$template" "$score/$max" "$min" "$expected" "$status"
done
echo ""
# --- Per-criterion regression check ---
echo "=== Per-Criterion Breakdown ==="
echo ""
CRITERION_FAILURES=0
for template in "${TEMPLATES[@]}"; do
result_dir="${RESULT_DIRS[$template]}"
# Check if baseline has per-criterion data
has_criteria=$(jq -r --arg t "$template" '.evals[$t].criteria // empty' "$BASELINE_FILE")
[[ -n "$has_criteria" ]] || continue
# Read actual criterion scores
local_criteria=""
if [[ "$NUM_RUNS" -gt 1 ]]; then
agg_file="$result_dir/skill.aggregate.json"
[[ -f "$agg_file" ]] && local_criteria=$(jq -r '.criteria' "$agg_file")
else
score_file="$result_dir/skill.score.json"
[[ -f "$score_file" ]] && local_criteria=$(jq -r '.criteria' "$score_file")
fi
[[ -n "$local_criteria" ]] || continue
echo "--- $template ---"
# Get criterion names from baseline
criteria_names=$(jq -r --arg t "$template" '.evals[$t].criteria | keys[]' "$BASELINE_FILE")
for crit in $criteria_names; do
crit_expected=$(jq -r --arg t "$template" --arg c "$crit" '.evals[$t].criteria[$c].expected' "$BASELINE_FILE")
crit_min=$(jq -r --arg t "$template" --arg c "$crit" '.evals[$t].criteria[$c].min' "$BASELINE_FILE")
# Get actual score
if [[ "$NUM_RUNS" -gt 1 ]]; then
actual=$(jq -r --arg c "$crit" '.criteria[$c].avg // 0' "$agg_file")
stddev=$(jq -r --arg c "$crit" '.criteria[$c].stddev // 0' "$agg_file")
actual_int=$(echo "$actual" | awk '{printf "%d", $1}')
score_display="${actual} (stddev=${stddev})"
else
actual=$(echo "$local_criteria" | jq -r --arg c "$crit" '.[$c] // 0')
actual_int="$actual"
score_display="$actual"
fi
if [[ "$actual_int" -lt "$crit_min" ]]; then
status="FAIL"
CRITERION_FAILURES=$((CRITERION_FAILURES + 1))
elif [[ "$actual_int" -lt "$crit_expected" ]]; then
status="WARN"
else
status="PASS"
fi
printf " %-40s %12s expected=%-3s min=%-3s %s\n" "$crit" "$score_display" "$crit_expected" "$crit_min" "$status"
done
echo ""
done
# --- Flaky criteria report (multi-run only) ---
if [[ "$NUM_RUNS" -gt 1 ]]; then
echo "=== Variance Report ==="
echo ""
any_flaky=0
for template in "${TEMPLATES[@]}"; do
result_dir="${RESULT_DIRS[$template]}"
agg_file="$result_dir/skill.aggregate.json"
[[ -f "$agg_file" ]] || continue
flaky=$(jq -r '.flaky_criteria // [] | .[]' "$agg_file" 2>/dev/null)
if [[ -n "$flaky" ]]; then
any_flaky=1
echo " $template:"
for crit in $flaky; do
scores=$(jq -r --arg c "$crit" '.criteria[$c].scores | map(tostring) | join(", ")' "$agg_file")
stddev=$(jq -r --arg c "$crit" '.criteria[$c].stddev' "$agg_file")
echo " $crit: [$scores] stddev=$stddev"
done
fi
done
if [[ "$any_flaky" -eq 0 ]]; then
echo " No flaky criteria detected across $NUM_RUNS runs."
fi
echo ""
fi
# --- Write summary for CI ---
SUMMARY_FILE="$EVAL_DIR/results/regression-summary.json"
mkdir -p "$(dirname "$SUMMARY_FILE")"
{
echo "{"
echo " \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
echo " \"runs_per_eval\": $NUM_RUNS,"
echo " \"passed\": $([ $FAILED -eq 0 ] && [ $CRITERION_FAILURES -eq 0 ] && echo true || echo false),"
echo " \"total_regressions\": $FAILED,"
echo " \"criterion_regressions\": $CRITERION_FAILURES,"
echo " \"results\": {"
first=1
for template in "${TEMPLATES[@]}"; do
[ $first -eq 0 ] && echo ","
first=0
score="${SCORES[$template]}"
min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
printf ' "%s": { "score": %s, "min": %s, "expected": %s }' "$template" "$score" "$min" "$expected"
done
echo ""
echo " }"
echo "}"
} > "$SUMMARY_FILE"
echo "Summary: $SUMMARY_FILE"
if [[ $FAILED -eq 1 || $CRITERION_FAILURES -gt 0 ]]; then
echo ""
if [[ $FAILED -eq 1 ]]; then
echo "REGRESSION DETECTED: Total score below minimum threshold."
fi
if [[ $CRITERION_FAILURES -gt 0 ]]; then
echo "CRITERION REGRESSION: $CRITERION_FAILURES criterion(s) below minimum threshold."
fi
exit 1
fi
echo ""
echo "All evals passed regression check."
exit 0

View file

@ -6,9 +6,8 @@ set -euo pipefail
# ./run-eval.sh <template> # run both skill + baseline
# ./run-eval.sh <template> --skill-only # run with-skill only
# ./run-eval.sh <template> --baseline-only # run baseline only
# ./run-eval.sh <template> --runs 3 # run 3 times, aggregate
# ./run-eval.sh --list # list available templates
#
# Templates: crossdomain-easy, crossdomain-hard, layered, ranking
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
TEMPLATES_DIR="$EVAL_DIR/templates"
@ -186,6 +185,39 @@ run_claude() {
fi
}
run_single() {
# Run a single eval iteration into $RUN_DIR
local eval_name=$1 mode=$2 manifest=$3 prompt=$4 version=$5
echo "$PROMPT" > /dev/null # ensure PROMPT is available
# --- With-skill run ---
if [ "$mode" != "--baseline-only" ]; then
local skill_workdir
if [ "$version" -ge 2 ]; then
skill_workdir=$(setup_workspace_v2 "$eval_name" "skill" "$manifest")
else
skill_workdir=$(setup_workspace_v1 "$eval_name" "skill")
fi
echo "$prompt" > "$skill_workdir/.eval-prompt"
run_claude "$skill_workdir" "with-skill" "$RUN_DIR/skill" "true"
echo ""
fi
# --- Baseline run ---
if [ "$mode" != "--skill-only" ]; then
local baseline_workdir
if [ "$version" -ge 2 ]; then
baseline_workdir=$(setup_workspace_v2 "$eval_name" "baseline" "$manifest")
else
baseline_workdir=$(setup_workspace_v1 "$eval_name" "baseline")
fi
echo "$prompt" > "$baseline_workdir/.eval-prompt"
run_claude "$baseline_workdir" "baseline" "$RUN_DIR/baseline" "false"
echo ""
fi
}
# --- Main ---
if [ "${1:-}" = "--list" ]; then
@ -193,8 +225,33 @@ if [ "${1:-}" = "--list" ]; then
exit 0
fi
eval_name="${1:?Usage: $0 <eval-name> [--skill-only|--baseline-only]}"
mode="${2:---both}"
# --- Parse args ---
eval_name=""
mode="--both"
num_runs=1
while [[ $# -gt 0 ]]; do
case "$1" in
--skill-only|--baseline-only)
mode="$1"
shift
;;
--runs)
num_runs="${2:?--runs requires a number}"
shift 2
;;
-*)
die "Unknown flag: $1"
;;
*)
eval_name="$1"
shift
;;
esac
done
[[ -n "$eval_name" ]] || die "Usage: $0 <eval-name> [--skill-only|--baseline-only] [--runs N]"
EVAL_SOURCE_DIR=$(find_eval_dir "$eval_name")
[ -n "$EVAL_SOURCE_DIR" ] || die "Eval not found: $eval_name. Use --list to see available evals."
@ -205,59 +262,76 @@ MANIFEST="$EVAL_SOURCE_DIR/manifest.json"
VERSION=$(jq -r '.version // 1' "$MANIFEST")
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
RUN_DIR="$RESULTS_DIR/${eval_name}-${TIMESTAMP}"
mkdir -p "$RUN_DIR"
# Copy manifest for reference
cp "$MANIFEST" "$RUN_DIR/manifest.json"
echo "Eval: $eval_name (v$VERSION)"
echo "Results: $RUN_DIR"
echo ""
# Build prompt
PROMPT=$(build_prompt "$MANIFEST")
echo "Prompt: $PROMPT"
echo ""
# --- With-skill run ---
if [ "$mode" != "--baseline-only" ]; then
if [ "$VERSION" -ge 2 ]; then
skill_workdir=$(setup_workspace_v2 "$eval_name" "skill" "$MANIFEST")
else
skill_workdir=$(setup_workspace_v1 "$eval_name" "skill")
fi
echo "$PROMPT" > "$skill_workdir/.eval-prompt"
run_claude "$skill_workdir" "with-skill" "$RUN_DIR/skill" "true"
if [ "$num_runs" -gt 1 ]; then
# Multi-run mode: create parent dir with run-1/, run-2/, etc.
PARENT_DIR="$RESULTS_DIR/${eval_name}-${TIMESTAMP}-${num_runs}runs"
mkdir -p "$PARENT_DIR"
cp "$MANIFEST" "$PARENT_DIR/manifest.json"
echo "Eval: $eval_name (v$VERSION) — $num_runs runs"
echo "Results: $PARENT_DIR"
echo "Prompt: $PROMPT"
echo ""
fi
# --- Baseline run ---
if [ "$mode" != "--skill-only" ]; then
if [ "$VERSION" -ge 2 ]; then
baseline_workdir=$(setup_workspace_v2 "$eval_name" "baseline" "$MANIFEST")
else
baseline_workdir=$(setup_workspace_v1 "$eval_name" "baseline")
fi
echo "$PROMPT" > "$baseline_workdir/.eval-prompt"
run_claude "$baseline_workdir" "baseline" "$RUN_DIR/baseline" "false"
for i in $(seq 1 "$num_runs"); do
echo "========================================"
echo " Run $i / $num_runs"
echo "========================================"
RUN_DIR="$PARENT_DIR/run-$i"
mkdir -p "$RUN_DIR"
cp "$MANIFEST" "$RUN_DIR/manifest.json"
run_single "$eval_name" "$mode" "$MANIFEST" "$PROMPT" "$VERSION"
# Score this run immediately
"$EVAL_DIR/score-eval.sh" "$RUN_DIR"
echo ""
done
# Aggregate across runs
echo "========================================"
echo " Aggregating $num_runs runs"
echo "========================================"
python3 "$EVAL_DIR/score.py" aggregate "$PARENT_DIR"
echo "=== Results ==="
echo "Directory: $PARENT_DIR"
echo "Runs: $num_runs"
echo ""
echo "Files:"
ls -1 "$PARENT_DIR/"
else
# Single-run mode (original behavior)
RUN_DIR="$RESULTS_DIR/${eval_name}-${TIMESTAMP}"
mkdir -p "$RUN_DIR"
cp "$MANIFEST" "$RUN_DIR/manifest.json"
echo "Eval: $eval_name (v$VERSION)"
echo "Results: $RUN_DIR"
echo "Prompt: $PROMPT"
echo ""
run_single "$eval_name" "$mode" "$MANIFEST" "$PROMPT" "$VERSION"
echo "=== Results ==="
echo "Directory: $RUN_DIR"
echo ""
if [ -f "$RUN_DIR/skill.duration" ] && [ -f "$RUN_DIR/baseline.duration" ]; then
skill_dur=$(cat "$RUN_DIR/skill.duration")
baseline_dur=$(cat "$RUN_DIR/baseline.duration")
echo "With-skill duration: ${skill_dur}s"
echo "Baseline duration: ${baseline_dur}s"
fi
echo ""
echo "Files:"
ls -1 "$RUN_DIR/"
echo ""
echo "Next step: run ./score-eval.sh $RUN_DIR to score the results"
fi
# --- Summary ---
echo "=== Results ==="
echo "Directory: $RUN_DIR"
echo ""
if [ -f "$RUN_DIR/skill.duration" ] && [ -f "$RUN_DIR/baseline.duration" ]; then
skill_dur=$(cat "$RUN_DIR/skill.duration")
baseline_dur=$(cat "$RUN_DIR/baseline.duration")
echo "With-skill duration: ${skill_dur}s"
echo "Baseline duration: ${baseline_dur}s"
fi
echo ""
echo "Files:"
ls -1 "$RUN_DIR/"
echo ""
echo "Next step: run ./score-eval.sh $RUN_DIR to score the results"

View file

@ -4,10 +4,13 @@
Feeds the manifest rubric and full conversation to Claude, which scores
each criterion. Fully automated, no human input needed.
Usage: python3 score.py <results-dir>
Usage:
python3 score.py <results-dir> # score a single run
python3 score.py aggregate <parent-dir> # aggregate multiple runs
"""
import json
import math
import re
import subprocess
import sys
@ -370,8 +373,101 @@ def score_variant(variant: str, results_dir: Path, manifest: dict) -> dict:
return result
def main():
results_dir = Path(sys.argv[1])
# --- Aggregation ---
def aggregate_runs(parent_dir: Path) -> int:
"""Aggregate scores from multiple runs into stats per criterion."""
run_dirs = sorted(parent_dir.glob("run-*/"))
if not run_dirs:
print(f"ERROR: No run-*/ directories found in {parent_dir}", file=sys.stderr)
return 1
for variant in ("skill", "baseline"):
score_files = [d / f"{variant}.score.json" for d in run_dirs]
score_files = [f for f in score_files if f.exists()]
if not score_files:
continue
scores = [json.loads(f.read_text()) for f in score_files]
n = len(scores)
# Aggregate totals
totals = [s["total"] for s in scores]
max_total = scores[0]["max"]
# Aggregate per-criterion
all_criteria = list(scores[0]["criteria"].keys())
criteria_stats = {}
for crit in all_criteria:
vals = [s["criteria"].get(crit, 0) for s in scores]
avg = sum(vals) / n
criteria_stats[crit] = {
"scores": vals,
"min": min(vals),
"max": max(vals),
"avg": round(avg, 1),
"stddev": round(math.sqrt(sum((v - avg) ** 2 for v in vals) / n), 2),
}
total_avg = sum(totals) / n
agg = {
"variant": variant,
"runs": n,
"total": {
"scores": totals,
"min": min(totals),
"max": max(totals),
"avg": round(total_avg, 1),
"stddev": round(math.sqrt(sum((v - total_avg) ** 2 for v in totals) / n), 2),
},
"max_possible": max_total,
"criteria": criteria_stats,
}
# Identify flaky criteria (stddev > 0)
flaky = [c for c, s in criteria_stats.items() if s["stddev"] > 0]
if flaky:
agg["flaky_criteria"] = flaky
# Collect durations
durations = [s.get("duration") for s in scores if s.get("duration") is not None]
if durations:
agg["duration"] = {
"min": min(durations),
"max": max(durations),
"avg": round(sum(durations) / len(durations), 1),
}
agg_path = parent_dir / f"{variant}.aggregate.json"
agg_path.write_text(json.dumps(agg, indent=2))
# Print summary
print(f"=== {variant} aggregate ({n} runs) ===")
print(f" Total: {agg['total']['avg']}/{max_total} "
f"(range {agg['total']['min']}-{agg['total']['max']}, "
f"stddev {agg['total']['stddev']})")
print()
for crit, stats in criteria_stats.items():
flaky_mark = " *" if stats["stddev"] > 0 else ""
print(f" {crit:40s} avg={stats['avg']:4.1f} "
f"range=[{stats['min']}-{stats['max']}] "
f"stddev={stats['stddev']}{flaky_mark}")
if flaky:
print(f"\n * flaky criteria (non-zero stddev): {', '.join(flaky)}")
if agg.get("duration"):
d = agg["duration"]
print(f"\n Duration: avg={d['avg']}s range=[{d['min']}-{d['max']}s]")
print()
return 0
def score_single(results_dir: Path) -> int:
"""Score a single results directory."""
manifest_path = results_dir / "manifest.json"
if not manifest_path.exists():
@ -446,5 +542,20 @@ def main():
return 0
def main():
if len(sys.argv) < 2:
print("Usage: python3 score.py <results-dir>", file=sys.stderr)
print(" python3 score.py aggregate <parent-dir>", file=sys.stderr)
return 1
if sys.argv[1] == "aggregate":
if len(sys.argv) < 3:
print("Usage: python3 score.py aggregate <parent-dir>", file=sys.stderr)
return 1
return aggregate_runs(Path(sys.argv[2]))
return score_single(Path(sys.argv[1]))
if __name__ == "__main__":
sys.exit(main())

View file

@ -8,6 +8,7 @@
"id": "score-on2",
"file": "src/analytics/pipeline.py",
"function": "score_by_category",
"domain": "data-structures",
"description": "O(n²) nested loop counting category peers for each record",
"expected_fix": "Pre-compute category counts with Counter/defaultdict",
"impact_pct": 31
@ -16,6 +17,7 @@
"id": "rank-insertion-sort",
"file": "src/analytics/pipeline.py",
"function": "rank_results",
"domain": "data-structures",
"description": "O(n²) insertion sort with custom comparator",
"expected_fix": "Use sorted() with key function",
"impact_pct": 31
@ -24,6 +26,7 @@
"id": "summary-cross-category",
"file": "src/analytics/pipeline.py",
"function": "generate_summary",
"domain": "data-structures",
"description": "O(c² × n) cross-category source overlap with nested list scans per category pair",
"expected_fix": "Pre-build source sets per category, use set intersection",
"impact_pct": 36
@ -32,6 +35,7 @@
"id": "enrich-deepcopy",
"file": "src/analytics/pipeline.py",
"function": "enrich_metadata",
"domain": "data-structures",
"description": "copy.deepcopy(config) per record",
"expected_fix": "Extract defaults once before loop",
"impact_pct": 0.9,
@ -41,6 +45,7 @@
"id": "format-json-roundtrip",
"file": "src/analytics/pipeline.py",
"function": "format_output",
"domain": "data-structures",
"description": "Double JSON serialization per record for integrity check",
"expected_fix": "Single serialization or remove round-trip",
"impact_pct": 0.4,
@ -50,6 +55,7 @@
"id": "normalize-string-concat",
"file": "src/analytics/pipeline.py",
"function": "normalize_fields",
"domain": "data-structures",
"description": "Character-by-character string concatenation in loop",
"expected_fix": "Use split/join or regex",
"impact_pct": 0.2,
@ -59,6 +65,7 @@
"id": "dedup-list-scan",
"file": "src/analytics/pipeline.py",
"function": "deduplicate",
"domain": "data-structures",
"description": "List-based ID dedup with O(n) scan per record",
"expected_fix": "Use set for seen IDs",
"impact_pct": 0.2,
@ -68,6 +75,7 @@
"id": "parse-regex-compile",
"file": "src/analytics/pipeline.py",
"function": "parse_records",
"domain": "data-structures",
"description": "re.compile() called per field instead of once",
"expected_fix": "Compile regex once outside the loop",
"impact_pct": 0.1,
@ -77,6 +85,7 @@
"id": "validate-list-blocklist",
"file": "src/analytics/pipeline.py",
"function": "validate_records",
"domain": "data-structures",
"description": "List-based blocklist and required fields checks",
"expected_fix": "Convert to sets",
"impact_pct": 0.05,
@ -86,6 +95,7 @@
"id": "filter-list-tags",
"file": "src/analytics/pipeline.py",
"function": "apply_filters",
"domain": "data-structures",
"description": "Nested loop checking tags against blocklist",
"expected_fix": "Use set intersection",
"impact_pct": 0.03,

View file

@ -1,10 +0,0 @@
{
"version": 2,
"updated": "2026-03-27",
"note": "Deterministic auto-scoring for profiler usage, iterative profiling, and ranked list reduces LLM variance. Thresholds tightened from expected-3 to expected-2.",
"evals": {
"ranking": { "expected": 9, "min": 7, "max": 10 },
"memory-hard": { "expected": 9, "min": 7, "max": 10 },
"memory-misdirection": { "expected": 9, "min": 7, "max": 10 }
}
}

View file

@ -1,178 +0,0 @@
#!/bin/bash
set -euo pipefail
# Eval regression checker for codeflash-agent plugin
#
# Runs a subset of evals, scores them, and compares to checked-in baselines.
# Exits 1 if any score drops below the minimum threshold.
#
# Usage:
# ./check-regression.sh # run all baseline evals
# ./check-regression.sh ranking # run specific template(s)
# ./check-regression.sh --score-only <dir> # score existing results, skip running
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
BASELINE_FILE="$EVAL_DIR/baseline-scores.json"
die() { echo "ERROR: $*" >&2; exit 1; }
[ -f "$BASELINE_FILE" ] || die "Baseline file not found: $BASELINE_FILE"
# --- Parse args ---
SCORE_ONLY=""
RESULTS_DIRS=()
TEMPLATES=()
while [[ $# -gt 0 ]]; do
case "$1" in
--score-only)
SCORE_ONLY=1
shift
while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
RESULTS_DIRS+=("$1")
shift
done
;;
*)
TEMPLATES+=("$1")
shift
;;
esac
done
# If no templates specified, use all from baseline file
if [[ ${#TEMPLATES[@]} -eq 0 && -z "$SCORE_ONLY" ]]; then
mapfile -t TEMPLATES < <(jq -r '.evals | keys[]' "$BASELINE_FILE")
fi
# --- Score-only mode ---
if [[ -n "$SCORE_ONLY" ]]; then
[[ ${#RESULTS_DIRS[@]} -gt 0 ]] || die "Usage: $0 --score-only <results-dir> [<results-dir> ...]"
for dir in "${RESULTS_DIRS[@]}"; do
[ -d "$dir" ] || die "Results directory not found: $dir"
echo "Scoring: $dir"
"$EVAL_DIR/score-eval.sh" "$dir"
done
echo ""
echo "Scored ${#RESULTS_DIRS[@]} result(s). Compare manually or re-run without --score-only."
exit 0
fi
# --- Run evals ---
echo "=== Eval Regression Check ==="
echo "Templates: ${TEMPLATES[*]}"
echo "Baseline: $BASELINE_FILE"
echo ""
declare -A RESULT_DIRS
for template in "${TEMPLATES[@]}"; do
# Verify template exists in baseline
min=$(jq -r --arg t "$template" '.evals[$t].min // empty' "$BASELINE_FILE")
[[ -n "$min" ]] || die "Template '$template' not found in baseline file"
echo "--- Running: $template ---"
output=$("$EVAL_DIR/run-eval.sh" "$template" --skill-only 2>&1)
echo "$output"
# Extract the results directory from run-eval output
result_dir=$(echo "$output" | grep "^Results:" | head -1 | awk '{print $2}')
[[ -n "$result_dir" && -d "$result_dir" ]] || die "Could not find results directory for $template"
RESULT_DIRS[$template]="$result_dir"
echo ""
done
# --- Score ---
echo "=== Scoring ==="
echo ""
declare -A SCORES
for template in "${TEMPLATES[@]}"; do
result_dir="${RESULT_DIRS[$template]}"
echo "--- Scoring: $template ---"
"$EVAL_DIR/score-eval.sh" "$result_dir"
# Read the score
score_file="$result_dir/skill.score.json"
if [[ -f "$score_file" ]]; then
score=$(jq -r '.total' "$score_file")
SCORES[$template]="$score"
else
echo "WARNING: No score file for $template"
SCORES[$template]="0"
fi
echo ""
done
# --- Compare to baseline ---
echo "=== Regression Check ==="
echo ""
printf "%-25s %8s %8s %8s %10s\n" "Template" "Score" "Min" "Expected" "Status"
printf "%-25s %8s %8s %8s %10s\n" "--------" "-----" "---" "--------" "------"
FAILED=0
for template in "${TEMPLATES[@]}"; do
score="${SCORES[$template]}"
min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
max=$(jq -r --arg t "$template" '.evals[$t].max' "$BASELINE_FILE")
if [[ "$score" -lt "$min" ]]; then
status="FAIL"
FAILED=1
elif [[ "$score" -lt "$expected" ]]; then
status="WARN"
else
status="PASS"
fi
printf "%-25s %8s %8s %8s %10s\n" "$template" "$score/$max" "$min" "$expected" "$status"
done
echo ""
# --- Write summary for CI ---
SUMMARY_FILE="$EVAL_DIR/results/regression-summary.json"
mkdir -p "$(dirname "$SUMMARY_FILE")"
{
echo "{"
echo " \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
echo " \"passed\": $([ $FAILED -eq 0 ] && echo true || echo false),"
echo " \"results\": {"
first=1
for template in "${TEMPLATES[@]}"; do
[ $first -eq 0 ] && echo ","
first=0
score="${SCORES[$template]}"
min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
printf ' "%s": { "score": %s, "min": %s, "expected": %s }' "$template" "$score" "$min" "$expected"
done
echo ""
echo " }"
echo "}"
} > "$SUMMARY_FILE"
echo "Summary: $SUMMARY_FILE"
if [[ $FAILED -eq 1 ]]; then
echo ""
echo "REGRESSION DETECTED: One or more evals scored below minimum threshold."
exit 1
fi
echo ""
echo "All evals passed regression check."
exit 0

View file

@ -1,31 +1,45 @@
---
name: codeflash-optimize
description: >-
This skill should be used when the user asks to "optimize my code", "start an optimization
session", "resume optimization", "check optimization status", "make this faster", "reduce
memory usage", "fix slow functions", or "run performance experiments". Covers CPU, async,
memory, and codebase structure optimization.
Profiles code, identifies bottlenecks, runs benchmarks, and applies targeted optimizations
across CPU, async, memory, and codebase structure domains. Use when the user asks to
"optimize my code", "start an optimization session", "resume optimization", "check
optimization status", "make this faster", "reduce memory usage", "fix slow functions",
or "run performance experiments".
allowed-tools: "Agent, AskUserQuestion, Read"
argument-hint: "[start|resume|status]"
allowed-tools: ["Agent", "AskUserQuestion", "Read"]
---
Optimization session launcher.
Optimization session launcher. Routes to the **codeflash** agent in one of three modes.
## For `start` (or no arguments)
Before launching the agent, use the AskUserQuestion tool to ask: "Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed."
**Step 1.** Use AskUserQuestion to ask:
Wait for the user's response. Then use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Include the user's original request AND their answer in the prompt. Also include this directive at the top of the prompt:
> Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed.
**Step 2.** After the user responds, launch the agent with these exact parameters:
- **Agent name:** `codeflash`
- **run_in_background:** `true`
- **Prompt:** The prompt must contain exactly three parts in this order, and nothing else:
Part 1 — the AUTONOMOUS MODE directive (copy verbatim):
```
AUTONOMOUS MODE: The user has already been asked for context (included below). Do NOT ask the user any questions — work fully autonomously. Make all decisions yourself: generate a run tag from today's date, identify benchmark tiers from available tests, choose optimization targets from profiler output. If something is ambiguous, pick the reasonable default and document your choice in HANDOFF.md.
```
Part 2 — the user's original request (verbatim).
Part 3 — the user's answer from Step 1 (verbatim).
Do not add any other instructions — the agent has its own workflow.
## For `resume`
Use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Pass `resume` and the user's request as the prompt. Include this at the top:
Launch the agent with these exact parameters:
- **Agent name:** `codeflash`
- **run_in_background:** `true`
- **Prompt:** The directive below (verbatim), followed by `resume` and the user's request:
```
AUTONOMOUS MODE: Work fully autonomously. Do NOT ask the user any questions. Read session state from .codeflash/ and continue where the last session left off.
@ -33,4 +47,9 @@ AUTONOMOUS MODE: Work fully autonomously. Do NOT ask the user any questions. Rea
## For `status`
Use the Agent tool to launch the **codeflash** agent. Pass `status` as the prompt. Do NOT run in background — wait for the result and show it to the user.
Launch the agent with these exact parameters:
- **Agent name:** `codeflash`
- **run_in_background:** `false` (wait for the result)
- **Prompt:** `status`
Show the agent's result to the user.