Merge pull request #6 from codeflash-ai/feat/tool-configs

feat: improve skill and eval system
2026-05-04 18:25:19 +00:00 · 2026-03-27 11:31:30 -05:00 · 2026-03-27 11:31:30 -05:00 · 7fab0082c0
commit 7fab0082c0
parent 999e08fb5e 37efa524d7
75 changed files with 680 additions and 278 deletions
--- a/.github/workflows/eval-regression.yml
+++ b/.github/workflows/eval-regression.yml
@ -48,7 +48,7 @@ jobs:
          ANTHROPIC_MODEL: us.anthropic.claude-sonnet-4-6
          CLAUDE_CODE_USE_BEDROCK: 1
        run: |
-          chmod +x evals/check-regression.sh evals/run-eval.sh evals/score-eval.sh
+          chmod +x codeflash-evals/check-regression.sh codeflash-evals/run-eval.sh codeflash-evals/score-eval.sh

          ARGS=()
          if [ -n "${{ inputs.templates }}" ]; then
@ -58,20 +58,20 @@ jobs:
            done
          fi

-          ./evals/check-regression.sh "${ARGS[@]}"
+          ./codeflash-evals/check-regression.sh "${ARGS[@]}"

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results-${{ github.run_number }}
-          path: evals/results/
+          path: codeflash-evals/results/
          retention-days: 30

      - name: Post job summary
        if: always()
        run: |
-          SUMMARY="evals/results/regression-summary.json"
+          SUMMARY="codeflash-evals/results/regression-summary.json"
          if [ ! -f "$SUMMARY" ]; then
            echo "::warning::No regression summary found"
            exit 0
--- a/.github/workflows/validate.yml
+++ b/.github/workflows/validate.yml
@ -48,13 +48,11 @@ jobs:
          use_sticky_comment: true
          track_progress: true
          show_full_output: true
-          plugins: "plugin-dev@claude-plugins-official"
-          plugin_marketplaces: "https://github.com/anthropics/claude-plugins-official.git"
          prompt: |
            You are validating the codeflash-agent Claude Code plugin. This plugin has:
            - 6 agents in `agents/` (router + setup + 4 domain agents)
            - 2 skills in `skills/` (codeflash-optimize, memray-profiling)
-            - Eval templates in `evals/templates/`
+            - Eval templates in `codeflash-evals/templates/`
            - Plugin manifest at `.claude-plugin/plugin.json`
            - No hooks directory

@ -66,22 +64,27 @@ jobs:
            2. Classify changes:
               - AGENTS: files in `agents/`
               - SKILLS: files in `skills/`
-               - EVALS: files in `evals/`
-               - PLUGIN_CONFIG: `.claude-plugin/plugin.json`, hooks, `.mcp.json`
+               - EVALS: files in `codeflash-evals/`
+               - PLUGIN_CONFIG: `.claude-plugin/plugin.json`, hooks
               - DOCS: `*.md` outside agents/skills, LICENSE
               - OTHER: anything else
            3. Record which categories have changes — later steps only run if relevant.
            </step>

            <step name="plugin_structure">
-            Use the Agent tool to launch the **plugin-dev:plugin-validator** agent with this prompt:
-            "Validate the plugin at the current directory. Check plugin.json, all agent frontmatter, all skill frontmatter, and hook configuration. Report any issues found."
+            First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
+            "Look up the full Claude Code plugin specification. I need the required and optional fields for:
+            1. plugin.json manifest schema
+            2. Agent .md frontmatter (YAML between --- markers) — all valid fields
+            3. Skill SKILL.md frontmatter — all valid fields
+            Return the complete field lists with types and whether each is required."

-            This agent knows the full Claude Code plugin spec and will check:
-            - plugin.json schema and required fields
-            - Agent YAML frontmatter (name, description, model, tools, color)
-            - Skill YAML frontmatter (name, description, allowed-tools)
-            - File cross-references and directory structure
+            Then, using the spec returned by that agent, validate this plugin:
+            - Read `.claude-plugin/plugin.json` and check against the plugin.json schema
+            - Read each `agents/*.md` and validate frontmatter fields against the agent spec
+            - Read each `skills/*/SKILL.md` and validate frontmatter fields against the skill spec
+            - Check file cross-references (agents referenced in plugin.json exist, skills referenced in agent frontmatter exist)
+            - Report any issues found
            </step>

            <step name="agent_consistency">
@ -106,7 +109,7 @@ jobs:
            <step name="eval_manifests">
            Only run if EVALS changed.

-            For each `evals/templates/*/manifest.json`:
+            For each `codeflash-evals/templates/*/manifest.json`:
            1. Verify valid JSON.
            2. Verify required fields: `name`, `eval_type`, `bugs` (array), `rubric` (object with `criteria`).
            3. Verify each bug has: `id`, `file`, `description`, `domain`.
@ -118,9 +121,18 @@ jobs:
            <step name="skill_review">
            Only run if SKILLS changed.

-            Use the Agent tool to launch the **plugin-dev:skill-reviewer** agent with this prompt:
-            "Review all skills in the `skills/` directory. Check description quality, triggering accuracy,
-            allowed-tools restrictions, and whether the skill content follows best practices."
+            First, use the Agent tool to launch a **claude-code-guide** agent with this prompt:
+            "Look up Claude Code skill best practices. I need:
+            1. What makes a good skill description (trigger terms, specificity, completeness)
+            2. Best practices for allowed-tools restrictions
+            3. Best practices for skill content structure (conciseness, actionability, progressive disclosure)
+            Return the complete guidelines."
+
+            Then, using those guidelines, review each skill in `skills/`:
+            - Check description quality and trigger term coverage
+            - Check allowed-tools restrictions are appropriate
+            - Check content follows best practices (concise, actionable, clear workflow)
+            - Report any issues found
            </step>

            <step name="summary">
@ -129,7 +141,7 @@ jobs:
            ## Plugin Validation

            ### Plugin Structure
-            (plugin-validator findings or "All checks passed")
+            (validation findings or "All checks passed")

            ### Agent Consistency
            (experiment loop check results or "Not applicable — no agent changes")
@ -138,10 +150,10 @@ jobs:
            (manifest validation results or "Not applicable — no eval changes")

            ### Skill Review
-            (skill-reviewer findings or "Not applicable — no skill changes")
+            (skill review findings or "Not applicable — no skill changes")

            ---
-            *Validated by plugin-dev + codeflash-agent checks*
+            *Validated by claude-code-guide + codeflash-agent checks*
            </step>

            <step name="verdict">
@ -234,6 +246,4 @@ jobs:
        uses: anthropics/claude-code-action@v1
        with:
          use_bedrock: "true"
-          plugins: "plugin-dev@claude-plugins-official"
-          plugin_marketplaces: "https://github.com/anthropics/claude-plugins-official.git"
          claude_args: '--model us.anthropic.claude-sonnet-4-6 --allowedTools "Agent,Read,Edit,Write,Glob,Grep,Bash(git status*),Bash(git diff*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git log*),Bash(gh pr comment*),Bash(gh pr view*),Bash(gh pr diff*)"'
--- a/agents/codeflash-setup.md
+++ b/agents/codeflash-setup.md
@ -15,6 +15,8 @@ description: >
 model: sonnet
 color: red
 memory: project
+skills:
+  - memray-profiling
 tools: ["Read", "Bash", "Glob", "Grep", "Write"]
 ---

--- a/codeflash-evals/.gitignore
+++ b/codeflash-evals/.gitignore
--- a/codeflash-evals/baseline-scores.json
+++ b/codeflash-evals/baseline-scores.json
@ -0,0 +1,43 @@
+{
+  "version": 3,
+  "updated": "2026-03-27",
+  "note": "v3: per-criterion baselines for pinpointed regression detection",
+  "evals": {
+    "ranking": {
+      "expected": 9,
+      "min": 7,
+      "max": 10,
+      "criteria": {
+        "built_ranked_list_with_impact_pct": { "expected": 3, "min": 2 },
+        "fixed_highest_impact_first":        { "expected": 2, "min": 1 },
+        "skipped_low_impact_targets":        { "expected": 3, "min": 2 },
+        "reprofiled_after_major_fix":        { "expected": 2, "min": 1 }
+      }
+    },
+    "memory-hard": {
+      "expected": 9,
+      "min": 7,
+      "max": 10,
+      "criteria": {
+        "used_memory_profiler":        { "expected": 2, "min": 2 },
+        "profiled_per_stage":          { "expected": 2, "min": 1 },
+        "identified_dominant_allocator": { "expected": 3, "min": 2 },
+        "fixed_dominant_issue":        { "expected": 2, "min": 1 },
+        "fixed_secondary_issues":      { "expected": 1, "min": 0 }
+      }
+    },
+    "memory-misdirection": {
+      "expected": 9,
+      "min": 7,
+      "max": 10,
+      "criteria": {
+        "used_memory_profiler":          { "expected": 1, "min": 1 },
+        "profiled_iteratively":          { "expected": 2, "min": 1 },
+        "identified_analytics_as_major": { "expected": 2, "min": 1 },
+        "fixed_analytics_details":       { "expected": 2, "min": 1 },
+        "fixed_other_issues":            { "expected": 2, "min": 1 },
+        "tests_pass":                    { "expected": 1, "min": 1 }
+      }
+    }
+  }
+}
--- a/codeflash-evals/check-regression.sh
+++ b/codeflash-evals/check-regression.sh
@ -0,0 +1,321 @@
+#!/bin/bash
+set -euo pipefail
+
+# Eval regression checker for codeflash-agent plugin
+#
+# Runs evals, scores them, and compares to checked-in baselines.
+# Reports per-criterion regressions so you know exactly what broke.
+# Exits 1 if any score drops below the minimum threshold.
+#
+# Usage:
+#   ./check-regression.sh                         # run all baseline evals (3 runs each)
+#   ./check-regression.sh ranking                 # run specific template(s)
+#   ./check-regression.sh --runs 1                # single run (faster, less reliable)
+#   ./check-regression.sh --score-only <dir>      # score existing results, skip running
+
+EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
+BASELINE_FILE="$EVAL_DIR/baseline-scores.json"
+
+die() { echo "ERROR: $*" >&2; exit 1; }
+
+[ -f "$BASELINE_FILE" ] || die "Baseline file not found: $BASELINE_FILE"
+
+# --- Parse args ---
+
+SCORE_ONLY=""
+RESULTS_DIRS=()
+TEMPLATES=()
+NUM_RUNS=3
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --score-only)
+            SCORE_ONLY=1
+            shift
+            while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+                RESULTS_DIRS+=("$1")
+                shift
+            done
+            ;;
+        --runs)
+            NUM_RUNS="${2:?--runs requires a number}"
+            shift 2
+            ;;
+        *)
+            TEMPLATES+=("$1")
+            shift
+            ;;
+    esac
+done
+
+# If no templates specified, use all from baseline file
+if [[ ${#TEMPLATES[@]} -eq 0 && -z "$SCORE_ONLY" ]]; then
+    mapfile -t TEMPLATES < <(jq -r '.evals | keys[]' "$BASELINE_FILE")
+fi
+
+# --- Score-only mode ---
+
+if [[ -n "$SCORE_ONLY" ]]; then
+    [[ ${#RESULTS_DIRS[@]} -gt 0 ]] || die "Usage: $0 --score-only <results-dir> [<results-dir> ...]"
+
+    for dir in "${RESULTS_DIRS[@]}"; do
+        [ -d "$dir" ] || die "Results directory not found: $dir"
+        echo "Scoring: $dir"
+
+        # Check if this is a multi-run parent dir
+        if ls "$dir"/run-*/ >/dev/null 2>&1; then
+            for run_dir in "$dir"/run-*/; do
+                "$EVAL_DIR/score-eval.sh" "$run_dir"
+            done
+            python3 "$EVAL_DIR/score.py" aggregate "$dir"
+        else
+            "$EVAL_DIR/score-eval.sh" "$dir"
+        fi
+    done
+
+    echo ""
+    echo "Scored ${#RESULTS_DIRS[@]} result(s). Compare manually or re-run without --score-only."
+    exit 0
+fi
+
+# --- Run evals ---
+
+echo "=== Eval Regression Check ==="
+echo "Templates: ${TEMPLATES[*]}"
+echo "Runs per eval: $NUM_RUNS"
+echo "Baseline:  $BASELINE_FILE"
+echo ""
+
+declare -A RESULT_DIRS
+
+for template in "${TEMPLATES[@]}"; do
+    # Verify template exists in baseline
+    min=$(jq -r --arg t "$template" '.evals[$t].min // empty' "$BASELINE_FILE")
+    [[ -n "$min" ]] || die "Template '$template' not found in baseline file"
+
+    echo "--- Running: $template ---"
+    run_args=("$template" --skill-only)
+    if [[ "$NUM_RUNS" -gt 1 ]]; then
+        run_args+=(--runs "$NUM_RUNS")
+    fi
+    output=$("$EVAL_DIR/run-eval.sh" "${run_args[@]}" 2>&1)
+    echo "$output"
+
+    # Extract the results directory from run-eval output
+    result_dir=$(echo "$output" | grep "^Directory:" | head -1 | awk '{print $2}')
+    # Fallback to "Results:" prefix
+    if [[ -z "$result_dir" || ! -d "$result_dir" ]]; then
+        result_dir=$(echo "$output" | grep "^Results:" | head -1 | awk '{print $2}')
+    fi
+    [[ -n "$result_dir" && -d "$result_dir" ]] || die "Could not find results directory for $template"
+    RESULT_DIRS[$template]="$result_dir"
+    echo ""
+done
+
+# --- Score (single-run only, multi-run scores during run-eval.sh) ---
+
+if [[ "$NUM_RUNS" -eq 1 ]]; then
+    echo "=== Scoring ==="
+    echo ""
+
+    for template in "${TEMPLATES[@]}"; do
+        result_dir="${RESULT_DIRS[$template]}"
+        echo "--- Scoring: $template ---"
+        "$EVAL_DIR/score-eval.sh" "$result_dir"
+        echo ""
+    done
+fi
+
+# --- Read scores ---
+
+declare -A SCORES
+
+read_scores() {
+    local template=$1
+    local result_dir="${RESULT_DIRS[$template]}"
+
+    if [[ "$NUM_RUNS" -gt 1 ]]; then
+        # Multi-run: read from aggregate
+        local agg_file="$result_dir/skill.aggregate.json"
+        if [[ -f "$agg_file" ]]; then
+            SCORES[$template]=$(jq -r '.total.avg | floor' "$agg_file")
+            return
+        fi
+    fi
+
+    # Single-run: read from score file
+    local score_file="$result_dir/skill.score.json"
+    if [[ -f "$score_file" ]]; then
+        SCORES[$template]=$(jq -r '.total' "$score_file")
+    else
+        echo "WARNING: No score file for $template"
+        SCORES[$template]="0"
+    fi
+}
+
+for template in "${TEMPLATES[@]}"; do
+    read_scores "$template"
+done
+
+# --- Compare to baseline (totals) ---
+
+echo "=== Regression Check ==="
+echo ""
+printf "%-25s %8s %8s %8s %10s\n" "Template" "Score" "Min" "Expected" "Status"
+printf "%-25s %8s %8s %8s %10s\n" "--------" "-----" "---" "--------" "------"
+
+FAILED=0
+
+for template in "${TEMPLATES[@]}"; do
+    score="${SCORES[$template]}"
+    min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
+    expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
+    max=$(jq -r --arg t "$template" '.evals[$t].max' "$BASELINE_FILE")
+
+    if [[ "$score" -lt "$min" ]]; then
+        status="FAIL"
+        FAILED=1
+    elif [[ "$score" -lt "$expected" ]]; then
+        status="WARN"
+    else
+        status="PASS"
+    fi
+
+    printf "%-25s %8s %8s %8s %10s\n" "$template" "$score/$max" "$min" "$expected" "$status"
+done
+
+echo ""
+
+# --- Per-criterion regression check ---
+
+echo "=== Per-Criterion Breakdown ==="
+echo ""
+
+CRITERION_FAILURES=0
+
+for template in "${TEMPLATES[@]}"; do
+    result_dir="${RESULT_DIRS[$template]}"
+
+    # Check if baseline has per-criterion data
+    has_criteria=$(jq -r --arg t "$template" '.evals[$t].criteria // empty' "$BASELINE_FILE")
+    [[ -n "$has_criteria" ]] || continue
+
+    # Read actual criterion scores
+    local_criteria=""
+    if [[ "$NUM_RUNS" -gt 1 ]]; then
+        agg_file="$result_dir/skill.aggregate.json"
+        [[ -f "$agg_file" ]] && local_criteria=$(jq -r '.criteria' "$agg_file")
+    else
+        score_file="$result_dir/skill.score.json"
+        [[ -f "$score_file" ]] && local_criteria=$(jq -r '.criteria' "$score_file")
+    fi
+    [[ -n "$local_criteria" ]] || continue
+
+    echo "--- $template ---"
+
+    # Get criterion names from baseline
+    criteria_names=$(jq -r --arg t "$template" '.evals[$t].criteria | keys[]' "$BASELINE_FILE")
+
+    for crit in $criteria_names; do
+        crit_expected=$(jq -r --arg t "$template" --arg c "$crit" '.evals[$t].criteria[$c].expected' "$BASELINE_FILE")
+        crit_min=$(jq -r --arg t "$template" --arg c "$crit" '.evals[$t].criteria[$c].min' "$BASELINE_FILE")
+
+        # Get actual score
+        if [[ "$NUM_RUNS" -gt 1 ]]; then
+            actual=$(jq -r --arg c "$crit" '.criteria[$c].avg // 0' "$agg_file")
+            stddev=$(jq -r --arg c "$crit" '.criteria[$c].stddev // 0' "$agg_file")
+            actual_int=$(echo "$actual" | awk '{printf "%d", $1}')
+            score_display="${actual} (stddev=${stddev})"
+        else
+            actual=$(echo "$local_criteria" | jq -r --arg c "$crit" '.[$c] // 0')
+            actual_int="$actual"
+            score_display="$actual"
+        fi
+
+        if [[ "$actual_int" -lt "$crit_min" ]]; then
+            status="FAIL"
+            CRITERION_FAILURES=$((CRITERION_FAILURES + 1))
+        elif [[ "$actual_int" -lt "$crit_expected" ]]; then
+            status="WARN"
+        else
+            status="PASS"
+        fi
+
+        printf "  %-40s %12s  expected=%-3s min=%-3s %s\n" "$crit" "$score_display" "$crit_expected" "$crit_min" "$status"
+    done
+    echo ""
+done
+
+# --- Flaky criteria report (multi-run only) ---
+
+if [[ "$NUM_RUNS" -gt 1 ]]; then
+    echo "=== Variance Report ==="
+    echo ""
+    any_flaky=0
+
+    for template in "${TEMPLATES[@]}"; do
+        result_dir="${RESULT_DIRS[$template]}"
+        agg_file="$result_dir/skill.aggregate.json"
+        [[ -f "$agg_file" ]] || continue
+
+        flaky=$(jq -r '.flaky_criteria // [] | .[]' "$agg_file" 2>/dev/null)
+        if [[ -n "$flaky" ]]; then
+            any_flaky=1
+            echo "  $template:"
+            for crit in $flaky; do
+                scores=$(jq -r --arg c "$crit" '.criteria[$c].scores | map(tostring) | join(", ")' "$agg_file")
+                stddev=$(jq -r --arg c "$crit" '.criteria[$c].stddev' "$agg_file")
+                echo "    $crit: [$scores] stddev=$stddev"
+            done
+        fi
+    done
+
+    if [[ "$any_flaky" -eq 0 ]]; then
+        echo "  No flaky criteria detected across $NUM_RUNS runs."
+    fi
+    echo ""
+fi
+
+# --- Write summary for CI ---
+
+SUMMARY_FILE="$EVAL_DIR/results/regression-summary.json"
+mkdir -p "$(dirname "$SUMMARY_FILE")"
+
+{
+    echo "{"
+    echo "  \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
+    echo "  \"runs_per_eval\": $NUM_RUNS,"
+    echo "  \"passed\": $([ $FAILED -eq 0 ] && [ $CRITERION_FAILURES -eq 0 ] && echo true || echo false),"
+    echo "  \"total_regressions\": $FAILED,"
+    echo "  \"criterion_regressions\": $CRITERION_FAILURES,"
+    echo "  \"results\": {"
+    first=1
+    for template in "${TEMPLATES[@]}"; do
+        [ $first -eq 0 ] && echo ","
+        first=0
+        score="${SCORES[$template]}"
+        min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
+        expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
+        printf '    "%s": { "score": %s, "min": %s, "expected": %s }' "$template" "$score" "$min" "$expected"
+    done
+    echo ""
+    echo "  }"
+    echo "}"
+} > "$SUMMARY_FILE"
+
+echo "Summary: $SUMMARY_FILE"
+
+if [[ $FAILED -eq 1 || $CRITERION_FAILURES -gt 0 ]]; then
+    echo ""
+    if [[ $FAILED -eq 1 ]]; then
+        echo "REGRESSION DETECTED: Total score below minimum threshold."
+    fi
+    if [[ $CRITERION_FAILURES -gt 0 ]]; then
+        echo "CRITERION REGRESSION: $CRITERION_FAILURES criterion(s) below minimum threshold."
+    fi
+    exit 1
+fi
+
+echo ""
+echo "All evals passed regression check."
+exit 0
--- a/codeflash-evals/repos/codeflash-internal-psycopg-serialization/manifest.json
+++ b/codeflash-evals/repos/codeflash-internal-psycopg-serialization/manifest.json
--- a/codeflash-evals/run-eval.sh
+++ b/codeflash-evals/run-eval.sh
@ -6,9 +6,8 @@ set -euo pipefail
 #   ./run-eval.sh <template>                    # run both skill + baseline
 #   ./run-eval.sh <template> --skill-only       # run with-skill only
 #   ./run-eval.sh <template> --baseline-only    # run baseline only
+#   ./run-eval.sh <template> --runs 3           # run 3 times, aggregate
 #   ./run-eval.sh --list                        # list available templates
-#
-# Templates: crossdomain-easy, crossdomain-hard, layered, ranking

 EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
 TEMPLATES_DIR="$EVAL_DIR/templates"
@ -186,6 +185,39 @@ run_claude() {
    fi
 }

+run_single() {
+    # Run a single eval iteration into $RUN_DIR
+    local eval_name=$1 mode=$2 manifest=$3 prompt=$4 version=$5
+
+    echo "$PROMPT" > /dev/null  # ensure PROMPT is available
+
+    # --- With-skill run ---
+    if [ "$mode" != "--baseline-only" ]; then
+        local skill_workdir
+        if [ "$version" -ge 2 ]; then
+            skill_workdir=$(setup_workspace_v2 "$eval_name" "skill" "$manifest")
+        else
+            skill_workdir=$(setup_workspace_v1 "$eval_name" "skill")
+        fi
+        echo "$prompt" > "$skill_workdir/.eval-prompt"
+        run_claude "$skill_workdir" "with-skill" "$RUN_DIR/skill" "true"
+        echo ""
+    fi
+
+    # --- Baseline run ---
+    if [ "$mode" != "--skill-only" ]; then
+        local baseline_workdir
+        if [ "$version" -ge 2 ]; then
+            baseline_workdir=$(setup_workspace_v2 "$eval_name" "baseline" "$manifest")
+        else
+            baseline_workdir=$(setup_workspace_v1 "$eval_name" "baseline")
+        fi
+        echo "$prompt" > "$baseline_workdir/.eval-prompt"
+        run_claude "$baseline_workdir" "baseline" "$RUN_DIR/baseline" "false"
+        echo ""
+    fi
+}
+
 # --- Main ---

 if [ "${1:-}" = "--list" ]; then
@ -193,8 +225,33 @@ if [ "${1:-}" = "--list" ]; then
    exit 0
 fi

-eval_name="${1:?Usage: $0 <eval-name> [--skill-only|--baseline-only]}"
-mode="${2:---both}"
+# --- Parse args ---
+
+eval_name=""
+mode="--both"
+num_runs=1
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --skill-only|--baseline-only)
+            mode="$1"
+            shift
+            ;;
+        --runs)
+            num_runs="${2:?--runs requires a number}"
+            shift 2
+            ;;
+        -*)
+            die "Unknown flag: $1"
+            ;;
+        *)
+            eval_name="$1"
+            shift
+            ;;
+    esac
+done
+
+[[ -n "$eval_name" ]] || die "Usage: $0 <eval-name> [--skill-only|--baseline-only] [--runs N]"

 EVAL_SOURCE_DIR=$(find_eval_dir "$eval_name")
 [ -n "$EVAL_SOURCE_DIR" ] || die "Eval not found: $eval_name. Use --list to see available evals."
@ -205,59 +262,76 @@ MANIFEST="$EVAL_SOURCE_DIR/manifest.json"
 VERSION=$(jq -r '.version // 1' "$MANIFEST")

 TIMESTAMP=$(date +%Y%m%d-%H%M%S)
-RUN_DIR="$RESULTS_DIR/${eval_name}-${TIMESTAMP}"
-mkdir -p "$RUN_DIR"
-
-# Copy manifest for reference
-cp "$MANIFEST" "$RUN_DIR/manifest.json"
-
-echo "Eval: $eval_name (v$VERSION)"
-echo "Results: $RUN_DIR"
-echo ""

 # Build prompt
 PROMPT=$(build_prompt "$MANIFEST")
-echo "Prompt: $PROMPT"
-echo ""

-# --- With-skill run ---
-if [ "$mode" != "--baseline-only" ]; then
-    if [ "$VERSION" -ge 2 ]; then
-        skill_workdir=$(setup_workspace_v2 "$eval_name" "skill" "$MANIFEST")
-    else
-        skill_workdir=$(setup_workspace_v1 "$eval_name" "skill")
-    fi
-    echo "$PROMPT" > "$skill_workdir/.eval-prompt"
-    run_claude "$skill_workdir" "with-skill" "$RUN_DIR/skill" "true"
+if [ "$num_runs" -gt 1 ]; then
+    # Multi-run mode: create parent dir with run-1/, run-2/, etc.
+    PARENT_DIR="$RESULTS_DIR/${eval_name}-${TIMESTAMP}-${num_runs}runs"
+    mkdir -p "$PARENT_DIR"
+    cp "$MANIFEST" "$PARENT_DIR/manifest.json"
+
+    echo "Eval: $eval_name (v$VERSION) — $num_runs runs"
+    echo "Results: $PARENT_DIR"
+    echo "Prompt: $PROMPT"
    echo ""
-fi

-# --- Baseline run ---
-if [ "$mode" != "--skill-only" ]; then
-    if [ "$VERSION" -ge 2 ]; then
-        baseline_workdir=$(setup_workspace_v2 "$eval_name" "baseline" "$MANIFEST")
-    else
-        baseline_workdir=$(setup_workspace_v1 "$eval_name" "baseline")
-    fi
-    echo "$PROMPT" > "$baseline_workdir/.eval-prompt"
-    run_claude "$baseline_workdir" "baseline" "$RUN_DIR/baseline" "false"
+    for i in $(seq 1 "$num_runs"); do
+        echo "========================================"
+        echo "  Run $i / $num_runs"
+        echo "========================================"
+
+        RUN_DIR="$PARENT_DIR/run-$i"
+        mkdir -p "$RUN_DIR"
+        cp "$MANIFEST" "$RUN_DIR/manifest.json"
+
+        run_single "$eval_name" "$mode" "$MANIFEST" "$PROMPT" "$VERSION"
+
+        # Score this run immediately
+        "$EVAL_DIR/score-eval.sh" "$RUN_DIR"
+        echo ""
+    done
+
+    # Aggregate across runs
+    echo "========================================"
+    echo "  Aggregating $num_runs runs"
+    echo "========================================"
+    python3 "$EVAL_DIR/score.py" aggregate "$PARENT_DIR"
+
+    echo "=== Results ==="
+    echo "Directory: $PARENT_DIR"
+    echo "Runs: $num_runs"
    echo ""
+    echo "Files:"
+    ls -1 "$PARENT_DIR/"
+else
+    # Single-run mode (original behavior)
+    RUN_DIR="$RESULTS_DIR/${eval_name}-${TIMESTAMP}"
+    mkdir -p "$RUN_DIR"
+    cp "$MANIFEST" "$RUN_DIR/manifest.json"
+
+    echo "Eval: $eval_name (v$VERSION)"
+    echo "Results: $RUN_DIR"
+    echo "Prompt: $PROMPT"
+    echo ""
+
+    run_single "$eval_name" "$mode" "$MANIFEST" "$PROMPT" "$VERSION"
+
+    echo "=== Results ==="
+    echo "Directory: $RUN_DIR"
+    echo ""
+
+    if [ -f "$RUN_DIR/skill.duration" ] && [ -f "$RUN_DIR/baseline.duration" ]; then
+        skill_dur=$(cat "$RUN_DIR/skill.duration")
+        baseline_dur=$(cat "$RUN_DIR/baseline.duration")
+        echo "With-skill duration: ${skill_dur}s"
+        echo "Baseline duration:   ${baseline_dur}s"
+    fi
+
+    echo ""
+    echo "Files:"
+    ls -1 "$RUN_DIR/"
+    echo ""
+    echo "Next step: run ./score-eval.sh $RUN_DIR to score the results"
 fi
-
-# --- Summary ---
-echo "=== Results ==="
-echo "Directory: $RUN_DIR"
-echo ""
-
-if [ -f "$RUN_DIR/skill.duration" ] && [ -f "$RUN_DIR/baseline.duration" ]; then
-    skill_dur=$(cat "$RUN_DIR/skill.duration")
-    baseline_dur=$(cat "$RUN_DIR/baseline.duration")
-    echo "With-skill duration: ${skill_dur}s"
-    echo "Baseline duration:   ${baseline_dur}s"
-fi
-
-echo ""
-echo "Files:"
-ls -1 "$RUN_DIR/"
-echo ""
-echo "Next step: run ./score-eval.sh $RUN_DIR to score the results"
--- a/codeflash-evals/score-eval.sh
+++ b/codeflash-evals/score-eval.sh
--- a/codeflash-evals/score.py
+++ b/codeflash-evals/score.py
@ -4,10 +4,13 @@
 Feeds the manifest rubric and full conversation to Claude, which scores
 each criterion. Fully automated, no human input needed.

-Usage: python3 score.py <results-dir>
+Usage:
+  python3 score.py <results-dir>              # score a single run
+  python3 score.py aggregate <parent-dir>     # aggregate multiple runs
 """

 import json
+import math
 import re
 import subprocess
 import sys
@ -370,8 +373,101 @@ def score_variant(variant: str, results_dir: Path, manifest: dict) -> dict:
    return result


-def main():
-    results_dir = Path(sys.argv[1])
+# --- Aggregation ---
+
+
+def aggregate_runs(parent_dir: Path) -> int:
+    """Aggregate scores from multiple runs into stats per criterion."""
+    run_dirs = sorted(parent_dir.glob("run-*/"))
+    if not run_dirs:
+        print(f"ERROR: No run-*/ directories found in {parent_dir}", file=sys.stderr)
+        return 1
+
+    for variant in ("skill", "baseline"):
+        score_files = [d / f"{variant}.score.json" for d in run_dirs]
+        score_files = [f for f in score_files if f.exists()]
+
+        if not score_files:
+            continue
+
+        scores = [json.loads(f.read_text()) for f in score_files]
+        n = len(scores)
+
+        # Aggregate totals
+        totals = [s["total"] for s in scores]
+        max_total = scores[0]["max"]
+
+        # Aggregate per-criterion
+        all_criteria = list(scores[0]["criteria"].keys())
+        criteria_stats = {}
+        for crit in all_criteria:
+            vals = [s["criteria"].get(crit, 0) for s in scores]
+            avg = sum(vals) / n
+            criteria_stats[crit] = {
+                "scores": vals,
+                "min": min(vals),
+                "max": max(vals),
+                "avg": round(avg, 1),
+                "stddev": round(math.sqrt(sum((v - avg) ** 2 for v in vals) / n), 2),
+            }
+
+        total_avg = sum(totals) / n
+        agg = {
+            "variant": variant,
+            "runs": n,
+            "total": {
+                "scores": totals,
+                "min": min(totals),
+                "max": max(totals),
+                "avg": round(total_avg, 1),
+                "stddev": round(math.sqrt(sum((v - total_avg) ** 2 for v in totals) / n), 2),
+            },
+            "max_possible": max_total,
+            "criteria": criteria_stats,
+        }
+
+        # Identify flaky criteria (stddev > 0)
+        flaky = [c for c, s in criteria_stats.items() if s["stddev"] > 0]
+        if flaky:
+            agg["flaky_criteria"] = flaky
+
+        # Collect durations
+        durations = [s.get("duration") for s in scores if s.get("duration") is not None]
+        if durations:
+            agg["duration"] = {
+                "min": min(durations),
+                "max": max(durations),
+                "avg": round(sum(durations) / len(durations), 1),
+            }
+
+        agg_path = parent_dir / f"{variant}.aggregate.json"
+        agg_path.write_text(json.dumps(agg, indent=2))
+
+        # Print summary
+        print(f"=== {variant} aggregate ({n} runs) ===")
+        print(f"  Total: {agg['total']['avg']}/{max_total} "
+              f"(range {agg['total']['min']}-{agg['total']['max']}, "
+              f"stddev {agg['total']['stddev']})")
+        print()
+
+        for crit, stats in criteria_stats.items():
+            flaky_mark = " *" if stats["stddev"] > 0 else ""
+            print(f"  {crit:40s} avg={stats['avg']:4.1f}  "
+                  f"range=[{stats['min']}-{stats['max']}]  "
+                  f"stddev={stats['stddev']}{flaky_mark}")
+
+        if flaky:
+            print(f"\n  * flaky criteria (non-zero stddev): {', '.join(flaky)}")
+        if agg.get("duration"):
+            d = agg["duration"]
+            print(f"\n  Duration: avg={d['avg']}s range=[{d['min']}-{d['max']}s]")
+        print()
+
+    return 0
+
+
+def score_single(results_dir: Path) -> int:
+    """Score a single results directory."""
    manifest_path = results_dir / "manifest.json"

    if not manifest_path.exists():
@ -446,5 +542,20 @@ def main():
    return 0


+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python3 score.py <results-dir>", file=sys.stderr)
+        print("       python3 score.py aggregate <parent-dir>", file=sys.stderr)
+        return 1
+
+    if sys.argv[1] == "aggregate":
+        if len(sys.argv) < 3:
+            print("Usage: python3 score.py aggregate <parent-dir>", file=sys.stderr)
+            return 1
+        return aggregate_runs(Path(sys.argv[2]))
+
+    return score_single(Path(sys.argv[1]))
+
+
 if __name__ == "__main__":
    sys.exit(main())
--- a/codeflash-evals/templates/crossdomain-easy/CLAUDE.md
+++ b/codeflash-evals/templates/crossdomain-easy/CLAUDE.md
--- a/codeflash-evals/templates/crossdomain-easy/manifest.json
+++ b/codeflash-evals/templates/crossdomain-easy/manifest.json
--- a/codeflash-evals/templates/crossdomain-easy/pyproject.toml
+++ b/codeflash-evals/templates/crossdomain-easy/pyproject.toml
--- a/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/init.py
+++ b/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/init.py
--- a/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/analyzer.py
+++ b/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/analyzer.py
--- a/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/batch.py
+++ b/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/batch.py
--- a/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/streamer.py
+++ b/codeflash-evals/templates/crossdomain-easy/src/log_analyzer/streamer.py
--- a/codeflash-evals/templates/crossdomain-easy/tests/test_analyzer.py
+++ b/codeflash-evals/templates/crossdomain-easy/tests/test_analyzer.py
--- a/codeflash-evals/templates/crossdomain-easy/tests/test_batch.py
+++ b/codeflash-evals/templates/crossdomain-easy/tests/test_batch.py
--- a/codeflash-evals/templates/crossdomain-easy/tests/test_streamer.py
+++ b/codeflash-evals/templates/crossdomain-easy/tests/test_streamer.py
--- a/codeflash-evals/templates/crossdomain-hard/CLAUDE.md
+++ b/codeflash-evals/templates/crossdomain-hard/CLAUDE.md
--- a/codeflash-evals/templates/crossdomain-hard/manifest.json
+++ b/codeflash-evals/templates/crossdomain-hard/manifest.json
--- a/codeflash-evals/templates/crossdomain-hard/pyproject.toml
+++ b/codeflash-evals/templates/crossdomain-hard/pyproject.toml
--- a/codeflash-evals/templates/crossdomain-hard/src/pipeline/init.py
+++ b/codeflash-evals/templates/crossdomain-hard/src/pipeline/init.py
--- a/codeflash-evals/templates/crossdomain-hard/src/pipeline/aggregator.py
+++ b/codeflash-evals/templates/crossdomain-hard/src/pipeline/aggregator.py
--- a/codeflash-evals/templates/crossdomain-hard/src/pipeline/enricher.py
+++ b/codeflash-evals/templates/crossdomain-hard/src/pipeline/enricher.py
--- a/codeflash-evals/templates/crossdomain-hard/src/pipeline/formatter.py
+++ b/codeflash-evals/templates/crossdomain-hard/src/pipeline/formatter.py
--- a/codeflash-evals/templates/crossdomain-hard/tests/test_aggregator.py
+++ b/codeflash-evals/templates/crossdomain-hard/tests/test_aggregator.py
--- a/codeflash-evals/templates/crossdomain-hard/tests/test_enricher.py
+++ b/codeflash-evals/templates/crossdomain-hard/tests/test_enricher.py
--- a/codeflash-evals/templates/crossdomain-hard/tests/test_formatter.py
+++ b/codeflash-evals/templates/crossdomain-hard/tests/test_formatter.py
--- a/codeflash-evals/templates/layered/CLAUDE.md
+++ b/codeflash-evals/templates/layered/CLAUDE.md
--- a/codeflash-evals/templates/layered/manifest.json
+++ b/codeflash-evals/templates/layered/manifest.json
--- a/codeflash-evals/templates/layered/pyproject.toml
+++ b/codeflash-evals/templates/layered/pyproject.toml
--- a/codeflash-evals/templates/layered/src/processor/init.py
+++ b/codeflash-evals/templates/layered/src/processor/init.py
--- a/codeflash-evals/templates/layered/src/processor/core.py
+++ b/codeflash-evals/templates/layered/src/processor/core.py
--- a/codeflash-evals/templates/layered/tests/test_processor.py
+++ b/codeflash-evals/templates/layered/tests/test_processor.py
--- a/codeflash-evals/templates/memory-balanced/CLAUDE.md
+++ b/codeflash-evals/templates/memory-balanced/CLAUDE.md
--- a/codeflash-evals/templates/memory-balanced/manifest.json
+++ b/codeflash-evals/templates/memory-balanced/manifest.json
--- a/codeflash-evals/templates/memory-balanced/pyproject.toml
+++ b/codeflash-evals/templates/memory-balanced/pyproject.toml
--- a/codeflash-evals/templates/memory-balanced/src/orders/init.py
+++ b/codeflash-evals/templates/memory-balanced/src/orders/init.py
--- a/codeflash-evals/templates/memory-balanced/src/orders/core.py
+++ b/codeflash-evals/templates/memory-balanced/src/orders/core.py
--- a/codeflash-evals/templates/memory-balanced/tests/test_orders.py
+++ b/codeflash-evals/templates/memory-balanced/tests/test_orders.py
--- a/codeflash-evals/templates/memory-hard/CLAUDE.md
+++ b/codeflash-evals/templates/memory-hard/CLAUDE.md
--- a/codeflash-evals/templates/memory-hard/manifest.json
+++ b/codeflash-evals/templates/memory-hard/manifest.json
--- a/codeflash-evals/templates/memory-hard/pyproject.toml
+++ b/codeflash-evals/templates/memory-hard/pyproject.toml
--- a/codeflash-evals/templates/memory-hard/src/pipeline/init.py
+++ b/codeflash-evals/templates/memory-hard/src/pipeline/init.py
--- a/codeflash-evals/templates/memory-hard/src/pipeline/core.py
+++ b/codeflash-evals/templates/memory-hard/src/pipeline/core.py
--- a/codeflash-evals/templates/memory-hard/tests/test_pipeline.py
+++ b/codeflash-evals/templates/memory-hard/tests/test_pipeline.py
--- a/codeflash-evals/templates/memory-misdirection/CLAUDE.md
+++ b/codeflash-evals/templates/memory-misdirection/CLAUDE.md
--- a/codeflash-evals/templates/memory-misdirection/manifest.json
+++ b/codeflash-evals/templates/memory-misdirection/manifest.json
--- a/codeflash-evals/templates/memory-misdirection/pyproject.toml
+++ b/codeflash-evals/templates/memory-misdirection/pyproject.toml
--- a/codeflash-evals/templates/memory-misdirection/src/analytics/init.py
+++ b/codeflash-evals/templates/memory-misdirection/src/analytics/init.py
--- a/codeflash-evals/templates/memory-misdirection/src/analytics/core.py
+++ b/codeflash-evals/templates/memory-misdirection/src/analytics/core.py
--- a/codeflash-evals/templates/memory-misdirection/tests/test_analytics.py
+++ b/codeflash-evals/templates/memory-misdirection/tests/test_analytics.py
--- a/codeflash-evals/templates/memory/CLAUDE.md
+++ b/codeflash-evals/templates/memory/CLAUDE.md
--- a/codeflash-evals/templates/memory/manifest.json
+++ b/codeflash-evals/templates/memory/manifest.json
--- a/codeflash-evals/templates/memory/pyproject.toml
+++ b/codeflash-evals/templates/memory/pyproject.toml
--- a/codeflash-evals/templates/memory/src/aggregator/init.py
+++ b/codeflash-evals/templates/memory/src/aggregator/init.py
--- a/codeflash-evals/templates/memory/src/aggregator/core.py
+++ b/codeflash-evals/templates/memory/src/aggregator/core.py
--- a/codeflash-evals/templates/memory/tests/test_aggregator.py
+++ b/codeflash-evals/templates/memory/tests/test_aggregator.py
--- a/codeflash-evals/templates/ranking-hard/CLAUDE.md
+++ b/codeflash-evals/templates/ranking-hard/CLAUDE.md
--- a/codeflash-evals/templates/ranking-hard/manifest.json
+++ b/codeflash-evals/templates/ranking-hard/manifest.json
@ -8,6 +8,7 @@
      "id": "score-on2",
      "file": "src/analytics/pipeline.py",
      "function": "score_by_category",
+      "domain": "data-structures",
      "description": "O(n²) nested loop counting category peers for each record",
      "expected_fix": "Pre-compute category counts with Counter/defaultdict",
      "impact_pct": 31
@ -16,6 +17,7 @@
      "id": "rank-insertion-sort",
      "file": "src/analytics/pipeline.py",
      "function": "rank_results",
+      "domain": "data-structures",
      "description": "O(n²) insertion sort with custom comparator",
      "expected_fix": "Use sorted() with key function",
      "impact_pct": 31
@ -24,6 +26,7 @@
      "id": "summary-cross-category",
      "file": "src/analytics/pipeline.py",
      "function": "generate_summary",
+      "domain": "data-structures",
      "description": "O(c² × n) cross-category source overlap with nested list scans per category pair",
      "expected_fix": "Pre-build source sets per category, use set intersection",
      "impact_pct": 36
@ -32,6 +35,7 @@
      "id": "enrich-deepcopy",
      "file": "src/analytics/pipeline.py",
      "function": "enrich_metadata",
+      "domain": "data-structures",
      "description": "copy.deepcopy(config) per record",
      "expected_fix": "Extract defaults once before loop",
      "impact_pct": 0.9,
@ -41,6 +45,7 @@
      "id": "format-json-roundtrip",
      "file": "src/analytics/pipeline.py",
      "function": "format_output",
+      "domain": "data-structures",
      "description": "Double JSON serialization per record for integrity check",
      "expected_fix": "Single serialization or remove round-trip",
      "impact_pct": 0.4,
@ -50,6 +55,7 @@
      "id": "normalize-string-concat",
      "file": "src/analytics/pipeline.py",
      "function": "normalize_fields",
+      "domain": "data-structures",
      "description": "Character-by-character string concatenation in loop",
      "expected_fix": "Use split/join or regex",
      "impact_pct": 0.2,
@ -59,6 +65,7 @@
      "id": "dedup-list-scan",
      "file": "src/analytics/pipeline.py",
      "function": "deduplicate",
+      "domain": "data-structures",
      "description": "List-based ID dedup with O(n) scan per record",
      "expected_fix": "Use set for seen IDs",
      "impact_pct": 0.2,
@ -68,6 +75,7 @@
      "id": "parse-regex-compile",
      "file": "src/analytics/pipeline.py",
      "function": "parse_records",
+      "domain": "data-structures",
      "description": "re.compile() called per field instead of once",
      "expected_fix": "Compile regex once outside the loop",
      "impact_pct": 0.1,
@ -77,6 +85,7 @@
      "id": "validate-list-blocklist",
      "file": "src/analytics/pipeline.py",
      "function": "validate_records",
+      "domain": "data-structures",
      "description": "List-based blocklist and required fields checks",
      "expected_fix": "Convert to sets",
      "impact_pct": 0.05,
@ -86,6 +95,7 @@
      "id": "filter-list-tags",
      "file": "src/analytics/pipeline.py",
      "function": "apply_filters",
+      "domain": "data-structures",
      "description": "Nested loop checking tags against blocklist",
      "expected_fix": "Use set intersection",
      "impact_pct": 0.03,
--- a/codeflash-evals/templates/ranking-hard/pyproject.toml
+++ b/codeflash-evals/templates/ranking-hard/pyproject.toml
--- a/codeflash-evals/templates/ranking-hard/src/analytics/init.py
+++ b/codeflash-evals/templates/ranking-hard/src/analytics/init.py
--- a/codeflash-evals/templates/ranking-hard/src/analytics/pipeline.py
+++ b/codeflash-evals/templates/ranking-hard/src/analytics/pipeline.py
--- a/codeflash-evals/templates/ranking-hard/tests/test_pipeline.py
+++ b/codeflash-evals/templates/ranking-hard/tests/test_pipeline.py
--- a/codeflash-evals/templates/ranking/CLAUDE.md
+++ b/codeflash-evals/templates/ranking/CLAUDE.md
--- a/codeflash-evals/templates/ranking/manifest.json
+++ b/codeflash-evals/templates/ranking/manifest.json
--- a/codeflash-evals/templates/ranking/pyproject.toml
+++ b/codeflash-evals/templates/ranking/pyproject.toml
--- a/codeflash-evals/templates/ranking/src/pipeline/init.py
+++ b/codeflash-evals/templates/ranking/src/pipeline/init.py
--- a/codeflash-evals/templates/ranking/src/pipeline/core.py
+++ b/codeflash-evals/templates/ranking/src/pipeline/core.py
--- a/codeflash-evals/templates/ranking/tests/test_pipeline.py
+++ b/codeflash-evals/templates/ranking/tests/test_pipeline.py
--- a/evals/baseline-scores.json
+++ b/evals/baseline-scores.json
@ -1,10 +0,0 @@
-{
-  "version": 2,
-  "updated": "2026-03-27",
-  "note": "Deterministic auto-scoring for profiler usage, iterative profiling, and ranked list reduces LLM variance. Thresholds tightened from expected-3 to expected-2.",
-  "evals": {
-    "ranking":             { "expected": 9, "min": 7, "max": 10 },
-    "memory-hard":         { "expected": 9, "min": 7, "max": 10 },
-    "memory-misdirection": { "expected": 9, "min": 7, "max": 10 }
-  }
-}
--- a/evals/check-regression.sh
+++ b/evals/check-regression.sh
@ -1,178 +0,0 @@
-#!/bin/bash
-set -euo pipefail
-
-# Eval regression checker for codeflash-agent plugin
-#
-# Runs a subset of evals, scores them, and compares to checked-in baselines.
-# Exits 1 if any score drops below the minimum threshold.
-#
-# Usage:
-#   ./check-regression.sh                      # run all baseline evals
-#   ./check-regression.sh ranking              # run specific template(s)
-#   ./check-regression.sh --score-only <dir>   # score existing results, skip running
-
-EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
-BASELINE_FILE="$EVAL_DIR/baseline-scores.json"
-
-die() { echo "ERROR: $*" >&2; exit 1; }
-
-[ -f "$BASELINE_FILE" ] || die "Baseline file not found: $BASELINE_FILE"
-
-# --- Parse args ---
-
-SCORE_ONLY=""
-RESULTS_DIRS=()
-TEMPLATES=()
-
-while [[ $# -gt 0 ]]; do
-    case "$1" in
-        --score-only)
-            SCORE_ONLY=1
-            shift
-            while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
-                RESULTS_DIRS+=("$1")
-                shift
-            done
-            ;;
-        *)
-            TEMPLATES+=("$1")
-            shift
-            ;;
-    esac
-done
-
-# If no templates specified, use all from baseline file
-if [[ ${#TEMPLATES[@]} -eq 0 && -z "$SCORE_ONLY" ]]; then
-    mapfile -t TEMPLATES < <(jq -r '.evals | keys[]' "$BASELINE_FILE")
-fi
-
-# --- Score-only mode ---
-
-if [[ -n "$SCORE_ONLY" ]]; then
-    [[ ${#RESULTS_DIRS[@]} -gt 0 ]] || die "Usage: $0 --score-only <results-dir> [<results-dir> ...]"
-
-    for dir in "${RESULTS_DIRS[@]}"; do
-        [ -d "$dir" ] || die "Results directory not found: $dir"
-        echo "Scoring: $dir"
-        "$EVAL_DIR/score-eval.sh" "$dir"
-    done
-
-    echo ""
-    echo "Scored ${#RESULTS_DIRS[@]} result(s). Compare manually or re-run without --score-only."
-    exit 0
-fi
-
-# --- Run evals ---
-
-echo "=== Eval Regression Check ==="
-echo "Templates: ${TEMPLATES[*]}"
-echo "Baseline:  $BASELINE_FILE"
-echo ""
-
-declare -A RESULT_DIRS
-
-for template in "${TEMPLATES[@]}"; do
-    # Verify template exists in baseline
-    min=$(jq -r --arg t "$template" '.evals[$t].min // empty' "$BASELINE_FILE")
-    [[ -n "$min" ]] || die "Template '$template' not found in baseline file"
-
-    echo "--- Running: $template ---"
-    output=$("$EVAL_DIR/run-eval.sh" "$template" --skill-only 2>&1)
-    echo "$output"
-
-    # Extract the results directory from run-eval output
-    result_dir=$(echo "$output" | grep "^Results:" | head -1 | awk '{print $2}')
-    [[ -n "$result_dir" && -d "$result_dir" ]] || die "Could not find results directory for $template"
-    RESULT_DIRS[$template]="$result_dir"
-    echo ""
-done
-
-# --- Score ---
-
-echo "=== Scoring ==="
-echo ""
-
-declare -A SCORES
-
-for template in "${TEMPLATES[@]}"; do
-    result_dir="${RESULT_DIRS[$template]}"
-    echo "--- Scoring: $template ---"
-    "$EVAL_DIR/score-eval.sh" "$result_dir"
-
-    # Read the score
-    score_file="$result_dir/skill.score.json"
-    if [[ -f "$score_file" ]]; then
-        score=$(jq -r '.total' "$score_file")
-        SCORES[$template]="$score"
-    else
-        echo "WARNING: No score file for $template"
-        SCORES[$template]="0"
-    fi
-    echo ""
-done
-
-# --- Compare to baseline ---
-
-echo "=== Regression Check ==="
-echo ""
-
-printf "%-25s %8s %8s %8s %10s\n" "Template" "Score" "Min" "Expected" "Status"
-printf "%-25s %8s %8s %8s %10s\n" "--------" "-----" "---" "--------" "------"
-
-FAILED=0
-
-for template in "${TEMPLATES[@]}"; do
-    score="${SCORES[$template]}"
-    min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
-    expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
-    max=$(jq -r --arg t "$template" '.evals[$t].max' "$BASELINE_FILE")
-
-    if [[ "$score" -lt "$min" ]]; then
-        status="FAIL"
-        FAILED=1
-    elif [[ "$score" -lt "$expected" ]]; then
-        status="WARN"
-    else
-        status="PASS"
-    fi
-
-    printf "%-25s %8s %8s %8s %10s\n" "$template" "$score/$max" "$min" "$expected" "$status"
-done
-
-echo ""
-
-# --- Write summary for CI ---
-
-SUMMARY_FILE="$EVAL_DIR/results/regression-summary.json"
-mkdir -p "$(dirname "$SUMMARY_FILE")"
-
-{
-    echo "{"
-    echo "  \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
-    echo "  \"passed\": $([ $FAILED -eq 0 ] && echo true || echo false),"
-    echo "  \"results\": {"
-    first=1
-    for template in "${TEMPLATES[@]}"; do
-        [ $first -eq 0 ] && echo ","
-        first=0
-        score="${SCORES[$template]}"
-        min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
-        expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
-        printf '    "%s": { "score": %s, "min": %s, "expected": %s }' "$template" "$score" "$min" "$expected"
-    done
-    echo ""
-    echo "  }"
-    echo "}"
-} > "$SUMMARY_FILE"
-
-echo "Summary: $SUMMARY_FILE"
-
-if [[ $FAILED -eq 1 ]]; then
-    echo ""
-    echo "REGRESSION DETECTED: One or more evals scored below minimum threshold."
-    exit 1
-fi
-
-echo ""
-echo "All evals passed regression check."
-exit 0
--- a/skills/codeflash-optimize/SKILL.md
+++ b/skills/codeflash-optimize/SKILL.md
@ -1,31 +1,45 @@
 ---
 name: codeflash-optimize
 description: >-
-  This skill should be used when the user asks to "optimize my code", "start an optimization
-  session", "resume optimization", "check optimization status", "make this faster", "reduce
-  memory usage", "fix slow functions", or "run performance experiments". Covers CPU, async,
-  memory, and codebase structure optimization.
+  Profiles code, identifies bottlenecks, runs benchmarks, and applies targeted optimizations
+  across CPU, async, memory, and codebase structure domains. Use when the user asks to
+  "optimize my code", "start an optimization session", "resume optimization", "check
+  optimization status", "make this faster", "reduce memory usage", "fix slow functions",
+  or "run performance experiments".
+allowed-tools: "Agent, AskUserQuestion, Read"
 argument-hint: "[start|resume|status]"
-allowed-tools: ["Agent", "AskUserQuestion", "Read"]
 ---

-Optimization session launcher.
+Optimization session launcher. Routes to the **codeflash** agent in one of three modes.

 ## For `start` (or no arguments)

-Before launching the agent, use the AskUserQuestion tool to ask: "Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed."
+**Step 1.** Use AskUserQuestion to ask:

-Wait for the user's response. Then use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Include the user's original request AND their answer in the prompt. Also include this directive at the top of the prompt:
+> Before I start optimizing, is there anything I should know? For example: areas to avoid, known constraints, things you've already tried, or specific files to focus on. Or just say 'go' to proceed.

+**Step 2.** After the user responds, launch the agent with these exact parameters:
+- **Agent name:** `codeflash`
+- **run_in_background:** `true`
+- **Prompt:** The prompt must contain exactly three parts in this order, and nothing else:
+
+Part 1 — the AUTONOMOUS MODE directive (copy verbatim):
 ```
 AUTONOMOUS MODE: The user has already been asked for context (included below). Do NOT ask the user any questions — work fully autonomously. Make all decisions yourself: generate a run tag from today's date, identify benchmark tiers from available tests, choose optimization targets from profiler output. If something is ambiguous, pick the reasonable default and document your choice in HANDOFF.md.
 ```

+Part 2 — the user's original request (verbatim).
+
+Part 3 — the user's answer from Step 1 (verbatim).
+
 Do not add any other instructions — the agent has its own workflow.

 ## For `resume`

-Use the Agent tool to launch the **codeflash** agent with `run_in_background: true`. Pass `resume` and the user's request as the prompt. Include this at the top:
+Launch the agent with these exact parameters:
+- **Agent name:** `codeflash`
+- **run_in_background:** `true`
+- **Prompt:** The directive below (verbatim), followed by `resume` and the user's request:

 ```
 AUTONOMOUS MODE: Work fully autonomously. Do NOT ask the user any questions. Read session state from .codeflash/ and continue where the last session left off.
@ -33,4 +47,9 @@ AUTONOMOUS MODE: Work fully autonomously. Do NOT ask the user any questions. Rea

 ## For `status`

-Use the Agent tool to launch the **codeflash** agent. Pass `status` as the prompt. Do NOT run in background — wait for the result and show it to the user.
+Launch the agent with these exact parameters:
+- **Agent name:** `codeflash`
+- **run_in_background:** `false` (wait for the result)
+- **Prompt:** `status`
+
+Show the agent's result to the user.