* feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add CI plugin validation workflow (#3) Uses claude-code-action with plugin-dev plugin to validate plugin structure, agent consistency, eval manifests, and skills on every PR. Includes @claude mention support for interactive fixes. * fix: correct plugin marketplace name for CI validation plugin-dev is in claude-plugins-official, not claude-code-plugins. Also adds plugin_marketplaces URL for discovery. * fix: expand allowed tools for validation workflow Add gh pr comment, gh api, cat, python3, jq to allowed tools so Claude can post PR summary comments and subagents can function. * fix: enable track_progress and show_full_output for debugging * fix: remove colons from Bash glob patterns in validate allowedTools The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which are treated as literal characters, so they never matched actual commands like `gh pr diff 2 --name-only`. This caused 1 permission denial per CI run and prevented the summary comment from posting. * fix: fail CI job when validation finds issues Add verdict step that writes PASS/FAIL to a file, with a follow-up workflow step that exits 1 on FAIL. Previously validation reported issues in comments but the job always succeeded. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: treat validation warnings as blocking failures Warnings were previously non-blocking — the verdict step only checked for "issues that need fixing." Now any warning also triggers FAIL. * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * fix: parse verdict from PR comment instead of temp file Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't execute the python3 file-write command. The check step now reads the claude[bot] comment via gh api and greps for the verdict line. * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run
178 lines
4.8 KiB
Bash
Executable file
178 lines
4.8 KiB
Bash
Executable file
#!/bin/bash
|
|
set -euo pipefail
|
|
|
|
# Eval regression checker for codeflash-agent plugin
|
|
#
|
|
# Runs a subset of evals, scores them, and compares to checked-in baselines.
|
|
# Exits 1 if any score drops below the minimum threshold.
|
|
#
|
|
# Usage:
|
|
# ./check-regression.sh # run all baseline evals
|
|
# ./check-regression.sh ranking # run specific template(s)
|
|
# ./check-regression.sh --score-only <dir> # score existing results, skip running
|
|
|
|
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
|
|
BASELINE_FILE="$EVAL_DIR/baseline-scores.json"
|
|
|
|
die() { echo "ERROR: $*" >&2; exit 1; }
|
|
|
|
[ -f "$BASELINE_FILE" ] || die "Baseline file not found: $BASELINE_FILE"
|
|
|
|
# --- Parse args ---
|
|
|
|
SCORE_ONLY=""
|
|
RESULTS_DIRS=()
|
|
TEMPLATES=()
|
|
|
|
while [[ $# -gt 0 ]]; do
|
|
case "$1" in
|
|
--score-only)
|
|
SCORE_ONLY=1
|
|
shift
|
|
while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
|
|
RESULTS_DIRS+=("$1")
|
|
shift
|
|
done
|
|
;;
|
|
*)
|
|
TEMPLATES+=("$1")
|
|
shift
|
|
;;
|
|
esac
|
|
done
|
|
|
|
# If no templates specified, use all from baseline file
|
|
if [[ ${#TEMPLATES[@]} -eq 0 && -z "$SCORE_ONLY" ]]; then
|
|
mapfile -t TEMPLATES < <(jq -r '.evals | keys[]' "$BASELINE_FILE")
|
|
fi
|
|
|
|
# --- Score-only mode ---
|
|
|
|
if [[ -n "$SCORE_ONLY" ]]; then
|
|
[[ ${#RESULTS_DIRS[@]} -gt 0 ]] || die "Usage: $0 --score-only <results-dir> [<results-dir> ...]"
|
|
|
|
for dir in "${RESULTS_DIRS[@]}"; do
|
|
[ -d "$dir" ] || die "Results directory not found: $dir"
|
|
echo "Scoring: $dir"
|
|
"$EVAL_DIR/score-eval.sh" "$dir"
|
|
done
|
|
|
|
echo ""
|
|
echo "Scored ${#RESULTS_DIRS[@]} result(s). Compare manually or re-run without --score-only."
|
|
exit 0
|
|
fi
|
|
|
|
# --- Run evals ---
|
|
|
|
echo "=== Eval Regression Check ==="
|
|
echo "Templates: ${TEMPLATES[*]}"
|
|
echo "Baseline: $BASELINE_FILE"
|
|
echo ""
|
|
|
|
declare -A RESULT_DIRS
|
|
|
|
for template in "${TEMPLATES[@]}"; do
|
|
# Verify template exists in baseline
|
|
min=$(jq -r --arg t "$template" '.evals[$t].min // empty' "$BASELINE_FILE")
|
|
[[ -n "$min" ]] || die "Template '$template' not found in baseline file"
|
|
|
|
echo "--- Running: $template ---"
|
|
output=$("$EVAL_DIR/run-eval.sh" "$template" --skill-only 2>&1)
|
|
echo "$output"
|
|
|
|
# Extract the results directory from run-eval output
|
|
result_dir=$(echo "$output" | grep "^Results:" | head -1 | awk '{print $2}')
|
|
[[ -n "$result_dir" && -d "$result_dir" ]] || die "Could not find results directory for $template"
|
|
RESULT_DIRS[$template]="$result_dir"
|
|
echo ""
|
|
done
|
|
|
|
# --- Score ---
|
|
|
|
echo "=== Scoring ==="
|
|
echo ""
|
|
|
|
declare -A SCORES
|
|
|
|
for template in "${TEMPLATES[@]}"; do
|
|
result_dir="${RESULT_DIRS[$template]}"
|
|
echo "--- Scoring: $template ---"
|
|
"$EVAL_DIR/score-eval.sh" "$result_dir"
|
|
|
|
# Read the score
|
|
score_file="$result_dir/skill.score.json"
|
|
if [[ -f "$score_file" ]]; then
|
|
score=$(jq -r '.total' "$score_file")
|
|
SCORES[$template]="$score"
|
|
else
|
|
echo "WARNING: No score file for $template"
|
|
SCORES[$template]="0"
|
|
fi
|
|
echo ""
|
|
done
|
|
|
|
# --- Compare to baseline ---
|
|
|
|
echo "=== Regression Check ==="
|
|
echo ""
|
|
|
|
printf "%-25s %8s %8s %8s %10s\n" "Template" "Score" "Min" "Expected" "Status"
|
|
printf "%-25s %8s %8s %8s %10s\n" "--------" "-----" "---" "--------" "------"
|
|
|
|
FAILED=0
|
|
|
|
for template in "${TEMPLATES[@]}"; do
|
|
score="${SCORES[$template]}"
|
|
min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
|
|
expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
|
|
max=$(jq -r --arg t "$template" '.evals[$t].max' "$BASELINE_FILE")
|
|
|
|
if [[ "$score" -lt "$min" ]]; then
|
|
status="FAIL"
|
|
FAILED=1
|
|
elif [[ "$score" -lt "$expected" ]]; then
|
|
status="WARN"
|
|
else
|
|
status="PASS"
|
|
fi
|
|
|
|
printf "%-25s %8s %8s %8s %10s\n" "$template" "$score/$max" "$min" "$expected" "$status"
|
|
done
|
|
|
|
echo ""
|
|
|
|
# --- Write summary for CI ---
|
|
|
|
SUMMARY_FILE="$EVAL_DIR/results/regression-summary.json"
|
|
mkdir -p "$(dirname "$SUMMARY_FILE")"
|
|
|
|
{
|
|
echo "{"
|
|
echo " \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
|
|
echo " \"passed\": $([ $FAILED -eq 0 ] && echo true || echo false),"
|
|
echo " \"results\": {"
|
|
first=1
|
|
for template in "${TEMPLATES[@]}"; do
|
|
[ $first -eq 0 ] && echo ","
|
|
first=0
|
|
score="${SCORES[$template]}"
|
|
min=$(jq -r --arg t "$template" '.evals[$t].min' "$BASELINE_FILE")
|
|
expected=$(jq -r --arg t "$template" '.evals[$t].expected' "$BASELINE_FILE")
|
|
printf ' "%s": { "score": %s, "min": %s, "expected": %s }' "$template" "$score" "$min" "$expected"
|
|
done
|
|
echo ""
|
|
echo " }"
|
|
echo "}"
|
|
} > "$SUMMARY_FILE"
|
|
|
|
echo "Summary: $SUMMARY_FILE"
|
|
|
|
if [[ $FAILED -eq 1 ]]; then
|
|
echo ""
|
|
echo "REGRESSION DETECTED: One or more evals scored below minimum threshold."
|
|
exit 1
|
|
fi
|
|
|
|
echo ""
|
|
echo "All evals passed regression check."
|
|
exit 0
|