Two types of evals, both run through `run-eval.sh`:
**v1 (templates)** — Small synthetic projects in `evals/templates/`. Each bundles source code, tests, and a `pyproject.toml`. The runner copies the template to a temp dir, installs deps with `uv`, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.
**v2 (repos)** — Real repos in `evals/repos/`. Each has a `manifest.json` pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a `fix_commit` for reference and a rubric for scoring.
Each eval produces results in `evals/results/<name>-<timestamp>/`. Score with `score.py`, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.
**Regression testing** — Go to Actions > "Eval Regression" > Run workflow. Runs a subset of evals, scores them, compares to baselines in `evals/baseline-scores.json`. Fails if any score drops below threshold. Use before merging agent behavior changes.
- Language-agnostic methodology lives in `plugin/references/shared/`; language-specific implementations live under `plugin/languages/<lang>/references/`