# codeflash-agent — how this repo works ## Packages (UV workspace) - `packages/codeflash-core/` — shared foundation: models, AI client, telemetry, git helpers - `packages/codeflash-python/` — Python language CLI (`codeflash` command), extends core - `packages/codeflash-mcp/` — MCP server (stub) - `packages/codeflash-lsp/` — LSP server (stub) ## Services - `services/github-app/` — GitHub App integration service ## Plugin - `plugin/` — Claude Code plugin (self-contained, multi-language). See [plugin/README.md](plugin/README.md) for architecture and session flow. ## Evals Two types of evals, both run through `run-eval.sh`: **v1 (templates)** — Small synthetic projects in `evals/templates/`. Each bundles source code, tests, and a `pyproject.toml`. The runner copies the template to a temp dir, installs deps with `uv`, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types. **v2 (repos)** — Real repos in `evals/repos/`. Each has a `manifest.json` pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a `fix_commit` for reference and a rubric for scoring. Each eval produces results in `evals/results/-/`. Score with `score.py`, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric. **Regression testing** — Go to Actions > "Eval Regression" > Run workflow. Runs a subset of evals, scores them, compares to baselines in `evals/baseline-scores.json`. Fails if any score drops below threshold. Use before merging agent behavior changes. ``` ./evals/run-eval.sh --list # see all evals (v1 + v2) ./evals/run-eval.sh ranking --skill-only # run a v1 eval ./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only # run a v2 eval ./evals/score-eval.sh evals/results/ # score it ./evals/check-regression.sh # full regression check ``` ## CI (runs on every PR) The `validate` workflow runs Claude with the `plugin-dev` plugin to check: - Plugin structure (frontmatter, manifest, cross-references) - Agent consistency (all domain agents must have the same experiment loop steps) - Eval manifest validity - Skill quality Warnings are blocking — any issue fails the job. Claude posts a summary comment on the PR. ## Key conventions - Domain agents are self-contained — all methodology is inline, no required file reads before starting - Every agent uses the same experiment loop structure (choose target > implement > benchmark > keep/discard > commit only on KEEP) - Changes to one domain agent should be mirrored to others where applicable (CI enforces this) - The plugin uses `.codeflash/` in the user's project for session state (results.tsv, HANDOFF.md) - Language-agnostic methodology lives in `plugin/references/shared/`; language-specific implementations live under `plugin/languages//references/` ## Contributing 1. Branch off main 2. Make changes, push — CI validates automatically 3. If you changed agent behavior, trigger an eval regression run before merging