3.4 KiB
codeflash-agent — how this repo works
Packages (UV workspace)
packages/codeflash-core/— shared foundation: models, AI client, telemetry, git helperspackages/codeflash-python/— Python language CLI (codeflashcommand), extends corepackages/codeflash-mcp/— MCP server (stub)packages/codeflash-lsp/— LSP server (stub)
Services
services/github-app/— GitHub App integration service
Plugin
plugin/— Claude Code plugin (self-contained, multi-language). See plugin/README.md for architecture and session flow.
Evals
Two types of evals, both run through run-eval.sh:
v1 (templates) — Small synthetic projects in evals/templates/. Each bundles source code, tests, and a pyproject.toml. The runner copies the template to a temp dir, installs deps with uv, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.
v2 (repos) — Real repos in evals/repos/. Each has a manifest.json pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a fix_commit for reference and a rubric for scoring.
Each eval produces results in evals/results/<name>-<timestamp>/. Score with score.py, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.
Regression testing — Go to Actions > "Eval Regression" > Run workflow. Runs a subset of evals, scores them, compares to baselines in evals/baseline-scores.json. Fails if any score drops below threshold. Use before merging agent behavior changes.
./evals/run-eval.sh --list # see all evals (v1 + v2)
./evals/run-eval.sh ranking --skill-only # run a v1 eval
./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only # run a v2 eval
./evals/score-eval.sh evals/results/<dir> # score it
./evals/check-regression.sh # full regression check
CI (runs on every PR)
The validate workflow runs Claude with the plugin-dev plugin to check:
- Plugin structure (frontmatter, manifest, cross-references)
- Agent consistency (all domain agents must have the same experiment loop steps)
- Eval manifest validity
- Skill quality
Warnings are blocking — any issue fails the job. Claude posts a summary comment on the PR.
Key conventions
- Domain agents are self-contained — all methodology is inline, no required file reads before starting
- Every agent uses the same experiment loop structure (choose target > implement > benchmark > keep/discard > commit only on KEEP)
- Changes to one domain agent should be mirrored to others where applicable (CI enforces this)
- The plugin uses
.codeflash/in the user's project for session state (results.tsv, HANDOFF.md) - Language-agnostic methodology lives in
plugin/references/shared/; language-specific implementations live underplugin/languages/<lang>/references/
Contributing
- Branch off main
- Make changes, push — CI validates automatically
- If you changed agent behavior, trigger an eval regression run before merging