codeflash-agent — how this repo works

Packages (UV workspace)

packages/codeflash-core/ — shared foundation: models, AI client, telemetry, git helpers
packages/codeflash-python/ — Python language CLI (codeflash command), extends core
packages/codeflash-mcp/ — MCP server (stub)
packages/codeflash-lsp/ — LSP server (stub)

Services

services/github-app/ — GitHub App integration service

Plugin (language-agnostic)

plugin/agents/codeflash-review.md — review agent
plugin/agents/codeflash-researcher.md — research agent
plugin/commands/ — codex CLI commands
vendor/codex/ — codex companion scripts and schemas (vendored)
plugin/references/shared/ — shared methodology (experiment loop, templates, benchmarks)
plugin/hooks/ — session lifecycle and review gate hooks

Languages (per-language content)

languages/python/plugin/agents/codeflash.md — router that detects the domain and delegates
languages/python/plugin/agents/codeflash-cpu.md, codeflash-memory.md, codeflash-async.md, codeflash-structure.md — one agent per domain
languages/python/plugin/agents/codeflash-setup.md — detects project env, installs deps
languages/python/plugin/skills/ — /codeflash-optimize entry point, memray profiling
languages/python/plugin/references/ — domain-specific deep-dive docs (async, memory, data-structures, structure)

Evals

evals/templates/ — 9 synthetic eval scenarios (v1: ranking, memory, crossdomain, layered)
evals/repos/ — real-repo evals (v2: clone a repo at a specific commit, agent finds and fixes the bug)

CI (runs on every PR)

The validate workflow runs Claude with the plugin-dev plugin to check:

Plugin structure (frontmatter, manifest, cross-references)
Agent consistency (all domain agents must have the same experiment loop steps)
Eval manifest validity
Skill quality

Warnings are blocking — any issue fails the job. Claude posts a summary comment on the PR.

Evals

Two types of evals, both run through run-eval.sh:

v1 (templates) — Small synthetic projects in evals/templates/. Each bundles source code, tests, and a pyproject.toml. The runner copies the template to a temp dir, installs deps with uv, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.

v2 (repos) — Real repos in evals/repos/. Each has a manifest.json pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a fix_commit for reference and a rubric for scoring.

Each eval produces results in evals/results/<name>-<timestamp>/. Score with score.py, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.

Regression testing — Go to Actions → "Eval Regression" → Run workflow. Runs a subset of evals, scores them, compares to baselines in evals/baseline-scores.json. Fails if any score drops below threshold. Use before merging agent behavior changes.

./evals/run-eval.sh --list                  # see all evals (v1 + v2)
./evals/run-eval.sh ranking --skill-only    # run a v1 eval
./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only  # run a v2 eval
./evals/score-eval.sh evals/results/<dir>   # score it
./evals/check-regression.sh                 # full regression check

Key conventions

Domain agents are self-contained — all methodology is inline, no required file reads before starting
Every agent uses the same experiment loop structure (choose target → implement → benchmark → keep/discard → commit only on KEEP)
Changes to one domain agent should be mirrored to others where applicable (CI enforces this)
The plugin uses .codeflash/ in the user's project for session state (results.tsv, HANDOFF.md)

Contributing

Branch off main
Make changes, push — CI validates automatically
If you changed agent behavior, trigger an eval regression run before merging

4.3 KiB Raw Blame History

codeflash-agent — how this repo works

Packages (UV workspace)

Services

Plugin (language-agnostic)

Languages (per-language content)

Evals

CI (runs on every PR)

Evals

Key conventions

Contributing

4.3 KiB

Raw Blame History