codeflash-agent/plugin/intro.md

# codeflash-agent — how this repo works

## Packages (UV workspace)
- `packages/codeflash-core/` — shared foundation: models, AI client, telemetry, git helpers
- `packages/codeflash-python/` — Python language CLI (`codeflash` command), extends core
- `packages/codeflash-mcp/` — MCP server (stub)
- `packages/codeflash-lsp/` — LSP server (stub)

## Services
- `services/github-app/` — GitHub App integration service

## Plugin (language-agnostic)
- `plugin/agents/codeflash-review.md` — review agent
- `plugin/agents/codeflash-researcher.md` — research agent
- `plugin/commands/` — codex CLI commands
- `vendor/codex/` — codex companion scripts and schemas (vendored)
- `plugin/references/shared/` — shared methodology (experiment loop, templates, benchmarks)
- `plugin/hooks/` — session lifecycle and review gate hooks

## Languages (per-language content)
- `languages/python/plugin/agents/codeflash.md` — router that detects the domain and delegates
- `languages/python/plugin/agents/codeflash-cpu.md`, `codeflash-memory.md`, `codeflash-async.md`, `codeflash-structure.md` — one agent per domain
- `languages/python/plugin/agents/codeflash-setup.md` — detects project env, installs deps
- `languages/python/plugin/skills/` — `/codeflash-optimize` entry point, memray profiling
- `languages/python/plugin/references/` — domain-specific deep-dive docs (async, memory, data-structures, structure)

## Evals
- `evals/templates/` — 9 synthetic eval scenarios (v1: ranking, memory, crossdomain, layered)
- `evals/repos/` — real-repo evals (v2: clone a repo at a specific commit, agent finds and fixes the bug)

## CI (runs on every PR)

The `validate` workflow runs Claude with the `plugin-dev` plugin to check:

- Plugin structure (frontmatter, manifest, cross-references)
- Agent consistency (all domain agents must have the same experiment loop steps)
- Eval manifest validity
- Skill quality

Warnings are blocking — any issue fails the job. Claude posts a summary comment on the PR.

## Evals

Two types of evals, both run through `run-eval.sh`:

**v1 (templates)** — Small synthetic projects in `evals/templates/`. Each bundles source code, tests, and a `pyproject.toml`. The runner copies the template to a temp dir, installs deps with `uv`, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.

**v2 (repos)** — Real repos in `evals/repos/`. Each has a `manifest.json` pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a `fix_commit` for reference and a rubric for scoring.

Each eval produces results in `evals/results/<name>-<timestamp>/`. Score with `score.py`, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.

**Regression testing** — Go to Actions → "Eval Regression" → Run workflow. Runs a subset of evals, scores them, compares to baselines in `evals/baseline-scores.json`. Fails if any score drops below threshold. Use before merging agent behavior changes.

```
./evals/run-eval.sh --list                  # see all evals (v1 + v2)
./evals/run-eval.sh ranking --skill-only    # run a v1 eval
./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only  # run a v2 eval
./evals/score-eval.sh evals/results/<dir>   # score it
./evals/check-regression.sh                 # full regression check
```

## Key conventions

- Domain agents are self-contained — all methodology is inline, no required file reads before starting
- Every agent uses the same experiment loop structure (choose target → implement → benchmark → keep/discard → commit only on KEEP)
- Changes to one domain agent should be mirrored to others where applicable (CI enforces this)
- The plugin uses `.codeflash/` in the user's project for session state (results.tsv, HANDOFF.md)

## Contributing

1. Branch off main
2. Make changes, push — CI validates automatically
3. If you changed agent behavior, trigger an eval regression run before merging
fix: v2 eval runner — shallow cached clones + non-interactive prompt - Shallow clone (--no-checkout --depth 1 + fetch specific commit) instead of full clone — 15s vs 2+ min for large repos like codeflash-internal - Cache clone in evals/repos/<name>/workspace/, cp -r for each run - Use gh repo clone for private repo auth - Fix eval prompt to skip skill's AskUserQuestion step in non-interactive mode - Gitignore workspace/ dirs - Update intro.md with v2 eval docs 2026-03-27 12:25:38 +00:00			`# codeflash-agent — how this repo works`

Merge main-teammate branch 2026-04-03 22:36:50 +00:00			`## Packages (UV workspace)`
			- `packages/codeflash-core/` — shared foundation: models, AI client, telemetry, git helpers
			- `packages/codeflash-python/` — Python language CLI (`codeflash` command), extends core
			- `packages/codeflash-mcp/` — MCP server (stub)
			- `packages/codeflash-lsp/` — LSP server (stub)

			`## Services`
			- `services/github-app/` — GitHub App integration service

			`## Plugin (language-agnostic)`
			- `plugin/agents/codeflash-review.md` — review agent
			- `plugin/agents/codeflash-researcher.md` — research agent
			- `plugin/commands/` — codex CLI commands
			- `vendor/codex/` — codex companion scripts and schemas (vendored)
			- `plugin/references/shared/` — shared methodology (experiment loop, templates, benchmarks)
			- `plugin/hooks/` — session lifecycle and review gate hooks

			`## Languages (per-language content)`
			- `languages/python/plugin/agents/codeflash.md` — router that detects the domain and delegates
			- `languages/python/plugin/agents/codeflash-cpu.md`, `codeflash-memory.md`, `codeflash-async.md`, `codeflash-structure.md` — one agent per domain
			- `languages/python/plugin/agents/codeflash-setup.md` — detects project env, installs deps
			- `languages/python/plugin/skills/` — `/codeflash-optimize` entry point, memray profiling
			- `languages/python/plugin/references/` — domain-specific deep-dive docs (async, memory, data-structures, structure)

			`## Evals`
fix: v2 eval runner — shallow cached clones + non-interactive prompt - Shallow clone (--no-checkout --depth 1 + fetch specific commit) instead of full clone — 15s vs 2+ min for large repos like codeflash-internal - Cache clone in evals/repos/<name>/workspace/, cp -r for each run - Use gh repo clone for private repo auth - Fix eval prompt to skip skill's AskUserQuestion step in non-interactive mode - Gitignore workspace/ dirs - Update intro.md with v2 eval docs 2026-03-27 12:25:38 +00:00			- `evals/templates/` — 9 synthetic eval scenarios (v1: ranking, memory, crossdomain, layered)
			- `evals/repos/` — real-repo evals (v2: clone a repo at a specific commit, agent finds and fixes the bug)

			`## CI (runs on every PR)`

			The `validate` workflow runs Claude with the `plugin-dev` plugin to check:

			`- Plugin structure (frontmatter, manifest, cross-references)`
			`- Agent consistency (all domain agents must have the same experiment loop steps)`
			`- Eval manifest validity`
			`- Skill quality`

			`Warnings are blocking — any issue fails the job. Claude posts a summary comment on the PR.`

			`## Evals`

			Two types of evals, both run through `run-eval.sh`:

			v1 (templates) — Small synthetic projects in `evals/templates/`. Each bundles source code, tests, and a `pyproject.toml`. The runner copies the template to a temp dir, installs deps with `uv`, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.

			v2 (repos) — Real repos in `evals/repos/`. Each has a `manifest.json` pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a `fix_commit` for reference and a rubric for scoring.

			Each eval produces results in `evals/results/<name>-<timestamp>/`. Score with `score.py`, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.

			Regression testing — Go to Actions → "Eval Regression" → Run workflow. Runs a subset of evals, scores them, compares to baselines in `evals/baseline-scores.json`. Fails if any score drops below threshold. Use before merging agent behavior changes.

			```
			`./evals/run-eval.sh --list # see all evals (v1 + v2)`
			`./evals/run-eval.sh ranking --skill-only # run a v1 eval`
			`./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only # run a v2 eval`
			`./evals/score-eval.sh evals/results/<dir> # score it`
			`./evals/check-regression.sh # full regression check`
			```

			`## Key conventions`

			`- Domain agents are self-contained — all methodology is inline, no required file reads before starting`
			`- Every agent uses the same experiment loop structure (choose target → implement → benchmark → keep/discard → commit only on KEEP)`
			`- Changes to one domain agent should be mirrored to others where applicable (CI enforces this)`
			- The plugin uses `.codeflash/` in the user's project for session state (results.tsv, HANDOFF.md)

			`## Contributing`

			`1. Branch off main`
			`2. Make changes, push — CI validates automatically`
			`3. If you changed agent behavior, trigger an eval regression run before merging`