codeflash-agent/CLAUDE.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

2.9 KiB

codeflash-agent

Monorepo for the Codeflash optimization platform: Python packages, Claude Code plugin, and services.

Case Studies

Active case study data lives in .codeflash/{org}/{project}/ (status, bench scripts, raw data, VM infra). Summaries are built out of .codeflash/ into case-studies/{org}/{project}/.

Active case studies in .codeflash/:

  • microsoft/typeagent
  • unstructured/core-product
  • netflix/metaflow
  • coveragepy/coveragepy
  • textualize/rich
  • pypa/pip

Directory conventions

Target repos live in ~/Desktop/work/{org}_org/{project}:

  • microsoft_org/typeagent
  • unstructured_org/core-product
  • netflix_org/metaflow
  • coveragepy_org/coveragepy

Optimization flow

  1. Make changes in the target repo on a perf/<description> branch
  2. Run tests locally to verify nothing breaks
  3. Commit and push to the fork
  4. Benchmark on the VM via ssh -A azureuser@<ip> "cd ~/<project> && git fetch origin && ..."
  5. Record results in .codeflash/{org}/{project}/data/results.tsv
  6. Update status.md in .codeflash/{org}/{project}/
  7. Open a PR on the fork with VM benchmark numbers

VM access

VMs use SSH agent forwarding -- always connect with ssh -A:

Project VM IP Size Resource group
core-product 40.65.91.158 Standard_E4s_v5 core-product-BENCH-RG
typeagent 40.65.81.123 Standard_D2s_v5 typeagent-BENCH-RG

If SSH times out, check:

  1. VM is running: az vm start --resource-group <RG> --name <vm>
  2. NSG IP is current: update AllowSSHFromMyIP source address in the Azure portal or via az network nsg rule update

PR strategy

  • Individual PRs on the fork (KRRT7/<repo>) -- one per optimization on a perf/<description> branch. Each is self-contained with its own benchmark numbers.
  • Stacked draft PR (optional) on the fork (--base main --head optimization) -- accumulates all optimizations, shows cumulative gain.

Benchmarking

  • codeflash compare for internal benchmarks (fork PRs) -- worktree-isolated, per-function breakdown, structured markdown. Does NOT handle import time yet -- use hyperfine for that.
  • hyperfine for upstream PRs and import time measurements -- portable, no codeflash dependency for maintainers to install.
  • Keep the VM running during optimization sessions -- don't deallocate between benchmarks
  • Cloud-init must use ASCII only -- Azure CLI chokes on non-ASCII (em dashes, etc.)

Runner convention

Use $RUNNER in docs and scripts to refer to the Python runner. The value depends on context:

Context $RUNNER value Why
VM benchmark scripts .venv/bin/python Accuracy -- uv run adds ~50% overhead and 2.5x variance
Upstream PR reproducers uv run python Portability -- matches how the target team works
Setup / verify steps uv run python Measurement accuracy doesn't matter