codeflash-agent/docs/infra_readme.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

6.5 KiB

Infrastructure Design

Benchmarking and CI infrastructure for codeflash self-optimization.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Developer Machine                     │
│  • Implement optimization                               │
│  • Push to branch                                       │
│  • Trigger benchmark run                                │
└────────────────────────┬────────────────────────────────┘
                         │ git push
                         ▼
┌─────────────────────────────────────────────────────────┐
│                   GitHub Actions CI                      │
│  • Lint + type check                                    │
│  • Unit tests (fast, every push)                        │
│  • Trigger benchmark VM if perf/ branch                 │
└────────────────────────┬────────────────────────────────┘
                         │ webhook / SSH
                         ▼
┌─────────────────────────────────────────────────────────┐
│              Azure Benchmark VM (cf-bench)               │
│  Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable)        │
│                                                          │
│  • Checkout branch                                      │
│  • Install codeflash in editable mode                   │
│  • Run benchmark suite                                  │
│  • Compare against baseline (main)                      │
│  • Post results as PR comment                           │
└─────────────────────────────────────────────────────────┘

Benchmark VM

Why dedicated VM?

Concern Laptop GitHub Actions runner Dedicated VM
CPU consistency Poor (thermal, background) Poor (shared, noisy neighbor) Good (non-burstable)
Reproducibility Low Medium High
Cost Free Free (but noisy) ~$0.10/hr (on-demand)
Python versions Whatever's installed Configurable Full control

VM Spec

Setting Value Rationale
Size Standard_D4s_v5 4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping
OS Ubuntu 24.04 LTS Matches CI, stable
Region westus2 Low latency, proven reliable
Disk 64 GB Premium SSD Fast I/O for git, pip cache
Scheduling On-demand (start/stop) Only runs during benchmark jobs, ~$0.10/hr

Provisioning

Cloud-init installs:

  1. System packages (git, build-essential, curl, jq)
  2. uv (fast Python/venv management)
  3. Python 3.12, 3.13, 3.14 via uv
  4. hyperfine v1.19+
  5. memray (memory profiling)
  6. py-spy (CPU sampling profiler)
  7. Codeflash clone + editable install
  8. Benchmark scripts

Start/Stop

# Start VM (before benchmark run)
az vm start --resource-group CF-BENCH-RG --name cf-bench

# Run benchmarks
ssh azureuser@<ip> "bash ~/bench/bench_all.sh <branch>"

# Stop VM (after benchmark run — stops billing)
az vm deallocate --resource-group CF-BENCH-RG --name cf-bench

Benchmark Suite Design

Layers

Layer 1: Startup (import time, CLI response time)
    └── python -X importtime
    └── hyperfine: codeflash --version, codeflash --help

Layer 2: Unit operations (micro-benchmarks)
    └── timeit: AST parsing, test generation, result analysis
    └── Function-level profiling of hot paths

Layer 3: Integration (real optimization runs)
    └── codeflash optimize <target> on a fixture codebase
    └── Wall clock, memory peak, output quality

Layer 4: Memory
    └── memray: peak RSS during optimization run
    └── tracemalloc: allocation hotspots

Benchmark scripts

Script Layer What it measures
bench_startup.sh 1 CLI startup time (--version, --help, import)
bench_importtime.py 1 Per-module import cost breakdown
bench_micro.py 2 Hot function micro-benchmarks (timeit)
bench_optimize.sh 3 Full optimization run on fixture codebase
bench_memory.sh 4 Peak memory during optimization run
bench_all.sh * Run all benchmarks, save results
bench_compare.sh * A/B comparison between two branches

Fixture codebase

A small but representative Python project for integration benchmarks:

  • ~500 lines across 5-10 files
  • Mix of pure functions, classes, I/O-bound code
  • Known optimization opportunities (so we can measure "did codeflash find them?")
  • Checked into fixtures/ directory

Result format

Each benchmark run produces a directory:

results/<branch>-<timestamp>/
├── startup.json          # hyperfine JSON export
├── importtime.tsv        # Per-module import breakdown
├── micro.json            # Micro-benchmark results
├── optimize.json         # Integration benchmark (wall clock, memory, findings)
├── memory.json           # Peak RSS and allocation data
└── summary.md            # Human-readable summary

CI Integration

On every push to perf/* branch:

  1. Run unit tests (GitHub Actions, fast)
  2. Start benchmark VM
  3. Run bench_compare.sh main <branch> on VM
  4. Post results as PR comment via gh
  5. Deallocate VM

PR comment format:

## Benchmark Results: `perf/optimize-startup` vs `main`

| Metric | main | branch | Delta |
|---|---|---|---|
| `codeflash --version` | 450ms | 120ms | **-73% (3.75x)** |
| `import codeflash` | 380ms | 310ms | **-18%** |
| Full optimize run | 12.3s | 11.8s | **-4%** |
| Peak memory | 245 MB | 230 MB | **-6%** |

<details><summary>Per-module import breakdown</summary>
...
</details>

Cost

Component Cost Frequency
Benchmark VM (D4s_v5) ~$0.10/hr On-demand, ~10 min per run
Storage (64 GB SSD) ~$10/month Always
GitHub Actions Free (public) / included (private) Every push

Estimated: ~$15/month with daily benchmark runs.