codeflash-agent/docs/infra_readme.md

# Infrastructure Design

Benchmarking and CI infrastructure for codeflash self-optimization.

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    Developer Machine                     │
│  • Implement optimization                               │
│  • Push to branch                                       │
│  • Trigger benchmark run                                │
└────────────────────────┬────────────────────────────────┘
                         │ git push
                         ▼
┌─────────────────────────────────────────────────────────┐
│                   GitHub Actions CI                      │
│  • Lint + type check                                    │
│  • Unit tests (fast, every push)                        │
│  • Trigger benchmark VM if perf/ branch                 │
└────────────────────────┬────────────────────────────────┘
                         │ webhook / SSH
                         ▼
┌─────────────────────────────────────────────────────────┐
│              Azure Benchmark VM (cf-bench)               │
│  Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable)        │
│                                                          │
│  • Checkout branch                                      │
│  • Install codeflash in editable mode                   │
│  • Run benchmark suite                                  │
│  • Compare against baseline (main)                      │
│  • Post results as PR comment                           │
└─────────────────────────────────────────────────────────┘
```

## Benchmark VM

### Why dedicated VM?

| Concern | Laptop | GitHub Actions runner | Dedicated VM |
|---|---|---|---|
| CPU consistency | Poor (thermal, background) | Poor (shared, noisy neighbor) | **Good** (non-burstable) |
| Reproducibility | Low | Medium | **High** |
| Cost | Free | Free (but noisy) | ~$0.10/hr (on-demand) |
| Python versions | Whatever's installed | Configurable | **Full control** |

### VM Spec

| Setting | Value | Rationale |
|---|---|---|
| Size | `Standard_D4s_v5` | 4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping |
| OS | Ubuntu 24.04 LTS | Matches CI, stable |
| Region | `westus2` | Low latency, proven reliable |
| Disk | 64 GB Premium SSD | Fast I/O for git, pip cache |
| Scheduling | On-demand (start/stop) | Only runs during benchmark jobs, ~$0.10/hr |

### Provisioning

Cloud-init installs:
1. System packages (git, build-essential, curl, jq)
2. uv (fast Python/venv management)
3. Python 3.12, 3.13, 3.14 via uv
4. hyperfine v1.19+
5. memray (memory profiling)
6. py-spy (CPU sampling profiler)
7. Codeflash clone + editable install
8. Benchmark scripts

### Start/Stop

```bash
# Start VM (before benchmark run)
az vm start --resource-group CF-BENCH-RG --name cf-bench

# Run benchmarks
ssh azureuser@<ip> "bash ~/bench/bench_all.sh <branch>"

# Stop VM (after benchmark run — stops billing)
az vm deallocate --resource-group CF-BENCH-RG --name cf-bench
```

## Benchmark Suite Design

### Layers

```
Layer 1: Startup (import time, CLI response time)
    └── python -X importtime
    └── hyperfine: codeflash --version, codeflash --help

Layer 2: Unit operations (micro-benchmarks)
    └── timeit: AST parsing, test generation, result analysis
    └── Function-level profiling of hot paths

Layer 3: Integration (real optimization runs)
    └── codeflash optimize <target> on a fixture codebase
    └── Wall clock, memory peak, output quality

Layer 4: Memory
    └── memray: peak RSS during optimization run
    └── tracemalloc: allocation hotspots
```

### Benchmark scripts

| Script | Layer | What it measures |
|---|---|---|
| `bench_startup.sh` | 1 | CLI startup time (--version, --help, import) |
| `bench_importtime.py` | 1 | Per-module import cost breakdown |
| `bench_micro.py` | 2 | Hot function micro-benchmarks (timeit) |
| `bench_optimize.sh` | 3 | Full optimization run on fixture codebase |
| `bench_memory.sh` | 4 | Peak memory during optimization run |
| `bench_all.sh` | * | Run all benchmarks, save results |
| `bench_compare.sh` | * | A/B comparison between two branches |

### Fixture codebase

A small but representative Python project for integration benchmarks:
- ~500 lines across 5-10 files
- Mix of pure functions, classes, I/O-bound code
- Known optimization opportunities (so we can measure "did codeflash find them?")
- Checked into `fixtures/` directory

### Result format

Each benchmark run produces a directory:

```
results/<branch>-<timestamp>/
├── startup.json          # hyperfine JSON export
├── importtime.tsv        # Per-module import breakdown
├── micro.json            # Micro-benchmark results
├── optimize.json         # Integration benchmark (wall clock, memory, findings)
├── memory.json           # Peak RSS and allocation data
└── summary.md            # Human-readable summary
```

## CI Integration

### On every push to `perf/*` branch:

1. Run unit tests (GitHub Actions, fast)
2. Start benchmark VM
3. Run `bench_compare.sh main <branch>` on VM
4. Post results as PR comment via `gh`
5. Deallocate VM

### PR comment format:

```markdown
## Benchmark Results: `perf/optimize-startup` vs `main`

| Metric | main | branch | Delta |
|---|---|---|---|
| `codeflash --version` | 450ms | 120ms | **-73% (3.75x)** |
| `import codeflash` | 380ms | 310ms | **-18%** |
| Full optimize run | 12.3s | 11.8s | **-4%** |
| Peak memory | 245 MB | 230 MB | **-6%** |

<details><summary>Per-module import breakdown</summary>
...
</details>
```

## Cost

| Component | Cost | Frequency |
|---|---|---|
| Benchmark VM (D4s_v5) | ~$0.10/hr | On-demand, ~10 min per run |
| Storage (64 GB SSD) | ~$10/month | Always |
| GitHub Actions | Free (public) / included (private) | Every push |

Estimated: **~$15/month** with daily benchmark runs.
squash 2026-04-09 08:36:01 +00:00			`# Infrastructure Design`

			`Benchmarking and CI infrastructure for codeflash self-optimization.`

			`## Architecture`

			```
			`┌─────────────────────────────────────────────────────────┐`
			`│ Developer Machine │`
			`│ • Implement optimization │`
			`│ • Push to branch │`
			`│ • Trigger benchmark run │`
			`└────────────────────────┬────────────────────────────────┘`
			`│ git push`
			`▼`
			`┌─────────────────────────────────────────────────────────┐`
			`│ GitHub Actions CI │`
			`│ • Lint + type check │`
			`│ • Unit tests (fast, every push) │`
			`│ • Trigger benchmark VM if perf/ branch │`
			`└────────────────────────┬────────────────────────────────┘`
			`│ webhook / SSH`
			`▼`
			`┌─────────────────────────────────────────────────────────┐`
			`│ Azure Benchmark VM (cf-bench) │`
			`│ Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable) │`
			`│ │`
			`│ • Checkout branch │`
			`│ • Install codeflash in editable mode │`
			`│ • Run benchmark suite │`
			`│ • Compare against baseline (main) │`
			`│ • Post results as PR comment │`
			`└─────────────────────────────────────────────────────────┘`
			```

			`## Benchmark VM`

			`### Why dedicated VM?`

			`\| Concern \| Laptop \| GitHub Actions runner \| Dedicated VM \|`
			`\|---\|---\|---\|---\|`
			`\| CPU consistency \| Poor (thermal, background) \| Poor (shared, noisy neighbor) \| Good (non-burstable) \|`
			`\| Reproducibility \| Low \| Medium \| High \|`
			`\| Cost \| Free \| Free (but noisy) \| ~$0.10/hr (on-demand) \|`
			`\| Python versions \| Whatever's installed \| Configurable \| Full control \|`

			`### VM Spec`

			`\| Setting \| Value \| Rationale \|`
			`\|---\|---\|---\|`
			\| Size \| `Standard_D4s_v5` \| 4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping \|
			`\| OS \| Ubuntu 24.04 LTS \| Matches CI, stable \|`
			\| Region \| `westus2` \| Low latency, proven reliable \|
			`\| Disk \| 64 GB Premium SSD \| Fast I/O for git, pip cache \|`
			`\| Scheduling \| On-demand (start/stop) \| Only runs during benchmark jobs, ~$0.10/hr \|`

			`### Provisioning`

			`Cloud-init installs:`
			`1. System packages (git, build-essential, curl, jq)`
			`2. uv (fast Python/venv management)`
			`3. Python 3.12, 3.13, 3.14 via uv`
			`4. hyperfine v1.19+`
			`5. memray (memory profiling)`
			`6. py-spy (CPU sampling profiler)`
			`7. Codeflash clone + editable install`
			`8. Benchmark scripts`

			`### Start/Stop`

			```bash
			`# Start VM (before benchmark run)`
			`az vm start --resource-group CF-BENCH-RG --name cf-bench`

			`# Run benchmarks`
			`ssh azureuser@<ip> "bash ~/bench/bench_all.sh <branch>"`

			`# Stop VM (after benchmark run — stops billing)`
			`az vm deallocate --resource-group CF-BENCH-RG --name cf-bench`
			```

			`## Benchmark Suite Design`

			`### Layers`

			```
			`Layer 1: Startup (import time, CLI response time)`
			`└── python -X importtime`
			`└── hyperfine: codeflash --version, codeflash --help`

			`Layer 2: Unit operations (micro-benchmarks)`
			`└── timeit: AST parsing, test generation, result analysis`
			`└── Function-level profiling of hot paths`

			`Layer 3: Integration (real optimization runs)`
			`└── codeflash optimize <target> on a fixture codebase`
			`└── Wall clock, memory peak, output quality`

			`Layer 4: Memory`
			`└── memray: peak RSS during optimization run`
			`└── tracemalloc: allocation hotspots`
			```

			`### Benchmark scripts`

			`\| Script \| Layer \| What it measures \|`
			`\|---\|---\|---\|`
			\| `bench_startup.sh` \| 1 \| CLI startup time (--version, --help, import) \|
			\| `bench_importtime.py` \| 1 \| Per-module import cost breakdown \|
			\| `bench_micro.py` \| 2 \| Hot function micro-benchmarks (timeit) \|
			\| `bench_optimize.sh` \| 3 \| Full optimization run on fixture codebase \|
			\| `bench_memory.sh` \| 4 \| Peak memory during optimization run \|
			\| `bench_all.sh` \| * \| Run all benchmarks, save results \|
			\| `bench_compare.sh` \| * \| A/B comparison between two branches \|

			`### Fixture codebase`

			`A small but representative Python project for integration benchmarks:`
			`- ~500 lines across 5-10 files`
			`- Mix of pure functions, classes, I/O-bound code`
			`- Known optimization opportunities (so we can measure "did codeflash find them?")`
			- Checked into `fixtures/` directory

			`### Result format`

			`Each benchmark run produces a directory:`

			```
			`results/<branch>-<timestamp>/`
			`├── startup.json # hyperfine JSON export`
			`├── importtime.tsv # Per-module import breakdown`
			`├── micro.json # Micro-benchmark results`
			`├── optimize.json # Integration benchmark (wall clock, memory, findings)`
			`├── memory.json # Peak RSS and allocation data`
			`└── summary.md # Human-readable summary`
			```

			`## CI Integration`

			### On every push to `perf/*` branch:

			`1. Run unit tests (GitHub Actions, fast)`
			`2. Start benchmark VM`
			3. Run `bench_compare.sh main <branch>` on VM
			4. Post results as PR comment via `gh`
			`5. Deallocate VM`

			`### PR comment format:`

			```markdown
			## Benchmark Results: `perf/optimize-startup` vs `main`

			`\| Metric \| main \| branch \| Delta \|`
			`\|---\|---\|---\|---\|`
			\| `codeflash --version` \| 450ms \| 120ms \| -73% (3.75x) \|
			\| `import codeflash` \| 380ms \| 310ms \| -18% \|
			`\| Full optimize run \| 12.3s \| 11.8s \| -4% \|`
			`\| Peak memory \| 245 MB \| 230 MB \| -6% \|`

			`<details><summary>Per-module import breakdown</summary>`
			`...`
			`</details>`
			```

			`## Cost`

			`\| Component \| Cost \| Frequency \|`
			`\|---\|---\|---\|`
			`\| Benchmark VM (D4s_v5) \| ~$0.10/hr \| On-demand, ~10 min per run \|`
			`\| Storage (64 GB SSD) \| ~$10/month \| Always \|`
			`\| GitHub Actions \| Free (public) / included (private) \| Every push \|`

			`Estimated: ~$15/month with daily benchmark runs.`