174 lines
6.5 KiB
Markdown
174 lines
6.5 KiB
Markdown
|
|
# Infrastructure Design
|
||
|
|
|
||
|
|
Benchmarking and CI infrastructure for codeflash self-optimization.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────┐
|
||
|
|
│ Developer Machine │
|
||
|
|
│ • Implement optimization │
|
||
|
|
│ • Push to branch │
|
||
|
|
│ • Trigger benchmark run │
|
||
|
|
└────────────────────────┬────────────────────────────────┘
|
||
|
|
│ git push
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────┐
|
||
|
|
│ GitHub Actions CI │
|
||
|
|
│ • Lint + type check │
|
||
|
|
│ • Unit tests (fast, every push) │
|
||
|
|
│ • Trigger benchmark VM if perf/ branch │
|
||
|
|
└────────────────────────┬────────────────────────────────┘
|
||
|
|
│ webhook / SSH
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────┐
|
||
|
|
│ Azure Benchmark VM (cf-bench) │
|
||
|
|
│ Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable) │
|
||
|
|
│ │
|
||
|
|
│ • Checkout branch │
|
||
|
|
│ • Install codeflash in editable mode │
|
||
|
|
│ • Run benchmark suite │
|
||
|
|
│ • Compare against baseline (main) │
|
||
|
|
│ • Post results as PR comment │
|
||
|
|
└─────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## Benchmark VM
|
||
|
|
|
||
|
|
### Why dedicated VM?
|
||
|
|
|
||
|
|
| Concern | Laptop | GitHub Actions runner | Dedicated VM |
|
||
|
|
|---|---|---|---|
|
||
|
|
| CPU consistency | Poor (thermal, background) | Poor (shared, noisy neighbor) | **Good** (non-burstable) |
|
||
|
|
| Reproducibility | Low | Medium | **High** |
|
||
|
|
| Cost | Free | Free (but noisy) | ~$0.10/hr (on-demand) |
|
||
|
|
| Python versions | Whatever's installed | Configurable | **Full control** |
|
||
|
|
|
||
|
|
### VM Spec
|
||
|
|
|
||
|
|
| Setting | Value | Rationale |
|
||
|
|
|---|---|---|
|
||
|
|
| Size | `Standard_D4s_v5` | 4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping |
|
||
|
|
| OS | Ubuntu 24.04 LTS | Matches CI, stable |
|
||
|
|
| Region | `westus2` | Low latency, proven reliable |
|
||
|
|
| Disk | 64 GB Premium SSD | Fast I/O for git, pip cache |
|
||
|
|
| Scheduling | On-demand (start/stop) | Only runs during benchmark jobs, ~$0.10/hr |
|
||
|
|
|
||
|
|
### Provisioning
|
||
|
|
|
||
|
|
Cloud-init installs:
|
||
|
|
1. System packages (git, build-essential, curl, jq)
|
||
|
|
2. uv (fast Python/venv management)
|
||
|
|
3. Python 3.12, 3.13, 3.14 via uv
|
||
|
|
4. hyperfine v1.19+
|
||
|
|
5. memray (memory profiling)
|
||
|
|
6. py-spy (CPU sampling profiler)
|
||
|
|
7. Codeflash clone + editable install
|
||
|
|
8. Benchmark scripts
|
||
|
|
|
||
|
|
### Start/Stop
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start VM (before benchmark run)
|
||
|
|
az vm start --resource-group CF-BENCH-RG --name cf-bench
|
||
|
|
|
||
|
|
# Run benchmarks
|
||
|
|
ssh azureuser@<ip> "bash ~/bench/bench_all.sh <branch>"
|
||
|
|
|
||
|
|
# Stop VM (after benchmark run — stops billing)
|
||
|
|
az vm deallocate --resource-group CF-BENCH-RG --name cf-bench
|
||
|
|
```
|
||
|
|
|
||
|
|
## Benchmark Suite Design
|
||
|
|
|
||
|
|
### Layers
|
||
|
|
|
||
|
|
```
|
||
|
|
Layer 1: Startup (import time, CLI response time)
|
||
|
|
└── python -X importtime
|
||
|
|
└── hyperfine: codeflash --version, codeflash --help
|
||
|
|
|
||
|
|
Layer 2: Unit operations (micro-benchmarks)
|
||
|
|
└── timeit: AST parsing, test generation, result analysis
|
||
|
|
└── Function-level profiling of hot paths
|
||
|
|
|
||
|
|
Layer 3: Integration (real optimization runs)
|
||
|
|
└── codeflash optimize <target> on a fixture codebase
|
||
|
|
└── Wall clock, memory peak, output quality
|
||
|
|
|
||
|
|
Layer 4: Memory
|
||
|
|
└── memray: peak RSS during optimization run
|
||
|
|
└── tracemalloc: allocation hotspots
|
||
|
|
```
|
||
|
|
|
||
|
|
### Benchmark scripts
|
||
|
|
|
||
|
|
| Script | Layer | What it measures |
|
||
|
|
|---|---|---|
|
||
|
|
| `bench_startup.sh` | 1 | CLI startup time (--version, --help, import) |
|
||
|
|
| `bench_importtime.py` | 1 | Per-module import cost breakdown |
|
||
|
|
| `bench_micro.py` | 2 | Hot function micro-benchmarks (timeit) |
|
||
|
|
| `bench_optimize.sh` | 3 | Full optimization run on fixture codebase |
|
||
|
|
| `bench_memory.sh` | 4 | Peak memory during optimization run |
|
||
|
|
| `bench_all.sh` | * | Run all benchmarks, save results |
|
||
|
|
| `bench_compare.sh` | * | A/B comparison between two branches |
|
||
|
|
|
||
|
|
### Fixture codebase
|
||
|
|
|
||
|
|
A small but representative Python project for integration benchmarks:
|
||
|
|
- ~500 lines across 5-10 files
|
||
|
|
- Mix of pure functions, classes, I/O-bound code
|
||
|
|
- Known optimization opportunities (so we can measure "did codeflash find them?")
|
||
|
|
- Checked into `fixtures/` directory
|
||
|
|
|
||
|
|
### Result format
|
||
|
|
|
||
|
|
Each benchmark run produces a directory:
|
||
|
|
|
||
|
|
```
|
||
|
|
results/<branch>-<timestamp>/
|
||
|
|
├── startup.json # hyperfine JSON export
|
||
|
|
├── importtime.tsv # Per-module import breakdown
|
||
|
|
├── micro.json # Micro-benchmark results
|
||
|
|
├── optimize.json # Integration benchmark (wall clock, memory, findings)
|
||
|
|
├── memory.json # Peak RSS and allocation data
|
||
|
|
└── summary.md # Human-readable summary
|
||
|
|
```
|
||
|
|
|
||
|
|
## CI Integration
|
||
|
|
|
||
|
|
### On every push to `perf/*` branch:
|
||
|
|
|
||
|
|
1. Run unit tests (GitHub Actions, fast)
|
||
|
|
2. Start benchmark VM
|
||
|
|
3. Run `bench_compare.sh main <branch>` on VM
|
||
|
|
4. Post results as PR comment via `gh`
|
||
|
|
5. Deallocate VM
|
||
|
|
|
||
|
|
### PR comment format:
|
||
|
|
|
||
|
|
```markdown
|
||
|
|
## Benchmark Results: `perf/optimize-startup` vs `main`
|
||
|
|
|
||
|
|
| Metric | main | branch | Delta |
|
||
|
|
|---|---|---|---|
|
||
|
|
| `codeflash --version` | 450ms | 120ms | **-73% (3.75x)** |
|
||
|
|
| `import codeflash` | 380ms | 310ms | **-18%** |
|
||
|
|
| Full optimize run | 12.3s | 11.8s | **-4%** |
|
||
|
|
| Peak memory | 245 MB | 230 MB | **-6%** |
|
||
|
|
|
||
|
|
<details><summary>Per-module import breakdown</summary>
|
||
|
|
...
|
||
|
|
</details>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Cost
|
||
|
|
|
||
|
|
| Component | Cost | Frequency |
|
||
|
|
|---|---|---|
|
||
|
|
| Benchmark VM (D4s_v5) | ~$0.10/hr | On-demand, ~10 min per run |
|
||
|
|
| Storage (64 GB SSD) | ~$10/month | Always |
|
||
|
|
| GitHub Actions | Free (public) / included (private) | Every push |
|
||
|
|
|
||
|
|
Estimated: **~$15/month** with daily benchmark runs.
|