# Infrastructure Design Benchmarking and CI infrastructure for codeflash self-optimization. ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Developer Machine │ │ • Implement optimization │ │ • Push to branch │ │ • Trigger benchmark run │ └────────────────────────┬────────────────────────────────┘ │ git push ▼ ┌─────────────────────────────────────────────────────────┐ │ GitHub Actions CI │ │ • Lint + type check │ │ • Unit tests (fast, every push) │ │ • Trigger benchmark VM if perf/ branch │ └────────────────────────┬────────────────────────────────┘ │ webhook / SSH ▼ ┌─────────────────────────────────────────────────────────┐ │ Azure Benchmark VM (cf-bench) │ │ Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable) │ │ │ │ • Checkout branch │ │ • Install codeflash in editable mode │ │ • Run benchmark suite │ │ • Compare against baseline (main) │ │ • Post results as PR comment │ └─────────────────────────────────────────────────────────┘ ``` ## Benchmark VM ### Why dedicated VM? | Concern | Laptop | GitHub Actions runner | Dedicated VM | |---|---|---|---| | CPU consistency | Poor (thermal, background) | Poor (shared, noisy neighbor) | **Good** (non-burstable) | | Reproducibility | Low | Medium | **High** | | Cost | Free | Free (but noisy) | ~$0.10/hr (on-demand) | | Python versions | Whatever's installed | Configurable | **Full control** | ### VM Spec | Setting | Value | Rationale | |---|---|---| | Size | `Standard_D4s_v5` | 4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping | | OS | Ubuntu 24.04 LTS | Matches CI, stable | | Region | `westus2` | Low latency, proven reliable | | Disk | 64 GB Premium SSD | Fast I/O for git, pip cache | | Scheduling | On-demand (start/stop) | Only runs during benchmark jobs, ~$0.10/hr | ### Provisioning Cloud-init installs: 1. System packages (git, build-essential, curl, jq) 2. uv (fast Python/venv management) 3. Python 3.12, 3.13, 3.14 via uv 4. hyperfine v1.19+ 5. memray (memory profiling) 6. py-spy (CPU sampling profiler) 7. Codeflash clone + editable install 8. Benchmark scripts ### Start/Stop ```bash # Start VM (before benchmark run) az vm start --resource-group CF-BENCH-RG --name cf-bench # Run benchmarks ssh azureuser@ "bash ~/bench/bench_all.sh " # Stop VM (after benchmark run — stops billing) az vm deallocate --resource-group CF-BENCH-RG --name cf-bench ``` ## Benchmark Suite Design ### Layers ``` Layer 1: Startup (import time, CLI response time) └── python -X importtime └── hyperfine: codeflash --version, codeflash --help Layer 2: Unit operations (micro-benchmarks) └── timeit: AST parsing, test generation, result analysis └── Function-level profiling of hot paths Layer 3: Integration (real optimization runs) └── codeflash optimize on a fixture codebase └── Wall clock, memory peak, output quality Layer 4: Memory └── memray: peak RSS during optimization run └── tracemalloc: allocation hotspots ``` ### Benchmark scripts | Script | Layer | What it measures | |---|---|---| | `bench_startup.sh` | 1 | CLI startup time (--version, --help, import) | | `bench_importtime.py` | 1 | Per-module import cost breakdown | | `bench_micro.py` | 2 | Hot function micro-benchmarks (timeit) | | `bench_optimize.sh` | 3 | Full optimization run on fixture codebase | | `bench_memory.sh` | 4 | Peak memory during optimization run | | `bench_all.sh` | * | Run all benchmarks, save results | | `bench_compare.sh` | * | A/B comparison between two branches | ### Fixture codebase A small but representative Python project for integration benchmarks: - ~500 lines across 5-10 files - Mix of pure functions, classes, I/O-bound code - Known optimization opportunities (so we can measure "did codeflash find them?") - Checked into `fixtures/` directory ### Result format Each benchmark run produces a directory: ``` results/-/ ├── startup.json # hyperfine JSON export ├── importtime.tsv # Per-module import breakdown ├── micro.json # Micro-benchmark results ├── optimize.json # Integration benchmark (wall clock, memory, findings) ├── memory.json # Peak RSS and allocation data └── summary.md # Human-readable summary ``` ## CI Integration ### On every push to `perf/*` branch: 1. Run unit tests (GitHub Actions, fast) 2. Start benchmark VM 3. Run `bench_compare.sh main ` on VM 4. Post results as PR comment via `gh` 5. Deallocate VM ### PR comment format: ```markdown ## Benchmark Results: `perf/optimize-startup` vs `main` | Metric | main | branch | Delta | |---|---|---|---| | `codeflash --version` | 450ms | 120ms | **-73% (3.75x)** | | `import codeflash` | 380ms | 310ms | **-18%** | | Full optimize run | 12.3s | 11.8s | **-4%** | | Peak memory | 245 MB | 230 MB | **-6%** |
Per-module import breakdown ...
``` ## Cost | Component | Cost | Frequency | |---|---|---| | Benchmark VM (D4s_v5) | ~$0.10/hr | On-demand, ~10 min per run | | Storage (64 GB SSD) | ~$10/month | Always | | GitHub Actions | Free (public) / included (private) | Every push | Estimated: **~$15/month** with daily benchmark runs.