6.5 KiB
6.5 KiB
Infrastructure Design
Benchmarking and CI infrastructure for codeflash self-optimization.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Developer Machine │
│ • Implement optimization │
│ • Push to branch │
│ • Trigger benchmark run │
└────────────────────────┬────────────────────────────────┘
│ git push
▼
┌─────────────────────────────────────────────────────────┐
│ GitHub Actions CI │
│ • Lint + type check │
│ • Unit tests (fast, every push) │
│ • Trigger benchmark VM if perf/ branch │
└────────────────────────┬────────────────────────────────┘
│ webhook / SSH
▼
┌─────────────────────────────────────────────────────────┐
│ Azure Benchmark VM (cf-bench) │
│ Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable) │
│ │
│ • Checkout branch │
│ • Install codeflash in editable mode │
│ • Run benchmark suite │
│ • Compare against baseline (main) │
│ • Post results as PR comment │
└─────────────────────────────────────────────────────────┘
Benchmark VM
Why dedicated VM?
| Concern | Laptop | GitHub Actions runner | Dedicated VM |
|---|---|---|---|
| CPU consistency | Poor (thermal, background) | Poor (shared, noisy neighbor) | Good (non-burstable) |
| Reproducibility | Low | Medium | High |
| Cost | Free | Free (but noisy) | ~$0.10/hr (on-demand) |
| Python versions | Whatever's installed | Configurable | Full control |
VM Spec
| Setting | Value | Rationale |
|---|---|---|
| Size | Standard_D4s_v5 |
4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping |
| OS | Ubuntu 24.04 LTS | Matches CI, stable |
| Region | westus2 |
Low latency, proven reliable |
| Disk | 64 GB Premium SSD | Fast I/O for git, pip cache |
| Scheduling | On-demand (start/stop) | Only runs during benchmark jobs, ~$0.10/hr |
Provisioning
Cloud-init installs:
- System packages (git, build-essential, curl, jq)
- uv (fast Python/venv management)
- Python 3.12, 3.13, 3.14 via uv
- hyperfine v1.19+
- memray (memory profiling)
- py-spy (CPU sampling profiler)
- Codeflash clone + editable install
- Benchmark scripts
Start/Stop
# Start VM (before benchmark run)
az vm start --resource-group CF-BENCH-RG --name cf-bench
# Run benchmarks
ssh azureuser@<ip> "bash ~/bench/bench_all.sh <branch>"
# Stop VM (after benchmark run — stops billing)
az vm deallocate --resource-group CF-BENCH-RG --name cf-bench
Benchmark Suite Design
Layers
Layer 1: Startup (import time, CLI response time)
└── python -X importtime
└── hyperfine: codeflash --version, codeflash --help
Layer 2: Unit operations (micro-benchmarks)
└── timeit: AST parsing, test generation, result analysis
└── Function-level profiling of hot paths
Layer 3: Integration (real optimization runs)
└── codeflash optimize <target> on a fixture codebase
└── Wall clock, memory peak, output quality
Layer 4: Memory
└── memray: peak RSS during optimization run
└── tracemalloc: allocation hotspots
Benchmark scripts
| Script | Layer | What it measures |
|---|---|---|
bench_startup.sh |
1 | CLI startup time (--version, --help, import) |
bench_importtime.py |
1 | Per-module import cost breakdown |
bench_micro.py |
2 | Hot function micro-benchmarks (timeit) |
bench_optimize.sh |
3 | Full optimization run on fixture codebase |
bench_memory.sh |
4 | Peak memory during optimization run |
bench_all.sh |
* | Run all benchmarks, save results |
bench_compare.sh |
* | A/B comparison between two branches |
Fixture codebase
A small but representative Python project for integration benchmarks:
- ~500 lines across 5-10 files
- Mix of pure functions, classes, I/O-bound code
- Known optimization opportunities (so we can measure "did codeflash find them?")
- Checked into
fixtures/directory
Result format
Each benchmark run produces a directory:
results/<branch>-<timestamp>/
├── startup.json # hyperfine JSON export
├── importtime.tsv # Per-module import breakdown
├── micro.json # Micro-benchmark results
├── optimize.json # Integration benchmark (wall clock, memory, findings)
├── memory.json # Peak RSS and allocation data
└── summary.md # Human-readable summary
CI Integration
On every push to perf/* branch:
- Run unit tests (GitHub Actions, fast)
- Start benchmark VM
- Run
bench_compare.sh main <branch>on VM - Post results as PR comment via
gh - Deallocate VM
PR comment format:
## Benchmark Results: `perf/optimize-startup` vs `main`
| Metric | main | branch | Delta |
|---|---|---|---|
| `codeflash --version` | 450ms | 120ms | **-73% (3.75x)** |
| `import codeflash` | 380ms | 310ms | **-18%** |
| Full optimize run | 12.3s | 11.8s | **-4%** |
| Peak memory | 245 MB | 230 MB | **-6%** |
<details><summary>Per-module import breakdown</summary>
...
</details>
Cost
| Component | Cost | Frequency |
|---|---|---|
| Benchmark VM (D4s_v5) | ~$0.10/hr | On-demand, ~10 min per run |
| Storage (64 GB SSD) | ~$10/month | Always |
| GitHub Actions | Free (public) / included (private) | Every push |
Estimated: ~$15/month with daily benchmark runs.