codeflash-agent/docs/codeflash-agent-dogfooding.md

214 lines
11 KiB
Markdown
Raw Permalink Normal View History

2026-04-09 08:36:01 +00:00
# Codeflash Self-Optimization
Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.
## The Stack
All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs.
```
┌─────────────────────────────────────────────────────────────┐
│ User: codeflash optimize foo.py │
└──────────────────────────┬──────────────────────────────────┘
┌──────────────────────────▼──────────────────────────────────┐
│ codeflash (core engine) │
│ • CLI entry point │
│ • Test generation, AST analysis, benchmark harness │
│ • Optimization loop: profile → generate → validate │
│ repos/codeflash/ │
└──────────────────────────┬──────────────────────────────────┘
┌──────────────────────────▼──────────────────────────────────┐
│ codeflash-internal (backend service) │
│ • LLM orchestration, prompt management │
│ • Optimization result storage │
│ repos/codeflash-internal/ │
└──────────────────────────┬──────────────────────────────────┘
┌──────────────────────────▼──────────────────────────────────┐
│ docflash (CI pipeline) │
│ • Dockerized optimization runs │
│ • Bug detection + auto-fix pipeline │
│ repos/docflash/ │
└─────────────────────────────────────────────────────────────┘
```
### Setup
```bash
# Clone all repos into repos/
mkdir -p repos
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
git clone git@github.com:codeflash-ai/docflash.git repos/docflash
```
### Cross-boundary optimization targets
| Boundary | What to look for |
|---|---|
| **codeflash CLI → internal service** | HTTP round-trip latency, payload size, connection reuse, retry overhead |
| **codeflash CLI → user's code** | AST parsing cost, test generation I/O, benchmark harness subprocess overhead |
| **docflash → codeflash CLI** | Docker startup, volume mount overhead, cold-start import time |
## Prior Art
| Project | Key Result | Approach | Case Study |
|---|---|---|---|
| **Rich** | 2.35x Console import (79ms → 34ms) | Import deferral, `re` elimination, runtime micro-opts | [rich_org](https://github.com/KRRT7/rich_org) |
| **pip** | 7x `--version`, 1.81x resolver | 122 commits: startup, resolver, packaging, import deferral | [pip_org](https://github.com/KRRT7/pip_org) |
## Patterns That Worked
Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.
### Tier 1: Startup / Import Time (highest user-visible impact)
| Pattern | Example | Typical Savings |
|---|---|---|
| **Fast-path early exit** | `pip --version` bypasses entire `pip._internal` import | 5-100x for that codepath |
| **Import deferral** | Move `import X` from module level into the function that uses it | 2-20ms per deferred module |
| **`TYPE_CHECKING` guard** | Move annotation-only imports behind `if TYPE_CHECKING:` | 1-5ms per module |
| **`from __future__ import annotations`** | Enables string annotations so type aliases can move to `TYPE_CHECKING` | Unlocks further deferrals |
| **Kill dead imports** | Remove imports that aren't used at runtime | 1-10ms each |
| **Avoid transitive chains** | `dataclasses``inspect` (~10ms); `typing.Match``re` (~3ms) | Chain-dependent |
### Tier 2: Architecture (highest absolute time savings)
| Pattern | Example | Typical Savings |
|---|---|---|
| **Replace `@dataclass` with `__slots__`** | ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import | 10ms import + 60% memory |
| **Lazy loading large data** | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms |
| **Speculative prefetch** | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths |
| **Conditional rebuild** | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop |
| **Cache at the right level** | `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation | Varies widely |
### Tier 3: Micro-optimizations (small per-call, adds up in hot loops)
| Pattern | Example | Typical Savings |
|---|---|---|
| **Identity shortcut (`is` before `==`)** | `Style.__eq__`, `Segment.simplify` | 1.3-1.8x for identity case |
| **Bypass public API internally** | `Style._add` (cached) vs `__add__` (copies linked styles) | 1.1-1.3x |
| **Hoist to module level** | `operator.attrgetter`, `methodcaller` as module constants | ns per call |
| **`__slots__` on hot classes** | Criterion, ConsoleOptions, tokenizer state | 40-60% memory |
| **Pre-compute in `__init__`** | `Link._is_wheel`, `Version._str_cache` | Eliminates repeated work |
| **Direct construction** | `__new__` + slot assignment bypassing `__init__` | 20-40% for allocation-heavy paths |
### Tier 4: I/O and Serialization
| Pattern | Example | Typical Savings |
|---|---|---|
| **Replace slow serializer** | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops |
| **Connection pooling** | Increase HTTP pool size for parallel index fetches | Latency-dependent |
| **Parallel I/O** | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent |
## Anti-patterns (things that didn't work or weren't worth it)
- **Caching with low hit rate** — Caches that get evicted before reuse add overhead
- **Premature `__slots__`** — Only worth it on classes with many instances or in hot loops
- **Over-deferring** — Deferring imports in functions called once on startup just moves the cost
- **Regex elimination** — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there
- **Optimizing cold paths** — Error handling, setup/teardown, one-time init — not worth the complexity
## Methodology
### Profiling toolkit
| Tool | Purpose | When to use |
|---|---|---|
| `python -X importtime` | Import cost breakdown | First step for any CLI tool |
| `hyperfine` | E2E command timing with statistics | Before/after validation |
| `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions |
| `timeit` | Micro-benchmarks for specific functions | Validating micro-opts |
| `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths |
| `objgraph` | Object count tracking | Finding redundant allocations |
### Environment
- Azure Standard_D2s_v5 (non-burstable, consistent CPU)
- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
- hyperfine with `--warmup 5 --min-runs 30` for statistical rigor
- All tests passing before AND after every change
### Workflow
```
1. Profile → identify top-N bottlenecks
2. For each bottleneck:
a. Read the actual code (don't guess from profiler shapes)
b. Implement the smallest change that addresses it
c. Micro-benchmark before/after
d. Run full test suite
e. E2E benchmark
3. Commit with clear perf: prefix and numbers
4. Repeat until plateau
```
## Codeflash Optimization Plan
### Phase 1: Profile each layer
**codeflash (core engine)**
- [ ] `python -X importtime -c "import codeflash"` — import chain analysis
- [ ] `codeflash --version` startup time baseline
- [ ] Profile a real optimization run end-to-end (py-spy flamegraph)
- [ ] Memory profile on a large codebase target
- [ ] Trace test generation loop: AST parse, codegen, subprocess, validation
**codeflash-internal (backend service)**
- [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
- [ ] Check connection reuse and retry patterns
- [ ] Measure cold-start time
**Cross-boundary**
- [ ] E2E trace: user command → agent → CLI → service → result (where does time go?)
- [ ] Measure serialization costs at each boundary
- [ ] Identify redundant round-trips
### Phase 2: Identify targets
- [ ] Rank imports by cost per layer, identify deferrable ones
- [ ] Find hot functions in the optimization loop
- [ ] Check for heavy dependencies that could be deferred or replaced
- [ ] Map cross-boundary overhead (serialization, subprocess, HTTP)
- [ ] Look for the patterns from Tier 1-4 above
### Phase 3: Implement
- [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility
- [ ] Then Tier 2 (architecture) — highest absolute savings
- [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed
- [ ] Cross-boundary optimizations last (require changes in multiple repos)
- [ ] Each change: micro-bench → test suite → E2E bench → commit
### Phase 4: Document
- [ ] Before/after benchmark tables per layer
- [ ] E2E before/after for user-facing operations
- [ ] Per-optimization breakdown
- [ ] Flamegraphs showing the shift
- [ ] Case study narrative: "codeflash optimized itself"
## Repo Structure
```
.
├── README.md # This file — framework and playbook
├── repos/ # The vertical stack (git-ignored, clone locally)
│ ├── codeflash/ # Core engine (codeflash-ai/codeflash)
│ ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
│ └── docflash/ # CI pipeline (codeflash-ai/docflash)
├── prior-art/
│ ├── rich-summary.md # What we learned from Rich
│ └── pip-summary.md # What we learned from pip
├── infra/
│ ├── README.md # Infrastructure design and architecture
│ ├── cloud-init.yaml # VM provisioning (one-shot)
│ └── vm-manage.sh # VM lifecycle management script
├── profiles/ # Profiling output (importtime, flamegraphs)
│ ├── codeflash/ # Core engine profiles
│ ├── codeflash-internal/# Service profiles
│ └── cross-boundary/ # E2E traces spanning layers
├── bench/ # Benchmark scripts (copied to VM by cloud-init)
├── data/ # Raw benchmark results
└── results/ # Before/after analysis
```