# Codeflash Self-Optimization Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance. ## The Stack All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs. ``` ┌─────────────────────────────────────────────────────────────┐ │ User: codeflash optimize foo.py │ └──────────────────────────┬──────────────────────────────────┘ │ ┌──────────────────────────▼──────────────────────────────────┐ │ codeflash (core engine) │ │ • CLI entry point │ │ • Test generation, AST analysis, benchmark harness │ │ • Optimization loop: profile → generate → validate │ │ repos/codeflash/ │ └──────────────────────────┬──────────────────────────────────┘ │ ┌──────────────────────────▼──────────────────────────────────┐ │ codeflash-internal (backend service) │ │ • LLM orchestration, prompt management │ │ • Optimization result storage │ │ repos/codeflash-internal/ │ └──────────────────────────┬──────────────────────────────────┘ │ ┌──────────────────────────▼──────────────────────────────────┐ │ docflash (CI pipeline) │ │ • Dockerized optimization runs │ │ • Bug detection + auto-fix pipeline │ │ repos/docflash/ │ └─────────────────────────────────────────────────────────────┘ ``` ### Setup ```bash # Clone all repos into repos/ mkdir -p repos git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal git clone git@github.com:codeflash-ai/docflash.git repos/docflash ``` ### Cross-boundary optimization targets | Boundary | What to look for | |---|---| | **codeflash CLI → internal service** | HTTP round-trip latency, payload size, connection reuse, retry overhead | | **codeflash CLI → user's code** | AST parsing cost, test generation I/O, benchmark harness subprocess overhead | | **docflash → codeflash CLI** | Docker startup, volume mount overhead, cold-start import time | ## Prior Art | Project | Key Result | Approach | Case Study | |---|---|---|---| | **Rich** | 2.35x Console import (79ms → 34ms) | Import deferral, `re` elimination, runtime micro-opts | [rich_org](https://github.com/KRRT7/rich_org) | | **pip** | 7x `--version`, 1.81x resolver | 122 commits: startup, resolver, packaging, import deferral | [pip_org](https://github.com/KRRT7/pip_org) | ## Patterns That Worked Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact. ### Tier 1: Startup / Import Time (highest user-visible impact) | Pattern | Example | Typical Savings | |---|---|---| | **Fast-path early exit** | `pip --version` bypasses entire `pip._internal` import | 5-100x for that codepath | | **Import deferral** | Move `import X` from module level into the function that uses it | 2-20ms per deferred module | | **`TYPE_CHECKING` guard** | Move annotation-only imports behind `if TYPE_CHECKING:` | 1-5ms per module | | **`from __future__ import annotations`** | Enables string annotations so type aliases can move to `TYPE_CHECKING` | Unlocks further deferrals | | **Kill dead imports** | Remove imports that aren't used at runtime | 1-10ms each | | **Avoid transitive chains** | `dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms) | Chain-dependent | ### Tier 2: Architecture (highest absolute time savings) | Pattern | Example | Typical Savings | |---|---|---| | **Replace `@dataclass` with `__slots__`** | ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import | 10ms import + 60% memory | | **Lazy loading large data** | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms | | **Speculative prefetch** | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths | | **Conditional rebuild** | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop | | **Cache at the right level** | `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation | Varies widely | ### Tier 3: Micro-optimizations (small per-call, adds up in hot loops) | Pattern | Example | Typical Savings | |---|---|---| | **Identity shortcut (`is` before `==`)** | `Style.__eq__`, `Segment.simplify` | 1.3-1.8x for identity case | | **Bypass public API internally** | `Style._add` (cached) vs `__add__` (copies linked styles) | 1.1-1.3x | | **Hoist to module level** | `operator.attrgetter`, `methodcaller` as module constants | ns per call | | **`__slots__` on hot classes** | Criterion, ConsoleOptions, tokenizer state | 40-60% memory | | **Pre-compute in `__init__`** | `Link._is_wheel`, `Version._str_cache` | Eliminates repeated work | | **Direct construction** | `__new__` + slot assignment bypassing `__init__` | 20-40% for allocation-heavy paths | ### Tier 4: I/O and Serialization | Pattern | Example | Typical Savings | |---|---|---| | **Replace slow serializer** | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops | | **Connection pooling** | Increase HTTP pool size for parallel index fetches | Latency-dependent | | **Parallel I/O** | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent | ## Anti-patterns (things that didn't work or weren't worth it) - **Caching with low hit rate** — Caches that get evicted before reuse add overhead - **Premature `__slots__`** — Only worth it on classes with many instances or in hot loops - **Over-deferring** — Deferring imports in functions called once on startup just moves the cost - **Regex elimination** — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there - **Optimizing cold paths** — Error handling, setup/teardown, one-time init — not worth the complexity ## Methodology ### Profiling toolkit | Tool | Purpose | When to use | |---|---|---| | `python -X importtime` | Import cost breakdown | First step for any CLI tool | | `hyperfine` | E2E command timing with statistics | Before/after validation | | `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions | | `timeit` | Micro-benchmarks for specific functions | Validating micro-opts | | `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths | | `objgraph` | Object count tracking | Finding redundant allocations | ### Environment - Azure Standard_D2s_v5 (non-burstable, consistent CPU) - Multiple Python versions (3.12, 3.13 minimum — behavior differs) - hyperfine with `--warmup 5 --min-runs 30` for statistical rigor - All tests passing before AND after every change ### Workflow ``` 1. Profile → identify top-N bottlenecks 2. For each bottleneck: a. Read the actual code (don't guess from profiler shapes) b. Implement the smallest change that addresses it c. Micro-benchmark before/after d. Run full test suite e. E2E benchmark 3. Commit with clear perf: prefix and numbers 4. Repeat until plateau ``` ## Codeflash Optimization Plan ### Phase 1: Profile each layer **codeflash (core engine)** - [ ] `python -X importtime -c "import codeflash"` — import chain analysis - [ ] `codeflash --version` startup time baseline - [ ] Profile a real optimization run end-to-end (py-spy flamegraph) - [ ] Memory profile on a large codebase target - [ ] Trace test generation loop: AST parse, codegen, subprocess, validation **codeflash-internal (backend service)** - [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing) - [ ] Check connection reuse and retry patterns - [ ] Measure cold-start time **Cross-boundary** - [ ] E2E trace: user command → agent → CLI → service → result (where does time go?) - [ ] Measure serialization costs at each boundary - [ ] Identify redundant round-trips ### Phase 2: Identify targets - [ ] Rank imports by cost per layer, identify deferrable ones - [ ] Find hot functions in the optimization loop - [ ] Check for heavy dependencies that could be deferred or replaced - [ ] Map cross-boundary overhead (serialization, subprocess, HTTP) - [ ] Look for the patterns from Tier 1-4 above ### Phase 3: Implement - [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility - [ ] Then Tier 2 (architecture) — highest absolute savings - [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed - [ ] Cross-boundary optimizations last (require changes in multiple repos) - [ ] Each change: micro-bench → test suite → E2E bench → commit ### Phase 4: Document - [ ] Before/after benchmark tables per layer - [ ] E2E before/after for user-facing operations - [ ] Per-optimization breakdown - [ ] Flamegraphs showing the shift - [ ] Case study narrative: "codeflash optimized itself" ## Repo Structure ``` . ├── README.md # This file — framework and playbook ├── repos/ # The vertical stack (git-ignored, clone locally) │ ├── codeflash/ # Core engine (codeflash-ai/codeflash) │ ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal) │ └── docflash/ # CI pipeline (codeflash-ai/docflash) ├── prior-art/ │ ├── rich-summary.md # What we learned from Rich │ └── pip-summary.md # What we learned from pip ├── infra/ │ ├── README.md # Infrastructure design and architecture │ ├── cloud-init.yaml # VM provisioning (one-shot) │ └── vm-manage.sh # VM lifecycle management script ├── profiles/ # Profiling output (importtime, flamegraphs) │ ├── codeflash/ # Core engine profiles │ ├── codeflash-internal/# Service profiles │ └── cross-boundary/ # E2E traces spanning layers ├── bench/ # Benchmark scripts (copied to VM by cloud-init) ├── data/ # Raw benchmark results └── results/ # Before/after analysis ```