11 KiB
11 KiB
Codeflash Self-Optimization
Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.
The Stack
All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., codeflash optimize foo.py) touches every layer — optimizing one layer in isolation misses cross-boundary costs.
┌─────────────────────────────────────────────────────────────┐
│ User: codeflash optimize foo.py │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ codeflash (core engine) │
│ • CLI entry point │
│ • Test generation, AST analysis, benchmark harness │
│ • Optimization loop: profile → generate → validate │
│ repos/codeflash/ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ codeflash-internal (backend service) │
│ • LLM orchestration, prompt management │
│ • Optimization result storage │
│ repos/codeflash-internal/ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ docflash (CI pipeline) │
│ • Dockerized optimization runs │
│ • Bug detection + auto-fix pipeline │
│ repos/docflash/ │
└─────────────────────────────────────────────────────────────┘
Setup
# Clone all repos into repos/
mkdir -p repos
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
git clone git@github.com:codeflash-ai/docflash.git repos/docflash
Cross-boundary optimization targets
| Boundary | What to look for |
|---|---|
| codeflash CLI → internal service | HTTP round-trip latency, payload size, connection reuse, retry overhead |
| codeflash CLI → user's code | AST parsing cost, test generation I/O, benchmark harness subprocess overhead |
| docflash → codeflash CLI | Docker startup, volume mount overhead, cold-start import time |
Prior Art
| Project | Key Result | Approach | Case Study |
|---|---|---|---|
| Rich | 2.35x Console import (79ms → 34ms) | Import deferral, re elimination, runtime micro-opts |
rich_org |
| pip | 7x --version, 1.81x resolver |
122 commits: startup, resolver, packaging, import deferral | pip_org |
Patterns That Worked
Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.
Tier 1: Startup / Import Time (highest user-visible impact)
| Pattern | Example | Typical Savings |
|---|---|---|
| Fast-path early exit | pip --version bypasses entire pip._internal import |
5-100x for that codepath |
| Import deferral | Move import X from module level into the function that uses it |
2-20ms per deferred module |
TYPE_CHECKING guard |
Move annotation-only imports behind if TYPE_CHECKING: |
1-5ms per module |
from __future__ import annotations |
Enables string annotations so type aliases can move to TYPE_CHECKING |
Unlocks further deferrals |
| Kill dead imports | Remove imports that aren't used at runtime | 1-10ms each |
| Avoid transitive chains | dataclasses → inspect (~10ms); typing.Match → re (~3ms) |
Chain-dependent |
Tier 2: Architecture (highest absolute time savings)
| Pattern | Example | Typical Savings |
|---|---|---|
Replace @dataclass with __slots__ |
ConsoleOptions: 344 → 136 bytes, eliminates inspect import |
10ms import + 60% memory |
| Lazy loading large data | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms |
| Speculative prefetch | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths |
| Conditional rebuild | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop |
| Cache at the right level | lru_cache on Style._add, parse_wheel_filename, tag generation |
Varies widely |
Tier 3: Micro-optimizations (small per-call, adds up in hot loops)
| Pattern | Example | Typical Savings |
|---|---|---|
Identity shortcut (is before ==) |
Style.__eq__, Segment.simplify |
1.3-1.8x for identity case |
| Bypass public API internally | Style._add (cached) vs __add__ (copies linked styles) |
1.1-1.3x |
| Hoist to module level | operator.attrgetter, methodcaller as module constants |
ns per call |
__slots__ on hot classes |
Criterion, ConsoleOptions, tokenizer state | 40-60% memory |
Pre-compute in __init__ |
Link._is_wheel, Version._str_cache |
Eliminates repeated work |
| Direct construction | __new__ + slot assignment bypassing __init__ |
20-40% for allocation-heavy paths |
Tier 4: I/O and Serialization
| Pattern | Example | Typical Savings |
|---|---|---|
| Replace slow serializer | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops |
| Connection pooling | Increase HTTP pool size for parallel index fetches | Latency-dependent |
| Parallel I/O | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent |
Anti-patterns (things that didn't work or weren't worth it)
- Caching with low hit rate — Caches that get evicted before reuse add overhead
- Premature
__slots__— Only worth it on classes with many instances or in hot loops - Over-deferring — Deferring imports in functions called once on startup just moves the cost
- Regex elimination — On Python 3.12,
typingimportsreanyway, so deferringreis a no-op there - Optimizing cold paths — Error handling, setup/teardown, one-time init — not worth the complexity
Methodology
Profiling toolkit
| Tool | Purpose | When to use |
|---|---|---|
python -X importtime |
Import cost breakdown | First step for any CLI tool |
hyperfine |
E2E command timing with statistics | Before/after validation |
cProfile / py-spy |
Function-level CPU profiling | Finding hot functions |
timeit |
Micro-benchmarks for specific functions | Validating micro-opts |
memray / tracemalloc |
Memory profiling | Allocation-heavy paths |
objgraph |
Object count tracking | Finding redundant allocations |
Environment
- Azure Standard_D2s_v5 (non-burstable, consistent CPU)
- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
- hyperfine with
--warmup 5 --min-runs 30for statistical rigor - All tests passing before AND after every change
Workflow
1. Profile → identify top-N bottlenecks
2. For each bottleneck:
a. Read the actual code (don't guess from profiler shapes)
b. Implement the smallest change that addresses it
c. Micro-benchmark before/after
d. Run full test suite
e. E2E benchmark
3. Commit with clear perf: prefix and numbers
4. Repeat until plateau
Codeflash Optimization Plan
Phase 1: Profile each layer
codeflash (core engine)
python -X importtime -c "import codeflash"— import chain analysiscodeflash --versionstartup time baseline- Profile a real optimization run end-to-end (py-spy flamegraph)
- Memory profile on a large codebase target
- Trace test generation loop: AST parse, codegen, subprocess, validation
codeflash-internal (backend service)
- Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
- Check connection reuse and retry patterns
- Measure cold-start time
Cross-boundary
- E2E trace: user command → agent → CLI → service → result (where does time go?)
- Measure serialization costs at each boundary
- Identify redundant round-trips
Phase 2: Identify targets
- Rank imports by cost per layer, identify deferrable ones
- Find hot functions in the optimization loop
- Check for heavy dependencies that could be deferred or replaced
- Map cross-boundary overhead (serialization, subprocess, HTTP)
- Look for the patterns from Tier 1-4 above
Phase 3: Implement
- Apply Tier 1 (startup/import) optimizations first — highest visibility
- Then Tier 2 (architecture) — highest absolute savings
- Then Tier 3 (micro) and Tier 4 (I/O) as needed
- Cross-boundary optimizations last (require changes in multiple repos)
- Each change: micro-bench → test suite → E2E bench → commit
Phase 4: Document
- Before/after benchmark tables per layer
- E2E before/after for user-facing operations
- Per-optimization breakdown
- Flamegraphs showing the shift
- Case study narrative: "codeflash optimized itself"
Repo Structure
.
├── README.md # This file — framework and playbook
├── repos/ # The vertical stack (git-ignored, clone locally)
│ ├── codeflash/ # Core engine (codeflash-ai/codeflash)
│ ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
│ └── docflash/ # CI pipeline (codeflash-ai/docflash)
├── prior-art/
│ ├── rich-summary.md # What we learned from Rich
│ └── pip-summary.md # What we learned from pip
├── infra/
│ ├── README.md # Infrastructure design and architecture
│ ├── cloud-init.yaml # VM provisioning (one-shot)
│ └── vm-manage.sh # VM lifecycle management script
├── profiles/ # Profiling output (importtime, flamegraphs)
│ ├── codeflash/ # Core engine profiles
│ ├── codeflash-internal/# Service profiles
│ └── cross-boundary/ # E2E traces spanning layers
├── bench/ # Benchmark scripts (copied to VM by cloud-init)
├── data/ # Raw benchmark results
└── results/ # Before/after analysis