Kevin Turcios 3b59d97647 squash

2026-04-13 14:12:17 -05:00

11 KiB

Raw Blame History

Codeflash Self-Optimization

Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.

The Stack

All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., codeflash optimize foo.py) touches every layer — optimizing one layer in isolation misses cross-boundary costs.

┌─────────────────────────────────────────────────────────────┐
│  User: codeflash optimize foo.py                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash (core engine)                                    │
│  • CLI entry point                                          │
│  • Test generation, AST analysis, benchmark harness         │
│  • Optimization loop: profile → generate → validate         │
│  repos/codeflash/                                           │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash-internal (backend service)                       │
│  • LLM orchestration, prompt management                     │
│  • Optimization result storage                              │
│  repos/codeflash-internal/                                  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  docflash (CI pipeline)                                     │
│  • Dockerized optimization runs                             │
│  • Bug detection + auto-fix pipeline                        │
│  repos/docflash/                                            │
└─────────────────────────────────────────────────────────────┘

Setup

# Clone all repos into repos/
mkdir -p repos
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
git clone git@github.com:codeflash-ai/docflash.git repos/docflash

Cross-boundary optimization targets

Boundary	What to look for
codeflash CLI → internal service	HTTP round-trip latency, payload size, connection reuse, retry overhead
codeflash CLI → user's code	AST parsing cost, test generation I/O, benchmark harness subprocess overhead
docflash → codeflash CLI	Docker startup, volume mount overhead, cold-start import time

Prior Art

Project	Key Result	Approach	Case Study
Rich	2.35x Console import (79ms → 34ms)	Import deferral, `re` elimination, runtime micro-opts	rich_org
pip	7x `--version`, 1.81x resolver	122 commits: startup, resolver, packaging, import deferral	pip_org

Patterns That Worked

Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.

Tier 1: Startup / Import Time (highest user-visible impact)

Pattern	Example	Typical Savings
Fast-path early exit	`pip --version` bypasses entire `pip._internal` import	5-100x for that codepath
Import deferral	Move `import X` from module level into the function that uses it	2-20ms per deferred module
`TYPE_CHECKING` guard	Move annotation-only imports behind `if TYPE_CHECKING:`	1-5ms per module
`from __future__ import annotations`	Enables string annotations so type aliases can move to `TYPE_CHECKING`	Unlocks further deferrals
Kill dead imports	Remove imports that aren't used at runtime	1-10ms each
Avoid transitive chains	`dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms)	Chain-dependent

Tier 2: Architecture (highest absolute time savings)

Pattern	Example	Typical Savings
Replace `@dataclass` with `__slots__`	ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import	10ms import + 60% memory
Lazy loading large data	Rich emoji dict (3,608 entries) deferred to first use	2-5ms
Speculative prefetch	Background thread downloads metadata while resolver works	10-30% on I/O-bound paths
Conditional rebuild	Skip rebuilding Criterion when nothing changed (95% of cases)	20-40% on hot loop
Cache at the right level	`lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation	Varies widely

Tier 3: Micro-optimizations (small per-call, adds up in hot loops)

Pattern	Example	Typical Savings
Identity shortcut (`is` before `==`)	`Style.__eq__`, `Segment.simplify`	1.3-1.8x for identity case
Bypass public API internally	`Style._add` (cached) vs `__add__` (copies linked styles)	1.1-1.3x
Hoist to module level	`operator.attrgetter`, `methodcaller` as module constants	ns per call
`__slots__` on hot classes	Criterion, ConsoleOptions, tokenizer state	40-60% memory
Pre-compute in `__init__`	`Link._is_wheel`, `Version._str_cache`	Eliminates repeated work
Direct construction	`__new__` + slot assignment bypassing `__init__`	20-40% for allocation-heavy paths

Tier 4: I/O and Serialization

Pattern	Example	Typical Savings
Replace slow serializer	msgpack (pure Python) → stdlib JSON (C)	2-5x for cache ops
Connection pooling	Increase HTTP pool size for parallel index fetches	Latency-dependent
Parallel I/O	SharedThreadPoolExecutor for wheel downloads	Throughput-dependent

Anti-patterns (things that didn't work or weren't worth it)

Caching with low hit rate — Caches that get evicted before reuse add overhead
Premature __slots__ — Only worth it on classes with many instances or in hot loops
Over-deferring — Deferring imports in functions called once on startup just moves the cost
Regex elimination — On Python 3.12, typing imports re anyway, so deferring re is a no-op there
Optimizing cold paths — Error handling, setup/teardown, one-time init — not worth the complexity

Methodology

Profiling toolkit

Tool	Purpose	When to use
`python -X importtime`	Import cost breakdown	First step for any CLI tool
`hyperfine`	E2E command timing with statistics	Before/after validation
`cProfile` / `py-spy`	Function-level CPU profiling	Finding hot functions
`timeit`	Micro-benchmarks for specific functions	Validating micro-opts
`memray` / `tracemalloc`	Memory profiling	Allocation-heavy paths
`objgraph`	Object count tracking	Finding redundant allocations

Environment

Azure Standard_D2s_v5 (non-burstable, consistent CPU)
Multiple Python versions (3.12, 3.13 minimum — behavior differs)
hyperfine with --warmup 5 --min-runs 30 for statistical rigor
All tests passing before AND after every change

Workflow

1. Profile → identify top-N bottlenecks
2. For each bottleneck:
   a. Read the actual code (don't guess from profiler shapes)
   b. Implement the smallest change that addresses it
   c. Micro-benchmark before/after
   d. Run full test suite
   e. E2E benchmark
3. Commit with clear perf: prefix and numbers
4. Repeat until plateau

Codeflash Optimization Plan

Phase 1: Profile each layer

codeflash (core engine)

python -X importtime -c "import codeflash" — import chain analysis
codeflash --version startup time baseline
Profile a real optimization run end-to-end (py-spy flamegraph)
Memory profile on a large codebase target
Trace test generation loop: AST parse, codegen, subprocess, validation

codeflash-internal (backend service)

Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
Check connection reuse and retry patterns
Measure cold-start time

Cross-boundary

E2E trace: user command → agent → CLI → service → result (where does time go?)
Measure serialization costs at each boundary
Identify redundant round-trips

Phase 2: Identify targets

Rank imports by cost per layer, identify deferrable ones
Find hot functions in the optimization loop
Check for heavy dependencies that could be deferred or replaced
Map cross-boundary overhead (serialization, subprocess, HTTP)
Look for the patterns from Tier 1-4 above

Phase 3: Implement

Apply Tier 1 (startup/import) optimizations first — highest visibility
Then Tier 2 (architecture) — highest absolute savings
Then Tier 3 (micro) and Tier 4 (I/O) as needed
Cross-boundary optimizations last (require changes in multiple repos)
Each change: micro-bench → test suite → E2E bench → commit

Phase 4: Document

Before/after benchmark tables per layer
E2E before/after for user-facing operations
Per-optimization breakdown
Flamegraphs showing the shift
Case study narrative: "codeflash optimized itself"

Repo Structure

.
├── README.md              # This file — framework and playbook
├── repos/                 # The vertical stack (git-ignored, clone locally)
│   ├── codeflash/         # Core engine (codeflash-ai/codeflash)
│   ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
│   └── docflash/          # CI pipeline (codeflash-ai/docflash)
├── prior-art/
│   ├── rich-summary.md    # What we learned from Rich
│   └── pip-summary.md     # What we learned from pip
├── infra/
│   ├── README.md          # Infrastructure design and architecture
│   ├── cloud-init.yaml    # VM provisioning (one-shot)
│   └── vm-manage.sh       # VM lifecycle management script
├── profiles/              # Profiling output (importtime, flamegraphs)
│   ├── codeflash/         # Core engine profiles
│   ├── codeflash-internal/# Service profiles
│   └── cross-boundary/    # E2E traces spanning layers
├── bench/                 # Benchmark scripts (copied to VM by cloud-init)
├── data/                  # Raw benchmark results
└── results/               # Before/after analysis

11 KiB Raw Blame History