codeflash-agent/docs/codeflash-agent-dogfooding.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

11 KiB

Codeflash Self-Optimization

Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.

The Stack

All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., codeflash optimize foo.py) touches every layer — optimizing one layer in isolation misses cross-boundary costs.

┌─────────────────────────────────────────────────────────────┐
│  User: codeflash optimize foo.py                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash (core engine)                                    │
│  • CLI entry point                                          │
│  • Test generation, AST analysis, benchmark harness         │
│  • Optimization loop: profile → generate → validate         │
│  repos/codeflash/                                           │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash-internal (backend service)                       │
│  • LLM orchestration, prompt management                     │
│  • Optimization result storage                              │
│  repos/codeflash-internal/                                  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  docflash (CI pipeline)                                     │
│  • Dockerized optimization runs                             │
│  • Bug detection + auto-fix pipeline                        │
│  repos/docflash/                                            │
└─────────────────────────────────────────────────────────────┘

Setup

# Clone all repos into repos/
mkdir -p repos
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
git clone git@github.com:codeflash-ai/docflash.git repos/docflash

Cross-boundary optimization targets

Boundary What to look for
codeflash CLI → internal service HTTP round-trip latency, payload size, connection reuse, retry overhead
codeflash CLI → user's code AST parsing cost, test generation I/O, benchmark harness subprocess overhead
docflash → codeflash CLI Docker startup, volume mount overhead, cold-start import time

Prior Art

Project Key Result Approach Case Study
Rich 2.35x Console import (79ms → 34ms) Import deferral, re elimination, runtime micro-opts rich_org
pip 7x --version, 1.81x resolver 122 commits: startup, resolver, packaging, import deferral pip_org

Patterns That Worked

Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.

Tier 1: Startup / Import Time (highest user-visible impact)

Pattern Example Typical Savings
Fast-path early exit pip --version bypasses entire pip._internal import 5-100x for that codepath
Import deferral Move import X from module level into the function that uses it 2-20ms per deferred module
TYPE_CHECKING guard Move annotation-only imports behind if TYPE_CHECKING: 1-5ms per module
from __future__ import annotations Enables string annotations so type aliases can move to TYPE_CHECKING Unlocks further deferrals
Kill dead imports Remove imports that aren't used at runtime 1-10ms each
Avoid transitive chains dataclassesinspect (~10ms); typing.Matchre (~3ms) Chain-dependent

Tier 2: Architecture (highest absolute time savings)

Pattern Example Typical Savings
Replace @dataclass with __slots__ ConsoleOptions: 344 → 136 bytes, eliminates inspect import 10ms import + 60% memory
Lazy loading large data Rich emoji dict (3,608 entries) deferred to first use 2-5ms
Speculative prefetch Background thread downloads metadata while resolver works 10-30% on I/O-bound paths
Conditional rebuild Skip rebuilding Criterion when nothing changed (95% of cases) 20-40% on hot loop
Cache at the right level lru_cache on Style._add, parse_wheel_filename, tag generation Varies widely

Tier 3: Micro-optimizations (small per-call, adds up in hot loops)

Pattern Example Typical Savings
Identity shortcut (is before ==) Style.__eq__, Segment.simplify 1.3-1.8x for identity case
Bypass public API internally Style._add (cached) vs __add__ (copies linked styles) 1.1-1.3x
Hoist to module level operator.attrgetter, methodcaller as module constants ns per call
__slots__ on hot classes Criterion, ConsoleOptions, tokenizer state 40-60% memory
Pre-compute in __init__ Link._is_wheel, Version._str_cache Eliminates repeated work
Direct construction __new__ + slot assignment bypassing __init__ 20-40% for allocation-heavy paths

Tier 4: I/O and Serialization

Pattern Example Typical Savings
Replace slow serializer msgpack (pure Python) → stdlib JSON (C) 2-5x for cache ops
Connection pooling Increase HTTP pool size for parallel index fetches Latency-dependent
Parallel I/O SharedThreadPoolExecutor for wheel downloads Throughput-dependent

Anti-patterns (things that didn't work or weren't worth it)

  • Caching with low hit rate — Caches that get evicted before reuse add overhead
  • Premature __slots__ — Only worth it on classes with many instances or in hot loops
  • Over-deferring — Deferring imports in functions called once on startup just moves the cost
  • Regex elimination — On Python 3.12, typing imports re anyway, so deferring re is a no-op there
  • Optimizing cold paths — Error handling, setup/teardown, one-time init — not worth the complexity

Methodology

Profiling toolkit

Tool Purpose When to use
python -X importtime Import cost breakdown First step for any CLI tool
hyperfine E2E command timing with statistics Before/after validation
cProfile / py-spy Function-level CPU profiling Finding hot functions
timeit Micro-benchmarks for specific functions Validating micro-opts
memray / tracemalloc Memory profiling Allocation-heavy paths
objgraph Object count tracking Finding redundant allocations

Environment

  • Azure Standard_D2s_v5 (non-burstable, consistent CPU)
  • Multiple Python versions (3.12, 3.13 minimum — behavior differs)
  • hyperfine with --warmup 5 --min-runs 30 for statistical rigor
  • All tests passing before AND after every change

Workflow

1. Profile → identify top-N bottlenecks
2. For each bottleneck:
   a. Read the actual code (don't guess from profiler shapes)
   b. Implement the smallest change that addresses it
   c. Micro-benchmark before/after
   d. Run full test suite
   e. E2E benchmark
3. Commit with clear perf: prefix and numbers
4. Repeat until plateau

Codeflash Optimization Plan

Phase 1: Profile each layer

codeflash (core engine)

  • python -X importtime -c "import codeflash" — import chain analysis
  • codeflash --version startup time baseline
  • Profile a real optimization run end-to-end (py-spy flamegraph)
  • Memory profile on a large codebase target
  • Trace test generation loop: AST parse, codegen, subprocess, validation

codeflash-internal (backend service)

  • Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
  • Check connection reuse and retry patterns
  • Measure cold-start time

Cross-boundary

  • E2E trace: user command → agent → CLI → service → result (where does time go?)
  • Measure serialization costs at each boundary
  • Identify redundant round-trips

Phase 2: Identify targets

  • Rank imports by cost per layer, identify deferrable ones
  • Find hot functions in the optimization loop
  • Check for heavy dependencies that could be deferred or replaced
  • Map cross-boundary overhead (serialization, subprocess, HTTP)
  • Look for the patterns from Tier 1-4 above

Phase 3: Implement

  • Apply Tier 1 (startup/import) optimizations first — highest visibility
  • Then Tier 2 (architecture) — highest absolute savings
  • Then Tier 3 (micro) and Tier 4 (I/O) as needed
  • Cross-boundary optimizations last (require changes in multiple repos)
  • Each change: micro-bench → test suite → E2E bench → commit

Phase 4: Document

  • Before/after benchmark tables per layer
  • E2E before/after for user-facing operations
  • Per-optimization breakdown
  • Flamegraphs showing the shift
  • Case study narrative: "codeflash optimized itself"

Repo Structure

.
├── README.md              # This file — framework and playbook
├── repos/                 # The vertical stack (git-ignored, clone locally)
│   ├── codeflash/         # Core engine (codeflash-ai/codeflash)
│   ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
│   └── docflash/          # CI pipeline (codeflash-ai/docflash)
├── prior-art/
│   ├── rich-summary.md    # What we learned from Rich
│   └── pip-summary.md     # What we learned from pip
├── infra/
│   ├── README.md          # Infrastructure design and architecture
│   ├── cloud-init.yaml    # VM provisioning (one-shot)
│   └── vm-manage.sh       # VM lifecycle management script
├── profiles/              # Profiling output (importtime, flamegraphs)
│   ├── codeflash/         # Core engine profiles
│   ├── codeflash-internal/# Service profiles
│   └── cross-boundary/    # E2E traces spanning layers
├── bench/                 # Benchmark scripts (copied to VM by cloud-init)
├── data/                  # Raw benchmark results
└── results/               # Before/after analysis