# Codeflash Self-Optimization

Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.

## The Stack

All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs.

```
┌─────────────────────────────────────────────────────────────┐
│  User: codeflash optimize foo.py                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash (core engine)                                    │
│  • CLI entry point                                          │
│  • Test generation, AST analysis, benchmark harness         │
│  • Optimization loop: profile → generate → validate         │
│  repos/codeflash/                                           │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash-internal (backend service)                       │
│  • LLM orchestration, prompt management                     │
│  • Optimization result storage                              │
│  repos/codeflash-internal/                                  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  docflash (CI pipeline)                                     │
│  • Dockerized optimization runs                             │
│  • Bug detection + auto-fix pipeline                        │
│  repos/docflash/                                            │
└─────────────────────────────────────────────────────────────┘
```

### Setup

```bash
# Clone all repos into repos/
mkdir -p repos
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
git clone git@github.com:codeflash-ai/docflash.git repos/docflash
```

### Cross-boundary optimization targets

| Boundary | What to look for |
|---|---|
| **codeflash CLI → internal service** | HTTP round-trip latency, payload size, connection reuse, retry overhead |
| **codeflash CLI → user's code** | AST parsing cost, test generation I/O, benchmark harness subprocess overhead |
| **docflash → codeflash CLI** | Docker startup, volume mount overhead, cold-start import time |

## Prior Art

| Project | Key Result | Approach | Case Study |
|---|---|---|---|
| **Rich** | 2.35x Console import (79ms → 34ms) | Import deferral, `re` elimination, runtime micro-opts | [rich_org](https://github.com/KRRT7/rich_org) |
| **pip** | 7x `--version`, 1.81x resolver | 122 commits: startup, resolver, packaging, import deferral | [pip_org](https://github.com/KRRT7/pip_org) |

## Patterns That Worked

Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.

### Tier 1: Startup / Import Time (highest user-visible impact)

| Pattern | Example | Typical Savings |
|---|---|---|
| **Fast-path early exit** | `pip --version` bypasses entire `pip._internal` import | 5-100x for that codepath |
| **Import deferral** | Move `import X` from module level into the function that uses it | 2-20ms per deferred module |
| **`TYPE_CHECKING` guard** | Move annotation-only imports behind `if TYPE_CHECKING:` | 1-5ms per module |
| **`from __future__ import annotations`** | Enables string annotations so type aliases can move to `TYPE_CHECKING` | Unlocks further deferrals |
| **Kill dead imports** | Remove imports that aren't used at runtime | 1-10ms each |
| **Avoid transitive chains** | `dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms) | Chain-dependent |

### Tier 2: Architecture (highest absolute time savings)

| Pattern | Example | Typical Savings |
|---|---|---|
| **Replace `@dataclass` with `__slots__`** | ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import | 10ms import + 60% memory |
| **Lazy loading large data** | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms |
| **Speculative prefetch** | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths |
| **Conditional rebuild** | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop |
| **Cache at the right level** | `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation | Varies widely |

### Tier 3: Micro-optimizations (small per-call, adds up in hot loops)

| Pattern | Example | Typical Savings |
|---|---|---|
| **Identity shortcut (`is` before `==`)** | `Style.__eq__`, `Segment.simplify` | 1.3-1.8x for identity case |
| **Bypass public API internally** | `Style._add` (cached) vs `__add__` (copies linked styles) | 1.1-1.3x |
| **Hoist to module level** | `operator.attrgetter`, `methodcaller` as module constants | ns per call |
| **`__slots__` on hot classes** | Criterion, ConsoleOptions, tokenizer state | 40-60% memory |
| **Pre-compute in `__init__`** | `Link._is_wheel`, `Version._str_cache` | Eliminates repeated work |
| **Direct construction** | `__new__` + slot assignment bypassing `__init__` | 20-40% for allocation-heavy paths |

### Tier 4: I/O and Serialization

| Pattern | Example | Typical Savings |
|---|---|---|
| **Replace slow serializer** | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops |
| **Connection pooling** | Increase HTTP pool size for parallel index fetches | Latency-dependent |
| **Parallel I/O** | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent |

## Anti-patterns (things that didn't work or weren't worth it)

- **Caching with low hit rate** — Caches that get evicted before reuse add overhead
- **Premature `__slots__`** — Only worth it on classes with many instances or in hot loops
- **Over-deferring** — Deferring imports in functions called once on startup just moves the cost
- **Regex elimination** — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there
- **Optimizing cold paths** — Error handling, setup/teardown, one-time init — not worth the complexity

## Methodology

### Profiling toolkit

| Tool | Purpose | When to use |
|---|---|---|
| `python -X importtime` | Import cost breakdown | First step for any CLI tool |
| `hyperfine` | E2E command timing with statistics | Before/after validation |
| `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions |
| `timeit` | Micro-benchmarks for specific functions | Validating micro-opts |
| `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths |
| `objgraph` | Object count tracking | Finding redundant allocations |

### Environment

- Azure Standard_D2s_v5 (non-burstable, consistent CPU)
- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
- hyperfine with `--warmup 5 --min-runs 30` for statistical rigor
- All tests passing before AND after every change

### Workflow

```
1. Profile → identify top-N bottlenecks
2. For each bottleneck:
   a. Read the actual code (don't guess from profiler shapes)
   b. Implement the smallest change that addresses it
   c. Micro-benchmark before/after
   d. Run full test suite
   e. E2E benchmark
3. Commit with clear perf: prefix and numbers
4. Repeat until plateau
```

## Codeflash Optimization Plan

### Phase 1: Profile each layer

**codeflash (core engine)**
- [ ] `python -X importtime -c "import codeflash"` — import chain analysis
- [ ] `codeflash --version` startup time baseline
- [ ] Profile a real optimization run end-to-end (py-spy flamegraph)
- [ ] Memory profile on a large codebase target
- [ ] Trace test generation loop: AST parse, codegen, subprocess, validation

**codeflash-internal (backend service)**
- [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
- [ ] Check connection reuse and retry patterns
- [ ] Measure cold-start time

**Cross-boundary**
- [ ] E2E trace: user command → agent → CLI → service → result (where does time go?)
- [ ] Measure serialization costs at each boundary
- [ ] Identify redundant round-trips

### Phase 2: Identify targets
- [ ] Rank imports by cost per layer, identify deferrable ones
- [ ] Find hot functions in the optimization loop
- [ ] Check for heavy dependencies that could be deferred or replaced
- [ ] Map cross-boundary overhead (serialization, subprocess, HTTP)
- [ ] Look for the patterns from Tier 1-4 above

### Phase 3: Implement
- [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility
- [ ] Then Tier 2 (architecture) — highest absolute savings
- [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed
- [ ] Cross-boundary optimizations last (require changes in multiple repos)
- [ ] Each change: micro-bench → test suite → E2E bench → commit

### Phase 4: Document
- [ ] Before/after benchmark tables per layer
- [ ] E2E before/after for user-facing operations
- [ ] Per-optimization breakdown
- [ ] Flamegraphs showing the shift
- [ ] Case study narrative: "codeflash optimized itself"

## Repo Structure

```
.
├── README.md              # This file — framework and playbook
├── repos/                 # The vertical stack (git-ignored, clone locally)
│   ├── codeflash/         # Core engine (codeflash-ai/codeflash)
│   ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
│   └── docflash/          # CI pipeline (codeflash-ai/docflash)
├── prior-art/
│   ├── rich-summary.md    # What we learned from Rich
│   └── pip-summary.md     # What we learned from pip
├── infra/
│   ├── README.md          # Infrastructure design and architecture
│   ├── cloud-init.yaml    # VM provisioning (one-shot)
│   └── vm-manage.sh       # VM lifecycle management script
├── profiles/              # Profiling output (importtime, flamegraphs)
│   ├── codeflash/         # Core engine profiles
│   ├── codeflash-internal/# Service profiles
│   └── cross-boundary/    # E2E traces spanning layers
├── bench/                 # Benchmark scripts (copied to VM by cloud-init)
├── data/                  # Raw benchmark results
└── results/               # Before/after analysis
```