codeflash-agent/docs/codeflash-agent-dogfooding.md

# Codeflash Self-Optimization

Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.

## The Stack

All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs.

```
┌─────────────────────────────────────────────────────────────┐
│  User: codeflash optimize foo.py                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash (core engine)                                    │
│  • CLI entry point                                          │
│  • Test generation, AST analysis, benchmark harness         │
│  • Optimization loop: profile → generate → validate         │
│  repos/codeflash/                                           │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  codeflash-internal (backend service)                       │
│  • LLM orchestration, prompt management                     │
│  • Optimization result storage                              │
│  repos/codeflash-internal/                                  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│  docflash (CI pipeline)                                     │
│  • Dockerized optimization runs                             │
│  • Bug detection + auto-fix pipeline                        │
│  repos/docflash/                                            │
└─────────────────────────────────────────────────────────────┘
```

### Setup

```bash
# Clone all repos into repos/
mkdir -p repos
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
git clone git@github.com:codeflash-ai/docflash.git repos/docflash
```

### Cross-boundary optimization targets

| Boundary | What to look for |
|---|---|
| **codeflash CLI → internal service** | HTTP round-trip latency, payload size, connection reuse, retry overhead |
| **codeflash CLI → user's code** | AST parsing cost, test generation I/O, benchmark harness subprocess overhead |
| **docflash → codeflash CLI** | Docker startup, volume mount overhead, cold-start import time |

## Prior Art

| Project | Key Result | Approach | Case Study |
|---|---|---|---|
| **Rich** | 2.35x Console import (79ms → 34ms) | Import deferral, `re` elimination, runtime micro-opts | [rich_org](https://github.com/KRRT7/rich_org) |
| **pip** | 7x `--version`, 1.81x resolver | 122 commits: startup, resolver, packaging, import deferral | [pip_org](https://github.com/KRRT7/pip_org) |

## Patterns That Worked

Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.

### Tier 1: Startup / Import Time (highest user-visible impact)

| Pattern | Example | Typical Savings |
|---|---|---|
| **Fast-path early exit** | `pip --version` bypasses entire `pip._internal` import | 5-100x for that codepath |
| **Import deferral** | Move `import X` from module level into the function that uses it | 2-20ms per deferred module |
| **`TYPE_CHECKING` guard** | Move annotation-only imports behind `if TYPE_CHECKING:` | 1-5ms per module |
| **`from __future__ import annotations`** | Enables string annotations so type aliases can move to `TYPE_CHECKING` | Unlocks further deferrals |
| **Kill dead imports** | Remove imports that aren't used at runtime | 1-10ms each |
| **Avoid transitive chains** | `dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms) | Chain-dependent |

### Tier 2: Architecture (highest absolute time savings)

| Pattern | Example | Typical Savings |
|---|---|---|
| **Replace `@dataclass` with `__slots__`** | ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import | 10ms import + 60% memory |
| **Lazy loading large data** | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms |
| **Speculative prefetch** | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths |
| **Conditional rebuild** | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop |
| **Cache at the right level** | `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation | Varies widely |

### Tier 3: Micro-optimizations (small per-call, adds up in hot loops)

| Pattern | Example | Typical Savings |
|---|---|---|
| **Identity shortcut (`is` before `==`)** | `Style.__eq__`, `Segment.simplify` | 1.3-1.8x for identity case |
| **Bypass public API internally** | `Style._add` (cached) vs `__add__` (copies linked styles) | 1.1-1.3x |
| **Hoist to module level** | `operator.attrgetter`, `methodcaller` as module constants | ns per call |
| **`__slots__` on hot classes** | Criterion, ConsoleOptions, tokenizer state | 40-60% memory |
| **Pre-compute in `__init__`** | `Link._is_wheel`, `Version._str_cache` | Eliminates repeated work |
| **Direct construction** | `__new__` + slot assignment bypassing `__init__` | 20-40% for allocation-heavy paths |

### Tier 4: I/O and Serialization

| Pattern | Example | Typical Savings |
|---|---|---|
| **Replace slow serializer** | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops |
| **Connection pooling** | Increase HTTP pool size for parallel index fetches | Latency-dependent |
| **Parallel I/O** | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent |

## Anti-patterns (things that didn't work or weren't worth it)

- **Caching with low hit rate** — Caches that get evicted before reuse add overhead
- **Premature `__slots__`** — Only worth it on classes with many instances or in hot loops
- **Over-deferring** — Deferring imports in functions called once on startup just moves the cost
- **Regex elimination** — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there
- **Optimizing cold paths** — Error handling, setup/teardown, one-time init — not worth the complexity

## Methodology

### Profiling toolkit

| Tool | Purpose | When to use |
|---|---|---|
| `python -X importtime` | Import cost breakdown | First step for any CLI tool |
| `hyperfine` | E2E command timing with statistics | Before/after validation |
| `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions |
| `timeit` | Micro-benchmarks for specific functions | Validating micro-opts |
| `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths |
| `objgraph` | Object count tracking | Finding redundant allocations |

### Environment

- Azure Standard_D2s_v5 (non-burstable, consistent CPU)
- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
- hyperfine with `--warmup 5 --min-runs 30` for statistical rigor
- All tests passing before AND after every change

### Workflow

```
1. Profile → identify top-N bottlenecks
2. For each bottleneck:
   a. Read the actual code (don't guess from profiler shapes)
   b. Implement the smallest change that addresses it
   c. Micro-benchmark before/after
   d. Run full test suite
   e. E2E benchmark
3. Commit with clear perf: prefix and numbers
4. Repeat until plateau
```

## Codeflash Optimization Plan

### Phase 1: Profile each layer

**codeflash (core engine)**
- [ ] `python -X importtime -c "import codeflash"` — import chain analysis
- [ ] `codeflash --version` startup time baseline
- [ ] Profile a real optimization run end-to-end (py-spy flamegraph)
- [ ] Memory profile on a large codebase target
- [ ] Trace test generation loop: AST parse, codegen, subprocess, validation

**codeflash-internal (backend service)**
- [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
- [ ] Check connection reuse and retry patterns
- [ ] Measure cold-start time

**Cross-boundary**
- [ ] E2E trace: user command → agent → CLI → service → result (where does time go?)
- [ ] Measure serialization costs at each boundary
- [ ] Identify redundant round-trips

### Phase 2: Identify targets
- [ ] Rank imports by cost per layer, identify deferrable ones
- [ ] Find hot functions in the optimization loop
- [ ] Check for heavy dependencies that could be deferred or replaced
- [ ] Map cross-boundary overhead (serialization, subprocess, HTTP)
- [ ] Look for the patterns from Tier 1-4 above

### Phase 3: Implement
- [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility
- [ ] Then Tier 2 (architecture) — highest absolute savings
- [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed
- [ ] Cross-boundary optimizations last (require changes in multiple repos)
- [ ] Each change: micro-bench → test suite → E2E bench → commit

### Phase 4: Document
- [ ] Before/after benchmark tables per layer
- [ ] E2E before/after for user-facing operations
- [ ] Per-optimization breakdown
- [ ] Flamegraphs showing the shift
- [ ] Case study narrative: "codeflash optimized itself"

## Repo Structure

```
.
├── README.md              # This file — framework and playbook
├── repos/                 # The vertical stack (git-ignored, clone locally)
│   ├── codeflash/         # Core engine (codeflash-ai/codeflash)
│   ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
│   └── docflash/          # CI pipeline (codeflash-ai/docflash)
├── prior-art/
│   ├── rich-summary.md    # What we learned from Rich
│   └── pip-summary.md     # What we learned from pip
├── infra/
│   ├── README.md          # Infrastructure design and architecture
│   ├── cloud-init.yaml    # VM provisioning (one-shot)
│   └── vm-manage.sh       # VM lifecycle management script
├── profiles/              # Profiling output (importtime, flamegraphs)
│   ├── codeflash/         # Core engine profiles
│   ├── codeflash-internal/# Service profiles
│   └── cross-boundary/    # E2E traces spanning layers
├── bench/                 # Benchmark scripts (copied to VM by cloud-init)
├── data/                  # Raw benchmark results
└── results/               # Before/after analysis
```
squash 2026-04-09 08:36:01 +00:00			`# Codeflash Self-Optimization`

			`Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.`

			`## The Stack`

			All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs.

			```
			`┌─────────────────────────────────────────────────────────────┐`
			`│ User: codeflash optimize foo.py │`
			`└──────────────────────────┬──────────────────────────────────┘`
			`│`
			`┌──────────────────────────▼──────────────────────────────────┐`
			`│ codeflash (core engine) │`
			`│ • CLI entry point │`
			`│ • Test generation, AST analysis, benchmark harness │`
			`│ • Optimization loop: profile → generate → validate │`
			`│ repos/codeflash/ │`
			`└──────────────────────────┬──────────────────────────────────┘`
			`│`
			`┌──────────────────────────▼──────────────────────────────────┐`
			`│ codeflash-internal (backend service) │`
			`│ • LLM orchestration, prompt management │`
			`│ • Optimization result storage │`
			`│ repos/codeflash-internal/ │`
			`└──────────────────────────┬──────────────────────────────────┘`
			`│`
			`┌──────────────────────────▼──────────────────────────────────┐`
			`│ docflash (CI pipeline) │`
			`│ • Dockerized optimization runs │`
			`│ • Bug detection + auto-fix pipeline │`
			`│ repos/docflash/ │`
			`└─────────────────────────────────────────────────────────────┘`
			```

			`### Setup`

			```bash
			`# Clone all repos into repos/`
			`mkdir -p repos`
			`git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash`
			`git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal`
			`git clone git@github.com:codeflash-ai/docflash.git repos/docflash`
			```

			`### Cross-boundary optimization targets`

			`\| Boundary \| What to look for \|`
			`\|---\|---\|`
			`\| codeflash CLI → internal service \| HTTP round-trip latency, payload size, connection reuse, retry overhead \|`
			`\| codeflash CLI → user's code \| AST parsing cost, test generation I/O, benchmark harness subprocess overhead \|`
			`\| docflash → codeflash CLI \| Docker startup, volume mount overhead, cold-start import time \|`

			`## Prior Art`

			`\| Project \| Key Result \| Approach \| Case Study \|`
			`\|---\|---\|---\|---\|`
			\| Rich \| 2.35x Console import (79ms → 34ms) \| Import deferral, `re` elimination, runtime micro-opts \| [rich_org](https://github.com/KRRT7/rich_org) \|
			\| pip \| 7x `--version`, 1.81x resolver \| 122 commits: startup, resolver, packaging, import deferral \| [pip_org](https://github.com/KRRT7/pip_org) \|

			`## Patterns That Worked`

			`Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.`

			`### Tier 1: Startup / Import Time (highest user-visible impact)`

			`\| Pattern \| Example \| Typical Savings \|`
			`\|---\|---\|---\|`
			\| Fast-path early exit \| `pip --version` bypasses entire `pip._internal` import \| 5-100x for that codepath \|
			\| Import deferral \| Move `import X` from module level into the function that uses it \| 2-20ms per deferred module \|
			\| `TYPE_CHECKING` guard \| Move annotation-only imports behind `if TYPE_CHECKING:` \| 1-5ms per module \|
			\| `from __future__ import annotations` \| Enables string annotations so type aliases can move to `TYPE_CHECKING` \| Unlocks further deferrals \|
			`\| Kill dead imports \| Remove imports that aren't used at runtime \| 1-10ms each \|`
			\| Avoid transitive chains \| `dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms) \| Chain-dependent \|

			`### Tier 2: Architecture (highest absolute time savings)`

			`\| Pattern \| Example \| Typical Savings \|`
			`\|---\|---\|---\|`
			\| Replace `@dataclass` with `__slots__` \| ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import \| 10ms import + 60% memory \|
			`\| Lazy loading large data \| Rich emoji dict (3,608 entries) deferred to first use \| 2-5ms \|`
			`\| Speculative prefetch \| Background thread downloads metadata while resolver works \| 10-30% on I/O-bound paths \|`
			`\| Conditional rebuild \| Skip rebuilding Criterion when nothing changed (95% of cases) \| 20-40% on hot loop \|`
			\| Cache at the right level \| `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation \| Varies widely \|

			`### Tier 3: Micro-optimizations (small per-call, adds up in hot loops)`

			`\| Pattern \| Example \| Typical Savings \|`
			`\|---\|---\|---\|`
			\| Identity shortcut (`is` before `==`) \| `Style.__eq__`, `Segment.simplify` \| 1.3-1.8x for identity case \|
			\| Bypass public API internally \| `Style._add` (cached) vs `__add__` (copies linked styles) \| 1.1-1.3x \|
			\| Hoist to module level \| `operator.attrgetter`, `methodcaller` as module constants \| ns per call \|
			\| `__slots__` on hot classes \| Criterion, ConsoleOptions, tokenizer state \| 40-60% memory \|
			\| Pre-compute in `__init__` \| `Link._is_wheel`, `Version._str_cache` \| Eliminates repeated work \|
			\| Direct construction \| `__new__` + slot assignment bypassing `__init__` \| 20-40% for allocation-heavy paths \|

			`### Tier 4: I/O and Serialization`

			`\| Pattern \| Example \| Typical Savings \|`
			`\|---\|---\|---\|`
			`\| Replace slow serializer \| msgpack (pure Python) → stdlib JSON (C) \| 2-5x for cache ops \|`
			`\| Connection pooling \| Increase HTTP pool size for parallel index fetches \| Latency-dependent \|`
			`\| Parallel I/O \| SharedThreadPoolExecutor for wheel downloads \| Throughput-dependent \|`

			`## Anti-patterns (things that didn't work or weren't worth it)`

			`- Caching with low hit rate — Caches that get evicted before reuse add overhead`
			- Premature `__slots__` — Only worth it on classes with many instances or in hot loops
			`- Over-deferring — Deferring imports in functions called once on startup just moves the cost`
			- Regex elimination — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there
			`- Optimizing cold paths — Error handling, setup/teardown, one-time init — not worth the complexity`

			`## Methodology`

			`### Profiling toolkit`

			`\| Tool \| Purpose \| When to use \|`
			`\|---\|---\|---\|`
			\| `python -X importtime` \| Import cost breakdown \| First step for any CLI tool \|
			\| `hyperfine` \| E2E command timing with statistics \| Before/after validation \|
			\| `cProfile` / `py-spy` \| Function-level CPU profiling \| Finding hot functions \|
			\| `timeit` \| Micro-benchmarks for specific functions \| Validating micro-opts \|
			\| `memray` / `tracemalloc` \| Memory profiling \| Allocation-heavy paths \|
			\| `objgraph` \| Object count tracking \| Finding redundant allocations \|

			`### Environment`

			`- Azure Standard_D2s_v5 (non-burstable, consistent CPU)`
			`- Multiple Python versions (3.12, 3.13 minimum — behavior differs)`
			- hyperfine with `--warmup 5 --min-runs 30` for statistical rigor
			`- All tests passing before AND after every change`

			`### Workflow`

			```
			`1. Profile → identify top-N bottlenecks`
			`2. For each bottleneck:`
			`a. Read the actual code (don't guess from profiler shapes)`
			`b. Implement the smallest change that addresses it`
			`c. Micro-benchmark before/after`
			`d. Run full test suite`
			`e. E2E benchmark`
			`3. Commit with clear perf: prefix and numbers`
			`4. Repeat until plateau`
			```

			`## Codeflash Optimization Plan`

			`### Phase 1: Profile each layer`

			`codeflash (core engine)`
			- [ ] `python -X importtime -c "import codeflash"` — import chain analysis
			- [ ] `codeflash --version` startup time baseline
			`- [ ] Profile a real optimization run end-to-end (py-spy flamegraph)`
			`- [ ] Memory profile on a large codebase target`
			`- [ ] Trace test generation loop: AST parse, codegen, subprocess, validation`

			`codeflash-internal (backend service)`
			`- [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)`
			`- [ ] Check connection reuse and retry patterns`
			`- [ ] Measure cold-start time`

			`Cross-boundary`
			`- [ ] E2E trace: user command → agent → CLI → service → result (where does time go?)`
			`- [ ] Measure serialization costs at each boundary`
			`- [ ] Identify redundant round-trips`

			`### Phase 2: Identify targets`
			`- [ ] Rank imports by cost per layer, identify deferrable ones`
			`- [ ] Find hot functions in the optimization loop`
			`- [ ] Check for heavy dependencies that could be deferred or replaced`
			`- [ ] Map cross-boundary overhead (serialization, subprocess, HTTP)`
			`- [ ] Look for the patterns from Tier 1-4 above`

			`### Phase 3: Implement`
			`- [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility`
			`- [ ] Then Tier 2 (architecture) — highest absolute savings`
			`- [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed`
			`- [ ] Cross-boundary optimizations last (require changes in multiple repos)`
			`- [ ] Each change: micro-bench → test suite → E2E bench → commit`

			`### Phase 4: Document`
			`- [ ] Before/after benchmark tables per layer`
			`- [ ] E2E before/after for user-facing operations`
			`- [ ] Per-optimization breakdown`
			`- [ ] Flamegraphs showing the shift`
			`- [ ] Case study narrative: "codeflash optimized itself"`

			`## Repo Structure`

			```
			`.`
			`├── README.md # This file — framework and playbook`
			`├── repos/ # The vertical stack (git-ignored, clone locally)`
			`│ ├── codeflash/ # Core engine (codeflash-ai/codeflash)`
			`│ ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)`
			`│ └── docflash/ # CI pipeline (codeflash-ai/docflash)`
			`├── prior-art/`
			`│ ├── rich-summary.md # What we learned from Rich`
			`│ └── pip-summary.md # What we learned from pip`
			`├── infra/`
			`│ ├── README.md # Infrastructure design and architecture`
			`│ ├── cloud-init.yaml # VM provisioning (one-shot)`
			`│ └── vm-manage.sh # VM lifecycle management script`
			`├── profiles/ # Profiling output (importtime, flamegraphs)`
			`│ ├── codeflash/ # Core engine profiles`
			`│ ├── codeflash-internal/# Service profiles`
			`│ └── cross-boundary/ # E2E traces spanning layers`
			`├── bench/ # Benchmark scripts (copied to VM by cloud-init)`
			`├── data/ # Raw benchmark results`
			`└── results/ # Before/after analysis`
			```