mirror of
https://github.com/codeflash-ai/codeflash-agent.git
synced 2026-05-04 18:25:19 +00:00
214 lines
11 KiB
Markdown
214 lines
11 KiB
Markdown
|
|
# Codeflash Self-Optimization
|
||
|
|
|
||
|
|
Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.
|
||
|
|
|
||
|
|
## The Stack
|
||
|
|
|
||
|
|
All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs.
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ User: codeflash optimize foo.py │
|
||
|
|
└──────────────────────────┬──────────────────────────────────┘
|
||
|
|
│
|
||
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
||
|
|
│ codeflash (core engine) │
|
||
|
|
│ • CLI entry point │
|
||
|
|
│ • Test generation, AST analysis, benchmark harness │
|
||
|
|
│ • Optimization loop: profile → generate → validate │
|
||
|
|
│ repos/codeflash/ │
|
||
|
|
└──────────────────────────┬──────────────────────────────────┘
|
||
|
|
│
|
||
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
||
|
|
│ codeflash-internal (backend service) │
|
||
|
|
│ • LLM orchestration, prompt management │
|
||
|
|
│ • Optimization result storage │
|
||
|
|
│ repos/codeflash-internal/ │
|
||
|
|
└──────────────────────────┬──────────────────────────────────┘
|
||
|
|
│
|
||
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
||
|
|
│ docflash (CI pipeline) │
|
||
|
|
│ • Dockerized optimization runs │
|
||
|
|
│ • Bug detection + auto-fix pipeline │
|
||
|
|
│ repos/docflash/ │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Setup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Clone all repos into repos/
|
||
|
|
mkdir -p repos
|
||
|
|
git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
|
||
|
|
git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
|
||
|
|
git clone git@github.com:codeflash-ai/docflash.git repos/docflash
|
||
|
|
```
|
||
|
|
|
||
|
|
### Cross-boundary optimization targets
|
||
|
|
|
||
|
|
| Boundary | What to look for |
|
||
|
|
|---|---|
|
||
|
|
| **codeflash CLI → internal service** | HTTP round-trip latency, payload size, connection reuse, retry overhead |
|
||
|
|
| **codeflash CLI → user's code** | AST parsing cost, test generation I/O, benchmark harness subprocess overhead |
|
||
|
|
| **docflash → codeflash CLI** | Docker startup, volume mount overhead, cold-start import time |
|
||
|
|
|
||
|
|
## Prior Art
|
||
|
|
|
||
|
|
| Project | Key Result | Approach | Case Study |
|
||
|
|
|---|---|---|---|
|
||
|
|
| **Rich** | 2.35x Console import (79ms → 34ms) | Import deferral, `re` elimination, runtime micro-opts | [rich_org](https://github.com/KRRT7/rich_org) |
|
||
|
|
| **pip** | 7x `--version`, 1.81x resolver | 122 commits: startup, resolver, packaging, import deferral | [pip_org](https://github.com/KRRT7/pip_org) |
|
||
|
|
|
||
|
|
## Patterns That Worked
|
||
|
|
|
||
|
|
Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.
|
||
|
|
|
||
|
|
### Tier 1: Startup / Import Time (highest user-visible impact)
|
||
|
|
|
||
|
|
| Pattern | Example | Typical Savings |
|
||
|
|
|---|---|---|
|
||
|
|
| **Fast-path early exit** | `pip --version` bypasses entire `pip._internal` import | 5-100x for that codepath |
|
||
|
|
| **Import deferral** | Move `import X` from module level into the function that uses it | 2-20ms per deferred module |
|
||
|
|
| **`TYPE_CHECKING` guard** | Move annotation-only imports behind `if TYPE_CHECKING:` | 1-5ms per module |
|
||
|
|
| **`from __future__ import annotations`** | Enables string annotations so type aliases can move to `TYPE_CHECKING` | Unlocks further deferrals |
|
||
|
|
| **Kill dead imports** | Remove imports that aren't used at runtime | 1-10ms each |
|
||
|
|
| **Avoid transitive chains** | `dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms) | Chain-dependent |
|
||
|
|
|
||
|
|
### Tier 2: Architecture (highest absolute time savings)
|
||
|
|
|
||
|
|
| Pattern | Example | Typical Savings |
|
||
|
|
|---|---|---|
|
||
|
|
| **Replace `@dataclass` with `__slots__`** | ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import | 10ms import + 60% memory |
|
||
|
|
| **Lazy loading large data** | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms |
|
||
|
|
| **Speculative prefetch** | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths |
|
||
|
|
| **Conditional rebuild** | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop |
|
||
|
|
| **Cache at the right level** | `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation | Varies widely |
|
||
|
|
|
||
|
|
### Tier 3: Micro-optimizations (small per-call, adds up in hot loops)
|
||
|
|
|
||
|
|
| Pattern | Example | Typical Savings |
|
||
|
|
|---|---|---|
|
||
|
|
| **Identity shortcut (`is` before `==`)** | `Style.__eq__`, `Segment.simplify` | 1.3-1.8x for identity case |
|
||
|
|
| **Bypass public API internally** | `Style._add` (cached) vs `__add__` (copies linked styles) | 1.1-1.3x |
|
||
|
|
| **Hoist to module level** | `operator.attrgetter`, `methodcaller` as module constants | ns per call |
|
||
|
|
| **`__slots__` on hot classes** | Criterion, ConsoleOptions, tokenizer state | 40-60% memory |
|
||
|
|
| **Pre-compute in `__init__`** | `Link._is_wheel`, `Version._str_cache` | Eliminates repeated work |
|
||
|
|
| **Direct construction** | `__new__` + slot assignment bypassing `__init__` | 20-40% for allocation-heavy paths |
|
||
|
|
|
||
|
|
### Tier 4: I/O and Serialization
|
||
|
|
|
||
|
|
| Pattern | Example | Typical Savings |
|
||
|
|
|---|---|---|
|
||
|
|
| **Replace slow serializer** | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops |
|
||
|
|
| **Connection pooling** | Increase HTTP pool size for parallel index fetches | Latency-dependent |
|
||
|
|
| **Parallel I/O** | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent |
|
||
|
|
|
||
|
|
## Anti-patterns (things that didn't work or weren't worth it)
|
||
|
|
|
||
|
|
- **Caching with low hit rate** — Caches that get evicted before reuse add overhead
|
||
|
|
- **Premature `__slots__`** — Only worth it on classes with many instances or in hot loops
|
||
|
|
- **Over-deferring** — Deferring imports in functions called once on startup just moves the cost
|
||
|
|
- **Regex elimination** — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there
|
||
|
|
- **Optimizing cold paths** — Error handling, setup/teardown, one-time init — not worth the complexity
|
||
|
|
|
||
|
|
## Methodology
|
||
|
|
|
||
|
|
### Profiling toolkit
|
||
|
|
|
||
|
|
| Tool | Purpose | When to use |
|
||
|
|
|---|---|---|
|
||
|
|
| `python -X importtime` | Import cost breakdown | First step for any CLI tool |
|
||
|
|
| `hyperfine` | E2E command timing with statistics | Before/after validation |
|
||
|
|
| `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions |
|
||
|
|
| `timeit` | Micro-benchmarks for specific functions | Validating micro-opts |
|
||
|
|
| `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths |
|
||
|
|
| `objgraph` | Object count tracking | Finding redundant allocations |
|
||
|
|
|
||
|
|
### Environment
|
||
|
|
|
||
|
|
- Azure Standard_D2s_v5 (non-burstable, consistent CPU)
|
||
|
|
- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
|
||
|
|
- hyperfine with `--warmup 5 --min-runs 30` for statistical rigor
|
||
|
|
- All tests passing before AND after every change
|
||
|
|
|
||
|
|
### Workflow
|
||
|
|
|
||
|
|
```
|
||
|
|
1. Profile → identify top-N bottlenecks
|
||
|
|
2. For each bottleneck:
|
||
|
|
a. Read the actual code (don't guess from profiler shapes)
|
||
|
|
b. Implement the smallest change that addresses it
|
||
|
|
c. Micro-benchmark before/after
|
||
|
|
d. Run full test suite
|
||
|
|
e. E2E benchmark
|
||
|
|
3. Commit with clear perf: prefix and numbers
|
||
|
|
4. Repeat until plateau
|
||
|
|
```
|
||
|
|
|
||
|
|
## Codeflash Optimization Plan
|
||
|
|
|
||
|
|
### Phase 1: Profile each layer
|
||
|
|
|
||
|
|
**codeflash (core engine)**
|
||
|
|
- [ ] `python -X importtime -c "import codeflash"` — import chain analysis
|
||
|
|
- [ ] `codeflash --version` startup time baseline
|
||
|
|
- [ ] Profile a real optimization run end-to-end (py-spy flamegraph)
|
||
|
|
- [ ] Memory profile on a large codebase target
|
||
|
|
- [ ] Trace test generation loop: AST parse, codegen, subprocess, validation
|
||
|
|
|
||
|
|
**codeflash-internal (backend service)**
|
||
|
|
- [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
|
||
|
|
- [ ] Check connection reuse and retry patterns
|
||
|
|
- [ ] Measure cold-start time
|
||
|
|
|
||
|
|
**Cross-boundary**
|
||
|
|
- [ ] E2E trace: user command → agent → CLI → service → result (where does time go?)
|
||
|
|
- [ ] Measure serialization costs at each boundary
|
||
|
|
- [ ] Identify redundant round-trips
|
||
|
|
|
||
|
|
### Phase 2: Identify targets
|
||
|
|
- [ ] Rank imports by cost per layer, identify deferrable ones
|
||
|
|
- [ ] Find hot functions in the optimization loop
|
||
|
|
- [ ] Check for heavy dependencies that could be deferred or replaced
|
||
|
|
- [ ] Map cross-boundary overhead (serialization, subprocess, HTTP)
|
||
|
|
- [ ] Look for the patterns from Tier 1-4 above
|
||
|
|
|
||
|
|
### Phase 3: Implement
|
||
|
|
- [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility
|
||
|
|
- [ ] Then Tier 2 (architecture) — highest absolute savings
|
||
|
|
- [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed
|
||
|
|
- [ ] Cross-boundary optimizations last (require changes in multiple repos)
|
||
|
|
- [ ] Each change: micro-bench → test suite → E2E bench → commit
|
||
|
|
|
||
|
|
### Phase 4: Document
|
||
|
|
- [ ] Before/after benchmark tables per layer
|
||
|
|
- [ ] E2E before/after for user-facing operations
|
||
|
|
- [ ] Per-optimization breakdown
|
||
|
|
- [ ] Flamegraphs showing the shift
|
||
|
|
- [ ] Case study narrative: "codeflash optimized itself"
|
||
|
|
|
||
|
|
## Repo Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
.
|
||
|
|
├── README.md # This file — framework and playbook
|
||
|
|
├── repos/ # The vertical stack (git-ignored, clone locally)
|
||
|
|
│ ├── codeflash/ # Core engine (codeflash-ai/codeflash)
|
||
|
|
│ ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
|
||
|
|
│ └── docflash/ # CI pipeline (codeflash-ai/docflash)
|
||
|
|
├── prior-art/
|
||
|
|
│ ├── rich-summary.md # What we learned from Rich
|
||
|
|
│ └── pip-summary.md # What we learned from pip
|
||
|
|
├── infra/
|
||
|
|
│ ├── README.md # Infrastructure design and architecture
|
||
|
|
│ ├── cloud-init.yaml # VM provisioning (one-shot)
|
||
|
|
│ └── vm-manage.sh # VM lifecycle management script
|
||
|
|
├── profiles/ # Profiling output (importtime, flamegraphs)
|
||
|
|
│ ├── codeflash/ # Core engine profiles
|
||
|
|
│ ├── codeflash-internal/# Service profiles
|
||
|
|
│ └── cross-boundary/ # E2E traces spanning layers
|
||
|
|
├── bench/ # Benchmark scripts (copied to VM by cloud-init)
|
||
|
|
├── data/ # Raw benchmark results
|
||
|
|
└── results/ # Before/after analysis
|
||
|
|
```
|