codeflash-agent/case-studies/python/pip/summary.md

# pip Optimization — Lessons Learned

Full case study: [pip_org](https://github.com/KRRT7/pip_org)

## Context

pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.

## What we did (by impact)

### Startup (7x `--version`)

The single biggest visible win. `pip --version` went from 138ms to 20ms by:
1. Adding an ultra-fast path in `__main__.py` that reads the version and exits before importing `pip._internal`
2. Deferring `base_command.py` import chain to command creation time
3. Deferring autocompletion imports behind `PIP_AUTO_COMPLETE` check

**Key insight**: For simple commands like `--version`, the user shouldn't pay the cost of importing the entire tool.

### Resolver architecture (1.81x for complex resolves)

1. **Speculative metadata prefetch**: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.

2. **Conditional Criterion rebuild**: `_remove_information_from_criteria` was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria.

3. **`__slots__` on Criterion**: Created per-package, per-resolution-step. With `__slots__`: 100 bytes less per instance × thousands of instances = significant.

4. **Two-level candidate cache**: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.

### Packaging layer (1.82x for `install -r`)

The vendored `packaging` library is called thousands of times during resolution:
- `Version.__hash__` cached in slot (42K → 21K calls)
- `Specifier.__str__` and `__hash__` cached
- `_tokenizer` dataclass → `__slots__` class
- Integer comparison key for Version (avoids full `_key` tuple construction)
- Bisect-based `filter_versions` for O(log n + k) batch filtering

### Import deferral (vendored Rich)

Same patterns as the Rich case study, but applied to pip's vendored copy:
- Deferred all Rich imports to first use
- Stripped unused Rich modules from the import chain
- Deferred heavy imports in `console.py`, `progress_bars.py`, `self_outdated_check.py`

### I/O

- Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
- Increased connection pool size for parallel index fetches

## Results

| Benchmark | Before | After | Speedup |
|---|---|---|---|
| `pip --version` | 138ms | 20ms | **7.0x** |
| `flask+django+boto3+requests` resolve | 1,493ms | 826ms | **1.81x** |
| `install -r requirements.txt` (21 pkgs) | 1,344ms | 740ms | **1.82x** |
| `pip list` | 162ms | 146ms | **1.11x** |
| All benchmarks (sum) | 18,717ms | 15,223ms | **1.23x** |

## Bugs found along the way

Optimization work surfaced real bugs:
1. **`--report -` outputs invalid JSON** ([pypa/pip#13898](https://github.com/pypa/pip/issues/13898)) — Rich was mixing log output into stdout JSON
2. **Test failure on Python 3.15** ([pypa/pip#13901](https://github.com/pypa/pip/issues/13901)) — `importlib.metadata` behavior change
3. **`_stderr_console` typo in logging.py** — global never actually set (pre-existing, not fixed to keep diff focused)

**Key insight**: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.

## Key takeaways

1. **Profile first, always**: The resolver was the bottleneck for real workloads, not startup — but startup was the most *visible* improvement to users
2. **Allocation counting reveals hidden work**: `Tag.__init__` called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone
3. **Caching needs the right granularity**: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
4. **Vendored code is fair game**: pip's vendored `packaging` had the most micro-optimization opportunities because it's called thousands of times in tight loops
5. **Test suite is your safety net**: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step

## Applicable to codeflash

- **Startup fast-path**: Does `codeflash --version` import the entire optimization engine? It shouldn't
- **Test generation loop**: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
- **AST parsing**: If parsing the same files repeatedly, cache the AST
- **Benchmark harness**: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
- **Vendored/installed deps**: Which heavy deps does codeflash import at startup? Profile and defer