codeflash-agent/case-studies/python/pip/summary.md
Kevin Turcios 1734199e85
Add metaflow and core-product case studies, rename pypa to python (#18)
- Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention
- Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip)
- Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory)
- Update main README results table with all five case studies
2026-04-14 23:31:49 -05:00

84 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# pip Optimization — Lessons Learned
Full case study: [pip_org](https://github.com/KRRT7/pip_org)
## Context
pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.
## What we did (by impact)
### Startup (7x `--version`)
The single biggest visible win. `pip --version` went from 138ms to 20ms by:
1. Adding an ultra-fast path in `__main__.py` that reads the version and exits before importing `pip._internal`
2. Deferring `base_command.py` import chain to command creation time
3. Deferring autocompletion imports behind `PIP_AUTO_COMPLETE` check
**Key insight**: For simple commands like `--version`, the user shouldn't pay the cost of importing the entire tool.
### Resolver architecture (1.81x for complex resolves)
1. **Speculative metadata prefetch**: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.
2. **Conditional Criterion rebuild**: `_remove_information_from_criteria` was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria.
3. **`__slots__` on Criterion**: Created per-package, per-resolution-step. With `__slots__`: 100 bytes less per instance × thousands of instances = significant.
4. **Two-level candidate cache**: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.
### Packaging layer (1.82x for `install -r`)
The vendored `packaging` library is called thousands of times during resolution:
- `Version.__hash__` cached in slot (42K → 21K calls)
- `Specifier.__str__` and `__hash__` cached
- `_tokenizer` dataclass → `__slots__` class
- Integer comparison key for Version (avoids full `_key` tuple construction)
- Bisect-based `filter_versions` for O(log n + k) batch filtering
### Import deferral (vendored Rich)
Same patterns as the Rich case study, but applied to pip's vendored copy:
- Deferred all Rich imports to first use
- Stripped unused Rich modules from the import chain
- Deferred heavy imports in `console.py`, `progress_bars.py`, `self_outdated_check.py`
### I/O
- Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
- Increased connection pool size for parallel index fetches
## Results
| Benchmark | Before | After | Speedup |
|---|---|---|---|
| `pip --version` | 138ms | 20ms | **7.0x** |
| `flask+django+boto3+requests` resolve | 1,493ms | 826ms | **1.81x** |
| `install -r requirements.txt` (21 pkgs) | 1,344ms | 740ms | **1.82x** |
| `pip list` | 162ms | 146ms | **1.11x** |
| All benchmarks (sum) | 18,717ms | 15,223ms | **1.23x** |
## Bugs found along the way
Optimization work surfaced real bugs:
1. **`--report -` outputs invalid JSON** ([pypa/pip#13898](https://github.com/pypa/pip/issues/13898)) — Rich was mixing log output into stdout JSON
2. **Test failure on Python 3.15** ([pypa/pip#13901](https://github.com/pypa/pip/issues/13901)) — `importlib.metadata` behavior change
3. **`_stderr_console` typo in logging.py** — global never actually set (pre-existing, not fixed to keep diff focused)
**Key insight**: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.
## Key takeaways
1. **Profile first, always**: The resolver was the bottleneck for real workloads, not startup — but startup was the most *visible* improvement to users
2. **Allocation counting reveals hidden work**: `Tag.__init__` called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone
3. **Caching needs the right granularity**: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
4. **Vendored code is fair game**: pip's vendored `packaging` had the most micro-optimization opportunities because it's called thousands of times in tight loops
5. **Test suite is your safety net**: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step
## Applicable to codeflash
- **Startup fast-path**: Does `codeflash --version` import the entire optimization engine? It shouldn't
- **Test generation loop**: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
- **AST parsing**: If parsing the same files repeatedly, cache the AST
- **Benchmark harness**: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
- **Vendored/installed deps**: Which heavy deps does codeflash import at startup? Profile and defer