85 lines
4.7 KiB
Markdown
85 lines
4.7 KiB
Markdown
|
|
# pip Optimization — Lessons Learned
|
|||
|
|
|
|||
|
|
Full case study: [pip_org](https://github.com/KRRT7/pip_org)
|
|||
|
|
|
|||
|
|
## Context
|
|||
|
|
|
|||
|
|
pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.
|
|||
|
|
|
|||
|
|
## What we did (by impact)
|
|||
|
|
|
|||
|
|
### Startup (7x `--version`)
|
|||
|
|
|
|||
|
|
The single biggest visible win. `pip --version` went from 138ms to 20ms by:
|
|||
|
|
1. Adding an ultra-fast path in `__main__.py` that reads the version and exits before importing `pip._internal`
|
|||
|
|
2. Deferring `base_command.py` import chain to command creation time
|
|||
|
|
3. Deferring autocompletion imports behind `PIP_AUTO_COMPLETE` check
|
|||
|
|
|
|||
|
|
**Key insight**: For simple commands like `--version`, the user shouldn't pay the cost of importing the entire tool.
|
|||
|
|
|
|||
|
|
### Resolver architecture (1.81x for complex resolves)
|
|||
|
|
|
|||
|
|
1. **Speculative metadata prefetch**: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.
|
|||
|
|
|
|||
|
|
2. **Conditional Criterion rebuild**: `_remove_information_from_criteria` was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria.
|
|||
|
|
|
|||
|
|
3. **`__slots__` on Criterion**: Created per-package, per-resolution-step. With `__slots__`: 100 bytes less per instance × thousands of instances = significant.
|
|||
|
|
|
|||
|
|
4. **Two-level candidate cache**: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.
|
|||
|
|
|
|||
|
|
### Packaging layer (1.82x for `install -r`)
|
|||
|
|
|
|||
|
|
The vendored `packaging` library is called thousands of times during resolution:
|
|||
|
|
- `Version.__hash__` cached in slot (42K → 21K calls)
|
|||
|
|
- `Specifier.__str__` and `__hash__` cached
|
|||
|
|
- `_tokenizer` dataclass → `__slots__` class
|
|||
|
|
- Integer comparison key for Version (avoids full `_key` tuple construction)
|
|||
|
|
- Bisect-based `filter_versions` for O(log n + k) batch filtering
|
|||
|
|
|
|||
|
|
### Import deferral (vendored Rich)
|
|||
|
|
|
|||
|
|
Same patterns as the Rich case study, but applied to pip's vendored copy:
|
|||
|
|
- Deferred all Rich imports to first use
|
|||
|
|
- Stripped unused Rich modules from the import chain
|
|||
|
|
- Deferred heavy imports in `console.py`, `progress_bars.py`, `self_outdated_check.py`
|
|||
|
|
|
|||
|
|
### I/O
|
|||
|
|
|
|||
|
|
- Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
|
|||
|
|
- Increased connection pool size for parallel index fetches
|
|||
|
|
|
|||
|
|
## Results
|
|||
|
|
|
|||
|
|
| Benchmark | Before | After | Speedup |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| `pip --version` | 138ms | 20ms | **7.0x** |
|
|||
|
|
| `flask+django+boto3+requests` resolve | 1,493ms | 826ms | **1.81x** |
|
|||
|
|
| `install -r requirements.txt` (21 pkgs) | 1,344ms | 740ms | **1.82x** |
|
|||
|
|
| `pip list` | 162ms | 146ms | **1.11x** |
|
|||
|
|
| All benchmarks (sum) | 18,717ms | 15,223ms | **1.23x** |
|
|||
|
|
|
|||
|
|
## Bugs found along the way
|
|||
|
|
|
|||
|
|
Optimization work surfaced real bugs:
|
|||
|
|
1. **`--report -` outputs invalid JSON** ([pypa/pip#13898](https://github.com/pypa/pip/issues/13898)) — Rich was mixing log output into stdout JSON
|
|||
|
|
2. **Test failure on Python 3.15** ([pypa/pip#13901](https://github.com/pypa/pip/issues/13901)) — `importlib.metadata` behavior change
|
|||
|
|
3. **`_stderr_console` typo in logging.py** — global never actually set (pre-existing, not fixed to keep diff focused)
|
|||
|
|
|
|||
|
|
**Key insight**: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.
|
|||
|
|
|
|||
|
|
## Key takeaways
|
|||
|
|
|
|||
|
|
1. **Profile first, always**: The resolver was the bottleneck for real workloads, not startup — but startup was the most *visible* improvement to users
|
|||
|
|
2. **Allocation counting reveals hidden work**: `Tag.__init__` called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone
|
|||
|
|
3. **Caching needs the right granularity**: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
|
|||
|
|
4. **Vendored code is fair game**: pip's vendored `packaging` had the most micro-optimization opportunities because it's called thousands of times in tight loops
|
|||
|
|
5. **Test suite is your safety net**: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step
|
|||
|
|
|
|||
|
|
## Applicable to codeflash
|
|||
|
|
|
|||
|
|
- **Startup fast-path**: Does `codeflash --version` import the entire optimization engine? It shouldn't
|
|||
|
|
- **Test generation loop**: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
|
|||
|
|
- **AST parsing**: If parsing the same files repeatedly, cache the AST
|
|||
|
|
- **Benchmark harness**: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
|
|||
|
|
- **Vendored/installed deps**: Which heavy deps does codeflash import at startup? Profile and defer
|