4.7 KiB
pip Optimization — Lessons Learned
Full case study: pip_org
Context
pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.
What we did (by impact)
Startup (7x --version)
The single biggest visible win. pip --version went from 138ms to 20ms by:
- Adding an ultra-fast path in
__main__.pythat reads the version and exits before importingpip._internal - Deferring
base_command.pyimport chain to command creation time - Deferring autocompletion imports behind
PIP_AUTO_COMPLETEcheck
Key insight: For simple commands like --version, the user shouldn't pay the cost of importing the entire tool.
Resolver architecture (1.81x for complex resolves)
-
Speculative metadata prefetch: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.
-
Conditional Criterion rebuild:
_remove_information_from_criteriawas rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria. -
__slots__on Criterion: Created per-package, per-resolution-step. With__slots__: 100 bytes less per instance × thousands of instances = significant. -
Two-level candidate cache: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.
Packaging layer (1.82x for install -r)
The vendored packaging library is called thousands of times during resolution:
Version.__hash__cached in slot (42K → 21K calls)Specifier.__str__and__hash__cached_tokenizerdataclass →__slots__class- Integer comparison key for Version (avoids full
_keytuple construction) - Bisect-based
filter_versionsfor O(log n + k) batch filtering
Import deferral (vendored Rich)
Same patterns as the Rich case study, but applied to pip's vendored copy:
- Deferred all Rich imports to first use
- Stripped unused Rich modules from the import chain
- Deferred heavy imports in
console.py,progress_bars.py,self_outdated_check.py
I/O
- Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
- Increased connection pool size for parallel index fetches
Results
| Benchmark | Before | After | Speedup |
|---|---|---|---|
pip --version |
138ms | 20ms | 7.0x |
flask+django+boto3+requests resolve |
1,493ms | 826ms | 1.81x |
install -r requirements.txt (21 pkgs) |
1,344ms | 740ms | 1.82x |
pip list |
162ms | 146ms | 1.11x |
| All benchmarks (sum) | 18,717ms | 15,223ms | 1.23x |
Bugs found along the way
Optimization work surfaced real bugs:
--report -outputs invalid JSON (pypa/pip#13898) — Rich was mixing log output into stdout JSON- Test failure on Python 3.15 (pypa/pip#13901) —
importlib.metadatabehavior change _stderr_consoletypo in logging.py — global never actually set (pre-existing, not fixed to keep diff focused)
Key insight: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.
Key takeaways
- Profile first, always: The resolver was the bottleneck for real workloads, not startup — but startup was the most visible improvement to users
- Allocation counting reveals hidden work:
Tag.__init__called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone - Caching needs the right granularity: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
- Vendored code is fair game: pip's vendored
packaginghad the most micro-optimization opportunities because it's called thousands of times in tight loops - Test suite is your safety net: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step
Applicable to codeflash
- Startup fast-path: Does
codeflash --versionimport the entire optimization engine? It shouldn't - Test generation loop: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
- AST parsing: If parsing the same files repeatedly, cache the AST
- Benchmark harness: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
- Vendored/installed deps: Which heavy deps does codeflash import at startup? Profile and defer