# pip Optimization — Lessons Learned Full case study: [pip_org](https://github.com/KRRT7/pip_org) ## Context pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64. ## What we did (by impact) ### Startup (7x `--version`) The single biggest visible win. `pip --version` went from 138ms to 20ms by: 1. Adding an ultra-fast path in `__main__.py` that reads the version and exits before importing `pip._internal` 2. Deferring `base_command.py` import chain to command creation time 3. Deferring autocompletion imports behind `PIP_AUTO_COMPLETE` check **Key insight**: For simple commands like `--version`, the user shouldn't pay the cost of importing the entire tool. ### Resolver architecture (1.81x for complex resolves) 1. **Speculative metadata prefetch**: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU. 2. **Conditional Criterion rebuild**: `_remove_information_from_criteria` was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria. 3. **`__slots__` on Criterion**: Created per-package, per-resolution-step. With `__slots__`: 100 bytes less per instance × thousands of instances = significant. 4. **Two-level candidate cache**: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking. ### Packaging layer (1.82x for `install -r`) The vendored `packaging` library is called thousands of times during resolution: - `Version.__hash__` cached in slot (42K → 21K calls) - `Specifier.__str__` and `__hash__` cached - `_tokenizer` dataclass → `__slots__` class - Integer comparison key for Version (avoids full `_key` tuple construction) - Bisect-based `filter_versions` for O(log n + k) batch filtering ### Import deferral (vendored Rich) Same patterns as the Rich case study, but applied to pip's vendored copy: - Deferred all Rich imports to first use - Stripped unused Rich modules from the import chain - Deferred heavy imports in `console.py`, `progress_bars.py`, `self_outdated_check.py` ### I/O - Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization - Increased connection pool size for parallel index fetches ## Results | Benchmark | Before | After | Speedup | |---|---|---|---| | `pip --version` | 138ms | 20ms | **7.0x** | | `flask+django+boto3+requests` resolve | 1,493ms | 826ms | **1.81x** | | `install -r requirements.txt` (21 pkgs) | 1,344ms | 740ms | **1.82x** | | `pip list` | 162ms | 146ms | **1.11x** | | All benchmarks (sum) | 18,717ms | 15,223ms | **1.23x** | ## Bugs found along the way Optimization work surfaced real bugs: 1. **`--report -` outputs invalid JSON** ([pypa/pip#13898](https://github.com/pypa/pip/issues/13898)) — Rich was mixing log output into stdout JSON 2. **Test failure on Python 3.15** ([pypa/pip#13901](https://github.com/pypa/pip/issues/13901)) — `importlib.metadata` behavior change 3. **`_stderr_console` typo in logging.py** — global never actually set (pre-existing, not fixed to keep diff focused) **Key insight**: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally. ## Key takeaways 1. **Profile first, always**: The resolver was the bottleneck for real workloads, not startup — but startup was the most *visible* improvement to users 2. **Allocation counting reveals hidden work**: `Tag.__init__` called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone 3. **Caching needs the right granularity**: Per-resolution-step caches worked; global caches didn't (different resolution contexts) 4. **Vendored code is fair game**: pip's vendored `packaging` had the most micro-optimization opportunities because it's called thousands of times in tight loops 5. **Test suite is your safety net**: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step ## Applicable to codeflash - **Startup fast-path**: Does `codeflash --version` import the entire optimization engine? It shouldn't - **Test generation loop**: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.) - **AST parsing**: If parsing the same files repeatedly, cache the AST - **Benchmark harness**: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient? - **Vendored/installed deps**: Which heavy deps does codeflash import at startup? Profile and defer