pip Optimization — Lessons Learned

Full case study: pip_org

Context

pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.

What we did (by impact)

Startup (7x `--version`)

The single biggest visible win. pip --version went from 138ms to 20ms by:

Adding an ultra-fast path in __main__.py that reads the version and exits before importing pip._internal
Deferring base_command.py import chain to command creation time
Deferring autocompletion imports behind PIP_AUTO_COMPLETE check

Key insight: For simple commands like --version, the user shouldn't pay the cost of importing the entire tool.

Resolver architecture (1.81x for complex resolves)

Speculative metadata prefetch: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.
Conditional Criterion rebuild: _remove_information_from_criteria was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria.
__slots__ on Criterion: Created per-package, per-resolution-step. With __slots__: 100 bytes less per instance × thousands of instances = significant.
Two-level candidate cache: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.

Packaging layer (1.82x for `install -r`)

The vendored packaging library is called thousands of times during resolution:

Version.__hash__ cached in slot (42K → 21K calls)
Specifier.__str__ and __hash__ cached
_tokenizer dataclass → __slots__ class
Integer comparison key for Version (avoids full _key tuple construction)
Bisect-based filter_versions for O(log n + k) batch filtering

Import deferral (vendored Rich)

Same patterns as the Rich case study, but applied to pip's vendored copy:

Deferred all Rich imports to first use
Stripped unused Rich modules from the import chain
Deferred heavy imports in console.py, progress_bars.py, self_outdated_check.py

I/O

Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
Increased connection pool size for parallel index fetches

Results

Benchmark	Before	After	Speedup
`pip --version`	138ms	20ms	7.0x
`flask+django+boto3+requests` resolve	1,493ms	826ms	1.81x
`install -r requirements.txt` (21 pkgs)	1,344ms	740ms	1.82x
`pip list`	162ms	146ms	1.11x
All benchmarks (sum)	18,717ms	15,223ms	1.23x

Bugs found along the way

Optimization work surfaced real bugs:

--report - outputs invalid JSON (pypa/pip#13898) — Rich was mixing log output into stdout JSON
Test failure on Python 3.15 (pypa/pip#13901) — importlib.metadata behavior change
_stderr_console typo in logging.py — global never actually set (pre-existing, not fixed to keep diff focused)

Key insight: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.

Key takeaways

Profile first, always: The resolver was the bottleneck for real workloads, not startup — but startup was the most visible improvement to users
Allocation counting reveals hidden work: Tag.__init__ called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone
Caching needs the right granularity: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
Vendored code is fair game: pip's vendored packaging had the most micro-optimization opportunities because it's called thousands of times in tight loops
Test suite is your safety net: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step

Applicable to codeflash

Startup fast-path: Does codeflash --version import the entire optimization engine? It shouldn't
Test generation loop: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
AST parsing: If parsing the same files repeatedly, cache the AST
Benchmark harness: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
Vendored/installed deps: Which heavy deps does codeflash import at startup? Profile and defer

4.7 KiB Raw Blame History Unescape Escape