codeflash-agent/case-studies/pypa/pip/summary.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

4.7 KiB
Raw Blame History

pip Optimization — Lessons Learned

Full case study: pip_org

Context

pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.

What we did (by impact)

Startup (7x --version)

The single biggest visible win. pip --version went from 138ms to 20ms by:

  1. Adding an ultra-fast path in __main__.py that reads the version and exits before importing pip._internal
  2. Deferring base_command.py import chain to command creation time
  3. Deferring autocompletion imports behind PIP_AUTO_COMPLETE check

Key insight: For simple commands like --version, the user shouldn't pay the cost of importing the entire tool.

Resolver architecture (1.81x for complex resolves)

  1. Speculative metadata prefetch: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.

  2. Conditional Criterion rebuild: _remove_information_from_criteria was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria.

  3. __slots__ on Criterion: Created per-package, per-resolution-step. With __slots__: 100 bytes less per instance × thousands of instances = significant.

  4. Two-level candidate cache: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.

Packaging layer (1.82x for install -r)

The vendored packaging library is called thousands of times during resolution:

  • Version.__hash__ cached in slot (42K → 21K calls)
  • Specifier.__str__ and __hash__ cached
  • _tokenizer dataclass → __slots__ class
  • Integer comparison key for Version (avoids full _key tuple construction)
  • Bisect-based filter_versions for O(log n + k) batch filtering

Import deferral (vendored Rich)

Same patterns as the Rich case study, but applied to pip's vendored copy:

  • Deferred all Rich imports to first use
  • Stripped unused Rich modules from the import chain
  • Deferred heavy imports in console.py, progress_bars.py, self_outdated_check.py

I/O

  • Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
  • Increased connection pool size for parallel index fetches

Results

Benchmark Before After Speedup
pip --version 138ms 20ms 7.0x
flask+django+boto3+requests resolve 1,493ms 826ms 1.81x
install -r requirements.txt (21 pkgs) 1,344ms 740ms 1.82x
pip list 162ms 146ms 1.11x
All benchmarks (sum) 18,717ms 15,223ms 1.23x

Bugs found along the way

Optimization work surfaced real bugs:

  1. --report - outputs invalid JSON (pypa/pip#13898) — Rich was mixing log output into stdout JSON
  2. Test failure on Python 3.15 (pypa/pip#13901) — importlib.metadata behavior change
  3. _stderr_console typo in logging.py — global never actually set (pre-existing, not fixed to keep diff focused)

Key insight: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.

Key takeaways

  1. Profile first, always: The resolver was the bottleneck for real workloads, not startup — but startup was the most visible improvement to users
  2. Allocation counting reveals hidden work: Tag.__init__ called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone
  3. Caching needs the right granularity: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
  4. Vendored code is fair game: pip's vendored packaging had the most micro-optimization opportunities because it's called thousands of times in tight loops
  5. Test suite is your safety net: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step

Applicable to codeflash

  • Startup fast-path: Does codeflash --version import the entire optimization engine? It shouldn't
  • Test generation loop: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
  • AST parsing: If parsing the same files repeatedly, cache the AST
  • Benchmark harness: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
  • Vendored/installed deps: Which heavy deps does codeflash import at startup? Profile and defer