codeflash-agent/.codeflash/krrt7/python/pip/README.md

157 lines
7.1 KiB
Markdown
Raw Permalink Normal View History

2026-04-09 08:36:01 +00:00
# pip Performance Optimization
End-to-end performance optimization of [pip](https://github.com/pypa/pip), the Python package installer. 122 commits across startup, dependency resolution, packaging, import deferral, and vendored Rich.
## Results
**Environment**: Python 3.15.0a7, macOS arm64 (Apple Silicon), ~27 packages installed, HTTP cache warm, hyperfine (510 runs, 23 warmup)
### Startup
| Benchmark | main | optimized | Speedup |
|---|---:|---:|---:|
| `pip --version` | 138ms | **20ms** | **7.0x** |
| `pip --help` | 143ms | **121ms** | **1.18x** |
### Dependency Resolution
| Benchmark | main | optimized | Speedup |
|---|---:|---:|---:|
| `requests` (~5 deps) | 589ms | **516ms** | **1.14x** |
| `flask + django` (~15 deps) | 708ms | **599ms** | **1.18x** |
| `flask + django + boto3 + requests` (~30 deps) | 1,493ms | **826ms** | **1.81x** |
| `fastapi[standard]` (~42 deps) | 13,325ms | **11,664ms** | **1.14x** |
### Package Operations
| Benchmark | main | optimized | Speedup |
|---|---:|---:|---:|
| `pip list` | 162ms | **146ms** | **1.11x** |
| `pip freeze` | 225ms | **211ms** | **1.07x** |
| `pip show pip` | 162ms | **148ms** | **1.09x** |
| `install -r requirements.txt` (21 pkgs) | 1,344ms | **740ms** | **1.82x** |
### Totals
| | main | optimized | Speedup |
|-|---:|---:|---:|
| **All benchmarks** | 18,717ms | 15,223ms | **1.23x** |
| **Excluding fastapi[standard]** | 5,392ms | 3,559ms | **1.51x** |
## What We Optimized (122 commits)
### 1. Startup
- Ultra-fast `--version` path in `__main__.py` — exits before importing `pip._internal` (138ms → 20ms)
- Fast-path `--version` in `cli/main.py` — avoids `pip._internal.utils.misc` import
- Deferred `base_command.py` import chain to command creation time
- Deferred `Configuration` module loading
- Deferred autocompletion imports behind `PIP_AUTO_COMPLETE` check
### 2. Dependency Resolver — Architecture
- **Speculative metadata prefetch**: background thread downloads PEP 658 metadata for the top candidate while the resolver processes other packages
- **Conditional Criterion rebuild**: `_remove_information_from_criteria` skips rebuilding unaffected criteria, eliminating ~95% of allocations
- **`__slots__` on Criterion**: reduces per-instance memory by ~100 bytes
- Two-level cache for `_iter_found_candidates` (specifier merge + candidate infos)
- Parallel index-page prefetch during dependency resolution
- Unified shared ThreadPoolExecutor for parallel wheel downloads
### 3. Dependency Resolver — Micro
- Cached wheel tag priority dict on `TargetPython`
- Pre-extracted requirements tuple on `Criterion` for per-call avoidance of generator expressions
- Cached `Marker.evaluate()` results for repeated extra lookups
- Hoisted `operator.methodcaller`/`attrgetter` to module-level constants
- Cached `_sort_key` results to avoid double evaluation in `compute_best_candidate`
### 4. Packaging (vendored `pip._vendor.packaging`)
- Replaced `_tokenizer` dataclass with `__slots__` class
- Deferred `Version.__hash__` computation until first call
- Integer comparison key (`_cmp_int`) — avoids full `_key` tuple construction
- Bisect-based `filter_versions` for O(log n + k) batch filtering
- Pre-computed integer bounds on `SpecifierSet` for fast rejection
- Cached parsed `Version`, `Requirement`, `Specifier` objects
- Fast-path tokenizer for simple tokens to bypass regex engine
- Direct release-tuple prefix comparison in `_compare_equal` / `_compare_compatible`
### 5. Link and Wheel Parsing
- Pre-computed `Link._is_wheel` slot to avoid repeated `splitext`
- Cached URL scheme on `Link` to skip `urlsplit` for `is_vcs`/`is_file`
- Inlined Link construction in `_evaluate_json_page` to skip redundant work
- `rsplit` instead of `rfind`x3 for wheel tag extraction
- Cached `parse_tag` results to eliminate redundant `Tag` creation
### 6. I/O and Caching
- Replaced pure-Python msgpack with stdlib JSON for cache serialization
- Increased HTTP connection pool and prefetch concurrency
### 7. Import Deferral (vendored Rich)
- Deferred all Rich imports to first use
- Stripped unused Rich modules from import chain
- Deferred heavy imports in Rich `console.py` (pretty/pager/scope/screen/export)
- Deferred Rich imports in `progress_bars.py` and `self_outdated_check.py`
### 8. Micro-optimizations
- Bypassed `InstallationCandidate.__init__` with `__new__` + direct slot assignment
- Removed redundant O(n) subset assertion in `BestCandidateResult`
- Cached `Hashes.__hash__`, `Constraint.empty()` singleton, `Requirement.__str__`
- Bypassed `email.parser` for metadata parsing
## Upstream Contributions
### Bug fixes (PRs to pypa/pip)
| PR | Status | Description |
|---|---|---|
| [pypa/pip#13900](https://github.com/pypa/pip/pull/13900) | Open | Fix `--report -` to use stdlib `json` instead of Rich for stdout output |
| [pypa/pip#13902](https://github.com/pypa/pip/pull/13902) | Open | Fix `test_trailing_slash_directory_metadata` for Python 3.15 |
### Bug reports (issues on pypa/pip)
| Issue | Description |
|---|---|
| [pypa/pip#13898](https://github.com/pypa/pip/issues/13898) | `pip install --report -` outputs invalid JSON when not combined with `--quiet` |
| [pypa/pip#13901](https://github.com/pypa/pip/issues/13901) | `test_trailing_slash_directory_metadata` fails on Python 3.15.0a8 |
### Rich upstream (separate case study)
| PR | Description |
|---|---|
| [Textualize/rich#4070](https://github.com/Textualize/rich/pull/4070) | Import deferral — 2x import speedup |
| [KRRT7/rich#12](https://github.com/KRRT7/rich/pull/12) | Architectural wins (dataclass→__slots__, lazy emoji) |
| [KRRT7/rich#13](https://github.com/KRRT7/rich/pull/13) | Import deferral + runtime micro-opts |
See [rich_org](https://github.com/KRRT7/rich_org) for the full Rich case study.
## Methodology
### Profiling approach
1. **`python -X importtime`** — Identified the heaviest imports in the startup chain
2. **cProfile / py-spy** — Found hot functions in the resolver and packaging layers
3. **Allocation counting** — Tracked object creation counts to find redundant work (e.g., 45,301 → 1,559 `Tag.__init__` calls with caching)
4. **E2E hyperfine** — Validated every change with end-to-end benchmarks
### Environment
- **Local**: macOS arm64 (Apple Silicon), Python 3.15.0a7, ~27 packages installed
- **CI validation**: Azure VM (Standard_D2s_v5, Ubuntu 24.04, Python 3.12), nox test sessions
- **Test suite**: 1,690 unit tests + 15 functional tests passing throughout
### Branch
All 122 optimization commits are on [`codeflash/optimize`](https://github.com/KRRT7/pip/tree/codeflash/optimize) in the KRRT7/pip fork.
## Repo Structure
```
.
├── README.md # This file
└── data/
├── benchmarks.md # Full E2E benchmark results table
├── results.tsv # Per-optimization tracking (target, speedup, status)
├── benchmark-analysis.md # Detailed profiling analysis
├── io-analysis.md # I/O and caching analysis
├── coverage-analysis.md # Test coverage analysis
├── learnings.md # Session learnings and patterns
└── session-handoff.md # Optimization session state
```