codeflash-agent/.codeflash/pypa/pip/data/learnings.md at 3b59d976477ae7a43769bd35a2898d3efbba5a09

mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Kevin Turcios 3b59d97647 squash

2026-04-13 14:12:17 -05:00

12 KiB

Raw Blame History

All dependencies are vendored

Everything in src/pip/_vendor/ is vendored from upstream: resolvelib, packaging, requests, urllib3, cachecontrol, certifi, distlib, importlib_metadata, pygments, rich, etc. These are copies of external libraries maintained via tools/vendoring/. Each vendor is a candidate for replacement — if pip only uses a subset of a library's API, a focused implementation covering just that subset will be faster (no generalized overhead). If a vendored library is fully replaced and no longer imported anywhere in src/pip/_internal/, delete it from _vendor/ and remove its entry from _vendor/vendor.txt. The vendoring manifest is at src/pip/_vendor/vendor.txt.

Resolution flow is the primary hot path

The dependency resolution call chain:

Resolver.resolve()  [resolution/resolvelib/resolver.py]
  -> resolvelib.Resolver (vendored algorithm)
    -> PipProvider.find_matches()  [resolution/resolvelib/provider.py]
      -> Factory._iter_found_candidates()  [resolution/resolvelib/factory.py]
        -> PackageFinder.find_best_candidate()  [index/package_finder.py]
          -> LinkCollector.collect_sources()  [index/collector.py]
          -> LinkEvaluator.evaluate_link()
          -> CandidateEvaluator.compute_best_candidate()
    -> PipProvider.get_dependencies()
      -> Candidate.iter_dependencies()  [resolution/resolvelib/candidates.py]

The vendored resolvelib drives the algorithm; pip's layer (factory, provider, candidates, package_finder) is where the overhead lives.

Existing caching in pip — evaluate and improve

Current caching is a starting point, not a ceiling. Profile each one — the cache sizes, strategies, and data structures may all be suboptimal:

functools.lru_cache(maxsize=10000) on parse_version in utils/packaging.py — is 10k the right size? Is lru_cache the fastest caching strategy here? Would a plain dict be faster (no LRU eviction overhead)?
functools.lru_cache(maxsize=32) on get_requirement in utils/packaging.py — only 32 slots. During large resolutions this evicts constantly. Profile whether a larger cache or unbounded @functools.cache is faster.
@functools.cache on Link properties in models/link.py — functools.cache has per-call overhead for hashing args. If Link properties are called with self only, a simple __dict__-based cache or __slots__ with pre-computed values may be faster.
@functools.cached_property on InstallRequirement properties — has thread-safety overhead in 3.12+. Evaluate whether a simpler lazy pattern is faster.
@functools.cache on candidate creation in resolution/resolvelib/provider.py — profile the cache hit rate. If it's low, the hashing overhead is pure waste.
HTTP caching via vendored CacheControl in network/session.py — a general-purpose HTTP cache. If pip only needs a subset of caching semantics, a focused implementation could be faster.
Wheel cache by URL hash in cache.py — uses sha224. Profile whether a faster hash (xxhash via C, or even just dict key on URL string) would help at scale.
Lazy wheel loading in network/lazy_wheel.py — the TODO says range requests aren't cached. Fix this, and also evaluate whether the lazy loading strategy itself is optimal (e.g., batch range requests, prefetch metadata sections).

Known TODOs from source — verified optimization opportunities

resolution/resolvelib/candidates.py ~line 250: "TODO performance: this means we iterate dependencies at least twice" — dependencies are extracted from metadata, then iterated again during resolution
resolution/resolvelib/factory.py: "TODO: Check already installed candidate, and use it if the link and hash match" — redundant work when a compatible version is already installed
network/lazy_wheel.py: "TODO: Get range requests to be correctly cached" — lazy wheel metadata fetches bypass the HTTP cache

Version parsing happens repeatedly

During candidate evaluation in package_finder.py, version strings are parsed from Link URLs multiple times across different stages (link evaluation, candidate evaluation, sorting). The lru_cache on parse_version helps but the cache key is the string — if the same version appears in different URL formats, it may be parsed redundantly.

Link objects are high-volume

models/link.py Link objects are created for every candidate from every index page. They use @functools.cache on properties, but the sheer volume (hundreds to thousands per resolution) means object creation overhead itself matters.

Tests structure

tests/unit/ — fast, no network, good for profiling feedback
tests/unit/resolution_resolvelib/ — resolver-specific unit tests
tests/functional/ — slow, needs network, creates real virtualenvs
Socket disabled by default in pytest config
tests/unit/test_finder.py — tests for PackageFinder
tests/unit/test_req.py — tests for requirement handling

pip targets Python 3.9+ and PyPy

Cannot use: walrus operator in 3.9-incompatible ways, match/case (3.10+), exception groups (3.11+), or type statement (3.12+). typing.Self, typing.TypeAlias need imports from typing_extensions or __future__.

Ruff is the linter

Line length 88, target-version py39. Key ignores: PERF203 is explicitly ignored for src/pip/_internal/* (try-except in loop). Isort has a custom vendored section for pip._vendor.

packse is available for realistic resolver workloads

The sibling repo at ../packse/ contains 148 dependency resolution test scenarios. These can be used to create realistic profiling workloads by building the packse index and running pip install --index-url <packse-url> <scenario-package>. Categories with the most resolver stress: fork (32 scenarios), prereleases (20), local versions (16), requires-python (15).

Apr05 session: optimization results

Key findings from the optimization session:

get_supported() is the single most impactful cache target. A single @functools.lru_cache on the underlying implementation reduces Tag.init calls from 45K to 1.5K in resolver test workloads (97% reduction). The cache key is (version, platforms_tuple, impl, abis_tuple). Hit rate is high because the same TargetPython params are used across resolution.
canonicalize_name() has 92% cache hit rate. Package names are canonicalized repeatedly during resolution — once for each candidate evaluation, each distribution check, and each requirement comparison. An lru_cache(maxsize=1024) catches the vast majority of calls.
Test suite wall-clock is poor proxy for pip performance. The unit test suite is dominated by test fixture creation (0.3s setup per resolver test × 40 tests = 12s), I/O (directory scanning), and subprocess calls (build isolation). Caches provide little benefit because each test creates fresh state. Real pip invocations process a single dependency tree where caches accumulate hits.
cProfile overhead is higher with lru_cache. cProfile tracks every function call including the lru_cache wrapper. The profiling overhead ratio is ~3.3x with cached functions vs ~1.2x without. This makes the optimized code look slower under cProfile, but real execution is equivalent or faster.
Python 3.15 is significantly faster than 3.14. The same unit test suite runs in ~37-42s on Py 3.15 vs ~130s on Py 3.14. This is from general CPython performance improvements, not pip-specific changes.
E2E profiling reveals completely different targets than unit tests. The unit test suite is dominated by test infrastructure (fixture creation, subprocess calls). Real pip install --dry-run flask django boto3 requests with cached metadata reveals: Link object creation (12K+), Version operations, URL cleaning, and filename parsing dominate. Always profile real workloads.
Double urlsplit is a hidden 2.5% cost. _ensure_quoted_url does urlsplit to check the path, then Link.__init__ does urlsplit again on the same URL. Integrating quoting into init eliminates this. For HTTP/HTTPS URLs with already-clean paths (99% from package indices), a regex fast-path (_PATH_ALREADY_QUOTED_RE) skips _clean_url_path entirely.
Pre-computing hot properties in init is the most effective pattern for high-volume objects. Link objects are created 12K+ times. Moving splitext, filename, and hash computation from property access to init eliminated ~7% of self-time because these properties were accessed 2-4x each per link during evaluation and sorting.
Lazy parsing for rarely-used fields saves significant time. upload_time (ISO datetime) was parsed eagerly for all 12K links but only used when --uploaded-prior-to flag is set (rare). Deferring parse_iso_datetime to first property access eliminated 1.4% of self-time.
Remaining performance floor after link/version optimizations. Profile is now flat: Version.init (7%), Link.init (7%), evaluate_link (5%), from_json (4%), specifier filtering (3%), version comparison (3%). These are core resolution operations — further gains require algorithmic changes (reducing candidate count) or resolver restructuring.
parse_wheel_filename cache has 75% hit rate in single-package installs. In larger resolutions with many candidates from the same package, hit rate is higher.

Apr 2026 session: optimization floor analysis

_evaluate_json_page is at the Python-level floor. Per-entry processing costs 4.2us across 13.7K entries. The cost is spread across dict.get (65K calls, 0.009s), str.endswith (21K, 0.003s), str operations (rsplit/find/split/startswith: ~0.005s total), object construction (15K new, 0.002s), and isinstance (19K, 0.002s). No single operation dominates. A py3-none-any fast path that replaces rsplit+set-lookup with a single endswith showed <2% improvement within noise. The function's self-time is fundamentally the cost of executing ~340 lines of Python bytecode per entry.
Resolution round counts are much lower than expected after caching. The two-level cache in _iter_found_candidates (experiment 18) reduced resolver iterations dramatically. flask+django+boto3+requests: 23 state pushes, 0 backtracks. fastapi[standard]: 48 pushes, 0 backtracks. COW state snapshots, IteratorMapping elimination, and other per-round optimizations have negligible impact at these scales.
Wall time is I/O-dominated after CPU optimizations. For flask+django+boto3+requests (826ms optimized), HTTP requests account for ~70% of wall time. The 41 HTTP requests (21 for index pages + 20 for metadata) are serialized by the resolver's sequential processing of packages. Our parallel prefetch infrastructure helps but can only overlap I/O for packages discovered through dependency traversal, not the initial set.
Benchmark results are highly sensitive to network conditions and cache state. The same benchmark can vary 2-3x depending on HTTP cache warmth and network latency. Always use hyperfine with warmup runs and report median/mean with sigma. The "Using cached" lines in pip output indicate cache hits; "Downloading" indicates misses.
make_install_req_from_link serialize-reparse is wasteful but low-impact. The function serializes a Requirement to string then re-parses it through install_req_from_line (which does os.path.normpath, os.path.abspath, URL parsing, etc.). But with only 21 calls at 0.1ms each (2.2ms total), the absolute impact is negligible. Would only matter for workloads with thousands of direct requirements.

Local warehouse (PyPI) is running

A full warehouse instance is running via Docker at http://localhost:80/. The Simple API is at http://localhost:80/simple/. This enables end-to-end profiling of the entire pip → network → warehouse → database stack. The warehouse source at ../warehouse/ is live-reloaded by gunicorn — changes to warehouse/api/simple.py (the Simple API endpoint) take effect immediately. Manage with cd ../warehouse && docker compose [up -d | down | logs web].

12 KiB Raw Blame History Unescape Escape