14 KiB
Resolver Benchmark Analysis: Ideas 4-10 Applicability
Methodology
Each workload was run 3 times with pip install --dry-run --ignore-installed.
Resolver internals were monkey-patched (no source modifications) to capture 10 metrics
per resolution. Timing values are medians of 3 runs; count values are deterministic
(first run). HTTP cache was warm for all runs.
Environment: Python 3.15.0a7, macOS arm64 (Apple Silicon), branch codeflash/optimize
Important timing note: "Resolver CPU" measures wall time inside Resolution.resolve(),
which includes metadata fetching, wheel downloading, and sdist building -- not just the
resolver algorithm. The actual algorithmic CPU cost (round logic, pin satisfying checks,
preference computation) is a tiny fraction of this. For packages that require building
from source (dask, fastapi), the resolver CPU time is dominated by build system overhead.
Table 1: Core Resolution Metrics
| Workload | Rounds | Backtracks | Peak Criteria | Pin-Satisfying Calls | Candidates Pinned |
|---|---|---|---|---|---|
| simple: requests | 7 | 0 | 6 | 57 | 6 |
| simple: flask | 9 | 0 | 8 | 100 | 8 |
| simple: click | 3 | 0 | 2 | 7 | 2 |
| medium: django | 5 | 0 | 4 | 26 | 4 |
| medium: flask+django+boto3+requests | 23 | 0 | 22 | 682 | 22 |
| complex: fastapi[standard] | 48 | 0 | 47 | 2,654 | 47 |
| complex: jupyterlab | 96 | 0 | 95 | 10,529 | 95 |
| complex: celery[redis] | 21 | 0 | 20 | 533 | 20 |
| complex: dask[complete] | 35 | 0 | 34 | 1,400 | 34 |
| conflict: google-cloud-bigquery+pandas<2 | 1 | 0 | 2 | 2 | 1 |
| conflict: boto3==1.26.0+botocore>=1.31 | 1 | 1 | 2 | 2 | 1 |
| large: 30-pkg requirements | 95 | 0 | 94 | 11,676 | 94 |
Key observation: Across all 12 workloads, backtracks range from 0 to 1. Round counts are directly proportional to the number of packages resolved (rounds ~= packages + 1). The resolver is operating as an essentially linear algorithm.
Table 2: make_requirements_from_spec Analysis
| Workload | Total Calls | Unique Specs | Duplicate Specs | Dup Rate |
|---|---|---|---|---|
| simple: requests | 4 | 4 | 0 | 0.0% |
| simple: flask | 8 | 7 | 1 | 12.5% |
| simple: click | 0 | 0 | 0 | - |
| medium: django | 2 | 2 | 0 | 0.0% |
| medium: flask+django+boto3+requests | 22 | 20 | 2 | 9.1% |
| complex: fastapi[standard] | 84 | 64 | 20 | 23.8% |
| complex: jupyterlab | 156 | 136 | 20 | 12.8% |
| complex: celery[redis] | 35 | 22 | 13 | 37.1% |
| complex: dask[complete] | 70 | 45 | 25 | 35.7% |
| conflict: google-cloud-bigquery+pandas<2 | 0 | 0 | 0 | - |
| conflict: boto3==1.26.0+botocore>=1.31 | 3 | 3 | 0 | 0.0% |
| large: 30-pkg requirements | 118 | 109 | 9 | 7.6% |
Aggregate: 502 total calls, 90 duplicates (17.9% duplication rate).
Table 3: RequirementInformation Allocations and narrow_requirement_selection
| Workload | RI Allocs | Narrow Calls | Reductions | Avg In | Avg Out | Reduction Rate |
|---|---|---|---|---|---|---|
| simple: requests | 10 | 4 | 1 | 3.5 | 2.5 | 25% |
| simple: flask | 16 | 6 | 1 | 4.5 | 3.5 | 17% |
| simple: click | 2 | 0 | 0 | - | - | - |
| medium: django | 6 | 2 | 1 | 2.5 | 1.5 | 50% |
| medium: flask+django+boto3+requests | 47 | 21 | 1 | 8.5 | 8.1 | 5% |
| complex: fastapi[standard] | 138 | 45 | 1 | 8.8 | 8.5 | 2% |
| complex: jupyterlab | 247 | 92 | 1 | 15.3 | 15.2 | 1% |
| complex: celery[redis] | 54 | 17 | 1 | 6.5 | 5.9 | 6% |
| complex: dask[complete] | 103 | 31 | 1 | 6.7 | 6.4 | 3% |
| conflict: google-cloud-bigquery+pandas<2 | 2 | 1 | 0 | 2.0 | 2.0 | 0% |
| conflict: boto3==1.26.0+botocore>=1.31 | 4 | 1 | 0 | 2.0 | 2.0 | 0% |
| large: 30-pkg requirements | 239 | 93 | 1 | 29.8 | 29.5 | 1% |
Aggregate: 313 narrow calls, 9 actual reductions (3% effectiveness -- narrows almost exclusively on the first round where Requires-Python is present).
Table 4: Timing Breakdown
| Workload | Wall (s) | Resolver CPU (s) | Non-Resolver (s) | _build_result (ms) | Resolver % |
|---|---|---|---|---|---|
| simple: requests | 0.328 | 0.215 | 0.108 | 0.033 | 66% |
| simple: flask | 0.347 | 0.236 | 0.111 | 0.037 | 68% |
| simple: click | 0.209 | 0.101 | 0.107 | 0.019 | 48% |
| medium: django | 0.299 | 0.190 | 0.108 | 0.028 | 63% |
| medium: flask+django+boto3+requests | 0.668 | 0.551 | 0.116 | 0.072 | 82% |
| complex: fastapi[standard] | 12.209 | 11.598 | 0.493 | 0.143 | 95%* |
| complex: jupyterlab | 6.182 | 5.908 | 0.274 | 0.246 | 96%* |
| complex: celery[redis] | 0.575 | 0.438 | 0.117 | 2.046 | 76% |
| complex: dask[complete] | 197.001 | 194.707 | 2.121 | 0.158 | 99%** |
| conflict: google-cloud-bigquery+pandas<2 | 3.322 | 2.645 | 0.677 | 0.000 | 80% |
| conflict: boto3==1.26.0+botocore>=1.31 | 0.297 | 0.184 | 0.114 | 0.000 | 62% |
| large: 30-pkg requirements | 4.592 | 4.308 | 0.255 | 0.290 | 94%* |
*These high CPU percentages include metadata downloading and wheel building inside the resolver loop, not just resolver algorithm work. **dask[complete] spends ~193s building C extensions (numpy, scipy, distributed, etc.) inside the resolver's metadata preparation.
Ideas 4-10: Detailed Analysis
Idea 4: Copy-on-Write (COW) State Snapshots
Theory: Each resolution round copies the full state dict via _push_new_state.
With 3000+ rounds and heavy backtracking, COW would defer the copy until mutation.
Measured reality:
- Round counts across all workloads: 1-96
- Backtrack counts: 0-1
- Peak criteria dict sizes: 2-95 entries
At these sizes, dict.copy() on a 2-95 entry dict costs 0.5-5 microseconds per copy.
Even at 96 rounds (jupyterlab, the largest), total copy overhead is ~0.5ms.
COW proxy objects would add per-access overhead on every dict read/write, easily
exceeding the copy cost given that each round does dozens of dict operations.
VERDICT: NO-GO
The codex assumed 3000+ rounds with heavy backtracking. Reality: 1-96 rounds with 0-1 backtracks. State copies at this scale are free. COW would be a net negative.
Idea 5: Batch/Vectorize _is_current_pin_satisfying
Theory: Called once per criterion per round. Vectorizing into a single pass over the mapping dict would reduce per-call overhead.
Measured reality:
| Workload | Calls | Per Round |
|---|---|---|
| simple: click | 7 | 2 |
| simple: requests | 57 | 8 |
| simple: flask | 100 | 11 |
| medium: django | 26 | 5 |
| medium: flask+django+boto3+requests | 682 | 30 |
| complex: celery[redis] | 533 | 25 |
| complex: dask[complete] | 1,400 | 40 |
| complex: fastapi[standard] | 2,654 | 55 |
| complex: jupyterlab | 10,529 | 110 |
| large: 30-pkg requirements | 11,676 | 123 |
The function body is already optimized: dict.get() (single hash lookup) + all()
generator with early exit. Each call costs ~200ns. Even at the maximum (11,676 calls),
total cost is 2.3ms -- unmeasurably small vs wall time.
VERDICT: NO-GO
Call counts scale as O(rounds * criteria), which is O(n^2) where n = packages. But even at n=95, the total is only ~12K calls at 200ns each = 2.3ms. Batching would add complexity for negligible gain.
Idea 6: Cache make_requirements_from_spec Results
Theory: Same specifier strings may be passed multiple times during backtracking. Caching avoids redundant InstallRequirement construction.
Measured reality:
- Aggregate across all workloads: 502 total calls, 90 duplicates (17.9%)
- Per workload: 0-156 calls, 0-25 duplicates
- Highest duplication rates: celery[redis] 37.1%, dask[complete] 35.7%
Even if every duplicate call is eliminated (~5us each), total savings = 0.45ms.
The duplication occurs because some packages share common dependency specifier
strings (e.g., typing-extensions>=3.7.4), not from backtracking.
VERDICT: NO-GO
Total call count is too low (max 156) and per-call cost is too cheap (~5us) for caching to produce measurable improvement. The codex assumed backtracking would cause thousands of repeated spec evaluations; with 0 backtracks, each spec is essentially processed once.
Idea 7: Pool/Reduce RequirementInformation Allocations
Theory: RequirementInformation namedtuples are allocated in hot loops. Object pooling or flyweight pattern could reduce allocation pressure.
Measured reality:
- Range: 2-247 allocations per resolution
- Maximum (jupyterlab): 247 allocations
- Cost: 247 allocs x 50ns/alloc = 0.012ms
VERDICT: NO-GO
Allocation counts are trivially small. Python's tuple allocator handles this
volume in microseconds. Pooling infrastructure (hash lookups, reference counting)
would cost more than the allocations themselves. The __slots__ on Criterion
(already applied) captures far more value than RI pooling ever could.
Idea 8: Optimize narrow_requirement_selection
Theory: This function is called each round. Optimizing it or skipping it when it won't narrow could save time.
Measured reality:
- Aggregate: 313 calls, 9 reductions (3% effectiveness)
- The function only reduces on the first round (Requires-Python check) for most workloads
- Average input size: 2-30 identifiers; average output reduction: <1 identifier
The function's O(n) linear scan over identifiers is already at the floor. Its primary value is the Requires-Python early-return (always first round), and the backtrack-cause narrowing (only activates during backtracks -- of which there are 0-1 across all workloads).
VERDICT: NO-GO
Already well-optimized. 313 calls x ~10us/call = 3ms total. The function serves its purpose (Requires-Python fast-path) and has no meaningful optimization headroom.
Idea 9: Faster _build_result Graph Traversal
Theory: The final graph construction via _has_route_to_root recursive
traversal could be expensive for large dependency trees.
Measured reality:
| Workload | _build_result time | % of Wall |
|---|---|---|
| simple: click | 0.019ms | 0.009% |
| medium: flask+django+boto3+requests | 0.072ms | 0.011% |
| complex: fastapi[standard] | 0.143ms | 0.001% |
| complex: jupyterlab | 0.246ms | 0.004% |
| complex: celery[redis] | 2.046ms | 0.356% |
| large: 30-pkg requirements | 0.290ms | 0.006% |
Maximum: 2.0ms (celery[redis], likely a one-off GC pause). Called exactly once per resolution.
VERDICT: NO-GO
_build_result accounts for <0.01% of wall time in all workloads except one outlier. Even that outlier is 2ms. No optimization here could produce a user-visible improvement.
Idea 10: Reduce I/O Overhead / Improve CPU-I/O Overlap
Theory: Resolution is I/O-bound. Better overlap between CPU and I/O work, or reducing I/O volume, could significantly cut wall time.
Measured reality:
For pure-python workloads (no compilation overhead):
| Workload | Wall (s) | Non-Resolver (s) | I/O Fraction |
|---|---|---|---|
| simple: click | 0.209 | 0.107 | 51% |
| simple: requests | 0.328 | 0.108 | 33% |
| simple: flask | 0.347 | 0.111 | 32% |
| medium: django | 0.299 | 0.108 | 36% |
| medium: flask+django+boto3+requests | 0.668 | 0.116 | 17% |
| complex: celery[redis] | 0.575 | 0.117 | 20% |
| conflict: boto3==1.26.0+botocore>=1.31 | 0.297 | 0.114 | 38% |
For simple workloads, ~100-110ms is a fixed I/O floor (pip startup + initial index request). As workload complexity grows, the I/O fraction shrinks because HTTP responses are cached and the resolver does more CPU work per package.
The speculative metadata prefetch (already implemented) overlaps I/O with CPU for packages discovered through dependency traversal. Further gains require:
- Protocol-level changes (server-side filtering, batch metadata endpoints)
- HTTP/2 multiplexing for parallel index page requests
- Larger connection pool for concurrent metadata downloads
VERDICT: GO (partially addressed)
I/O is 17-51% of wall time for pure-python workloads. The prefetch infrastructure already handles the most impactful case (top-candidate metadata overlap). Remaining I/O is structural: initial index page fetches that must complete before resolution can begin.
Summary: GO/NO-GO Recommendations
| # | Idea | Verdict | Key Evidence |
|---|---|---|---|
| 4 | COW state snapshots | NO-GO | 1-96 rounds (not 3000+), 0-1 backtracks, copy cost <0.5ms |
| 5 | Batch _is_current_pin_satisfying | NO-GO | Max 11,676 calls x 200ns = 2.3ms total |
| 6 | Cache make_requirements_from_spec | NO-GO | 502 total calls, 90 dups, savings ~0.45ms |
| 7 | Pool RequirementInformation | NO-GO | Max 247 allocs x 50ns = 0.012ms |
| 8 | Optimize narrow_requirement_selection | NO-GO | 313 calls, 3% effective, already O(n) |
| 9 | Faster _build_result | NO-GO | Max 2.0ms, <0.01% of wall time |
| 10 | Reduce I/O overhead | GO (partial) | 17-51% of wall time is I/O for pure-python pkgs |
Key Insight: The Resolver Operates in a Fundamentally Different Regime Than Assumed
The codex research report assumed worst-case scenarios: 3000+ resolution rounds, heavy backtracking, massive criteria dictionaries, and repeated processing of the same specifiers. The measured reality across 12 diverse workloads is:
| Metric | Codex Assumption | Measured Reality |
|---|---|---|
| Resolution rounds | 3,000+ | 1-96 |
| Backtracks | Heavy | 0-1 |
| Criteria dict size | Large | 2-95 entries |
| make_req_from_spec calls | Thousands (repeated) | 0-156 (few dups) |
| RI allocations | Hundreds of thousands | 2-247 |
| _build_result cost | Significant | 0.02-2.0ms |
The two-level cache on _iter_found_candidates (Experiment 18) is the key
optimization that made this possible. By caching specifier merge results and
candidate info lists, the resolver avoids redundant work during what would
otherwise be expensive backtracking cycles. The result is an essentially
linear one-pass-per-package algorithm with zero backtracks.
Bottom line: Ideas 4-9 target algorithmic overhead that has been effectively eliminated by the existing caching infrastructure. The resolver's pure algorithmic cost (round logic, preference computation, pin satisfaction checks) totals approximately 2-12ms even for the most complex workloads. This is irreducible overhead at the Python function-call level.
The only remaining lever for wall-time improvement is I/O (idea 10), which requires changes at the protocol/network layer, not the resolver algorithm.