codeflash-agent/.codeflash/krrt7/python/pip/data/benchmark-analysis.md
Kevin Turcios cc29a27289
Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15)
Add team member dimension to case study paths so multiple contributors
can track optimization data independently. Derives member from
git config user.name in session-start hooks.

- Move all case studies under .codeflash/krrt7/
- Rename pypa/pip → python/pip (org grouping)
- Update session-start hooks, docs, scripts, and references
2026-04-14 23:04:34 -05:00

14 KiB

Resolver Benchmark Analysis: Ideas 4-10 Applicability

Methodology

Each workload was run 3 times with pip install --dry-run --ignore-installed. Resolver internals were monkey-patched (no source modifications) to capture 10 metrics per resolution. Timing values are medians of 3 runs; count values are deterministic (first run). HTTP cache was warm for all runs.

Environment: Python 3.15.0a7, macOS arm64 (Apple Silicon), branch codeflash/optimize

Important timing note: "Resolver CPU" measures wall time inside Resolution.resolve(), which includes metadata fetching, wheel downloading, and sdist building -- not just the resolver algorithm. The actual algorithmic CPU cost (round logic, pin satisfying checks, preference computation) is a tiny fraction of this. For packages that require building from source (dask, fastapi), the resolver CPU time is dominated by build system overhead.


Table 1: Core Resolution Metrics

Workload Rounds Backtracks Peak Criteria Pin-Satisfying Calls Candidates Pinned
simple: requests 7 0 6 57 6
simple: flask 9 0 8 100 8
simple: click 3 0 2 7 2
medium: django 5 0 4 26 4
medium: flask+django+boto3+requests 23 0 22 682 22
complex: fastapi[standard] 48 0 47 2,654 47
complex: jupyterlab 96 0 95 10,529 95
complex: celery[redis] 21 0 20 533 20
complex: dask[complete] 35 0 34 1,400 34
conflict: google-cloud-bigquery+pandas<2 1 0 2 2 1
conflict: boto3==1.26.0+botocore>=1.31 1 1 2 2 1
large: 30-pkg requirements 95 0 94 11,676 94

Key observation: Across all 12 workloads, backtracks range from 0 to 1. Round counts are directly proportional to the number of packages resolved (rounds ~= packages + 1). The resolver is operating as an essentially linear algorithm.

Table 2: make_requirements_from_spec Analysis

Workload Total Calls Unique Specs Duplicate Specs Dup Rate
simple: requests 4 4 0 0.0%
simple: flask 8 7 1 12.5%
simple: click 0 0 0 -
medium: django 2 2 0 0.0%
medium: flask+django+boto3+requests 22 20 2 9.1%
complex: fastapi[standard] 84 64 20 23.8%
complex: jupyterlab 156 136 20 12.8%
complex: celery[redis] 35 22 13 37.1%
complex: dask[complete] 70 45 25 35.7%
conflict: google-cloud-bigquery+pandas<2 0 0 0 -
conflict: boto3==1.26.0+botocore>=1.31 3 3 0 0.0%
large: 30-pkg requirements 118 109 9 7.6%

Aggregate: 502 total calls, 90 duplicates (17.9% duplication rate).

Table 3: RequirementInformation Allocations and narrow_requirement_selection

Workload RI Allocs Narrow Calls Reductions Avg In Avg Out Reduction Rate
simple: requests 10 4 1 3.5 2.5 25%
simple: flask 16 6 1 4.5 3.5 17%
simple: click 2 0 0 - - -
medium: django 6 2 1 2.5 1.5 50%
medium: flask+django+boto3+requests 47 21 1 8.5 8.1 5%
complex: fastapi[standard] 138 45 1 8.8 8.5 2%
complex: jupyterlab 247 92 1 15.3 15.2 1%
complex: celery[redis] 54 17 1 6.5 5.9 6%
complex: dask[complete] 103 31 1 6.7 6.4 3%
conflict: google-cloud-bigquery+pandas<2 2 1 0 2.0 2.0 0%
conflict: boto3==1.26.0+botocore>=1.31 4 1 0 2.0 2.0 0%
large: 30-pkg requirements 239 93 1 29.8 29.5 1%

Aggregate: 313 narrow calls, 9 actual reductions (3% effectiveness -- narrows almost exclusively on the first round where Requires-Python is present).

Table 4: Timing Breakdown

Workload Wall (s) Resolver CPU (s) Non-Resolver (s) _build_result (ms) Resolver %
simple: requests 0.328 0.215 0.108 0.033 66%
simple: flask 0.347 0.236 0.111 0.037 68%
simple: click 0.209 0.101 0.107 0.019 48%
medium: django 0.299 0.190 0.108 0.028 63%
medium: flask+django+boto3+requests 0.668 0.551 0.116 0.072 82%
complex: fastapi[standard] 12.209 11.598 0.493 0.143 95%*
complex: jupyterlab 6.182 5.908 0.274 0.246 96%*
complex: celery[redis] 0.575 0.438 0.117 2.046 76%
complex: dask[complete] 197.001 194.707 2.121 0.158 99%**
conflict: google-cloud-bigquery+pandas<2 3.322 2.645 0.677 0.000 80%
conflict: boto3==1.26.0+botocore>=1.31 0.297 0.184 0.114 0.000 62%
large: 30-pkg requirements 4.592 4.308 0.255 0.290 94%*

*These high CPU percentages include metadata downloading and wheel building inside the resolver loop, not just resolver algorithm work. **dask[complete] spends ~193s building C extensions (numpy, scipy, distributed, etc.) inside the resolver's metadata preparation.


Ideas 4-10: Detailed Analysis

Idea 4: Copy-on-Write (COW) State Snapshots

Theory: Each resolution round copies the full state dict via _push_new_state. With 3000+ rounds and heavy backtracking, COW would defer the copy until mutation.

Measured reality:

  • Round counts across all workloads: 1-96
  • Backtrack counts: 0-1
  • Peak criteria dict sizes: 2-95 entries

At these sizes, dict.copy() on a 2-95 entry dict costs 0.5-5 microseconds per copy. Even at 96 rounds (jupyterlab, the largest), total copy overhead is ~0.5ms. COW proxy objects would add per-access overhead on every dict read/write, easily exceeding the copy cost given that each round does dozens of dict operations.

VERDICT: NO-GO

The codex assumed 3000+ rounds with heavy backtracking. Reality: 1-96 rounds with 0-1 backtracks. State copies at this scale are free. COW would be a net negative.


Idea 5: Batch/Vectorize _is_current_pin_satisfying

Theory: Called once per criterion per round. Vectorizing into a single pass over the mapping dict would reduce per-call overhead.

Measured reality:

Workload Calls Per Round
simple: click 7 2
simple: requests 57 8
simple: flask 100 11
medium: django 26 5
medium: flask+django+boto3+requests 682 30
complex: celery[redis] 533 25
complex: dask[complete] 1,400 40
complex: fastapi[standard] 2,654 55
complex: jupyterlab 10,529 110
large: 30-pkg requirements 11,676 123

The function body is already optimized: dict.get() (single hash lookup) + all() generator with early exit. Each call costs ~200ns. Even at the maximum (11,676 calls), total cost is 2.3ms -- unmeasurably small vs wall time.

VERDICT: NO-GO

Call counts scale as O(rounds * criteria), which is O(n^2) where n = packages. But even at n=95, the total is only ~12K calls at 200ns each = 2.3ms. Batching would add complexity for negligible gain.


Idea 6: Cache make_requirements_from_spec Results

Theory: Same specifier strings may be passed multiple times during backtracking. Caching avoids redundant InstallRequirement construction.

Measured reality:

  • Aggregate across all workloads: 502 total calls, 90 duplicates (17.9%)
  • Per workload: 0-156 calls, 0-25 duplicates
  • Highest duplication rates: celery[redis] 37.1%, dask[complete] 35.7%

Even if every duplicate call is eliminated (~5us each), total savings = 0.45ms. The duplication occurs because some packages share common dependency specifier strings (e.g., typing-extensions>=3.7.4), not from backtracking.

VERDICT: NO-GO

Total call count is too low (max 156) and per-call cost is too cheap (~5us) for caching to produce measurable improvement. The codex assumed backtracking would cause thousands of repeated spec evaluations; with 0 backtracks, each spec is essentially processed once.


Idea 7: Pool/Reduce RequirementInformation Allocations

Theory: RequirementInformation namedtuples are allocated in hot loops. Object pooling or flyweight pattern could reduce allocation pressure.

Measured reality:

  • Range: 2-247 allocations per resolution
  • Maximum (jupyterlab): 247 allocations
  • Cost: 247 allocs x 50ns/alloc = 0.012ms

VERDICT: NO-GO

Allocation counts are trivially small. Python's tuple allocator handles this volume in microseconds. Pooling infrastructure (hash lookups, reference counting) would cost more than the allocations themselves. The __slots__ on Criterion (already applied) captures far more value than RI pooling ever could.


Idea 8: Optimize narrow_requirement_selection

Theory: This function is called each round. Optimizing it or skipping it when it won't narrow could save time.

Measured reality:

  • Aggregate: 313 calls, 9 reductions (3% effectiveness)
  • The function only reduces on the first round (Requires-Python check) for most workloads
  • Average input size: 2-30 identifiers; average output reduction: <1 identifier

The function's O(n) linear scan over identifiers is already at the floor. Its primary value is the Requires-Python early-return (always first round), and the backtrack-cause narrowing (only activates during backtracks -- of which there are 0-1 across all workloads).

VERDICT: NO-GO

Already well-optimized. 313 calls x ~10us/call = 3ms total. The function serves its purpose (Requires-Python fast-path) and has no meaningful optimization headroom.


Idea 9: Faster _build_result Graph Traversal

Theory: The final graph construction via _has_route_to_root recursive traversal could be expensive for large dependency trees.

Measured reality:

Workload _build_result time % of Wall
simple: click 0.019ms 0.009%
medium: flask+django+boto3+requests 0.072ms 0.011%
complex: fastapi[standard] 0.143ms 0.001%
complex: jupyterlab 0.246ms 0.004%
complex: celery[redis] 2.046ms 0.356%
large: 30-pkg requirements 0.290ms 0.006%

Maximum: 2.0ms (celery[redis], likely a one-off GC pause). Called exactly once per resolution.

VERDICT: NO-GO

_build_result accounts for <0.01% of wall time in all workloads except one outlier. Even that outlier is 2ms. No optimization here could produce a user-visible improvement.


Idea 10: Reduce I/O Overhead / Improve CPU-I/O Overlap

Theory: Resolution is I/O-bound. Better overlap between CPU and I/O work, or reducing I/O volume, could significantly cut wall time.

Measured reality:

For pure-python workloads (no compilation overhead):

Workload Wall (s) Non-Resolver (s) I/O Fraction
simple: click 0.209 0.107 51%
simple: requests 0.328 0.108 33%
simple: flask 0.347 0.111 32%
medium: django 0.299 0.108 36%
medium: flask+django+boto3+requests 0.668 0.116 17%
complex: celery[redis] 0.575 0.117 20%
conflict: boto3==1.26.0+botocore>=1.31 0.297 0.114 38%

For simple workloads, ~100-110ms is a fixed I/O floor (pip startup + initial index request). As workload complexity grows, the I/O fraction shrinks because HTTP responses are cached and the resolver does more CPU work per package.

The speculative metadata prefetch (already implemented) overlaps I/O with CPU for packages discovered through dependency traversal. Further gains require:

  • Protocol-level changes (server-side filtering, batch metadata endpoints)
  • HTTP/2 multiplexing for parallel index page requests
  • Larger connection pool for concurrent metadata downloads

VERDICT: GO (partially addressed)

I/O is 17-51% of wall time for pure-python workloads. The prefetch infrastructure already handles the most impactful case (top-candidate metadata overlap). Remaining I/O is structural: initial index page fetches that must complete before resolution can begin.


Summary: GO/NO-GO Recommendations

# Idea Verdict Key Evidence
4 COW state snapshots NO-GO 1-96 rounds (not 3000+), 0-1 backtracks, copy cost <0.5ms
5 Batch _is_current_pin_satisfying NO-GO Max 11,676 calls x 200ns = 2.3ms total
6 Cache make_requirements_from_spec NO-GO 502 total calls, 90 dups, savings ~0.45ms
7 Pool RequirementInformation NO-GO Max 247 allocs x 50ns = 0.012ms
8 Optimize narrow_requirement_selection NO-GO 313 calls, 3% effective, already O(n)
9 Faster _build_result NO-GO Max 2.0ms, <0.01% of wall time
10 Reduce I/O overhead GO (partial) 17-51% of wall time is I/O for pure-python pkgs

Key Insight: The Resolver Operates in a Fundamentally Different Regime Than Assumed

The codex research report assumed worst-case scenarios: 3000+ resolution rounds, heavy backtracking, massive criteria dictionaries, and repeated processing of the same specifiers. The measured reality across 12 diverse workloads is:

Metric Codex Assumption Measured Reality
Resolution rounds 3,000+ 1-96
Backtracks Heavy 0-1
Criteria dict size Large 2-95 entries
make_req_from_spec calls Thousands (repeated) 0-156 (few dups)
RI allocations Hundreds of thousands 2-247
_build_result cost Significant 0.02-2.0ms

The two-level cache on _iter_found_candidates (Experiment 18) is the key optimization that made this possible. By caching specifier merge results and candidate info lists, the resolver avoids redundant work during what would otherwise be expensive backtracking cycles. The result is an essentially linear one-pass-per-package algorithm with zero backtracks.

Bottom line: Ideas 4-9 target algorithmic overhead that has been effectively eliminated by the existing caching infrastructure. The resolver's pure algorithmic cost (round logic, preference computation, pin satisfaction checks) totals approximately 2-12ms even for the most complex workloads. This is irreducible overhead at the Python function-call level.

The only remaining lever for wall-time improvement is I/O (idea 10), which requires changes at the protocol/network layer, not the resolver algorithm.