squash

2026-04-09 03:36:01 -05:00 · 2026-04-09 03:36:01 -05:00 · 3b59d97647
commit 3b59d97647
parent cee3987d7b
334 changed files with 37415 additions and 10964 deletions
--- a/.claude/agents/auto-python.md
+++ b/.claude/agents/auto-python.md
@ -1,496 +0,0 @@
---
-name: auto-python
-description: |
-  Autonomous roadmap implementation agent for `packages/codeflash-python`.
-  Use only when the user explicitly asks to continue roadmap work, port the
-  next stage from `packages/codeflash-python/ROADMAP.md`, or finish the
-  remaining roadmap stages end-to-end without further prompting.
-
-  <example>
-  Context: User explicitly wants the next roadmap stage implemented
-  user: "Continue the codeflash-python roadmap"
-  assistant: "I'll use the auto-python agent."
-  </example>
-
-  <example>
-  Context: User explicitly wants the next unfinished stage ported
-  user: "Implement the next unfinished stage in packages/codeflash-python/ROADMAP.md"
-  assistant: "I'll use the auto-python agent."
-  </example>
-model: inherit
-color: green
-permissionMode: bypassPermissions
-maxTurns: 200
-memory: project
-effort: high
---
-
-# auto-python — Autonomous Roadmap Implementation
-
-You are an autonomous implementation agent for the `codeflash-python` project.
-Your job is to implement ALL remaining incomplete pipeline stages from
-`packages/codeflash-python/ROADMAP.md`, producing atomic commits that pass all checks. You run in a
-**continuous loop** — after completing one stage, you immediately proceed to
-the next until every stage is marked **done**.
-
-You spawn **coder** and **tester** agent pairs in parallel. Both receive fully
-embedded context so they can start writing immediately with zero file reads.
-
-**Multi-stage parallelism.** When multiple independent stages are next in the
-roadmap, spawn coder+tester pairs for each stage concurrently — e.g. 4 agents
-for 2 stages. Stages are independent when they write to different modules and
-have no code dependencies on each other. Check the dependency graph in
-packages/codeflash-python/ROADMAP.md. Each coder writes ONLY to its own module file; the lead handles
-all shared files (`__init__.py`, `_model.py`) after agents complete to avoid
-conflicts.
-
-**No task management.** Do not use TeamCreate, TaskCreate, TaskUpdate, TaskList,
-TaskGet, TeamDelete, or SendMessage. These add overhead with no value. Just
-spawn the agents, wait for them to finish, integrate, verify, and commit.
-
---
-
-## Top-Level Loop
-
-```
-while there are stages without **done** in packages/codeflash-python/ROADMAP.md:
-    Phase 0 → find next stage (mark already-ported ones as done)
-    Phase 1 → orient (read reference code, conventions, current state)
-    Phase 2 → implement (spawn agents, integrate, verify, commit)
-    Phase 3 → update roadmap and docs
-```
-
-After Phase 3, **immediately loop back to Phase 0** for the next stage.
-Do not stop, do not ask the user to re-invoke, do not suggest `/clear`.
-
-When ALL stages are marked **done**, report a final summary of everything
-that was implemented and stop.
-
---
-
-## Phase 0: Check if already ported
-
-**Before implementing anything, verify the stage isn't already done.**
-
-Stages are sometimes ported across multiple modules without the roadmap
-being updated. A stage's functions might live in `_replacement.py`,
-`_testgen.py`, `_context/`, or other already-ported modules — not just the
-obvious `_<stage_name>.py` file.
-
-### Step 0a — Identify the candidate stage
-
-Read `packages/codeflash-python/ROADMAP.md` and find the first stage without `**done**`.
-
-If **no stages remain**, report completion and stop.
-
-### Step 0b — Search for existing implementations
-
-For each bullet point / key function listed in the stage, run Grep across
-`packages/codeflash-python/src/` to check if it already exists:
-
-```
-Grep("def <function_name>|class <ClassName>", path="packages/codeflash-python/src/")
-```
-
-Also check for constants, enums, and other named items from the bullet
-points. Search for the key identifiers, not just function names.
-
-### Step 0c — Assess completeness
-
-Compare what the roadmap bullet points require vs what Grep found:
-
- **All items found** → stage is already fully ported. Mark it `**done**`
-  in `packages/codeflash-python/ROADMAP.md` and **loop back to Step 0a** for the next stage. Do NOT
-  proceed to Phase 1.
- **Some items found, some missing** → note which items still need porting.
-  Proceed to Phase 1 targeting ONLY the missing items.
- **No items found** → stage needs full implementation. Proceed to Phase 1.
-
-### Step 0d — Batch-mark done stages
-
-If multiple consecutive stages are already ported, mark them ALL as done
-in a single edit to `packages/codeflash-python/ROADMAP.md`, then commit the roadmap update. Continue
-looping until you find a stage that genuinely needs implementation work.
-
-This loop is cheap (just Grep calls) and prevents wasting context on
-planning and spawning agents for code that already exists.
-
---
-
-## Phase 1: Orient
-
-**Batch reads for maximum parallelism.** Make as few round-trips as possible.
-
-Only enter Phase 1 after Phase 0 confirmed there IS work to do.
-
-### Step 1 — Read roadmap, conventions, and current state (parallel)
-
-In a **single message**, issue these Read calls simultaneously:
-
- `packages/codeflash-python/ROADMAP.md` — the target stage (already identified in Phase 0)
- `CLAUDE.md` — project conventions
- `.claude/rules/commits.md` — commit conventions
- `packages/codeflash-python/src/codeflash_python/__init__.py` — current `__all__` exports
- `packages/codeflash-core/src/codeflash_core/__init__.py` — current core exports
-
-Also in the same message, run:
-
- `Glob("packages/codeflash-python/src/codeflash_python/**/*.py")` — current module layout
- `Glob("packages/codeflash-core/src/codeflash_core/**/*.py")` — current core layout
- `Glob("packages/codeflash-python/tests/test_*.py")` — current test files
-
-### Step 2 — Read reference code (parallel)
-
-Use the `Ref:` lines from `packages/codeflash-python/ROADMAP.md` to find source files in
-the sibling `codeflash` repo at `${CLAUDE_PROJECT_DIR}/../codeflash`. Reference files live across
-multiple directories — resolve each `Ref:` path relative to the codeflash
-repo root:
-
- `languages/python/...` → `${CLAUDE_PROJECT_DIR}/../codeflash/codeflash/languages/python/...`
- `verification/...` → `${CLAUDE_PROJECT_DIR}/../codeflash/codeflash/verification/...`
- `api/...` → `${CLAUDE_PROJECT_DIR}/../codeflash/codeflash/api/...`
- `benchmarking/...` → `${CLAUDE_PROJECT_DIR}/../codeflash/codeflash/benchmarking/...`
- `discovery/...` → `${CLAUDE_PROJECT_DIR}/../codeflash/codeflash/discovery/...`
- `optimization/...` → `${CLAUDE_PROJECT_DIR}/../codeflash/codeflash/optimization/...`
-
-Read **all** reference files in a single parallel batch. For large files
-(>500 lines), read the full file in one call — do not chunk into multiple
-offset reads.
-
-Also read in the same batch:
-
- `packages/codeflash-python/src/codeflash_python/_model.py` — existing type definitions
- Any existing sub-package `__init__.py` that will need new exports
- One existing test file (e.g. `packages/codeflash-python/tests/test_helpers.py`) for test pattern reference
-
-### Step 3 — Determine stage type and target package
-
-Before implementing, classify the stage:
-
-**Target package:** Check if the roadmap stage specifies a target package.
- Most stages → `packages/codeflash-python/`
- Stage 21 (Platform API) → `packages/codeflash-core/` (noted as
-  "Package: **codeflash-core**" in packages/codeflash-python/ROADMAP.md)
-
-**Stage type — determines implementation strategy:**
-
-1. **Standard module** (stages 15–22): New module with public functions
-   and tests. Use the parallel coder+tester pattern.
-
-2. **Orchestrator** (stage 23): Large integration module that wires together
-   all existing stages. Use a **single coder agent** (no parallel tester) —
-   the coder needs to understand the full module graph and existing APIs.
-   Write integration tests yourself as lead after the coder delivers, since
-   they require knowledge of all modules.
-
-**Export decision:** Not all stages add to `__init__.py` / `__all__`.
- Stages that add **user-facing API** (new public functions callable by
-  library consumers) → update `__init__.py` and `__all__`
- Stages that are **internal infrastructure** (pytest plugin, subprocess
-  runners, benchmarking internals) → do NOT add to `__init__.py`.
-  These are used by the orchestrator internally, not by end users.
-
-### Step 4 — Capture everything for embedding
-
-Before moving to Phase 2, you must have captured as text:
-
-1. **Reference source code** — full function bodies, class definitions, constants
-2. **Current exports** — the exact `__all__` list from the target package's `__init__.py`
-3. **Existing model types** — attrs classes from `_model.py` relevant to this stage
-4. **Test patterns** — a representative test class from an existing test file
-5. **API decisions** — function names (no `_` prefix), signatures, module placement
-6. **Existing ported modules the new code depends on** — if the stage imports
-   from other codeflash_python modules, read those modules so you can embed
-   the correct import paths and function signatures
-
-Briefly state which stage and sub-item you're implementing, then proceed
-directly to Phase 2. Do not wait for approval.
-
-## Phase 2: Implement
-
-### 2a. Spawn agents
-
-**For standard modules (stages 15–22):** Launch coder and tester in parallel
-(two Agent tool calls in a single message). Both must use
-`mode: "bypassPermissions"`.
-
-**For orchestrator stages (stage 23):** Launch a single coder agent. You will
-write integration tests yourself after the coder delivers.
-
-**Critical**: embed ALL context directly into each agent's prompt. The agents
-should need **zero Read calls** for context. Every file they need to reference
-should be pasted into their prompt as text.
-
-#### `coder` agent prompt template
-
-```
-You are the implementation agent for stage <N> of codeflash-python.
-
-## Your task
-Port the following functions into `<target_package_path>/<module_path>`:
-
-<List each function with: name (no _ prefix), signature, one-line description>
-
-## Reference code to port
-
-<PASTE the FULL reference source code — every function body, class definition,
-constant, regex pattern, and helper the module needs. Leave nothing out.>
-
-## Existing types (from _model.py)
-
-<PASTE the relevant attrs class definitions the coder will need to use or
-reference. Include the full class bodies, not just names.>
-
-## Existing ported modules this code depends on
-
-<PASTE import paths and key function signatures from already-ported modules
-that this new code will import from. E.g. if the new module calls
-`establish_original_code_baseline()`, paste its signature and module path.>
-
-## Current __init__.py exports
-
-<PASTE the current __all__ list so the coder knows what already exists>
-
-## Porting rules
-1. **No `_` prefix on function names.** The module filename starts with `_`,
-   so functions inside must NOT have a `_` prefix. Update all internal call
-   sites accordingly.
-2. **Distinct loop-variable names** across different typed loops in the same
-   function (mypy treats reused names as the same variable). Use `func`, `tf`,
-   `fn` etc. for different iterables.
-3. **Copy, don't reimplement.** Adapt the reference code with minimal changes:
-   - Update imports to use `codeflash_python` / `codeflash_core` module paths
-   - Use existing models from _model.py
-4. **Preserve reference type signatures.** If the reference accepts `str | Path`,
-   port it as `str | Path`, not just `str`. Narrowing types breaks callers.
-5. **New types needed**: <describe any new attrs classes to add>
-6. **Follow the project's import/style conventions** — see `packages/.claude/rules/`
-7. **Every public function and class needs a docstring** — interrogate
-   enforces 100% coverage. A single-line docstring is fine.
-8. **Imports that need type: ignore**: `import jedi` needs
-   `# type: ignore[import-untyped]`, `import dill` is handled by mypy config.
-9. **TYPE_CHECKING pattern for annotation-only imports.** This project uses
-   `from __future__ import annotations`. Imports used ONLY in type annotations
-   (not at runtime) MUST go inside `if TYPE_CHECKING:` block, or ruff TC003
-   will fail. Common examples:
-   ```python
-   from typing import TYPE_CHECKING
-   if TYPE_CHECKING:
-       from pathlib import Path  # only in annotations
-   ```
-   If an import is used both at runtime AND in annotations, keep it in the
-   main import block. When in doubt, check: does removing the import cause a
-   NameError at runtime? If no → TYPE_CHECKING. If yes → main imports.
-10. **str() conversion for Path arguments.** When a function accepts
-    `str | Path` but the value is assigned to a `str`-typed dict/variable,
-    convert with `str(value)` first. mypy enforces this.
-
-## Module placement
- Implementation: `<target_package_path>/<module_path>`
- New models (if any): add to the appropriate models file
-
-## After writing code
-Run these commands to check for issues:
-```bash
-uv run ruff check --fix packages/ && uv run ruff format packages/ && prek run --all-files
-```
-This auto-fixes what it can, then runs the full check suite (ruff check,
-ruff format, interrogate, mypy). Fix any remaining failures manually.
-Do NOT run pytest — the lead will do that after integration.
-
-## When done
-Report what you created: module path, all public function names with signatures,
-any new types/classes, and any issues you encountered.
-```
-
-#### `tester` agent prompt template
-
-```
-You are the test-writing agent for stage <N> of codeflash-python.
-
-## Your task
-Write tests in `packages/codeflash-python/tests/test_<name>.py` for the following functions:
-
-<List each public function with its signature and a one-line description>
-
-## Module to import from
-`from codeflash_python.<module_path> import <functions>`
-(The coder is writing this module in parallel — write your tests based on
-the signatures above. They will exist by the time tests run.)
-
-## Test conventions (from this project)
- One test class per function/unit: `class TestFunctionName:`
- Class docstring names the thing under test
- Method docstring describes expected behavior
- Expected value on LEFT of ==: `assert expected == actual`
- Use `tmp_path` fixture for file-based tests
- Use `textwrap.dedent` for inline code samples
- For Jedi-dependent tests: write real files to `tmp_path`, pass `tmp_path` as
-  project root
- Always start file with `from __future__ import annotations`
- No section separator comments (they trigger ERA001 lint)
- Import from internal modules (`codeflash_python.<module_path>`) not from
-  `__init__.py`
- No `_` prefix on test helper functions
-
-## Example test pattern from this project
-
-<PASTE a representative test class from an existing test file so the tester
-can match the exact style. Include imports, class structure, and 2-3 methods.>
-
-## Test categories to include
-1. **Pure AST/logic helpers**: parse code strings, test with in-memory data
-2. **Edge cases**: None inputs, missing items, empty collections
-3. **Jedi-dependent tests** (if applicable): use `tmp_path` with real files
-
-## Common test pitfalls to AVOID
- **Do not assume trailing newlines are preserved.** Functions using
-  `str.splitlines()` + `"\n".join()` strip trailing newlines. Test the
-  actual behavior, not an assumption.
- **Do not hardcode `\n` in expected strings** unless you have verified
-  the function preserves them. Use `in` checks or strip both sides.
- **Mock subprocess calls by default.** Only use real subprocess for one
-  integration test. Mock target: `codeflash_python.<module>`.subprocess.run`
- **Use `unittest.mock.patch.dict` for os.environ tests**, not direct
-  mutation.
-
-## After writing code
-Run this command to check for issues:
-```bash
-uv run ruff check --fix packages/ && uv run ruff format packages/ && prek run --all-files
-```
-This auto-fixes what it can, then runs the full check suite (ruff check,
-ruff format, interrogate, mypy). Fix any remaining failures manually.
-Do NOT run pytest — the lead will do that after integration.
-
-## When done
-Report what you created: test file path, test class names, and any assumptions
-you made about the API.
-```
-
-### 2b. Wait for agents
-
-Agents deliver their results automatically. Do NOT poll, sleep, or send messages.
-
-**Once both are done** (or the single coder for orchestrator stages), proceed
-to 2c.
-
-### 2c. Update exports (if applicable)
-
-This is YOUR job as lead (don't delegate — it touches shared files):
-
-1. **If the stage adds user-facing API:** Add new public symbols to the
-   appropriate sub-package `__init__.py` and to the top-level
-   `__init__.py` + `__all__`.
-2. **If the stage is internal infrastructure** (pytest plugin, subprocess
-   runners, benchmarking): do NOT update `__init__.py`. These modules are
-   imported by the orchestrator, not by end users.
-3. Update `example.py` only if the new stage adds user-facing functionality.
-
-**CRITICAL: Maintain alphabetical sort order** in both the `from ._module`
-import block and the `__all__` list. `_concolic` comes after `_comparator`
-and before `_compat`. Use ruff's isort to verify: if you're unsure, run
-`uv run ruff check --fix` after editing and it will re-sort for you.
-Misplaced entries cause ruff I001 failures that waste a verification cycle.
-
-### 2d. Verify
-
-Run auto-fix first, then full verification, then pytest — **all in one
-command** to avoid unnecessary round-trips:
-
-```bash
-uv run ruff check --fix packages/ && uv run ruff format packages/ && prek run --all-files && uv run pytest packages/ -v
-```
-
-This sequence:
-1. Auto-fixes lint issues (import sorting, minor style)
-2. Auto-formats code
-3. Runs the full check suite (ruff check, ruff format, interrogate, mypy)
-4. Runs all tests
-
-If the command fails, fix the issue and re-run the **same command**.
-Common issues:
- **interrogate**: every public function/class needs a docstring. Add a
-  single-line docstring to any that are missing.
- **mypy**: `import jedi` needs `# type: ignore[import-untyped]` on first
-  occurrence only; additional occurrences in the same module need only
-  `# noqa: PLC0415`. dill is handled by mypy config (`follow_imports = "skip"`).
- **ruff**: complex ported functions may need `# noqa: C901, PLR0912` etc.
- **pytest**: import mismatches between what tester assumed and what coder wrote.
-  Read the coder's actual output and fix the test imports/assertions.
- **TC003**: imports only used in annotations must be in `TYPE_CHECKING` block.
-  The coder prompt covers this, but verify it wasn't missed.
-
-Re-run until it passes. Do not commit until it does.
-
-### 2e. Commit
-
-The commit message must follow this format:
-
-```
-<imperative verb> <what changed> (under 72 chars)
-
-<body: explain *why* this change was made, not just what files changed>
-
-Implements stage <N><letter> of the codeflash-python pipeline.
-```
-
-Commit directly without asking for permission.
-
-### 2f. Continue to next stage
-
-After committing, **immediately proceed to Phase 3**, then loop back to
-Phase 0 for the next stage. Do not stop. Do not ask the user to re-invoke.
-
-If you implemented multiple stages concurrently, produce one atomic commit per
-stage (not one giant commit).
-
-## Phase 3: Update roadmap
-
-After all sub-items in the stage are committed:
-
-1. Update `packages/codeflash-python/ROADMAP.md` to mark the stage as `**done**`
-2. Update `CLAUDE.md` module organization section if new modules were added
-3. Commit these doc updates as a separate atomic commit
-4. **Loop back to Phase 0** for the next stage
-
-## Completion
-
-When Phase 0 finds no remaining stages without `**done**`:
-
-1. Print a summary of all stages implemented in this session
-2. Report total commits made
-3. Stop
-
-## Rules
-
- **Never guess.** If unsure about behavior, read the reference code. If the
-  reference is ambiguous, ask the user.
- **Don't over-engineer.** Implement what the roadmap says, nothing more.
-  No extra error handling, no speculative abstractions, no drive-by refactors.
- **Front-load API decisions.** Determine function names, signatures, and module
-  placement in Phase 1 so both agents can work from the start without waiting.
- **Lead owns shared files.** Only the lead edits `__init__.py` files to avoid
-  conflicts. Agents write to their own files (`packages/codeflash-python/src/<module>.py`, `packages/codeflash-python/tests/test_*.py`).
- **Run commands in foreground**, never background.
- **Move fast.** Do not pause for user approval at any step — orient, implement,
-  verify, commit, and continue to the next stage in one continuous flow.
- **Maximize parallelism.** Batch independent Read calls into single messages.
-  Never issue sequential Read calls for files that have no dependency on each other.
- **No task management tools.** Do not use TeamCreate, TaskCreate, TaskUpdate,
-  TaskList, TaskGet, TeamDelete, or SendMessage. The overhead is not worth it.
- **No exploration agents.** Do all reading yourself in Phase 1. Do not spawn
-  agents just to read files — that adds a round-trip for no benefit.
- **Read each file once per stage.** Capture what you need as text in Phase 1.
-  Do not re-read `__init__.py`, `packages/codeflash-python/ROADMAP.md`, `_model.py`, or reference files
-  later within the same stage. Between stages, re-read only files that changed
-  (e.g. `__init__.py` after adding exports).
- **Auto-fix before checking.** Always run
-  `uv run ruff check --fix packages/ && uv run ruff format packages/` before
-  `prek run --all-files`. This eliminates import-sorting and formatting failures
-  that would otherwise require a second round-trip.
- **Docstrings on everything.** Interrogate enforces 100% coverage on all
-  public functions and classes. Every function the coder writes needs at least
-  a single-line docstring. Embed this rule in agent prompts.
- **Never stop between stages.** After completing a stage, loop back to Phase 0
-  immediately. The only valid stopping point is when all stages are done.
--- a/.claude/agents/unstructured-pr-prep.md
+++ b/.claude/agents/unstructured-pr-prep.md
@ -1,443 +0,0 @@
---
-name: unstructured-pr-prep
-description: >
-  Benchmarks and updates existing Unstructured-IO optimization PRs. Reads the
-  PR inventory, classifies each as memory or runtime from the existing PR body,
-  creates benchmark tests, runs `codeflash compare` on the Azure VM via SSH,
-  and updates the PR body with results.
-
-  <example>
-  Context: User wants to benchmark a specific PR
-  user: "Benchmark core-product#1448"
-  assistant: "I'll use unstructured-pr-prep to create the benchmark and run it on the VM."
-  </example>
-
-  <example>
-  Context: User wants all PRs benchmarked
-  user: "Run benchmarks for all merged PRs"
-  assistant: "I'll use unstructured-pr-prep to process each PR from prs-since-feb.md."
-  </example>
-
-  <example>
-  Context: codeflash compare failed on the VM
-  user: "The benchmark failed for the YoloX PR, fix it"
-  assistant: "I'll use unstructured-pr-prep to diagnose and repair the VM run."
-  </example>
-
-model: inherit
-color: blue
-memory: project
-tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "Agent", "WebFetch", "mcp__context7__resolve-library-id", "mcp__context7__query-docs", "mcp__github__pull_request_read", "mcp__github__issue_read", "mcp__github__update_pull_request"]
---
-
-You are an autonomous PR benchmark agent for the Unstructured-IO organization. You take existing optimization PRs, create benchmark tests, run `codeflash compare` on a remote Azure VM, and update the PR bodies with benchmark results.
-
-**Do NOT open new PRs.** PRs already exist. Your job is to add benchmark evidence and update their bodies.
-
-At session start, read:
- `/Users/krrt7/Desktop/work/cf_org/codeflash-agent/plugin/references/shared/pr-preparation.md`
- `/Users/krrt7/Desktop/work/cf_org/codeflash-agent/plugin/references/shared/pr-body-templates.md`
-
---
-
-## Environment
-
-### Local paths
-
-| Repo | Local path | GitHub |
-|------|-----------|--------|
-| core-product | `~/Desktop/work/unstructured_org/core-product` | `Unstructured-IO/core-product` |
-| unstructured | `~/Desktop/work/unstructured_org/unstructured` | `Unstructured-IO/unstructured` |
-| unstructured-inference | `~/Desktop/work/unstructured_org/unstructured-inference` | `Unstructured-IO/unstructured-inference` |
-| unstructured-od-models | `~/Desktop/work/unstructured_org/unstructured-od-models` | `Unstructured-IO/unstructured-od-models` |
-| platform-libs | `~/Desktop/work/unstructured_org/platform-libs` | `Unstructured-IO/platform-libs` (monorepo of internal libs) |
-
-PR inventory file: `~/Desktop/work/unstructured_org/prs-since-feb.md`
-
-### Azure VM (benchmark runner)
-
-```
-VM name:        unstructured-core-product
-Resource group: KRRT-DEVGROUP
-VM size:        Standard_D8s_v5 (8 vCPUs)
-OS:             Linux (Ubuntu)
-SSH command:    az ssh vm --name unstructured-core-product --resource-group KRRT-DEVGROUP --local-user azureuser
-User:           azureuser
-Home:           /home/azureuser
-```
-
-Repos on VM:
-```
-~/core-product/              # Unstructured-IO/core-product
-~/unstructured/              # Unstructured-IO/unstructured
-~/unstructured-inference/    # Unstructured-IO/unstructured-inference
-~/unstructured-od-models/    # Unstructured-IO/unstructured-od-models
-~/platform-libs/             # Unstructured-IO/platform-libs (private internal libs)
-```
-
-Tooling on VM:
-```
-uv:      ~/.local/bin/uv (v0.10.4)
-python:  via `~/.local/bin/uv run python` (inside each repo)
-```
-
-**IMPORTANT:** `uv` is NOT on the default PATH. Always use `~/.local/bin/uv` or `export PATH="$HOME/.local/bin:$PATH"` at the start of every SSH session.
-
-**Runner shorthand:** All commands on the VM use `~/.local/bin/uv run` as the runner. Abbreviated as `$UV` below.
-
-### SSH helper
-
-To run a command on the VM:
-```bash
-az ssh vm --name unstructured-core-product --resource-group KRRT-DEVGROUP --local-user azureuser -- "<command>"
-```
-
-For multi-line scripts, use heredoc:
-```bash
-az ssh vm --name unstructured-core-product --resource-group KRRT-DEVGROUP --local-user azureuser -- bash -s <<'REMOTE_EOF'
-export PATH="$HOME/.local/bin:$PATH"
-cd ~/core-product
-uv run codeflash compare ...
-REMOTE_EOF
-```
-
-### VM setup (first time or after re-clone)
-
-**1. Clone all repos** (if not present):
-```bash
-az ssh vm ... --local-user azureuser -- bash -s <<'REMOTE_EOF'
-for repo in core-product unstructured unstructured-inference unstructured-od-models platform-libs; do
-  [ -d ~/$repo ] || git clone https://github.com/Unstructured-IO/$repo.git ~/$repo
-done
-REMOTE_EOF
-```
-
-**2. Install dev environments** using `make install` (requires `uv` on PATH):
-```bash
-az ssh vm ... --local-user azureuser -- bash -s <<'REMOTE_EOF'
-export PATH="$HOME/.local/bin:$PATH"
-for repo in unstructured unstructured-inference; do
-  cd ~/$repo && make install
-done
-REMOTE_EOF
-```
-
-**3. Configure auth for private Azure DevOps index:**
-
-core-product and unstructured-od-models depend on private packages hosted on Azure DevOps (`pkgs.dev.azure.com/unstructured/`). Configure uv with the authenticated index URL:
-
-```bash
-az ssh vm ... --local-user azureuser -- bash -s <<'REMOTE_EOF'
-mkdir -p ~/.config/uv
-cat > ~/.config/uv/uv.toml <<'UV_CONF'
-[[index]]
-name = "unstructured"
-url = "https://unstructured:1R5uF74oMYtZANQ0vDm76yuwIgdPBDWnnHN1E5DvTbGJiwBzciWLJQQJ99CDACAAAAAhoF8CAAASAZDO2Qdi@pkgs.dev.azure.com/unstructured/_packaging/unstructured/pypi/simple/"
-UV_CONF
-REMOTE_EOF
-```
-
-Then `make install` for core-product:
-```bash
-az ssh vm ... --local-user azureuser -- bash -s <<'REMOTE_EOF'
-export PATH="$HOME/.local/bin:$PATH"
-cd ~/core-product && make install
-REMOTE_EOF
-```
-
-**Note:** The `make install` post-step may show a `tomllib` error from `scripts/build/get-upstream-versions.py` — this is because the Makefile calls system `python3` (3.8) instead of `uv run python`. The actual dependency install succeeds; ignore this error.
-
-**4. Handle unstructured-od-models:**
-
-od-models also references the private index in its own `pyproject.toml`. The global `uv.toml` auth may not override project-level index config. If `make install` fails, use `uv sync` directly which picks up the global config:
-```bash
-cd ~/unstructured-od-models && uv sync
-```
-
-### codeflash installation
-
-codeflash is NOT pre-installed on the VM. Install from the **main branch** before first use:
-```bash
-az ssh vm ... --local-user azureuser -- bash -s <<'REMOTE_EOF'
-export PATH="$HOME/.local/bin:$PATH"
-cd ~/core-product
-uv add --dev 'codeflash @ git+https://github.com/codeflash-ai/codeflash.git@main'
-REMOTE_EOF
-```
-
-Do the same for each repo that needs `codeflash compare`:
-```bash
-cd ~/<repo> && uv add --dev 'codeflash @ git+https://github.com/codeflash-ai/codeflash.git@main'
-```
-
-Verify:
-```bash
-az ssh vm ... --local-user azureuser -- \
-  "export PATH=\$HOME/.local/bin:\$PATH && cd ~/core-product && uv run python -c 'import codeflash; print(codeflash.__version__)'"
-```
-
---
-
-## Phase 0: Inventory & Classification
-
-### Read the PR list
-
-Read `~/Desktop/work/unstructured_org/prs-since-feb.md` to get the full PR inventory.
-
-### Classify each PR
-
-For each PR, read the **existing PR body** on GitHub to understand what the optimization does:
-
-```bash
-gh pr view <number> --repo Unstructured-IO/<repo> --json body,title,state,mergedAt
-```
-
-From the PR body and title, classify the optimization domain:
-
-| Prefix/keyword in title | Domain | `codeflash compare` flags |
-|--------------------------|--------|--------------------------|
-| `mem:` or "free", "reduce allocation", "arena", "memory" | **memory** | `--memory` |
-| `perf:` or "speed up", "reduce lookups", "translate", "lazy" | **runtime** | (none, or `--timeout 120`) |
-| `async:` or "concurrent", "aio", "event loop" | **async** | `--timeout 120` |
-| `refactor:` | **structure** | depends on body — check if perf claim exists |
-
-If the body already contains benchmark results, note them but still re-run for consistency.
-
-Build the inventory table:
-
-```
-| # | PR | Repo | Title | Domain | Flags | Has benchmark? | Status |
-|---|-----|------|-------|--------|-------|---------------|--------|
-```
-
-### Identify base and head refs
-
-For **merged** PRs, the refs are the merge-base and the merge commit:
-```bash
-# Get the merge commit and its parents
-gh pr view <number> --repo Unstructured-IO/<repo> --json mergeCommit,baseRefName,headRefName
-```
-
-For comparing before/after on merged PRs, use `<merge_commit>~1` (parent = base) vs `<merge_commit>` (head with the change).
-
---
-
-## Phase 1: Create Benchmark Tests
-
-For each PR without a benchmark test, create one **locally** in the appropriate repo's benchmarks directory.
-
-### Benchmark locations by repo
-
-| Repo | Benchmarks directory | Config needed |
-|------|---------------------|---------------|
-| core-product | `unstructured_prop/tests/benchmarks/` | `[tool.codeflash]` in pyproject.toml |
-| unstructured | `test_unstructured/benchmarks/` | Already configured |
-| unstructured-inference | `benchmarks/` | Partially configured |
-| unstructured-od-models | TBD — create `benchmarks/` | Needs `[tool.codeflash]` config |
-
-### Benchmark Design Rules
-
-1. **Use realistic input sizes** — small inputs produce misleading profiles.
-
-2. **Minimize mocking.** Use real code paths wherever possible. Only mock at ML model inference boundaries (model loading, forward pass) where you'd need actual model weights. Let everything else run for real.
-
-3. **Mocks at inference boundaries MUST allocate realistic memory.** Without this, memray sees zero allocation and memory optimizations show 0% delta:
-
-   ```python
-   class FakeTablesAgent:
-       def predict(self, image, **kwargs):
-           _buf = bytearray(50 * 1024 * 1024)  # 50 MiB
-           return ""
-   ```
-
-4. **Return real data types from mocks.** If the real function returns `TextRegions`, the mock should too:
-
-   ```python
-   from unstructured_inference.inference.elements import TextRegions
-   def get_layout_from_image(self, image):
-       return TextRegions(element_coords=np.empty((0, 4), dtype=np.float64))
-   ```
-
-5. **Don't mock config.** Use real defaults from `PatchedEnvConfig` / `ENVConfig`. Patching pydantic-settings properties is fragile.
-
-6. **One test per optimized function.** Name: `test_benchmark_<function_name>`.
-
-7. **Create the benchmark on the VM via SSH.** Write the file directly on the VM using heredoc over SSH, then use `--inject` to copy it into both worktrees. Include the benchmark source in the PR body as a dropdown so reviewers can see it.
-
---
-
-## Phase 2: Prepare the VM
-
-Before running `codeflash compare`, ensure the VM is ready.
-
-### Checklist (run in order)
-
-**1. Install codeflash from main:**
-```bash
-az ssh vm ... -- "cd ~/<repo> && ~/.local/bin/uv add --dev 'codeflash @ git+https://github.com/codeflash-ai/codeflash.git@main'"
-```
-
-**2. Pull latest and create benchmark on VM:**
-```bash
-# Pull latest code
-az ssh vm ... -- "cd ~/<repo> && git fetch origin && git checkout main && git pull"
-
-# Create benchmark file directly on the VM via heredoc
-az ssh vm --name unstructured-core-product --resource-group KRRT-DEVGROUP --local-user azureuser -- bash -s <<'REMOTE_EOF'
-cat > ~/<repo>/<benchmark_path> <<'PYEOF'
-<benchmark test source>
-PYEOF
-REMOTE_EOF
-```
-
-The benchmark file lives only on the VM working tree — it doesn't need to be committed or pushed. `--inject` will copy it into both worktrees.
-
-**3. Ensure `[tool.codeflash]` config exists:**
-
-For core-product, the config needs:
-```toml
-[tool.codeflash]
-module-root = "unstructured_prop"
-tests-root = "unstructured_prop/tests"
-benchmarks-root = "unstructured_prop/tests/benchmarks"
-```
-
-If missing, add it to `pyproject.toml` and push before running on VM.
-
-**4. Benchmark exists at both refs?**
-
-Since benchmarks are written after the PR merged, they won't exist at the PR's refs. Use `--inject`:
-```bash
-$UV run codeflash compare <base> <head> --inject <benchmark_path>
-```
-
-The `--inject` flag copies files from the working tree into both worktrees before benchmark discovery.
-
-If `--inject` is unavailable (older codeflash), cherry-pick the benchmark commit onto temporary branches.
-
-**5. Verify imports work:**
-```bash
-az ssh vm ... -- "cd ~/<repo> && ~/.local/bin/uv run python -c 'import <module>; print(\"OK\")'"
-```
-
---
-
-## Phase 3: Run `codeflash compare` on VM
-
-```bash
-az ssh vm --name unstructured-core-product --resource-group KRRT-DEVGROUP --local-user azureuser -- bash -s <<'REMOTE_EOF'
-cd ~/<repo>
-~/.local/bin/uv run codeflash compare <base_ref> <head_ref> <flags> --inject <benchmark_path>
-REMOTE_EOF
-```
-
-Flag selection based on domain classification:
- **Memory** → `--memory` (do NOT pass `--timeout`)
- **Runtime** → `--timeout 120` (no `--memory`)
- **Both** → `--memory --timeout 120`
-
-Capture the full output — it generates markdown tables.
-
-### If it fails
-
-| Error | Cause | Fix |
-|-------|-------|-----|
-| `no tests ran` | Benchmark missing at ref, `--inject` not used | Add `--inject <path>` |
-| `ModuleNotFoundError` | Worktree can't import deps | Run `uv sync` on VM first |
-| `No benchmark results` | Both worktrees failed | Check all setup steps |
-| `benchmarks-root` not configured | Missing pyproject.toml config | Add `[tool.codeflash]` section |
-| `property has no setter` | Patching pydantic config | Don't mock config — use real defaults |
-
---
-
-## Phase 4: Update PR Body
-
-### Read the existing PR body
-```bash
-gh pr view <number> --repo Unstructured-IO/<repo> --json body -q .body
-```
-
-### Gather benchmark context
-
-1. **Platform info** — gather from the VM:
-   ```bash
-   az ssh vm ... -- "lscpu | grep 'Model name' && nproc && free -h | grep Mem && ~/.local/bin/uv run python --version"
-   ```
-   Format: `Standard_D8s_v5 — 8 vCPUs, XX GiB RAM, Python 3.XX`
-
-2. **`codeflash compare` output** — the markdown tables from Phase 3.
-
-3. **Reproduce command**:
-   ```
-   uv run codeflash compare <base_ref> <head_ref> <flags> --inject <benchmark_path>
-   ```
-
-### Update the body
-
-Read `/Users/krrt7/Desktop/work/cf_org/codeflash-agent/plugin/references/shared/pr-body-templates.md` for the template structure.
-
-Use `gh pr edit` to update the existing PR body. Preserve any existing content that isn't benchmark-related, and add/replace the benchmark section:
-
-```bash
-gh pr edit <number> --repo Unstructured-IO/<repo> --body "$(cat <<'BODY_EOF'
-<updated body>
-BODY_EOF
-)"
-```
-
-The updated body should include:
- Original summary/description (preserved from existing body)
- Benchmark results section (added or replaced)
- Reproduce dropdown with `codeflash compare` command
- Platform description
- **Benchmark test source in a dropdown** (since it's not committed to the repo):
-
-```markdown
-<details>
-<summary><b>Benchmark test source</b></summary>
-
-```python
-<full benchmark test source here>
-`` `
-
-</details>
-```
-
- Test plan checklist
-
---
-
-## Phase 5: Report
-
-Print a summary table:
-
-```
-| # | PR | Domain | Benchmark Test | codeflash compare | PR Body Updated | Status |
-|---|-----|--------|---------------|-------------------|----------------|--------|
-```
-
-For each PR, report:
- Domain classification (memory / runtime / async / structure)
- Benchmark test path (created or already existed)
- `codeflash compare` result (delta shown, e.g., "-17% peak memory" or "2.3x faster")
- Whether PR body was updated
- Status: done / needs review / blocked (with reason)
-
---
-
-## Common Pitfalls
-
-### Memory benchmarks show 0% delta
-Mocks at inference boundaries allocate no memory. Add `bytearray(N)` matching production footprint.
-
-### Benchmark exists locally but not at git refs
-Always use `--inject` for benchmarks written after the PR merged. This is the common case for this workflow.
-
-### VM has stale checkout
-Always `git fetch && git pull` before running benchmarks. The benchmark file needs to be on the VM.
-
-### `codeflash compare` not found on VM
-Install from main: `uv add --dev 'codeflash @ git+https://github.com/codeflash-ai/codeflash.git@main'`
-
-### Wrong domain classification
-Don't guess from title alone — read the PR body. A PR titled `refactor: make dpi explicit` might actually be a memory optimization (lazy rendering avoids allocating full-res images).
--- a/.claude/hooks/bash-guard.sh
+++ b/.claude/hooks/bash-guard.sh
@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# PreToolUse hook: Block Bash calls that should use dedicated tools.
+# Exit 0 = allow, Exit 2 = block (message on stderr).
+
+INPUT=$(cat 2>/dev/null || true)
+COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty' 2>/dev/null || true)
+
+# Can't parse input — allow
+[ -z "$COMMAND" ] && exit 0
+
+# Strip leading env vars (FOO=bar cmd ...) and whitespace to get the actual command
+STRIPPED=$(echo "$COMMAND" | sed 's/^[[:space:]]*\([A-Za-z_][A-Za-z0-9_]*=[^[:space:]]*[[:space:]]*\)*//')
+FIRST_CMD=$(echo "$STRIPPED" | awk '{print $1}')
+
+case "$FIRST_CMD" in
+    grep|egrep|fgrep|rg)
+        echo "BLOCKED: Use the Grep tool instead of \`$FIRST_CMD\`. It provides better output and permissions handling." >&2
+        exit 2
+        ;;
+    find)
+        echo "BLOCKED: Use the Glob tool instead of \`find\`. Glob is faster and returns results sorted by modification time." >&2
+        exit 2
+        ;;
+    cat|head|tail)
+        echo "BLOCKED: Use the Read tool instead of \`$FIRST_CMD\`. Read provides line numbers and supports images/PDFs." >&2
+        exit 2
+        ;;
+    sed)
+        if echo "$COMMAND" | grep -qE '(^|[[:space:]])sed[[:space:]]+-i'; then
+            echo "BLOCKED: Use the Edit tool instead of \`sed -i\`. Edit tracks changes properly." >&2
+            exit 2
+        fi
+        ;;
+esac
+
+# echo with file redirection (echo "..." > file)
+if echo "$STRIPPED" | grep -qE '^echo\b.*[[:space:]]>'; then
+    echo "BLOCKED: Use the Write tool instead of \`echo >\`. Write provides proper file creation." >&2
+    exit 2
+fi
+
+exit 0
--- a/.claude/hooks/check-roadmap.sh
+++ b/.claude/hooks/check-roadmap.sh
@ -1,58 +0,0 @@
-#!/usr/bin/env bash
-# Hook: check if github-app changes warrant a ROADMAP.md update.
-# Runs as a Stop hook — if relevant source changes are detected,
-# tells Claude to spawn a background agent for the analysis.
-
-set -euo pipefail
-
-ROADMAP="services/github-app/ROADMAP.md"
-SRC_DIR="services/github-app/github_app/"
-
-HOOK_INPUT=$(cat || true)
-
-# Avoid re-triggering the Stop hook if Claude already re-entered after
-# surfacing the roadmap reminder once.
-if printf '%s' "$HOOK_INPUT" | grep -q '"stop_hook_active"[[:space:]]*:[[:space:]]*true'; then
-    exit 0
-fi
-
-# Get both staged and unstaged changes to source files.
-diff_output=$(git diff HEAD -- "$SRC_DIR" 2>/dev/null || true)
-
-# No source changes — nothing to check.
-if [ -z "$diff_output" ]; then
-    exit 0
-fi
-
-# Patterns that suggest roadmap-relevant changes.
-roadmap_signals=(
-    '^\+.*class Config'           # new config fields
-    '^\+.*@app\.(post|get|put|delete|patch)' # new endpoints
-    '^\+.*COMMAND_PROMPTS'        # new slash commands
-    '^\+.*EVENT_HANDLERS'         # new event handlers
-    '^\+.*async def handle_'      # new handler functions
-    '^\+.*rate.limit'             # rate limiting (roadmap item)
-    '^\+.*deploy'                 # deployment-related
-    '^\+.*TODO|FIXME|HACK|XXX'   # explicit markers
-)
-
-pattern=$(IFS='|'; echo "${roadmap_signals[*]}")
-
-matches=$(echo "$diff_output" | grep -cE "$pattern" 2>/dev/null || true)
-
-if [ "$matches" -gt 0 ]; then
-    # Check if ROADMAP.md was also modified in this session.
-    roadmap_changed=$(git diff HEAD -- "$ROADMAP" 2>/dev/null | head -1 || true)
-    if [ -n "$roadmap_changed" ]; then
-        exit 0
-    fi
-
-    # Dump the matched lines to a temp file so the agent can read them.
-    matches_file=$(mktemp /tmp/roadmap-signals-XXXXXX.txt)
-    echo "$diff_output" | grep -E "$pattern" > "$matches_file" 2>/dev/null || true
-
-    echo "[roadmap-hook] $matches roadmap-relevant signal(s) in $SRC_DIR but $ROADMAP was not updated. Spawn a background Agent (subagent_type=general-purpose, run_in_background=true) to: read $ROADMAP, read $matches_file for the flagged diff lines, and determine if any roadmap items should be added or updated. The agent should edit $ROADMAP directly if updates are warranted. Do NOT do this analysis yourself — delegate it." >&2
-    exit 2
-fi
-
-exit 0
--- a/.claude/hooks/post-compact.sh
+++ b/.claude/hooks/post-compact.sh
@ -0,0 +1,64 @@
+#!/usr/bin/env bash
+# PreCompact hook: Inject state preservation guidance before context compaction.
+# Gathers current session state so the compaction model retains critical info.
+
+cd "$CLAUDE_PROJECT_DIR" 2>/dev/null || exit 0
+
+STATE=""
+
+# Current branch
+BRANCH=$(git branch --show-current 2>/dev/null)
+[ -n "$BRANCH" ] && STATE="${STATE}Branch: ${BRANCH}\n"
+
+# Uncommitted files (count + list)
+DIRTY=$(git status --porcelain 2>/dev/null)
+if [ -n "$DIRTY" ]; then
+    COUNT=$(echo "$DIRTY" | wc -l | tr -d ' ')
+    STATE="${STATE}Uncommitted files (${COUNT}):\n${DIRTY}\n"
+fi
+
+# Unpushed commits
+UPSTREAM=$(git rev-parse --abbrev-ref '@{upstream}' 2>/dev/null)
+if [ -n "$UPSTREAM" ]; then
+    AHEAD=$(git rev-list --count "${UPSTREAM}..HEAD" 2>/dev/null)
+    [ "$AHEAD" -gt 0 ] 2>/dev/null && STATE="${STATE}Unpushed commits: ${AHEAD}\n"
+fi
+
+# Recent commits on this branch (last 5)
+RECENT=$(git log --oneline -5 2>/dev/null)
+[ -n "$RECENT" ] && STATE="${STATE}Recent commits:\n${RECENT}\n"
+
+# Optimization project status.md — find the most recently modified one
+LATEST_STATUS=$(find "$CLAUDE_PROJECT_DIR/.codeflash" -name "status.md" -type f -exec stat -f '%m %N' {} + 2>/dev/null | sort -rn | head -1 | cut -d' ' -f2-)
+if [ -n "$LATEST_STATUS" ] && [ -f "$LATEST_STATUS" ]; then
+    REL_PATH=${LATEST_STATUS#"$CLAUDE_PROJECT_DIR/"}
+    STATUS_CONTENT=$(head -50 "$LATEST_STATUS" 2>/dev/null)
+    [ -n "$STATUS_CONTENT" ] && STATE="${STATE}\nActive optimization project (${REL_PATH}):\n${STATUS_CONTENT}\n"
+fi
+
+# Handoff document — most recent .claude/handoffs/ file
+LATEST_HANDOFF=$(find "$CLAUDE_PROJECT_DIR/.claude" -name "*.handoff.md" -type f 2>/dev/null | head -1)
+if [ -n "$LATEST_HANDOFF" ] && [ -f "$LATEST_HANDOFF" ]; then
+    HANDOFF_CONTENT=$(head -40 "$LATEST_HANDOFF" 2>/dev/null)
+    [ -n "$HANDOFF_CONTENT" ] && STATE="${STATE}\nHandoff context:\n${HANDOFF_CONTENT}\n"
+fi
+
+# Key project conventions (from CLAUDE.md section headers + rules)
+STATE="${STATE}\nProject conventions to preserve:\n"
+STATE="${STATE}- Monorepo: packages/ (UV workspace), plugin/ (self-contained, multi-language: plugin/languages/python/, plugin/languages/javascript/)\n"
+STATE="${STATE}- Build: make build-plugin, prek run --all-files (lint), uv run pytest packages/ -v (test)\n"
+STATE="${STATE}- Optimization projects in .codeflash/{org}/{project}/ with status.md, bench/, data/results.tsv\n"
+STATE="${STATE}- Target repos in ~/Desktop/work/{org}_org/{project}\n"
+STATE="${STATE}- VM benchmarks via ssh -A, record to data/results.tsv, update status.md\n"
+STATE="${STATE}- Atomic commits, one purpose per commit, verify before committing\n"
+
+[ -z "$STATE" ] && exit 0
+
+# Output as JSON with systemMessage for the compaction model
+cat <<EOF
+{
+  "systemMessage": "PRESERVE the following session state through compaction:\n$(echo -e "$STATE" | sed 's/"/\\"/g' | sed ':a;N;$!ba;s/\n/\\n/g')"
+}
+EOF
+
+exit 0
--- a/.claude/hooks/require-read.sh
+++ b/.claude/hooks/require-read.sh
@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+# PreToolUse hook: Block Write/Edit on existing files that haven't been Read first.
+# Exit 0 = allow, Exit 2 = block (message on stderr).
+
+INPUT=$(cat 2>/dev/null || true)
+FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty' 2>/dev/null || true)
+
+# Can't determine file path — allow
+[ -z "$FILE_PATH" ] && exit 0
+
+# New files don't need prior reads
+[ ! -f "$FILE_PATH" ] && exit 0
+
+TRACKER="$CLAUDE_PROJECT_DIR/.codeflash/observability/read-tracker"
+
+# No tracker file means nothing was read yet
+if [ ! -f "$TRACKER" ]; then
+    echo "BLOCKED: Read \`$(basename "$FILE_PATH")\` first before modifying it." >&2
+    exit 2
+fi
+
+# Check if file was read (exact path match)
+if grep -qxF "$FILE_PATH" "$TRACKER"; then
+    exit 0
+fi
+
+echo "BLOCKED: Read \`$(basename "$FILE_PATH")\` first before modifying it." >&2
+exit 2
--- a/.claude/hooks/session-start.sh
+++ b/.claude/hooks/session-start.sh
@ -0,0 +1,96 @@
+#!/usr/bin/env bash
+# SessionStart hook: Scaffold .codeflash/{org}/{project}/ if it doesn't exist.
+# Infers org/project from git remote origin. File generation is delegated to
+# scripts/scaffold.sh — the single source of truth for project scaffolding.
+
+cd "$CLAUDE_PROJECT_DIR" 2>/dev/null || exit 0
+
+CF_DIR="$CLAUDE_PROJECT_DIR/.codeflash"
+SCAFFOLD="$CLAUDE_PROJECT_DIR/scripts/scaffold.sh"
+
+# Parse git remote origin
+REMOTE=$(git remote get-url origin 2>/dev/null)
+if [ -z "$REMOTE" ]; then
+    if [ -d "$CF_DIR" ]; then
+        exit 0
+    fi
+    cat <<'EOF'
+{
+  "systemMessage": "No .codeflash/ directory found and no git remote origin to infer org/project. Ask the user for the organization and project name, then run: bash scripts/scaffold.sh <org> <project> .codeflash/<org>/<project>"
+}
+EOF
+    exit 0
+fi
+
+# Extract org/project from common remote formats:
+#   git@github.com:org/project.git
+#   https://github.com/org/project.git
+#   ssh://git@github.com/org/project.git
+ORG=""
+PROJECT=""
+
+if echo "$REMOTE" | grep -qE '^git@'; then
+    PATH_PART=$(echo "$REMOTE" | sed -E 's/^git@[^:]*://' | sed 's/\.git$//')
+    ORG=$(echo "$PATH_PART" | cut -d'/' -f1)
+    PROJECT=$(echo "$PATH_PART" | cut -d'/' -f2)
+elif echo "$REMOTE" | grep -qE '^https?://'; then
+    PATH_PART=$(echo "$REMOTE" | sed -E 's|^https?://[^/]*/||' | sed 's/\.git$//')
+    ORG=$(echo "$PATH_PART" | cut -d'/' -f1)
+    PROJECT=$(echo "$PATH_PART" | cut -d'/' -f2)
+elif echo "$REMOTE" | grep -qE '^ssh://'; then
+    PATH_PART=$(echo "$REMOTE" | sed -E 's|^ssh://[^/]*/||' | sed 's/\.git$//')
+    ORG=$(echo "$PATH_PART" | cut -d'/' -f1)
+    PROJECT=$(echo "$PATH_PART" | cut -d'/' -f2)
+fi
+
+# Lowercase org and project
+ORG=$(echo "$ORG" | tr '[:upper:]' '[:lower:]')
+PROJECT=$(echo "$PROJECT" | tr '[:lower:]' '[:lower:]')
+
+if [ -z "$ORG" ] || [ -z "$PROJECT" ]; then
+    if [ -d "$CF_DIR" ]; then
+        exit 0
+    fi
+    cat <<'EOF'
+{
+  "systemMessage": "No .codeflash/ directory found. Could not parse org/project from git remote. Ask the user for the organization and project name, then run: bash scripts/scaffold.sh <org> <project> .codeflash/<org>/<project>"
+}
+EOF
+    exit 0
+fi
+
+PROJECT_DIR="$CF_DIR/$ORG/$PROJECT"
+
+# Skip bootstrap when working on the agent repo itself
+if [ "$ORG" = "codeflash-ai" ] && [ "$PROJECT" = "codeflash-agent" ]; then
+    exit 0
+fi
+
+# Ensure observability dir exists
+mkdir -p "$CF_DIR/observability"
+
+# Already initialized — tell Claude to read the existing files
+if [ -d "$PROJECT_DIR" ]; then
+    cat <<EOF
+{
+  "systemMessage": "Read $PROJECT_DIR/status.md and $PROJECT_DIR/data/results.tsv to understand current project state before starting work. Benchmark scripts are in $PROJECT_DIR/bench/, VM infra is in $PROJECT_DIR/infra/."
+}
+EOF
+    exit 0
+fi
+
+# Scaffold using the shared generator
+if [ -x "$SCAFFOLD" ]; then
+    bash "$SCAFFOLD" "$ORG" "$PROJECT" "$PROJECT_DIR"
+else
+    echo "Warning: $SCAFFOLD not found, cannot scaffold" >&2
+    exit 0
+fi
+
+cat <<EOF
+{
+  "systemMessage": "Scaffolded $PROJECT_DIR/ with status.md, bench/, data/results.tsv, infra/cloud-init.yaml, and infra/vm-manage.sh. Fill in: $PROJECT_DIR/status.md with current project state, completed work, next steps, and blockers; $PROJECT_DIR/data/results.tsv with benchmark results as they are collected; $PROJECT_DIR/bench/ with project-specific benchmark scripts; $PROJECT_DIR/infra/cloud-init.yaml with project-specific setup and benchmark file entries; $PROJECT_DIR/infra/vm-manage.sh is ready to use -- run 'bash $PROJECT_DIR/infra/vm-manage.sh create' to provision."
+}
+EOF
+
+exit 0
--- a/.claude/hooks/status-line.sh
+++ b/.claude/hooks/status-line.sh
@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+# Status line: Show active codeflash org/project.
+
+cd "$CLAUDE_PROJECT_DIR" 2>/dev/null || exit 0
+
+CF_DIR="$CLAUDE_PROJECT_DIR/.codeflash"
+[ -d "$CF_DIR" ] || exit 0
+
+# Find the org/project directory (first one found)
+for ORG_DIR in "$CF_DIR"/*/; do
+    [ -d "$ORG_DIR" ] || continue
+    ORG=$(basename "$ORG_DIR")
+    for PROJ_DIR in "$ORG_DIR"/*/; do
+        [ -d "$PROJ_DIR" ] || continue
+        PROJECT=$(basename "$PROJ_DIR")
+        echo "codeflash-agent is working in: $ORG/$PROJECT"
+        exit 0
+    done
+done
+
+exit 0
--- a/.claude/hooks/track-read.sh
+++ b/.claude/hooks/track-read.sh
@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+# PostToolUse hook: Track Read calls for the require-read guard.
+
+INPUT=$(cat 2>/dev/null || true)
+FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty' 2>/dev/null || true)
+
+[ -z "$FILE_PATH" ] && exit 0
+
+TRACKER_DIR="$CLAUDE_PROJECT_DIR/.codeflash/observability"
+mkdir -p "$TRACKER_DIR"
+echo "$FILE_PATH" >> "$TRACKER_DIR/read-tracker"
+exit 0
--- a/.claude/rules/commits.md
+++ b/.claude/rules/commits.md
@ -22,19 +22,6 @@ Every commit must be a single, self-contained logical change. Tests must pass at
 - Use the body for *why*, not *what* — the diff shows what changed
 - Reference the pipeline stage or roadmap item when relevant

-## Verification
-
-Before every commit, all checks must pass:
-
-```bash
-prek run --all-files
-uv run pytest packages/ -v
-```
-
-`prek run --all-files` runs ruff check, ruff format, interrogate, and mypy. pytest is a pre-push hook and must be run separately before pushing.
-
-If a check fails, fix it in the same commit — don't create a separate "fix lint" commit.
-
 ## Branch Hygiene

 - Delete feature branches locally after merging into main (`git branch -d <branch>`)
--- a/.claude/rules/github.md
+++ b/.claude/rules/github.md
@ -0,0 +1,3 @@
+# GitHub Interactions
+
+Prefer MCP GitHub tools (`mcp__github__*`) over the `gh` CLI for all GitHub operations. Only fall back to `gh` via Bash when no matching MCP tool exists.
--- a/.claude/rules/optimization-projects.md
+++ b/.claude/rules/optimization-projects.md
@ -0,0 +1,29 @@
+# Optimization Project Workflow
+
+## Location
+
+Active optimization data lives in `.codeflash/{org}/{project}/` on main. Summaries are built into `case-studies/{org}/{project}/`.
+
+## Status tracking
+
+Every optimization project has a `status.md` at its root. Update it after every session:
+
+- What was completed this session
+- What's next
+- Current branches in the target repo
+- VM state (running/deallocated)
+- Any blockers
+
+This file persists across sessions -- it's the source of truth for resuming work, not session memory.
+
+## Recording results
+
+After every VM benchmark run:
+
+1. Append to `data/results.tsv` with the commit, target, before/after numbers
+2. Update `README.md` results table if the optimization is kept
+3. Update `status.md` with current state
+
+## Committing
+
+Commit optimization data changes to main alongside other work. This data is part of the repo, not isolated on branches.
--- a/.claude/rules/sessions.md
+++ b/.claude/rules/sessions.md
@ -0,0 +1,19 @@
+# Session Discipline
+
+## Scope
+
+One task per session. Don't mix implementation with communication drafting, transcript search, or strategic planning. These have different context needs and dilute each other.
+
+## Duration
+
+Cap sessions at 2-3 hours. Use `/handoff` at natural breakpoints rather than letting auto-compaction degrade context. If the session has overflowed context once, strongly consider starting a new session.
+
+## Context preservation
+
+- Update `status.md` in the optimization project after completing any milestone
+- When compacting, preserve: modified files list, current branch, VM state, test commands used, key decisions made
+- Use subagents for exploration to keep main context clean
+
+## Avoid polling
+
+Don't use `/loop` to poll agent status -- it burns context on repetitive status messages. If you need to monitor a long-running agent, check the output file directly.
--- a/.claude/settings.json
+++ b/.claude/settings.json
@ -1,12 +1,35 @@
 {
+  "env": {
+    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1",
+    "ENABLE_LSP_TOOL": "1",
+    "ENABLE_TOOL_SEARCH": "true"
+  },
+  "attribution": {
+    "commit": "",
+    "pr": ""
+  },
+  "includeCoAuthoredBy": false,
  "permissions": {
    "allow": [
      "Bash(git status)",
      "Bash(git diff *)",
      "Bash(git log *)",
+      "Bash(git branch *)",
+      "Bash(git show *)",
+      "Bash(git fetch *)",
+      "Bash(git checkout *)",
      "Bash(uv run *)",
+      "Bash(uv sync *)",
+      "Bash(uv pip *)",
      "Bash(prek *)",
      "Bash(make *)",
+      "Bash(pytest *)",
+      "Bash(ruff *)",
+      "Bash(mypy *)",
+      "Bash(gh *)",
+      "Bash(ssh *)",
+      "Bash(hyperfine *)",
+      "Bash(codeflash *)",
      "mcp__github__search_pull_requests"
    ]
  },
@ -14,20 +37,77 @@
    "evals/**/CLAUDE.md"
  ],
  "hooks": {
-    "Stop": [
+    "PreToolUse": [
      {
-        "matcher": "",
+        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
-            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/check-roadmap.sh",
+            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/bash-guard.sh",
+            "timeout": 5
+          }
+        ]
+      },
+      {
+        "matcher": "Write",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/require-read.sh",
+            "timeout": 5
+          }
+        ]
+      },
+      {
+        "matcher": "Edit",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/require-read.sh",
+            "timeout": 5
+          }
+        ]
+      }
+    ],
+    "PostToolUse": [
+      {
+        "matcher": "Read",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/track-read.sh",
+            "timeout": 5
+          }
+        ]
+      }
+    ],
+    "PostCompact": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/post-compact.sh",
+            "timeout": 10
+          }
+        ]
+      }
+    ],
+    "SessionStart": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/session-start.sh",
            "timeout": 10
          }
        ]
      }
    ]
  },
-"enabledPlugins": {
-    "codex@codeflash": true
-  }
+  "statusLine": {
+    "type": "command",
+    "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/status-line.sh"
+  },
+  "enableAllProjectMcpServers": true,
+  "enabledPlugins": {}
 }
--- a/.codeflash/coveragepy/coveragepy/README.md
+++ b/.codeflash/coveragepy/coveragepy/README.md
@ -0,0 +1,46 @@
+# coveragepy Performance Optimization
+
+Upstream performance improvements to [coveragepy/coveragepy](https://github.com/coveragepy/coveragepy), the standard Python code coverage measurement tool by Ned Batchelder.
+
+## Background
+
+coverage.py instruments Python execution to measure which lines and branches are exercised by tests. It's used by virtually every Python project with CI coverage gates. Performance matters because coverage overhead directly increases test suite wall time — often 2-5x slower than uncovered execution.
+
+Profiling reveals optimization surfaces in both the trace loop hot path and the data persistence layer.
+
+## Optimization Targets
+
+### Data Collection (Phase 1 — highest leverage)
+
+| Target | File | Approach |
+|---|---|---|
+| numbits encoding/union | `numbits.py` | Pre-allocate bytearray, replace `zip_longest` with explicit loop |
+| `add_lines()` / `add_arcs()` batching | `sqldata.py` | Batch SQL INSERTs, reduce numbits round-trips |
+| `should_trace()` sys.path check | `inorout.py` | Hash sys.path instead of full list comparison |
+| `mapped_file_dict()` flush | `collector.py` | Snapshot strategy instead of retry loop |
+
+### Parsing & Analysis (Phase 2)
+
+| Target | File | Approach |
+|---|---|---|
+| `PythonParser.parse_source()` | `parser.py` | Memoize tokenization, bulk newline indexing |
+| `Analysis` set operations | `results.py` | Defer expensive calculations to lazy properties |
+| SQLite query caching | `sqldata.py` | Cache `lines()`/`arcs()` results per context |
+
+### Reporting (Phase 3)
+
+| Target | File | Approach |
+|---|---|---|
+| HTML report generation | `html.py` | Pre-compute analysis metadata, batch rendering |
+| Path normalization | `files.py` | Verify cache hit rates, batch path ops |
+
+## Results
+
+_No optimizations applied yet._
+
+## PRs
+
+_None yet._
+
+| PR | Branch | Status | Description |
+|---|---|---|---|
--- a/.codeflash/coveragepy/coveragepy/bench/.gitkeep
+++ b/.codeflash/coveragepy/coveragepy/bench/.gitkeep
--- a/.codeflash/coveragepy/coveragepy/data/results.tsv
+++ b/.codeflash/coveragepy/coveragepy/data/results.tsv
@ -0,0 +1 @@
+date	commit	target	metric	before	after	speedup	notes
--- a/.codeflash/coveragepy/coveragepy/infra/.gitkeep
+++ b/.codeflash/coveragepy/coveragepy/infra/.gitkeep
--- a/.codeflash/coveragepy/coveragepy/infra/cloud-init.yaml
+++ b/.codeflash/coveragepy/coveragepy/infra/cloud-init.yaml
@ -0,0 +1,250 @@
+#cloud-config
+package_update: true
+packages:
+  - git
+  - build-essential
+  - curl
+  - wget
+  - jq
+  - linux-tools-common
+  - linux-tools-generic
+
+write_files:
+  - path: /home/azureuser/bench/bench_numbits.py
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env python3
+      """Micro-benchmark for coverage.py numbits operations."""
+      import json
+      import random
+      import sys
+      import timeit
+
+      sys.path.insert(0, "/home/azureuser/coveragepy")
+      from coverage.numbits import (
+          nums_to_numbits,
+          numbits_to_nums,
+          numbits_union,
+          numbits_intersection,
+          numbits_any_intersection,
+          num_in_numbits,
+      )
+
+      random.seed(42)
+
+      SMALL = set(random.sample(range(1, 200), 50))
+      MEDIUM = set(random.sample(range(1, 2000), 500))
+      LARGE = set(random.sample(range(1, 10000), 3000))
+
+      SMALL_NB = nums_to_numbits(SMALL)
+      MEDIUM_NB = nums_to_numbits(MEDIUM)
+      LARGE_NB = nums_to_numbits(LARGE)
+
+      SMALL_NB2 = nums_to_numbits(set(random.sample(range(1, 200), 50)))
+      MEDIUM_NB2 = nums_to_numbits(set(random.sample(range(1, 2000), 500)))
+      LARGE_NB2 = nums_to_numbits(set(random.sample(range(1, 10000), 3000)))
+
+      N = 10_000
+
+      benchmarks = {
+          "nums_to_numbits (small)":  lambda: nums_to_numbits(SMALL),
+          "nums_to_numbits (medium)": lambda: nums_to_numbits(MEDIUM),
+          "nums_to_numbits (large)":  lambda: nums_to_numbits(LARGE),
+          "numbits_to_nums (small)":  lambda: numbits_to_nums(SMALL_NB),
+          "numbits_to_nums (medium)": lambda: numbits_to_nums(MEDIUM_NB),
+          "numbits_to_nums (large)":  lambda: numbits_to_nums(LARGE_NB),
+          "numbits_union (small)":    lambda: numbits_union(SMALL_NB, SMALL_NB2),
+          "numbits_union (medium)":   lambda: numbits_union(MEDIUM_NB, MEDIUM_NB2),
+          "numbits_union (large)":    lambda: numbits_union(LARGE_NB, LARGE_NB2),
+          "numbits_intersection (small)":  lambda: numbits_intersection(SMALL_NB, SMALL_NB2),
+          "numbits_intersection (medium)": lambda: numbits_intersection(MEDIUM_NB, MEDIUM_NB2),
+          "numbits_intersection (large)":  lambda: numbits_intersection(LARGE_NB, LARGE_NB2),
+          "numbits_any_intersection (small)":  lambda: numbits_any_intersection(SMALL_NB, SMALL_NB2),
+          "numbits_any_intersection (medium)": lambda: numbits_any_intersection(MEDIUM_NB, MEDIUM_NB2),
+          "numbits_any_intersection (large)":  lambda: numbits_any_intersection(LARGE_NB, LARGE_NB2),
+          "num_in_numbits (small)":  lambda: num_in_numbits(100, SMALL_NB),
+          "num_in_numbits (medium)": lambda: num_in_numbits(1000, MEDIUM_NB),
+          "num_in_numbits (large)":  lambda: num_in_numbits(5000, LARGE_NB),
+      }
+
+      outfile = sys.argv[1] if len(sys.argv) > 1 else None
+      results = {}
+
+      print(f"{'Benchmark':<45} {'Time (us)':>12}")
+      print("-" * 58)
+      for name, func in benchmarks.items():
+          t = timeit.timeit(func, number=N)
+          us = t / N * 1_000_000
+          results[name] = us
+          print(f"{name:<45} {us:>10.2f}us")
+
+      if outfile:
+          with open(outfile, "w") as f:
+              json.dump(results, f, indent=2)
+          print(f"\nJSON written to {outfile}")
+
+  - path: /home/azureuser/bench/bench_e2e.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      PYTHON="$HOME/coveragepy/.venv/bin/python"
+      COVERAGE="$HOME/coveragepy/.venv/bin/coverage"
+
+      echo "=== coverage.py E2E benchmarks ==="
+      echo "Python: $($PYTHON --version)"
+      echo "Coverage: $($COVERAGE --version | head -1)"
+      echo ""
+
+      # Create a synthetic workload: many-file project
+      WORKLOAD="$HOME/bench/workload"
+      if [ ! -d "$WORKLOAD" ]; then
+        echo "--- Creating synthetic workload ---"
+        mkdir -p "$WORKLOAD"
+        $PYTHON -c "
+      import os
+      for i in range(200):
+          with open(os.path.join('$WORKLOAD', f'mod_{i}.py'), 'w') as f:
+              f.write(f'def func_{i}():\n')
+              for j in range(50):
+                  f.write(f'    x_{j} = {j} * {i}\n')
+              f.write(f'    return x_0\n\n')
+      with open(os.path.join('$WORKLOAD', 'run_all.py'), 'w') as f:
+          for i in range(200):
+              f.write(f'from mod_{i} import func_{i}\n')
+          for i in range(200):
+              f.write(f'func_{i}()\n')
+      "
+      fi
+
+      echo "--- coverage run (200 modules, 50 lines each) ---"
+      hyperfine --warmup 5 --min-runs 30 --shell=none \
+        --command-name 'coverage run' \
+        "$COVERAGE run $WORKLOAD/run_all.py"
+
+      echo ""
+      echo "--- coverage json (report generation) ---"
+      $COVERAGE run "$WORKLOAD/run_all.py" 2>/dev/null
+      hyperfine --warmup 3 --min-runs 20 --shell=none \
+        --command-name 'coverage json' \
+        "$COVERAGE json -o /dev/null"
+
+      echo ""
+      echo "--- baseline (no coverage) ---"
+      hyperfine --warmup 5 --min-runs 30 --shell=none \
+        --command-name 'no coverage' \
+        "$PYTHON $WORKLOAD/run_all.py"
+
+  - path: /home/azureuser/bench/bench_all.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BRANCH="${1:?Usage: bench_all.sh <branch>}"
+      TS=$(date +%Y%m%d-%H%M%S)
+      OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+      mkdir -p "$OUTDIR"
+      PYTHON="$HOME/coveragepy/.venv/bin/python"
+
+      cd ~/coveragepy
+      git fetch origin
+      git checkout "$BRANCH"
+      export PATH="$HOME/.local/bin:$PATH"
+      uv pip install -e .
+
+      echo "=== Benchmarking branch: $BRANCH ==="
+      echo "Output: $OUTDIR"
+      echo ""
+
+      echo "--- Micro: numbits ---"
+      $PYTHON ~/bench/bench_numbits.py "$OUTDIR/numbits.json"
+
+      echo ""
+      echo "--- E2E ---"
+      bash ~/bench/bench_e2e.sh 2>&1 | tee "$OUTDIR/e2e.txt"
+
+      echo ""
+      echo "Results saved to $OUTDIR/"
+      ls -la "$OUTDIR/"
+
+  - path: /home/azureuser/bench/bench_compare.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BASE="${1:?Usage: bench_compare.sh <base-branch> <opt-branch>}"
+      OPT="${2:?Usage: bench_compare.sh <base-branch> <opt-branch>}"
+
+      echo "=== Comparing $BASE vs $OPT ==="
+      bash ~/bench/bench_all.sh "$BASE"
+      bash ~/bench/bench_all.sh "$OPT"
+
+      echo ""
+      echo "Compare results in ~/results/"
+      ls ~/results/
+
+  - path: /home/azureuser/setup_coveragepy.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      export PATH="$HOME/.local/bin:$PATH"
+
+      echo "=== Installing uv ==="
+      curl -LsSf https://astral.sh/uv/install.sh | sh
+      export PATH="$HOME/.local/bin:$PATH"
+
+      echo "=== Installing Python ==="
+      uv python install 3.13
+
+      echo "=== Cloning coveragepy ==="
+      git clone https://github.com/nedbat/coveragepy.git ~/coveragepy
+
+      echo "=== Creating venv and installing ==="
+      cd ~/coveragepy
+      uv venv --python 3.13
+      uv pip install -e ".[dev]"
+
+      echo "=== Installing profiling tools ==="
+      uv pip install memray py-spy
+
+      echo "=== Creating results directory ==="
+      mkdir -p ~/results
+
+      echo "=== Done ==="
+      ~/coveragepy/.venv/bin/python -c "import coverage; print(f'coverage {coverage.__version__} installed')"
+
+  - path: /home/azureuser/bin/gh-auth-token.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      if [ -z "${GH_TOKEN:-}" ]; then
+        echo "Error: GH_TOKEN not set. Pass it via:"
+        echo "  export GH_TOKEN=ghp_... && ssh -o SendEnv=GH_TOKEN azureuser@<ip> 'bash ~/bin/gh-auth-token.sh'"
+        exit 1
+      fi
+      echo "$GH_TOKEN" | gh auth login --with-token
+      gh auth status
+
+runcmd:
+  - wget -q https://github.com/sharkdp/hyperfine/releases/download/v1.19.0/hyperfine_1.19.0_amd64.deb -O /tmp/hyperfine.deb
+  - dpkg -i /tmp/hyperfine.deb
+  # Install GitHub CLI
+  - curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg -o /usr/share/keyrings/githubcli-archive-keyring.gpg
+  - chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg
+  - echo "deb [arch=amd64 signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" > /etc/apt/sources.list.d/github-cli.list
+  - apt-get update -qq && apt-get install -y gh
+  - su - azureuser -c 'bash /home/azureuser/setup_coveragepy.sh'
--- a/.codeflash/coveragepy/coveragepy/infra/vm-manage.sh
+++ b/.codeflash/coveragepy/coveragepy/infra/vm-manage.sh
@ -0,0 +1,112 @@
+#!/usr/bin/env bash
+# Manage the coveragepy-bench Azure VM
+set -euo pipefail
+
+RG="COVERAGEPY-BENCH-RG"
+VM="coveragepy-bench"
+REGION="westus2"
+SIZE="Standard_D2s_v5"
+IMAGE="Canonical:ubuntu-24_04-lts:server:latest"
+SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519.pub}"
+
+case "${1:-help}" in
+  create)
+    if [ ! -f "$SSH_KEY" ]; then
+      echo "Error: SSH public key not found at $SSH_KEY"
+      echo "Generate one: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519"
+      echo "Or set SSH_KEY=/path/to/key.pub"
+      exit 1
+    fi
+
+    echo "Creating resource group..."
+    az group create --name "$RG" --location "$REGION" --only-show-errors --output none
+
+    echo "Creating VM (Trusted Launch, SSH-only, locked-down NSG)..."
+    az vm create \
+      --resource-group "$RG" \
+      --name "$VM" \
+      --image "$IMAGE" \
+      --size "$SIZE" \
+      --os-disk-size-gb 64 \
+      --admin-username azureuser \
+      --ssh-key-values "$SSH_KEY" \
+      --authentication-type ssh \
+      --security-type TrustedLaunch \
+      --enable-secure-boot true \
+      --enable-vtpm true \
+      --nsg-rule NONE \
+      --custom-data infra/cloud-init.yaml \
+      --only-show-errors
+
+    MY_IP=$(curl -s ifconfig.me)
+    echo "Restricting SSH to $MY_IP..."
+    az network nsg rule create \
+      --resource-group "$RG" \
+      --nsg-name "${VM}NSG" \
+      --name AllowSSHFromMyIP \
+      --priority 1000 \
+      --source-address-prefixes "$MY_IP/32" \
+      --destination-port-ranges 22 \
+      --access Allow \
+      --protocol Tcp \
+      --output none
+
+    echo "VM created. Get IP with: $0 ip"
+    ;;
+
+  start)
+    echo "Starting VM..."
+    az vm start --resource-group "$RG" --name "$VM"
+    echo "Started. IP: $(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)"
+    ;;
+
+  stop)
+    echo "Deallocating VM (stops billing)..."
+    az vm deallocate --resource-group "$RG" --name "$VM"
+    echo "Deallocated."
+    ;;
+
+  ip)
+    az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv
+    ;;
+
+  ssh)
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh azureuser@"$IP" "${@:2}"
+    ;;
+
+  bench)
+    BRANCH="${2:?Usage: $0 bench <branch>}"
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh azureuser@"$IP" "bash ~/bench/bench_all.sh $BRANCH"
+    ;;
+
+  gh-auth)
+    if [ -z "${GH_TOKEN:-}" ]; then
+      echo "Error: GH_TOKEN not set."
+      echo "Usage: GH_TOKEN=ghp_... $0 gh-auth"
+      exit 1
+    fi
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh -o SendEnv=GH_TOKEN azureuser@"$IP" "bash ~/bin/gh-auth-token.sh"
+    ;;
+
+  destroy)
+    echo "Destroying resource group (all resources)..."
+    az group delete --name "$RG" --yes --no-wait
+    echo "Deletion started."
+    ;;
+
+  help|*)
+    echo "Usage: $0 {create|start|stop|ip|ssh|bench <branch>|destroy}"
+    echo ""
+    echo "  create   - Provision VM with cloud-init"
+    echo "  start    - Start deallocated VM"
+    echo "  stop     - Deallocate VM (stops billing)"
+    echo "  ip       - Show VM public IP"
+    echo "  ssh      - SSH into VM"
+    echo "  bench    - Run benchmarks on a branch"
+    echo "  gh-auth  - Authenticate gh CLI on VM (requires GH_TOKEN)"
+    echo "  destroy  - Delete resource group and all resources"
+    ;;
+esac
--- a/.codeflash/coveragepy/coveragepy/status.md
+++ b/.codeflash/coveragepy/coveragepy/status.md
@ -0,0 +1,36 @@
+# coveragepy Status
+
+Last updated: 2026-04-10
+
+## Current state
+
+Scouting complete. Optimization targets identified, no PRs opened yet.
+
+## Target repo
+
+`~/Desktop/work/coveragepy_org/coveragepy`
+
+## PRs
+
+None yet.
+
+## Key findings
+
+- **Hot path**: PyTracer._trace() called on every line/call/return event — dictionary lookups, set.add() on every hit
+- **Data persistence**: numbits encoding uses zip_longest generators, add_lines/add_arcs do per-file SQL round-trips
+- **Quick wins**: numbits_union() optimization (low complexity), sys.path hash caching in should_trace(), SQL batching in add_lines/add_arcs
+- **Parser**: PythonParser does multi-pass source text analysis with repeated text.count("\n") calls
+
+## VM
+
+Not provisioned. coverage.py is pure Python + C extension — local benchmarking sufficient. Use existing test suite + `hyperfine` for end-to-end, `pytest-benchmark` for targeted functions.
+
+## Next steps
+
+1. Start with numbits optimizations (lowest risk, clear speedup path)
+2. Batch SQL operations in add_lines/add_arcs
+3. Benchmark with coverage's own test suite as workload
+
+## Blockers
+
+None.
--- a/.codeflash/microsoft/typeagent/README.md
+++ b/.codeflash/microsoft/typeagent/README.md
@ -0,0 +1,234 @@
+# typeagent Performance Optimization
+
+Upstream performance improvements to [microsoft/typeagent-py](https://github.com/microsoft/typeagent-py), a structured RAG library (ingest, index, query) for Python.
+
+## Background
+
+typeagent-py is Microsoft's Python library for structured knowledge processing — ingesting conversations, building semantic indexes, and querying them with LLM-backed answer generation. Profiling `import typeagent` revealed unnecessary eager imports pulling in heavy dependencies (like `black`, a code formatter) at module load time, even when they're only used in cold debug/formatting paths.
+
+In addition to optimizing the core library, we optimized the vector search hot paths in a community contributor's open optimization PR ([microsoft/typeagent-py#228](https://github.com/microsoft/typeagent-py/pull/228) — "Auto-tune Embedding Model Parameters & Add Benchmarking Tool"). That PR was itself a performance effort — tuning embedding model parameters and adding benchmarking tooling — and we further improved its code with 1.4x–14.2x speedups on the search paths it touches.
+
+## Results
+
+### Import Time (hyperfine, 30 runs, Standard_D2s_v5)
+
+| Benchmark | Before | After | Speedup |
+|---|---:|---:|---:|
+| `import typeagent` | 791 ms | 683 ms | **1.16x** |
+
+Import-time breakdown by PR:
+
+| PR | Before | After | Δ | Cumulative |
+|---|---:|---:|---:|---:|
+| #229 defer-black | 791 ms | 713 ms | −78 ms | 713 ms |
+| #235 optional-black | 713 ms | 734 ms* | — | 734 ms |
+| #236 defer-query-imports | 734 ms | 683 ms | −51 ms | 683 ms |
+
+\* PR #235 replaces `black` with stdlib `pprint` — no import-time change expected; the 713→734 delta is measurement noise within hyperfine variance.
+
+### Offline E2E Test Suite (hyperfine, 10 runs, Standard_D2s_v5)
+
+| Benchmark | Before | After | Speedup |
+|---|---:|---:|---:|
+| 69 offline tests | 5.72 s | 5.60 s | **1.02x** |
+
+### Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, Standard_D2s_v5)
+
+| Benchmark | Before (min) | After (min) | Speedup |
+|---|---:|---:|---:|
+| `add_messages_with_indexing` (200 msgs) | 28.8 ms | 25.0 ms | **1.16x** |
+| `add_messages_with_indexing` (50 msgs) | 7.8 ms | 6.7 ms | **1.16x** |
+| VTT ingest (40 msgs) | 6.9 ms | 6.1 ms | **1.14x** |
+
+> Consistent ~14-16% improvement across all message counts. Only the hot path is timed — setup (DB creation, storage init) and teardown (close, delete) are excluded via `async_benchmark.pedantic()`. All 69 tests pass before and after.
+
+### Query (pytest-async-benchmark pedantic, 200 rounds, Standard_D2s_v5)
+
+| Benchmark | Before (median) | After (median) | Speedup |
+|---|---:|---:|---:|
+| `lookup_term_filtered` (200 matches) | 2.652 ms | 1.260 ms | **2.10x** |
+| `group_matches_by_type` (200 matches) | 2.453 ms | 992 μs | **2.47x** |
+| `get_scored_semantic_refs_from_ordinals_iter` (200 matches) | 2.511 ms | 2.979 ms | 0.84x |
+| `lookup_property_in_property_index` (200 matches) | 24.484 ms | 9.376 ms | **2.61x** |
+| `get_matches_in_scope` (200 matches) | 24.062 ms | 9.185 ms | **2.62x** |
+
+> 200 matches against a 200-message indexed SQLite transcript. Only the function under test is timed. Includes batch metadata query, binary search in `contains_range`, inline tuple comparisons in `TextRange`, and skipping pydantic validation in `get_metadata_multiple`.
+
+### Vector Search (pytest-async-benchmark pedantic, 200 rounds, Standard_D2s_v5)
+
+| Benchmark | Before (min) | After (min) | Speedup |
+|---|---:|---:|---:|
+| `fuzzy_lookup_embedding` (1K vecs) | 257 us | 70 us | **3.7x** |
+| `fuzzy_lookup_embedding` (10K vecs) | 5.72 ms | 559 us | **10.2x** |
+| `fuzzy_lookup_embedding` (10K + predicate) | 4.79 ms | 3.41 ms | **1.4x** |
+| `fuzzy_lookup_embedding_in_subset` (1K of 10K) | 3.45 ms | 243 us | **14.2x** |
+
+> 384-dim embeddings, normalized. The no-predicate path (most common in practice) sees the largest gains by staying entirely in numpy. The subset lookup benefits from computing dot products only for subset indices instead of all vectors. This optimization applies to code from [microsoft/typeagent-py#228](https://github.com/microsoft/typeagent-py/pull/228) (not yet merged upstream); the PR was opened against the contributor's fork at [shreejaykurhade/typeagent-py#1](https://github.com/shreejaykurhade/typeagent-py/pull/1) and has been merged there.
+
+## What We Changed
+
+### Startup / Import
+
+**Defer `black` import to first use** ([KRRT7/typeagent-py#4](https://github.com/KRRT7/typeagent-py/pull/4))
+
+`black` (the code formatter) was imported at module level in two files but only used in cold formatting paths:
+
+- **`knowpro/answers.py`** — `black.format_str()` called only in `create_context_prompt()` to pretty-print debug context
+- **`aitools/utils.py`** — `black.format_str()` called only in `format_code()` for terminal output formatting
+
+Moved `import black` inside each function. `black` pulls in `pathspec`, `platformdirs`, `tomli`, and its own parser — none of which are needed until someone actually formats code.
+
+**Replace `black` with stdlib `pprint` for runtime formatting** ([microsoft/typeagent-py#235](https://github.com/microsoft/typeagent-py/pull/235))
+
+After deferring `black` in #229, we went further: removed `black` from runtime dependencies entirely. The two call sites (`create_context_prompt` in `answers.py` and `format_code`/`pretty_print` in `utils.py`) only format Python data structures — `pprint.pformat` produces equivalent output with zero external dependencies. `format_code` now uses `ast.literal_eval` to round-trip `repr()` strings back to objects for `pprint` formatting. `black` moved to dev-only dependencies.
+
+**Defer query-time imports in `conversation_base`** ([microsoft/typeagent-py#236](https://github.com/microsoft/typeagent-py/pull/236))
+
+`conversation_base.py` eagerly imported `answers`, `searchlang`, `search_query_schema`, and `answer_response_schema` at module level. These modules are only needed when `query()` is called — not during ingestion or indexing. Moved these imports into the `query()` method body and added `TYPE_CHECKING` guards for type annotations. Saves ~51ms (734ms → 683ms).
+
+The largest remaining import-time cost is `pydantic_ai` at ~161ms — this is an upstream issue outside typeagent's control, noted in the PR body.
+
+**Reproducer** (upstream-friendly, uses `$RUNNER`):
+```bash
+# Before (main)
+$RUNNER -X importtime -c 'import typeagent' 2>&1 | sort -t'|' -k1 -rn | head -20
+
+# After (optimization branch)
+# black no longer appears in the import chain at all
+```
+
+### Runtime / Indexing
+
+**Batch SQLite INSERTs for indexing pipeline** ([KRRT7/typeagent-py#5](https://github.com/KRRT7/typeagent-py/pull/5))
+
+The indexing pipeline (`add_messages_with_indexing`) was issuing individual `cursor.execute()` calls for every semantic ref term and property — over 1000 individual INSERT calls for 200 messages. Added `add_terms_batch` and `add_properties_batch` to the interface protocols, with SQLite backends using `executemany` to batch all inserts into 2–3 calls. Restructured the callers (`add_metadata_to_index_from_list`, `add_to_property_index`) to collect data via pure functions first, then batch-insert.
+
+**Reverted: Batch schema DDL + pre-compile regex** — no measurable gain, reverted.
+
+### Vector Search
+
+**Numpy vectorized fuzzy lookup** ([shreejaykurhade/typeagent-py#1](https://github.com/shreejaykurhade/typeagent-py/pull/1))
+
+The `fuzzy_lookup_embedding` hot path was building Python `ScoredInt` objects for every vector, then sorting the full list. `fuzzy_lookup_embedding_in_subset` delegated to the full-scan method with a `lambda i: i in ordinals_set` predicate, computing dot products for *all* vectors then filtering.
+
+Replaced with numpy-native operations:
+- **No-predicate path**: `np.flatnonzero` for score filtering + `np.argpartition` for O(n) top-k selection — `ScoredInt` only created for final k results
+- **Predicate path**: numpy pre-filters by score threshold, applies predicate only to passing candidates
+- **Subset lookup**: `self._vectors[subset]` fancy indexing computes dot products only for subset indices, then same fast top-k path
+
+### Query
+
+**Batch metadata query across 5 N+1 call sites** ([KRRT7/typeagent-py#7](https://github.com/KRRT7/typeagent-py/pull/7))
+
+Five call sites used `get_item()` per scored ref — one SELECT and full deserialization per match (N+1 pattern). Profiling showed 64% of per-row cost was deserializing `knowledge_json`, which the filters never use — they only check `knowledge_type` (a plain DB column) and/or `range` (needs `json.loads(range_json)` only).
+
+Added `get_metadata_multiple` to `ISemanticRefCollection` that fetches only `semref_id, range_json, knowledge_type` in a single batch query, skipping `json.loads(knowledge_json)` and `deserialize_knowledge()` entirely. Replaced the N+1 loop with one `get_metadata_multiple` call at each site.
+
+Further optimized the scope-filtering path (benchmarks #4, #5 went from 1.08x/1.06x to 2.61x/2.62x):
+- **Binary search in `TextRangeCollection.contains_range`**: replaced O(n) linear scan with `bisect_right` keyed on `start` — O(1) for the common case of non-overlapping point ranges
+- **Inline tuple comparisons in `TextRange`**: replaced `TextLocation` object allocation in `__eq__`/`__lt__`/`__contains__` with a shared `_effective_end` returning tuples
+- **Skip pydantic validation in `get_metadata_multiple`**: construct `TextLocation`/`TextRange` directly from JSON instead of going through `__pydantic_validator__`
+
+Call sites:
+1. `lookup_term_filtered` — batch metadata, filter by knowledge_type/range (2.10x)
+2. `group_matches_by_type` — batch metadata, group by knowledge_type (2.47x)
+3. `get_scored_semantic_refs_from_ordinals_iter` — two-phase: metadata filter then batch fetch
+4. `lookup_property_in_property_index` — batch metadata + bisect + inline comparisons (2.61x)
+5. `get_matches_in_scope` — batch metadata + bisect + inline comparisons (2.62x)
+
+### Bugfix
+
+**Fix parse_azure_endpoint passing query string to AsyncAzureOpenAI** ([microsoft/typeagent-py#231](https://github.com/microsoft/typeagent-py/pull/231))
+
+`parse_azure_endpoint` returned the full URL including `?api-version=...`. `AsyncAzureOpenAI` appends `/openai/` to `azure_endpoint`, producing a mangled URL. Now strips the query string with `str.split("?", 1)[0]`. Added 6 unit tests.
+
+### Remaining Targets (identified, not yet implemented)
+
+From `python -X importtime` profiling on the optimization branch (683ms baseline), sorted by self-time:
+
+| Module | Self-time | Notes |
+|---|---:|---|
+| `pydantic_ai` (total) | ~161 ms | Upstream dependency — largest single cost, outside typeagent control |
+| `pydantic_ai.messages` | 22 ms | Heavy pydantic model definitions |
+| `knowledge_schema` | 17 ms | Schema initialization |
+| `griffe` | 11 ms | Used for code introspection |
+| `annotated_types` | 9 ms | Type annotation overhead |
+
+## Upstream Contributions
+
+Stacked PRs — merge in order (#231 first, each subsequent PR builds on the previous):
+
+| Order | PR | Status | Description |
+|---|---|---|---|
+| 1 | [microsoft/typeagent-py#231](https://github.com/microsoft/typeagent-py/pull/231) | **Merged** | Fix parse_azure_endpoint passing query string to AsyncAzureOpenAI |
+| 2 | [microsoft/typeagent-py#229](https://github.com/microsoft/typeagent-py/pull/229) | **Merged** | Defer black import to first use |
+| 3 | [microsoft/typeagent-py#230](https://github.com/microsoft/typeagent-py/pull/230) | Open | Batch SQLite INSERTs for indexing pipeline |
+| 4 | [microsoft/typeagent-py#232](https://github.com/microsoft/typeagent-py/pull/232) | Open | Batch metadata query to avoid N+1 across 5 call sites |
+| 5 | [microsoft/typeagent-py#234](https://github.com/microsoft/typeagent-py/pull/234) | Open | Vectorize fuzzy_lookup_embedding with numpy ops |
+| 6 | [microsoft/typeagent-py#235](https://github.com/microsoft/typeagent-py/pull/235) | Open | Replace black with stdlib pprint for runtime formatting |
+| 7 | [microsoft/typeagent-py#236](https://github.com/microsoft/typeagent-py/pull/236) | Open | Defer query-time imports in conversation_base |
+| — | [shreejaykurhade/typeagent-py#1](https://github.com/shreejaykurhade/typeagent-py/pull/1) | **Merged** | Numpy vectorized fuzzy lookup (against PR #228's fork) |
+
+### Fork PRs
+
+| PR | Type | Description |
+|---|---|---|
+| [KRRT7/typeagent-py#3](https://github.com/KRRT7/typeagent-py/pull/3) | Stacked draft | Cumulative optimizations (all changes) |
+| [KRRT7/typeagent-py#4](https://github.com/KRRT7/typeagent-py/pull/4) | Individual | Defer black import in answers.py and utils.py |
+| [KRRT7/typeagent-py#5](https://github.com/KRRT7/typeagent-py/pull/5) | Individual | Batch SQLite INSERTs for indexing pipeline |
+| [KRRT7/typeagent-py#6](https://github.com/KRRT7/typeagent-py/pull/6) | Individual | Benchmark tests for indexing pipeline |
+| KRRT7/typeagent-py `perf/vectorbase-lookup` | Branch | Numpy vectorized fuzzy lookup (PR opened against contributor fork) |
+| [KRRT7/typeagent-py#7](https://github.com/KRRT7/typeagent-py/pull/7) | Individual | Batch metadata query across 5 N+1 call sites + parse_azure_endpoint bugfix |
+
+## Methodology
+
+### Environment
+
+- **VM**: Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable)
+- **OS**: Ubuntu 24.04 LTS
+- **Region**: westus2
+- **Python**: 3.13 via uv
+- **Tooling**: hyperfine (warmup 5, min-runs 30), `python -X importtime`, pytest-async-benchmark (pedantic mode, 20 rounds, 3 warmup)
+
+Non-burstable VM chosen for consistent CPU performance — no thermal throttling or turbo variability.
+
+### Profiling approach
+
+1. `python -X importtime -c 'import typeagent'` — identified heaviest imports by self-time
+2. Traced each heavy import to its call site — checked if it's needed at module level or only in a cold path
+3. hyperfine A/B comparison (`bench_ab.sh main optimization`) — validated every change end-to-end
+4. pytest-async-benchmark for isolated runtime benchmarks (indexing pipeline, avoids import-time noise)
+5. Full offline test suite (69 tests) run before and after every change to catch regressions
+
+### Runner convention
+
+Benchmark scripts use `.venv/bin/python` directly for accuracy (`uv run` adds ~50% overhead and 2.5x variance). Upstream reproducers use `uv run python` for portability.
+
+### Benchmark harness
+
+All scripts provisioned via cloud-init on the VM:
+
+| Script | Purpose |
+|---|---|
+| `bench_import.sh` | `import typeagent` time via hyperfine |
+| `bench_tests.sh` | Offline E2E test suite via hyperfine |
+| `bench_baseline.sh` | Run both import + test benchmarks |
+| `bench_compare.sh` | Single branch benchmark (checkout + rebuild + measure) |
+| `bench_ab.sh` | Side-by-side A/B comparison of two branches |
+
+### Raw data
+
+Results tracked in [`data/results.tsv`](data/results.tsv).
+
+## Repo Structure
+
+```
+.
+├── README.md              # This file
+├── bench/                 # Benchmark scripts
+├── data/                  # Raw benchmark data
+│   └── results.tsv
+└── infra/                 # VM provisioning
+    ├── cloud-init.yaml
+    └── vm-manage.sh
+```
--- a/.codeflash/microsoft/typeagent/bench/.gitkeep
+++ b/.codeflash/microsoft/typeagent/bench/.gitkeep
--- a/.codeflash/microsoft/typeagent/bench/bench_ab.sh
+++ b/.codeflash/microsoft/typeagent/bench/bench_ab.sh
@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+set -euo pipefail
+BASE="${1:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+OPT="${2:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+
+echo "=== A/B comparison: $BASE vs $OPT ==="
+bash ~/bench/bench_compare.sh "$BASE"
+bash ~/bench/bench_compare.sh "$OPT"
+
+echo ""
+echo "Compare results in ~/results/"
+ls ~/results/
--- a/.codeflash/microsoft/typeagent/bench/bench_baseline.sh
+++ b/.codeflash/microsoft/typeagent/bench/bench_baseline.sh
@ -0,0 +1,14 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+echo "=== Running all baseline benchmarks ==="
+echo ""
+
+bash ~/bench/bench_import.sh
+echo ""
+bash ~/bench/bench_tests.sh
+
+echo ""
+echo "=== All baselines complete ==="
+echo "Results in ~/results/"
+ls -R ~/results/
--- a/.codeflash/microsoft/typeagent/bench/bench_compare.sh
+++ b/.codeflash/microsoft/typeagent/bench/bench_compare.sh
@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+set -euo pipefail
+BRANCH="${1:?Usage: bench_compare.sh <branch-or-commit>}"
+TS=$(date +%Y%m%d-%H%M%S)
+OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+mkdir -p "$OUTDIR"
+
+cd ~/typeagent
+git fetch origin
+git checkout "$BRANCH"
+
+# Rebuild after switching branches
+export PATH="$HOME/.local/bin:$PATH"
+uv sync
+
+PYTHON=~/typeagent/.venv/bin/python
+
+echo "=== Benchmarking branch: $BRANCH ==="
+
+# Import time (direct venv to avoid uv run overhead)
+hyperfine --warmup 5 --min-runs 30 --shell=none \
+  --export-json "$OUTDIR/import.json" \
+  "$PYTHON -c 'import typeagent'"
+
+# Offline test suite
+hyperfine --warmup 2 --min-runs 10 --shell=none \
+  --export-json "$OUTDIR/test_suite.json" \
+  "$PYTHON -m pytest tests/test_incremental_index.py tests/test_add_messages_with_indexing.py tests/test_podcast_incremental.py tests/test_sqlite_indexes.py tests/test_query.py -q"
+
+echo ""
+echo "Results saved to $OUTDIR/"
+ls -la "$OUTDIR/"
--- a/.codeflash/microsoft/typeagent/bench/bench_import.sh
+++ b/.codeflash/microsoft/typeagent/bench/bench_import.sh
@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+PYTHON=~/typeagent/.venv/bin/python
+OUTDIR="$HOME/results/import"
+mkdir -p "$OUTDIR"
+
+echo "=== typeagent import benchmark ==="
+echo ""
+
+# E2E import time (direct venv to avoid uv run overhead)
+hyperfine --warmup 5 --min-runs 30 --shell=none \
+  --export-json "$OUTDIR/import.json" \
+  "$PYTHON -c 'import typeagent'"
+
+# Per-module breakdown
+$PYTHON -X importtime -c 'import typeagent' 2>"$OUTDIR/importtime_raw.txt"
+sort -t'|' -k1 -rn "$OUTDIR/importtime_raw.txt" | head -30 > "$OUTDIR/importtime_top30.txt"
+
+echo ""
+echo "Top 30 imports by self time:"
+cat "$OUTDIR/importtime_top30.txt"
--- a/.codeflash/microsoft/typeagent/bench/bench_tests.sh
+++ b/.codeflash/microsoft/typeagent/bench/bench_tests.sh
@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+PYTHON=~/typeagent/.venv/bin/python
+OUTDIR="$HOME/results/tests"
+mkdir -p "$OUTDIR"
+
+cd ~/typeagent
+
+echo "=== typeagent offline E2E test benchmark ==="
+echo ""
+
+# Run offline tests with timing (direct venv to avoid uv run overhead)
+$PYTHON -m pytest \
+  tests/test_incremental_index.py \
+  tests/test_add_messages_with_indexing.py \
+  tests/test_podcast_incremental.py \
+  tests/test_sqlite_indexes.py \
+  tests/test_query.py \
+  --durations=0 -v 2>&1 | tee "$OUTDIR/test_output.txt"
+
+# Hyperfine on the full offline suite
+hyperfine --warmup 2 --min-runs 10 --shell=none \
+  --export-json "$OUTDIR/test_suite.json" \
+  "$PYTHON -m pytest tests/test_incremental_index.py tests/test_add_messages_with_indexing.py tests/test_podcast_incremental.py tests/test_sqlite_indexes.py tests/test_query.py -q"
--- a/.codeflash/microsoft/typeagent/data/results.tsv
+++ b/.codeflash/microsoft/typeagent/data/results.tsv
@ -0,0 +1,18 @@
+commit	target	category	before	after	speedup	tests_passed	tests_failed	status	description
+baseline	import typeagent	baseline	791ms	-	-	69	0	baseline	VM baseline: Azure Standard_D2s_v5, Python 3.13, Ubuntu 24.04
+baseline	test suite (69 tests)	baseline	5.72s	-	-	69	0	baseline	VM baseline: offline E2E test suite
+ecbf6f5	import typeagent	startup	791ms	713ms	1.11x	69	0	keep	Defer black import to first use in answers.py and utils.py
+ecbf6f5	test suite (69 tests)	startup	5.72s	5.60s	1.02x	69	0	keep	Defer black import to first use in answers.py and utils.py
+d4bc744	test suite (69 tests)	runtime	6.07s	6.08s	1.00x	69	0	reverted	Batch schema DDL into executescript + pre-compile regex -- no measurable gain
+bc9f2df	add_messages_with_indexing (200 msgs)	runtime	28.8ms	25.0ms	1.16x	69	0	keep	Batch SQLite INSERTs via executemany for semref and property indexing
+bc9f2df	add_messages_with_indexing (50 msgs)	runtime	7.8ms	6.7ms	1.16x	69	0	keep	Batch SQLite INSERTs via executemany for semref and property indexing
+bc9f2df	VTT ingest (40 msgs)	runtime	6.9ms	6.1ms	1.14x	69	0	keep	Batch SQLite INSERTs via executemany for semref and property indexing
+bc5b319	fuzzy_lookup_embedding (1K vecs)	runtime	257us	70us	3.7x	455	0	keep	Numpy vectorized fuzzy lookup -- np.flatnonzero + np.argpartition
+bc5b319	fuzzy_lookup_embedding (10K vecs)	runtime	5.72ms	559us	10.2x	455	0	keep	Numpy vectorized fuzzy lookup -- np.flatnonzero + np.argpartition
+bc5b319	fuzzy_lookup_embedding (10K + predicate)	runtime	4.79ms	3.41ms	1.4x	455	0	keep	Numpy vectorized fuzzy lookup -- numpy pre-filter then predicate
+bc5b319	fuzzy_lookup_embedding_in_subset (1K of 10K)	runtime	3.45ms	243us	14.2x	455	0	keep	Numpy vectorized fuzzy lookup -- fancy indexing for subset dot products
+bc7d230	lookup_term_filtered (200 matches)	runtime	2.652ms	1.260ms	2.10x	456	0	keep	Batch metadata query + bisect + inline tuple comparisons
+bc7d230	group_matches_by_type (200 matches)	runtime	2.453ms	992us	2.47x	456	0	keep	Batch metadata query + skip pydantic validation
+bc7d230	get_scored_semantic_refs_from_ordinals_iter (200 matches)	runtime	2.511ms	2.979ms	0.84x	456	0	keep	Two-phase: metadata filter then batch fetch (break-even)
+bc7d230	lookup_property_in_property_index (200 matches)	runtime	24.484ms	9.376ms	2.61x	456	0	keep	Batch metadata + bisect in contains_range + inline tuple comparisons
+bc7d230	get_matches_in_scope (200 matches)	runtime	24.062ms	9.185ms	2.62x	456	0	keep	Batch metadata + bisect in contains_range + inline tuple comparisons
--- a/.codeflash/microsoft/typeagent/infra/cloud-init.yaml
+++ b/.codeflash/microsoft/typeagent/infra/cloud-init.yaml
@ -0,0 +1,196 @@
+#cloud-config
+#
+# Benchmark VM provisioning for microsoft/typeagent-py
+#
+# Structured RAG library (ingest, index, query) -- Python 3.13, uv-based build.
+# Clones from KRRT7 fork, installs via uv sync (lockfile-based).
+#
+# Usage:
+#   az vm create ... --custom-data infra/cloud-init.yaml
+#
+# VM: Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable)
+#   Non-burstable ensures consistent CPU -- no thermal throttling or turbo variability.
+
+package_update: true
+packages:
+  - git
+  - build-essential
+  - curl
+  - wget
+  - jq
+
+write_files:
+  # --- Benchmark: import time ---
+  - path: /home/azureuser/bench/bench_import.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      PYTHON=~/typeagent/.venv/bin/python
+      OUTDIR="$HOME/results/import"
+      mkdir -p "$OUTDIR"
+
+      echo "=== typeagent import benchmark ==="
+      echo ""
+
+      # E2E import time (direct venv to avoid uv run overhead)
+      hyperfine --warmup 5 --min-runs 30 --shell=none \
+        --export-json "$OUTDIR/import.json" \
+        "$PYTHON -c 'import typeagent'"
+
+      # Per-module breakdown
+      $PYTHON -X importtime -c 'import typeagent' 2>"$OUTDIR/importtime_raw.txt"
+      sort -t'|' -k1 -rn "$OUTDIR/importtime_raw.txt" | head -30 > "$OUTDIR/importtime_top30.txt"
+
+      echo ""
+      echo "Top 30 imports by self time:"
+      cat "$OUTDIR/importtime_top30.txt"
+
+  # --- Benchmark: offline E2E test suite ---
+  - path: /home/azureuser/bench/bench_tests.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      PYTHON=~/typeagent/.venv/bin/python
+      OUTDIR="$HOME/results/tests"
+      mkdir -p "$OUTDIR"
+
+      cd ~/typeagent
+
+      echo "=== typeagent offline E2E test benchmark ==="
+      echo ""
+
+      # Run offline tests with timing (direct venv to avoid uv run overhead)
+      $PYTHON -m pytest \
+        tests/test_incremental_index.py \
+        tests/test_add_messages_with_indexing.py \
+        tests/test_podcast_incremental.py \
+        tests/test_sqlite_indexes.py \
+        tests/test_query.py \
+        --durations=0 -v 2>&1 | tee "$OUTDIR/test_output.txt"
+
+      # Hyperfine on the full offline suite
+      hyperfine --warmup 2 --min-runs 10 --shell=none \
+        --export-json "$OUTDIR/test_suite.json" \
+        "$PYTHON -m pytest tests/test_incremental_index.py tests/test_add_messages_with_indexing.py tests/test_podcast_incremental.py tests/test_sqlite_indexes.py tests/test_query.py -q"
+
+  # --- Benchmark: all baselines (runs import + tests) ---
+  - path: /home/azureuser/bench/bench_baseline.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      echo "=== Running all baseline benchmarks ==="
+      echo ""
+
+      bash ~/bench/bench_import.sh
+      echo ""
+      bash ~/bench/bench_tests.sh
+
+      echo ""
+      echo "=== All baselines complete ==="
+      echo "Results in ~/results/"
+      ls -R ~/results/
+
+  # --- Benchmark: A/B branch comparison ---
+  - path: /home/azureuser/bench/bench_compare.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BRANCH="${1:?Usage: bench_compare.sh <branch-or-commit>}"
+      TS=$(date +%Y%m%d-%H%M%S)
+      OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+      mkdir -p "$OUTDIR"
+
+      cd ~/typeagent
+      git fetch origin
+      git checkout "$BRANCH"
+
+      # Rebuild after switching branches
+      export PATH="$HOME/.local/bin:$PATH"
+      uv sync
+
+      PYTHON=~/typeagent/.venv/bin/python
+
+      echo "=== Benchmarking branch: $BRANCH ==="
+
+      # Import time (direct venv to avoid uv run overhead)
+      hyperfine --warmup 5 --min-runs 30 --shell=none \
+        --export-json "$OUTDIR/import.json" \
+        "$PYTHON -c 'import typeagent'"
+
+      # Offline test suite
+      hyperfine --warmup 2 --min-runs 10 --shell=none \
+        --export-json "$OUTDIR/test_suite.json" \
+        "$PYTHON -m pytest tests/test_incremental_index.py tests/test_add_messages_with_indexing.py tests/test_podcast_incremental.py tests/test_sqlite_indexes.py tests/test_query.py -q"
+
+      echo ""
+      echo "Results saved to $OUTDIR/"
+      ls -la "$OUTDIR/"
+
+  # --- Benchmark: side-by-side two branches ---
+  - path: /home/azureuser/bench/bench_ab.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BASE="${1:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+      OPT="${2:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+
+      echo "=== A/B comparison: $BASE vs $OPT ==="
+      bash ~/bench/bench_compare.sh "$BASE"
+      bash ~/bench/bench_compare.sh "$OPT"
+
+      echo ""
+      echo "Compare results in ~/results/"
+      ls ~/results/
+
+  # --- Setup script (runs once via runcmd) ---
+  - path: /home/azureuser/setup.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      export PATH="$HOME/.local/bin:$PATH"
+
+      echo "=== Cloning typeagent ==="
+      git clone https://github.com/KRRT7/typeagent-py.git ~/typeagent
+      cd ~/typeagent
+
+      echo "=== Installing toolchain and building ==="
+      curl -LsSf https://astral.sh/uv/install.sh | sh
+      export PATH="$HOME/.local/bin:$PATH"
+      uv sync --python 3.13
+
+      echo "=== Creating results directory ==="
+      mkdir -p ~/results
+
+      echo "=== Verifying installation ==="
+      uv run python -c 'import typeagent; print("OK")'
+
+      echo "=== Running baseline benchmarks ==="
+      bash ~/bench/bench_baseline.sh
+
+      echo "=== Done ==="
+
+runcmd:
+  - wget -q https://github.com/sharkdp/hyperfine/releases/download/v1.19.0/hyperfine_1.19.0_amd64.deb -O /tmp/hyperfine.deb
+  - dpkg -i /tmp/hyperfine.deb
+  - su - azureuser -c 'bash /home/azureuser/setup.sh'
--- a/.codeflash/microsoft/typeagent/infra/vm-manage.sh
+++ b/.codeflash/microsoft/typeagent/infra/vm-manage.sh
@ -0,0 +1,111 @@
+#!/usr/bin/env bash
+#
+# Template: Azure benchmark VM lifecycle management
+#
+# Customize:
+#   1. Replace typeagent with your project name (e.g., "rich", "myapi")
+#   2. Adjust SIZE if your project needs more/less resources
+#   3. Update the cloud-init path if yours lives elsewhere
+#
+# Usage:
+#   bash infra/vm-manage.sh {create|start|stop|ip|ssh|bench <branch>|destroy}
+
+set -euo pipefail
+
+RG="typeagent-BENCH-RG"
+VM="typeagent-bench"
+REGION="westus2"
+SIZE="Standard_D2s_v5"
+IMAGE="Canonical:ubuntu-24_04-lts:server:latest"
+SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519.pub}"
+
+case "${1:-help}" in
+  create)
+    if [ ! -f "$SSH_KEY" ]; then
+      echo "Error: SSH public key not found at $SSH_KEY"
+      echo "Generate one: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519"
+      echo "Or set SSH_KEY=/path/to/key.pub"
+      exit 1
+    fi
+
+    echo "Creating resource group..."
+    az group create --name "$RG" --location "$REGION" --only-show-errors --output none
+
+    echo "Creating VM (Trusted Launch, SSH-only, locked-down NSG)..."
+    az vm create \
+      --resource-group "$RG" \
+      --name "$VM" \
+      --image "$IMAGE" \
+      --size "$SIZE" \
+      --os-disk-size-gb 64 \
+      --admin-username azureuser \
+      --ssh-key-values "$SSH_KEY" \
+      --authentication-type ssh \
+      --security-type TrustedLaunch \
+      --enable-secure-boot true \
+      --enable-vtpm true \
+      --nsg-rule NONE \
+      --custom-data infra/cloud-init.yaml \
+      --only-show-errors
+
+    MY_IP=$(curl -s ifconfig.me)
+    echo "Restricting SSH to $MY_IP..."
+    az network nsg rule create \
+      --resource-group "$RG" \
+      --nsg-name "${VM}NSG" \
+      --name AllowSSHFromMyIP \
+      --priority 1000 \
+      --source-address-prefixes "$MY_IP/32" \
+      --destination-port-ranges 22 \
+      --access Allow \
+      --protocol Tcp \
+      --output none
+
+    echo "VM created. Get IP with: $0 ip"
+    ;;
+
+  start)
+    echo "Starting VM..."
+    az vm start --resource-group "$RG" --name "$VM"
+    echo "Started. IP: $(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)"
+    ;;
+
+  stop)
+    echo "Deallocating VM (stops billing)..."
+    az vm deallocate --resource-group "$RG" --name "$VM"
+    echo "Deallocated."
+    ;;
+
+  ip)
+    az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv
+    ;;
+
+  ssh)
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh azureuser@"$IP" "${@:2}"
+    ;;
+
+  bench)
+    BRANCH="${2:?Usage: $0 bench <branch>}"
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh azureuser@"$IP" "bash ~/bench/bench_compare.sh $BRANCH"
+    ;;
+
+  destroy)
+    echo "Destroying resource group (all resources)..."
+    az group delete --name "$RG" --yes --no-wait
+    echo "Deletion started."
+    ;;
+
+  help|*)
+    echo "Usage: $0 {create|start|stop|ip|ssh|bench <branch>|destroy}"
+    echo ""
+    echo "  create   - Provision VM with cloud-init"
+    echo "  start    - Start deallocated VM"
+    echo "  stop     - Deallocate VM (stops billing)"
+    echo "  ip       - Show VM public IP"
+    echo "  ssh      - SSH into VM"
+    echo "  bench    - Run benchmarks on a branch"
+    echo "  destroy  - Delete resource group and all resources"
+    ;;
+esac
--- a/.codeflash/microsoft/typeagent/status.md
+++ b/.codeflash/microsoft/typeagent/status.md
@ -0,0 +1,50 @@
+# typeagent Status
+
+Last updated: 2026-04-10
+
+## Current state
+
+Optimization work is active. 6 optimizations completed, 2 PRs merged, 4 PRs open.
+
+## Target repo
+
+`~/Desktop/work/microsoft_org/typeagent` on branch `perf/defer-query-imports`
+
+## PRs
+
+| PR | Branch | Status | Description |
+|---|---|---|---|
+| #229 | `perf/defer-black` | Merged | Defer black import to first use |
+| #231 | `fix/parse-azure-endpoint` | Merged | Fix parse_azure_endpoint query string bug |
+| #230 | `perf/batch-inserts` | Open | Batch SQLite INSERTs for indexing pipeline |
+| #232 | `perf/batch-metadata-query` | Open | Batch metadata query to avoid N+1 across 5 call sites |
+| #234 | `perf/vectorbase-numpy` | Open (stacked on #232) | Vectorize fuzzy_lookup_embedding with numpy ops |
+| #235 | `perf/optional-black` | Open (stacked on #234) | Replace black with stdlib pprint for runtime formatting |
+| #236 | `perf/defer-query-imports` | Open (stacked on #235) | Defer query-time imports in conversation_base |
+
+## Key results
+
+- **Import**: 791ms → 683ms (1.16x cumulative) via defer-black + optional-black + defer-query-imports
+- **Batch inserts**: 1.16x on add_messages_with_indexing
+- **Vectorbase**: 3.7x-14.2x on fuzzy_lookup_embedding (numpy ops)
+- **Metadata query**: 2.1x-2.6x on lookup/group/scope operations
+
+## VM
+
+- **IP**: 40.65.81.123
+- **Size**: Standard_D2s_v5
+- **RG**: typeagent-BENCH-RG
+- **State**: Unknown (check with `az vm show -g typeagent-BENCH-RG -n typeagent-bench -d --query powerState -o tsv`)
+
+## Next
+
+- Check PR review feedback on #230, #232, #234, #235, #236
+- Profile for more import time optimizations (currently 683ms, aim for absolute floor)
+- Investigate lightweight query path (branch `perf/lightweight-query` exists but no PR yet)
+- pydantic_ai accounts for ~161ms of import time — upstream opportunity, noted in #236 body
+- Update README.md with final results table
+
+## Local branches to clean up
+
+- `perf/vectorbase-lookup` — superseded by `perf/vectorbase-numpy`
+- `perf/benchmark-tests` — merged into individual PRs
--- a/.codeflash/netflix/metaflow/README.md
+++ b/.codeflash/netflix/metaflow/README.md
@ -0,0 +1,48 @@
+# metaflow Performance Optimization
+
+Upstream performance improvements to [Netflix/metaflow](https://github.com/Netflix/metaflow), a human-centric framework for data science and ML workflows.
+
+## Background
+
+Metaflow is Netflix's open-source Python framework for building and managing real-world data science projects. It handles workflow orchestration, versioning, and execution across local, cloud, and Kubernetes environments.
+
+Profiling reveals two main optimization surfaces:
+
+1. **Import time (~513ms)**: Heavy optional dependencies (requests, kubernetes, asyncio, yaml) loaded eagerly even when not needed. Plugin resolution alone accounts for 65% of import time.
+2. **Runtime hot paths**: Double gzip compression on every artifact, SHA1 hashing where faster non-crypto hashes suffice, sleep-based polling in multiprocessing utilities.
+
+## Optimization Targets
+
+### Import Time (Phase 1 — ~200ms savings estimated)
+
+| Target | Current | Savings | Approach |
+|---|---:|---:|---|
+| Defer `requests` in metadata providers | 128ms | ~108ms | Lazy import inside ServiceMetadataProvider |
+| Lazy-load Kubernetes clients | 50ms | ~48ms | Conditional import when K8s decorator used |
+| Defer `asyncio` in subprocess_manager | 91ms | ~41ms | Import inside async functions only |
+| Defer YAML/cards infrastructure | 52ms | ~37ms | Move YAML import to card render time |
+
+### Runtime (Phase 2)
+
+| Target | File | Approach |
+|---|---|---|
+| Double gzip compression | `content_addressed_store.py` | Single compression, tune level |
+| SHA1 content hashing | `content_addressed_store.py` | Switch to xxHash/BLAKE3 |
+| Sleep-based polling | `multicore_utils.py` | Event-based waiting |
+| Extension loading cache | `extension_support/__init__.py` | Mtime-based cache |
+
+## Results
+
+_No optimizations applied yet._
+
+| Benchmark | Before | After | Speedup |
+|---|---:|---:|---:|
+| `import metaflow` | 513ms | — | — |
+| `metaflow --version` CLI | ~1.8s | — | — |
+
+## PRs
+
+_None yet._
+
+| PR | Branch | Status | Description |
+|---|---|---|---|
--- a/.codeflash/netflix/metaflow/bench/.gitkeep
+++ b/.codeflash/netflix/metaflow/bench/.gitkeep
--- a/.codeflash/netflix/metaflow/data/results.tsv
+++ b/.codeflash/netflix/metaflow/data/results.tsv
@ -0,0 +1 @@
+date	commit	target	metric	before	after	speedup	notes
--- a/.codeflash/netflix/metaflow/data/sha1-proposal.md
+++ b/.codeflash/netflix/metaflow/data/sha1-proposal.md
@ -0,0 +1,53 @@
+# SHA1 -> Faster Hash Proposal (Content Addressed Store)
+
+Status: **Deferred** — needs discussion with maintainers before implementation.
+
+## Opportunity
+
+SHA1 is used as the content-addressing hash in `content_addressed_store.py:98`. Benchmarks on Azure Standard_D2s_v5:
+
+| Blob Size | SHA1 | xxh64 | xxh64 Speedup | blake2b (est) |
+|-----------|-------|-------|---------------|---------------|
+| 1KB | 0.001ms | 0.0004ms | 2.5x | ~1.5x |
+| 100KB | 0.060ms | 0.008ms | 7.5x | ~3x |
+| 1MB | 0.596ms | 0.073ms | 8.2x | ~4x |
+| 10MB | 5.979ms | 0.736ms | 8.1x | ~4x |
+
+## Why it's not a simple drop-in
+
+The SHA1 hex digest is the **storage key** — it determines where artifacts live on disk/S3 (`<prefix>/<sha[:2]>/<sha>`). It's persisted in metadata databases and used across 14 locations in the codebase.
+
+### All SHA1 usage locations
+
+| File | Line | Purpose | Persistent? | Breaking? |
+|------|------|---------|-------------|-----------|
+| `content_addressed_store.py` | 98 | Artifact content-address key | S3/filesystem paths | Yes |
+| `filecache.py` | 96-100 | Log/metadata cache tokens | Local cache filenames | Local only |
+| `includefile.py` | 417 | Include file hash | Metadata DB | Yes |
+| `metadata.py` | 588 | Artifact metadata field | Metadata DB | Yes |
+| `argo_workflows.py` | 781 | Event name suffix | Argo event names | No (regenerable) |
+| `argo_workflows_cli.py` | 550, 605, 611 | Workflow name truncation | Argo workflow names | No (regenerable) |
+| `step_functions_cli.py` | 299, 307 | StateMachine name suffix | AWS resource names | No (regenerable) |
+| `airflow_cli.py` | 454 | DAG naming | Airflow DAG names | No (regenerable) |
+| `event_bridge_client.py` | 82 | Rule name truncation | AWS resource names | No (regenerable) |
+| `s3op.py` | 726 | S3 download cache filename | Local cache filenames | Local only |
+| test files | Multiple | Test verification | No | No |
+
+### Key concerns
+
+1. **Dedup boundary**: Same content saved before/after change gets different keys — no cross-version dedup
+2. **Collision safety**: xxh64 (64-bit) birthday bound is ~2^32, too small for content addressing. Must use xxh128 or blake2b.
+3. **New dependency**: xxhash adds to `install_requires`; blake2b is stdlib (Python 3.6+) but slower
+4. **Migration**: Needs versioned hash algorithm in metadata, dual-compute transition period
+
+### Proposed approach (for future PR)
+
+1. Add `hash_version` field to CAS metadata
+2. Use blake2b (stdlib, no new dep) or xxh128 for new writes
+3. Keep SHA1 reader support indefinitely for backward compat
+4. `load_blobs` is already key-based (no rehash), so old artifacts remain loadable
+5. Open a discussion issue first to align with maintainers on migration strategy
+
+### Recommendation
+
+Open a GitHub issue proposing the change and linking benchmark data. Let maintainers weigh in on hash choice and migration strategy before implementing.
--- a/.codeflash/netflix/metaflow/infra/.gitkeep
+++ b/.codeflash/netflix/metaflow/infra/.gitkeep
--- a/.codeflash/netflix/metaflow/infra/cloud-init.yaml
+++ b/.codeflash/netflix/metaflow/infra/cloud-init.yaml
@ -0,0 +1,560 @@
+#cloud-config
+#
+# Benchmark VM provisioning for Netflix/metaflow
+#
+# Pure Python workflow framework -- targets: content_addressed_store (gzip, SHA1),
+# multicore_utils (sleep polling). Python 3.12, pip editable install.
+#
+# Usage:
+#   az vm create ... --custom-data infra/cloud-init.yaml
+#
+# VM: Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable)
+#   Non-burstable ensures consistent CPU -- no thermal throttling or turbo variability.
+
+package_update: true
+packages:
+  - git
+  - build-essential
+  - curl
+  - wget
+  - jq
+
+write_files:
+  # --- Benchmark: content_addressed_store (gzip + SHA1) ---
+  - path: /home/azureuser/bench/bench_cas.py
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env python3
+      """
+      Benchmark content_addressed_store hot paths: gzip compression/decompression
+      and SHA1 hashing at various blob sizes.
+
+      Outputs JSON results to ~/results/cas/.
+      """
+      import gzip
+      import hashlib
+      import json
+      import os
+      import time
+      from io import BytesIO
+
+      OUTDIR = os.path.expanduser("~/results/cas")
+      os.makedirs(OUTDIR, exist_ok=True)
+
+      # Simulate realistic artifact sizes: small (1KB pickled scalar),
+      # medium (100KB pickled array), large (10MB pickled dataframe)
+      BLOB_SIZES = {
+          "1KB": 1_000,
+          "10KB": 10_000,
+          "100KB": 100_000,
+          "1MB": 1_000_000,
+          "10MB": 10_000_000,
+      }
+
+      ITERATIONS = {
+          "1KB": 5000,
+          "10KB": 2000,
+          "100KB": 500,
+          "1MB": 50,
+          "10MB": 10,
+      }
+
+      def make_blob(size):
+          """Pseudo-random blob (compressible, like pickled Python objects)."""
+          import random
+          random.seed(42)
+          # Mix of structured + random bytes to mimic pickled data
+          pattern = bytes(range(256)) * (size // 256 + 1)
+          return pattern[:size]
+
+      def bench_sha1(blob, iterations):
+          start = time.perf_counter()
+          for _ in range(iterations):
+              hashlib.sha1(blob).hexdigest()
+          elapsed = time.perf_counter() - start
+          return elapsed / iterations
+
+      def bench_gzip_compress(blob, iterations, level=3):
+          start = time.perf_counter()
+          for _ in range(iterations):
+              buf = BytesIO()
+              with gzip.GzipFile(fileobj=buf, mode="wb", compresslevel=level) as f:
+                  f.write(blob)
+              buf.seek(0)
+              _ = buf.read()
+          elapsed = time.perf_counter() - start
+          return elapsed / iterations
+
+      def bench_gzip_decompress(compressed, iterations):
+          start = time.perf_counter()
+          for _ in range(iterations):
+              with gzip.GzipFile(fileobj=BytesIO(compressed), mode="rb") as f:
+                  f.read()
+          elapsed = time.perf_counter() - start
+          return elapsed / iterations
+
+      def bench_zlib_compress(blob, iterations, level=3):
+          import zlib
+          start = time.perf_counter()
+          for _ in range(iterations):
+              zlib.compress(blob, level)
+          elapsed = time.perf_counter() - start
+          return elapsed / iterations
+
+      def bench_zlib_decompress(compressed, iterations):
+          import zlib
+          start = time.perf_counter()
+          for _ in range(iterations):
+              zlib.decompress(compressed)
+          elapsed = time.perf_counter() - start
+          return elapsed / iterations
+
+      def main():
+          results = {}
+
+          for label, size in BLOB_SIZES.items():
+              iters = ITERATIONS[label]
+              blob = make_blob(size)
+              print(f"\n=== {label} blob ({len(blob)} bytes, {iters} iterations) ===")
+
+              # SHA1
+              sha1_time = bench_sha1(blob, iters)
+              print(f"  SHA1:              {sha1_time*1000:.3f} ms/op")
+
+              # Gzip compress (current: level 3)
+              gzip_c_time = bench_gzip_compress(blob, iters, level=3)
+              print(f"  gzip compress L3:  {gzip_c_time*1000:.3f} ms/op")
+
+              # Gzip compress level 1 (fastest)
+              gzip_c1_time = bench_gzip_compress(blob, iters, level=1)
+              print(f"  gzip compress L1:  {gzip_c1_time*1000:.3f} ms/op")
+
+              # Prepare compressed blob for decompression bench
+              buf = BytesIO()
+              with gzip.GzipFile(fileobj=buf, mode="wb", compresslevel=3) as f:
+                  f.write(blob)
+              compressed = buf.getvalue()
+              ratio = len(compressed) / len(blob)
+              print(f"  compression ratio: {ratio:.3f} ({len(compressed)} bytes)")
+
+              # Gzip decompress
+              gzip_d_time = bench_gzip_decompress(compressed, iters)
+              print(f"  gzip decompress:   {gzip_d_time*1000:.3f} ms/op")
+
+              # zlib compress (no gzip header overhead)
+              import zlib
+              zlib_c_time = bench_zlib_compress(blob, iters, level=3)
+              print(f"  zlib compress L3:  {zlib_c_time*1000:.3f} ms/op")
+
+              zlib_compressed = zlib.compress(blob, 3)
+              zlib_d_time = bench_zlib_decompress(zlib_compressed, iters)
+              print(f"  zlib decompress:   {zlib_d_time*1000:.3f} ms/op")
+
+              results[label] = {
+                  "blob_bytes": len(blob),
+                  "iterations": iters,
+                  "sha1_ms": round(sha1_time * 1000, 4),
+                  "gzip_compress_L3_ms": round(gzip_c_time * 1000, 4),
+                  "gzip_compress_L1_ms": round(gzip_c1_time * 1000, 4),
+                  "gzip_decompress_ms": round(gzip_d_time * 1000, 4),
+                  "gzip_compressed_bytes": len(compressed),
+                  "gzip_ratio": round(ratio, 4),
+                  "zlib_compress_L3_ms": round(zlib_c_time * 1000, 4),
+                  "zlib_decompress_ms": round(zlib_d_time * 1000, 4),
+              }
+
+          # Try optional fast alternatives if available
+          try:
+              import xxhash
+              print("\n=== xxhash available ===")
+              for label, size in BLOB_SIZES.items():
+                  iters = ITERATIONS[label]
+                  blob = make_blob(size)
+                  start = time.perf_counter()
+                  for _ in range(iters):
+                      xxhash.xxh64(blob).hexdigest()
+                  elapsed = time.perf_counter() - start
+                  xxh_time = elapsed / iters
+                  results[label]["xxh64_ms"] = round(xxh_time * 1000, 4)
+                  sha1_ms = results[label]["sha1_ms"]
+                  print(f"  {label}: xxh64={xxh_time*1000:.3f} ms vs sha1={sha1_ms:.3f} ms ({sha1_ms/xxh_time/1000*1000:.1f}x faster)")
+          except ImportError:
+              print("\n  xxhash not installed, skipping")
+
+          try:
+              import lz4.frame
+              print("\n=== lz4 available ===")
+              for label, size in BLOB_SIZES.items():
+                  iters = ITERATIONS[label]
+                  blob = make_blob(size)
+                  start = time.perf_counter()
+                  for _ in range(iters):
+                      lz4.frame.compress(blob)
+                  elapsed = time.perf_counter() - start
+                  lz4_c_time = elapsed / iters
+
+                  lz4_compressed = lz4.frame.compress(blob)
+                  start = time.perf_counter()
+                  for _ in range(iters):
+                      lz4.frame.decompress(lz4_compressed)
+                  elapsed = time.perf_counter() - start
+                  lz4_d_time = elapsed / iters
+
+                  lz4_ratio = len(lz4_compressed) / len(blob)
+                  results[label]["lz4_compress_ms"] = round(lz4_c_time * 1000, 4)
+                  results[label]["lz4_decompress_ms"] = round(lz4_d_time * 1000, 4)
+                  results[label]["lz4_ratio"] = round(lz4_ratio, 4)
+                  gzip_ms = results[label]["gzip_compress_L3_ms"]
+                  print(f"  {label}: lz4={lz4_c_time*1000:.3f} ms vs gzip={gzip_ms:.3f} ms (ratio: lz4={lz4_ratio:.3f} vs gzip={results[label]['gzip_ratio']:.3f})")
+          except ImportError:
+              print("\n  lz4 not installed, skipping")
+
+          with open(os.path.join(OUTDIR, "baseline.json"), "w") as f:
+              json.dump(results, f, indent=2)
+          print(f"\nResults saved to {OUTDIR}/baseline.json")
+
+      if __name__ == "__main__":
+          main()
+
+  # --- Benchmark: multicore_utils (polling overhead) ---
+  - path: /home/azureuser/bench/bench_multicore.py
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env python3
+      """
+      Benchmark multicore_utils polling overhead.
+
+      Measures wall-clock time for parallel_map with trivial vs real workloads
+      to isolate the polling/sleep overhead from actual work.
+      """
+      import json
+      import os
+      import sys
+      import time
+
+      sys.path.insert(0, os.path.expanduser("~/metaflow"))
+      from metaflow.multicore_utils import parallel_map, parallel_imap_unordered
+
+      OUTDIR = os.path.expanduser("~/results/multicore")
+      os.makedirs(OUTDIR, exist_ok=True)
+
+      def noop(x):
+          return x
+
+      def sleep_10ms(x):
+          time.sleep(0.01)
+          return x
+
+      def cpu_work(x):
+          """~5ms of CPU work."""
+          total = 0
+          for i in range(100_000):
+              total += i * i
+          return total
+
+      def main():
+          results = {}
+
+          # Trivial workload -- exposes polling overhead
+          for n_items in [4, 16, 64]:
+              items = list(range(n_items))
+
+              # noop: all overhead is fork + polling
+              start = time.perf_counter()
+              parallel_map(noop, items, max_parallel=2)
+              noop_time = time.perf_counter() - start
+
+              # cpu_work: real work dominates
+              start = time.perf_counter()
+              parallel_map(cpu_work, items, max_parallel=2)
+              cpu_time = time.perf_counter() - start
+
+              overhead_pct = (noop_time / cpu_time) * 100 if cpu_time > 0 else 0
+
+              print(f"n={n_items}: noop={noop_time:.3f}s, cpu_work={cpu_time:.3f}s, overhead={overhead_pct:.1f}%")
+
+              results[f"n{n_items}"] = {
+                  "items": n_items,
+                  "noop_s": round(noop_time, 4),
+                  "cpu_work_s": round(cpu_time, 4),
+                  "overhead_pct": round(overhead_pct, 2),
+              }
+
+          # Sleep workload -- isolates polling gap
+          for n_items in [4, 16]:
+              items = list(range(n_items))
+              start = time.perf_counter()
+              parallel_map(sleep_10ms, items, max_parallel=2)
+              sleep_time = time.perf_counter() - start
+              ideal = (n_items / 2) * 0.01  # perfect parallelism
+              gap = sleep_time - ideal
+
+              print(f"n={n_items} sleep_10ms: actual={sleep_time:.3f}s, ideal={ideal:.3f}s, gap={gap:.3f}s")
+              results[f"n{n_items}_sleep"] = {
+                  "items": n_items,
+                  "actual_s": round(sleep_time, 4),
+                  "ideal_s": round(ideal, 4),
+                  "gap_s": round(gap, 4),
+              }
+
+          with open(os.path.join(OUTDIR, "baseline.json"), "w") as f:
+              json.dump(results, f, indent=2)
+          print(f"\nResults saved to {OUTDIR}/baseline.json")
+
+      if __name__ == "__main__":
+          main()
+
+  # --- Benchmark: end-to-end save/load via CAS ---
+  - path: /home/azureuser/bench/bench_cas_e2e.py
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env python3
+      """
+      End-to-end benchmark of ContentAddressedStore save_blobs / load_blobs
+      using the local filesystem storage backend.
+
+      This tests the full pipeline: SHA1 hash -> dedup check -> gzip -> write -> read -> gunzip.
+      """
+      import json
+      import os
+      import shutil
+      import sys
+      import tempfile
+      import time
+
+      sys.path.insert(0, os.path.expanduser("~/metaflow"))
+
+      from metaflow.datastore.content_addressed_store import ContentAddressedStore
+      from metaflow.plugins.datastores.local_storage import LocalStorage
+
+      OUTDIR = os.path.expanduser("~/results/cas_e2e")
+      os.makedirs(OUTDIR, exist_ok=True)
+
+      BLOB_SIZES = {
+          "1KB": 1_000,
+          "100KB": 100_000,
+          "1MB": 1_000_000,
+          "10MB": 10_000_000,
+      }
+
+      ITERATIONS = {
+          "1KB": 200,
+          "100KB": 100,
+          "1MB": 20,
+          "10MB": 5,
+      }
+
+      def make_blob(size):
+          import random
+          random.seed(42)
+          pattern = bytes(range(256)) * (size // 256 + 1)
+          return pattern[:size]
+
+      def main():
+          results = {}
+
+          for label, size in BLOB_SIZES.items():
+              iters = ITERATIONS[label]
+              blob = make_blob(size)
+              print(f"\n=== {label} blob ({len(blob)} bytes, {iters} iterations) ===")
+
+              save_times = []
+              load_times = []
+
+              for i in range(iters):
+                  tmpdir = tempfile.mkdtemp(prefix="cas_bench_")
+                  try:
+                      storage = LocalStorage(tmpdir)
+                      cas = ContentAddressedStore("cas", storage)
+
+                      # Use unique blobs to avoid dedup short-circuit
+                      unique_blob = blob + i.to_bytes(4, "big")
+
+                      # Save
+                      start = time.perf_counter()
+                      result = cas.save_blobs(iter([unique_blob]))
+                      save_elapsed = time.perf_counter() - start
+                      save_times.append(save_elapsed)
+
+                      key = result[0].key
+
+                      # Load
+                      start = time.perf_counter()
+                      loaded = list(cas.load_blobs([key]))
+                      load_elapsed = time.perf_counter() - start
+                      load_times.append(load_elapsed)
+
+                      # Verify correctness
+                      assert loaded[0][1] == unique_blob, "Data mismatch!"
+                  finally:
+                      shutil.rmtree(tmpdir, ignore_errors=True)
+
+              avg_save = sum(save_times) / len(save_times)
+              avg_load = sum(load_times) / len(load_times)
+              print(f"  save: {avg_save*1000:.3f} ms/op")
+              print(f"  load: {avg_load*1000:.3f} ms/op")
+              print(f"  total: {(avg_save+avg_load)*1000:.3f} ms/op")
+
+              results[label] = {
+                  "blob_bytes": len(blob),
+                  "iterations": iters,
+                  "save_ms": round(avg_save * 1000, 4),
+                  "load_ms": round(avg_load * 1000, 4),
+                  "total_ms": round((avg_save + avg_load) * 1000, 4),
+              }
+
+          with open(os.path.join(OUTDIR, "baseline.json"), "w") as f:
+              json.dump(results, f, indent=2)
+          print(f"\nResults saved to {OUTDIR}/baseline.json")
+
+      if __name__ == "__main__":
+          main()
+
+  # --- Benchmark: run all baselines ---
+  - path: /home/azureuser/bench/bench_baseline.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      PYTHON=~/metaflow/.venv/bin/python
+
+      echo "=== Running all baseline benchmarks ==="
+      echo ""
+
+      echo "--- CAS microbenchmarks (gzip, SHA1, alternatives) ---"
+      $PYTHON ~/bench/bench_cas.py
+
+      echo ""
+      echo "--- CAS end-to-end (save_blobs / load_blobs) ---"
+      $PYTHON ~/bench/bench_cas_e2e.py
+
+      echo ""
+      echo "--- Multicore utils (polling overhead) ---"
+      $PYTHON ~/bench/bench_multicore.py
+
+      echo ""
+      echo "=== All baselines complete ==="
+      echo "Results in ~/results/"
+      find ~/results/ -name "*.json" -exec echo {} \;
+
+  # --- Benchmark: A/B branch comparison ---
+  - path: /home/azureuser/bench/bench_compare.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BRANCH="${1:?Usage: bench_compare.sh <branch-or-commit>}"
+      TS=$(date +%Y%m%d-%H%M%S)
+      OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+      mkdir -p "$OUTDIR"
+
+      PYTHON=~/metaflow/.venv/bin/python
+
+      cd ~/metaflow
+      git fetch origin
+      git checkout "$BRANCH"
+
+      # Reinstall after switching branches
+      $PYTHON -m pip install -e . -q
+
+      echo "=== Benchmarking branch: $BRANCH ==="
+
+      $PYTHON ~/bench/bench_cas.py
+      cp ~/results/cas/baseline.json "$OUTDIR/cas.json"
+
+      $PYTHON ~/bench/bench_cas_e2e.py
+      cp ~/results/cas_e2e/baseline.json "$OUTDIR/cas_e2e.json"
+
+      $PYTHON ~/bench/bench_multicore.py
+      cp ~/results/multicore/baseline.json "$OUTDIR/multicore.json"
+
+      echo ""
+      echo "Results saved to $OUTDIR/"
+      ls -la "$OUTDIR/"
+
+  # --- Benchmark: side-by-side two branches ---
+  - path: /home/azureuser/bench/bench_ab.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BASE="${1:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+      OPT="${2:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+
+      echo "=== A/B comparison: $BASE vs $OPT ==="
+      bash ~/bench/bench_compare.sh "$BASE"
+      bash ~/bench/bench_compare.sh "$OPT"
+
+      echo ""
+      echo "Compare results in ~/results/"
+      ls ~/results/
+
+  # --- Unit test runner ---
+  - path: /home/azureuser/bench/run_tests.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      PYTHON=~/metaflow/.venv/bin/python
+      cd ~/metaflow
+
+      echo "=== Running metaflow unit tests ==="
+      $PYTHON -m pytest test/unit/ -v --tb=short --timeout=120 -m "not docker" "$@"
+
+  # --- Setup script (runs once via runcmd) ---
+  - path: /home/azureuser/setup.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      echo "=== Cloning metaflow ==="
+      git clone https://github.com/KRRT7/metaflow.git ~/metaflow
+      cd ~/metaflow
+      git remote add upstream https://github.com/Netflix/metaflow.git
+
+      echo "=== Installing Python 3.12 venv ==="
+      sudo apt-get install -y python3.12-venv python3-pip
+      python3 -m venv .venv
+      .venv/bin/pip install --upgrade pip
+
+      echo "=== Installing metaflow (editable) ==="
+      .venv/bin/pip install -e ".[dev]" || .venv/bin/pip install -e .
+      .venv/bin/pip install pytest pytest-timeout
+
+      echo "=== Installing benchmark dependencies ==="
+      .venv/bin/pip install xxhash lz4
+
+      echo "=== Creating results directory ==="
+      mkdir -p ~/results
+
+      echo "=== Verifying installation ==="
+      .venv/bin/python -c 'import metaflow; print("metaflow OK:", metaflow.__version__)'
+
+      echo "=== Running baseline benchmarks ==="
+      bash ~/bench/bench_baseline.sh
+
+      echo "=== Done ==="
+
+runcmd:
+  - wget -q https://github.com/sharkdp/hyperfine/releases/download/v1.19.0/hyperfine_1.19.0_amd64.deb -O /tmp/hyperfine.deb
+  - dpkg -i /tmp/hyperfine.deb
+  - su - azureuser -c 'bash /home/azureuser/setup.sh'
--- a/.codeflash/netflix/metaflow/infra/vm-manage.sh
+++ b/.codeflash/netflix/metaflow/infra/vm-manage.sh
@ -0,0 +1,106 @@
+#!/usr/bin/env bash
+#
+# Azure benchmark VM lifecycle management for Netflix/metaflow
+#
+# Usage:
+#   bash infra/vm-manage.sh {create|start|stop|ip|ssh|bench <branch>|destroy}
+
+set -euo pipefail
+
+RG="metaflow-BENCH-RG"
+VM="metaflow-bench"
+REGION="westus2"
+SIZE="Standard_D2s_v5"
+IMAGE="Canonical:ubuntu-24_04-lts:server:latest"
+SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519.pub}"
+
+case "${1:-help}" in
+  create)
+    if [ ! -f "$SSH_KEY" ]; then
+      echo "Error: SSH public key not found at $SSH_KEY"
+      echo "Generate one: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519"
+      echo "Or set SSH_KEY=/path/to/key.pub"
+      exit 1
+    fi
+
+    echo "Creating resource group..."
+    az group create --name "$RG" --location "$REGION" --only-show-errors --output none
+
+    echo "Creating VM (Trusted Launch, SSH-only, locked-down NSG)..."
+    az vm create \
+      --resource-group "$RG" \
+      --name "$VM" \
+      --image "$IMAGE" \
+      --size "$SIZE" \
+      --os-disk-size-gb 64 \
+      --admin-username azureuser \
+      --ssh-key-values "$SSH_KEY" \
+      --authentication-type ssh \
+      --security-type TrustedLaunch \
+      --enable-secure-boot true \
+      --enable-vtpm true \
+      --nsg-rule NONE \
+      --custom-data infra/cloud-init.yaml \
+      --only-show-errors
+
+    MY_IP=$(curl -s ifconfig.me)
+    echo "Restricting SSH to $MY_IP..."
+    az network nsg rule create \
+      --resource-group "$RG" \
+      --nsg-name "${VM}NSG" \
+      --name AllowSSHFromMyIP \
+      --priority 1000 \
+      --source-address-prefixes "$MY_IP/32" \
+      --destination-port-ranges 22 \
+      --access Allow \
+      --protocol Tcp \
+      --output none
+
+    echo "VM created. Get IP with: $0 ip"
+    ;;
+
+  start)
+    echo "Starting VM..."
+    az vm start --resource-group "$RG" --name "$VM"
+    echo "Started. IP: $(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)"
+    ;;
+
+  stop)
+    echo "Deallocating VM (stops billing)..."
+    az vm deallocate --resource-group "$RG" --name "$VM"
+    echo "Deallocated."
+    ;;
+
+  ip)
+    az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv
+    ;;
+
+  ssh)
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh -A azureuser@"$IP" "${@:2}"
+    ;;
+
+  bench)
+    BRANCH="${2:?Usage: $0 bench <branch>}"
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh -A azureuser@"$IP" "bash ~/bench/bench_compare.sh $BRANCH"
+    ;;
+
+  destroy)
+    echo "Destroying resource group (all resources)..."
+    az group delete --name "$RG" --yes --no-wait
+    echo "Deletion started."
+    ;;
+
+  help|*)
+    echo "Usage: $0 {create|start|stop|ip|ssh|bench <branch>|destroy}"
+    echo ""
+    echo "  create   - Provision VM with cloud-init"
+    echo "  start    - Start deallocated VM"
+    echo "  stop     - Deallocate VM (stops billing)"
+    echo "  ip       - Show VM public IP"
+    echo "  ssh      - SSH into VM"
+    echo "  bench    - Run benchmarks on a branch"
+    echo "  destroy  - Delete resource group and all resources"
+    ;;
+esac
--- a/.codeflash/netflix/metaflow/status.md
+++ b/.codeflash/netflix/metaflow/status.md
@ -0,0 +1,51 @@
+# metaflow Status
+
+Last updated: 2026-04-10
+
+## Current state
+
+First PR open upstream. Waiting for maintainer feedback.
+
+## Target repo
+
+`~/Desktop/work/netflix_org/metaflow` — fork remote: `KRRT7/metaflow`
+
+## VM
+
+Azure Standard_D2s_v5, IP: 20.112.32.177, RG: metaflow-BENCH-RG (deallocated)
+- Python 3.12, pip editable install, lz4/xxhash/numpy/hyperfine installed
+- Baseline + realistic benchmarks complete in ~/results/
+
+## PRs
+
+| PR | Branch | Status | Description |
+|---|---|---|---|
+| [Netflix/metaflow#3090](https://github.com/Netflix/metaflow/pull/3090) | `perf/lz4-artifact-compression` | Open, waiting for review | Replace gzip with lz4 in CAS — 7-18x on realistic data |
+| [KRRT7/metaflow#1](https://github.com/KRRT7/metaflow/pull/1) | `perf/lz4-artifact-compression` | Draft (mirror) | Same, on fork |
+
+## Key results (realistic artifacts)
+
+| Payload | Pickled Size | gzip total | lz4 total | Speedup |
+|---------|-------------|------------|-----------|---------|
+| Small dict (config) | 233B | 0.341ms | 0.218ms | 1.6x |
+| Metrics dict (feature stats) | 52KB | 2.278ms | 0.327ms | 7.0x |
+| Numpy float64 (embeddings) | 800KB | 29.111ms | 1.557ms | 18.7x |
+| Numpy float64 (model weights) | 8MB | 289.234ms | 15.792ms | 18.3x |
+| Random bytes (opaque model) | 5MB | 118.315ms | 9.646ms | 12.3x |
+
+## Open questions on PR
+
+- Hard vs soft dependency for lz4
+- Forward compat story (old metaflow can't read cas_version=2)
+- Benchmark scripts to be reverted before merge
+
+## Next steps (pending maintainer response)
+
+1. If approach accepted: make lz4 optional, revert benchmark scripts, address feedback
+2. If rejected on dependency grounds: explore zlib.compress directly (no new dep, smaller win)
+3. Open SHA1 discussion issue (data in `data/sha1-proposal.md`)
+4. Multicore polling improvement (low priority, marginal impact)
+
+## Blockers
+
+Waiting on Netflix/metaflow#3090 review.
--- a/.codeflash/pypa/pip/.gitignore
+++ b/.codeflash/pypa/pip/.gitignore
@ -0,0 +1 @@
+.DS_Store
--- a/.codeflash/pypa/pip/README.md
+++ b/.codeflash/pypa/pip/README.md
@ -0,0 +1,156 @@
+# pip Performance Optimization
+
+End-to-end performance optimization of [pip](https://github.com/pypa/pip), the Python package installer. 122 commits across startup, dependency resolution, packaging, import deferral, and vendored Rich.
+
+## Results
+
+**Environment**: Python 3.15.0a7, macOS arm64 (Apple Silicon), ~27 packages installed, HTTP cache warm, hyperfine (5–10 runs, 2–3 warmup)
+
+### Startup
+
+| Benchmark | main | optimized | Speedup |
+|---|---:|---:|---:|
+| `pip --version` | 138ms | **20ms** | **7.0x** |
+| `pip --help` | 143ms | **121ms** | **1.18x** |
+
+### Dependency Resolution
+
+| Benchmark | main | optimized | Speedup |
+|---|---:|---:|---:|
+| `requests` (~5 deps) | 589ms | **516ms** | **1.14x** |
+| `flask + django` (~15 deps) | 708ms | **599ms** | **1.18x** |
+| `flask + django + boto3 + requests` (~30 deps) | 1,493ms | **826ms** | **1.81x** |
+| `fastapi[standard]` (~42 deps) | 13,325ms | **11,664ms** | **1.14x** |
+
+### Package Operations
+
+| Benchmark | main | optimized | Speedup |
+|---|---:|---:|---:|
+| `pip list` | 162ms | **146ms** | **1.11x** |
+| `pip freeze` | 225ms | **211ms** | **1.07x** |
+| `pip show pip` | 162ms | **148ms** | **1.09x** |
+| `install -r requirements.txt` (21 pkgs) | 1,344ms | **740ms** | **1.82x** |
+
+### Totals
+
+| | main | optimized | Speedup |
+|-|---:|---:|---:|
+| **All benchmarks** | 18,717ms | 15,223ms | **1.23x** |
+| **Excluding fastapi[standard]** | 5,392ms | 3,559ms | **1.51x** |
+
+## What We Optimized (122 commits)
+
+### 1. Startup
+- Ultra-fast `--version` path in `__main__.py` — exits before importing `pip._internal` (138ms → 20ms)
+- Fast-path `--version` in `cli/main.py` — avoids `pip._internal.utils.misc` import
+- Deferred `base_command.py` import chain to command creation time
+- Deferred `Configuration` module loading
+- Deferred autocompletion imports behind `PIP_AUTO_COMPLETE` check
+
+### 2. Dependency Resolver — Architecture
+- **Speculative metadata prefetch**: background thread downloads PEP 658 metadata for the top candidate while the resolver processes other packages
+- **Conditional Criterion rebuild**: `_remove_information_from_criteria` skips rebuilding unaffected criteria, eliminating ~95% of allocations
+- **`__slots__` on Criterion**: reduces per-instance memory by ~100 bytes
+- Two-level cache for `_iter_found_candidates` (specifier merge + candidate infos)
+- Parallel index-page prefetch during dependency resolution
+- Unified shared ThreadPoolExecutor for parallel wheel downloads
+
+### 3. Dependency Resolver — Micro
+- Cached wheel tag priority dict on `TargetPython`
+- Pre-extracted requirements tuple on `Criterion` for per-call avoidance of generator expressions
+- Cached `Marker.evaluate()` results for repeated extra lookups
+- Hoisted `operator.methodcaller`/`attrgetter` to module-level constants
+- Cached `_sort_key` results to avoid double evaluation in `compute_best_candidate`
+
+### 4. Packaging (vendored `pip._vendor.packaging`)
+- Replaced `_tokenizer` dataclass with `__slots__` class
+- Deferred `Version.__hash__` computation until first call
+- Integer comparison key (`_cmp_int`) — avoids full `_key` tuple construction
+- Bisect-based `filter_versions` for O(log n + k) batch filtering
+- Pre-computed integer bounds on `SpecifierSet` for fast rejection
+- Cached parsed `Version`, `Requirement`, `Specifier` objects
+- Fast-path tokenizer for simple tokens to bypass regex engine
+- Direct release-tuple prefix comparison in `_compare_equal` / `_compare_compatible`
+
+### 5. Link and Wheel Parsing
+- Pre-computed `Link._is_wheel` slot to avoid repeated `splitext`
+- Cached URL scheme on `Link` to skip `urlsplit` for `is_vcs`/`is_file`
+- Inlined Link construction in `_evaluate_json_page` to skip redundant work
+- `rsplit` instead of `rfind`x3 for wheel tag extraction
+- Cached `parse_tag` results to eliminate redundant `Tag` creation
+
+### 6. I/O and Caching
+- Replaced pure-Python msgpack with stdlib JSON for cache serialization
+- Increased HTTP connection pool and prefetch concurrency
+
+### 7. Import Deferral (vendored Rich)
+- Deferred all Rich imports to first use
+- Stripped unused Rich modules from import chain
+- Deferred heavy imports in Rich `console.py` (pretty/pager/scope/screen/export)
+- Deferred Rich imports in `progress_bars.py` and `self_outdated_check.py`
+
+### 8. Micro-optimizations
+- Bypassed `InstallationCandidate.__init__` with `__new__` + direct slot assignment
+- Removed redundant O(n) subset assertion in `BestCandidateResult`
+- Cached `Hashes.__hash__`, `Constraint.empty()` singleton, `Requirement.__str__`
+- Bypassed `email.parser` for metadata parsing
+
+## Upstream Contributions
+
+### Bug fixes (PRs to pypa/pip)
+
+| PR | Status | Description |
+|---|---|---|
+| [pypa/pip#13900](https://github.com/pypa/pip/pull/13900) | Open | Fix `--report -` to use stdlib `json` instead of Rich for stdout output |
+| [pypa/pip#13902](https://github.com/pypa/pip/pull/13902) | Open | Fix `test_trailing_slash_directory_metadata` for Python 3.15 |
+
+### Bug reports (issues on pypa/pip)
+
+| Issue | Description |
+|---|---|
+| [pypa/pip#13898](https://github.com/pypa/pip/issues/13898) | `pip install --report -` outputs invalid JSON when not combined with `--quiet` |
+| [pypa/pip#13901](https://github.com/pypa/pip/issues/13901) | `test_trailing_slash_directory_metadata` fails on Python 3.15.0a8 |
+
+### Rich upstream (separate case study)
+
+| PR | Description |
+|---|---|
+| [Textualize/rich#4070](https://github.com/Textualize/rich/pull/4070) | Import deferral — 2x import speedup |
+| [KRRT7/rich#12](https://github.com/KRRT7/rich/pull/12) | Architectural wins (dataclass→__slots__, lazy emoji) |
+| [KRRT7/rich#13](https://github.com/KRRT7/rich/pull/13) | Import deferral + runtime micro-opts |
+
+See [rich_org](https://github.com/KRRT7/rich_org) for the full Rich case study.
+
+## Methodology
+
+### Profiling approach
+
+1. **`python -X importtime`** — Identified the heaviest imports in the startup chain
+2. **cProfile / py-spy** — Found hot functions in the resolver and packaging layers
+3. **Allocation counting** — Tracked object creation counts to find redundant work (e.g., 45,301 → 1,559 `Tag.__init__` calls with caching)
+4. **E2E hyperfine** — Validated every change with end-to-end benchmarks
+
+### Environment
+
+- **Local**: macOS arm64 (Apple Silicon), Python 3.15.0a7, ~27 packages installed
+- **CI validation**: Azure VM (Standard_D2s_v5, Ubuntu 24.04, Python 3.12), nox test sessions
+- **Test suite**: 1,690 unit tests + 15 functional tests passing throughout
+
+### Branch
+
+All 122 optimization commits are on [`codeflash/optimize`](https://github.com/KRRT7/pip/tree/codeflash/optimize) in the KRRT7/pip fork.
+
+## Repo Structure
+
+```
+.
+├── README.md                       # This file
+└── data/
+    ├── benchmarks.md               # Full E2E benchmark results table
+    ├── results.tsv                  # Per-optimization tracking (target, speedup, status)
+    ├── benchmark-analysis.md       # Detailed profiling analysis
+    ├── io-analysis.md              # I/O and caching analysis
+    ├── coverage-analysis.md        # Test coverage analysis
+    ├── learnings.md                # Session learnings and patterns
+    └── session-handoff.md          # Optimization session state
+```
--- a/.codeflash/pypa/pip/bench/.gitkeep
+++ b/.codeflash/pypa/pip/bench/.gitkeep
--- a/.codeflash/pypa/pip/data/benchmark-analysis.md
+++ b/.codeflash/pypa/pip/data/benchmark-analysis.md
@ -0,0 +1,330 @@
+# Resolver Benchmark Analysis: Ideas 4-10 Applicability
+
+## Methodology
+
+Each workload was run 3 times with `pip install --dry-run --ignore-installed`.
+Resolver internals were monkey-patched (no source modifications) to capture 10 metrics
+per resolution. Timing values are medians of 3 runs; count values are deterministic
+(first run). HTTP cache was warm for all runs.
+
+**Environment:** Python 3.15.0a7, macOS arm64 (Apple Silicon), branch `codeflash/optimize`
+
+**Important timing note:** "Resolver CPU" measures wall time inside `Resolution.resolve()`,
+which includes metadata fetching, wheel downloading, and sdist building -- not just the
+resolver algorithm. The actual algorithmic CPU cost (round logic, pin satisfying checks,
+preference computation) is a tiny fraction of this. For packages that require building
+from source (dask, fastapi), the resolver CPU time is dominated by build system overhead.
+
+---
+
+## Table 1: Core Resolution Metrics
+
+| Workload | Rounds | Backtracks | Peak Criteria | Pin-Satisfying Calls | Candidates Pinned |
+|----------|-------:|-----------:|--------------:|--------------------:|-----------------:|
+| simple: requests | 7 | 0 | 6 | 57 | 6 |
+| simple: flask | 9 | 0 | 8 | 100 | 8 |
+| simple: click | 3 | 0 | 2 | 7 | 2 |
+| medium: django | 5 | 0 | 4 | 26 | 4 |
+| medium: flask+django+boto3+requests | 23 | 0 | 22 | 682 | 22 |
+| complex: fastapi[standard] | 48 | 0 | 47 | 2,654 | 47 |
+| complex: jupyterlab | 96 | 0 | 95 | 10,529 | 95 |
+| complex: celery[redis] | 21 | 0 | 20 | 533 | 20 |
+| complex: dask[complete] | 35 | 0 | 34 | 1,400 | 34 |
+| conflict: google-cloud-bigquery+pandas<2 | 1 | 0 | 2 | 2 | 1 |
+| conflict: boto3==1.26.0+botocore>=1.31 | 1 | 1 | 2 | 2 | 1 |
+| large: 30-pkg requirements | 95 | 0 | 94 | 11,676 | 94 |
+
+**Key observation:** Across all 12 workloads, backtracks range from 0 to 1. Round counts
+are directly proportional to the number of packages resolved (rounds ~= packages + 1).
+The resolver is operating as an essentially linear algorithm.
+
+## Table 2: make_requirements_from_spec Analysis
+
+| Workload | Total Calls | Unique Specs | Duplicate Specs | Dup Rate |
+|----------|----------:|-----------:|---------------:|--------:|
+| simple: requests | 4 | 4 | 0 | 0.0% |
+| simple: flask | 8 | 7 | 1 | 12.5% |
+| simple: click | 0 | 0 | 0 | - |
+| medium: django | 2 | 2 | 0 | 0.0% |
+| medium: flask+django+boto3+requests | 22 | 20 | 2 | 9.1% |
+| complex: fastapi[standard] | 84 | 64 | 20 | 23.8% |
+| complex: jupyterlab | 156 | 136 | 20 | 12.8% |
+| complex: celery[redis] | 35 | 22 | 13 | 37.1% |
+| complex: dask[complete] | 70 | 45 | 25 | 35.7% |
+| conflict: google-cloud-bigquery+pandas<2 | 0 | 0 | 0 | - |
+| conflict: boto3==1.26.0+botocore>=1.31 | 3 | 3 | 0 | 0.0% |
+| large: 30-pkg requirements | 118 | 109 | 9 | 7.6% |
+
+**Aggregate:** 502 total calls, 90 duplicates (17.9% duplication rate).
+
+## Table 3: RequirementInformation Allocations and narrow_requirement_selection
+
+| Workload | RI Allocs | Narrow Calls | Reductions | Avg In | Avg Out | Reduction Rate |
+|----------|--------:|-----------:|-----------:|------:|-------:|--------------:|
+| simple: requests | 10 | 4 | 1 | 3.5 | 2.5 | 25% |
+| simple: flask | 16 | 6 | 1 | 4.5 | 3.5 | 17% |
+| simple: click | 2 | 0 | 0 | - | - | - |
+| medium: django | 6 | 2 | 1 | 2.5 | 1.5 | 50% |
+| medium: flask+django+boto3+requests | 47 | 21 | 1 | 8.5 | 8.1 | 5% |
+| complex: fastapi[standard] | 138 | 45 | 1 | 8.8 | 8.5 | 2% |
+| complex: jupyterlab | 247 | 92 | 1 | 15.3 | 15.2 | 1% |
+| complex: celery[redis] | 54 | 17 | 1 | 6.5 | 5.9 | 6% |
+| complex: dask[complete] | 103 | 31 | 1 | 6.7 | 6.4 | 3% |
+| conflict: google-cloud-bigquery+pandas<2 | 2 | 1 | 0 | 2.0 | 2.0 | 0% |
+| conflict: boto3==1.26.0+botocore>=1.31 | 4 | 1 | 0 | 2.0 | 2.0 | 0% |
+| large: 30-pkg requirements | 239 | 93 | 1 | 29.8 | 29.5 | 1% |
+
+**Aggregate:** 313 narrow calls, 9 actual reductions (3% effectiveness -- narrows
+almost exclusively on the first round where Requires-Python is present).
+
+## Table 4: Timing Breakdown
+
+| Workload | Wall (s) | Resolver CPU (s) | Non-Resolver (s) | _build_result (ms) | Resolver % |
+|----------|--------:|----------------:|----------:|------------------:|-------------:|
+| simple: requests | 0.328 | 0.215 | 0.108 | 0.033 | 66% |
+| simple: flask | 0.347 | 0.236 | 0.111 | 0.037 | 68% |
+| simple: click | 0.209 | 0.101 | 0.107 | 0.019 | 48% |
+| medium: django | 0.299 | 0.190 | 0.108 | 0.028 | 63% |
+| medium: flask+django+boto3+requests | 0.668 | 0.551 | 0.116 | 0.072 | 82% |
+| complex: fastapi[standard] | 12.209 | 11.598 | 0.493 | 0.143 | 95%* |
+| complex: jupyterlab | 6.182 | 5.908 | 0.274 | 0.246 | 96%* |
+| complex: celery[redis] | 0.575 | 0.438 | 0.117 | 2.046 | 76% |
+| complex: dask[complete] | 197.001 | 194.707 | 2.121 | 0.158 | 99%** |
+| conflict: google-cloud-bigquery+pandas<2 | 3.322 | 2.645 | 0.677 | 0.000 | 80% |
+| conflict: boto3==1.26.0+botocore>=1.31 | 0.297 | 0.184 | 0.114 | 0.000 | 62% |
+| large: 30-pkg requirements | 4.592 | 4.308 | 0.255 | 0.290 | 94%* |
+
+*These high CPU percentages include metadata downloading and wheel building inside the
+resolver loop, not just resolver algorithm work. **dask[complete] spends ~193s building
+C extensions (numpy, scipy, distributed, etc.) inside the resolver's metadata preparation.
+
+---
+
+## Ideas 4-10: Detailed Analysis
+
+### Idea 4: Copy-on-Write (COW) State Snapshots
+
+**Theory:** Each resolution round copies the full state dict via `_push_new_state`.
+With 3000+ rounds and heavy backtracking, COW would defer the copy until mutation.
+
+**Measured reality:**
+- Round counts across all workloads: **1-96**
+- Backtrack counts: **0-1**
+- Peak criteria dict sizes: **2-95** entries
+
+At these sizes, `dict.copy()` on a 2-95 entry dict costs 0.5-5 microseconds per copy.
+Even at 96 rounds (jupyterlab, the largest), total copy overhead is ~0.5ms.
+COW proxy objects would add per-access overhead on every dict read/write, easily
+exceeding the copy cost given that each round does dozens of dict operations.
+
+**VERDICT: NO-GO**
+
+The codex assumed 3000+ rounds with heavy backtracking. Reality: 1-96 rounds with
+0-1 backtracks. State copies at this scale are free. COW would be a net negative.
+
+---
+
+### Idea 5: Batch/Vectorize _is_current_pin_satisfying
+
+**Theory:** Called once per criterion per round. Vectorizing into a single pass
+over the mapping dict would reduce per-call overhead.
+
+**Measured reality:**
+
+| Workload | Calls | Per Round |
+|----------|------:|----------:|
+| simple: click | 7 | 2 |
+| simple: requests | 57 | 8 |
+| simple: flask | 100 | 11 |
+| medium: django | 26 | 5 |
+| medium: flask+django+boto3+requests | 682 | 30 |
+| complex: celery[redis] | 533 | 25 |
+| complex: dask[complete] | 1,400 | 40 |
+| complex: fastapi[standard] | 2,654 | 55 |
+| complex: jupyterlab | 10,529 | 110 |
+| large: 30-pkg requirements | 11,676 | 123 |
+
+The function body is already optimized: `dict.get()` (single hash lookup) + `all()`
+generator with early exit. Each call costs ~200ns. Even at the maximum (11,676 calls),
+total cost is **2.3ms** -- unmeasurably small vs wall time.
+
+**VERDICT: NO-GO**
+
+Call counts scale as O(rounds * criteria), which is O(n^2) where n = packages. But
+even at n=95, the total is only ~12K calls at 200ns each = 2.3ms. Batching would add
+complexity for negligible gain.
+
+---
+
+### Idea 6: Cache make_requirements_from_spec Results
+
+**Theory:** Same specifier strings may be passed multiple times during backtracking.
+Caching avoids redundant InstallRequirement construction.
+
+**Measured reality:**
+- Aggregate across all workloads: **502 total calls**, **90 duplicates** (17.9%)
+- Per workload: 0-156 calls, 0-25 duplicates
+- Highest duplication rates: celery[redis] 37.1%, dask[complete] 35.7%
+
+Even if every duplicate call is eliminated (~5us each), total savings = **0.45ms**.
+The duplication occurs because some packages share common dependency specifier
+strings (e.g., `typing-extensions>=3.7.4`), not from backtracking.
+
+**VERDICT: NO-GO**
+
+Total call count is too low (max 156) and per-call cost is too cheap (~5us) for
+caching to produce measurable improvement. The codex assumed backtracking would
+cause thousands of repeated spec evaluations; with 0 backtracks, each spec is
+essentially processed once.
+
+---
+
+### Idea 7: Pool/Reduce RequirementInformation Allocations
+
+**Theory:** RequirementInformation namedtuples are allocated in hot loops.
+Object pooling or flyweight pattern could reduce allocation pressure.
+
+**Measured reality:**
+- Range: **2-247** allocations per resolution
+- Maximum (jupyterlab): 247 allocations
+- Cost: 247 allocs x 50ns/alloc = **0.012ms**
+
+**VERDICT: NO-GO**
+
+Allocation counts are trivially small. Python's tuple allocator handles this
+volume in microseconds. Pooling infrastructure (hash lookups, reference counting)
+would cost more than the allocations themselves. The `__slots__` on Criterion
+(already applied) captures far more value than RI pooling ever could.
+
+---
+
+### Idea 8: Optimize narrow_requirement_selection
+
+**Theory:** This function is called each round. Optimizing it or skipping it
+when it won't narrow could save time.
+
+**Measured reality:**
+- Aggregate: **313 calls**, **9 reductions** (**3% effectiveness**)
+- The function only reduces on the first round (Requires-Python check) for most workloads
+- Average input size: 2-30 identifiers; average output reduction: <1 identifier
+
+The function's O(n) linear scan over identifiers is already at the floor.
+Its primary value is the Requires-Python early-return (always first round),
+and the backtrack-cause narrowing (only activates during backtracks -- of which
+there are 0-1 across all workloads).
+
+**VERDICT: NO-GO**
+
+Already well-optimized. 313 calls x ~10us/call = 3ms total. The function serves
+its purpose (Requires-Python fast-path) and has no meaningful optimization headroom.
+
+---
+
+### Idea 9: Faster _build_result Graph Traversal
+
+**Theory:** The final graph construction via `_has_route_to_root` recursive
+traversal could be expensive for large dependency trees.
+
+**Measured reality:**
+
+| Workload | _build_result time | % of Wall |
+|----------|-----------------:|----------:|
+| simple: click | 0.019ms | 0.009% |
+| medium: flask+django+boto3+requests | 0.072ms | 0.011% |
+| complex: fastapi[standard] | 0.143ms | 0.001% |
+| complex: jupyterlab | 0.246ms | 0.004% |
+| complex: celery[redis] | 2.046ms | 0.356% |
+| large: 30-pkg requirements | 0.290ms | 0.006% |
+
+Maximum: **2.0ms** (celery[redis], likely a one-off GC pause).
+Called exactly once per resolution.
+
+**VERDICT: NO-GO**
+
+_build_result accounts for <0.01% of wall time in all workloads except one
+outlier. Even that outlier is 2ms. No optimization here could produce a
+user-visible improvement.
+
+---
+
+### Idea 10: Reduce I/O Overhead / Improve CPU-I/O Overlap
+
+**Theory:** Resolution is I/O-bound. Better overlap between CPU and I/O work,
+or reducing I/O volume, could significantly cut wall time.
+
+**Measured reality:**
+
+For pure-python workloads (no compilation overhead):
+
+| Workload | Wall (s) | Non-Resolver (s) | I/O Fraction |
+|----------|--------:|------------------:|-------------:|
+| simple: click | 0.209 | 0.107 | 51% |
+| simple: requests | 0.328 | 0.108 | 33% |
+| simple: flask | 0.347 | 0.111 | 32% |
+| medium: django | 0.299 | 0.108 | 36% |
+| medium: flask+django+boto3+requests | 0.668 | 0.116 | 17% |
+| complex: celery[redis] | 0.575 | 0.117 | 20% |
+| conflict: boto3==1.26.0+botocore>=1.31 | 0.297 | 0.114 | 38% |
+
+For simple workloads, ~100-110ms is a fixed I/O floor (pip startup + initial
+index request). As workload complexity grows, the I/O fraction shrinks because
+HTTP responses are cached and the resolver does more CPU work per package.
+
+The speculative metadata prefetch (already implemented) overlaps I/O with CPU
+for packages discovered through dependency traversal. Further gains require:
+- Protocol-level changes (server-side filtering, batch metadata endpoints)
+- HTTP/2 multiplexing for parallel index page requests
+- Larger connection pool for concurrent metadata downloads
+
+**VERDICT: GO (partially addressed)**
+
+I/O is 17-51% of wall time for pure-python workloads. The prefetch infrastructure
+already handles the most impactful case (top-candidate metadata overlap). Remaining
+I/O is structural: initial index page fetches that must complete before resolution
+can begin.
+
+---
+
+## Summary: GO/NO-GO Recommendations
+
+| # | Idea | Verdict | Key Evidence |
+|---|------|---------|-------------|
+| 4 | COW state snapshots | **NO-GO** | 1-96 rounds (not 3000+), 0-1 backtracks, copy cost <0.5ms |
+| 5 | Batch _is_current_pin_satisfying | **NO-GO** | Max 11,676 calls x 200ns = 2.3ms total |
+| 6 | Cache make_requirements_from_spec | **NO-GO** | 502 total calls, 90 dups, savings ~0.45ms |
+| 7 | Pool RequirementInformation | **NO-GO** | Max 247 allocs x 50ns = 0.012ms |
+| 8 | Optimize narrow_requirement_selection | **NO-GO** | 313 calls, 3% effective, already O(n) |
+| 9 | Faster _build_result | **NO-GO** | Max 2.0ms, <0.01% of wall time |
+| 10 | Reduce I/O overhead | **GO (partial)** | 17-51% of wall time is I/O for pure-python pkgs |
+
+---
+
+## Key Insight: The Resolver Operates in a Fundamentally Different Regime Than Assumed
+
+The codex research report assumed worst-case scenarios: 3000+ resolution rounds,
+heavy backtracking, massive criteria dictionaries, and repeated processing of
+the same specifiers. The measured reality across 12 diverse workloads is:
+
+| Metric | Codex Assumption | Measured Reality |
+|--------|----------------:|----------------:|
+| Resolution rounds | 3,000+ | **1-96** |
+| Backtracks | Heavy | **0-1** |
+| Criteria dict size | Large | **2-95 entries** |
+| make_req_from_spec calls | Thousands (repeated) | **0-156 (few dups)** |
+| RI allocations | Hundreds of thousands | **2-247** |
+| _build_result cost | Significant | **0.02-2.0ms** |
+
+The two-level cache on `_iter_found_candidates` (Experiment 18) is the key
+optimization that made this possible. By caching specifier merge results and
+candidate info lists, the resolver avoids redundant work during what would
+otherwise be expensive backtracking cycles. The result is an essentially
+linear one-pass-per-package algorithm with zero backtracks.
+
+**Bottom line:** Ideas 4-9 target algorithmic overhead that has been effectively
+eliminated by the existing caching infrastructure. The resolver's pure algorithmic
+cost (round logic, preference computation, pin satisfaction checks) totals
+approximately **2-12ms** even for the most complex workloads. This is irreducible
+overhead at the Python function-call level.
+
+The only remaining lever for wall-time improvement is I/O (idea 10), which requires
+changes at the protocol/network layer, not the resolver algorithm.
--- a/.codeflash/pypa/pip/data/benchmarks.md
+++ b/.codeflash/pypa/pip/data/benchmarks.md
@ -0,0 +1,144 @@
+# pip End-to-End Performance: `main` vs `codeflash/optimize`
+
+**Branch:** `codeflash/optimize` (118 commits ahead of `main`)
+**Environment:** Python 3.15.0a7 | macOS arm64 (Apple Silicon) | ~27 packages installed | HTTP cache warm
+**Tool:** hyperfine (5-10 runs, 2-3 warmup)
+
+---
+
+## Startup
+
+| Benchmark | Main | Optimized | Delta | Speedup |
+|-----------|-----:|----------:|------:|--------:|
+| `pip --version` | 138 ms | **20 ms** | -118 ms | **7.0x** |
+| `pip --help` | 143 ms | **121 ms** | -22 ms | **1.18x** |
+| `pip install --help` | 207 ms | 208 ms | +1 ms | ~1.0x |
+
+## Package Operations
+
+| Benchmark | Main | Optimized | Delta | Speedup |
+|-----------|-----:|----------:|------:|--------:|
+| `pip list` | 162 ms | **146 ms** | -16 ms | **1.11x** |
+| `pip freeze` | 225 ms | **211 ms** | -14 ms | **1.07x** |
+| `pip show pip` | 162 ms | **148 ms** | -14 ms | **1.09x** |
+| `pip check` | 191 ms | **174 ms** | -17 ms | **1.10x** |
+
+## Dependency Resolution
+
+Cached HTTP responses, `--dry-run --ignore-installed` to force full resolution.
+
+| Benchmark | Main | Optimized | Delta | Speedup |
+|-----------|-----:|----------:|------:|--------:|
+| `requests` (simple, ~5 deps) | 589 ms | **516 ms** | -73 ms | **1.14x** |
+| `flask + django` (medium, ~15 deps) | 708 ms | **599 ms** | -109 ms | **1.18x** |
+| `flask + django + boto3 + requests` (complex, ~30 deps) | 1,493 ms | **826 ms** | **-667 ms** | **1.81x** |
+| `fastapi[standard]` (heavy, ~42 deps) | 13,325 ms | **11,664 ms** | **-1,661 ms** | **1.14x** |
+
+## Parsing
+
+| Benchmark | Main | Optimized | Delta | Speedup |
+|-----------|-----:|----------:|------:|--------:|
+| `install -r requirements.txt` (21 pinned packages, `--no-deps`) | 1,344 ms | **740 ms** | **-604 ms** | **1.82x** |
+
+## Import Time
+
+| Benchmark | Main | Optimized | Delta | Speedup |
+|-----------|-----:|----------:|------:|--------:|
+| `import pip._internal.cli.main` | 50 ms | 50 ms | 0 ms | 1.0x |
+
+> Note: On Python 3.15 the import chain is already fast (50ms). The `--version`
+> fast-path bypasses this import entirely, which is why `pip --version` is 7x faster.
+
+---
+
+## Totals
+
+| | Main | Optimized | Speedup |
+|-|-----:|----------:|--------:|
+| **All benchmarks (sum)** | **18,717 ms** | **15,223 ms** | **1.23x (18.7% faster)** |
+| **Excluding fastapi[standard]** | **5,392 ms** | **3,559 ms** | **1.51x (34.0% faster)** |
+
+---
+
+## Top Improvements
+
+| Rank | Benchmark | Improvement | Time Saved |
+|-----:|-----------|------------:|-----------:|
+| 1 | `resolve: fastapi[standard]` | **12.5%** | 1,661 ms |
+| 2 | `resolve: flask+django+boto3+requests` | **44.7%** | 667 ms |
+| 3 | `install -r requirements.txt` | **44.9%** | 604 ms |
+| 4 | `pip --version` | **85.5%** | 118 ms |
+| 5 | `resolve: flask+django` | **15.4%** | 109 ms |
+| 6 | `resolve: requests` | **12.4%** | 73 ms |
+
+---
+
+## What Was Optimized (118 commits)
+
+### 1. Startup
+- Ultra-fast `--version` path in `__main__.py` that exits before importing `pip._internal`
+- Fast-path `--version` in `cli/main.py` that avoids `pip._internal.utils.misc` import
+- Deferred `base_command.py` import chain to command creation time (saves ~22ms on `--help`)
+- Deferred `Configuration` module loading
+- Deferred autocompletion imports behind `PIP_AUTO_COMPLETE` check
+
+### 2. Dependency Resolver -- Architecture
+- **Speculative metadata prefetch**: background thread downloads PEP 658 metadata for the top candidate while the resolver processes other packages
+- **Conditional Criterion rebuild**: `_remove_information_from_criteria` now skips rebuilding unaffected criteria, eliminating ~95% of allocations
+- **`__slots__` on Criterion**: reduces per-instance memory by ~100 bytes
+- Two-level cache for `_iter_found_candidates` (specifier merge cache + candidate infos cache)
+- Fail-first preference heuristic (`candidate_count` in resolver preference tuple)
+- `ChainMap` delta and plain dict in resolvelib state management
+- Parallel index-page prefetch during dependency resolution
+- Thread-safe `dist` property on candidates for concurrent metadata access
+
+### 3. Dependency Resolver -- Micro
+- Cached wheel tag priority dict on `TargetPython`
+- Pre-extracted requirements tuple on `Criterion` to avoid per-call generator expressions
+- Cached specifier merge and candidate infos across resolver backtracking
+- Cached `Marker.evaluate()` results for repeated extra lookups
+- Cached `_sort_key` results to avoid double evaluation in `compute_best_candidate`
+- Hoisted `operator.methodcaller`/`attrgetter` to module-level constants
+
+### 4. Packaging (vendored `pip._vendor.packaging`)
+- Replaced `_tokenizer` dataclass with `__slots__` class
+- Deferred `Version.__hash__` computation until first call
+- Integer comparison key (`_cmp_int`) for `Version` and `Specifier` -- avoids full `_key` tuple construction
+- Bisect-based `filter_versions` for O(log n + k) batch filtering
+- Pre-computed integer bounds on `SpecifierSet` for fast rejection
+- Cached parsed `Version` objects in `_coerce_version`
+- Cached parsed `Requirement` fields for repeated requirement strings
+- Cached parsed `frozenset` of `Specifier`s in `SpecifierSet`
+- Fast-path tokenizer for simple tokens to bypass regex engine
+- Ultra-fast path in `SpecifierSet.contains` for `prereleases=True`
+- Pre-computed `is_prerelease`/`is_postrelease` flags at `Version` init
+- Direct release-tuple prefix comparison in `_compare_equal` and `_compare_compatible`
+- Cached `Specifier.__str__` and `__hash__`
+
+### 5. Link and Wheel Parsing
+- Pre-computed `Link._is_wheel` slot to avoid repeated `splitext` comparison
+- Cached URL scheme on `Link` to skip `urlsplit` for `is_vcs`/`is_file`
+- Deferred URL path extraction in `Link.from_json` when filename exists
+- Inlined Link construction in `_evaluate_json_page` to skip redundant work
+- Direct string extraction replacing `parse_wheel_filename` in sort path
+- `rsplit` instead of `rfind`x3 for wheel tag extraction
+- Cached `parse_tag` results to eliminate redundant `Tag` creation
+
+### 6. I/O and Caching
+- Replaced pure-Python msgpack with C-level stdlib JSON for cache serialization (backward compatible)
+- Increased HTTP connection pool and prefetch concurrency
+
+### 7. Import Deferral
+- Deferred `base_command.py` import chain to command creation time
+- Deferred all Rich imports to first use
+- Stripped unused Rich modules from import chain
+- Deferred heavy imports in Rich `console.py` (pretty/pager/scope/screen/export)
+- Deferred Rich imports in `progress_bars.py` and `self_outdated_check.py`
+
+### 8. Micro-optimizations
+- Bypassed `InstallationCandidate.__init__` with `__new__` + direct slot assignment
+- Removed redundant O(n) subset assertion in `BestCandidateResult`
+- Replaced `min()` builtins with inline conditionals in `_cmp_int`
+- Cached `Hashes.__hash__` to avoid repeated sort+join computation
+- Cached `Constraint.empty()` singleton to avoid 169K redundant allocations
+- Bypassed `email.parser` for metadata parsing
--- a/.codeflash/pypa/pip/data/coverage-analysis.md
+++ b/.codeflash/pypa/pip/data/coverage-analysis.md
@ -0,0 +1,342 @@
+# Pip Coverage Analysis Report
+
+**Branch:** `codeflash/optimize`
+**Date:** 2026-04-08
+**Test suite:** `tests/unit/` (1690 passed, 39 skipped, 4 xfailed)
+**Tool:** coverage.py with `--source=src/pip`
+
+---
+
+## 1. Summary Statistics
+
+| Metric | Value |
+|--------|-------|
+| Total files analyzed | 388 |
+| Total statements | 42,496 |
+| Covered lines | 20,725 |
+| Uncovered lines | 21,771 |
+| **Overall coverage** | **48.8%** |
+
+### Breakdown by area
+
+| Area | Statements | Covered | Missing | Coverage |
+|------|-----------|---------|---------|----------|
+| Vendored (`_vendor/`) | 27,551 | 10,462 | 17,089 | 38.0% |
+| Internal (`_internal/`) | 14,902 | 10,260 | 4,642 | 68.8% |
+| Other (top-level) | 43 | 3 | 40 | 7.0% |
+
+Over 78% of uncovered code is in vendored dependencies.
+
+---
+
+## 2. Completely Unused Files (0% Coverage)
+
+### Vendor (51 files, 5,411 lines total)
+
+| Lines | Package | File |
+|------:|---------|------|
+| 586 | msgpack | `src/pip/_vendor/msgpack/fallback.py` |
+| 488 | tomli | `src/pip/_vendor/tomli/_parser.py` |
+| 421 | urllib3 | `src/pip/_vendor/urllib3/contrib/securetransport.py` |
+| 419 | packaging | `src/pip/_vendor/packaging/metadata.py` |
+| 346 | distro | `src/pip/_vendor/distro/distro.py` |
+| 276 | urllib3 | `src/pip/_vendor/urllib3/contrib/_securetransport/bindings.py` |
+| 261 | urllib3 | `src/pip/_vendor/urllib3/contrib/pyopenssl.py` |
+| 212 | truststore | `src/pip/_vendor/truststore/_windows.py` |
+| 200 | pyproject_hooks | `src/pip/_vendor/pyproject_hooks/_in_process/_in_process.py` |
+| 172 | idna | `src/pip/_vendor/idna/uts46data.py` |
+| 159 | urllib3 | `src/pip/_vendor/urllib3/contrib/_securetransport/low_level.py` |
+| 147 | platformdirs | `src/pip/_vendor/platformdirs/windows.py` |
+| 145 | platformdirs | `src/pip/_vendor/platformdirs/android.py` |
+| 141 | platformdirs | `src/pip/_vendor/platformdirs/unix.py` |
+| 137 | pygments | `src/pip/_vendor/pygments/sphinxext.py` |
+| 109 | urllib3 | `src/pip/_vendor/urllib3/contrib/appengine.py` |
+| 106 | pygments | `src/pip/_vendor/pygments/lexers/python.py` |
+| 90 | urllib3 | `src/pip/_vendor/urllib3/packages/backports/weakref_finalize.py` |
+| 84 | pygments | `src/pip/_vendor/pygments/formatters/__init__.py` |
+| 81 | idna | `src/pip/_vendor/idna/codec.py` |
+| 72 | msgpack | `src/pip/_vendor/msgpack/ext.py` |
+| 71 | cachecontrol | `src/pip/_vendor/cachecontrol/heuristics.py` |
+| 62 | urllib3 | `src/pip/_vendor/urllib3/contrib/ntlmpool.py` |
+| 59 | packaging | `src/pip/_vendor/packaging/licenses/__init__.py` |
+| 57 | requests | `src/pip/_vendor/requests/help.py` |
+| 40 | pygments | `src/pip/_vendor/pygments/scanner.py` |
+| 40 | pygments | `src/pip/_vendor/pygments/unistring.py` |
+| 38 | dependency_groups | `src/pip/_vendor/dependency_groups/_pip_wrapper.py` |
+| 38 | pygments | `src/pip/_vendor/pygments/console.py` |
+| 35 | cachecontrol | `src/pip/_vendor/cachecontrol/_cmd.py` |
+| 35 | urllib3 | `src/pip/_vendor/urllib3/packages/backports/makefile.py` |
+| 34 | dependency_groups | `src/pip/_vendor/dependency_groups/_lint_dependency_groups.py` |
+| 34 | tomli | `src/pip/_vendor/tomli/_re.py` |
+| 30 | dependency_groups | `src/pip/_vendor/dependency_groups/__main__.py` |
+| 30 | pygments | `src/pip/_vendor/pygments/formatter.py` |
+| 26 | truststore | `src/pip/_vendor/truststore/_openssl.py` |
+| 25 | platformdirs | `src/pip/_vendor/platformdirs/__main__.py` |
+| 23 | msgpack | `src/pip/_vendor/msgpack/__init__.py` |
+| 17 | msgpack | `src/pip/_vendor/msgpack/exceptions.py` |
+| 11 | packaging | `src/pip/_vendor/packaging/licenses/_spdx.py` |
+| 8 | certifi | `src/pip/_vendor/certifi/__main__.py` |
+| 8 | idna | `src/pip/_vendor/idna/compat.py` |
+| 8 | rich | `src/pip/_vendor/rich/pager.py` |
+| 6 | dependency_groups | `src/pip/_vendor/dependency_groups/_toml_compat.py` |
+| 6 | pygments | `src/pip/_vendor/pygments/__main__.py` |
+| 4 | rich | `src/pip/_vendor/rich/_export_format.py` |
+| 4 | tomli | `src/pip/_vendor/tomli/_types.py` |
+| 3 | distro | `src/pip/_vendor/distro/__init__.py` |
+| 3 | distro | `src/pip/_vendor/distro/__main__.py` |
+| 3 | tomli | `src/pip/_vendor/tomli/__init__.py` |
+| 1 | pygments | `src/pip/_vendor/pygments/formatters/_mapping.py` |
+
+### Internal (2 files, 79 lines total)
+
+| Lines | Module | File |
+|------:|--------|------|
+| 75 | locations | `src/pip/_internal/locations/_distutils.py` |
+| 4 | (root) | `src/pip/_internal/main.py` |
+
+### Other (2 files, 38 lines total)
+
+| Lines | File |
+|------:|------|
+| 21 | `src/pip/__pip-runner__.py` |
+| 17 | `src/pip/__main__.py` |
+
+---
+
+## 3. Nearly Dead Files (<10% coverage)
+
+No files fall in the 1-9% range. All partially-covered files have at least 10% coverage.
+
+---
+
+## 4. Top 30 Files by Uncovered Line Count
+
+These files contain the most dead code by absolute count.
+
+| Missing | Total | Cov% | Area | File |
+|--------:|------:|-----:|------|------|
+| 1,022 | 1,272 | 19.7% | vendor | `src/pip/_vendor/distlib/util.py` |
+| 888 | 1,561 | 43.1% | vendor | `src/pip/_vendor/pkg_resources/__init__.py` |
+| 586 | 586 | 0.0% | vendor | `src/pip/_vendor/msgpack/fallback.py` |
+| 488 | 488 | 0.0% | vendor | `src/pip/_vendor/tomli/_parser.py` |
+| 480 | 996 | 51.8% | vendor | `src/pip/_vendor/rich/console.py` |
+| 421 | 421 | 0.0% | vendor | `src/pip/_vendor/urllib3/contrib/securetransport.py` |
+| 419 | 419 | 0.0% | vendor | `src/pip/_vendor/packaging/metadata.py` |
+| 381 | 466 | 18.2% | vendor | `src/pip/_vendor/pygments/lexer.py` |
+| 346 | 346 | 0.0% | vendor | `src/pip/_vendor/distro/distro.py` |
+| 315 | 396 | 20.5% | vendor | `src/pip/_vendor/rich/pretty.py` |
+| 310 | 481 | 35.6% | vendor | `src/pip/_vendor/requests/utils.py` |
+| 310 | 622 | 50.2% | vendor | `src/pip/_vendor/rich/progress.py` |
+| 276 | 276 | 0.0% | vendor | `src/pip/_vendor/urllib3/contrib/_securetransport/bindings.py` |
+| 271 | 632 | 57.1% | vendor | `src/pip/_vendor/packaging/specifiers.py` |
+| 261 | 344 | 24.1% | vendor | `src/pip/_vendor/rich/syntax.py` |
+| 261 | 261 | 0.0% | vendor | `src/pip/_vendor/urllib3/contrib/pyopenssl.py` |
+| 257 | 292 | 12.0% | vendor | `src/pip/_vendor/idna/core.py` |
+| 234 | 606 | 61.4% | vendor | `src/pip/_vendor/rich/text.py` |
+| 232 | 503 | 53.9% | vendor | `src/pip/_vendor/urllib3/packages/six.py` |
+| 226 | 424 | 46.7% | vendor | `src/pip/_vendor/urllib3/response.py` |
+| 225 | 424 | 46.9% | vendor | `src/pip/_vendor/rich/style.py` |
+| 218 | 290 | 24.8% | vendor | `src/pip/_vendor/rich/traceback.py` |
+| 216 | 397 | 45.6% | internal | `src/pip/_internal/resolution/resolvelib/factory.py` |
+| 212 | 212 | 0.0% | vendor | `src/pip/_vendor/truststore/_windows.py` |
+| 200 | 200 | 0.0% | vendor | `src/pip/_vendor/pyproject_hooks/_in_process/_in_process.py` |
+| 197 | 455 | 56.7% | vendor | `src/pip/_vendor/requests/models.py` |
+| 190 | 235 | 19.1% | vendor | `src/pip/_vendor/cachecontrol/controller.py` |
+| 173 | 764 | 77.4% | internal | `src/pip/_internal/index/package_finder.py` |
+| 172 | 172 | 0.0% | vendor | `src/pip/_vendor/idna/uts46data.py` |
+| 171 | 288 | 40.6% | internal | `src/pip/_internal/commands/install.py` |
+
+---
+
+## 5. Vendored Package Breakdown
+
+Sorted by uncovered (dead) lines, most dead code first.
+
+| Package | Files | Total | Covered | Missing | Coverage |
+|---------|------:|------:|--------:|--------:|---------:|
+| rich | 62 | 6,796 | 3,156 | 3,640 | 46.4% |
+| urllib3 | 39 | 4,883 | 1,858 | 3,025 | 38.1% |
+| pygments | 23 | 1,782 | 325 | 1,457 | 18.2% |
+| packaging | 17 | 3,065 | 1,654 | 1,411 | 54.0% |
+| distlib | 5 | 1,715 | 477 | 1,238 | 27.8% |
+| requests | 18 | 2,176 | 1,136 | 1,040 | 52.2% |
+| pkg_resources | 1 | 1,561 | 673 | 888 | 43.1% |
+| **msgpack** | **4** | **698** | **0** | **698** | **0.0%** |
+| platformdirs | 8 | 787 | 233 | 554 | 29.6% |
+| idna | 8 | 592 | 50 | 542 | 8.4% |
+| **tomli** | **4** | **529** | **0** | **529** | **0.0%** |
+| cachecontrol | 12 | 678 | 175 | 503 | 25.8% |
+| truststore | 6 | 715 | 233 | 482 | 32.6% |
+| **distro** | **3** | **352** | **0** | **352** | **0.0%** |
+| pyproject_hooks | 4 | 323 | 80 | 243 | 24.8% |
+| resolvelib | 9 | 457 | 285 | 172 | 62.4% |
+| dependency_groups | 6 | 201 | 75 | 126 | 37.3% |
+| tomli_w | 2 | 130 | 27 | 103 | 20.8% |
+| certifi | 3 | 38 | 17 | 21 | 44.7% |
+
+**Bold** = Entirely unused (0% coverage across all files in the package).
+
+**Totals:** 27,551 vendor lines, 10,462 covered (38.0%), 17,089 uncovered.
+
+---
+
+## 6. Dead Code Hotspots in pip Internals
+
+### Internal module breakdown
+
+| Module | Files | Total | Covered | Missing | Coverage |
+|--------|------:|------:|--------:|--------:|---------:|
+| commands | 19 | 1,601 | 648 | 953 | 40.5% |
+| resolution | 13 | 1,579 | 927 | 652 | 58.7% |
+| cli | 13 | 1,268 | 828 | 440 | 65.3% |
+| req | 8 | 1,384 | 1,015 | 369 | 73.3% |
+| operations | 12 | 1,036 | 677 | 359 | 65.3% |
+| utils | 28 | 1,729 | 1,409 | 320 | 81.5% |
+| (root) | 9 | 1,194 | 887 | 307 | 74.3% |
+| metadata | 8 | 958 | 672 | 286 | 70.1% |
+| vcs | 6 | 829 | 563 | 266 | 67.9% |
+| locations | 4 | 371 | 123 | 248 | 33.2% |
+| index | 4 | 1,113 | 904 | 209 | 81.2% |
+| network | 8 | 920 | 785 | 135 | 85.3% |
+| models | 13 | 789 | 723 | 66 | 91.6% |
+| distributions | 5 | 131 | 99 | 32 | 75.6% |
+
+### Large contiguous uncovered blocks in pip internals (>= 10 lines)
+
+These are likely entire unused functions, methods, or code branches.
+
+| Size | Lines | File |
+|-----:|-------|------|
+| 16 | 212-227 | `src/pip/_internal/commands/show.py` |
+| 16 | 1175-1190 | `src/pip/_internal/index/package_finder.py` |
+| 16 | 475-490 | `src/pip/_internal/metadata/base.py` |
+| 13 | 75-87 | `src/pip/_internal/locations/_sysconfig.py` |
+| 13 | 423-435 | `src/pip/_internal/models/link.py` |
+| 13 | 24-36 | `src/pip/_internal/utils/pylock.py` |
+| 12 | 103-114 | `src/pip/_internal/resolution/resolvelib/found_candidates.py` |
+| 12 | 142-153 | `src/pip/_internal/vcs/subversion.py` |
+| 11 | 284-294 | `src/pip/_internal/commands/list.py` |
+| 11 | 388-398 | `src/pip/_internal/commands/list.py` |
+| 11 | 93-103 | `src/pip/_internal/operations/install/wheel.py` |
+| 10 | 533-542 | `src/pip/_internal/commands/install.py` |
+| 10 | 632-641 | `src/pip/_internal/commands/install.py` |
+| 10 | 621-630 | `src/pip/_internal/req/req_uninstall.py` |
+| 10 | 69-78 | `src/pip/_internal/resolution/resolvelib/found_candidates.py` |
+| 10 | 97-106 | `src/pip/_internal/utils/filesystem.py` |
+| 10 | 104-113 | `src/pip/_internal/wheel_builder.py` |
+
+Total: 17 blocks, 204 lines of contiguous dead code in internals.
+
+**Note:** The `commands` module has 953 uncovered lines (40.5% coverage). This is expected because unit tests do not exercise most CLI command handlers -- those are covered by functional tests (which were not included in this analysis). The unit tests primarily exercise library/utility code.
+
+---
+
+## 7. Never-Imported Modules During Typical Usage
+
+Running `pip install --dry-run requests` imported **337 pip modules**. The following were **never imported** during that operation.
+
+### Never-imported vendor modules (54 modules)
+
+**Entirely unused vendor packages:**
+- `pip._vendor.msgpack` (all 4 modules) -- serialization library, not used at runtime
+- `pip._vendor.tomli` (all 4 modules) -- TOML parser, not needed for install
+- `pip._vendor.distro` (all 3 modules) -- Linux distribution detection, not needed on macOS/for install
+- `pip._vendor.tomli_w` (2 modules) -- TOML writer
+
+**Unused vendor submodules (platform-specific / optional features):**
+- `pip._vendor.truststore._windows`, `pip._vendor.truststore._openssl` -- platform-specific TLS backends
+- `pip._vendor.platformdirs.windows`, `pip._vendor.platformdirs.android`, `pip._vendor.platformdirs.unix` -- wrong-platform dirs
+- `pip._vendor.urllib3.contrib.*` (securetransport, pyopenssl, appengine, ntlmpool, socks, backports) -- optional urllib3 extras
+- `pip._vendor.idna.codec`, `pip._vendor.idna.compat`, `pip._vendor.idna.uts46data` -- IDNA codec/compat, rarely needed
+- `pip._vendor.cachecontrol._cmd`, `pip._vendor.cachecontrol.heuristics` -- CLI/heuristic features unused by pip
+- `pip._vendor.packaging.metadata`, `pip._vendor.packaging.licenses` -- packaging metadata/license handling
+- `pip._vendor.dependency_groups.*` (all 5 modules) -- dependency group resolution
+- `pip._vendor.requests.help` -- requests debug info
+- `pip._vendor.rich` partial: `_export_format`, `_spinners`, `ansi`, `file_proxy`, `filesize`, `live`, `live_render`, `pager`, `progress`, `progress_bar`, `screen`, `spinner`
+- `pip._vendor.certifi.__main__` -- certifi CLI
+- `pip._vendor.pygments` (most submodules) -- syntax highlighting, not used in install path
+
+### Never-imported internal modules (26 modules)
+
+Most are **command modules not used during `install`**:
+- `pip._internal.commands.cache`
+- `pip._internal.commands.check`
+- `pip._internal.commands.completion`
+- `pip._internal.commands.configuration`
+- `pip._internal.commands.debug`
+- `pip._internal.commands.download`
+- `pip._internal.commands.freeze`
+- `pip._internal.commands.hash`
+- `pip._internal.commands.help`
+- `pip._internal.commands.index`
+- `pip._internal.commands.inspect`
+- `pip._internal.commands.list`
+- `pip._internal.commands.lock`
+- `pip._internal.commands.search`
+- `pip._internal.commands.show`
+- `pip._internal.commands.uninstall`
+- `pip._internal.commands.wheel`
+
+Other never-imported internals:
+- `pip._internal.locations._distutils` -- legacy distutils location support
+- `pip._internal.main` -- thin wrapper, bypassed in tests
+- `pip._internal.metadata.pkg_resources` -- legacy metadata backend
+- `pip._internal.network.xmlrpc` -- XML-RPC client (for `pip search`)
+- `pip._internal.operations.freeze` -- freeze operation
+- `pip._internal.resolution.legacy` (2 modules) -- legacy resolver
+- `pip._internal.utils._jaraco_text` -- text utility
+
+---
+
+## 8. Recommendations
+
+### High-impact: Entirely unused vendor packages
+
+These packages have **0% coverage** and were **never imported** during install. They are candidates for removal or lazy-loading.
+
+| Package | Lines | Recommendation |
+|---------|------:|----------------|
+| **msgpack** | 698 | Already replaced by JSON caching (per commit `070099c01`). Can likely be fully removed from vendor. |
+| **tomli** | 529 | Python 3.11+ has `tomllib` in stdlib. If pip's minimum is 3.11+, this is dead weight. Otherwise needed for <3.11. |
+| **distro** | 352 | Only used on Linux for distro detection. Already lazy-imported. Could be skipped entirely on macOS/Windows. |
+
+**Potential savings: ~1,579 lines of vendor code.**
+
+### Medium-impact: Heavily unused vendor code
+
+| Package | Missing Lines | Notes |
+|---------|-------------:|-------|
+| rich | 3,640 | Pip uses a small fraction of rich. Consider vendoring only the needed subset. |
+| urllib3 `contrib/` | ~1,289 | securetransport, pyopenssl, appengine, ntlmpool, socks, backports -- all 0% coverage. Platform-specific or optional. |
+| pygments | 1,457 | 18.2% coverage. Pip only uses basic lexing. Most formatters, lexers, and utilities are unused. |
+| distlib | 1,238 | `util.py` alone has 1,022 uncovered lines. Much of distlib is unused. |
+| pkg_resources | 888 | Legacy metadata backend. 43.1% coverage. Being phased out. |
+
+### Low-impact: Internal pip dead code
+
+The internal pip code is reasonably well-covered at 68.8%. The uncovered code is mostly:
+
+1. **Command handlers** (953 lines) -- Expected. These are tested by functional tests, not unit tests. Not actually dead.
+2. **Legacy resolver** (`resolution/legacy/`) -- Never imported during install. Could be lazy-loaded or gated.
+3. **Platform-specific paths** (distutils locations, Windows/Linux branches) -- Not dead, just not exercised on macOS.
+4. **VCS backends** (subversion, mercurial) -- Only used when installing from VCS URLs.
+
+### Lazy-loading opportunities
+
+These modules are never imported during a standard `pip install` but are needed for other commands:
+- All command modules except `install` -- already lazy-loaded via command discovery
+- `pip._internal.resolution.legacy` -- could gate behind a flag check
+- `pip._internal.metadata.pkg_resources` -- could lazy-import
+- `pip._internal.network.xmlrpc` -- only used by deprecated `pip search`
+- `pip._vendor.pygments` -- only needed for `--verbose` or rich output formatting
+
+### Summary of removable/reducible code
+
+| Category | Estimated Removable Lines |
+|----------|-------------------------:|
+| Entirely unused vendor packages (msgpack, tomli, distro) | ~1,579 |
+| Unused vendor submodules (urllib3 contrib, pygments extras, etc.) | ~2,500 |
+| Never-imported vendor utility modules (__main__, CLI tools, etc.) | ~400 |
+| Total potential reduction | **~4,500 lines** |
+
+This represents roughly **10.6% of all pip source code** that could potentially be removed or lazy-loaded.
--- a/.codeflash/pypa/pip/data/io-analysis.md
+++ b/.codeflash/pypa/pip/data/io-analysis.md
@ -0,0 +1,543 @@
+# Pip I/O Layer Deep Analysis
+
+Investigation date: 2026-04-08
+Branch: `codeflash/optimize`
+Investigator: Research agent
+
+---
+
+## 1. Request Flow Diagram
+
+```
+User: pip install <pkg>
+  |
+  v
+Resolver (resolvelib)
+  |
+  +-- provider.get_dependencies(candidate)
+  |     +-- prefetch_packages(dep_names)  [background threads]
+  |
+  +-- provider.find_matches(identifier)
+  |     +-- factory.find_candidates()
+  |           +-- finder.find_best_candidate(name)
+  |                 +-- finder.find_all_candidates(name)
+  |                       |
+  |                       +-- [check _all_candidates cache]
+  |                       +-- [check _prefetch_futures]
+  |                       +-- _do_fetch_all_candidates(name)
+  |                             |
+  |                             +-- link_collector.collect_sources(name)
+  |                             |     +-- search_scope.get_index_urls_locations(name)
+  |                             |           # => ["https://pypi.org/simple/<name>/"]
+  |                             |
+  |                             +-- source.page_candidates()
+  |                                   +-- process_project_url(url)
+  |                                         |
+  |                                         +-- link_collector.fetch_response(url)
+  |                                         |     +-- _get_index_content(url)
+  |                                         |           +-- _get_simple_response(url, session)
+  |                                         |                 |
+  |                                         |                 v
+  |                                         |           session.get(url, headers={
+  |                                         |             Accept: "application/vnd.pypi.simple.v1+json, ...",
+  |                                         |             Cache-Control: "max-age=0"
+  |                                         |           })
+  |                                         |                 |
+  |                                         |                 v
+  |                                         |           CacheControlAdapter.send()
+  |                                         |             +-- controller.cached_request()
+  |                                         |             |     # max-age=0 => ALWAYS bypasses cache
+  |                                         |             |     # Adds If-None-Match / If-Modified-Since
+  |                                         |             +-- controller.conditional_headers()
+  |                                         |             +-- HTTPAdapter.send()
+  |                                         |                   +-- urllib3.HTTPSConnectionPool.urlopen()
+  |                                         |                         +-- _get_conn() [from pool queue]
+  |                                         |                         +-- TLS handshake (if new conn)
+  |                                         |                         +-- HTTP/1.1 GET request
+  |                                         |                         +-- _put_conn() [return to pool]
+  |                                         |             +-- controller.cache_response()
+  |                                         |                   # Stores response w/ ETag for
+  |                                         |                   # future conditional requests
+  |                                         |
+  |                                         +-- [JSON] _evaluate_json_page()
+  |                                         +-- [HTML] parse_links() + evaluate_links()
+  |
+  +-- candidate.dist  [triggers metadata fetch]
+        +-- _prepare()
+              +-- preparer.prepare_linked_requirement()
+                    +-- _fetch_metadata_only()
+                    |     +-- [1] _fetch_metadata_using_link_data_attr()
+                    |     |     # PEP 658: GET <url>.metadata
+                    |     +-- [2] _fetch_metadata_using_lazy_wheel()
+                    |           # HTTP Range requests on .whl
+                    |           +-- LazyZipOverHTTP(url, session)
+                    |                 +-- session.head(url)  # get Content-Length
+                    |                 +-- _check_zip()       # range-fetch tail
+                    |                 +-- ZipFile(self)       # parse EOCD
+                    +-- [3] Full download as fallback
+```
+
+### Request Count Per Package (typical PyPI resolution)
+
+For each **unique package name** the resolver encounters:
+1. **1 GET** for the index page (`/simple/<name>/`) -- conditional if cached
+2. **1 GET** for metadata (PEP 658 `.metadata` file) -- OR --
+   **1 HEAD + 1-2 Range GETs** for lazy wheel metadata -- OR --
+   **1 GET** full wheel download as fallback
+3. **1 GET** for the actual wheel download (after resolution)
+
+For a workload like `boto3` with ~40 transitive deps:
+- ~40 index page GETs (conditional requests)
+- ~40 metadata GETs (PEP 658 when available)
+- ~40 wheel download GETs
+- **Total: ~120 HTTP requests minimum**
+
+---
+
+## 2. Per-Area Findings
+
+### 2.1 HTTP Request Flow
+
+**How requests are serialized:**
+- The resolver processes packages **sequentially** through resolvelib's `resolve()` loop
+- Each `find_matches()` call triggers `find_all_candidates()`, which fetches the index page **synchronously** (unless prefetched)
+- Each `get_dependencies()` call triggers `candidate.dist`, which fetches metadata **synchronously** (unless prefetched)
+
+**Existing parallelism (two separate thread pools):**
+1. **Index page prefetch** (`PackageFinder._prefetch_executor`): 16 worker threads
+   - Triggered in `provider.get_dependencies()` for all discovered deps
+   - Triggered in `resolver.resolve()` for all root requirements
+   - Workers call `_do_fetch_all_candidates()` which does the full index fetch + evaluate pipeline
+2. **Metadata prefetch** (`Factory._metadata_prefetch_executor`): 8 worker threads
+   - Triggered in `_iter_found_candidates()` for the top candidate only
+   - Workers call `candidate.dist` which triggers PEP 658 / lazy wheel
+
+**Key finding: The two prefetch mechanisms are independent and both effective, but they don't coordinate.** The metadata prefetch for package B can't start until B's index page fetch completes. There is no pipelining of "index fetch -> immediately prefetch top candidate metadata."
+
+**Redundant requests found:**
+- `LazyZipOverHTTP.__init__()` sends a HEAD request (line 57 of lazy_wheel.py). If PEP 658 metadata is available, this HEAD is **never needed** -- the code tries PEP 658 first and only falls back to lazy wheel. This is correct and not redundant.
+- However, the HEAD request in `LazyZipOverHTTP` is sent **even if the wheel doesn't support range requests**, wasting one round trip before discovering this.
+- `_get_simple_response()` sends a HEAD before GET only if the URL looks like an archive (line 120-121 of collector.py). This is a rare case and correctly guarded.
+
+### 2.2 Connection Reuse & Pooling
+
+**Current configuration (session.py lines 388-389):**
+```python
+_pool_connections = 20   # urllib3 PoolManager caches pools for 20 distinct hosts
+_pool_maxsize = 16       # Each pool keeps up to 16 idle connections
+```
+
+**Analysis:**
+- Pool is correctly sized for 16 prefetch workers
+- `pool_block=False` (default) means excess connections proceed but aren't returned to pool
+- **Connections ARE reused** for same-host requests through urllib3's `HTTPSConnectionPool._get_conn()` / `_put_conn()` mechanism
+- HTTP/1.1 keep-alive works by default (urllib3 uses persistent connections)
+- The connection pool is per-(host, port, scheme), so `pypi.org:443` and `files.pythonhosted.org:443` each get their own pool
+- **A typical pip install touches only 2-3 hosts**: `pypi.org` (index pages), `files.pythonhosted.org` (wheel downloads, metadata), and possibly an extra index. Pool of 20 is more than adequate.
+
+**TLS Handshake Analysis:**
+- A TLS handshake happens **once per connection** (not per request)
+- With pool_maxsize=16, up to 16 connections are kept alive per host
+- The 16 prefetch threads can each hold a connection, so in theory all 16 reuse their TLS sessions
+- **Risk:** If more than 16 requests fire concurrently to the same host, excess connections are created and then **discarded** (not pooled), causing extra TLS handshakes. With `pool_block=False`, they proceed but the connection is thrown away after use.
+
+**Finding:** The pool is sized well for the current prefetch concurrency. No wasted TLS handshakes under normal operation.
+
+### 2.3 Caching Layer
+
+**How CacheControl works with pip:**
+
+1. **Index pages (`/simple/<name>/`):** Sent with `Cache-Control: max-age=0` header
+   - CacheControl controller sees `max-age=0` and **always bypasses the cache** (controller.py line 184-186)
+   - But it adds conditional headers (`If-None-Match`, `If-Modified-Since`) via `conditional_headers()`
+   - On 304 Not Modified, the cached response is served (no body transfer)
+   - On 200, the response is cached with its ETag for next time
+   - **This is working as intended** -- ensures freshness while avoiding re-downloading unchanged index pages
+
+2. **Package downloads (wheels, sdists):** Sent via `Downloader._http_get()` with `Accept-Encoding: identity`
+   - No `Cache-Control: max-age=0` header on these requests
+   - CacheControl can serve fully cached responses for packages that haven't changed
+   - `SafeFileCache` stores metadata + body as separate files on disk
+   - Cache key is the full URL (after normalization)
+
+3. **PEP 658 metadata files:** Fetched via `get_http_url()` using the Downloader
+   - Same caching behavior as package downloads
+   - Small files (~5-50KB), cached effectively
+
+4. **Lazy wheel range requests:** Sent with `Cache-Control: no-cache`
+   - **Explicitly bypasses caching** (lazy_wheel.py line 180)
+   - This is correct -- range requests for ZIP metadata shouldn't be cached as full responses
+
+**Cache efficiency finding:**
+- The `max-age=0` on index pages means **every resolution always incurs at least one conditional round-trip per package**. This is the single biggest I/O constraint for warm-cache scenarios.
+- For a `pip install --upgrade` with warm cache, all 40 index page requests still go to the network (as conditional GETs), but most return 304 with no body. Each 304 round-trip costs ~50-100ms (RTT to pypi.org).
+- **Total warm-cache overhead: 40 * ~80ms = ~3.2 seconds** just in sequential conditional GETs (partially parallelized by prefetch).
+
+### 2.4 Metadata Fetching
+
+**Fallback chain (prepare.py `_fetch_metadata_only()`):**
+1. **PEP 658 metadata** (`_fetch_metadata_using_link_data_attr()`): 
+   - Checks `link.metadata_link()` -- the link must have `data-dist-info-metadata` or `core-metadata` attribute
+   - If present, downloads the separate `.metadata` file (tiny: 5-50KB)
+   - **PyPI supports PEP 658** for all wheels uploaded after ~2023
+   - This is the fastest path: single small GET
+
+2. **Lazy wheel** (`_fetch_metadata_using_lazy_wheel()`):
+   - Requires `--use-feature=fast-deps` flag
+   - Sends HEAD to get Content-Length and check Accept-Ranges
+   - Downloads the tail of the wheel (ZIP end-of-central-directory) via range requests
+   - Parses the ZIP to find METADATA file, downloads just that range
+   - **2-4 HTTP requests per wheel** (HEAD + 1-3 range GETs)
+   - Has a `_lazy_wheel_cache` to avoid redundant range requests for same URL
+
+3. **Full download** (fallback):
+   - Downloads entire wheel/sdist
+   - For wheels: extracts metadata from the archive
+   - For sdists: runs `setup.py egg_info` or `pyproject.toml` build
+   - **Most expensive path**
+
+**Key finding:** PEP 658 is the dominant path for PyPI packages. The speculative metadata prefetch (factory.py) eagerly builds the top candidate and submits a background thread to fetch its metadata. This overlaps metadata I/O with resolution logic.
+
+**Optimization in place:** `_lazy_wheel_cache` (prepare.py line 288) prevents duplicate range requests when a package is evaluated with different extras (e.g., `pkg` and `pkg[extra]`).
+
+### 2.5 DNS & TLS
+
+**DNS resolution:**
+- urllib3 delegates to Python's `socket.create_connection()` which calls `getaddrinfo()`
+- **No DNS caching in urllib3 or pip** -- relies on OS-level DNS cache
+- However, connection pooling effectively caches DNS results because connections persist
+- With 16 pool connections to `pypi.org`, DNS is resolved at most once per connection creation
+
+**TLS handshakes:**
+- One TLS handshake per connection (not per request)
+- Connection pooling limits handshakes to pool_maxsize (16) per host
+- Python's `ssl` module handles TLS session resumption at the OpenSSL level
+- `_SSLContextAdapterMixin` (session.py line 255) properly forwards the SSL context to pools
+
+**Finding:** DNS and TLS are not significant bottlenecks. The connection pool effectively amortizes both costs. Pre-warming is not needed because the first batch of prefetch requests creates all needed connections.
+
+### 2.6 HTTP/2 and Protocol
+
+**Current state: pip uses HTTP/1.1 exclusively.**
+
+- The vendored `urllib3` (version appears to be 1.x/2.x line) does not support HTTP/2
+- The vendored `requests` library has no HTTP/2 support
+- There are **no references** to HTTP/2, h2, or hyper anywhere in pip's codebase
+
+**Would HTTP/2 help?**
+- **Index page fetches:** HTTP/2 multiplexing would allow sending all ~40 index page requests over a **single TCP connection** to pypi.org. Currently, each of the 16 prefetch threads uses its own connection. With HTTP/2, one connection handles all requests, eliminating 15 TLS handshakes and reducing head-of-line blocking.
+- **Metadata fetches:** Similarly multiplexed over the same connection.
+- **Package downloads:** Less benefit -- these are large sequential downloads.
+
+**Estimated benefit:** For index-heavy workloads (many small packages), HTTP/2 could reduce the connection setup overhead by ~90% and improve throughput by 20-30% due to multiplexing.
+
+**What it would take:**
+- Replace vendored `requests`/`urllib3` with `httpx` (supports HTTP/2 via `h2`) or add `h2` to urllib3
+- Major architectural change -- affects all of pip's network layer
+- PyPI's CDN (Fastly) already supports HTTP/2
+
+### 2.7 Parallel I/O Architecture
+
+**Index page prefetch (PackageFinder):**
+```python
+# package_finder.py lines 1535-1556
+def prefetch_packages(self, project_names):
+    with self._prefetch_lock:
+        for name in project_names:
+            if name in self._all_candidates or name in self._prefetch_futures:
+                continue
+            if self._prefetch_executor is None:
+                self._prefetch_executor = ThreadPoolExecutor(max_workers=16)
+            self._prefetch_futures[name] = self._prefetch_executor.submit(
+                self._do_fetch_all_candidates, name
+            )
+```
+
+- Called from two places:
+  1. `resolver.resolve()` -- submits all root requirements upfront
+  2. `provider.get_dependencies()` -- submits all discovered deps
+- Workers run `_do_fetch_all_candidates()` which does the full pipeline:
+  collect_sources -> fetch_response -> parse/evaluate
+- Results cached in `_all_candidates` dict
+- `find_all_candidates()` checks futures with 10s timeout
+
+**Metadata prefetch (Factory):**
+```python
+# factory.py lines 188-245
+def _prefetch_top_candidate_metadata(self, name, top_info, extras, template):
+    # Build top candidate eagerly (cheap: wheel-cache lookup)
+    candidate = build_func()
+    # Only prefetch for remote wheels
+    if link.is_file or not link.is_wheel:
+        return
+    def _do_prefetch():
+        candidate.dist  # triggers prepare_linked_requirement()
+    # Submit to 8-thread pool
+    self._metadata_prefetch_executor.submit(_do_prefetch)
+```
+
+**Serialization points that force sequential I/O:**
+1. **resolvelib's main loop is single-threaded.** Each round processes one package at a time. Even with prefetching, the resolver can only consume one result at a time.
+2. **`_complete_partial_requirements()`** (prepare.py line 474) downloads all "needs more preparation" requirements **sequentially** via `self._download.batch()` -- which is just a for-loop, NOT actually batched/parallel.
+3. **The `Downloader.batch()` method** (download.py line 179-184) is misleadingly named -- it's a sequential for-loop:
+   ```python
+   def batch(self, links, location):
+       for link in links:
+           filepath, content_type = self(link, location)
+           yield link, (filepath, content_type)
+   ```
+   **This is a significant finding.** All final wheel downloads happen sequentially.
+
+### 2.8 Response Compression
+
+**Index page requests:**
+- The `_get_simple_response()` in collector.py sets custom `Accept` headers but does NOT set `Accept-Encoding`
+- Requests library's default `Accept-Encoding` header is `gzip, deflate` (from urllib3's `ACCEPT_ENCODING = "gzip,deflate"`, applied by requests' `default_headers()`)
+- **Index pages ARE compressed** by PyPI/Fastly with gzip. The requests library transparently decompresses them.
+- No brotli support (would require `brotli` or `brotlicffi` package)
+
+**Package downloads:**
+- `Downloader._http_get()` uses `HEADERS = {"Accept-Encoding": "identity"}` (utils.py line 26)
+- **Package downloads explicitly disable compression.** This is intentional -- packages are already compressed archives (wheels are ZIP files, sdists are .tar.gz). Re-compressing would waste CPU and break hash verification.
+- `response_chunks()` uses `decode_content=False` to preserve raw bytes for hash checking.
+
+**Finding:** Compression is correctly handled. Index pages use gzip (transparent). Packages disable compression (correct). No improvement opportunity here.
+
+### 2.9 Lazy/Streaming Approaches
+
+**Current behavior:**
+- Index pages: `response.content` (collector.py line 309) reads the entire response into memory before parsing
+- JSON index pages can be 50KB-2MB for popular packages (e.g., boto3 has ~12,000 file entries)
+- HTML index pages are similar in size
+
+**Streaming opportunity:**
+- JSON index pages COULD be streamed using an incremental JSON parser (e.g., `ijson`)
+- However, `json.loads()` on a 1MB string takes ~5ms -- negligible compared to the ~80ms network round-trip
+- The real cost is not parsing but **candidate evaluation** -- the `_evaluate_json_page()` fast path already handles this efficiently with a single-pass fused pipeline
+
+**Early abort opportunity:**
+- When the resolver only needs the "best" (newest compatible) version, we could theoretically abort after finding it
+- **Problem:** The index page must be fully fetched before we know all versions (no streaming API from PyPI)
+- The speculative metadata prefetch already handles this by eagerly fetching metadata for the top candidate
+
+**Finding:** Streaming/early-abort offers negligible benefit for index pages because network latency dominates. The JSON parsing is already fast.
+
+### 2.10 PyPI-Specific Optimizations
+
+**Bulk/batch APIs:**
+- PyPI has no bulk metadata API (no way to get metadata for 40 packages in one request)
+- The Simple Repository API (PEP 503/691) is package-by-package
+- There is no "dependency tree" API that would let pip skip index page fetches
+
+**CDN-level optimizations already in use:**
+- `Cache-Control: max-age=0` with conditional requests (ETags/Last-Modified) -- implemented
+- PyPI responses include strong ETags
+- 304 responses save bandwidth but still cost one RTT each
+
+**JSON API:**
+- pip already prefers the JSON Simple API (`application/vnd.pypi.simple.v1+json`) via Accept header priority
+- The JSON path (`_evaluate_json_page()`) is heavily optimized with fused evaluation
+- PyPI's JSON API doesn't support partial responses or field selection
+
+**Server-push / Link preload:**
+- PyPI doesn't support HTTP/2 Server Push for metadata files
+- Even with HTTP/2, the server can't know which wheel the client will pick
+
+---
+
+## 3. Optimization Ideas (Ranked by Expected Impact)
+
+### Tier 1: High Impact (10-30% wall-time reduction)
+
+#### 3.1 Parallel Wheel Downloads
+**What:** Replace the sequential `Downloader.batch()` for-loop with concurrent.futures.ThreadPoolExecutor.
+**Where:** `src/pip/_internal/network/download.py` lines 179-184 and `src/pip/_internal/operations/prepare.py` lines 492-493.
+**Why:** After resolution completes, all wheels are downloaded sequentially. For 40 packages, this is 40 sequential HTTP GETs. Parallelizing would overlap download + write for multiple packages.
+**Expected improvement:** 15-25% of total wall time for download-heavy workloads. With 8 parallel downloads, the download phase shrinks from ~40 * avg_time to ~5 * avg_time.
+**Complexity:** Medium. Need to handle progress bar display for parallel downloads and ensure thread safety.
+**Risk:** Low -- downloads are independent operations.
+**pip-only change:** Yes.
+
+#### 3.2 Pipeline Index Fetch + Metadata Prefetch
+**What:** When an index page prefetch completes, immediately trigger metadata prefetch for the top candidate -- don't wait for the resolver to consume the index result.
+**Where:** `src/pip/_internal/index/package_finder.py` `_do_fetch_all_candidates()` should call `factory._prefetch_top_candidate_metadata()` at the end.
+**Why:** Currently, there's a gap between index fetch completion and metadata prefetch submission. The metadata prefetch only fires when the resolver calls `_iter_found_candidates()`. This gap can be 100ms-2s depending on how fast the resolver processes.
+**Expected improvement:** 5-15% for resolution-heavy workloads. Eliminates the serial gap between "index data ready" and "metadata fetch starts."
+**Complexity:** Medium. Requires threading coordination between PackageFinder and Factory. The PackageFinder would need a reference to the Factory (currently doesn't have one).
+**Risk:** Low-medium -- need to ensure thread safety for candidate cache.
+**pip-only change:** Yes.
+
+#### 3.3 Increase Metadata Prefetch Depth
+**What:** Prefetch metadata for top N candidates (not just the top 1), and prefetch for ALL packages whose index is ready (not just when the resolver asks).
+**Where:** `src/pip/_internal/resolution/resolvelib/factory.py` `_prefetch_top_candidate_metadata()`.
+**Why:** The resolver sometimes backtracks and needs the 2nd or 3rd candidate. Currently only the top candidate's metadata is prefetched. Prefetching the top 2-3 would prevent serial metadata fetches during backtracking.
+**Expected improvement:** 3-8% for workloads with backtracking.
+**Complexity:** Low.
+**Risk:** Low. Wastes some bandwidth on metadata that may not be needed, but metadata files are tiny (5-50KB).
+**pip-only change:** Yes.
+
+### Tier 2: Medium Impact (5-15% wall-time reduction)
+
+#### 3.4 HTTP/2 Support via httpx
+**What:** Replace the `requests` + `urllib3` stack with `httpx` which supports HTTP/2 multiplexing.
+**Why:** With HTTP/2, all index page requests and metadata fetches to pypi.org can be multiplexed over a single TCP connection. This eliminates 15 extra TLS handshakes and allows the server to interleave responses.
+**Expected improvement:** 10-20% for cold-cache workloads (fewer TLS handshakes, multiplexed requests). Less impact for warm-cache (304 responses are already small).
+**Complexity:** Very high. Fundamental change to pip's network layer. Would affect caching, authentication, proxies, all adapters.
+**Risk:** High -- potential for regressions across pip's extensive networking surface.
+**pip-only change:** Yes, but major architectural change.
+
+#### 3.5 Conditional Request Short-Circuit for Index Pages
+**What:** For warm-cache scenarios, batch all conditional index page requests into concurrent futures BEFORE the resolver starts, rather than lazily.
+**Where:** Before calling `resolver.resolve()`, pre-submit conditional GETs for ALL packages known from the lock file or previous resolution.
+**Why:** Currently, prefetch only fires as the resolver discovers dependencies. If pip could predict the dependency set (from a lock file or previous run), all ~40 conditional GETs could be fired simultaneously.
+**Expected improvement:** 5-10% for warm-cache repeat installs. Turns 3.2s of serial conditional GETs into <0.5s of parallel ones.
+**Complexity:** Medium. Need a mechanism to predict the package set (lock file, cache of previous resolution result).
+**Risk:** Low -- conditional GETs are safe to fire speculatively.
+**pip-only change:** Yes.
+
+#### 3.6 Connection Pre-warming
+**What:** Open TLS connections to pypi.org and files.pythonhosted.org at session creation time, before any requests.
+**Where:** `src/pip/_internal/network/session.py` `PipSession.__init__()`.
+**Why:** The first request to each host pays the TCP + TLS handshake cost (~100-200ms). Pre-warming during argument parsing / environment setup overlaps this with CPU work.
+**Expected improvement:** 2-5% (saves ~200ms one-time cost).
+**Complexity:** Low.
+**Risk:** Low -- harmless if the connections go unused (they just time out).
+**pip-only change:** Yes.
+
+### Tier 3: Low Impact (1-5% wall-time reduction)
+
+#### 3.7 Cache Index ETags In-Memory Across Packages
+**What:** After the first conditional GET returns an ETag for `pypi.org/simple/`, cache the server's response pattern in memory. Some CDNs return the same 304 pattern for all resources with the same age.
+**Expected improvement:** Negligible (<1%). The conditional request still requires a round trip.
+**pip-only change:** Yes.
+
+#### 3.8 Brotli Compression for Index Pages
+**What:** Add `brotli` or `brotlicffi` as an optional dependency so index page responses can be compressed with brotli (better compression ratio than gzip).
+**Why:** Brotli can compress JSON index pages 20-30% better than gzip, reducing transfer time for large index pages.
+**Expected improvement:** 1-3% for cold-cache scenarios. Index pages are typically 50KB-2MB; brotli saves ~30% of that.
+**Complexity:** Low. Just add the dependency and urllib3/requests will advertise brotli support.
+**Risk:** Low. Optional dependency, gzip fallback.
+**pip-only change:** Yes.
+
+---
+
+## 4. Quick Wins (< 50 lines of code)
+
+### QW1: Parallel Wheel Downloads (the biggest quick win)
+**File:** `src/pip/_internal/operations/prepare.py` `_complete_partial_requirements()`
+**Change:** Replace the sequential `self._download.batch()` loop with `ThreadPoolExecutor.map()`:
+```python
+# Current (sequential):
+batch_download = self._download.batch(links_to_fully_download.keys(), temp_dir)
+for link, (filepath, _) in batch_download:
+    ...
+
+# Proposed (parallel):
+with ThreadPoolExecutor(max_workers=8) as pool:
+    results = pool.map(
+        lambda link: (link, self._download(link, temp_dir)),
+        links_to_fully_download.keys()
+    )
+    for link, (filepath, _) in results:
+        ...
+```
+**Lines:** ~15 changed
+**Impact:** 15-25% wall-time reduction on download-heavy workloads
+
+### QW2: Pipeline Index + Metadata Prefetch
+**File:** `src/pip/_internal/index/package_finder.py` `_do_fetch_all_candidates()`
+**Change:** After building the candidate list, immediately trigger metadata prefetch for the top candidate if a factory callback is registered:
+```python
+# At the end of _do_fetch_all_candidates:
+if self._metadata_prefetch_callback and self._all_candidates[project_name]:
+    self._metadata_prefetch_callback(project_name, self._all_candidates[project_name])
+```
+**Lines:** ~20 changed (add callback registration + invocation)
+**Impact:** 5-15% for resolution-heavy workloads
+
+### QW3: Connection Pre-warming
+**File:** `src/pip/_internal/network/session.py`
+**Change:** Add a `prewarm()` method that opens connections to known hosts in background threads:
+```python
+def prewarm(self, urls: list[str]) -> None:
+    """Open TCP+TLS connections in background to reduce first-request latency."""
+    from concurrent.futures import ThreadPoolExecutor
+    def _warm(url):
+        try:
+            self.head(url, timeout=5)
+        except Exception:
+            pass
+    with ThreadPoolExecutor(max_workers=2) as pool:
+        pool.map(_warm, urls)
+```
+**Lines:** ~15
+**Impact:** 2-5% (saves ~200ms startup)
+
+### QW4: Prefetch Top 2-3 Candidates' Metadata
+**File:** `src/pip/_internal/resolution/resolvelib/factory.py`
+**Change:** In `_iter_found_candidates()`, prefetch metadata for top 2-3 candidates instead of just top 1:
+```python
+# Current: prefetch only infos_list[0]
+# Proposed: prefetch infos_list[0:3]
+for info in infos_list[:3]:
+    self._prefetch_top_candidate_metadata(name, info, extras, template)
+```
+**Lines:** ~10 changed
+**Impact:** 3-8% for workloads with backtracking
+
+---
+
+## 5. Big Bets (Architectural Changes for 20%+ Improvement)
+
+### BB1: Fully Parallel Resolution Pipeline
+**Description:** Replace the sequential resolvelib loop with a resolution architecture where ALL I/O is fully parallel. When the resolver needs data for package X, it doesn't block -- it queues the need and processes another package. When I/O completes, the resolver is notified.
+**Mechanism:** This is essentially an async resolver. Could be implemented with:
+- asyncio event loop driving the resolver
+- `aiohttp` or `httpx` async client for HTTP
+- resolvelib with a coroutine-based provider
+**Expected improvement:** 30-50% for large dependency trees. Eliminates all serial I/O gaps.
+**Complexity:** Very high. Fundamental architectural change to pip's resolver integration.
+**Risk:** High -- resolvelib is synchronous by design.
+
+### BB2: HTTP/2 Multiplexing
+**Description:** Replace vendored `requests` + `urllib3` with `httpx` (which supports HTTP/2 via h2).
+**Expected improvement:** 20-30% for cold-cache workloads. All requests to pypi.org multiplex over one connection. No head-of-line blocking between index page requests.
+**Complexity:** Very high. ~500+ line change touching all network code.
+**Risk:** High.
+
+### BB3: Dependency Prediction + Bulk Prefetch
+**Description:** Maintain a local cache of "last resolved dependency tree" per project. On next `pip install`, immediately fire all index page + metadata prefetch requests for the predicted set BEFORE the resolver starts.
+**Expected improvement:** 20-40% for repeat installs. Instead of discovering dependencies one-by-one through resolution, fire all 40+ conditional GETs simultaneously at startup.
+**Complexity:** Medium-high. Need a prediction cache format, staleness detection, and graceful handling of prediction misses.
+**Risk:** Medium. Wrong predictions waste bandwidth but don't cause correctness issues.
+
+### BB4: Server-Side Dependency Resolution API
+**Description:** Propose a PyPI API extension that accepts a requirements list and returns the resolved dependency tree (with all metadata). One HTTP request replaces 120+ requests.
+**Expected improvement:** 50-80% for cold-cache scenarios. Eliminates all per-package round trips.
+**Complexity:** Very high. Requires PyPI server cooperation, PEP process, etc.
+**Risk:** High. Requires ecosystem buy-in. Fallback to current behavior needed.
+
+---
+
+## 6. Summary of Key Files
+
+| File | Role |
+|------|------|
+| `src/pip/_internal/index/collector.py` | Fetches index pages, parses HTML/JSON |
+| `src/pip/_internal/index/package_finder.py` | Evaluates candidates, manages prefetch pool (16 threads) |
+| `src/pip/_internal/network/session.py` | PipSession, connection pool config (20/16), adapters |
+| `src/pip/_internal/network/cache.py` | SafeFileCache (filesystem-based HTTP cache) |
+| `src/pip/_internal/network/download.py` | Downloader (sequential batch downloads!) |
+| `src/pip/_internal/network/lazy_wheel.py` | LazyZipOverHTTP for range-request metadata |
+| `src/pip/_internal/network/utils.py` | Accept-Encoding: identity for downloads, chunk streaming |
+| `src/pip/_internal/operations/prepare.py` | RequirementPreparer, metadata fetch chain |
+| `src/pip/_internal/resolution/resolvelib/factory.py` | Metadata prefetch pool (8 threads), candidate building |
+| `src/pip/_internal/resolution/resolvelib/provider.py` | Triggers dep prefetch in get_dependencies() |
+| `src/pip/_internal/resolution/resolvelib/resolver.py` | Kicks off root requirement prefetch |
+| `src/pip/_internal/resolution/resolvelib/candidates.py` | Thread-safe dist preparation with _prepare_lock |
+| `src/pip/_vendor/cachecontrol/adapter.py` | CacheControlAdapter -- intercepts requests for caching |
+| `src/pip/_vendor/cachecontrol/controller.py` | Cache logic: max-age=0 bypass, conditional headers, 304 handling |
+
+## 7. Critical Finding: Sequential Download Is The Biggest Remaining Win
+
+The single most impactful optimization remaining is **parallelizing wheel downloads**. After resolution completes, `_complete_partial_requirements()` downloads all wheels sequentially through `Downloader.batch()`. This is purely sequential I/O with no data dependencies between packages. With 40 packages averaging 500KB each at ~50ms per download, the sequential phase takes ~2 seconds. Parallelizing with 8 workers would reduce this to ~0.25 seconds -- a potential 15-25% total wall-time improvement depending on the workload.
--- a/.codeflash/pypa/pip/data/learnings.md
+++ b/.codeflash/pypa/pip/data/learnings.md
@ -0,0 +1,110 @@
+## All dependencies are vendored
+
+Everything in `src/pip/_vendor/` is vendored from upstream: resolvelib, packaging, requests, urllib3, cachecontrol, certifi, distlib, importlib_metadata, pygments, rich, etc. These are copies of external libraries maintained via `tools/vendoring/`. Each vendor is a candidate for replacement — if pip only uses a subset of a library's API, a focused implementation covering just that subset will be faster (no generalized overhead). If a vendored library is fully replaced and no longer imported anywhere in `src/pip/_internal/`, delete it from `_vendor/` and remove its entry from `_vendor/vendor.txt`. The vendoring manifest is at `src/pip/_vendor/vendor.txt`.
+
+## Resolution flow is the primary hot path
+
+The dependency resolution call chain:
+```
+Resolver.resolve()  [resolution/resolvelib/resolver.py]
+  -> resolvelib.Resolver (vendored algorithm)
+    -> PipProvider.find_matches()  [resolution/resolvelib/provider.py]
+      -> Factory._iter_found_candidates()  [resolution/resolvelib/factory.py]
+        -> PackageFinder.find_best_candidate()  [index/package_finder.py]
+          -> LinkCollector.collect_sources()  [index/collector.py]
+          -> LinkEvaluator.evaluate_link()
+          -> CandidateEvaluator.compute_best_candidate()
+    -> PipProvider.get_dependencies()
+      -> Candidate.iter_dependencies()  [resolution/resolvelib/candidates.py]
+```
+
+The vendored `resolvelib` drives the algorithm; pip's layer (factory, provider, candidates, package_finder) is where the overhead lives.
+
+## Existing caching in pip — evaluate and improve
+
+Current caching is a starting point, not a ceiling. Profile each one — the cache sizes, strategies, and data structures may all be suboptimal:
+- `functools.lru_cache(maxsize=10000)` on `parse_version` in `utils/packaging.py` — is 10k the right size? Is lru_cache the fastest caching strategy here? Would a plain dict be faster (no LRU eviction overhead)?
+- `functools.lru_cache(maxsize=32)` on `get_requirement` in `utils/packaging.py` — only 32 slots. During large resolutions this evicts constantly. Profile whether a larger cache or unbounded `@functools.cache` is faster.
+- `@functools.cache` on Link properties in `models/link.py` — functools.cache has per-call overhead for hashing args. If Link properties are called with `self` only, a simple `__dict__`-based cache or `__slots__` with pre-computed values may be faster.
+- `@functools.cached_property` on InstallRequirement properties — has thread-safety overhead in 3.12+. Evaluate whether a simpler lazy pattern is faster.
+- `@functools.cache` on candidate creation in `resolution/resolvelib/provider.py` — profile the cache hit rate. If it's low, the hashing overhead is pure waste.
+- HTTP caching via vendored `CacheControl` in `network/session.py` — a general-purpose HTTP cache. If pip only needs a subset of caching semantics, a focused implementation could be faster.
+- Wheel cache by URL hash in `cache.py` — uses sha224. Profile whether a faster hash (xxhash via C, or even just dict key on URL string) would help at scale.
+- Lazy wheel loading in `network/lazy_wheel.py` — the TODO says range requests aren't cached. Fix this, and also evaluate whether the lazy loading strategy itself is optimal (e.g., batch range requests, prefetch metadata sections).
+
+## Known TODOs from source — verified optimization opportunities
+
+1. `resolution/resolvelib/candidates.py` ~line 250: "TODO performance: this means we iterate dependencies at least twice" — dependencies are extracted from metadata, then iterated again during resolution
+2. `resolution/resolvelib/factory.py`: "TODO: Check already installed candidate, and use it if the link and hash match" — redundant work when a compatible version is already installed
+3. `network/lazy_wheel.py`: "TODO: Get range requests to be correctly cached" — lazy wheel metadata fetches bypass the HTTP cache
+
+## Version parsing happens repeatedly
+
+During candidate evaluation in `package_finder.py`, version strings are parsed from Link URLs multiple times across different stages (link evaluation, candidate evaluation, sorting). The `lru_cache` on `parse_version` helps but the cache key is the string — if the same version appears in different URL formats, it may be parsed redundantly.
+
+## Link objects are high-volume
+
+`models/link.py` Link objects are created for every candidate from every index page. They use `@functools.cache` on properties, but the sheer volume (hundreds to thousands per resolution) means object creation overhead itself matters.
+
+## Tests structure
+
+- `tests/unit/` — fast, no network, good for profiling feedback
+- `tests/unit/resolution_resolvelib/` — resolver-specific unit tests
+- `tests/functional/` — slow, needs network, creates real virtualenvs
+- Socket disabled by default in pytest config
+- `tests/unit/test_finder.py` — tests for PackageFinder
+- `tests/unit/test_req.py` — tests for requirement handling
+
+## pip targets Python 3.9+ and PyPy
+
+Cannot use: walrus operator in 3.9-incompatible ways, match/case (3.10+), exception groups (3.11+), or `type` statement (3.12+). `typing.Self`, `typing.TypeAlias` need imports from `typing_extensions` or `__future__`.
+
+## Ruff is the linter
+
+Line length 88, target-version py39. Key ignores: `PERF203` is explicitly ignored for `src/pip/_internal/*` (try-except in loop). Isort has a custom `vendored` section for `pip._vendor`.
+
+## packse is available for realistic resolver workloads
+
+The sibling repo at `../packse/` contains 148 dependency resolution test scenarios. These can be used to create realistic profiling workloads by building the packse index and running `pip install --index-url <packse-url> <scenario-package>`. Categories with the most resolver stress: fork (32 scenarios), prereleases (20), local versions (16), requires-python (15).
+
+## Apr05 session: optimization results
+
+Key findings from the optimization session:
+
+1. **get_supported() is the single most impactful cache target.** A single `@functools.lru_cache` on the underlying implementation reduces Tag.__init__ calls from 45K to 1.5K in resolver test workloads (97% reduction). The cache key is (version, platforms_tuple, impl, abis_tuple). Hit rate is high because the same TargetPython params are used across resolution.
+
+2. **canonicalize_name() has 92% cache hit rate.** Package names are canonicalized repeatedly during resolution — once for each candidate evaluation, each distribution check, and each requirement comparison. An `lru_cache(maxsize=1024)` catches the vast majority of calls.
+
+3. **Test suite wall-clock is poor proxy for pip performance.** The unit test suite is dominated by test fixture creation (0.3s setup per resolver test × 40 tests = 12s), I/O (directory scanning), and subprocess calls (build isolation). Caches provide little benefit because each test creates fresh state. Real pip invocations process a single dependency tree where caches accumulate hits.
+
+4. **cProfile overhead is higher with lru_cache.** cProfile tracks every function call including the lru_cache wrapper. The profiling overhead ratio is ~3.3x with cached functions vs ~1.2x without. This makes the optimized code look slower under cProfile, but real execution is equivalent or faster.
+
+5. **Python 3.15 is significantly faster than 3.14.** The same unit test suite runs in ~37-42s on Py 3.15 vs ~130s on Py 3.14. This is from general CPython performance improvements, not pip-specific changes.
+
+6. **E2E profiling reveals completely different targets than unit tests.** The unit test suite is dominated by test infrastructure (fixture creation, subprocess calls). Real `pip install --dry-run flask django boto3 requests` with cached metadata reveals: Link object creation (12K+), Version operations, URL cleaning, and filename parsing dominate. Always profile real workloads.
+
+7. **Double urlsplit is a hidden 2.5% cost.** `_ensure_quoted_url` does `urlsplit` to check the path, then `Link.__init__` does `urlsplit` again on the same URL. Integrating quoting into __init__ eliminates this. For HTTP/HTTPS URLs with already-clean paths (99% from package indices), a regex fast-path (`_PATH_ALREADY_QUOTED_RE`) skips `_clean_url_path` entirely.
+
+8. **Pre-computing hot properties in __init__ is the most effective pattern for high-volume objects.** Link objects are created 12K+ times. Moving splitext, filename, and hash computation from property access to __init__ eliminated ~7% of self-time because these properties were accessed 2-4x each per link during evaluation and sorting.
+
+9. **Lazy parsing for rarely-used fields saves significant time.** `upload_time` (ISO datetime) was parsed eagerly for all 12K links but only used when `--uploaded-prior-to` flag is set (rare). Deferring parse_iso_datetime to first property access eliminated 1.4% of self-time.
+
+10. **Remaining performance floor after link/version optimizations.** Profile is now flat: Version.__init__ (7%), Link.__init__ (7%), evaluate_link (5%), from_json (4%), specifier filtering (3%), version comparison (3%). These are core resolution operations — further gains require algorithmic changes (reducing candidate count) or resolver restructuring.
+
+7. **parse_wheel_filename cache has 75% hit rate** in single-package installs. In larger resolutions with many candidates from the same package, hit rate is higher.
+
+## Apr 2026 session: optimization floor analysis
+
+11. **_evaluate_json_page is at the Python-level floor.** Per-entry processing costs 4.2us across 13.7K entries. The cost is spread across dict.get (65K calls, 0.009s), str.endswith (21K, 0.003s), str operations (rsplit/find/split/startswith: ~0.005s total), object construction (15K __new__, 0.002s), and isinstance (19K, 0.002s). No single operation dominates. A py3-none-any fast path that replaces rsplit+set-lookup with a single endswith showed <2% improvement within noise. The function's self-time is fundamentally the cost of executing ~340 lines of Python bytecode per entry.
+
+12. **Resolution round counts are much lower than expected after caching.** The two-level cache in _iter_found_candidates (experiment 18) reduced resolver iterations dramatically. flask+django+boto3+requests: 23 state pushes, 0 backtracks. fastapi[standard]: 48 pushes, 0 backtracks. COW state snapshots, IteratorMapping elimination, and other per-round optimizations have negligible impact at these scales.
+
+13. **Wall time is I/O-dominated after CPU optimizations.** For flask+django+boto3+requests (826ms optimized), HTTP requests account for ~70% of wall time. The 41 HTTP requests (21 for index pages + 20 for metadata) are serialized by the resolver's sequential processing of packages. Our parallel prefetch infrastructure helps but can only overlap I/O for packages discovered through dependency traversal, not the initial set.
+
+14. **Benchmark results are highly sensitive to network conditions and cache state.** The same benchmark can vary 2-3x depending on HTTP cache warmth and network latency. Always use hyperfine with warmup runs and report median/mean with sigma. The "Using cached" lines in pip output indicate cache hits; "Downloading" indicates misses.
+
+15. **make_install_req_from_link serialize-reparse is wasteful but low-impact.** The function serializes a Requirement to string then re-parses it through install_req_from_line (which does os.path.normpath, os.path.abspath, URL parsing, etc.). But with only 21 calls at 0.1ms each (2.2ms total), the absolute impact is negligible. Would only matter for workloads with thousands of direct requirements.
+
+## Local warehouse (PyPI) is running
+
+A full warehouse instance is running via Docker at `http://localhost:80/`. The Simple API is at `http://localhost:80/simple/`. This enables end-to-end profiling of the entire pip → network → warehouse → database stack. The warehouse source at `../warehouse/` is live-reloaded by gunicorn — changes to `warehouse/api/simple.py` (the Simple API endpoint) take effect immediately. Manage with `cd ../warehouse && docker compose [up -d | down | logs web]`.
--- a/.codeflash/pypa/pip/data/results.tsv
+++ b/.codeflash/pypa/pip/data/results.tsv
@ -0,0 +1,19 @@
+commit	target_test	cpu_baseline_s	cpu_optimized_s	cpu_speedup	mem_baseline_mb	mem_optimized_mb	mem_delta_mb	gc_before_s	gc_after_s	tests_passed	tests_failed	status	domains	interaction	description
+uncommitted	resolver_tests	0.155	0.155	1.0x	-	-	-	-	-	1690	0	keep	cpu	tag_gen_elimination	lru_cache on get_supported — Tag.__init__ 45301→1559 calls (97% reduction)
+uncommitted	resolver_tests	0.155	0.155	1.0x	-	-	-	-	-	1690	0	keep	cpu	none	Pre-compiled regex for wheel name and project name validation
+uncommitted	resolver_tests	0.155	0.155	1.0x	-	-	-	-	-	1690	0	keep	cpu	none	pathlib→os.scandir + distribution dict cache for O(1) lookups
+uncommitted	e2e_install	0.027	0.027	1.0x	-	-	-	-	-	1690	0	keep	cpu	cache_hit_92pct	lru_cache on canonicalize_name/parse_wheel_filename/Version.parse
+uncommitted	resolver_tests	0.155	0.155	1.0x	-	-	-	-	-	1690	0	keep	cpu	none	_version_nodot dict cache
+uncommitted	e2e_install	1.890	0.830	2.3x	-	-	-	-	-	1690	0	keep	cpu	urlsplit_dedup	Integrated URL quoting into Link.__init__ (eliminates double urlsplit)
+uncommitted	e2e_install	1.890	0.830	2.3x	-	-	-	-	-	1690	0	keep	cpu	none	Pre-computed Link._splitext/_filename/_hash in __init__
+uncommitted	e2e_install	1.890	0.830	2.3x	-	-	-	-	-	1690	0	keep	cpu	none	Version.__str__ caching via _str_cache slot
+uncommitted	e2e_install	1.890	0.830	2.3x	-	-	-	-	-	1690	0	keep	cpu	deferred_parse	Lazy upload_time parsing (defer fromisoformat to first access)
+uncommitted	e2e_install	1.890	0.830	2.3x	-	-	-	-	-	1690	0	keep	cpu	none	parse_wheel_filename cache 512->4096
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu	wheel_elimination	Eliminate Wheel construction from evaluate_link (6899→17 calls)
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu	inline_splitext	Inline splitext + eliminate duplicate basename in Link.__init__
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu	id_based_assert	BestCandidateResult identity-based assertion (eliminates 21K hash calls)
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu	cache_overflow	parse_wheel_filename cache 4096→16384
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu	hash_cache	Version.__hash__ cached in slot (42K→21K calls)
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu	alloc_skip	supported_hashes fast-path avoids dict alloc for 99% case
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	cpu,structure	version_dedup	Deduplicate versions before specifier filtering (83% fewer is_prerelease calls)
+uncommitted	e2e_install	-	-	-	-	-	-	-	-	1690	0	keep	structure	direct_construct	from_json direct Link construction: __init__ 12618→51, find_hash_frag 12601→34
--- a/.codeflash/pypa/pip/data/session-handoff.md
+++ b/.codeflash/pypa/pip/data/session-handoff.md
@ -0,0 +1,305 @@
+# Optimization Session — apr05
+
+## Environment
+- Python 3.15.0a7, macOS arm64
+- Branch: codeflash/optimize (off main 8df7b668b)
+- Tests: 1690/1690 unit tests passing, 40/40 resolver tests passing
+- Lint: ruff E,F clean on all modified files
+- Run tag: apr05
+
+## Baseline Profile (resolver unit tests, cProfile)
+- Wall: ~5-13s (40 tests, high variance from subprocess calls)
+- Project self-time: 0.155s
+- Top targets by self-time:
+  1. Tag.__init__: 25.1% (0.039s, 45,301 calls)
+  2. cpython_tags: 8.6% (0.013s, 21,630 calls)
+  3. compatible_tags: 5.8% (0.009s, 23,520 calls)
+  4. _version_nodot: 4.9% (0.008s, 18,570 calls)
+  5. find_legacy_editables: 3.6% (0.006s, 160 calls)
+  6. canonicalize_name: 2.4% (0.004s, 3,244 calls)
+  7. parse_name_and_version_from_info_directory: 2.1% (0.003s, 2,734 calls)
+
+## Baseline Profile (e2e single install, cProfile)
+- Project self-time: 0.027s
+- Tag.__init__: 3,014 calls, 5.9% of project CPU
+- cpython_tags: 1,442 calls, 2.6%
+- _version_nodot: 1,238 calls, 1.4%
+
+## Optimized Profile (resolver unit tests, cProfile)
+- Project self-time: 0.261s (higher due to cProfile overhead on lru_cache wrappers)
+- Tag.__init__: 1,559 calls (from 45,301 — 97% reduction)
+- cpython_tags: 721 calls (from 21,630 — 97% reduction)
+- Profile is flat — no function above 3.3% except find_legacy_editables at 25.1% (I/O bound)
+
+## Optimized Profile (e2e single install, cProfile)
+- Project self-time: 0.089s (higher due to cProfile overhead on lru_cache wrappers)
+- Tag.__init__: 1,505 calls (from 3,014 — 50% reduction)
+- canonicalize_name: 92.4% cache hit rate (327 hits / 27 misses)
+- parse_wheel_filename: 75% cache hit rate (3 hits / 1 miss)
+
+## Cache Hit Rates (real pip install --dry-run requests)
+- get_supported: 50% (1 hit, 1 miss) — saves full 1500-tag generation
+- canonicalize_name: 92.4% (327/354) — most impactful cache
+- parse_wheel_filename: 75% (3/4)
+- Version.parse: 20% (1/5) — higher for large dependency trees
+
+## Strategy
+- Target 1: packaging.tags — global caching of tag lists via lru_cache
+- Target 2: Distribution scanning — pathlib to os.scandir, dict cache for O(1) lookups
+- Target 3: canonicalize_name — lru_cache with 92% hit rate
+- Target 4: Version/wheel parsing — lru_cache, pre-compiled regex
+- Target 5: _version_nodot — dict cache
+
+## Experiments
+
+### Experiment 1: lru_cache on get_supported()
+- File: src/pip/_internal/utils/compatibility_tags.py
+- Change: Split get_supported() into public wrapper + @functools.lru_cache(maxsize=32) cached impl
+- Result: Tag.__init__ calls 45,301 → 1,559 (97% reduction)
+- Status: KEEP
+
+### Experiment 2: Pre-compiled regex for wheel name validation
+- File: src/pip/_vendor/packaging/utils.py
+- Change: Moved inline re.match() to module-level _wheel_name_regex
+- Result: Eliminates re.compile per call (313 calls in resolver tests)
+- Status: KEEP
+
+### Experiment 3: Pre-compiled regex for project name validation
+- File: src/pip/_internal/metadata/base.py
+- Change: Moved inline re.match() to module-level _VALID_PROJECT_NAME
+- Result: Eliminates re.compile per call (~700 calls in iter_all_distributions)
+- Status: KEEP
+
+### Experiment 4: pathlib to os.scandir + distribution dict cache
+- File: src/pip/_internal/metadata/importlib/_envs.py
+- Change: Replaced pathlib.Path.iterdir() with os.scandir(); added _distributions_cache dict for O(1) get_distribution lookups
+- Result: get_distribution changes from O(n) linear scan to O(1) after first build
+- Status: KEEP
+
+### Experiment 5: lru_cache on canonicalize_name, parse_wheel_filename, Version.parse
+- Files: src/pip/_vendor/packaging/utils.py, src/pip/_vendor/packaging/version.py
+- Change: Added @functools.lru_cache to canonicalize_name (maxsize=1024), parse_wheel_filename (maxsize=512), Version.parse (maxsize=1024)
+- Result: canonicalize_name 92.4% hit rate; parse_wheel_filename 75% hit rate
+- Status: KEEP
+
+### Experiment 6: _version_nodot dict cache
+- File: src/pip/_vendor/packaging/tags.py
+- Change: Added module-level dict cache for _version_nodot results
+- Result: Avoids repeated "".join(map(str, version)) calls
+- Status: KEEP
+
+## E2E Profile (pip install --dry-run flask django boto3 requests)
+
+### Before this round (cached metadata, cProfile)
+- Wall: ~5.9s
+- Project self-time: ~1.9s
+- Top targets:
+  1. Version.__init__: 6.7%, 12,537 calls
+  2. Link.from_json: 6.6%, 12,567 calls
+  3. evaluate_link: 4.3%, 12,584 calls
+  4. _clean_url_path: 3.0%, 12,567 calls
+  5. _ensure_quoted_url: 2.8%, 12,567 calls
+  6. splitext (link): 2.6%, 23,473 calls
+  7. Version.__str__: 2.6%, 12,442 calls
+  8. splitext (misc): 2.4%, 23,532 calls
+  9. Link.filename: 2.6%, 12,257 calls
+  10. Link.__hash__: 2.0%, 48,203 calls
+
+### After this round
+- Wall: ~2.5s (median of 3 runs: 2.1s, 2.6s, 3.6s)
+- Project self-time: ~0.83s (58% reduction)
+- Top targets now:
+  1. Version.__init__: 7.3%, 12,537 calls — fundamental, per-candidate
+  2. Link.__init__: 6.7%, 12,618 calls — includes pre-computation
+  3. evaluate_link: 5.0%, 12,584 calls — the evaluation algorithm
+  4. from_json: 4.4%, 12,567 calls — JSON dict access
+  5. _sort_key: 2.7% — sorting candidates
+  6. filter (specifiers): 2.6% — version filtering
+  7. _key (Version): 2.5%, 183K calls — comparison key (cached)
+- Eliminated from hot path: _clean_url_path, _ensure_quoted_url, splitext, Link.filename, Link.__hash__, parse_iso_datetime
+
+### Experiments (this round)
+
+#### Experiment 7: Link URL quoting integrated into __init__
+- File: src/pip/_internal/models/link.py
+- Change: Moved _ensure_quoted_url logic into Link.__init__, sharing the single urlsplit call. Added _PATH_ALREADY_QUOTED_RE fast path for HTTP/HTTPS URLs that skip _clean_url_path entirely (99% of package index URLs).
+- Impact: Eliminated double urlsplit for every Link. _ensure_quoted_url (2.5%) and _clean_url_path (3.0%) gone from profile.
+- Status: KEEP
+
+#### Experiment 8: Link._splitext pre-computed in __init__
+- File: src/pip/_internal/models/link.py
+- Change: Pre-compute splitext result during Link construction. splitext() method and ext property return cached values.
+- Impact: splitext (link) 2.6% + splitext (misc) 2.4% → eliminated from profile
+- Status: KEEP
+
+#### Experiment 9: Link._filename and Link._hash pre-computed
+- File: src/pip/_internal/models/link.py
+- Change: Pre-compute filename (posixpath.basename) and hash(url) during construction.
+- Impact: Link.filename (2.6%) + Link.__hash__ (2.0%) → eliminated from profile
+- Status: KEEP
+
+#### Experiment 10: Version.__str__ caching
+- File: src/pip/_vendor/packaging/version.py
+- Change: Added _str_cache slot, cache string representation on first __str__ call. Also fixed _TrimmedRelease to initialize the cache.
+- Impact: Version.__str__ 2.6% → 2.0% (35% faster per call, cached for repeated access)
+- Status: KEEP
+
+#### Experiment 11: Lazy upload_time parsing
+- File: src/pip/_internal/models/link.py
+- Change: Store raw ISO string in from_json, defer parse_iso_datetime to first access of upload_time property. Only parsed when --uploaded-prior-to is used.
+- Impact: parse_iso_datetime (1.4%, 12,568 calls) → eliminated from hot path
+- Status: KEEP
+
+#### Experiment 12: parse_wheel_filename cache size 512 → 4096
+- File: src/pip/_vendor/packaging/utils.py
+- Change: Increased lru_cache maxsize from 512 to 4096 to handle large resolutions (10,980 unique filenames observed in multi-package installs).
+- Impact: Better cache hit rate for large dependency trees
+- Status: KEEP
+
+## Round 3: Algorithmic changes to _evaluate_json_page
+
+### Experiment 13-16: Tag-first parsing, direct JSON filename, endswith checks
+- Files: src/pip/_internal/index/package_finder.py, src/pip/_internal/models/target_python.py
+- Changes:
+  - New `_evaluate_json_page()` method: single-pass over raw JSON, checks extension via endswith, extracts wheel tags from filename end using rfind, checks tag compatibility via frozenset before name parsing
+  - Direct use of PEP 691 `filename` field (avoids URL construction)
+  - Version interning across platform wheels
+  - Tag tuples frozenset cached on TargetPython
+- Impact: _evaluate_json_page self-time reduced ~33%, from_json calls reduced from ~10,899 to ~200 per page (only surviving candidates)
+- Status: KEEP (experiments 13-16 committed as 4 separate commits)
+
+### Experiment 17: Two-level platform pre-filter
+- Tried adding platform-only pre-filter (1 rfind) before full 3-rfind extraction
+- Results: Within noise margin (2-5%), code complexity not justified
+- Status: DISCARD
+
+## Round 4: Resolver backtracking cache (_iter_found_candidates)
+
+### Experiment 18: Two-level cache on _iter_found_candidates
+- File: src/pip/_internal/resolution/resolvelib/factory.py
+- Problem: During fastapi[standard] resolution, _iter_found_candidates is called
+  134K+ times with only ~120 unique (name, specifier, hashes, extras) tuples.
+  Each call redundantly: merges specifiers (set+update+frozenset), calls
+  find_best_candidate (dict lookup), scans all_yanked, checks is_pinned,
+  allocates functools.partial objects. Total: ~9.4s of 60s wall time.
+- Change: Two-level cache:
+  - Level 1 (merge cache): Maps raw specifier inputs (constraint _specs +
+    each ireq's _specs frozenset) to merged result (specifier, hashes, extras).
+    Uses frozenset VALUES (not id()) for GC safety. Frozenset hashing is O(1)
+    after first call. Eliminates specifier merge on 99.9% of calls.
+  - Level 2 (infos cache): Maps merged (name, specs, hashes, extras) to the
+    list of (version, build_func) tuples from find_best_candidate. Eliminates
+    find_best_candidate call, all_yanked scan, is_pinned check, and
+    functools.partial allocation.
+  - Inlined _get_installed_candidate (runs fresh every call — depends on
+    incompatible_ids which changes during backtracking).
+- Correctness note: Initial attempt used id()-based L1 cache for speed.
+  This caused InconsistentCandidate errors because Python reuses memory
+  addresses for gc'd objects during resolver backtracking, producing stale
+  cache hits. Fixed by using frozenset value-based keys.
+- Impact: 
+  - _iter_found_candidates: 9.4s → 1.85s (5x faster)
+  - fastapi[standard] resolution: 37.9s → 15.2s (2.49x faster)
+  - boto3: 0.65s → 0.33s (1.95x)
+  - django: 0.24s → 0.18s (1.35x)
+  - requests: 0.51s → 0.28s (1.84x)
+  - black: 0.33s → 0.30s (1.12x)
+- Status: KEEP
+
+## Plateau Analysis (Updated Apr 2026)
+- Resolver runs 22-48 rounds for typical workloads (flask+django+boto3+requests:
+  23 pushes, 0 backtracks; fastapi[standard]: 48 pushes, 0 backtracks).
+  COW state snapshots would save negligible time at these scales.
+- _evaluate_json_page: 55.7% of project self-time (0.84s out of 1.5s), but
+  per-entry cost is 4.2us dominated by dict.get (65K calls), Link construction
+  (14 attr assignments), and version interning. No single operation dominates.
+  py3-none-any fast path tested and discarded (<2% improvement within noise).
+- install_req_from_line: 21 calls at 0.1ms each = 2.2ms total. Not worth
+  bypassing the serialize-reparse pattern at this call count.
+- IteratorMapping: 4-6 objects per round x 48 rounds = ~240 allocations.
+  Each is 3 attribute assignments. Total cost negligible.
+- Wall time dominated by HTTP I/O: 41 requests account for ~70% of wall time.
+  Network latency and TLS handshakes are irreducible.
+- Profile is genuinely flat: after _evaluate_json_page (55.7%), the next
+  project function is _iter_found_candidates at 5.2%, then TLS at ~12%.
+  No single function has enough headroom for meaningful improvement.
+- Further gains require: (a) moving hot Python loops to C, (b) protocol-level
+  changes (e.g. server-side filtering), or (c) fundamentally different
+  resolution strategies (e.g. SAT solver).
+
+## Pre-submit Review Findings
+1. **CRITICAL (fixed)**: `get_applicable_candidates()` sorting was removed in an earlier optimization, breaking the resolver's assumption that applicable_candidates are version-sorted. The resolver iterates `reversed(icans)` expecting newest-first order. Fixed by sorting in `compute_best_candidate` while using `max()` for best-candidate tiebreaker stability.
+2. **F821 lint (fixed)**: `Version` type annotation in `_evaluate_json_page` referenced undefined name. Changed to `_BaseVersion`.
+3. **Reviewed (safe)**: `InstallationCandidate` frozen removal — no code compares candidates by value. Identity-based assertions already updated.
+4. **Reviewed (safe)**: `_lazy_wheel_cache` — bounded by dependency tree size (20-200 packages).
+5. **Reviewed (safe)**: `specifier._specs` direct access — vendored library under our control.
+6. **Reviewed (safe)**: `_prereleases = None` in bulk merge — pip never sets non-None prereleases on SpecifierSet in the factory path.
+
+## Adversarial Review Findings
+### Round 1
+1. **HIGH (fixed)**: Link.from_json query string stripping — signed URLs (?X-Amz-Signature=...) corrupted _path/_filename causing is_wheel=False. Fixed by finding earliest of ? or # to delimit path end.
+2. **HIGH (fixed)**: _build_distribution_cache dict comprehension kept last-seen instead of first-seen for duplicate names. Fixed with setdefault.
+3. **MEDIUM (safe)**: factory.py same-version installed candidate reuse. Investigated — FoundCandidates.__iter__ filters by incompatible_ids, and the original code already skips remote candidates for installed versions via versions_found set. No behavior change.
+
+### Round 2
+1. **HIGH (fixed)**: JSON sdist extensions — _evaluate_json_page only checked .tar.gz/.zip/.tar.bz2, missing .tgz/.tar/.tbz/.tar.xz etc. Fixed by adding all SUPPORTED_EXTENSIONS.
+2. **HIGH (fixed)**: JSON wheel fast path accepted malformed wheel names. Fixed by validating via parse_wheel_filename() (lru_cached).
+3. **MEDIUM (fixed)**: HTML pages lost _sort_links() dedup/precedence. Restored evaluate_links() call.
+4. **MEDIUM (fixed)**: datetime.fromisoformat() fails on Python 3.9/3.10 with trailing 'Z'. Replaced with parse_iso_datetime().
+
+### Round 3
+1. **HIGH (fixed)**: Link.from_json derived _filename from URL path, not JSON filename field. Fixed to prefer file_data["filename"].
+2. **MEDIUM (fixed)**: _log_skipped_link had early return on non-DEBUG that prevented requires-python skip bookkeeping. Fixed to always record. JSON path also records skip reasons in dedicated set.
+
+### Round 4
+1. **HIGH (fixed)**: Reverted factory.py installed-candidate reuse — conflated installed and index artifacts for the same version, blocking resolver backtracking.
+2. **HIGH (fixed)**: Link.from_json crashes on authority-only URLs (no path). Changed url.index to url.find with fallback.
+3. **MEDIUM (dismissed)**: Missing filename fallback in JSON path — PEP 691 requires filename field. Non-conformant indexes fall back to standard parse_links path.
+4. **MEDIUM (fixed)**: _FastMetadata.get_payload() returned empty string, dropping long descriptions from metadata_dict/pip inspect. Now preserves body text.
+
+## Refreshed E2E Benchmarks (Apr 2026, Py 3.15.0a7)
+All measured with hyperfine (5-10 runs, 2-3 warmup), HTTP cache warm.
+
+| Benchmark | Main | Optimized | Speedup |
+|-----------|-----:|----------:|--------:|
+| pip --version | 138ms | 20ms | **7.0x** |
+| pip --help | 143ms | 121ms | **1.18x** |
+| pip list | 162ms | 146ms | **1.11x** |
+| pip freeze | 225ms | 211ms | **1.07x** |
+| pip show pip | 162ms | 148ms | **1.09x** |
+| pip check | 191ms | 174ms | **1.10x** |
+| requests | 589ms | 516ms | **1.14x** |
+| flask+django | 708ms | 599ms | **1.18x** |
+| flask+django+boto3+requests | 1493ms | 826ms | **1.81x** |
+| fastapi[standard] | 13325ms | 11664ms | **1.14x** |
+| -r requirements.txt (21 pkgs) | 1344ms | 740ms | **1.82x** |
+
+Notes:
+- fastapi[standard] installs 42 packages including C extensions (uvloop,
+  pydantic_core) that require sdist building. The 11.7s is dominated by
+  build system overhead, not resolution.
+- The complex resolution benchmark (flask+django+boto3+requests) shows the
+  largest resolution-specific speedup (1.81x) because it exercises the
+  largest JSON pages (botocore 4692 entries, boto3 4020 entries).
+
+## Files Modified
+1. src/pip/_internal/utils/compatibility_tags.py — lru_cache on get_supported
+2. src/pip/_vendor/packaging/utils.py — lru_cache on canonicalize_name/parse_wheel_filename (16384), pre-compiled regex
+3. src/pip/_vendor/packaging/version.py — lru_cache on parse(), __str__/__hash__ caching
+4. src/pip/_vendor/packaging/tags.py — dict cache on _version_nodot
+5. src/pip/_internal/metadata/base.py — pre-compiled project name regex
+6. src/pip/_internal/metadata/importlib/_envs.py — os.scandir + distribution dict cache
+7. src/pip/_internal/models/link.py — direct JSON construction, lazy URL parsing, pre-computed splitext/filename/hash, lazy upload_time
+8. src/pip/_internal/index/package_finder.py — fused _evaluate_json_page, tag-first parsing, version interning, sorted applicable_candidates restoration
+9. src/pip/_internal/models/target_python.py — tag tuples frozenset, tag priority cache
+10. src/pip/_internal/resolution/resolvelib/factory.py — bulk specifier merge, hashes fast-path, two-level candidate infos cache
+11. src/pip/_internal/models/candidate.py — version pass-through, removed frozen dataclass overhead
+12. src/pip/_vendor/packaging/specifiers.py — canonical_spec cache, __str__/__hash__ caching, __eq__ fast-path
+13. src/pip/_vendor/packaging/requirements.py — __str__ caching
+14. src/pip/_vendor/packaging/markers.py — default_environment caching
+15. src/pip/_vendor/requests/utils.py — proxy detection memoization
+16. src/pip/_vendor/resolvelib/resolvers/resolution.py — hoisted method/attrgetter constants
+17. src/pip/_vendor/resolvelib/structs.py — guard for empty appends
+18. src/pip/_internal/utils/hashes.py — __hash__ caching, supported_hashes fast-path
+19. src/pip/_internal/resolution/resolvelib/base.py — Constraint.empty() singleton
+20. src/pip/_internal/operations/prepare.py — lazy wheel metadata cache
--- a/.codeflash/pypa/pip/infra/.gitkeep
+++ b/.codeflash/pypa/pip/infra/.gitkeep
--- a/.codeflash/pypa/pip/status.md
+++ b/.codeflash/pypa/pip/status.md
--- a/.codeflash/textualize/rich/.gitignore
+++ b/.codeflash/textualize/rich/.gitignore
@ -0,0 +1 @@
+.DS_Store
--- a/.codeflash/textualize/rich/README.md
+++ b/.codeflash/textualize/rich/README.md
@ -0,0 +1,131 @@
+# Rich Performance Optimization
+
+Upstream performance improvements to [Textualize/rich](https://github.com/Textualize/rich), motivated by pip startup time profiling.
+
+## Background
+
+pip vendors Rich for its progress bars, logging, and error display. Profiling `pip --version` revealed Rich as one of the heaviest imports in the startup chain — `from rich.console import Console` alone took ~79ms on CPython 3.12 (Standard_D2s_v5 VM).
+
+Rather than patching pip's vendored copy, we contributed upstream so everyone benefits.
+
+## Results
+
+### Import Time (hyperfine, 30+ runs, Standard_D2s_v5)
+
+#### CPython 3.12
+
+| Import | master | optimized | Speedup |
+|---|---|---|---|
+| `Console` | 79.1 ± 0.8ms | 37.5 ± 0.5ms | **2.11x** |
+| `RichHandler` | 100.3 ± 3.6ms | 39.6 ± 0.5ms | **2.53x** |
+
+#### CPython 3.13
+
+| Import | master | optimized | Speedup |
+|---|---|---|---|
+| `Console` | 67.9 ± 0.7ms | 33.6 ± 0.5ms | **2.02x** |
+| `RichHandler` | — | 37.5 ± 0.4ms | — |
+
+> On Python 3.13+, `typing` no longer imports `re`, so deferring all `re.compile()` calls eliminates `re` (+ `_sre`, `re._compiler`, `re._parser`, `re._constants`) from the Console import chain entirely.
+
+### Runtime Micro-benchmarks (Python 3.13.13)
+
+| Benchmark | Before | After | Speedup |
+|---|---|---|---|
+| Style.\_\_eq\_\_ (identity) | 114ns/call | 62ns/call | **1.84x** |
+| Style.combine (3 styles) | 579ns/call | 433ns/call | **1.34x** |
+| Segment.simplify (identity) | 1269ns/call | 931ns/call | **1.36x** |
+| Style.chain (3 styles) | 959ns/call | 878ns/call | **1.09x** |
+| E2E Console.print | 173.7us/call | 171.6us/call | ~1.01x |
+
+## What We Changed
+
+### PR #12 — Architectural wins ([KRRT7/rich#12](https://github.com/KRRT7/rich/pull/12))
+
+- **Replace `@dataclass` with `__slots__` classes** — `ConsoleOptions` and `ConsoleThreadLocals` used `@dataclass`, which imports `inspect` at module level (~10ms). Replaced with plain classes + `__slots__`. ConsoleOptions memory: 344 → 136 bytes (60% reduction).
+- **Lazy-load emoji dictionary** — `_emoji_codes.EMOJI` (3,608 entries) loaded unconditionally via `text.py → emoji.py`. Deferred to first use via module-level `__getattr__`.
+- **Defer imports across 12+ modules** — `inspect`, `pretty`, `scope`, `getpass`, `configparser`, `html.escape`, `zlib`, `traceback`, `pathlib` → deferred to the methods that actually use them.
+- **`from __future__ import annotations`** — Enabled in key modules to allow moving type-only imports to `TYPE_CHECKING`.
+
+### PR #13 — Import deferral + runtime micro-opts ([KRRT7/rich#13](https://github.com/KRRT7/rich/pull/13))
+
+**Import deferral (7 files):**
+- `color.py`: `RE_COLOR` compiled lazily in `Color.parse()` (LRU-cached)
+- `text.py`: `_re_whitespace` lazy; inline `import re` in 6 methods
+- `markup.py`: `RE_TAGS` via `_compile_tags()`, `RE_HANDLER` and escape regex lazy
+- `_emoji_replace.py`: regex default arg → lazy `_EMOJI_SUB` global
+- `_wrap.py`: `re_word` → lazy `_re_word`
+- `highlighter.py`: `import re` inside `JSONHighlighter.highlight()`
+- `default_styles.py`: 3 `rgb(...)` strings → `Color.from_rgb()` to avoid `Color.parse()` regex at import
+
+**Runtime micro-optimizations:**
+- `Style.__eq__`/`__ne__`: identity shortcut (`is`) before hash comparison
+- `Style.combine`/`chain`: use `_add` (LRU-cached) directly instead of `sum()` → `__add__` → `.copy()` check
+- `Segment.simplify`: `is` before `==` for style comparison
+
+### Upstream PR
+
+- [Textualize/rich#4070](https://github.com/Textualize/rich/pull/4070) — Initial import deferral PR (subset of the above)
+
+## Methodology
+
+### Environment
+
+- **VM**: Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable)
+- **OS**: Ubuntu 24.04 LTS
+- **Region**: westus2
+- **Python**: 3.12 and 3.13 via uv
+- **Tooling**: hyperfine (warmup 5, min-runs 30), timeit (best of 7)
+
+Non-burstable VM chosen for consistent CPU performance — no thermal throttling or turbo variability.
+
+### Benchmark harness
+
+All scripts in [`bench/`](bench/):
+
+| Script | Purpose |
+|---|---|
+| `bench_import.sh` | Overall `import rich` time via hyperfine |
+| `bench_module.sh` | Per-module import time (Console, RichHandler, Traceback, etc.) |
+| `bench_e2e.sh` | A/B comparison: master vs optimized branch |
+| `bench_compare.sh` | Generic branch comparison wrapper |
+| `bench_importtime.py` | `python -X importtime` parser → sorted TSV breakdown |
+| `bench_runtime.py` | PR #12 runtime benchmarks (ConsoleOptions, emoji_replace) |
+| `bench_runtime2.py` | PR #13 runtime benchmarks (Style.__eq__, combine, Segment.simplify) |
+| `bench_text.py` | Text hot-path benchmarks (construction, copy, divide, render) |
+| `test_all_impls.sh` | Run tests across CPython 3.9–3.14 + PyPy 3.10 |
+
+### Raw data
+
+Hyperfine JSON exports in [`data/`](data/).
+
+## Maintainer Engagement
+
+Reached out to Will McGugan (Textualize CEO) via Discord. Conversation in [`discord-transcript.md`](discord-transcript.md).
+
+Key quotes:
+- "Seems like a clear win. Feel free to open a PR."
+- "I'd say single PR."
+
+## Repo Structure
+
+```
+.
+├── README.md              # This file
+├── cloud-init.yaml        # VM provisioning (one-shot reproducible setup)
+├── discord-transcript.md  # Will McGugan conversation
+├── bench/                 # Benchmark scripts (from VM)
+│   ├── bench_import.sh
+│   ├── bench_module.sh
+│   ├── bench_e2e.sh
+│   ├── bench_compare.sh
+│   ├── bench_importtime.py
+│   ├── bench_runtime.py
+│   ├── bench_runtime2.py
+│   ├── bench_text.py
+│   └── test_all_impls.sh
+├── data/                  # Raw benchmark data (hyperfine JSON)
+│   ├── e2e-3.12/
+│   └── runtime/
+└── vm-setup.md            # Azure VM provisioning instructions
+```
--- a/.codeflash/textualize/rich/bench/bench_compare.sh
+++ b/.codeflash/textualize/rich/bench/bench_compare.sh
@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+set -euo pipefail
+BRANCH="${1:?Usage: bench_compare.sh <branch-or-commit>}"
+VENV_PYTHON="$HOME/rich/.venv/bin/python"
+TS=$(date +%Y%m%d-%H%M%S)
+OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+mkdir -p "$OUTDIR"
+
+cd ~/rich
+git checkout "$BRANCH"
+export PATH="$HOME/.local/bin:$PATH"
+uv pip install -e .
+
+echo "=== Benchmarking branch: $BRANCH ==="
+
+hyperfine --warmup 3 --min-runs 30 --shell=none \
+  --export-json "$OUTDIR/import.json" \
+  "$VENV_PYTHON -c 'import rich'"
+
+hyperfine --warmup 3 --min-runs 20 --shell=none \
+  --export-json "$OUTDIR/modules.json" \
+  -n 'console' "$VENV_PYTHON -c 'from rich.console import Console'" \
+  -n 'logging' "$VENV_PYTHON -c 'from rich.logging import RichHandler'" \
+  -n 'traceback' "$VENV_PYTHON -c 'from rich.traceback import Traceback'" \
+  -n 'syntax' "$VENV_PYTHON -c 'from rich.syntax import Syntax'" \
+  -n 'markdown' "$VENV_PYTHON -c 'from rich.markdown import Markdown'"
+
+python3 ~/bench/bench_importtime.py "import rich" "$OUTDIR/importtime.tsv"
+
+echo ""
+echo "Results saved to $OUTDIR/"
+ls -la "$OUTDIR/"
--- a/.codeflash/textualize/rich/bench/bench_e2e.sh
+++ b/.codeflash/textualize/rich/bench/bench_e2e.sh
@ -0,0 +1,59 @@
+#!/usr/bin/env bash
+set -euo pipefail
+export PATH="$HOME/.local/bin:$PATH"
+
+cd ~/rich
+TS=$(date +%Y%m%d-%H%M%S)
+OUTDIR="$HOME/results/e2e-${TS}"
+mkdir -p "$OUTDIR"
+
+VENV_PY="$HOME/rich/.venv/bin/python"
+SETUP='cd ~/rich && git checkout {branch} -q && PATH=$HOME/.local/bin:$PATH uv pip install --python '"$VENV_PY"' -e . -q 2>/dev/null'
+
+echo "=== E2E Benchmark: master vs codeflash/optimize ==="
+echo "Python: $($VENV_PY --version)"
+echo "Output: $OUTDIR"
+echo ""
+
+echo "--- Console import ---"
+hyperfine --warmup 5 --min-runs 30 --shell=bash \
+    --export-json "$OUTDIR/console.json" \
+    -L branch master,codeflash/optimize \
+    --setup "$SETUP" \
+    -n '{branch}' \
+    "$VENV_PY -c 'from rich.console import Console'"
+
+echo ""
+echo "--- RichHandler import ---"
+hyperfine --warmup 5 --min-runs 30 --shell=bash \
+    --export-json "$OUTDIR/richhandler.json" \
+    -L branch master,codeflash/optimize \
+    --setup "$SETUP" \
+    -n '{branch}' \
+    "$VENV_PY -c 'from rich.logging import RichHandler'"
+
+echo ""
+echo "--- import rich ---"
+hyperfine --warmup 5 --min-runs 30 --shell=bash \
+    --export-json "$OUTDIR/rich.json" \
+    -L branch master,codeflash/optimize \
+    --setup "$SETUP" \
+    -n '{branch}' \
+    "$VENV_PY -c 'import rich'"
+
+echo ""
+echo "--- Per-module breakdown (codeflash/optimize) ---"
+git checkout codeflash/optimize -q
+uv pip install --python $VENV_PY -e . -q 2>/dev/null
+hyperfine --warmup 3 --min-runs 20 --shell=none \
+    --export-json "$OUTDIR/modules.json" \
+    -n 'import rich' "$VENV_PY -c 'import rich'" \
+    -n 'Console' "$VENV_PY -c 'from rich.console import Console'" \
+    -n 'RichHandler' "$VENV_PY -c 'from rich.logging import RichHandler'" \
+    -n 'Traceback' "$VENV_PY -c 'from rich.traceback import Traceback'" \
+    -n 'Syntax' "$VENV_PY -c 'from rich.syntax import Syntax'" \
+    -n 'Markdown' "$VENV_PY -c 'from rich.markdown import Markdown'"
+
+echo ""
+echo "Results saved to $OUTDIR/"
+ls -la "$OUTDIR/"
--- a/.codeflash/textualize/rich/bench/bench_import.sh
+++ b/.codeflash/textualize/rich/bench/bench_import.sh
@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+set -euo pipefail
+VENV_PYTHON="$HOME/rich/.venv/bin/python"
+echo "=== Rich overall import time ==="
+hyperfine --warmup 3 --min-runs 30 --shell=none \
+  "$VENV_PYTHON -c 'import rich'"
--- a/.codeflash/textualize/rich/bench/bench_importtime.py
+++ b/.codeflash/textualize/rich/bench/bench_importtime.py
@ -0,0 +1,47 @@
+#!/usr/bin/env python3
+"""Parse python -X importtime output and produce a sorted breakdown."""
+import subprocess
+import sys
+import re
+import os
+
+def parse_importtime(stderr_lines):
+    pattern = re.compile(
+        r"import time:\s+(\d+)\s+\|\s+(\d+)\s+\|\s+(\s*)([\w.]+)"
+    )
+    results = []
+    for line in stderr_lines:
+        m = pattern.match(line)
+        if m:
+            self_us = int(m.group(1))
+            cumul_us = int(m.group(2))
+            indent = len(m.group(3)) // 2
+            module = m.group(4)
+            results.append((module, self_us, cumul_us, indent))
+    return results
+
+def main():
+    target = sys.argv[1] if len(sys.argv) > 1 else "import rich"
+    venv_python = os.path.expanduser("~/rich/.venv/bin/python")
+
+    proc = subprocess.run(
+        [venv_python, "-X", "importtime", "-c", target],
+        capture_output=True, text=True
+    )
+    entries = parse_importtime(proc.stderr.splitlines())
+    entries.sort(key=lambda e: e[1], reverse=True)
+
+    print(f"{'Module':<50} {'Self (us)':>12} {'Cumul (us)':>12} {'Depth':>6}")
+    print("-" * 82)
+    for mod, self_us, cumul_us, depth in entries[:40]:
+        print(f"{mod:<50} {self_us:>12,} {cumul_us:>12,} {depth:>6}")
+
+    if len(sys.argv) > 2:
+        with open(sys.argv[2], "w") as f:
+            f.write("module\tself_us\tcumul_us\tdepth\n")
+            for mod, self_us, cumul_us, depth in entries:
+                f.write(f"{mod}\t{self_us}\t{cumul_us}\t{depth}\n")
+        print(f"\nTSV written to {sys.argv[2]}")
+
+if __name__ == "__main__":
+    main()
--- a/.codeflash/textualize/rich/bench/bench_module.sh
+++ b/.codeflash/textualize/rich/bench/bench_module.sh
@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+set -euo pipefail
+VENV_PYTHON="$HOME/rich/.venv/bin/python"
+echo "=== Per-module import time ==="
+hyperfine --warmup 3 --min-runs 20 --shell=none \
+  -n 'rich (top-level)' "$VENV_PYTHON -c 'import rich'" \
+  -n 'rich.console.Console' "$VENV_PYTHON -c 'from rich.console import Console'" \
+  -n 'rich.logging.RichHandler' "$VENV_PYTHON -c 'from rich.logging import RichHandler'" \
+  -n 'rich.traceback.Traceback' "$VENV_PYTHON -c 'from rich.traceback import Traceback'" \
+  -n 'rich.print_json' "$VENV_PYTHON -c 'from rich import print_json'" \
+  -n 'rich.syntax.Syntax' "$VENV_PYTHON -c 'from rich.syntax import Syntax'" \
+  -n 'rich.pretty' "$VENV_PYTHON -c 'import rich.pretty'" \
+  -n 'rich.markdown.Markdown' "$VENV_PYTHON -c 'from rich.markdown import Markdown'"
--- a/.codeflash/textualize/rich/bench/bench_runtime.py
+++ b/.codeflash/textualize/rich/bench/bench_runtime.py
@ -0,0 +1,75 @@
+"""Benchmark the runtime optimizations in PR #12.
+
+Compares:
+  - ConsoleOptions.__eq__ (explicit short-circuit vs all(getattr(...)))
+  - ConsoleOptions.update() (identity check vs isinstance)
+  - _emoji_replace() (inline cache vs _get_emoji() call)
+
+Usage:
+  python3.13 bench_runtime.py
+"""
+import timeit
+import sys
+import os
+
+sys.path.insert(0, os.path.expanduser("~/rich"))
+
+def bench(label, stmt, setup, number=500_000):
+    times = timeit.repeat(stmt, setup, number=number, repeat=5)
+    best = min(times)
+    per_call_ns = best / number * 1e9
+    print(f"  {label}: {best*1000:.1f}ms total, {per_call_ns:.0f}ns/call ({number:,} iterations, best of 5)")
+    return best
+
+print(f"Python {sys.version}")
+print(f"Rich path: {os.path.expanduser('~/rich')}")
+print()
+
+# --- 1. ConsoleOptions.__eq__ ---
+print("=== ConsoleOptions.__eq__ ===")
+eq_setup = """\
+import sys, os
+sys.path.insert(0, os.path.expanduser("~/rich"))
+from rich.console import Console
+c = Console()
+opts_a = c.options
+opts_b = c.options.copy()
+"""
+bench("__eq__ (equal objects)", "opts_a == opts_b", eq_setup)
+bench("__eq__ (same object)", "opts_a == opts_a", eq_setup)
+
+eq_setup_diff = eq_setup + """\
+from rich.console import ConsoleDimensions
+opts_c = opts_b.copy()
+opts_c.size = ConsoleDimensions(999, 999)
+"""
+bench("__eq__ (differ at size)", "opts_a == opts_c", eq_setup_diff)
+print()
+
+# --- 2. ConsoleOptions.update() ---
+print("=== ConsoleOptions.update() ===")
+update_setup = """\
+import sys, os
+sys.path.insert(0, os.path.expanduser("~/rich"))
+from rich.console import Console
+c = Console()
+opts = c.options
+"""
+bench("update(width=80)", "opts.update(width=80)", update_setup)
+bench("update() no changes", "opts.update()", update_setup)
+bench("update(width=80, no_wrap=True, highlight=False)",
+      "opts.update(width=80, no_wrap=True, highlight=False)", update_setup)
+print()
+
+# --- 3. _emoji_replace ---
+print("=== _emoji_replace() ===")
+emoji_setup = """\
+import sys, os
+sys.path.insert(0, os.path.expanduser("~/rich"))
+from rich._emoji_replace import _emoji_replace
+"""
+bench("_emoji_replace (with emoji)", '_emoji_replace("Hello :wave: world :smile:")', emoji_setup, number=200_000)
+bench("_emoji_replace (no emoji)", '_emoji_replace("Hello world, no emojis here")', emoji_setup, number=200_000)
+print()
+
+print("Done.")
--- a/.codeflash/textualize/rich/bench/bench_runtime2.py
+++ b/.codeflash/textualize/rich/bench/bench_runtime2.py
@ -0,0 +1,99 @@
+"""Benchmark the runtime micro-optimizations from 6b354159.
+
+Targets:
+  1. Style.__eq__ identity shortcut
+  2. Style.combine/chain via _add (bypassing sum + __add__)
+  3. Segment.simplify with `is` check
+
+Usage:
+  cd ~/rich && ~/venv313/bin/python ~/bench/bench_runtime2.py
+"""
+import timeit
+import sys
+import os
+
+sys.path.insert(0, os.path.expanduser("~/rich"))
+
+def bench(label, stmt, setup, number=500_000, repeat=7):
+    times = timeit.repeat(stmt, setup, number=number, repeat=repeat)
+    best = min(times)
+    per_call_ns = best / number * 1e9
+    print(f"  {label}: {best*1000:.1f}ms/{number//1000}K calls, {per_call_ns:.0f}ns/call")
+    return best
+
+print(f"Python {sys.version}")
+print()
+
+common_setup = """\
+import sys, os
+sys.path.insert(0, os.path.expanduser("~/rich"))
+from rich.style import Style
+"""
+
+# --- 1. Style.__eq__ ---
+print("=== Style.__eq__ ===")
+eq_setup = common_setup + """\
+s1 = Style(bold=True, color="red")
+s2 = Style(bold=True, color="red")
+# Force hash caching
+hash(s1); hash(s2)
+"""
+bench("identity (s1 == s1)", "s1 == s1", eq_setup, number=1_000_000)
+bench("equal (s1 == s2)", "s1 == s2", eq_setup, number=1_000_000)
+bench("not-equal (s1 != Style())", "s1 != Style()", eq_setup + "s3 = Style(); hash(s3)\n", number=1_000_000)
+print()
+
+# --- 2. Style.combine ---
+print("=== Style.combine ===")
+combine_setup = common_setup + """\
+styles = [Style(bold=True), Style(color="red"), Style(italic=True)]
+"""
+bench("combine(3 styles)", "Style.combine(styles)", combine_setup, number=200_000)
+
+combine_setup_2 = common_setup + """\
+styles = [Style(bold=True), Style(color="red")]
+"""
+bench("combine(2 styles)", "Style.combine(styles)", combine_setup_2, number=200_000)
+print()
+
+# --- 3. Style.chain ---
+print("=== Style.chain ===")
+chain_setup = common_setup + """\
+s1 = Style(bold=True)
+s2 = Style(color="red")
+s3 = Style(italic=True)
+"""
+bench("chain(3 styles)", "Style.chain(s1, s2, s3)", chain_setup, number=200_000)
+print()
+
+# --- 4. Segment.simplify ---
+print("=== Segment.simplify ===")
+simplify_setup = common_setup + """\
+from rich.segment import Segment
+style_a = Style(bold=True, color="red")
+# Same object reference (common case)
+segs_identity = [Segment("hello ", style_a), Segment("world", style_a), Segment("! ", style_a)]
+# Equal but different objects
+style_b = Style(bold=True, color="red")
+segs_equal = [Segment("hello ", style_a), Segment("world", style_b), Segment("! ", style_a)]
+# Different styles (no merge)
+style_c = Style(italic=True)
+segs_diff = [Segment("hello ", style_a), Segment("world", style_c), Segment("! ", style_a)]
+"""
+bench("simplify (identity styles)", "list(Segment.simplify(segs_identity))", simplify_setup, number=200_000)
+bench("simplify (equal styles)", "list(Segment.simplify(segs_equal))", simplify_setup, number=200_000)
+bench("simplify (diff styles)", "list(Segment.simplify(segs_diff))", simplify_setup, number=200_000)
+print()
+
+# --- 5. E2E Console.print ---
+print("=== E2E Console.print ===")
+e2e_setup = common_setup + """\
+from rich.console import Console
+from rich.text import Text
+c = Console(file=open(os.devnull, "w"), color_system="truecolor")
+markup = "[bold red]Error:[/bold red] Something [italic]went wrong[/italic] in [blue underline]module.py[/blue underline]:42"
+"""
+bench("Console.print(markup)", "c.print(markup)", e2e_setup, number=5_000, repeat=5)
+print()
+
+print("Done.")
--- a/.codeflash/textualize/rich/bench/bench_text.py
+++ b/.codeflash/textualize/rich/bench/bench_text.py
@ -0,0 +1,75 @@
+"""Benchmark Text hot paths: construction, copy, divide, render."""
+import timeit
+import sys
+import os
+
+sys.path.insert(0, os.path.expanduser("~/rich"))
+
+def bench(label, stmt, setup, number=200_000):
+    times = timeit.repeat(stmt, setup, number=number, repeat=5)
+    best = min(times)
+    per_call_ns = best / number * 1e9
+    print(f"  {label}: {best*1000:.1f}ms total, {per_call_ns:.0f}ns/call ({number:,} iters, best of 5)")
+    return best
+
+print(f"Python {sys.version}")
+print()
+
+common = """\
+import sys, os
+sys.path.insert(0, os.path.expanduser("~/rich"))
+from rich.text import Text, Span
+from rich.style import Style
+from rich.console import Console
+"""
+
+# --- Text construction ---
+print("=== Text() construction ===")
+bench("Text('hello world')", "Text('hello world')", common)
+bench("Text('hello world', style='bold')", "Text('hello world', style='bold')", common)
+print()
+
+# --- Text.copy ---
+print("=== Text.copy() ===")
+copy_setup = common + "t = Text('hello world', style='bold')\nt.stylize('red', 0, 5)\n"
+bench("copy()", "t.copy()", copy_setup)
+print()
+
+# --- Text.blank_copy ---
+print("=== Text.blank_copy() ===")
+bench("blank_copy()", "t.blank_copy()", copy_setup)
+bench("blank_copy('new text')", "t.blank_copy('new text')", copy_setup)
+print()
+
+# --- Text.divide ---
+print("=== Text.divide() ===")
+div_setup = common + "t = Text('hello world, this is a longer text for divide testing')\nt.stylize('bold', 0, 5)\nt.stylize('red', 6, 11)\n"
+bench("divide([10, 20, 30])", "t.divide([10, 20, 30])", div_setup, number=100_000)
+print()
+
+# --- Text.render ---
+print("=== Text.render() ===")
+render_setup = common + """\
+c = Console(width=80)
+t0 = Text('hello world')
+t1 = Text('hello world')
+t1.stylize('bold', 0, 5)
+t2 = Text('hello world')
+t2.stylize('bold', 0, 5)
+t2.stylize('red', 6, 11)
+"""
+bench("render (no spans)", "list(t0.render(c))", render_setup, number=100_000)
+bench("render (1 span)", "list(t1.render(c))", render_setup, number=100_000)
+bench("render (2 spans)", "list(t2.render(c))", render_setup, number=100_000)
+print()
+
+# --- E2E Console.print ---
+print("=== Console.print() E2E ===")
+print_setup = common + """\
+import io
+c = Console(file=io.StringIO(), width=80)
+"""
+bench("print('hello')", "c.file.seek(0); c.print('hello')", print_setup, number=50_000)
+bench("print('[bold]hello[/bold]')", "c.file.seek(0); c.print('[bold]hello[/bold]')", print_setup, number=50_000)
+
+print("\nDone.")
--- a/.codeflash/textualize/rich/bench/test_all_impls.sh
+++ b/.codeflash/textualize/rich/bench/test_all_impls.sh
@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+set -euo pipefail
+export PATH="$HOME/.local/bin:$PATH"
+
+BRANCH="${1:?Usage: test_all_impls.sh <branch>}"
+cd ~/rich
+git checkout "$BRANCH"
+
+PYTHONS=(
+    "$HOME/.local/bin/python3.9"
+    "$HOME/.local/bin/python3.10"
+    "$HOME/.local/bin/python3.11"
+    "$HOME/.local/bin/python3.12"
+    "$HOME/.local/bin/python3.13"
+    "$HOME/.local/bin/python3.14"
+    "$HOME/.local/bin/pypy3.10"
+)
+
+for PYTHON in "${PYTHONS[@]}"; do
+    IMPL=$($PYTHON -c "import platform; print(f'{platform.python_implementation()} {platform.python_version()}')")
+    echo ""
+    echo "=== $IMPL ==="
+
+    VENV_DIR="/tmp/rich-test-$(basename $PYTHON)"
+    rm -rf "$VENV_DIR"
+    uv venv --python "$PYTHON" "$VENV_DIR" 2>/dev/null
+    VENV_PY="$VENV_DIR/bin/python"
+
+    uv pip install --python "$VENV_PY" -e . pytest attrs 2>/dev/null
+
+    $VENV_PY -m pytest tests/ -x -q 2>&1 | tail -3
+done
--- a/.codeflash/textualize/rich/data/discord-transcript.md
+++ b/.codeflash/textualize/rich/data/discord-transcript.md
@ -0,0 +1,113 @@
+# Discord Conversation with Will McGugan
+
+## April 8–9, 2026
+
+**KRRT** — Yesterday at 11:01 PM
+
+Hey Will, I'm working on a POC project to convince my boss — part of that is optimizing pip's startup time. pip vendors in Rich, and it's one of the heavier imports in the chain. I've been profiling it and found some quick wins on the import-time side. I could just open a PR against pip's vendored copy, but I'd rather contribute upstream so everyone benefits. I want to skip some of the red tape and wanted to establish a conversation with you on this — I'm aware of the new AI policy. I have full benchmark data from a controlled environment if you're interested.
+
+---
+
+**KRRT** — Yesterday at 11:12 PM
+
+https://github.com/KRRT7/rich/pull/1
+
+here's a draft PR on my fork for your reference
+
+---
+
+**Will McGugan** — Yesterday at 11:34 PM
+
+Seems like a clear win. Feel free to open a PR.
+
+---
+
+**KRRT** — Yesterday at 11:38 PM
+
+thanks, let me clean it up.
+
+---
+
+**KRRT** — Yesterday at 11:48 PM
+
+I've got 8 more import-time wins stacked on top of it. Combined E2E results:
+
+Console import: 1.50x faster (77.1ms → 51.5ms)
+RichHandler import: 1.75x faster (97.6ms → 55.6ms)
+
+All benchmarked on a dedicated VM with hyperfine, tests pass on CPython 3.9–3.14 and PyPy 3.10. The full breakdown is here: https://github.com/KRRT7/rich/pull/10
+
+The changes are all the same pattern — deferring imports that are only used in specific code paths (inspect, pretty, scope, getpass, configparser, html/zlib for SVG export, plus a dead logging import removal and a pathlib→os.path swap). I have them as individual branches if you want to review separately, but it'd be cleaner to open a single combined PR upstream. What do you prefer?
+
+you can see the other PRs in my fork
+
+---
+
+**Will McGugan** — Yesterday at 11:50 PM
+
+I'd say single PR. Are they all needed at runtime? Maybe some can do in an if TYPE_CHECKING block.
+
+---
+
+**KRRT** — Yesterday at 11:52 PM
+
+yeah, they're all needed at runtime, but not import time, that's why they're deferred to the methods that actually use them rather than put in TYPE_CHECKING
+
+though with future annotations maybe we can do a mix of both to maintain the type checking
+
+---
+
+**KRRT** — 12:13 AM
+
+ok, that worked even better than i expected
+Updated numbers with everything combined:
+
+Console import: 1.52x faster (78.8ms → 52.0ms)
+RichHandler import: 2.0x faster (99.4ms → 50.0ms)
+
+RichHandler is now faster than Console on master — it doesn't import rich.console at all anymore, defers it to first use via get_console().
+
+I've updated the upstream PR with everything in a single commit: https://github.com/Textualize/rich/pull/4070
+
+There's more TYPE_CHECKING opportunities in console.py, syntax.py, panel.py, and table.py too. this is just the initial low-hanging fruit, let me keep going
+
+---
+
+**KRRT** — 12:34 AM
+
+I also profiled what's left after these changes and found a few bigger architectural wins that would need your input:
+
+**Replace @dataclass with plain classes + __slots__ (~10ms import, 60% less memory)**
+
+console.py uses @dataclass for ConsoleOptions and ConsoleThreadLocals. The dataclasses module imports inspect at module level, so it's ~10ms to load. Replacing these with plain classes eliminates the entire dataclasses→inspect chain.
+
+Adding __slots__ at the same time gives a runtime win too: ConsoleOptions drops from 344 bytes to 136 bytes per instance (60% reduction). Since ConsoleOptions.update() creates a copy on every renderable, this adds up. The copy() method would change from __dict__.copy() to explicit slot assignment — I benchmarked this and it's the same speed (27.5ms vs 26.1ms per 100K copies). Style, Text, and Emoji already use __slots__, so this aligns with existing patterns.
+
+**Lazy emoji loading (~2ms)**
+
+_emoji_codes.py is a 3,608-entry dict that gets loaded unconditionally through text.py → emoji.py and console.py → _emoji_replace.py. Most users never use :emoji_name: syntax. If _emoji_codes.EMOJI were lazily loaded (e.g., a module-level __getattr__ or moving the import inside _emoji_replace()), that's ~2ms back.
+
+**Remaining inspect imports in protocol.py and repr.py**
+
+protocol.py does from inspect import isclass — same pattern I fixed in console.py, replaceable with isinstance(x, type)
+repr.py does import inspect for inspect.signature() in one method — could be deferred to that method
+
+These wouldn't save time on their own right now (inspect gets pulled in by dataclasses anyway), but they'd become free wins once #1 is done.
+
+**Codebase-wide from __future__ import annotations**
+
+This is the bigger unlock. Right now, most TYPE_CHECKING wins are blocked because annotation-only names share import lines with runtime names (e.g., from .style import Style, StyleType where Style is runtime but StyleType is annotation-only). With future annotations everywhere, type aliases like StyleType, TextType, AlignMethod, JustifyMethod, VerticalAlignMethod etc. could all move to TYPE_CHECKING. This is a larger change that touches many files but is mechanical and low-risk.
+
+---
+
+**KRRT** — 1:32 AM
+
+https://github.com/KRRT7/rich/pull/12
+
+I went ahead and prototyped the bigger architectural changes I mentioned, figured it'd be easier to show than describe.
+
+Replaced @dataclass with plain classes + __slots__ for ConsoleOptions / ConsoleThreadLocals — eliminates the dataclasses→inspect import chain (~10ms). Also cuts ConsoleOptions memory from 344 → 136 bytes per instance (60% less). Style, Text, and Emoji already use __slots__ so it's consistent with the codebase.
+
+Lazy-loaded _emoji_codes.EMOJI — the 3,608-entry dict was loading unconditionally even though most code paths never use emoji markup. Deferred to first use via module-level __getattr__.
+
+the stuff around emoji looks ugly / unpythonic but it's for performance reasons.
--- a/.codeflash/textualize/rich/data/e2e-3.12/console.json
+++ b/.codeflash/textualize/rich/data/e2e-3.12/console.json
@ -0,0 +1,222 @@
+{
+  "results": [
+    {
+      "command": "master",
+      "mean": 0.07810320798162165,
+      "stddev": 0.00095055037378374,
+      "median": 0.07818114836000001,
+      "user": 0.0698524254054054,
+      "system": 0.008184573513513512,
+      "min": 0.07617246636000001,
+      "max": 0.07979268136,
+      "times": [
+        0.07843564636000001,
+        0.07647260736000001,
+        0.07814914036000001,
+        0.07874194336000001,
+        0.07863066236,
+        0.07713108936000002,
+        0.07938313736000001,
+        0.07642884036000001,
+        0.07804165836000002,
+        0.07914831636000001,
+        0.07847910836000001,
+        0.07866800636000001,
+        0.07718950336000001,
+        0.07918183136000001,
+        0.07933491736,
+        0.07794807036000001,
+        0.07979268136,
+        0.07904792136000001,
+        0.07818114836000001,
+        0.07726757036000001,
+        0.07764409136000001,
+        0.07719195136000001,
+        0.07884813136,
+        0.07757704336000001,
+        0.07765492236,
+        0.07820118336000001,
+        0.07617985436000001,
+        0.07765687336,
+        0.07746460936,
+        0.07893502436000001,
+        0.07792255936,
+        0.07833061336000001,
+        0.07617246636000001,
+        0.07756865236,
+        0.07967146736000001,
+        0.07840486236000001,
+        0.07874058936000002
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ],
+      "parameters": {
+        "branch": "master"
+      }
+    },
+    {
+      "command": "codeflash/optimize",
+      "mean": 0.05223519170545455,
+      "stddev": 0.0006959150304968116,
+      "median": 0.05235289036,
+      "user": 0.04565700181818182,
+      "system": 0.006518259999999999,
+      "min": 0.050661796360000004,
+      "max": 0.05407629236,
+      "times": [
+        0.05333011636,
+        0.05187694736,
+        0.052415662360000004,
+        0.05246903636,
+        0.05217590736,
+        0.05117402636,
+        0.05173293436,
+        0.05197986536,
+        0.05352534336,
+        0.05241098336,
+        0.05183966336,
+        0.051560131360000004,
+        0.052137108360000003,
+        0.05251879536,
+        0.05167922736,
+        0.05252321736,
+        0.052266155360000004,
+        0.05159055536,
+        0.05244860936,
+        0.05144317736,
+        0.05276390636,
+        0.05215044236,
+        0.05238430236,
+        0.050661796360000004,
+        0.052386172360000004,
+        0.052011422360000004,
+        0.05164590436,
+        0.05142629636,
+        0.05256219536,
+        0.05288451536,
+        0.05170777336,
+        0.05098433936,
+        0.05407629236,
+        0.05292583336,
+        0.05181803136,
+        0.052201927360000004,
+        0.05296897136,
+        0.05269375836,
+        0.05254830636,
+        0.05235289036,
+        0.05308978636,
+        0.052731576360000004,
+        0.05170993836,
+        0.05203099236,
+        0.05282735636,
+        0.051441208360000004,
+        0.05107490336,
+        0.05317463236,
+        0.05257437736,
+        0.05241192036,
+        0.052466891360000004,
+        0.05218884036,
+        0.05340267036,
+        0.05272586036,
+        0.05083204936
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ],
+      "parameters": {
+        "branch": "codeflash/optimize"
+      }
+    }
+  ]
+}
--- a/.codeflash/textualize/rich/data/e2e-3.12/modules.json
+++ b/.codeflash/textualize/rich/data/e2e-3.12/modules.json
@ -0,0 +1,846 @@
+{
+  "results": [
+    {
+      "command": "import rich",
+      "mean": 0.018196754719512202,
+      "stddev": 0.0002996488024376836,
+      "median": 0.018180326500000003,
+      "user": 0.015105195121951215,
+      "system": 0.002990146341463413,
+      "min": 0.017536867,
+      "max": 0.019437769,
+      "times": [
+        0.018251129,
+        0.018244252000000002,
+        0.018266125,
+        0.017906164000000002,
+        0.018249968000000002,
+        0.018236207,
+        0.017830629,
+        0.018089297,
+        0.017754278000000002,
+        0.017944224,
+        0.018514856,
+        0.018006916,
+        0.017923675,
+        0.018400269,
+        0.017880253000000002,
+        0.017832469,
+        0.017844988000000003,
+        0.018010199,
+        0.018418105,
+        0.018048329000000002,
+        0.018273814000000003,
+        0.01817411,
+        0.018446933000000002,
+        0.018404291,
+        0.017980897000000003,
+        0.017536867,
+        0.018476020000000003,
+        0.018629838000000003,
+        0.018271879,
+        0.018597226,
+        0.018618404,
+        0.01835842,
+        0.017904972,
+        0.018450059,
+        0.018311659,
+        0.017969042,
+        0.017753378,
+        0.017857986000000003,
+        0.018337087000000002,
+        0.018013485000000003,
+        0.017607493000000002,
+        0.017829739,
+        0.018205016,
+        0.017938607000000002,
+        0.017705658000000003,
+        0.018179755000000002,
+        0.018241513,
+        0.018345740000000003,
+        0.018356617000000002,
+        0.018049115,
+        0.017930293,
+        0.018295298,
+        0.018082205,
+        0.018851941,
+        0.018233373,
+        0.018180898,
+        0.018063023,
+        0.01837899,
+        0.018413479,
+        0.018589148,
+        0.019437769,
+        0.018488971,
+        0.018159481,
+        0.018270386,
+        0.018255365000000003,
+        0.018672402,
+        0.018526237,
+        0.018089059,
+        0.017941078000000003,
+        0.018559174,
+        0.018601956000000003,
+        0.018376306000000002,
+        0.018153656,
+        0.018400734000000002,
+        0.017879048,
+        0.017764375000000002,
+        0.01761076,
+        0.017948859,
+        0.017896956000000002,
+        0.017921445,
+        0.018428127000000002,
+        0.018333757000000003,
+        0.018152090000000003,
+        0.018414771,
+        0.018037304,
+        0.017814868,
+        0.018236246,
+        0.017666607,
+        0.017761114,
+        0.018145883,
+        0.01788361,
+        0.018192916,
+        0.018700021,
+        0.018515486,
+        0.018381343,
+        0.018217801000000002,
+        0.017801349,
+        0.018164178,
+        0.018517038,
+        0.018069474000000002,
+        0.018127400000000002,
+        0.018050861,
+        0.017949531,
+        0.017916360000000003,
+        0.018354702,
+        0.018411262,
+        0.018110822000000002,
+        0.018157680000000002,
+        0.018126791,
+        0.018230155,
+        0.017697823,
+        0.017747085000000003,
+        0.018269773,
+        0.017914693000000002,
+        0.018171853,
+        0.017959091,
+        0.018171074000000002,
+        0.018541512,
+        0.018830191,
+        0.018365594000000002,
+        0.017964542,
+        0.018151816,
+        0.018682853000000003,
+        0.01898715,
+        0.018588472,
+        0.018337375,
+        0.018533038,
+        0.018078482,
+        0.017932934,
+        0.018497584,
+        0.018672153,
+        0.018044446000000002,
+        0.018454695,
+        0.018362808,
+        0.018220054,
+        0.018532080000000003,
+        0.017951892,
+        0.017791683000000003,
+        0.017769834,
+        0.018381618000000002,
+        0.018117405,
+        0.018375182,
+        0.018778525,
+        0.018107995,
+        0.018097072000000002,
+        0.0183311,
+        0.018120588,
+        0.017952549,
+        0.018339422,
+        0.018107447000000002,
+        0.017766824,
+        0.018754066,
+        0.018259234000000003,
+        0.018268896,
+        0.018667996000000003,
+        0.018277583,
+        0.018153804000000003,
+        0.018200994,
+        0.018260939,
+        0.017904843,
+        0.01838597,
+        0.018076319,
+        0.017864373000000003,
+        0.018174286
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ]
+    },
+    {
+      "command": "Console",
+      "mean": 0.05248550654385966,
+      "stddev": 0.0007108706642267253,
+      "median": 0.052470906000000005,
+      "user": 0.04568101754385964,
+      "system": 0.006691350877192981,
+      "min": 0.050992847,
+      "max": 0.05381309200000001,
+      "times": [
+        0.052016548,
+        0.052470906000000005,
+        0.051723211000000005,
+        0.052538924,
+        0.053017934,
+        0.053500668,
+        0.053485133000000004,
+        0.053173254,
+        0.052754079,
+        0.053803477,
+        0.052417272,
+        0.053650501,
+        0.052284180000000006,
+        0.05305688200000001,
+        0.052597725000000005,
+        0.053455708000000005,
+        0.05277209,
+        0.053155690000000005,
+        0.053752049,
+        0.05204993,
+        0.051688107000000004,
+        0.052713187,
+        0.051776659,
+        0.052871581,
+        0.053042656,
+        0.052902246,
+        0.05292729400000001,
+        0.052050587,
+        0.051419384000000005,
+        0.050992847,
+        0.052629497000000004,
+        0.051396171000000004,
+        0.052446156,
+        0.05350644,
+        0.05305356000000001,
+        0.052523024,
+        0.05381309200000001,
+        0.053018389000000006,
+        0.05310294300000001,
+        0.052008286,
+        0.051717645000000007,
+        0.052147932,
+        0.052172183000000004,
+        0.051828264000000006,
+        0.052352877000000006,
+        0.052296342,
+        0.051595705000000006,
+        0.052370015000000006,
+        0.05118607,
+        0.051981998,
+        0.052016134000000006,
+        0.052598726000000005,
+        0.051667111,
+        0.051453914,
+        0.051994616,
+        0.053007785,
+        0.051728289000000004
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ]
+    },
+    {
+      "command": "RichHandler",
+      "mean": 0.05638588145098039,
+      "stddev": 0.0005947363715264545,
+      "median": 0.056376734000000005,
+      "user": 0.04918654901960783,
+      "system": 0.007074039215686279,
+      "min": 0.055324908000000006,
+      "max": 0.058057899,
+      "times": [
+        0.057823422000000006,
+        0.057196783,
+        0.056851225000000005,
+        0.056229926000000006,
+        0.05589844,
+        0.056064844,
+        0.055944992000000006,
+        0.056353303,
+        0.055667395,
+        0.055876371,
+        0.056712499000000006,
+        0.055990061,
+        0.055633333,
+        0.05633269900000001,
+        0.058057899,
+        0.056888435,
+        0.056110749,
+        0.056471275,
+        0.056792867000000004,
+        0.05639526500000001,
+        0.055350116000000005,
+        0.056408873000000005,
+        0.05678959,
+        0.056398842000000005,
+        0.055839638000000004,
+        0.055712844000000004,
+        0.05688121300000001,
+        0.055469405000000006,
+        0.055445379,
+        0.056376734000000005,
+        0.056948133000000005,
+        0.055932182000000004,
+        0.055324908000000006,
+        0.056271942000000005,
+        0.057060106000000006,
+        0.056296983,
+        0.056830793000000004,
+        0.055892177,
+        0.056888030000000006,
+        0.055857778000000004,
+        0.056503080000000004,
+        0.05593001,
+        0.056412152,
+        0.056972234000000004,
+        0.056851148000000004,
+        0.05638242500000001,
+        0.056807090000000005,
+        0.05630981,
+        0.056195705000000006,
+        0.057476241000000004,
+        0.05657261
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ]
+    },
+    {
+      "command": "Traceback",
+      "mean": 0.09253963434375,
+      "stddev": 0.0009403225621558442,
+      "median": 0.09262608750000001,
+      "user": 0.0829881875,
+      "system": 0.009431812500000001,
+      "min": 0.090248097,
+      "max": 0.09427576900000001,
+      "times": [
+        0.092071738,
+        0.09285795300000001,
+        0.091159931,
+        0.093380026,
+        0.09285441900000001,
+        0.093224544,
+        0.09278657200000001,
+        0.092307729,
+        0.09311195300000001,
+        0.091947652,
+        0.092079354,
+        0.09091943200000001,
+        0.09201131900000001,
+        0.0926831,
+        0.092154733,
+        0.091895371,
+        0.091225806,
+        0.09427576900000001,
+        0.092972882,
+        0.09321852900000001,
+        0.092569075,
+        0.090248097,
+        0.093477535,
+        0.093973213,
+        0.09421566,
+        0.09186965500000001,
+        0.09277553100000001,
+        0.091959758,
+        0.092280314,
+        0.09376243100000001,
+        0.091938661,
+        0.093059557
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ]
+    },
+    {
+      "command": "Syntax",
+      "mean": 0.06469766976086957,
+      "stddev": 0.0009928259814298258,
+      "median": 0.064564858,
+      "user": 0.05657663043478261,
+      "system": 0.008004260869565216,
+      "min": 0.062625494,
+      "max": 0.06789864500000001,
+      "times": [
+        0.06508817800000001,
+        0.062625494,
+        0.06368148500000001,
+        0.06789864500000001,
+        0.06453987500000001,
+        0.06478584400000001,
+        0.065787263,
+        0.067362112,
+        0.064447933,
+        0.06612948,
+        0.06510700400000001,
+        0.063486955,
+        0.064218064,
+        0.064889889,
+        0.063855414,
+        0.063622633,
+        0.064729509,
+        0.06478154900000001,
+        0.06418591800000001,
+        0.06458984100000001,
+        0.06423559100000001,
+        0.064592149,
+        0.064511466,
+        0.064991319,
+        0.063613287,
+        0.06333606500000001,
+        0.065130748,
+        0.065149529,
+        0.06382328000000001,
+        0.06560769400000001,
+        0.064206179,
+        0.064363587,
+        0.064345068,
+        0.065027838,
+        0.064215859,
+        0.065392161,
+        0.06645457,
+        0.06472346100000001,
+        0.06444989300000001,
+        0.064814671,
+        0.06417358000000001,
+        0.06409564000000001,
+        0.066077574,
+        0.06521038700000001,
+        0.064144928,
+        0.0635932
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ]
+    },
+    {
+      "command": "Markdown",
+      "mean": 0.10298694286206897,
+      "stddev": 0.0012223350486491229,
+      "median": 0.102965391,
+      "user": 0.09229879310344825,
+      "system": 0.010564827586206896,
+      "min": 0.100920489,
+      "max": 0.10607996400000001,
+      "times": [
+        0.10251539200000001,
+        0.10298868400000001,
+        0.10136832900000001,
+        0.10398676800000001,
+        0.101551582,
+        0.101647445,
+        0.100920489,
+        0.10258349000000001,
+        0.102057163,
+        0.101942976,
+        0.103073089,
+        0.10340062800000001,
+        0.10365265900000001,
+        0.101700521,
+        0.10272693100000001,
+        0.10280757,
+        0.10195390600000001,
+        0.101912294,
+        0.10350345100000001,
+        0.104377943,
+        0.104097272,
+        0.102965391,
+        0.103025325,
+        0.102225334,
+        0.10346978800000001,
+        0.105075075,
+        0.104930935,
+        0.10607996400000001,
+        0.104080949
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ]
+    }
+  ]
+}
--- a/.codeflash/textualize/rich/data/e2e-3.12/rich.json
+++ b/.codeflash/textualize/rich/data/e2e-3.12/rich.json
@ -0,0 +1,656 @@
+{
+  "results": [
+    {
+      "command": "master",
+      "mean": 0.01823860520153846,
+      "stddev": 0.00033418748253994426,
+      "median": 0.018192030740000004,
+      "user": 0.015077637948717948,
+      "system": 0.0031177301282051275,
+      "min": 0.017511767240000004,
+      "max": 0.01981077624,
+      "times": [
+        0.018310842240000003,
+        0.01825513724,
+        0.018140367240000004,
+        0.018337057240000003,
+        0.01863625924,
+        0.01848201424,
+        0.01804805224,
+        0.01827036924,
+        0.01878393524,
+        0.018536032240000003,
+        0.018093027240000004,
+        0.01800485724,
+        0.01806896824,
+        0.018181987240000003,
+        0.018038545240000002,
+        0.017769561240000002,
+        0.018156742240000003,
+        0.017871014240000004,
+        0.017731034240000002,
+        0.01824771424,
+        0.018319656240000003,
+        0.018263254240000002,
+        0.017684556240000003,
+        0.01788151824,
+        0.017721791240000003,
+        0.017511767240000004,
+        0.01759984824,
+        0.01792891324,
+        0.01793705324,
+        0.017702249240000002,
+        0.018536035240000002,
+        0.018764143240000003,
+        0.018422977240000003,
+        0.01866145824,
+        0.017868611240000002,
+        0.017915204240000003,
+        0.018622137240000003,
+        0.019010169240000003,
+        0.018168751240000003,
+        0.017851504240000003,
+        0.01803674724,
+        0.01825662624,
+        0.01830023324,
+        0.01840434024,
+        0.01833120724,
+        0.018385914240000002,
+        0.018202989240000002,
+        0.01808005524,
+        0.01802546324,
+        0.018016573240000004,
+        0.01869524724,
+        0.01792506824,
+        0.017891028240000002,
+        0.01796534424,
+        0.01864604124,
+        0.018281947240000002,
+        0.017793487240000003,
+        0.018187418240000003,
+        0.01861600524,
+        0.018515928240000003,
+        0.018254150240000003,
+        0.018302078240000002,
+        0.018614018240000002,
+        0.018389315240000002,
+        0.018025986240000003,
+        0.01863732824,
+        0.01816808224,
+        0.018167282240000002,
+        0.017989392240000002,
+        0.017952038240000003,
+        0.018313124240000003,
+        0.01781043124,
+        0.018559775240000003,
+        0.018671000240000003,
+        0.01839588424,
+        0.018254557240000004,
+        0.01846536424,
+        0.01834406024,
+        0.017694060240000004,
+        0.018026936240000003,
+        0.017920371240000003,
+        0.018136474240000002,
+        0.017833095240000003,
+        0.018180279240000003,
+        0.01861320324,
+        0.01807628224,
+        0.01814603524,
+        0.01858972524,
+        0.018629435240000002,
+        0.018130638240000004,
+        0.018021372240000003,
+        0.017870836240000004,
+        0.01807730624,
+        0.01780388624,
+        0.018515032240000003,
+        0.01825366524,
+        0.01810628624,
+        0.01776130424,
+        0.018446362240000003,
+        0.01887040724,
+        0.018112462240000002,
+        0.01858319924,
+        0.018435506240000003,
+        0.01836991924,
+        0.017997051240000003,
+        0.017878102240000002,
+        0.01846664324,
+        0.018458467240000002,
+        0.018395984240000003,
+        0.01864995024,
+        0.01864832324,
+        0.018326341240000002,
+        0.01816901824,
+        0.01841227724,
+        0.018801497240000003,
+        0.01981077624,
+        0.01817123324,
+        0.018752317240000003,
+        0.018648279240000003,
+        0.018573375240000002,
+        0.01850112224,
+        0.018805380240000003,
+        0.01805995524,
+        0.01805278524,
+        0.018042610240000003,
+        0.01893831824,
+        0.01806157024,
+        0.01803846224,
+        0.01803261524,
+        0.018072805240000003,
+        0.01811128524,
+        0.01779088624,
+        0.01857712224,
+        0.018332725240000004,
+        0.017813045240000002,
+        0.018800270240000003,
+        0.01847144624,
+        0.01792151524,
+        0.018381200240000003,
+        0.017785991240000004,
+        0.01814308124,
+        0.01807071924,
+        0.01815407624,
+        0.01819664324,
+        0.01824260424,
+        0.01785372424,
+        0.018480901240000003,
+        0.01788718024,
+        0.01821740624,
+        0.018272863240000003,
+        0.017828476240000002,
+        0.018549097240000003,
+        0.01804737224,
+        0.01841517724,
+        0.018137950240000002,
+        0.018563659240000002
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ],
+      "parameters": {
+        "branch": "master"
+      }
+    },
+    {
+      "command": "codeflash/optimize",
+      "mean": 0.01828925187398692,
+      "stddev": 0.0003618156098647296,
+      "median": 0.018273530240000002,
+      "user": 0.014669318692810454,
+      "system": 0.003577026143790847,
+      "min": 0.017499048240000002,
+      "max": 0.019074251240000003,
+      "times": [
+        0.018735393240000002,
+        0.01816440424,
+        0.018508408240000003,
+        0.018554792240000003,
+        0.018684118240000002,
+        0.018511350240000002,
+        0.018538764240000002,
+        0.01864722224,
+        0.018391206240000003,
+        0.018933931240000004,
+        0.018818357240000003,
+        0.019074251240000003,
+        0.018876000240000003,
+        0.01892099724,
+        0.018522178240000003,
+        0.01813792724,
+        0.018590526240000002,
+        0.018826828240000003,
+        0.01851930024,
+        0.01809071224,
+        0.018357214240000003,
+        0.01853531924,
+        0.01880024224,
+        0.01808847224,
+        0.018450325240000003,
+        0.01808973124,
+        0.018346199240000003,
+        0.01836754024,
+        0.017958598240000002,
+        0.018048261240000002,
+        0.01823250124,
+        0.017922619240000003,
+        0.018567489240000002,
+        0.01902414624,
+        0.01848919224,
+        0.018443708240000002,
+        0.01783752924,
+        0.018080570240000003,
+        0.01777865924,
+        0.01812057224,
+        0.01822617924,
+        0.01791001424,
+        0.01770311424,
+        0.01792207724,
+        0.01827158724,
+        0.01870264224,
+        0.018351023240000004,
+        0.01772029924,
+        0.018602167240000002,
+        0.018485707240000003,
+        0.018125365240000003,
+        0.017943833240000002,
+        0.017553849240000002,
+        0.017890422240000002,
+        0.017967068240000002,
+        0.01772358024,
+        0.01828898824,
+        0.01838718324,
+        0.018283632240000004,
+        0.018131239240000003,
+        0.017873198240000002,
+        0.018190186240000003,
+        0.01821564024,
+        0.01870578124,
+        0.01849662624,
+        0.01831057724,
+        0.018273530240000002,
+        0.018486098240000002,
+        0.01874940324,
+        0.01871750524,
+        0.01855369424,
+        0.018238118240000004,
+        0.018156486240000002,
+        0.01808029224,
+        0.017836159240000003,
+        0.01769452724,
+        0.01799055524,
+        0.01807592724,
+        0.018385088240000003,
+        0.01842155724,
+        0.01817452324,
+        0.01799450424,
+        0.018134442240000003,
+        0.018004605240000002,
+        0.01791298124,
+        0.01773127424,
+        0.01775725824,
+        0.018100762240000002,
+        0.01843612324,
+        0.018485983240000003,
+        0.018767083240000003,
+        0.018724159240000003,
+        0.01806897924,
+        0.01853162824,
+        0.018281412240000003,
+        0.017879808240000003,
+        0.017499048240000002,
+        0.01799062624,
+        0.018016409240000003,
+        0.017709685240000002,
+        0.017637066240000002,
+        0.017968740240000003,
+        0.01817995224,
+        0.018418448240000003,
+        0.018215662240000003,
+        0.01766449524,
+        0.018362743240000003,
+        0.01794319424,
+        0.01821598124,
+        0.018297304240000003,
+        0.01756004724,
+        0.017999983240000003,
+        0.01824031924,
+        0.01778855424,
+        0.018690365240000003,
+        0.018422866240000002,
+        0.018095994240000003,
+        0.018116903240000003,
+        0.01857356224,
+        0.018491620240000003,
+        0.01818349224,
+        0.018254979240000004,
+        0.018079878240000003,
+        0.01812390524,
+        0.018279500240000003,
+        0.01889744724,
+        0.019028368240000003,
+        0.01879671424,
+        0.01842321824,
+        0.018946826240000002,
+        0.018981220240000003,
+        0.01888298524,
+        0.018882388240000002,
+        0.01824249324,
+        0.018600262240000002,
+        0.018785203240000003,
+        0.01859032924,
+        0.01855945524,
+        0.018583031240000002,
+        0.018651752240000003,
+        0.018117456240000003,
+        0.01800298524,
+        0.01859255224,
+        0.01859055424,
+        0.018036250240000003,
+        0.018414541240000002,
+        0.01765029724,
+        0.018229405240000003,
+        0.018597968240000002,
+        0.017855778240000002,
+        0.017647448240000002,
+        0.018059696240000003,
+        0.018375570240000003
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ],
+      "parameters": {
+        "branch": "codeflash/optimize"
+      }
+    }
+  ]
+}
--- a/.codeflash/textualize/rich/data/e2e-3.12/richhandler.json
+++ b/.codeflash/textualize/rich/data/e2e-3.12/richhandler.json
@ -0,0 +1,202 @@
+{
+  "results": [
+    {
+      "command": "master",
+      "mean": 0.09939199042666667,
+      "stddev": 0.0008906697962791088,
+      "median": 0.09926365576000001,
+      "user": 0.08788506,
+      "system": 0.011414793333333334,
+      "min": 0.09772261926,
+      "max": 0.10137707126,
+      "times": [
+        0.09891529426000001,
+        0.09997200326000001,
+        0.09974364026000002,
+        0.09968982826000002,
+        0.09971694426000001,
+        0.09863254326000001,
+        0.09867642726,
+        0.09988358426,
+        0.09794550626000001,
+        0.09911598226000001,
+        0.09873425326000002,
+        0.09926515126,
+        0.09772261926,
+        0.09847011326,
+        0.09856188826000001,
+        0.09931160726,
+        0.10007588826000001,
+        0.09907288626,
+        0.10075485626000001,
+        0.10137707126,
+        0.10130292826000001,
+        0.09937807226,
+        0.09934965126,
+        0.09982497426,
+        0.09921275626,
+        0.09889787126,
+        0.09865419126000001,
+        0.10105207026000002,
+        0.09926216026000001,
+        0.09918694926
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ],
+      "parameters": {
+        "branch": "master"
+      }
+    },
+    {
+      "command": "codeflash/optimize",
+      "mean": 0.05692257022153847,
+      "stddev": 0.0007609053000082627,
+      "median": 0.05695223626,
+      "user": 0.049386048461538455,
+      "system": 0.007473440769230769,
+      "min": 0.05559647026,
+      "max": 0.05946960126,
+      "times": [
+        0.05649646126,
+        0.056283218260000004,
+        0.05698114226,
+        0.05737112826,
+        0.05623637326,
+        0.057093913260000004,
+        0.05701229226,
+        0.05739657226,
+        0.056799157260000004,
+        0.05747349626,
+        0.05754027126,
+        0.05703928526,
+        0.05733696326,
+        0.05760033226,
+        0.05765181926,
+        0.05770124026,
+        0.05785528926,
+        0.05840456726,
+        0.05693144826,
+        0.05628983926,
+        0.05604765626,
+        0.05697302426,
+        0.056453238260000004,
+        0.05581530626,
+        0.056895194260000004,
+        0.05614977926,
+        0.05576215526,
+        0.05714445226,
+        0.05683762126,
+        0.05559647026,
+        0.05640095326,
+        0.056354972260000004,
+        0.05730621226,
+        0.05578081226,
+        0.05630418726,
+        0.05746297126,
+        0.05721115426,
+        0.05647862526,
+        0.05563157626,
+        0.05585147026,
+        0.05719180026,
+        0.05946960126,
+        0.056670795260000004,
+        0.05690808326,
+        0.05662255126,
+        0.05680255926,
+        0.05798815226,
+        0.05821877126,
+        0.05634854126,
+        0.05729275026,
+        0.05712497626,
+        0.05738242726
+      ],
+      "exit_codes": [
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0,
+        0
+      ],
+      "parameters": {
+        "branch": "codeflash/optimize"
+      }
+    }
+  ]
+}
--- a/.codeflash/textualize/rich/data/runtime/bench_runtime2_baseline.txt
+++ b/.codeflash/textualize/rich/data/runtime/bench_runtime2_baseline.txt
@ -0,0 +1,24 @@
+Python 3.13.13 (main, Apr  7 2026, 20:49:46) [Clang 22.1.1 ]
+Commit: 118efe21 (before runtime micro-optimizations)
+
+=== Style.__eq__ ===
+  identity (s1 == s1): 114.0ms/1000K calls, 114ns/call
+  equal (s1 == s2): 112.8ms/1000K calls, 113ns/call
+  not-equal (s1 != Style()): 1205.5ms/1000K calls, 1206ns/call
+
+=== Style.combine ===
+  combine(3 styles): 115.7ms/200K calls, 579ns/call
+  combine(2 styles): 120.8ms/200K calls, 604ns/call
+
+=== Style.chain ===
+  chain(3 styles): 191.8ms/200K calls, 959ns/call
+
+=== Segment.simplify ===
+  simplify (identity styles): 253.8ms/200K calls, 1269ns/call
+  simplify (equal styles): 253.9ms/200K calls, 1269ns/call
+  simplify (diff styles): 125.7ms/200K calls, 629ns/call
+
+=== E2E Console.print ===
+  Console.print(markup): 868.7ms/5K calls, 173745ns/call
+
+Done.
--- a/.codeflash/textualize/rich/data/runtime/bench_runtime2_optimized.txt
+++ b/.codeflash/textualize/rich/data/runtime/bench_runtime2_optimized.txt
@ -0,0 +1,24 @@
+Python 3.13.13 (main, Apr  7 2026, 20:49:46) [Clang 22.1.1 ]
+Commit: 6b354159 (with runtime micro-optimizations)
+
+=== Style.__eq__ ===
+  identity (s1 == s1): 62.0ms/1000K calls, 62ns/call
+  equal (s1 == s2): 129.1ms/1000K calls, 129ns/call
+  not-equal (s1 != Style()): 1246.2ms/1000K calls, 1246ns/call
+
+=== Style.combine ===
+  combine(3 styles): 86.6ms/200K calls, 433ns/call
+  combine(2 styles): 110.0ms/200K calls, 550ns/call
+
+=== Style.chain ===
+  chain(3 styles): 175.7ms/200K calls, 878ns/call
+
+=== Segment.simplify ===
+  simplify (identity styles): 186.3ms/200K calls, 931ns/call
+  simplify (equal styles): 225.5ms/200K calls, 1128ns/call
+  simplify (diff styles): 143.6ms/200K calls, 718ns/call
+
+=== E2E Console.print ===
+  Console.print(markup): 858.0ms/5K calls, 171609ns/call
+
+Done.
--- a/.codeflash/textualize/rich/infra/cloud-init.yaml
+++ b/.codeflash/textualize/rich/infra/cloud-init.yaml
@ -0,0 +1,160 @@
+#cloud-config
+package_update: true
+packages:
+  - git
+  - build-essential
+  - curl
+  - wget
+  - jq
+
+write_files:
+  - path: /home/azureuser/bench/bench_import.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      VENV_PYTHON="$HOME/rich/.venv/bin/python"
+      echo "=== Rich overall import time ==="
+      hyperfine --warmup 3 --min-runs 30 --shell=none \
+        "$VENV_PYTHON -c 'import rich'"
+
+  - path: /home/azureuser/bench/bench_module.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      VENV_PYTHON="$HOME/rich/.venv/bin/python"
+      echo "=== Per-module import time ==="
+      hyperfine --warmup 3 --min-runs 20 --shell=none \
+        -n 'rich (top-level)' "$VENV_PYTHON -c 'import rich'" \
+        -n 'rich.console.Console' "$VENV_PYTHON -c 'from rich.console import Console'" \
+        -n 'rich.logging.RichHandler' "$VENV_PYTHON -c 'from rich.logging import RichHandler'" \
+        -n 'rich.traceback.Traceback' "$VENV_PYTHON -c 'from rich.traceback import Traceback'" \
+        -n 'rich.print_json' "$VENV_PYTHON -c 'from rich import print_json'" \
+        -n 'rich.syntax.Syntax' "$VENV_PYTHON -c 'from rich.syntax import Syntax'" \
+        -n 'rich.pretty' "$VENV_PYTHON -c 'import rich.pretty'" \
+        -n 'rich.markdown.Markdown' "$VENV_PYTHON -c 'from rich.markdown import Markdown'"
+
+  - path: /home/azureuser/bench/bench_importtime.py
+    owner: azureuser:azureuser
+    permissions: "0755"
+    content: |
+      #!/usr/bin/env python3
+      """Parse python -X importtime output and produce a sorted breakdown."""
+      import subprocess
+      import sys
+      import re
+      import os
+
+      def parse_importtime(stderr_lines):
+          pattern = re.compile(
+              r"import time:\s+(\d+)\s+\|\s+(\d+)\s+\|\s+(\s*)([\w.]+)"
+          )
+          results = []
+          for line in stderr_lines:
+              m = pattern.match(line)
+              if m:
+                  self_us = int(m.group(1))
+                  cumul_us = int(m.group(2))
+                  indent = len(m.group(3)) // 2
+                  module = m.group(4)
+                  results.append((module, self_us, cumul_us, indent))
+          return results
+
+      def main():
+          target = sys.argv[1] if len(sys.argv) > 1 else "import rich"
+          venv_python = os.path.expanduser("~/rich/.venv/bin/python")
+
+          proc = subprocess.run(
+              [venv_python, "-X", "importtime", "-c", target],
+              capture_output=True, text=True
+          )
+          entries = parse_importtime(proc.stderr.splitlines())
+          entries.sort(key=lambda e: e[1], reverse=True)
+
+          print(f"{'Module':<50} {'Self (us)':>12} {'Cumul (us)':>12} {'Depth':>6}")
+          print("-" * 82)
+          for mod, self_us, cumul_us, depth in entries[:40]:
+              print(f"{mod:<50} {self_us:>12,} {cumul_us:>12,} {depth:>6}")
+
+          if len(sys.argv) > 2:
+              with open(sys.argv[2], "w") as f:
+                  f.write("module\tself_us\tcumul_us\tdepth\n")
+                  for mod, self_us, cumul_us, depth in entries:
+                      f.write(f"{mod}\t{self_us}\t{cumul_us}\t{depth}\n")
+              print(f"\nTSV written to {sys.argv[2]}")
+
+      if __name__ == "__main__":
+          main()
+
+  - path: /home/azureuser/bench/bench_compare.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BRANCH="${1:?Usage: bench_compare.sh <branch-or-commit>}"
+      VENV_PYTHON="$HOME/rich/.venv/bin/python"
+      TS=$(date +%Y%m%d-%H%M%S)
+      OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+      mkdir -p "$OUTDIR"
+
+      cd ~/rich
+      git checkout "$BRANCH"
+      export PATH="$HOME/.local/bin:$PATH"
+      uv pip install -e .
+
+      echo "=== Benchmarking branch: $BRANCH ==="
+
+      hyperfine --warmup 3 --min-runs 30 --shell=none \
+        --export-json "$OUTDIR/import.json" \
+        "$VENV_PYTHON -c 'import rich'"
+
+      hyperfine --warmup 3 --min-runs 20 --shell=none \
+        --export-json "$OUTDIR/modules.json" \
+        -n 'console' "$VENV_PYTHON -c 'from rich.console import Console'" \
+        -n 'logging' "$VENV_PYTHON -c 'from rich.logging import RichHandler'" \
+        -n 'traceback' "$VENV_PYTHON -c 'from rich.traceback import Traceback'" \
+        -n 'syntax' "$VENV_PYTHON -c 'from rich.syntax import Syntax'" \
+        -n 'markdown' "$VENV_PYTHON -c 'from rich.markdown import Markdown'"
+
+      python3 ~/bench/bench_importtime.py "import rich" "$OUTDIR/importtime.tsv"
+
+      echo ""
+      echo "Results saved to $OUTDIR/"
+      ls -la "$OUTDIR/"
+
+  - path: /home/azureuser/setup_rich.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      export PATH="$HOME/.local/bin:$PATH"
+
+      echo "=== Installing uv ==="
+      curl -LsSf https://astral.sh/uv/install.sh | sh
+
+      echo "=== Installing Python 3.12 and 3.13 ==="
+      uv python install 3.12 3.13
+
+      echo "=== Cloning Rich ==="
+      git clone https://github.com/Textualize/rich.git ~/rich
+
+      echo "=== Creating venv and installing Rich ==="
+      cd ~/rich
+      uv venv --python 3.13
+      uv pip install -e .
+
+      echo "=== Creating results directory ==="
+      mkdir -p ~/results
+
+      echo "=== Done ==="
+      ~/rich/.venv/bin/python -c "import rich; print(f'Rich {rich.__version__} installed')"
+
+runcmd:
+  - wget -q https://github.com/sharkdp/hyperfine/releases/download/v1.19.0/hyperfine_1.19.0_amd64.deb -O /tmp/hyperfine.deb
+  - dpkg -i /tmp/hyperfine.deb
+  - su - azureuser -c 'bash /home/azureuser/setup_rich.sh'
--- a/.codeflash/textualize/rich/infra/vm-setup.md
+++ b/.codeflash/textualize/rich/infra/vm-setup.md
@ -0,0 +1,92 @@
+# Azure VM Setup for Benchmarking
+
+## VM Spec
+
+| Setting | Value |
+|---|---|
+| Name | `rich-bench` |
+| Resource group | `RICH-BENCH-RG` |
+| Region | `westus2` |
+| Size | `Standard_D2s_v5` (2 vCPU, 8 GB RAM, **non-burstable**) |
+| OS | Ubuntu 24.04 LTS |
+| Image | `Canonical:ubuntu-24_04-lts:server:latest` |
+
+Non-burstable is critical — burstable VMs (B-series) have variable CPU performance that makes benchmarks unreliable.
+
+## Provisioning
+
+```bash
+# Create resource group
+az group create --name RICH-BENCH-RG --location westus2
+
+# Create VM
+az vm create \
+  --resource-group RICH-BENCH-RG \
+  --name rich-bench \
+  --image Canonical:ubuntu-24_04-lts:server:latest \
+  --size Standard_D2s_v5 \
+  --admin-username azureuser \
+  --generate-ssh-keys \
+  --custom-data cloud-init.yaml
+```
+
+## Cloud-init
+
+The full cloud-init is in [`cloud-init.yaml`](cloud-init.yaml). It installs:
+
+1. **System packages**: `git`, `build-essential`, `curl`
+2. **uv**: `curl -LsSf https://astral.sh/uv/install.sh | sh`
+3. **Python 3.12 + 3.13**: `uv python install 3.12 3.13`
+4. **hyperfine**: From GitHub releases (latest)
+5. **Rich clone**: `git clone https://github.com/Textualize/rich /home/azureuser/rich`
+6. **Venvs**: `.venv` (3.12) and `venv313` (3.13) with Rich in editable mode
+7. **Bench scripts**: Copied to `/home/azureuser/bench/`
+
+## Post-provisioning verification
+
+```bash
+ssh azureuser@<ip>
+
+# Check tools
+python3.12 --version
+python3.13 --version
+hyperfine --version
+
+# Check Rich
+cd ~/rich && git status
+~/rich/.venv/bin/python -c "import rich; print(rich.__version__)"
+
+# Run baseline
+bash ~/bench/bench_import.sh
+
+# Verify low stddev (should be <2ms for import benchmarks)
+```
+
+## Directory layout on VM
+
+```
+/home/azureuser/
+├── rich/                  # Rich repo clone (editable install)
+│   ├── .venv/             # Python 3.12 venv
+│   └── ...
+├── venv313/               # Python 3.13 venv
+├── bench/
+│   ├── bench_import.sh    # Overall import time
+│   ├── bench_module.sh    # Per-module imports
+│   ├── bench_e2e.sh       # A/B branch comparison
+│   ├── bench_compare.sh   # Generic branch comparison
+│   ├── bench_importtime.py # -X importtime parser
+│   ├── bench_runtime.py   # PR #12 runtime benchmarks
+│   ├── bench_runtime2.py  # PR #13 runtime benchmarks
+│   ├── bench_text.py      # Text hot-path benchmarks
+│   └── test_all_impls.sh  # Multi-version test runner
+└── results/               # Benchmark output storage
+```
+
+## Why this setup
+
+- **Dedicated VM** eliminates background process noise from a developer laptop
+- **Non-burstable** gives consistent CPU frequency — no turbo boost variability
+- **Two Python versions** because `typing` imports `re` on 3.12 but not 3.13, which affects the `re` deferral benchmarks
+- **hyperfine** handles warmup, min-runs, and statistical reporting (mean ± stddev)
+- **Editable install** allows quick branch switching without reinstall overhead
--- a/.codeflash/textualize/rich/status.md
+++ b/.codeflash/textualize/rich/status.md
--- a/.codeflash/unstructured/core-product/README.md
+++ b/.codeflash/unstructured/core-product/README.md
@ -0,0 +1,123 @@
+# core-product Performance Optimization
+
+Unstructured-IO's document processing pipeline -- PDF partitioning, OCR, layout detection, and element extraction. Python 3.12, multi-package uv workspace.
+
+## Results
+
+**Environment**: Python 3.12, Ubuntu 24.04 LTS (Azure Standard_E4s_v5, taskset -c 0), pytest-benchmark (5 rounds, 1 warmup, median reported)
+
+### Cumulative latency progression
+
+| Stage | 1p-tables | 10p-scan | 16p-mixed |
+|---|---:|---:|---:|
+| Baseline (main) | 2.93s | 35.91s | 65.32s |
+| + CPU-aware serial OCR (#1502) | 2.92s (-0.3%) | 33.80s (**-5.9%**) | 61.23s (**-6.3%**) |
+| + BMP render format (#1503) | 2.72s (**-7.3%**) | 30.68s (**-14.6%**) | 56.63s (**-13.3%**) |
+
+### Memory
+
+| Optimization | Metric | Before | After | Improvement |
+|---|---|---:|---:|---:|
+| CPU-aware serial OCR (#1502) | Process-tree RSS (10p scan, post-partition) | 3,491 MB | 1,398 MB | **-2,093 MB (60%)** |
+| CPU-aware serial OCR (#1502) | Process-tree RSS (10p scan, pre-partition) | 2,619 MB | 499 MB | **-2,120 MB (81%)** |
+
+### Throughput
+
+No throughput regressions detected. Serial OCR path matches pool-on-1-CPU throughput (pool provides no parallelism benefit when pinned to 1 core).
+
+## What We Changed
+
+### Latency
+
+- **BMP render format** (#1503): Replace PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. 9.2% incremental gain on 10p-scan (on top of serial OCR).
+- **CPU-aware serial OCR** (#1502): Use `os.sched_getaffinity(0)` to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory. 5.9% latency improvement + 2.1 GB memory savings.
+
+### Memory
+
+- **CPU-aware serial OCR** (#1502): Saves ~2.1 GB with zero latency cost on single-CPU pods.
+
+### Prior merges (before benchmarking infrastructure)
+
+- **Free page image before table OCR** (#1448): Release PIL image memory before table extraction starts
+- **Resize-first preprocessing** (#1441): Resize numpy arrays before YOLOX preprocessing instead of after
+- **Replace lazyproperty** (#1464): Switch from custom lazyproperty to stdlib `functools.cached_property`
+- **Reduce attribute lookups** (#1481): Optimize `elements_intersect_vertically` inner loop
+- **Fix blocking event loop** (#1400): Replace blocking CSV merge with async implementation
+
+## Upstream Contributions
+
+| PR | Status | Description |
+|---|---|---|
+| [Unstructured-IO/core-product#1503](https://github.com/Unstructured-IO/core-product/pull/1503) | Draft | Render PDF pages as BMP instead of PNG in pdfium pool |
+| [Unstructured-IO/core-product#1502](https://github.com/Unstructured-IO/core-product/pull/1502) | Draft | Cap OCR workers to available CPUs — serial mode on 1-CPU pods |
+| [Unstructured-IO/core-product#1500](https://github.com/Unstructured-IO/core-product/pull/1500) | Draft | Benchmark infrastructure and repo conventions |
+| [Unstructured-IO/core-product#1481](https://github.com/Unstructured-IO/core-product/pull/1481) | Merged | Reduce attribute lookups in elements_intersect_vertically |
+| [Unstructured-IO/core-product#1464](https://github.com/Unstructured-IO/core-product/pull/1464) | Merged | Replace lazyproperty with functools.cached_property |
+| [Unstructured-IO/core-product#1448](https://github.com/Unstructured-IO/core-product/pull/1448) | Merged | Free page image before table extraction |
+| [Unstructured-IO/core-product#1441](https://github.com/Unstructured-IO/core-product/pull/1441) | Merged | Resize-first numpy preprocessing for YOLOX |
+| [Unstructured-IO/core-product#1400](https://github.com/Unstructured-IO/core-product/pull/1400) | Merged | Fix blocking event loop in CSV merge |
+
+## Methodology
+
+### Environment
+
+- **VM**: Azure Standard_E4s_v5 (4 vCPU, 32 GB RAM, memory-optimized)
+- **OS**: Ubuntu 24.04 LTS
+- **Region**: westus2
+- **Python**: 3.12 (project constraint: `>=3.12, <3.13`)
+- **Tooling**: pytest-benchmark (5 rounds, 1 warmup, median reported), memray, cProfile
+- **CPU pinning**: `taskset -c 0` to match production pod profile (1 CPU request, 32 GB RAM limit)
+
+Non-burstable VM + CPU pinning matches production Knative pod resources. 32 GB RAM matches the pod limit exactly.
+
+### Benchmarking methodology
+
+- `pedantic(rounds=5, warmup_rounds=1)` — 1 warmup absorbs ONNX model JIT, page cache warming, and pool initialization overhead. 5 measured rounds enable median, IQR, and Tukey outlier detection.
+- **Median** reported as primary metric (robust to up to 2 outliers in 5 samples)
+- Observed stddev <0.4% of median across all measurements
+
+### Profiling approach
+
+1. cProfile + memray -- identify hot functions and peak memory allocators
+2. Per-stage benchmark instrumentation -- render, detect, OCR, merge timing breakdown
+3. Cumulative progression on stacked branch with proper statistical methodology
+4. Full unit test suite run before and after every change (348 tests)
+
+### Memory measurement
+
+Process-tree RSS measured by summing `/proc/[pid]/status` VmRSS for the main process and all direct children. This captures worker process memory that `resource.getrusage(RUSAGE_SELF)` misses.
+
+### Runner convention
+
+Benchmark scripts use `.venv/bin/python` directly for accuracy (`uv run` adds overhead). Upstream reproducers use `uv run python` for portability.
+
+### Install notes
+
+core-product is a multi-package uv workspace with three sub-packages (`unstructured_prop`, `unstructured_inference_prop`, `unstructured-api`). All share a **single root-level `.venv`**:
+
+```bash
+uv venv --python 3.12          # create root .venv
+export VIRTUAL_ENV=$PWD/.venv   # all uv sync --active commands use this
+make install                    # syncs all three sub-packages into shared venv
+```
+
+Private PyPI access requires:
+```bash
+export UV_INDEX_UNSTRUCTURED_USERNAME=unstructured
+export UV_INDEX_UNSTRUCTURED_PASSWORD=<token from Keeper "Private PyPi Index">
+```
+
+System dependencies beyond build-essential: `tesseract-ocr libtesseract-dev libleptonica-dev poppler-utils libmagic1 libgl1 libglib2.0-0`
+
+## Repo Structure
+
+```
+.
+├── README.md              # This file
+├── bench/                 # Benchmark scripts
+├── data/                  # Raw benchmark data
+│   └── results.tsv
+└── infra/                 # VM provisioning
+    ├── cloud-init.yaml
+    └── vm-manage.sh
+```
--- a/.codeflash/unstructured/core-product/bench/.gitkeep
+++ b/.codeflash/unstructured/core-product/bench/.gitkeep
--- a/.codeflash/unstructured/core-product/bench/bench_throughput.py
+++ b/.codeflash/unstructured/core-product/bench/bench_throughput.py
@ -0,0 +1,183 @@
+"""Concurrent throughput benchmark for core-product partition pipeline.
+
+Mirrors production: FastAPI calls `asyncio.to_thread(partition, ...)` per request.
+Uses pytest-async-benchmark for structured timing with rich output.
+
+Usage (on benchmark VM):
+    cd ~/core-product
+    # Install from pedantic-mode branch
+    uv pip install --python .venv/bin/python \
+      'pytest-async-benchmark[asyncio] @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode'
+
+    # Run all benchmarks (~5 min total)
+    taskset -c 0 .venv/bin/python -m pytest bench/bench_throughput.py -v -s
+
+    # Just concurrency scaling (fast strategy, ~20s)
+    taskset -c 0 .venv/bin/python -m pytest bench/bench_throughput.py -v -s -k Concurrency
+
+    # Just OCR pipeline (hi_res, ~5 min)
+    taskset -c 0 .venv/bin/python -m pytest bench/bench_throughput.py -v -s -k OCR
+
+Copy this file to ~/core-product/bench/ on the VM before running.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from pathlib import Path
+
+import pytest
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+
+# ---------------------------------------------------------------------------
+# Fixtures -- chosen for fast iteration
+#
+# "fast" strategy fixtures (pdfminer, no OCR/detection, <0.5s each):
+#   These test concurrency scaling without the 6s+ OCR overhead.
+#
+# "hi_res" fixture (full pipeline: render + detect + OCR + merge):
+#   Single small doc to measure the hot path we're optimizing.
+# ---------------------------------------------------------------------------
+FAST_FIXTURES = [
+    ("unstructured-api/sample-docs/embedded-images-tables.pdf", "img-tables-1p"),
+    ("unstructured_prop/tests/test_files/multi-column-2p.pdf", "multicol-2p"),
+    ("unstructured-api/sample-docs/layout-parser-paper-with-table.pdf", "table-1p"),
+]
+
+HIRES_FIXTURE = (
+    "unstructured-api/sample-docs/embedded-images-tables.pdf",
+    "img-tables-1p",
+)
+
+
+def resolve(rel: str) -> Path:
+    p = REPO_ROOT / rel
+    if not p.exists():
+        pytest.skip(f"fixture not found: {p}")
+    return p
+
+
+# ---------------------------------------------------------------------------
+# Partition wrappers
+# ---------------------------------------------------------------------------
+def partition_sync(filepath: Path, strategy: str = "auto") -> int:
+    from unstructured.partition.auto import partition
+
+    elements = partition(filename=str(filepath), strategy=strategy)
+    return len(elements)
+
+
+async def partition_async(filepath: Path, strategy: str = "auto") -> int:
+    """Mirrors production: asyncio.to_thread(partition, ...)"""
+    return await asyncio.to_thread(partition_sync, filepath, strategy)
+
+
+async def batch_serial(filepaths: list[Path], strategy: str) -> list[int]:
+    results = []
+    for fp in filepaths:
+        results.append(await partition_async(fp, strategy))
+    return results
+
+
+async def batch_concurrent(filepaths: list[Path], strategy: str, concurrency: int) -> list[int]:
+    sem = asyncio.Semaphore(concurrency)
+
+    async def worker(fp: Path) -> int:
+        async with sem:
+            return await partition_async(fp, strategy)
+
+    return list(await asyncio.gather(*[worker(fp) for fp in filepaths]))
+
+
+# ===================================================================
+# Concurrency scaling -- fast strategy (~0.2s/doc)
+#
+# Measures throughput scaling with concurrent requests.
+# Uses fast strategy so each round completes in seconds, not minutes.
+# The async patterns are identical to hi_res -- only the partition
+# internals differ.
+# ===================================================================
+class TestConcurrencyScaling:
+    """Concurrency scaling with fast strategy (practical iteration speed)."""
+
+    @pytest.fixture
+    def fast_paths(self) -> list[Path]:
+        return [resolve(rel) for rel, _ in FAST_FIXTURES]
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=5, warmup_rounds=1)
+    async def test_serial(self, async_benchmark, fast_paths):
+        result = await async_benchmark(batch_serial, fast_paths, "fast")
+        assert result["mean"] > 0
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=5, warmup_rounds=1)
+    async def test_concurrent_2(self, async_benchmark, fast_paths):
+        result = await async_benchmark(batch_concurrent, fast_paths, "fast", 2)
+        assert result["mean"] > 0
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=5, warmup_rounds=1)
+    async def test_concurrent_4(self, async_benchmark, fast_paths):
+        result = await async_benchmark(batch_concurrent, fast_paths, "fast", 4)
+        assert result["mean"] > 0
+
+
+# ===================================================================
+# OCR pipeline -- hi_res strategy (the hot path)
+#
+# Exercises: render PDF pages -> layout detection (ONNX) -> OCR
+# (Tesseract) -> element merge.  This is what experimental-agent
+# optimized and where native async will have the biggest impact.
+#
+# Baseline (main, taskset -c 0):
+#   single 1p:           ~8s
+#   3 docs serial:      ~46s
+#   3 docs concurrent:  ~44s  (only 5% gain -- GIL-bound)
+# ===================================================================
+class TestOCRPipeline:
+    """OCR pipeline latency (hi_res, single doc)."""
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=2, warmup_rounds=1)
+    async def test_hires_single(self, async_benchmark):
+        """Single 1-page PDF through full hi_res pipeline."""
+        filepath = resolve(HIRES_FIXTURE[0])
+        result = await async_benchmark(partition_async, filepath, "hi_res")
+        assert result["mean"] > 0
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=2, warmup_rounds=1)
+    async def test_hires_serial_3docs(self, async_benchmark):
+        """3 docs through hi_res, serial baseline."""
+        paths = [resolve(rel) for rel, _ in FAST_FIXTURES]
+        result = await async_benchmark(batch_serial, paths, "hi_res")
+        assert result["mean"] > 0
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=2, warmup_rounds=1)
+    async def test_hires_concurrent_3docs(self, async_benchmark):
+        """3 docs through hi_res, concurrent -- measures async gain."""
+        paths = [resolve(rel) for rel, _ in FAST_FIXTURES]
+        result = await async_benchmark(batch_concurrent, paths, "hi_res", 3)
+        assert result["mean"] > 0
+
+
+# ===================================================================
+# Per-doc latency -- fast strategy, parametrized
+# ===================================================================
+@pytest.mark.parametrize(
+    "fixture_rel,fixture_id",
+    FAST_FIXTURES,
+    ids=[fid for _, fid in FAST_FIXTURES],
+)
+class TestPerDocLatency:
+    """Per-document partition latency (fast strategy)."""
+
+    @pytest.mark.asyncio
+    @pytest.mark.async_benchmark(rounds=5, warmup_rounds=1)
+    async def test_fast(self, async_benchmark, fixture_rel, fixture_id):
+        filepath = resolve(fixture_rel)
+        result = await async_benchmark(partition_async, filepath, "fast")
+        assert result["mean"] > 0
--- a/.codeflash/unstructured/core-product/data/conventions.md
+++ b/.codeflash/unstructured/core-product/data/conventions.md
@ -0,0 +1,50 @@
+# Conventions
+
+## Code Style
+
+- **Line length**: 100
+- **Formatter**: ruff
+- **Python version**: 3.12
+
+## Workspace Layout
+
+Multi-package uv workspace with a shared `.venv` at the repo root.
+
+| Package | Description |
+|---------|-------------|
+| `unstructured_prop/` | Proprietary extensions to `unstructured` |
+| `unstructured_inference_prop/` | Proprietary extensions to `unstructured-inference` |
+| `unstructured-api/` | FastAPI service entry point |
+
+## Installation
+
+```bash
+make install
+```
+
+Requires `UV_EXTRA_INDEX_URL` for the private Azure DevOps PyPI registry.
+
+## Testing
+
+```bash
+make test-unit
+```
+
+## Benchmarks
+
+Benchmarks live in `.codeflash/benchmarks/` and use pytest-benchmark.
+
+```bash
+# Run all benchmarks (pin to 1 CPU to match production pod)
+taskset -c 0 .venv/bin/python -m pytest .codeflash/benchmarks/ -v
+
+# Fast iteration (1-page fixture only)
+taskset -c 0 .venv/bin/python -m pytest .codeflash/benchmarks/ -v -k "1p"
+```
+
+## Production Profile
+
+- **Runtime**: Knative
+- **CPU**: 1 request
+- **RAM**: 32 GB
+- **Benchmark VM**: Azure Standard_E4s_v5, `taskset -c 0` to match
--- a/.codeflash/unstructured/core-product/data/results.tsv
+++ b/.codeflash/unstructured/core-product/data/results.tsv
@ -0,0 +1 @@
+commit	target	category	before	after	speedup	tests_passed	tests_failed	status	description
--- a/.codeflash/unstructured/core-product/infra/cloud-init.yaml
+++ b/.codeflash/unstructured/core-product/infra/cloud-init.yaml
@ -0,0 +1,145 @@
+#cloud-config
+#
+# Benchmark VM provisioning for Unstructured-IO/core-product
+#
+# Document processing pipeline -- Python 3.12, uv-based multi-package workspace.
+# Private repo: requires SSH agent forwarding for clone (not done in cloud-init).
+#
+# Two-phase setup:
+#   Phase 1 (cloud-init): packages, hyperfine, uv, bench scripts
+#   Phase 2 (manual):     ssh -A, clone, make install, baseline benchmarks
+#
+# Usage:
+#   az vm create ... --custom-data infra/cloud-init.yaml
+#   bash infra/vm-manage.sh ssh       # connects with -A for agent forwarding
+#   bash ~/setup.sh                    # clone + install + verify
+#
+# VM: Azure Standard_E4s_v5 (4 vCPU, 32 GB RAM, memory-optimized)
+#   Matches production pod profile (1 CPU request, 32 GB RAM limit).
+#   Use taskset -c 0 to pin benchmarks to 1 core for production-realistic results.
+#   Non-burstable ensures consistent CPU -- no thermal throttling or turbo variability.
+
+package_update: true
+packages:
+  - git
+  - build-essential
+  - curl
+  - wget
+  - jq
+  - tesseract-ocr
+  - libtesseract-dev
+  - libleptonica-dev
+  - poppler-utils
+  - libmagic1
+  - libgl1
+  - libglib2.0-0
+  - pandoc
+  - libreoffice-core
+
+write_files:
+  # --- Benchmark: unit tests (fast, no ML models) ---
+  - path: /home/azureuser/bench/bench_tests.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+
+      cd ~/core-product
+      PYTHON=.venv/bin/python
+
+      echo "=== core-product unit tests (fast, no ML models) ==="
+      echo ""
+
+      $PYTHON -m pytest unstructured_prop/tests/unit -n auto -m "not slow" -q 2>&1 | tail -5
+      $PYTHON -m pytest unstructured_inference_prop/tests/unit -n auto -m "not slow" -q 2>&1 | tail -5
+
+  # --- Benchmark: A/B branch comparison ---
+  - path: /home/azureuser/bench/bench_compare.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BRANCH="${1:?Usage: bench_compare.sh <branch-or-commit>}"
+      TS=$(date +%Y%m%d-%H%M%S)
+      OUTDIR="$HOME/results/${BRANCH//\//-}-${TS}"
+      mkdir -p "$OUTDIR"
+
+      cd ~/core-product
+      git fetch origin
+      git checkout "$BRANCH"
+
+      # Rebuild after switching branches
+      export PATH="$HOME/.local/bin:$PATH"
+      export VIRTUAL_ENV=$PWD/.venv
+      make install
+
+      PYTHON=.venv/bin/python
+
+      echo "=== Benchmarking branch: $BRANCH ==="
+
+      # Unit tests (fast, no ML models)
+      $PYTHON -m pytest unstructured_prop/tests/unit -n auto -m "not slow" -q 2>&1 | tee "$OUTDIR/test_output.txt"
+
+      echo ""
+      echo "Results saved to $OUTDIR/"
+      ls -la "$OUTDIR/"
+
+  # --- Benchmark: side-by-side two branches ---
+  - path: /home/azureuser/bench/bench_ab.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      BASE="${1:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+      OPT="${2:?Usage: bench_ab.sh <base-branch> <opt-branch>}"
+
+      echo "=== A/B comparison: $BASE vs $OPT ==="
+      bash ~/bench/bench_compare.sh "$BASE"
+      bash ~/bench/bench_compare.sh "$OPT"
+
+      echo ""
+      echo "Compare results in ~/results/"
+      ls ~/results/
+
+  # --- Post-provision setup (run manually after ssh -A) ---
+  - path: /home/azureuser/setup.sh
+    owner: azureuser:azureuser
+    permissions: "0755"
+    defer: true
+    content: |
+      #!/usr/bin/env bash
+      set -euo pipefail
+      export PATH="$HOME/.local/bin:$PATH"
+
+      echo "=== Cloning core-product ==="
+      git clone git@github.com:Unstructured-IO/core-product.git ~/core-product
+      cd ~/core-product
+
+      echo "=== Creating shared venv ==="
+      uv venv --python 3.12
+      export VIRTUAL_ENV=$PWD/.venv
+
+      echo "=== Installing dependencies ==="
+      # Requires UV_INDEX_UNSTRUCTURED_USERNAME and UV_INDEX_UNSTRUCTURED_PASSWORD
+      # to be set for private PyPI access
+      make install
+
+      echo "=== Creating results directory ==="
+      mkdir -p ~/results
+
+      echo "=== Verifying installation ==="
+      .venv/bin/python -c 'from unstructured.partition.pdf import partition_pdf; print("OK")'
+
+      echo "=== Done ==="
+
+runcmd:
+  - wget -q https://github.com/sharkdp/hyperfine/releases/download/v1.19.0/hyperfine_1.19.0_amd64.deb -O /tmp/hyperfine.deb
+  - dpkg -i /tmp/hyperfine.deb
+  # Install uv (phase 1 -- no git clone, that requires SSH agent forwarding)
+  - su - azureuser -c 'curl -LsSf https://astral.sh/uv/install.sh | sh'
--- a/.codeflash/unstructured/core-product/infra/vm-manage.sh
+++ b/.codeflash/unstructured/core-product/infra/vm-manage.sh
@ -0,0 +1,111 @@
+#!/usr/bin/env bash
+#
+# Template: Azure benchmark VM lifecycle management
+#
+# Customize:
+#   1. Replace core-product with your project name (e.g., "rich", "myapi")
+#   2. Adjust SIZE if your project needs more/less resources
+#   3. Update the cloud-init path if yours lives elsewhere
+#
+# Usage:
+#   bash infra/vm-manage.sh {create|start|stop|ip|ssh|bench <branch>|destroy}
+
+set -euo pipefail
+
+RG="core-product-BENCH-RG"
+VM="core-product-bench"
+REGION="westus2"
+SIZE="Standard_E4s_v5"
+IMAGE="Canonical:ubuntu-24_04-lts:server:latest"
+SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519.pub}"
+
+case "${1:-help}" in
+  create)
+    if [ ! -f "$SSH_KEY" ]; then
+      echo "Error: SSH public key not found at $SSH_KEY"
+      echo "Generate one: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519"
+      echo "Or set SSH_KEY=/path/to/key.pub"
+      exit 1
+    fi
+
+    echo "Creating resource group..."
+    az group create --name "$RG" --location "$REGION" --only-show-errors --output none
+
+    echo "Creating VM (Trusted Launch, SSH-only, locked-down NSG)..."
+    az vm create \
+      --resource-group "$RG" \
+      --name "$VM" \
+      --image "$IMAGE" \
+      --size "$SIZE" \
+      --os-disk-size-gb 64 \
+      --admin-username azureuser \
+      --ssh-key-values "$SSH_KEY" \
+      --authentication-type ssh \
+      --security-type TrustedLaunch \
+      --enable-secure-boot true \
+      --enable-vtpm true \
+      --nsg-rule NONE \
+      --custom-data infra/cloud-init.yaml \
+      --only-show-errors
+
+    MY_IP=$(curl -s ifconfig.me)
+    echo "Restricting SSH to $MY_IP..."
+    az network nsg rule create \
+      --resource-group "$RG" \
+      --nsg-name "${VM}NSG" \
+      --name AllowSSHFromMyIP \
+      --priority 1000 \
+      --source-address-prefixes "$MY_IP/32" \
+      --destination-port-ranges 22 \
+      --access Allow \
+      --protocol Tcp \
+      --output none
+
+    echo "VM created. Get IP with: $0 ip"
+    ;;
+
+  start)
+    echo "Starting VM..."
+    az vm start --resource-group "$RG" --name "$VM"
+    echo "Started. IP: $(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)"
+    ;;
+
+  stop)
+    echo "Deallocating VM (stops billing)..."
+    az vm deallocate --resource-group "$RG" --name "$VM"
+    echo "Deallocated."
+    ;;
+
+  ip)
+    az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv
+    ;;
+
+  ssh)
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh -A azureuser@"$IP" "${@:2}"
+    ;;
+
+  bench)
+    BRANCH="${2:?Usage: $0 bench <branch>}"
+    IP=$(az vm show -g "$RG" -n "$VM" -d --query publicIps -o tsv)
+    ssh -A azureuser@"$IP" "bash ~/bench/bench_compare.sh $BRANCH"
+    ;;
+
+  destroy)
+    echo "Destroying resource group (all resources)..."
+    az group delete --name "$RG" --yes --no-wait
+    echo "Deletion started."
+    ;;
+
+  help|*)
+    echo "Usage: $0 {create|start|stop|ip|ssh|bench <branch>|destroy}"
+    echo ""
+    echo "  create   - Provision VM with cloud-init"
+    echo "  start    - Start deallocated VM"
+    echo "  stop     - Deallocate VM (stops billing)"
+    echo "  ip       - Show VM public IP"
+    echo "  ssh      - SSH into VM"
+    echo "  bench    - Run benchmarks on a branch"
+    echo "  destroy  - Delete resource group and all resources"
+    ;;
+esac
--- a/.codeflash/unstructured/core-product/status.md
+++ b/.codeflash/unstructured/core-product/status.md
@ -0,0 +1,59 @@
+# core-product Status
+
+Last updated: 2026-04-10
+
+## Current state
+
+Stacked PR #1500 updated with cumulative progression (proper benchmarking: 5 rounds + 1 warmup, median reported, <0.4% stddev). Cumulative: 14.6% latency on 10p-scan, 13.3% on 16p-mixed, 2.1 GB memory savings. Next: optimization #4 (direct numpy-to-BMP for tesseract).
+
+## Target repo
+
+`~/Desktop/work/unstructured_org/core-product` on branch `main` (PR branch: `perf/cpu-aware-serial-ocr`)
+
+## PRs
+
+| PR | Branch | Status | Description |
+|---|---|---|---|
+| #1503 | `perf/bmp-render-format` | Draft | Render PDF pages as BMP instead of PNG in pdfium pool |
+| #1502 | `perf/cpu-aware-serial-ocr` | Draft | Cap OCR workers to available CPUs (serial mode on 1-CPU pods) |
+| #1500 | `codeflash-agent` | Draft | Stacked optimizations + benchmark infra (cumulative progression) |
+| #1481 | `perf/elements-intersect-vertically` | Merged | Reduce attribute lookups |
+| #1464 | `replace-lazyproperty-with-cached-property` | Merged | Replace lazyproperty with functools.cached_property |
+| #1448 | `mem/free-pil-before-table-extraction` | Merged | Free page image before table OCR |
+| #1441 | `mem/numpy-preprocessing-yolox` | Merged | Resize-first preprocessing |
+| #1400 | `async-join-responses` | Merged | Fix blocking event loop in CSV merge |
+
+## Optimization queue
+
+1. ~~**CPU-aware serial OCR**~~ — PR #1502 open (draft), benchmarked. Rebase after #1501 merges.
+2. ~~**Early memory release**~~ — skipped, codebase already well-optimized (context managers, per-page cleanup)
+3. ~~**BMP render format**~~ — PR #1503 open (draft), benchmarked. 14.9% latency improvement on 10p-scan.
+4. **Direct numpy-to-BMP for tesseract** — encode from numpy without PIL round-trip
+5. Skip remaining PIL↔numpy conversions in OCR path
+
+## Dependencies
+
+- PR #1501 (segfault fix, `patched_convert_pdf_to_image` refactor) must merge before #1502 rebase. Different functions, clean rebase expected.
+
+## VM
+
+- **IP**: 40.65.91.158
+- **Size**: Standard_E4s_v5
+- **RG**: core-product-BENCH-RG
+- **State**: Running (verified 2026-04-10)
+- **Git auth**: HTTPS with embedded token (set previously). Use `ssh -A` for agent forwarding if token expires.
+- **Note**: `uv` is at `~/.local/bin/uv` — needs `export PATH=$HOME/.local/bin:$PATH` in non-login shells.
+- **Note**: `pytest-benchmark` installed in `.venv` (not in lockfile).
+
+## Next steps
+
+1. Implement "pass file path to tesseract" optimization (skip PIL→numpy→PIL→temp-file round-trip)
+2. Benchmark on VM, open draft PR
+3. Rebase #1502 once #1501 merges
+
+## Notes
+
+- `memray tree` opens a TUI — do not run directly over SSH. Use `memray stats`, `memray summary`, or `memray flamegraph --output file.html` instead.
+- memray peak is 1.0 GB (10p scan, serial path). 10 GB total allocated = heavy PIL churn per page, not accumulation.
+- Benchmarking: use `pedantic(rounds=5, warmup_rounds=1)` — warmup absorbs ONNX JIT + page cache. Observed stddev <0.4% of median. Guest CPU frequency controls are ineffective on Azure Hyper-V — use statistical methods (more rounds + median) instead of trying to pin frequency.
+- Workflow: independent `perf/<name>` branch → open individual draft PR → cherry-pick to `codeflash-agent` → benchmark stacked progression → update #1500 body.
--- a/.github/workflows/github-app-tests.yml
+++ b/.github/workflows/github-app-tests.yml
@ -3,11 +3,11 @@ name: GitHub App Tests
 on:
  pull_request:
    paths:
-      - "github-app/**"
+      - "packages/github-app/**"
  push:
    branches: [main, main-teammate]
    paths:
-      - "github-app/**"
+      - "packages/github-app/**"

 jobs:
  test:
@ -19,7 +19,7 @@ jobs:
      contents: read
    defaults:
      run:
-        working-directory: github-app
+        working-directory: packages/github-app
    steps:
      - name: Checkout
        uses: actions/checkout@v4
--- a/.github/workflows/packages-ci.yml
+++ b/.github/workflows/packages-ci.yml
@ -0,0 +1,63 @@
+name: Packages CI
+
+on:
+  pull_request:
+    paths:
+      - "packages/codeflash-core/**"
+      - "packages/codeflash-python/**"
+      - "pyproject.toml"
+      - "uv.lock"
+  push:
+    branches: [main]
+    paths:
+      - "packages/codeflash-core/**"
+      - "packages/codeflash-python/**"
+      - "pyproject.toml"
+      - "uv.lock"
+
+jobs:
+  check:
+    runs-on: ubuntu-latest
+    concurrency:
+      group: packages-ci-${{ github.head_ref || github.run_id }}
+      cancel-in-progress: true
+    permissions:
+      contents: read
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0  # needed for version check against origin/main
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+
+      - name: Install dependencies
+        run: uv sync --all-packages
+
+      - name: Ruff check
+        run: uv run ruff check packages/
+
+      - name: Ruff format
+        run: uv run ruff format --check packages/
+
+      - name: Interrogate
+        run: uv run interrogate packages/codeflash-core/src/ packages/codeflash-python/src/
+
+      - name: Mypy
+        run: uv run mypy packages/codeflash-core/src/ packages/codeflash-python/src/
+
+      - name: Pytest
+        run: uv run pytest packages/ -v
+        env:
+          CI: "true"
+
+      - name: Check version bump
+        if: github.event_name == 'pull_request'
+        run: uv run python scripts/versioning.py check-version
--- a/.gitignore
+++ b/.gitignore
@ -2,8 +2,11 @@
 __pycache__/
 *.pyc
 .venv/
-.codeflash/
+# .codeflash/ ephemeral files (case study data is tracked)
+.codeflash/observability/
 original_base_research/
 .claude/settings.local.json
 .claude/handoffs/
 dist/
+dist-*/
+dist-v2/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -2,37 +2,67 @@

 Monorepo for the Codeflash optimization platform: Python packages, Claude Code plugin, and services.

-## Layout
+## Case Studies

- **`packages/`** — UV workspace with Python packages (core, python, mcp, lsp)
- **`plugin/`** — Claude Code plugin (language-agnostic base: review agent, hooks, shared references)
- **`languages/python/plugin/`** — Python-specific plugin overlay (domain agents, skills, references)
- **`vendor/codex/`** — Vendored OpenAI Codex runtime
- **`services/github-app/`** — GitHub App integration service
- **`evals/`** — Eval templates and real-repo scenarios
+Active case study data lives in `.codeflash/{org}/{project}/` (status, bench scripts, raw data, VM infra). Summaries are built out of `.codeflash/` into `case-studies/{org}/{project}/`.

-## Build
+Active case studies in `.codeflash/`:
+- `microsoft/typeagent`
+- `unstructured/core-product`
+- `netflix/metaflow`
+- `coveragepy/coveragepy`
+- `textualize/rich`
+- `pypa/pip`

-```bash
-make build-plugin    # Assemble plugin → dist/ (base + python overlay + vendor)
-make clean           # Remove dist/
-```
+### Directory conventions

-## Packages (UV workspace)
+Target repos live in `~/Desktop/work/{org}_org/{project}`:
+- `microsoft_org/typeagent`
+- `unstructured_org/core-product`
+- `netflix_org/metaflow`
+- `coveragepy_org/coveragepy`

-```bash
-uv sync                          # Install all packages + dev deps
-prek run --all-files             # Lint: ruff check, ruff format, interrogate, mypy
-uv run pytest packages/ -v      # Test all packages
-```
+### Optimization flow

-Package-specific conventions (attrs patterns, type annotations, testing) are in `packages/.claude/rules/` and load automatically when editing package source.
+1. **Make changes** in the target repo on a `perf/<description>` branch
+2. **Run tests locally** to verify nothing breaks
+3. **Commit and push** to the fork
+4. **Benchmark on the VM** via `ssh -A azureuser@<ip> "cd ~/<project> && git fetch origin && ..."`
+5. **Record results** in `.codeflash/{org}/{project}/data/results.tsv`
+6. **Update status.md** in `.codeflash/{org}/{project}/`
+7. **Open a PR** on the fork with VM benchmark numbers

-## Plugin Development
+### VM access

-The plugin is split for composition:
- `plugin/` has language-agnostic agents, hooks, and shared references
- `languages/python/plugin/` has Python domain agents, skills, and references
- `make build-plugin` merges them into `dist/` with path rewriting
+VMs use SSH agent forwarding -- always connect with `ssh -A`:

-Agent files use `${CLAUDE_PLUGIN_ROOT}` for references. When editing agents, be aware that paths differ between source (`languages/python/plugin/references/`) and assembled (`references/`).
+| Project | VM IP | Size | Resource group |
+|---|---|---|---|
+| core-product | 40.65.91.158 | Standard_E4s_v5 | core-product-BENCH-RG |
+| typeagent | 40.65.81.123 | Standard_D2s_v5 | typeagent-BENCH-RG |
+
+If SSH times out, check:
+1. VM is running: `az vm start --resource-group <RG> --name <vm>`
+2. NSG IP is current: update `AllowSSHFromMyIP` source address in the Azure portal or via `az network nsg rule update`
+
+### PR strategy
+
+- **Individual PRs** on the fork (`KRRT7/<repo>`) -- one per optimization on a `perf/<description>` branch. Each is self-contained with its own benchmark numbers.
+- **Stacked draft PR** (optional) on the fork (`--base main --head optimization`) -- accumulates all optimizations, shows cumulative gain.
+
+### Benchmarking
+
+- **`codeflash compare`** for internal benchmarks (fork PRs) -- worktree-isolated, per-function breakdown, structured markdown. Does NOT handle import time yet -- use hyperfine for that.
+- **hyperfine** for upstream PRs and import time measurements -- portable, no codeflash dependency for maintainers to install.
+- **Keep the VM running** during optimization sessions -- don't deallocate between benchmarks
+- **Cloud-init must use ASCII only** -- Azure CLI chokes on non-ASCII (em dashes, etc.)
+
+### Runner convention
+
+Use `$RUNNER` in docs and scripts to refer to the Python runner. The value depends on context:
+
+| Context | `$RUNNER` value | Why |
+|---|---|---|
+| VM benchmark scripts | `.venv/bin/python` | Accuracy -- uv run adds ~50% overhead and 2.5x variance |
+| Upstream PR reproducers | `uv run python` | Portability -- matches how the target team works |
+| Setup / verify steps | `uv run python` | Measurement accuracy doesn't matter |
--- a/plugin/intro.md
+++ b/plugin/intro.md
@ -9,24 +9,28 @@
 ## Services
 - `services/github-app/` — GitHub App integration service

-## Plugin (language-agnostic)
- `plugin/agents/codeflash-review.md` — review agent
- `plugin/agents/codeflash-researcher.md` — research agent
- `plugin/commands/` — codex CLI commands
- `vendor/codex/` — codex companion scripts and schemas (vendored)
- `plugin/references/shared/` — shared methodology (experiment loop, templates, benchmarks)
- `plugin/hooks/` — session lifecycle and review gate hooks
-
-## Languages (per-language content)
- `languages/python/plugin/agents/codeflash.md` — router that detects the domain and delegates
- `languages/python/plugin/agents/codeflash-cpu.md`, `codeflash-memory.md`, `codeflash-async.md`, `codeflash-structure.md` — one agent per domain
- `languages/python/plugin/agents/codeflash-setup.md` — detects project env, installs deps
- `languages/python/plugin/skills/` — `/codeflash-optimize` entry point, memray profiling
- `languages/python/plugin/references/` — domain-specific deep-dive docs (async, memory, data-structures, structure)
+## Plugin
+- `plugin/` — Claude Code plugin (self-contained, multi-language). See [plugin/README.md](plugin/README.md) for architecture and session flow.

 ## Evals
- `evals/templates/` — 9 synthetic eval scenarios (v1: ranking, memory, crossdomain, layered)
- `evals/repos/` — real-repo evals (v2: clone a repo at a specific commit, agent finds and fixes the bug)
+
+Two types of evals, both run through `run-eval.sh`:
+
+**v1 (templates)** — Small synthetic projects in `evals/templates/`. Each bundles source code, tests, and a `pyproject.toml`. The runner copies the template to a temp dir, installs deps with `uv`, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.
+
+**v2 (repos)** — Real repos in `evals/repos/`. Each has a `manifest.json` pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a `fix_commit` for reference and a rubric for scoring.
+
+Each eval produces results in `evals/results/<name>-<timestamp>/`. Score with `score.py`, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.
+
+**Regression testing** — Go to Actions > "Eval Regression" > Run workflow. Runs a subset of evals, scores them, compares to baselines in `evals/baseline-scores.json`. Fails if any score drops below threshold. Use before merging agent behavior changes.
+
+```
+./evals/run-eval.sh --list                  # see all evals (v1 + v2)
+./evals/run-eval.sh ranking --skill-only    # run a v1 eval
+./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only  # run a v2 eval
+./evals/score-eval.sh evals/results/<dir>   # score it
+./evals/check-regression.sh                 # full regression check
+```

 ## CI (runs on every PR)

@ -39,32 +43,13 @@ The `validate` workflow runs Claude with the `plugin-dev` plugin to check:

 Warnings are blocking — any issue fails the job. Claude posts a summary comment on the PR.

-## Evals
-
-Two types of evals, both run through `run-eval.sh`:
-
-**v1 (templates)** — Small synthetic projects in `evals/templates/`. Each bundles source code, tests, and a `pyproject.toml`. The runner copies the template to a temp dir, installs deps with `uv`, and runs Claude. Good for testing specific agent behaviors (ranking accuracy, memory profiling methodology, cross-domain detection). 9 templates across ranking, memory, crossdomain, and layered types.
-
-**v2 (repos)** — Real repos in `evals/repos/`. Each has a `manifest.json` pointing to a GitHub repo + commit where a known bug exists. The runner shallow-clones the repo (cached locally after first run), drops Claude in, and the agent handles everything — setup, profiling, diagnosis, fix. More realistic but slower and more expensive (~$2/run). The manifest includes a `fix_commit` for reference and a rubric for scoring.
-
-Each eval produces results in `evals/results/<name>-<timestamp>/`. Score with `score.py`, which uses a mix of deterministic checks (did the agent use a profiler? did tests pass?) and LLM grading against the manifest's rubric.
-
-**Regression testing** — Go to Actions → "Eval Regression" → Run workflow. Runs a subset of evals, scores them, compares to baselines in `evals/baseline-scores.json`. Fails if any score drops below threshold. Use before merging agent behavior changes.
-
-```
-./evals/run-eval.sh --list                  # see all evals (v1 + v2)
-./evals/run-eval.sh ranking --skill-only    # run a v1 eval
-./evals/run-eval.sh codeflash-internal-psycopg-serialization --skill-only  # run a v2 eval
-./evals/score-eval.sh evals/results/<dir>   # score it
-./evals/check-regression.sh                 # full regression check
-```
-
 ## Key conventions

 - Domain agents are self-contained — all methodology is inline, no required file reads before starting
- Every agent uses the same experiment loop structure (choose target → implement → benchmark → keep/discard → commit only on KEEP)
+- Every agent uses the same experiment loop structure (choose target > implement > benchmark > keep/discard > commit only on KEEP)
 - Changes to one domain agent should be mirrored to others where applicable (CI enforces this)
 - The plugin uses `.codeflash/` in the user's project for session state (results.tsv, HANDOFF.md)
+- Language-agnostic methodology lives in `plugin/references/shared/`; language-specific implementations live under `plugin/languages/<lang>/references/`

 ## Contributing

--- a/145
+++ b/145
@ -1,48 +1,107 @@
-DIST := dist
-LANG := python
+LANGS := $(notdir $(wildcard plugin/languages/*))

-.PHONY: build-plugin clean
+.PHONY: build clean bootstrap \
+        install lock sync \
+        lint format typecheck docs-check test check tidy \
+        check-version version-dev version-release

-build-plugin: clean
-	@echo "Assembling plugin → $(DIST)/"
+##############
+# Plugin     #
+##############

-	# 1. Base plugin
-	cp -R plugin/ $(DIST)/
-
-	# 2. Language overlay (agents, references, skills merge into same dirs)
-	cp -R languages/$(LANG)/plugin/agents/ $(DIST)/agents/
-	cp -R languages/$(LANG)/plugin/references/ $(DIST)/references/
-	cp -R languages/$(LANG)/plugin/skills/ $(DIST)/skills/
-
-	# 3. Vendored codex (now inside dist as sibling)
-	mkdir -p $(DIST)/vendor
-	cp -R vendor/codex/ $(DIST)/vendor/codex/
-
-	# 4. Language config
-	cp languages/$(LANG)/lang.toml $(DIST)/lang.toml
-
-	# 5. Rewrite paths — vendor is now co-located instead of ../
-	# Do CLAUDE_PLUGIN_ROOT paths first (more specific), then generic ../vendor
-	find $(DIST) -type f \( -name '*.json' -o -name '*.md' \) -exec \
-		sed -i '' \
-		's|$${CLAUDE_PLUGIN_ROOT}/../vendor/codex|$${CLAUDE_PLUGIN_ROOT}/vendor/codex|g' {} +
-	find $(DIST) -type f \( -name '*.json' -o -name '*.md' \) -exec \
-		sed -i '' 's|\.\./vendor/codex|./vendor/codex|g' {} +
-
-	# 6. Rewrite language-relative paths — everything is now co-located
-	find $(DIST) -type f -name '*.md' -exec \
-		sed -i '' 's|languages/$(LANG)/plugin/references/|references/|g' {} +
-	find $(DIST) -type f -name '*.md' -exec \
-		sed -i '' 's|languages/$(LANG)/plugin/agents/|agents/|g' {} +
-	find $(DIST) -type f -name '*.md' -exec \
-		sed -i '' 's|languages/$(LANG)/plugin/skills/|skills/|g' {} +
-	find $(DIST) -type f -name '*.md' -exec \
-		sed -i '' 's|languages/$(LANG)/plugin/|./|g' {} +
-
-	# 7. Remove .DS_Store artifacts
-	find $(DIST) -name '.DS_Store' -delete
-
-	@echo "Done. Plugin assembled in $(DIST)/"
+build: clean
+	@for lang in $(LANGS); do \
+		echo "Assembling plugin ($$lang) → dist-$$lang/"; \
+		rsync -a --exclude='languages/' plugin/ dist-$$lang/; \
+		cp -R plugin/languages/$$lang/agents/ dist-$$lang/agents/; \
+		cp -R plugin/languages/$$lang/references/ dist-$$lang/references/; \
+		cp -R plugin/languages/$$lang/skills/ dist-$$lang/skills/; \
+		find dist-$$lang -type f -name '*.md' -exec \
+			sed -i '' "s|languages/$$lang/references/|references/|g" {} +; \
+		find dist-$$lang -type f -name '*.md' -exec \
+			sed -i '' "s|languages/$$lang/agents/|agents/|g" {} +; \
+		find dist-$$lang -type f -name '*.md' -exec \
+			sed -i '' "s|languages/$$lang/skills/|skills/|g" {} +; \
+		find dist-$$lang -name '.DS_Store' -delete; \
+		echo "Done. dist-$$lang/"; \
+		echo ""; \
+	done
+	@echo "Built: $(foreach l,$(LANGS),dist-$l/)"

 clean:
-	rm -rf $(DIST)
+	rm -rf $(foreach l,$(LANGS),dist-$l/)
+
+##############
+# Scaffold   #
+##############
+
+# Scaffold optimization projects: make bootstrap ORG=roboflow PROJECTS="supervision inference"
+bootstrap:
+ifndef ORG
+	$(error ORG is required. Usage: make bootstrap ORG=roboflow PROJECTS="supervision inference")
+endif
+ifndef PROJECTS
+	$(error PROJECTS is required. Usage: make bootstrap ORG=roboflow PROJECTS="supervision inference")
+endif
+	@echo "Scaffolding projects under: .codeflash/$(ORG)/"
+	@for proj in $(PROJECTS); do \
+		bash scripts/scaffold.sh $(ORG) $$proj .codeflash/$(ORG)/$$proj; \
+	done
+	@echo ""
+	@echo "Next steps:"
+	@echo "  1. Fill in infra/cloud-init.yaml with project-specific setup"
+	@echo "  2. Add benchmark scripts to bench/"
+	@echo "  3. Edit README.md with project description and methodology"
+	@echo "  4. Update status.md with current project state"
+
+##############
+# Install    #
+##############
+
+install: ## Install all workspace packages
+	uv sync --locked --all-packages
+
+lock: ## Re-lock the workspace
+	uv lock
+
+sync: ## Sync a specific package: make sync PKG=codeflash-core
+	uv sync --locked --package $(PKG)
+
+##############
+# Quality    #
+##############
+
+lint: ## Run ruff linter on packages
+	uv run ruff check packages/
+
+format: ## Check formatting on packages
+	uv run ruff format --check packages/
+
+typecheck: ## Run mypy on package sources
+	uv run mypy packages/codeflash-core/src/ packages/codeflash-python/src/
+
+docs-check: ## Check docstring coverage
+	uv run interrogate packages/codeflash-core/src/ packages/codeflash-python/src/
+
+test: ## Run all package tests
+	uv run pytest packages/ -v
+
+check: lint format typecheck docs-check ## Run all checks (no tests)
+
+tidy: ## Auto-fix formatting and lint issues
+	uv run ruff format packages/
+	uv run ruff check --fix-only --show-fixes packages/
+
+##############
+# Versioning #
+##############
+
+check-version: ## Verify version was bumped (CI)
+	uv run python scripts/versioning.py check-version $(ARGS)
+
+version-dev: ## Bump to pre-release version with changelog entry
+	uv run python scripts/versioning.py version-dev $(ARGS)
+
+version-release: ## Release version, aggregate changelogs
+	uv run python scripts/versioning.py version-release $(ARGS)
+	uv run python scripts/combine-changelogs.py $(ARGS)
--- a/README.md
+++ b/README.md
@ -1,6 +1,13 @@
 # codeflash-agent

-A [Claude Code plugin](https://docs.anthropic.com/en/docs/claude-code/plugins) for autonomous Python runtime performance optimization. Profiles code, implements optimizations, benchmarks before and after, and iterates until plateau.
+Autonomous performance optimization platform. Profiles code, implements optimizations, benchmarks before and after, and iterates until plateau.
+
+**What it's achieved on real projects:**
+
+| Project | Result | Details |
+|---|---|---|
+| Rich | 2x Console import (79ms → 34ms) | [summary](case-studies/textualize/rich/summary.md) |
+| pip | 7x `--version` (138ms → 20ms), 1.81x resolver | [summary](case-studies/pypa/pip/summary.md) |

 ## Domains

@ -16,69 +23,142 @@ The agent auto-detects which domain(s) apply based on your request.

 ## Install

-Inside Claude Code, run:
-
-```
-/plugin marketplace add codeflash-ai/codeflash-agent
-/plugin install codeflash-agent@codeflash
-```
-
-### Team setup
-
-Add to your repo's `.claude/settings.json` so everyone on the team gets it automatically:
-
-```json
-{
-  "extraKnownMarketplaces": {
-    "codeflash": {
-      "source": {
-        "source": "github",
-        "repo": "codeflash-ai/codeflash-agent"
-      }
-    }
-  },
-  "enabledPlugins": {
-    "codeflash-agent@codeflash": true
-  }
-}
-```
-
-### Local (development)
+Build the plugin first, then launch Claude with it:

 ```bash
 git clone https://github.com/codeflash-ai/codeflash-agent.git
-claude --plugin-dir ./codeflash-agent
+cd codeflash-agent
+make build-plugin  # assembles plugin into dist/ — must run before launching
+claude --dangerously-skip-permissions --effort max --plugin-dir ./dist/
 ```

-## Usage
+## Your first optimization

-The agent triggers automatically when you describe a performance problem:
+Just run:
+
+```
+> /codeflash-optimize start
+```
+
+If you know where the problem lies, describe it in natural language instead:

 ```
 > Our /process endpoint takes 5s but individual calls should only take 500ms each
-> test_process_large_file is using 3GB, find ways to reduce it
 > process_records is too slow, it's doing O(n²) lookups
 ```

-Or use the slash command:
+Other commands:

 ```
-> /codeflash-optimize start    # begin a new session
-> /codeflash-optimize resume   # continue from where you left off
-> /codeflash-optimize status   # check progress
 > /codeflash-optimize scan     # quick cross-domain diagnosis (no changes)
+> /codeflash-optimize status   # check progress
+> /codeflash-optimize resume   # continue from where you left off
 > /codeflash-optimize review   # review current changes or a PR
 ```

-## How it works
+Codeflash will profile, analyze, implement fixes one at a time, re-profile after each, and stop when gains plateau. Session state persists in `HANDOFF.md` and `results.tsv` so you can resume across conversations.

-1. **Discovery** — reads project structure, detects package manager, identifies target code
-2. **Baseline** — profiles the target before making any changes (mandatory)
-3. **Analysis** — ranks bottlenecks by measured impact, not source-reading intuition
-4. **Experiment loop** — implements fixes one at a time, re-profiles after each, keeps or discards based on measured improvement
-5. **Plateau detection** — stops when gains diminish or stall
+## For contributors

-Session state persists in `HANDOFF.md` and `results.tsv`, so you can resume across conversations.
+### Dev setup
+
+```bash
+git clone https://github.com/codeflash-ai/codeflash-agent.git
+cd codeflash-agent
+uv sync                          # install all packages + dev deps
+prek run --all-files             # lint: ruff check, ruff format, interrogate, mypy
+uv run pytest packages/ -v      # test all packages
+```
+
+### Plugin development
+
+```bash
+make build-plugin    # assemble plugin → dist/ (base + python overlay + vendor)
+make clean           # remove dist/
+```
+
+The plugin is self-contained under `plugin/`:
+- `plugin/` — language-agnostic agents, hooks, shared references
+- `plugin/languages/python/` — Python domain agents, skills, references
+- `plugin/languages/javascript/` — JavaScript domain agents, skills, references
+- `make build-plugin` assembles base + language overlay into `dist/` (default: `LANG=python`)
+
+## Optimization patterns
+
+Distilled from 122 pip commits + 2 Rich PRs. Ordered by typical impact.
+
+| Tier | Category | Examples | Typical impact |
+|---|---|---|---|
+| 1 | **Startup / Import** | Fast-path early exit, import deferral, `TYPE_CHECKING` guards, dead import removal | 2-100x for startup paths |
+| 2 | **Architecture** | `@dataclass` → `__slots__`, lazy loading, speculative prefetch, conditional rebuild, caching | 10-60% on hot paths |
+| 3 | **Micro** | Identity shortcuts (`is` before `==`), bypass public API internally, hoist to module level, `__slots__` on hot classes | 1.1-1.8x per call |
+| 4 | **I/O** | Replace slow serializers, connection pooling, parallel I/O | 2-5x for I/O-bound ops |
+
+**Anti-patterns to avoid:** caching with low hit rate, premature `__slots__`, over-deferring imports in one-time paths, optimizing cold paths.
+
+Full pattern catalog with examples: [docs/codeflash-agent-dogfooding.md](docs/codeflash-agent-dogfooding.md#patterns-that-worked)
+
+## Methodology
+
+### Profiling toolkit
+
+| Tool | Purpose | When to use |
+|---|---|---|
+| `python -X importtime` | Import cost breakdown | First step for any CLI tool |
+| `hyperfine` | E2E command timing with statistics | Before/after validation |
+| `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions |
+| `timeit` | Micro-benchmarks for specific functions | Validating micro-opts |
+| `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths |
+| `objgraph` | Object count tracking | Finding redundant allocations |
+
+### Workflow
+
+```
+1. Profile → identify top-N bottlenecks
+2. For each bottleneck:
+   a. Read the actual code (don't guess from profiler shapes)
+   b. Implement the smallest change that addresses it
+   c. Micro-benchmark before/after
+   d. Run full test suite
+   e. E2E benchmark
+3. Commit with clear perf: prefix and numbers
+4. Repeat until plateau
+```
+
+### Environment requirements
+
+- Non-burstable VM (e.g., Azure Standard_D2s_v5) for consistent CPU
+- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
+- `hyperfine --warmup 5 --min-runs 30` for statistical rigor
+- All tests passing before AND after every change
+
+Full methodology details: [docs/codeflash-agent-dogfooding.md](docs/codeflash-agent-dogfooding.md#methodology)
+
+## Workspace convention
+
+Each target organization gets its own `<org>_org/` directory containing all repos for that org:
+
+```
+~/Desktop/work/
+├── cf_org/                    # Codeflash
+│   ├── codeflash-agent/       # this monorepo
+│   ├── codeflash/             # core engine
+│   ├── codeflash-internal/    # backend service
+│   └── ...
+├── unstructured_org/          # Unstructured.io
+│   ├── unstructured/          # open source library
+│   ├── core-product/          # main product
+│   ├── unstructured-inference/ # ML inference
+│   └── ...
+├── microsoft_org/             # Microsoft
+│   └── typeagent/             # typeagent-py (Structured RAG)
+├── roboflow_org/              # Roboflow
+│   └── supervision/
+└── <org>_org/                 # new target org
+    └── <repo>/
+```
+
+When starting work on a new org: create `<org>_org/`, clone all relevant repos under it, and keep non-repo files out of the org directory.

 ## Repo structure

@ -92,24 +172,87 @@ packages/
 services/
  github-app/                  # GitHub App integration (FastAPI)

-plugin/                        # Claude Code plugin (language-agnostic)
-  .claude-plugin/              # plugin manifest & marketplace config
-  agents/                      # review & research agents
-  commands/                    # codex CLI integration commands
-  hooks/                       # session lifecycle & review gate hooks
-  references/shared/           # shared methodology & benchmarking guides
+plugin/                        # Claude Code plugin (self-contained, multi-language)
+  languages/python/            # Python domain agents, skills, references
+  languages/javascript/        # JavaScript domain agents, skills, references

-languages/python/plugin/       # Python-specific plugin content
-  agents/                      # router, domain agents (cpu, memory, async, structure),
-                               # deep, setup, scan, ci, pr-prep
-  references/                  # domain-specific guides (async, memory, structure,
-                               # data-structures, library replacement)
-  skills/                      # /codeflash-optimize, memray profiling
-
-vendor/
-  codex/                       # OpenAI Codex runtime (vendored)
+.codeflash/                  # active optimization data (org-grouped)
+  textualize/rich/             # 2x Rich import speedup
+  pypa/pip/                    # 7x pip --version, 1.81x resolver
+  microsoft/typeagent/         # Structured RAG optimization
+  <org>/<project>/             # new optimization targets

+case-studies/                  # summaries built from .codeflash/
+scripts/                       # scaffold scripts
 docs/                          # internal guides
 evals/                         # eval templates & real-repo scenarios
-dist/                          # assembled plugin (generated by make build-plugin)
 ```
+
+## Adding an optimization target
+
+When you optimize a new project, scaffold it in `.codeflash/` and build summaries into `case-studies/`.
+
+### 1. Set up local workspace
+
+Each org gets a `<org>_org/` directory under `work/`. Clone from your fork, add the upstream remote:
+
+```bash
+mkdir -p ~/Desktop/work/<org>_org
+git clone https://github.com/KRRT7/<repo>.git ~/Desktop/work/<org>_org/<project>
+cd ~/Desktop/work/<org>_org/<project>
+git remote add upstream https://github.com/<org>/<repo>.git
+```
+
+### 2. Scaffold the project
+
+```bash
+# Single project:
+make bootstrap ORG=roboflow PROJECTS=supervision
+
+# Multiple projects under one org:
+make bootstrap ORG=unstructured PROJECTS="unstructured unstructured-inference core-product"
+```
+
+This creates:
+
+```
+.codeflash/<org>/<project>/
+├── README.md              # results, what changed, methodology (from template)
+├── bench/                 # add your benchmark scripts here
+├── data/                  # save raw benchmark data here
+└── infra/
+    ├── cloud-init.yaml    # VM provisioning (fill in remaining placeholders)
+    └── vm-manage.sh       # VM lifecycle: create, start, stop, ssh, bench, destroy
+```
+
+### 3. Fill in the placeholders
+
+The scaffold substitutes `<PROJECT>` automatically. You still need to fill in:
+
+| Placeholder | Where | What to fill in |
+|---|---|---|
+| `<REPO_URL>` | `infra/cloud-init.yaml` | Your fork's clone URL |
+| `<SETUP_COMMANDS>` | `infra/cloud-init.yaml` | Toolchain install + build (language-specific) |
+| `<BENCH_COMMAND>` | `infra/cloud-init.yaml` | The command to benchmark |
+| `<VERIFY_COMMAND>` | `infra/cloud-init.yaml` | Smoke test after setup |
+
+The cloud-init template includes examples for Python, Rust, Go, Node.js, and Java.
+
+### VM lifecycle
+
+Each project gets a `vm-manage.sh` for the benchmark VM:
+
+```bash
+cd .codeflash/<org>/<project>
+bash infra/vm-manage.sh create    # provision VM with cloud-init
+bash infra/vm-manage.sh bench main  # run benchmarks on a branch
+bash infra/vm-manage.sh ssh       # SSH into VM
+bash infra/vm-manage.sh stop      # deallocate (stops billing)
+bash infra/vm-manage.sh destroy   # delete everything
+```
+
+### Examples
+
+Use the existing projects as templates:
+- [Rich](.codeflash/textualize/rich/) — focused scope, 2 PRs, import + runtime micro-opts
+- [pip](.codeflash/pypa/pip/) — large scope, 122 commits across 8 categories
--- a/case-studies/pypa/pip/summary.md
+++ b/case-studies/pypa/pip/summary.md
@ -0,0 +1,84 @@
+# pip Optimization — Lessons Learned
+
+Full case study: [pip_org](https://github.com/KRRT7/pip_org)
+
+## Context
+
+pip is the default Python package installer. 122 optimization commits across startup, dependency resolution, packaging, import deferral, and vendored Rich. Benchmarked on Python 3.15.0a7, macOS arm64.
+
+## What we did (by impact)
+
+### Startup (7x `--version`)
+
+The single biggest visible win. `pip --version` went from 138ms to 20ms by:
+1. Adding an ultra-fast path in `__main__.py` that reads the version and exits before importing `pip._internal`
+2. Deferring `base_command.py` import chain to command creation time
+3. Deferring autocompletion imports behind `PIP_AUTO_COMPLETE` check
+
+**Key insight**: For simple commands like `--version`, the user shouldn't pay the cost of importing the entire tool.
+
+### Resolver architecture (1.81x for complex resolves)
+
+1. **Speculative metadata prefetch**: While the resolver processes package A, a background thread downloads PEP 658 metadata for the most likely next candidate. This overlaps I/O with CPU.
+
+2. **Conditional Criterion rebuild**: `_remove_information_from_criteria` was rebuilding all criteria on every backtrack step — 95% of the time nothing changed. Added a check to skip unchanged criteria.
+
+3. **`__slots__` on Criterion**: Created per-package, per-resolution-step. With `__slots__`: 100 bytes less per instance × thousands of instances = significant.
+
+4. **Two-level candidate cache**: Specifier merge results + candidate infos cached across backtracking steps. The resolver re-evaluates the same packages many times during backtracking.
+
+### Packaging layer (1.82x for `install -r`)
+
+The vendored `packaging` library is called thousands of times during resolution:
+- `Version.__hash__` cached in slot (42K → 21K calls)
+- `Specifier.__str__` and `__hash__` cached
+- `_tokenizer` dataclass → `__slots__` class
+- Integer comparison key for Version (avoids full `_key` tuple construction)
+- Bisect-based `filter_versions` for O(log n + k) batch filtering
+
+### Import deferral (vendored Rich)
+
+Same patterns as the Rich case study, but applied to pip's vendored copy:
+- Deferred all Rich imports to first use
+- Stripped unused Rich modules from the import chain
+- Deferred heavy imports in `console.py`, `progress_bars.py`, `self_outdated_check.py`
+
+### I/O
+
+- Replaced pure-Python msgpack with stdlib JSON for HTTP cache serialization
+- Increased connection pool size for parallel index fetches
+
+## Results
+
+| Benchmark | Before | After | Speedup |
+|---|---|---|---|
+| `pip --version` | 138ms | 20ms | **7.0x** |
+| `flask+django+boto3+requests` resolve | 1,493ms | 826ms | **1.81x** |
+| `install -r requirements.txt` (21 pkgs) | 1,344ms | 740ms | **1.82x** |
+| `pip list` | 162ms | 146ms | **1.11x** |
+| All benchmarks (sum) | 18,717ms | 15,223ms | **1.23x** |
+
+## Bugs found along the way
+
+Optimization work surfaced real bugs:
+1. **`--report -` outputs invalid JSON** ([pypa/pip#13898](https://github.com/pypa/pip/issues/13898)) — Rich was mixing log output into stdout JSON
+2. **Test failure on Python 3.15** ([pypa/pip#13901](https://github.com/pypa/pip/issues/13901)) — `importlib.metadata` behavior change
+3. **`_stderr_console` typo in logging.py** — global never actually set (pre-existing, not fixed to keep diff focused)
+
+**Key insight**: Deep performance work forces you to understand code paths that normal development doesn't touch. Bugs fall out naturally.
+
+## Key takeaways
+
+1. **Profile first, always**: The resolver was the bottleneck for real workloads, not startup — but startup was the most *visible* improvement to users
+2. **Allocation counting reveals hidden work**: `Tag.__init__` called 45,301 times → 1,559 with caching (97% reduction). You can't see this in wall-clock profiling alone
+3. **Caching needs the right granularity**: Per-resolution-step caches worked; global caches didn't (different resolution contexts)
+4. **Vendored code is fair game**: pip's vendored `packaging` had the most micro-optimization opportunities because it's called thousands of times in tight loops
+5. **Test suite is your safety net**: 1,690 unit tests + 15 functional tests caught every regression. Never skip this step
+
+## Applicable to codeflash
+
+- **Startup fast-path**: Does `codeflash --version` import the entire optimization engine? It shouldn't
+- **Test generation loop**: If codeflash generates/runs many test variants, the same caching patterns apply (version parsing, specifier matching, etc.)
+- **AST parsing**: If parsing the same files repeatedly, cache the AST
+- **Benchmark harness**: subprocess overhead for running benchmarks is a known bottleneck — could the harness be more efficient?
+- **Vendored/installed deps**: Which heavy deps does codeflash import at startup? Profile and defer
--- a/case-studies/textualize/rich/summary.md
+++ b/case-studies/textualize/rich/summary.md
@ -0,0 +1,65 @@
+# Rich Optimization — Lessons Learned
+
+Full case study: [rich_org](https://github.com/KRRT7/rich_org)
+
+## Context
+
+pip vendors Rich for progress bars, logging, and error display. `from rich.console import Console` took 79ms on CPython 3.12 — a significant chunk of pip's startup.
+
+## What we did
+
+### Import deferral (PR #12 + #13)
+
+Deferred 15+ imports across Rich's codebase. The pattern:
+
+```python
+# Before (module level)
+import re
+RE_COLOR = re.compile(r"...")
+
+# After (lazy)
+_RE_COLOR = None
+
+def parse(color):
+    global _RE_COLOR
+    if _RE_COLOR is None:
+        import re
+        _RE_COLOR = re.compile(r"...")
+```
+
+**Key insight**: Most regex patterns in Rich are behind LRU-cached methods, so the lazy compile cost is paid once and amortized.
+
+### Architectural changes (PR #12)
+
+1. **`@dataclass` → `__slots__`**: `ConsoleOptions` and `ConsoleThreadLocals` used `@dataclass`, pulling in `inspect` (~10ms). Replaced with plain classes + `__slots__`. Memory: 344 → 136 bytes per instance.
+
+2. **Lazy emoji dict**: `_emoji_codes.EMOJI` (3,608 entries) loaded unconditionally. Deferred to first use via module-level `__getattr__`.
+
+### Runtime micro-optimizations (PR #13)
+
+1. `Style.__eq__` identity shortcut: `is` before hash comparison (1.84x for identity case)
+2. `Style.combine/chain`: direct `_add` (LRU-cached) instead of `sum()` → `__add__` (1.34x)
+3. `Segment.simplify`: `is` before `==` for style comparison (1.36x)
+
+## Results
+
+| Import | Before | After | Speedup |
+|---|---|---|---|
+| Console (3.12) | 79.1ms | 37.5ms | **2.11x** |
+| Console (3.13) | 67.9ms | 33.6ms | **2.02x** |
+| RichHandler (3.12) | 100.3ms | 39.6ms | **2.53x** |
+
+## Key takeaways
+
+1. **Python version matters**: `typing` imports `re` on 3.12 but not 3.13 — this made our `re` deferral a no-op on 3.12
+2. **`from __future__ import annotations`** is the unlock for `TYPE_CHECKING` moves — without it, annotation-only names that share import lines with runtime names can't be separated
+3. **Benchmark on controlled hardware**: Laptop results were noisy; Azure non-burstable VM gave consistent ±0.5ms stddev
+4. **Maintainer engagement matters**: Direct Discord DM to Will McGugan got "Seems like a clear win. Feel free to open a PR" within 30 minutes
+5. **Stack PRs, not scatter**: Started with 11 individual PRs, consolidated to 2 stacked PRs — much cleaner to review
+
+## Applicable to codeflash
+
+- Any Rich imports in codeflash's output/display layer are candidates for the same deferral
+- If codeflash vendors or depends on Rich, the upstream improvements benefit automatically
+- The `@dataclass` → `__slots__` pattern applies to any hot dataclass in codeflash
+- Identity shortcut pattern (`is` before `==`) applies to any cached/interned objects
--- a/docs/codeflash-agent-dogfooding.md
+++ b/docs/codeflash-agent-dogfooding.md
@ -0,0 +1,213 @@
+# Codeflash Self-Optimization
+
+Dogfooding codeflash on itself — using the same methodology that produced 2x Rich imports and 1.8x pip resolver to optimize codeflash's own performance.
+
+## The Stack
+
+All codeflash repos under one roof for vertical optimization. A user-facing operation (e.g., `codeflash optimize foo.py`) touches every layer — optimizing one layer in isolation misses cross-boundary costs.
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  User: codeflash optimize foo.py                            │
+└──────────────────────────┬──────────────────────────────────┘
+                           │
+┌──────────────────────────▼──────────────────────────────────┐
+│  codeflash (core engine)                                    │
+│  • CLI entry point                                          │
+│  • Test generation, AST analysis, benchmark harness         │
+│  • Optimization loop: profile → generate → validate         │
+│  repos/codeflash/                                           │
+└──────────────────────────┬──────────────────────────────────┘
+                           │
+┌──────────────────────────▼──────────────────────────────────┐
+│  codeflash-internal (backend service)                       │
+│  • LLM orchestration, prompt management                     │
+│  • Optimization result storage                              │
+│  repos/codeflash-internal/                                  │
+└──────────────────────────┬──────────────────────────────────┘
+                           │
+┌──────────────────────────▼──────────────────────────────────┐
+│  docflash (CI pipeline)                                     │
+│  • Dockerized optimization runs                             │
+│  • Bug detection + auto-fix pipeline                        │
+│  repos/docflash/                                            │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Setup
+
+```bash
+# Clone all repos into repos/
+mkdir -p repos
+git clone git@github.com:codeflash-ai/codeflash.git repos/codeflash
+git clone git@github.com:codeflash-ai/codeflash-internal.git repos/codeflash-internal
+git clone git@github.com:codeflash-ai/docflash.git repos/docflash
+```
+
+### Cross-boundary optimization targets
+
+| Boundary | What to look for |
+|---|---|
+| **codeflash CLI → internal service** | HTTP round-trip latency, payload size, connection reuse, retry overhead |
+| **codeflash CLI → user's code** | AST parsing cost, test generation I/O, benchmark harness subprocess overhead |
+| **docflash → codeflash CLI** | Docker startup, volume mount overhead, cold-start import time |
+
+## Prior Art
+
+| Project | Key Result | Approach | Case Study |
+|---|---|---|---|
+| **Rich** | 2.35x Console import (79ms → 34ms) | Import deferral, `re` elimination, runtime micro-opts | [rich_org](https://github.com/KRRT7/rich_org) |
+| **pip** | 7x `--version`, 1.81x resolver | 122 commits: startup, resolver, packaging, import deferral | [pip_org](https://github.com/KRRT7/pip_org) |
+
+## Patterns That Worked
+
+Distilled from 122 pip commits + 2 Rich PRs. These are the repeatable optimization categories, ordered by typical impact.
+
+### Tier 1: Startup / Import Time (highest user-visible impact)
+
+| Pattern | Example | Typical Savings |
+|---|---|---|
+| **Fast-path early exit** | `pip --version` bypasses entire `pip._internal` import | 5-100x for that codepath |
+| **Import deferral** | Move `import X` from module level into the function that uses it | 2-20ms per deferred module |
+| **`TYPE_CHECKING` guard** | Move annotation-only imports behind `if TYPE_CHECKING:` | 1-5ms per module |
+| **`from __future__ import annotations`** | Enables string annotations so type aliases can move to `TYPE_CHECKING` | Unlocks further deferrals |
+| **Kill dead imports** | Remove imports that aren't used at runtime | 1-10ms each |
+| **Avoid transitive chains** | `dataclasses` → `inspect` (~10ms); `typing.Match` → `re` (~3ms) | Chain-dependent |
+
+### Tier 2: Architecture (highest absolute time savings)
+
+| Pattern | Example | Typical Savings |
+|---|---|---|
+| **Replace `@dataclass` with `__slots__`** | ConsoleOptions: 344 → 136 bytes, eliminates `inspect` import | 10ms import + 60% memory |
+| **Lazy loading large data** | Rich emoji dict (3,608 entries) deferred to first use | 2-5ms |
+| **Speculative prefetch** | Background thread downloads metadata while resolver works | 10-30% on I/O-bound paths |
+| **Conditional rebuild** | Skip rebuilding Criterion when nothing changed (95% of cases) | 20-40% on hot loop |
+| **Cache at the right level** | `lru_cache` on `Style._add`, `parse_wheel_filename`, tag generation | Varies widely |
+
+### Tier 3: Micro-optimizations (small per-call, adds up in hot loops)
+
+| Pattern | Example | Typical Savings |
+|---|---|---|
+| **Identity shortcut (`is` before `==`)** | `Style.__eq__`, `Segment.simplify` | 1.3-1.8x for identity case |
+| **Bypass public API internally** | `Style._add` (cached) vs `__add__` (copies linked styles) | 1.1-1.3x |
+| **Hoist to module level** | `operator.attrgetter`, `methodcaller` as module constants | ns per call |
+| **`__slots__` on hot classes** | Criterion, ConsoleOptions, tokenizer state | 40-60% memory |
+| **Pre-compute in `__init__`** | `Link._is_wheel`, `Version._str_cache` | Eliminates repeated work |
+| **Direct construction** | `__new__` + slot assignment bypassing `__init__` | 20-40% for allocation-heavy paths |
+
+### Tier 4: I/O and Serialization
+
+| Pattern | Example | Typical Savings |
+|---|---|---|
+| **Replace slow serializer** | msgpack (pure Python) → stdlib JSON (C) | 2-5x for cache ops |
+| **Connection pooling** | Increase HTTP pool size for parallel index fetches | Latency-dependent |
+| **Parallel I/O** | SharedThreadPoolExecutor for wheel downloads | Throughput-dependent |
+
+## Anti-patterns (things that didn't work or weren't worth it)
+
+- **Caching with low hit rate** — Caches that get evicted before reuse add overhead
+- **Premature `__slots__`** — Only worth it on classes with many instances or in hot loops
+- **Over-deferring** — Deferring imports in functions called once on startup just moves the cost
+- **Regex elimination** — On Python 3.12, `typing` imports `re` anyway, so deferring `re` is a no-op there
+- **Optimizing cold paths** — Error handling, setup/teardown, one-time init — not worth the complexity
+
+## Methodology
+
+### Profiling toolkit
+
+| Tool | Purpose | When to use |
+|---|---|---|
+| `python -X importtime` | Import cost breakdown | First step for any CLI tool |
+| `hyperfine` | E2E command timing with statistics | Before/after validation |
+| `cProfile` / `py-spy` | Function-level CPU profiling | Finding hot functions |
+| `timeit` | Micro-benchmarks for specific functions | Validating micro-opts |
+| `memray` / `tracemalloc` | Memory profiling | Allocation-heavy paths |
+| `objgraph` | Object count tracking | Finding redundant allocations |
+
+### Environment
+
+- Azure Standard_D2s_v5 (non-burstable, consistent CPU)
+- Multiple Python versions (3.12, 3.13 minimum — behavior differs)
+- hyperfine with `--warmup 5 --min-runs 30` for statistical rigor
+- All tests passing before AND after every change
+
+### Workflow
+
+```
+1. Profile → identify top-N bottlenecks
+2. For each bottleneck:
+   a. Read the actual code (don't guess from profiler shapes)
+   b. Implement the smallest change that addresses it
+   c. Micro-benchmark before/after
+   d. Run full test suite
+   e. E2E benchmark
+3. Commit with clear perf: prefix and numbers
+4. Repeat until plateau
+```
+
+## Codeflash Optimization Plan
+
+### Phase 1: Profile each layer
+
+**codeflash (core engine)**
+- [ ] `python -X importtime -c "import codeflash"` — import chain analysis
+- [ ] `codeflash --version` startup time baseline
+- [ ] Profile a real optimization run end-to-end (py-spy flamegraph)
+- [ ] Memory profile on a large codebase target
+- [ ] Trace test generation loop: AST parse, codegen, subprocess, validation
+
+**codeflash-internal (backend service)**
+- [ ] Profile LLM call latency vs overhead (serialization, prompt assembly, result parsing)
+- [ ] Check connection reuse and retry patterns
+- [ ] Measure cold-start time
+
+**Cross-boundary**
+- [ ] E2E trace: user command → agent → CLI → service → result (where does time go?)
+- [ ] Measure serialization costs at each boundary
+- [ ] Identify redundant round-trips
+
+### Phase 2: Identify targets
+- [ ] Rank imports by cost per layer, identify deferrable ones
+- [ ] Find hot functions in the optimization loop
+- [ ] Check for heavy dependencies that could be deferred or replaced
+- [ ] Map cross-boundary overhead (serialization, subprocess, HTTP)
+- [ ] Look for the patterns from Tier 1-4 above
+
+### Phase 3: Implement
+- [ ] Apply Tier 1 (startup/import) optimizations first — highest visibility
+- [ ] Then Tier 2 (architecture) — highest absolute savings
+- [ ] Then Tier 3 (micro) and Tier 4 (I/O) as needed
+- [ ] Cross-boundary optimizations last (require changes in multiple repos)
+- [ ] Each change: micro-bench → test suite → E2E bench → commit
+
+### Phase 4: Document
+- [ ] Before/after benchmark tables per layer
+- [ ] E2E before/after for user-facing operations
+- [ ] Per-optimization breakdown
+- [ ] Flamegraphs showing the shift
+- [ ] Case study narrative: "codeflash optimized itself"
+
+## Repo Structure
+
+```
+.
+├── README.md              # This file — framework and playbook
+├── repos/                 # The vertical stack (git-ignored, clone locally)
+│   ├── codeflash/         # Core engine (codeflash-ai/codeflash)
+│   ├── codeflash-internal/# Backend service (codeflash-ai/codeflash-internal)
+│   └── docflash/          # CI pipeline (codeflash-ai/docflash)
+├── prior-art/
+│   ├── rich-summary.md    # What we learned from Rich
+│   └── pip-summary.md     # What we learned from pip
+├── infra/
+│   ├── README.md          # Infrastructure design and architecture
+│   ├── cloud-init.yaml    # VM provisioning (one-shot)
+│   └── vm-manage.sh       # VM lifecycle management script
+├── profiles/              # Profiling output (importtime, flamegraphs)
+│   ├── codeflash/         # Core engine profiles
+│   ├── codeflash-internal/# Service profiles
+│   └── cross-boundary/    # E2E traces spanning layers
+├── bench/                 # Benchmark scripts (copied to VM by cloud-init)
+├── data/                  # Raw benchmark results
+└── results/               # Before/after analysis
+```
--- a/docs/design.md
+++ b/docs/design.md
--- a/docs/hypothesis.md
+++ b/docs/hypothesis.md
@ -0,0 +1,40 @@
+# Hypothesis: Outdated Dependencies Cause Performance Regressions
+
+## Claim
+Outdated dependencies accumulate performance regressions over time through:
+- Missing tree-shaking improvements in newer versions
+- Duplicated polyfills for features now native to the runtime
+- Unoptimized codepaths that newer releases have rewritten
+- Missed bundle-size reductions from internal refactors
+- Transitive dependency bloat from old sub-dependencies
+
+## Testing approach
+Upgrade dependencies in order of likely performance impact on the cf-webapp Next.js dashboard (app.codeflash.ai). Build after each batch. Measure bundle size and build time before/after.
+
+## Experiment: cf-webapp (2026-04-10)
+
+### Baseline
+- 46 outdated packages identified via `npm outdated`
+- 16 major version bumps, ~30 semver-compatible patches
+
+### Round 1 — Semver-compatible patches (~30 packages)
+React 19.2.5, Sentry 10.48.0, Radix UI patches, PostCSS 8.5.9, auth0 4.17.0, etc.
+- **Result**: Build passes, 0 vulnerabilities
+
+### Round 2 — Major version upgrades (performance-impactful)
+- `posthog-js` 1.127 → 1.367 (analytics SDK, loads every page)
+- `lucide-react` 0.563 → 1.8 (icon library, v1 tree-shaking rewrite; required `Github` → `GitFork` rename — brand icons removed)
+- `tailwind-merge` 2.6 → 3.5 (used in every `cn()` call, v3 smaller/faster runtime)
+- `marked` 16.4 → 18.0 (markdown parser)
+- `react-markdown` 9.1 → 10.1 (required removing `className` prop — dropped in v10)
+- `prettier` 3.2 → 3.8, `lint-staged` 15 → 16, `posthog-node` 4 → 5
+- **Result**: Build passes after migration fixes
+
+### Deferred (high migration cost)
+- tailwindcss 3 → 4 (complete CSS framework rewrite)
+- prisma 6 → 7 (database client API changes)
+- zod 3 → 4 (validation API changes)
+- typescript 5 → 6 (type system changes)
+
+### Measurements
+TODO: Run `ANALYZE=true npm run build` before/after to capture concrete bundle size deltas.
--- a/docs/infra_readme.md
+++ b/docs/infra_readme.md
@ -0,0 +1,173 @@
+# Infrastructure Design
+
+Benchmarking and CI infrastructure for codeflash self-optimization.
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Developer Machine                     │
+│  • Implement optimization                               │
+│  • Push to branch                                       │
+│  • Trigger benchmark run                                │
+└────────────────────────┬────────────────────────────────┘
+                         │ git push
+                         ▼
+┌─────────────────────────────────────────────────────────┐
+│                   GitHub Actions CI                      │
+│  • Lint + type check                                    │
+│  • Unit tests (fast, every push)                        │
+│  • Trigger benchmark VM if perf/ branch                 │
+└────────────────────────┬────────────────────────────────┘
+                         │ webhook / SSH
+                         ▼
+┌─────────────────────────────────────────────────────────┐
+│              Azure Benchmark VM (cf-bench)               │
+│  Standard_D4s_v5 (4 vCPU, 16 GB, non-burstable)        │
+│                                                          │
+│  • Checkout branch                                      │
+│  • Install codeflash in editable mode                   │
+│  • Run benchmark suite                                  │
+│  • Compare against baseline (main)                      │
+│  • Post results as PR comment                           │
+└─────────────────────────────────────────────────────────┘
+```
+
+## Benchmark VM
+
+### Why dedicated VM?
+
+| Concern | Laptop | GitHub Actions runner | Dedicated VM |
+|---|---|---|---|
+| CPU consistency | Poor (thermal, background) | Poor (shared, noisy neighbor) | **Good** (non-burstable) |
+| Reproducibility | Low | Medium | **High** |
+| Cost | Free | Free (but noisy) | ~$0.10/hr (on-demand) |
+| Python versions | Whatever's installed | Configurable | **Full control** |
+
+### VM Spec
+
+| Setting | Value | Rationale |
+|---|---|---|
+| Size | `Standard_D4s_v5` | 4 vCPU, 16 GB RAM — enough to run codeflash on itself without swapping |
+| OS | Ubuntu 24.04 LTS | Matches CI, stable |
+| Region | `westus2` | Low latency, proven reliable |
+| Disk | 64 GB Premium SSD | Fast I/O for git, pip cache |
+| Scheduling | On-demand (start/stop) | Only runs during benchmark jobs, ~$0.10/hr |
+
+### Provisioning
+
+Cloud-init installs:
+1. System packages (git, build-essential, curl, jq)
+2. uv (fast Python/venv management)
+3. Python 3.12, 3.13, 3.14 via uv
+4. hyperfine v1.19+
+5. memray (memory profiling)
+6. py-spy (CPU sampling profiler)
+7. Codeflash clone + editable install
+8. Benchmark scripts
+
+### Start/Stop
+
+```bash
+# Start VM (before benchmark run)
+az vm start --resource-group CF-BENCH-RG --name cf-bench
+
+# Run benchmarks
+ssh azureuser@<ip> "bash ~/bench/bench_all.sh <branch>"
+
+# Stop VM (after benchmark run — stops billing)
+az vm deallocate --resource-group CF-BENCH-RG --name cf-bench
+```
+
+## Benchmark Suite Design
+
+### Layers
+
+```
+Layer 1: Startup (import time, CLI response time)
+    └── python -X importtime
+    └── hyperfine: codeflash --version, codeflash --help
+
+Layer 2: Unit operations (micro-benchmarks)
+    └── timeit: AST parsing, test generation, result analysis
+    └── Function-level profiling of hot paths
+
+Layer 3: Integration (real optimization runs)
+    └── codeflash optimize <target> on a fixture codebase
+    └── Wall clock, memory peak, output quality
+
+Layer 4: Memory
+    └── memray: peak RSS during optimization run
+    └── tracemalloc: allocation hotspots
+```
+
+### Benchmark scripts
+
+| Script | Layer | What it measures |
+|---|---|---|
+| `bench_startup.sh` | 1 | CLI startup time (--version, --help, import) |
+| `bench_importtime.py` | 1 | Per-module import cost breakdown |
+| `bench_micro.py` | 2 | Hot function micro-benchmarks (timeit) |
+| `bench_optimize.sh` | 3 | Full optimization run on fixture codebase |
+| `bench_memory.sh` | 4 | Peak memory during optimization run |
+| `bench_all.sh` | * | Run all benchmarks, save results |
+| `bench_compare.sh` | * | A/B comparison between two branches |
+
+### Fixture codebase
+
+A small but representative Python project for integration benchmarks:
+- ~500 lines across 5-10 files
+- Mix of pure functions, classes, I/O-bound code
+- Known optimization opportunities (so we can measure "did codeflash find them?")
+- Checked into `fixtures/` directory
+
+### Result format
+
+Each benchmark run produces a directory:
+
+```
+results/<branch>-<timestamp>/
+├── startup.json          # hyperfine JSON export
+├── importtime.tsv        # Per-module import breakdown
+├── micro.json            # Micro-benchmark results
+├── optimize.json         # Integration benchmark (wall clock, memory, findings)
+├── memory.json           # Peak RSS and allocation data
+└── summary.md            # Human-readable summary
+```
+
+## CI Integration
+
+### On every push to `perf/*` branch:
+
+1. Run unit tests (GitHub Actions, fast)
+2. Start benchmark VM
+3. Run `bench_compare.sh main <branch>` on VM
+4. Post results as PR comment via `gh`
+5. Deallocate VM
+
+### PR comment format:
+
+```markdown
+## Benchmark Results: `perf/optimize-startup` vs `main`
+
+| Metric | main | branch | Delta |
+|---|---|---|---|
+| `codeflash --version` | 450ms | 120ms | **-73% (3.75x)** |
+| `import codeflash` | 380ms | 310ms | **-18%** |
+| Full optimize run | 12.3s | 11.8s | **-4%** |
+| Peak memory | 245 MB | 230 MB | **-6%** |
+
+<details><summary>Per-module import breakdown</summary>
+...
+</details>
+```
+
+## Cost
+
+| Component | Cost | Frequency |
+|---|---|---|
+| Benchmark VM (D4s_v5) | ~$0.10/hr | On-demand, ~10 min per run |
+| Storage (64 GB SSD) | ~$10/month | Always |
+| GitHub Actions | Free (public) / included (private) | Every push |
+
+Estimated: **~$15/month** with daily benchmark runs.
--- a/evals/run-eval.sh
+++ b/evals/run-eval.sh
@ -164,15 +164,29 @@ run_claude() {

    # Run claude in the workspace dir
    # --dangerously-skip-permissions: evals run in temp dirs, safe to allow all tools
+    local exit_code=0
    (cd "$workdir" && claude "${claude_args[@]}" --dangerously-skip-permissions) \
-        > "${result_prefix}.json" 2> "${result_prefix}.stderr" || true
+        > "${result_prefix}.json" 2> "${result_prefix}.stderr" || exit_code=$?

    end_time=$(date +%s)
    duration=$((end_time - start_time))

-    echo "$label completed in ${duration}s"
+    echo "$exit_code" > "${result_prefix}.exitcode"
+    echo "$label completed in ${duration}s (exit code: $exit_code)"
    echo "$duration" > "${result_prefix}.duration"

+    # Detect empty/missing output — likely a crash or timeout
+    if [ ! -s "${result_prefix}.json" ]; then
+        echo "WARNING: $label produced no output (exit code: $exit_code)" >&2
+        echo "CRASH DIAGNOSTIC: claude exited with code $exit_code after ${duration}s. No JSON output produced." \
+            >> "${result_prefix}.stderr"
+        if [ "$exit_code" -eq 124 ] || [ "$duration" -gt 1800 ]; then
+            echo "LIKELY CAUSE: timeout (duration=${duration}s)" >> "${result_prefix}.stderr"
+        elif [ "$exit_code" -eq 137 ] || [ "$exit_code" -eq 139 ]; then
+            echo "LIKELY CAUSE: OOM or signal kill (exit code=$exit_code)" >> "${result_prefix}.stderr"
+        fi
+    fi
+
    # Run tests post-optimization to check correctness + timing
    local test_cmd
    test_cmd=$(jq -r '.test_command // empty' "$RUN_DIR/manifest.json")
--- a/languages/python/lang.toml
+++ b/languages/python/lang.toml
@ -1,4 +0,0 @@
-[language]
-name = "python"
-extensions = [".py", ".pyi"]
-commands = ["optimize", "review", "triage", "audit-libs"]
--- a/packages/codeflash-core/src/codeflash_core/init.py
+++ b/packages/codeflash-core/src/codeflash_core/init.py
@ -9,7 +9,19 @@ try:
 except Exception:  # noqa: BLE001
    __version__ = "0.0.0"

+from ._capabilities import (
+    REQUIRED_CAPABILITIES,
+    CompareResultsFn,
+    DiscoverFn,
+    ExtractContextFn,
+    NormalizeFn,
+    ParseResultsFn,
+    ReplaceCodeFn,
+    RunTestsFn,
+    validate_capabilities,
+)
 from ._client import AIClient
+from ._configuration import LanguageConfiguration
 from ._git import check_and_push_branch, get_repo_owner_and_name
 from ._model import (
    BenchmarkDetail,
@ -33,6 +45,7 @@ from ._pipeline import (
 )
 from ._platform import PlatformClient, parse_repo_owner_and_name
 from ._plugin import LanguagePlugin
+from ._state import LanguageState
 from ._telemetry import init_telemetry, ph
 from .exceptions import (
    AIServiceConnectionError,
@ -41,6 +54,7 @@ from .exceptions import (
 )

 __all__ = [
+    "REQUIRED_CAPABILITIES",
    "AIClient",
    "AIServiceConnectionError",
    "AIServiceError",
@ -48,14 +62,23 @@ __all__ = [
    "Candidate",
    "CandidateForest",
    "CandidateNode",
+    "CompareResultsFn",
+    "DiscoverFn",
    "EvaluationContext",
+    "ExtractContextFn",
    "FileDiffContent",
    "InvalidAPIKeyError",
+    "LanguageConfiguration",
    "LanguagePlugin",
+    "LanguageState",
+    "NormalizeFn",
    "OptimizationRequest",
    "OptimizationReviewResult",
+    "ParseResultsFn",
    "PlatformClient",
    "PrComment",
+    "ReplaceCodeFn",
+    "RunTestsFn",
    "__version__",
    "check_and_push_branch",
    "create_rank_dictionary",
@ -69,4 +92,5 @@ __all__ = [
    "performance_gain",
    "ph",
    "select_best",
+    "validate_capabilities",
 ]
--- a/packages/codeflash-core/src/codeflash_core/_capabilities.py
+++ b/packages/codeflash-core/src/codeflash_core/_capabilities.py
@ -0,0 +1,227 @@
+"""Capability protocols for language-specific callables.
+
+These protocols formalize the function signatures that core
+pipeline steps already accept as parameters.  Instead of
+documenting "pass a callable that normalizes code" in prose,
+we now have typed contracts.
+
+Language packages implement these protocols (structurally —
+no inheritance needed) and declare them in their
+:class:`LanguagePlugin`.  Core pipeline functions type their
+parameters against these protocols for static checking.
+
+This is the language equivalent of cloud abstractions'
+``AbstractComponent`` subclasses (VPC, ObjectStore, etc.) —
+each protocol defines what one capability looks like across
+all languages.
+
+Example usage in a core pipeline function::
+
+    from codeflash_core import NormalizeFn
+
+    def dedup_candidates(
+        candidates: list[Candidate],
+        *,
+        normalize_fn: NormalizeFn,
+        ...
+    ) -> list[Candidate]:
+        ...
+
+Example implementation in a language package::
+
+    # codeflash_python/analysis/_normalizer.py
+    def normalize_python_code(code: str) -> str:
+        ...  # satisfies NormalizeFn structurally
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Protocol, runtime_checkable
+
+if TYPE_CHECKING:
+    from pathlib import Path
+
+
+# -- Normalization -------------------------------------------------
+
+
+@runtime_checkable
+class NormalizeFn(Protocol):
+    """Normalize source code for deduplication comparison.
+
+    Takes raw source code and returns a canonical form where
+    semantically equivalent code produces identical strings.
+    Used by :func:`dedup_candidates` to detect duplicate
+    optimization candidates.
+
+    Python: strips comments, normalizes whitespace, sorts imports.
+    JavaScript: might normalize semicolons, quote styles, etc.
+    """
+
+    def __call__(self, code: str) -> str: ...
+
+
+# -- Discovery ----------------------------------------------------
+
+
+@runtime_checkable
+class DiscoverFn(Protocol):
+    """Discover optimizable functions in a source file.
+
+    Takes the source text and its file path, returns a list of
+    function descriptors.  The return type is ``list[object]``
+    because function models are language-specific (Python has
+    ``FunctionToOptimize``, JavaScript will have its own).
+
+    Each returned object must have at least:
+    - ``function_name: str``
+    - ``qualified_name: str``
+    - ``file_path: Path``
+    """
+
+    def __call__(
+        self,
+        source: str,
+        file_path: Path,
+    ) -> list[object]: ...
+
+
+# -- Context Extraction --------------------------------------------
+
+
+@runtime_checkable
+class ExtractContextFn(Protocol):
+    """Extract optimization context for a function.
+
+    Gathers the surrounding code context that the AI service
+    needs to generate good optimizations: imports, class
+    definitions, helper functions, type annotations, etc.
+
+    The return type is ``object`` because context models are
+    language-specific.  Each returned object must have at least:
+    - ``read_writable: str`` — code the AI can modify
+    - ``read_only: str`` — surrounding context for reference
+    """
+
+    def __call__(
+        self,
+        function: object,
+        project_root: Path,
+    ) -> object: ...
+
+
+# -- Code Replacement ----------------------------------------------
+
+
+@runtime_checkable
+class ReplaceCodeFn(Protocol):
+    """Replace function definitions in source code.
+
+    Takes the original source, a list of function names to
+    replace, and the new optimized code.  Returns the updated
+    source with the functions replaced.
+
+    Language packages handle the AST/CST manipulation
+    internally (Python uses libcst, JavaScript might use
+    babel/recast).
+    """
+
+    def __call__(
+        self,
+        source_code: str,
+        function_names: list[str],
+        optimized_code: str,
+    ) -> str: ...
+
+
+# -- Test Execution ------------------------------------------------
+
+
+@runtime_checkable
+class RunTestsFn(Protocol):
+    """Run behavioral tests and return raw results.
+
+    Executes the test suite (or a subset) against the current
+    code and returns a results object.  The return type is
+    ``object`` because test result models are language-specific.
+
+    Each returned object must have at least:
+    - ``passed: bool``
+    - ``runtime_ns: int``
+    """
+
+    def __call__(
+        self,
+        test_files: object,
+        cwd: Path,
+    ) -> object: ...
+
+
+@runtime_checkable
+class ParseResultsFn(Protocol):
+    """Parse raw test output into structured results.
+
+    Takes the raw output from the test runner and returns
+    a structured results object.  Separated from
+    :class:`RunTestsFn` so that the same parser can be used
+    for cached/replayed results.
+    """
+
+    def __call__(self, raw_output: object) -> object: ...
+
+
+@runtime_checkable
+class CompareResultsFn(Protocol):
+    """Compare original and optimized test results.
+
+    Returns ``True`` if the optimized code is behaviorally
+    equivalent to the original (same outputs, no regressions).
+    """
+
+    def __call__(
+        self,
+        original: object,
+        optimized: object,
+    ) -> bool: ...
+
+
+# -- Capability Map ------------------------------------------------
+
+
+REQUIRED_CAPABILITIES: frozenset[str] = frozenset(
+    {
+        "normalize_code",
+        "discover_functions",
+        "extract_context",
+        "replace_code",
+        "run_tests",
+        "parse_results",
+        "compare_results",
+    },
+)
+"""Capability names that every language plugin must provide.
+
+Used by :func:`validate_capabilities` to check that a plugin
+declares all required callables.
+"""
+
+OPTIONAL_CAPABILITIES: frozenset[str] = frozenset(
+    {
+        "run_benchmarks",
+        "generate_tests",
+        "detect_numerical",
+    },
+)
+"""Capability names that are useful but not required."""
+
+
+def validate_capabilities(
+    capabilities: dict[str, object],
+) -> list[str]:
+    """Return a list of missing required capabilities.
+
+    Returns an empty list if all required capabilities are
+    present.  Callers can use this at plugin construction
+    time to fail fast with a clear error message.
+    """
+    return sorted(REQUIRED_CAPABILITIES - capabilities.keys())
--- a/packages/codeflash-core/src/codeflash_core/_client.py
+++ b/packages/codeflash-core/src/codeflash_core/_client.py
@ -3,7 +3,6 @@
 from __future__ import annotations

 import contextlib
-import os
 import sys
 import uuid
 from typing import Any
@ -16,66 +15,13 @@ else:
 import attrs
 import requests

+from ._http import _resolve_api_key, _resolve_base_url, _strip_trailing_slash
 from ._model import Candidate, OptimizationRequest, OptimizationReviewResult
 from .exceptions import (
    AIServiceConnectionError,
    AIServiceError,
-    InvalidAPIKeyError,
 )

-_PROD_URL = "https://app.codeflash.ai"
-_LOCAL_URL = "http://localhost:8000"
-
-_CFAPI_PROD_URL = "https://app.codeflash.ai"
-_CFAPI_LOCAL_URL = "http://localhost:3001"
-
-
-def _resolve_base_url() -> str:
-    """
-    Return the base URL based on *CODEFLASH_AIS_SERVER*.
-    """
-    server = os.environ.get("CODEFLASH_AIS_SERVER", "prod")
-    if server.lower() == "local":
-        return _LOCAL_URL
-    return _PROD_URL
-
-
-def _resolve_cfapi_base_url() -> str:
-    """Return the platform API base URL from the environment."""
-    server = os.environ.get("CODEFLASH_CFAPI_SERVER", "prod")
-    if server.lower() == "local":
-        return _CFAPI_LOCAL_URL
-    return _CFAPI_PROD_URL
-
-
-def _strip_trailing_slash(url: str) -> str:
-    """Remove a trailing slash from *url*."""
-    return url.rstrip("/")
-
-
-def _resolve_api_key() -> str:
-    """
-    Read and validate *CODEFLASH_API_KEY* from the environment.
-    """
-    key = os.environ.get("CODEFLASH_API_KEY", "")
-    if not key:
-        msg = (
-            "Codeflash API key not found. Set the"
-            " CODEFLASH_API_KEY environment variable."
-            " Generate one at"
-            " https://app.codeflash.ai/app/apikeys"
-        )
-        raise InvalidAPIKeyError(msg)
-    if not key.startswith("cf-"):
-        msg = (
-            "Invalid Codeflash API key — must start with"
-            f" 'cf-', got '{key[:6]}…'."
-            " Generate a new one at"
-            " https://app.codeflash.ai/app/apikeys"
-        )
-        raise InvalidAPIKeyError(msg)
-    return key
-

@attrs.define
 class AIClient:
@ -88,11 +34,6 @@ class AIClient:
        default=attrs.Factory(_resolve_base_url),
        converter=_strip_trailing_slash,
    )
-    _cfapi_base_url: str = attrs.field(
-        alias="cfapi_base_url",
-        default=attrs.Factory(_resolve_cfapi_base_url),
-        converter=_strip_trailing_slash,
-    )
    _api_key: str = attrs.field(
        alias="api_key",
        default=attrs.Factory(_resolve_api_key),
@ -120,65 +61,6 @@ class AIClient:
        """Exit the context manager and close the session."""
        self.close()

-    def get_user_id(self) -> str | None:
-        """Fetch the current user's ID from the Codeflash API."""
-        try:
-            resp = self._session.get(
-                f"{self._cfapi_base_url}/cfapi/cli-get-user",
-                timeout=self._timeout,
-            )
-        except requests.RequestException:
-            return None
-
-        if not resp.ok:
-            return None
-
-        try:
-            data = resp.json()
-            return data.get("userId")  # type: ignore[no-any-return]
-        except (ValueError, KeyError):
-            # Older API returns plain-text user ID.
-            return resp.text or None
-
-    def validate_api_key(self) -> str:
-        """Validate the API key and return the user ID.
-
-        Raises :class:`InvalidAPIKeyError` if the key is rejected
-        (HTTP 403) or missing.  Returns the user ID string on
-        success.  Network errors are re-raised as
-        :class:`AIServiceConnectionError`.
-        """
-        try:
-            resp = self._session.get(
-                f"{self._cfapi_base_url}/cfapi/cli-get-user",
-                timeout=self._timeout,
-            )
-        except requests.RequestException as exc:
-            raise AIServiceConnectionError(str(exc)) from exc
-
-        if resp.status_code == 403:  # noqa: PLR2004
-            msg = (
-                "Invalid Codeflash API key."
-                " Generate a new one at"
-                " https://app.codeflash.ai/app/apikeys"
-            )
-            raise InvalidAPIKeyError(msg)
-
-        if not resp.ok:
-            raise AIServiceError(resp.status_code, resp.text)
-
-        try:
-            data = resp.json()
-            user_id: str | None = data.get("userId")
-        except (ValueError, KeyError):
-            user_id = resp.text or None
-
-        if not user_id:
-            msg = "Could not retrieve user ID from the API."
-            raise AIServiceError(0, msg)
-
-        return user_id
-
    def post(
        self,
        endpoint: str,
--- a/packages/codeflash-core/src/codeflash_core/_compat.py
+++ b/packages/codeflash-core/src/codeflash_core/_compat.py
@ -1,27 +0,0 @@
-"""Platform constants and codeflash directory paths."""
-
-from __future__ import annotations
-
-import os
-import sys
-import tempfile
-from pathlib import Path
-
-from platformdirs import user_config_dir
-
-LF: str = os.linesep
-IS_POSIX: bool = os.name != "nt"
-SAFE_SYS_EXECUTABLE: str = Path(sys.executable).as_posix()
-
-codeflash_cache_dir: Path = Path(
-    user_config_dir(
-        appname="codeflash",
-        appauthor="codeflash-ai",
-        ensure_exists=True,
-    ),
-)
-
-codeflash_temp_dir: Path = Path(tempfile.gettempdir()) / "codeflash"
-codeflash_temp_dir.mkdir(parents=True, exist_ok=True)
-
-codeflash_cache_db: Path = codeflash_cache_dir / "codeflash_cache.db"
--- a/Show more
+++ b/Show more
				`@ -0,0 +1 @@`
				`date commit target metric before after speedup notes`
				`@ -0,0 +1 @@`
				`commit target category before after speedup tests_passed tests_failed status description`