6.5 KiB
Library Boundary Breaking — Deep Guide
Domain agents treat external libraries as walls they can't cross. The primary optimizer doesn't. When profiling shows an external library dominating runtime and domain agents have plateaued, the optimizer has the authority to replace library calls with focused implementations that only cover the subset the codebase actually uses.
This is one of the optimizer's highest-value capabilities — a general-purpose library paying for features you never call is a cross-domain problem (structure × CPU) that no single-domain agent can solve.
When to consider this
All three conditions must hold:
- Profiling evidence: The library accounts for >15% of cumtime, AND the cost is in the library's internal machinery (visitor dispatch, metadata resolution, generalized parsing), not in your code's usage of it
- Plateau evidence: A domain agent has already tried to reduce traversals, skip unnecessary calls, cache results — and still plateaued because the remaining calls are essential but the library's implementation of them is heavy
- Narrow usage surface: The codebase uses a small fraction of the library's API. If you're using 5 functions out of 200, a focused replacement is feasible. If you're using most of the API, it's not worth it
How to assess feasibility
Step 1 — Audit the actual API surface. Grep for all imports and calls to the library across the project:
# What does the codebase actually import?
grep -rn "from <library>" --include="*.py" | sort -u
grep -rn "import <library>" --include="*.py" | sort -u
# What classes/functions are actually called?
grep -rn "<library>\." --include="*.py" | grep -v "^#" | sort -u
Step 2 — Classify each usage. For each call site, determine:
- What does it need? (parse source → AST, transform AST → source, visit nodes, resolve metadata)
- What subset of the library's type system does it touch?
- Could
ast(stdlib) + string manipulation cover this use case? - Does it depend on library-specific features (e.g., CST whitespace preservation, scope resolution)?
Step 3 — Map the replacement boundary. Draw the line:
- Replace: Uses where the codebase needs information extraction (collecting definitions, finding names, checking node types) —
asthandles this - Keep: Uses where the codebase needs source-faithful transformation (rewriting imports while preserving formatting, inserting code) — CST libraries provide this,
astdoesn't - Hybrid: Parse with
astfor analysis, fall back to the library only for transformations that must preserve source formatting
Step 4 — Estimate effort vs payoff. A focused replacement is worth it when:
- The library calls being replaced account for >20% of total runtime
- The replacement can use stdlib (
ast,tokenize,inspect) — no new dependencies - The API surface being replaced is <10 functions/classes
- Correctness can be verified against the library's output (run both, diff results)
The replacement pattern
The canonical case: a CST library (libcst, RedBaron) used primarily for reading code structure, but the library pays CST overhead (whitespace tracking, parent pointers, metadata resolution) that the codebase doesn't need for those reads.
Typical breakdown:
- 60% of calls: "Give me all top-level definitions" → ast.parse + ast.walk
- 25% of calls: "Find all names used in this scope" → ast.parse + ast.walk
- 10% of calls: "Remove unused imports" → needs source-faithful rewrite → KEEP the library
- 5% of calls: "Add this import statement" → needs source-faithful rewrite → KEEP the library
Replace the 85% that only reads. Keep the 15% that writes.
Implementation approach:
- Write the
ast-based replacement for the read-only use cases - Verify correctness: run the replacement alongside the library on real project files, diff the outputs
- Micro-benchmark: the replacement should be 5-20x faster for read-only operations (no CST overhead)
- Swap in the replacement at each call site. Keep the library import for the write operations that need it
- Profile the full benchmark — the library's visitor dispatch cost drops proportionally to how many traversals you eliminated
Verification is non-negotiable
Library replacements are high-reward but high-risk. The library handles edge cases you may not think of. Always verify:
- Diff test: Run both the library path and your replacement on every file in the project's test suite. The outputs must match exactly
- Edge cases: Empty files, files with syntax errors, files with decorators/async/walrus operators/match statements, files with star imports, files with
__all__ - Encoding: The library may handle encoding declarations (
# -*- coding: utf-8 -*-). Your replacement must too, or document the limitation - Version coverage: If the project supports Python 3.8-3.13, your
astusage must handle grammar differences (e.g.,matchstatements only exist in 3.10+)
Example: libcst → ast for analysis passes
This is the pattern you'll see most often. libcst provides a full Concrete Syntax Tree with whitespace preservation, metadata providers (parent, scope, qualified names), and a visitor/transformer framework. But analysis-only passes — collecting definitions, finding name references, building dependency graphs — don't need any of that. They need the parse tree structure, which ast provides at a fraction of the cost.
What makes this expensive in libcst:
MetadataWrapperresolves metadata providers (parent, scope) even when the visitor only checks node types- The visitor pattern dispatches
visit_Name,leave_Nameetc. through a deep class hierarchy with 523K+ calls for moderate files - CST nodes carry whitespace tokens, making the tree ~3x larger than an AST
What ast gives you:
ast.parse()is C-implemented, ~10x faster than libcst's parserast.walk()is a simple generator over the tree — no visitor dispatch overhead- Nodes are lightweight (no whitespace, no parent pointers unless you add them)
ast.NodeVisitorexists if you need the visitor pattern, but for most analysisast.walk+isinstancechecks suffice
What ast does NOT give you:
- Round-trip source fidelity (comments and whitespace are lost)
- Built-in scope resolution (you'd need to implement it or use a lighter library)
- Automatic metadata (parent node, qualified names) — you track these yourself if needed
If the analysis pass just needs "what names are defined at module level" or "what names does this function reference," ast is the right tool.