diff --git a/.claude/rules/debugging.md b/.claude/rules/debugging.md
new file mode 100644
index 0000000..d786dfc
--- /dev/null
+++ b/.claude/rules/debugging.md
@@ -0,0 +1,21 @@
+# Debugging and Testing
+
+## Root cause first
+
+When encountering a bug, always investigate the root cause. Don't patch symptoms. If you're about to add a try/except, a fallback default, or a defensive check -- ask whether the real fix is upstream.
+
+The user will tell you "fix the root cause" if you get this wrong. Don't make them say it twice.
+
+## Isolated testing
+
+Prefer running individual test functions or modules over full E2E suites. Only run the full suite when explicitly asked or before pushing.
+
+- Single function: `uv run pytest tests/test_foo.py::TestBar::test_baz -v`
+- Single module: `uv run pytest tests/test_foo.py -v`
+- Full suite: only when asked, or before `git push`
+
+When debugging a specific endpoint or integration, test it directly (e.g., `requests.post()` to a local server) instead of running the entire pipeline end-to-end.
+
+## Subprocess failures
+
+When a subprocess fails, always log stdout and stderr. "Exit code 1" with no output is useless. If you see a subprocess failure without output, add logging before moving on.
diff --git a/.claude/rules/github.md b/.claude/rules/github.md
index be9d7c9..c0f0d7e 100644
--- a/.claude/rules/github.md
+++ b/.claude/rules/github.md
@@ -1,3 +1,5 @@
 # GitHub Interactions
 
-Prefer MCP GitHub tools (`mcp__github__*`) over the `gh` CLI for all GitHub operations. Only fall back to `gh` via Bash when no matching MCP tool exists.
+ALWAYS use MCP GitHub tools (`mcp__github__*`) for GitHub operations. Check for a matching MCP tool first -- only fall back to `gh` via Bash when no MCP tool exists for the operation.
+
+This also applies to other MCP-connected services (Linear, Granola). MCP first, CLI second.
diff --git a/.claude/rules/sessions.md b/.claude/rules/sessions.md
index a9a5033..9384ea3 100644
--- a/.claude/rules/sessions.md
+++ b/.claude/rules/sessions.md
@@ -6,7 +6,11 @@ One task per session. Don't mix implementation with communication drafting, tran
 
 ## Duration
 
-Cap sessions at 2-3 hours. Use `/handoff` at natural breakpoints rather than letting auto-compaction degrade context. If the session has overflowed context once, strongly consider starting a new session.
+Cap sessions at 2-3 hours. Use `/handoff` at natural breakpoints rather than letting auto-compaction degrade context.
+
+- After 1 compaction: consider wrapping up the current task and handing off.
+- After 3 compactions: stop what you're doing, update `status.md`, and tell the user to start a fresh session.
+- Never continue past 5 compactions. Context is too degraded to be productive.
 
 ## Context preservation
 
@@ -14,6 +18,12 @@ Cap sessions at 2-3 hours. Use `/handoff` at natural breakpoints rather than let
 - When compacting, preserve: modified files list, current branch, VM state, test commands used, key decisions made
 - Use subagents for exploration to keep main context clean
 
-## Avoid polling
+## No polling
 
-Don't use `/loop` to poll agent status -- it burns context on repetitive status messages. If you need to monitor a long-running agent, check the output file directly.
+Never poll background tasks. No `wc -l`, no `tail -f`, no `sleep` loops checking output files. Use `run_in_background` and wait for the completion notification. One check after notification is fine. Sixty-two checks in a loop is not.
+
+## File read budget
+
+If you've read the same file 3+ times in a session, something is wrong. Either:
+- The session is too long and compaction destroyed your context -- write a handoff
+- You're not retaining key information from previous reads -- write it down in your response before it compacts away
diff --git a/packages/.claude/rules/error-handling.md b/packages/.claude/rules/error-handling.md
new file mode 100644
index 0000000..d55f119
--- /dev/null
+++ b/packages/.claude/rules/error-handling.md
@@ -0,0 +1,41 @@
+# Error Handling in Pipeline Code
+
+## No silent swallowing
+
+Never write `except Exception: pass` or `except Exception: return`. Every except block must either:
+- Log at WARNING or higher with `exc_info=True`
+- Re-raise after cleanup
+- Return a clearly documented sentinel value with at least DEBUG logging
+
+The tracing code (`benchmarking/_tracing.py`) is the worst offender -- 7 bare excepts with zero logging. Don't add more.
+
+## Protect ast.parse() calls
+
+`ast.parse()` raises `SyntaxError` on malformed input. Always wrap it:
+
+```python
+try:
+    tree = ast.parse(source)
+except SyntaxError:
+    log.warning("Failed to parse %s", path)
+    return None  # or skip this file
+```
+
+Known unprotected calls: `_state.py:70`, `_ranking.py:33`.
+
+## XML/TOML parsing
+
+Always use `recover=True` for lxml XMLParser. Always wrap tomlkit.parse() in try/except. Log parsing failures at WARNING, not DEBUG -- config parsing failures matter.
+
+## Format consistency: client-server boundary
+
+The AI service returns markdown-fenced code blocks. Every endpoint response must be parsed with `CodeStringsMarkdown.parse_markdown_code()` before using the code. Currently only `/optimize` and `/optimize-line-profiler` do this correctly. The refinement, repair, and adaptive endpoints in `ai/_refinement.py` skip this step.
+
+Pattern to follow (from `pipeline/_candidate_gen.py:83-91`):
+
+```python
+parsed = CodeStringsMarkdown.parse_markdown_code(c.code)
+if not parsed.code_strings:
+    continue
+plain_code = "\n\n".join(cs.code for cs in parsed.code_strings)
+```
diff --git a/packages/.claude/rules/test-coverage.md b/packages/.claude/rules/test-coverage.md
new file mode 100644
index 0000000..7f15087
--- /dev/null
+++ b/packages/.claude/rules/test-coverage.md
@@ -0,0 +1,24 @@
+# Test Coverage Requirements
+
+## Every module needs unit tests
+
+If you add or modify a module, it must have a corresponding test file. No exceptions for "infrastructure" or "worker" code -- those break the hardest in E2E.
+
+Modules that currently have zero unit tests and have broken in production:
+- `pipeline/_candidate_gen.py` -- candidate generation strategies
+- `analysis/_discovery_worker.py` -- pytest test collection subprocess
+
+Other untested modules that need coverage:
+- `codegen/_create_pr.py`
+- `ai/_tabulate.py`
+- `benchmarking/_trace_db.py`
+- `benchmarking/_benchmark_worker.py`
+
+## Test the error paths
+
+The happy path usually works. What breaks in E2E is:
+- Malformed input (bad XML, raw source where markdown expected, missing config keys)
+- Subprocess crashes (exit code != 0 with no output)
+- Version mismatches (pytest 9 vs 8, different config section names)
+
+Write tests for these cases, not just the golden path.