# Failure Modes & Recovery This document catalogs known failure modes in multi-teammate workflows, how to detect them, root causes, and recovery procedures. ## Failure Mode 1: Deadlock (Circular Task Dependencies) ### Detection - TaskList shows all teammates in "blocked" state - No tasks in "in_progress" (all waiting for something) - Lead waits for teammates to move, teammates wait for lead or each other - Session stalled for > 30 minutes with no progress ### Root Causes - **Circular dependencies**: Benchmarker blocked by Reviewer, Reviewer blocked by Benchmarker - **Missing dependency definition**: Lead created tasks without specifying blockedBy/blocks - **Lead waiting for teammate, teammate waiting for lead approval**: Both stalled ### Example Scenario ``` Optimizer blocked by: Reviewer (waiting for code review) Reviewer blocked by: Benchmarker (wants benchmark results first) Benchmarker blocked by: Optimizer (waiting for implementation) Lead: Blocked waiting for Optimizer to complete Result: No one can move ``` ### Recovery Procedure **Step 1: Break the cycle** (5 min) ``` Lead: Manually complete one task in the chain Option A: Lead implements the optimization (breaks Optimizer bottleneck) Option B: Lead approves current implementation (breaks Reviewer bottleneck) Option C: Lead runs benchmark manually (breaks Benchmarker bottleneck) ``` **Step 2: Unblock teammates** ``` Lead: TaskUpdate for the broken task with status: "completed" Notify teammates via SendMessage: "Cycle broken, [task] completed, proceed" ``` **Step 3: Replan for next session** ``` Lead: Review dependency graph in CLAUDE.md Redefine as linear chain, not circular: - Profile (Optimizer) → blocks - Implement (Optimizer) → blocks - Benchmark (Benchmarker) → blocks - Review (Reviewer) → completed ``` **Step 4: Document** ``` Add to project MEMORY.md: - Circular dependency detected: [describe] - Broken by: [what lead did] - Prevention: Use team-structure.md decision tree before creating teams ``` ### Prevention - Add TaskCreated hook to validate dependency graph - Use team-structure.md to pick proven configurations (single optimization = linear chain) - Always define dependencies when creating tasks --- ## Failure Mode 2: Silent Teammate Failure ### Detection - Teammate stops responding (no new messages/updates) - Task stuck in "in_progress" for > 2 hours - No error message in last response - Lead checks TaskList, sees task still assigned but no progress ### Root Causes - **API error** (StopFailure hook triggered): Model hit rate limit, timeout, or internal error - **Infinite loop**: Teammate caught in retry logic (unlikely but possible) - **Network timeout**: SSH to VM failed, teammate waiting indefinitely - **Rare**: Teammate hit unrecoverable error and silently stopped ### Example Scenario ``` Benchmarker: Started benchmark on VM, got timeout error StopFailure hook fired (no recovery context provided) Session ended with no message to lead Lead: Checks TaskList 2 hours later, sees task still "in_progress" Has no idea benchmarker failed ``` ### Recovery Procedure **Step 1: Verify failure** (2 min) ```bash # Check if task is actually stuck TaskList show: - Is task marked "in_progress"? - What was the last message/timestamp? - Who is the owner? # Check teammate session logs (if accessible) ls -la /Users/*/Desktop/work/.claude/projects/*/ grep "error\|failed" teammate-session.log ``` **Step 2: Assess damage** (2 min) ``` Questions: - Did teammate complete any work before failing? - Did they commit to branch or save results? - Is MEMORY.md updated with partial findings? - Can work be resumed from where they left off? ``` **Step 3: Choose recovery path** **Path A: Retry with fresh teammate** (recommended) ``` Lead: TaskUpdate(task, owner: "new-teammate-name", description: "Previous teammate failed. Retry from scratch or check MEMORY.md for partial findings") TaskCreate if previous teammate created partial work: - If branch partially implemented: "Completed implementation on perf/... branch" - If benchmark partial: "Validate benchmark results, see .codeflash/.../data/" New teammate: Reads task description + previous teammate MEMORY.md Continues from where possible, or redoes from scratch ``` **Path B: Lead takes over** (if time-critical) ``` Lead: Break team pattern temporarily Manually complete the failing task Continue rest of team as normal This saves time but adds context cost. Document cost in metrics.tsv ``` **Path C: Abandon task** (if low priority) ``` Lead: TaskUpdate(task, status: "deleted") SendMessage to remaining teammates: "Benchmark task cancelled, focus on review" ``` **Step 4: Post-mortem** ``` Add to failure-modes.md and project MEMORY.md: - Failure: [what happened] - Cause: [why - check logs if possible] - Recovery: [what lead did] - Prevention: [what to do differently] ``` ### Prevention - Monitor teammate progress manually every 1-2 hours if doing long-running tasks - Add PostToolUseFailure hook to capture error context and retry info - Define task timeouts: "Benchmark should complete within 30 min, alert lead if not" --- ## Failure Mode 3: Context Loss After Compaction ### Detection - Teammate resumes session after compaction - Forgets current branch or optimization goal - Asks "what was I working on?" or implements wrong thing - Compaction removed critical state info (branch name, status, next step) ### Root Causes - **PreCompact hook not configured**: Critical state not snapshotted before compaction - **SessionStart hook not re-injecting**: State not restored after compaction - **CLAUDE.md "Compact instructions" too aggressive**: Dropped branch/status info - **Teammate MEMORY.md not updated**: No context to fall back to ### Example Scenario ``` Optimizer: Working on perf/batch-size, got 40% gain, was about to benchmark Context at 150k tokens → compaction fires PreCompact hook missing → branch/status not saved SessionStart hook resumes but status.md not injected Optimizer resumes: Reads compacted summary, can't find branch name Runs `git branch | grep perf` but if many perf/* branches, confused Starts on wrong branch or redoes profiling ``` ### Recovery Procedure **Step 1: Restore context** (5 min) ``` Teammate: Check available context in order: Option A: Read .codeflash/{teammember}/{org}/{project}/status.md Should say "Current branch: perf/batch-size" Should say "Latest: 40% throughput gain, ready for benchmark" Option B: Read own MEMORY.md Should have notes about findings, branch, next steps Option C: Check git git branch (shows current and recent branches) git log --oneline | head (shows recent commits with messages) Option D: Read compacted summary in session Look for "Optimization" or branch/result references ``` **Step 2: Verify before continuing** (2 min) ```bash # Confirm right branch git status git branch -v # Confirm right state git log -1 --format=fuller # Confirm results exist ls -la .codeflash/{teammember}/{org}/{project}/data/ # Read what lead said in last interaction # (Look for approval, next steps in task description) ``` **Step 3: Document if context was lost** ``` If context had to be reconstructed from git/files instead of summary: - Note what was missing (branch? status? results?) - Check PreCompact hook configuration - Check SessionStart hook configuration Add to failure-modes.md: - Lost: [branch name / status / results] - Recovered from: [git log / MEMORY.md / files] - Fix: Configure PreCompact hook to snapshot state ``` **Step 4: Continue work** (normal) ``` Now that context restored, proceed normally ``` ### Prevention - **Configure PreCompact hook** to snapshot before compaction - **Configure SessionStart hook** to inject status after compaction - **Keep MEMORY.md updated** with current branch and findings as you work - **Update .codeflash/{teammember}/{org}/{project}/status.md** regularly, not just at handoff --- ## Failure Mode 4: Stale Teammate Results ### Detection - Lead reviews completed task 1-2 hours after completion - Requirements have changed since task started - Teammate's results are outdated - Lead asks for redo but frustrated by wasted time ### Root Causes - **No real-time notification**: Lead doesn't check TaskList frequently - **Long-running tasks**: 2-3 hour benchmarks, teammates finish but lead not watching - **Requirements evolved**: New discovery mid-task, teammates didn't know to adjust - **Lead context lost**: Lead forgot teammate was working on this, started duplicate work ### Example Scenario ``` Time 1:00 PM: Lead creates benchmark task for optimizer "Benchmark batch-size 32 on typeagent VM" Time 1:30 PM: Optimizer implements, TaskUpdate(status: "completed") Results show batch-size 32 gives 40% gain Time 2:30 PM: Lead discovers new issue: "Actually, test variance is 15%, need < 5%" Lead contacts optimizer: "We need tighter tolerance" Time 3:00 PM: Optimizer: "I already finished, results are done" Lead: "Those are worthless now, redo with variance validation" Wasted 1.5 hours of benchmarking ``` ### Recovery Procedure **Step 1: Assess if results are salvageable** (5 min) ``` Lead: Can we adjust the results, or truly need redo? - If just need re-analysis: Have teammate post-process - If need re-benchmarking: Full redo required ``` **Step 2: Create new task with updated requirements** (5 min) ``` Lead: TaskCreate("Re-benchmark batch-size 32 with variance < 5%", owner: "optimizer", description: "Previous run showed 40% gain but variance 15%. Need tighter tolerance. Run at least 10 iterations.") SendMessage(to: "optimizer"): "Requirements refined, new task created with details" ``` **Step 3: Reuse previous findings** (5 min) ``` Optimizer: Reads new task description Reuses MEMORY from previous attempt Focuses on variance reduction (more iterations, control environment) Much faster than cold start ``` **Step 4: Document and prevent** ``` Add to failure-modes.md: - Cause: Requirements changed mid-task, lead didn't notify - Prevention: Lead checks TaskList every 30-60 min for long-running tasks - Prevention: Send SendMessage "Requirements changed, task TBD" ASAP when discovered ``` ### Prevention - **Set expectation upfront**: "Lead will check TaskList every hour for in-progress tasks" - **Use SendMessage proactively**: If requirements change, contact teammate immediately - **Keep lead engaged**: Don't create tasks and disappear for 3 hours - **Milestone tracking**: For long tasks, ask for mid-task updates (e.g., "benchmark started, ETA 1 hour") --- ## Failure Mode 5: Over-Specified Compaction ### Detection - After compaction + resume, teammate forgets key results (performance numbers, branch state) - Compaction summary drops critical info - Teammate re-does work they'd already done ### Root Causes - **CLAUDE.md "Compact instructions" too aggressive**: Dropped results to save tokens - **PreCompact hook not configured**: Key info not snapshotted before summarization - **Summary algorithm dropped wrong info**: All outputs removed, but results needed ### Example Scenario ``` Optimizer: Profiled code, found O(n²) loop, results in .codeflash/data/profile.json Session context at 140k tokens Compaction fires, summary keeps "O(n²) found" but drops exact numbers After compaction: Optimizer: Compacted summary says "bottleneck found" but no numbers Teammate starts profiling again (déjà vu) Lead: "I already have the profile data!" ``` ### Recovery Procedure **Step 1: Check what was actually lost** (2 min) ```bash # Do the files still exist? ls -la .codeflash/{teammember}/{org}/{project}/data/profile.json cat .codeflash/{teammember}/{org}/{project}/data/profile.json | head # Is it in teammate MEMORY.md? grep -A 5 "O(n²)" MEMORY.md # Is it in project MEMORY.md? grep -A 5 "hotspot" MEMORY.md ``` **Step 2: If files/memory exist, retrieve from there** (2 min) ``` Teammate: Reference the files instead of repeating work Use previous profile data as starting point for next phase This is why MEMORY.md should be updated regularly ``` **Step 3: If truly lost, redo** (time-dependent) ``` Quick redo (< 5 min): Profile again, it's fast Expensive redo (> 30 min): Benchmark or full optimization → Lead should have caught this, expensive oversight ``` **Step 4: Adjust compaction settings** ``` Edit CLAUDE.md root: # Current (too aggressive) # Compact instructions When you are using compact, please focus on test output and code changes # Better (preserve results) # Compact instructions When compacting, preserve: 1. Benchmark results and performance deltas 2. Code changes made (git diffs) 3. Active branch state and next steps 4. Key findings and hotspots Drop: verbose tool output, full file reads, intermediate test runs ``` ### Prevention - **Adjust CLAUDE.md "Compact instructions"** to preserve benchmark numbers, branch state, findings - **Configure PreCompact hook** to snapshot before compaction - **Rely on MEMORY.md and status.md**, not compaction summary, for critical state - **Inject via SessionStart hook** after compaction --- ## Failure Mode 6: Lead Doesn't Wait for Teammates ### Detection - Lead starts implementing while teammates still in "in_progress" - Lead context balloons (own work + monitoring teammates) - Teammate completes but lead doesn't see because distracted - Task ordering breaks (lead gets ahead of profile results) ### Root Causes - **Lead gets impatient**: "Teammates are taking too long, I'll start implementing" - **No enforcement**: Stop hook not configured to block lead - **Lead habit**: From single-session work where they do everything - **Unclear role**: Lead thinks they should be coding, not just coordinating ### Example Scenario ``` Lead: TaskCreate("Profile typeagent", owner: "optimizer") Lead: (3 minutes later) Starts implementing own optimization Figures "optimizer will take a while, I can start now" Context bloats with both leading AND implementing Optimizer: Finishes profile after 20 min, posts findings Lead misses it (absorbed in own implementation) Optimizer does TaskUpdate but lead never reads Result: Lead + Optimizer both working independently, no coordination All parallelism benefits lost ``` ### Recovery Procedure **Step 1: Acknowledge** ``` Lead: Recognize you've started implementing while teammates work This defeats purpose of team (kills parallelism benefit) ``` **Step 2: Pause own work** (immediate) ``` Lead: Save current work (commit or stash) TaskUpdate for any open tasks to "on_hold" or revert Check TaskList for teammate progress Read teammate MEMORY.md for findings ``` **Step 3: Reorient to coordination role** (5 min) ``` Lead: What have teammates completed? What are they waiting for? What decisions do they need from you? Actions: - TaskUpdate with approvals if needed - SendMessage with feedback or next directions - Don't start new implementation ``` **Step 4: Enforce for next session** ``` Add Stop hook to prevent this: - If teammate tasks in "in_progress", block stop - Prevent lead from working on anything other than coordination Add to CLAUDE.md: Lead role: Coordinate, approve, synthesize Lead constraint: Do not implement while teammates work ``` ### Prevention - **Add Stop hook** to prevent this - **Make role explicit**: "Your job is to coordinate and review, not code" - **Review this document** before starting team session --- ## Failure Mode 7: Task Never Completes (Ambiguous "Done") ### Detection - Task stuck in "in_progress" for many hours - Lead and teammate disagree on what "done" means - Teammate asks "Is this good enough?" Lead says "no, keep going" - No checklist to verify completion ### Root Causes - **Vague task description**: "Profile Agent.execute()" (what counts as thorough?) - **No deliverables defined**: Teammate doesn't know what evidence to provide - **Unclear success criteria**: "Optimize for speed" (how fast is enough?) - **Teammate over-thinking**: Perfectionism, "one more iteration" ### Example Scenario ``` Lead: TaskCreate("Profile Agent.execute()") Optimizer: Profiles once, finds O(n²) TaskUpdate(status: "completed") Lead: Reads profile, says "Need more detail on cache misses" Optimizer: "I did basic profiling, thought that was done" Optimizer: Profiles again with cache analysis TaskUpdate(status: "completed") Lead: Says "What about GC? And memory allocations?" Back and forth continues... ``` ### Recovery Procedure **Step 1: Define done explicitly** ``` Lead + Teammate: Discuss and agree: - What is the minimum viable finding? - What metrics matter (throughput, latency, memory)? - How confident do we need to be? - What's the next stopping point? ``` **Step 2: Create checklist** ``` Lead: Update task description with deliverables: DELIVERABLES (task is DONE when): - [ ] Profile collected on 1000-message corpus - [ ] Top 3 hotspots identified (> 5% each) - [ ] O-complexity analysis for each hotspot - [ ] Cache hit rates measured - [ ] Variance < 5% (min 3 runs) - [ ] Findings documented in MEMORY.md - [ ] Branch created or pushed if applicable NOT required (out of scope): - GC analysis (separate task) - Memory profiling (separate task) - Implementation (separate task) ``` **Step 3: Teammate completes to checklist** ``` Optimizer: Works to checklist When all boxes checked, done TaskUpdate(status: "completed") Lead can't argue "but what about X" if X not on checklist ``` **Step 4: TaskCompleted hook validates** ``` Hook script: - Read task deliverables checklist - For "Profile" tasks: Verify profile.json exists, MEMORY.md has findings - Accept completion only if deliverables present ``` ### Prevention - **Always create task with deliverables checklist** - **Use TaskCompleted hook** to validate before accepting completion - **Define scope boundaries**: "This task does X, NOT Y" (prevents scope creep) --- ## Reference: Detecting Failure Early ### Signals to Watch For | Signal | Possible Failure | Action | |---|---|---| | Task in "in_progress" > 2 hours, no updates | Silent failure or infinite loop | Check teammate logs, consider restart | | Lead implementing while teammates work | Lead over-working, parallelism lost | Move lead implementation to new task, stop block enforcement | | Task oscillating between "in_progress" and no change | Ambiguous "done", perfectionism | Define explicit checklist | | All tasks "blocked", nothing moving | Deadlock, circular dependencies | Break cycle with lead override | | Teammate forgets branch after compaction | Context loss | Check PreCompact/SessionStart hooks | | Lead makes requirements change mid-task | Stale results | Notify teammate immediately, update task | ### Monitoring Routine (Every 1-2 hours during team session) ```bash Lead checklist: - [ ] TaskList: All in_progress tasks have recent updates? - [ ] Teammates: Any stuck or stalled? - [ ] MEMORY.md: Updated with current findings? - [ ] Status: Still on track, or need course correction? - [ ] Blockers: Any teammate waiting on lead decision? If anything looks wrong: → SendMessage to teammate: "How's it going? Any blockers?" → Escalate if no response in 30 min ``` --- See also: team-structure.md (prevent deadlock via config), agent-teams.md (Claude Code agent team docs)