Kevin Turcios 3b59d97647 squash

2026-04-13 14:12:17 -05:00

20 KiB

Raw Blame History

Failure Modes & Recovery

This document catalogs known failure modes in multi-teammate workflows, how to detect them, root causes, and recovery procedures.

Failure Mode 1: Deadlock (Circular Task Dependencies)

Detection

TaskList shows all teammates in "blocked" state
No tasks in "in_progress" (all waiting for something)
Lead waits for teammates to move, teammates wait for lead or each other
Session stalled for > 30 minutes with no progress

Root Causes

Circular dependencies: Benchmarker blocked by Reviewer, Reviewer blocked by Benchmarker
Missing dependency definition: Lead created tasks without specifying blockedBy/blocks
Lead waiting for teammate, teammate waiting for lead approval: Both stalled

Example Scenario

Optimizer blocked by: Reviewer (waiting for code review)
Reviewer blocked by: Benchmarker (wants benchmark results first)
Benchmarker blocked by: Optimizer (waiting for implementation)
Lead: Blocked waiting for Optimizer to complete

Result: No one can move

Recovery Procedure

Step 1: Break the cycle (5 min)

Lead: Manually complete one task in the chain
  Option A: Lead implements the optimization (breaks Optimizer bottleneck)
  Option B: Lead approves current implementation (breaks Reviewer bottleneck)
  Option C: Lead runs benchmark manually (breaks Benchmarker bottleneck)

Step 2: Unblock teammates

Lead: TaskUpdate for the broken task with status: "completed"
      Notify teammates via SendMessage: "Cycle broken, [task] completed, proceed"

Step 3: Replan for next session

Lead: Review dependency graph in CLAUDE.md
      Redefine as linear chain, not circular:
      - Profile (Optimizer) → blocks
      - Implement (Optimizer) → blocks  
      - Benchmark (Benchmarker) → blocks
      - Review (Reviewer) → completed

Step 4: Document

Add to project MEMORY.md:
- Circular dependency detected: [describe]
- Broken by: [what lead did]
- Prevention: Use team-structure.md decision tree before creating teams

Prevention

Add TaskCreated hook to validate dependency graph
Use team-structure.md to pick proven configurations (single optimization = linear chain)
Always define dependencies when creating tasks

Failure Mode 2: Silent Teammate Failure

Detection

Teammate stops responding (no new messages/updates)
Task stuck in "in_progress" for > 2 hours
No error message in last response
Lead checks TaskList, sees task still assigned but no progress

Root Causes

API error (StopFailure hook triggered): Model hit rate limit, timeout, or internal error
Infinite loop: Teammate caught in retry logic (unlikely but possible)
Network timeout: SSH to VM failed, teammate waiting indefinitely
Rare: Teammate hit unrecoverable error and silently stopped

Example Scenario

Benchmarker: Started benchmark on VM, got timeout error
           StopFailure hook fired (no recovery context provided)
           Session ended with no message to lead
           
Lead: Checks TaskList 2 hours later, sees task still "in_progress"
      Has no idea benchmarker failed

Recovery Procedure

Step 1: Verify failure (2 min)

# Check if task is actually stuck
TaskList show:
  - Is task marked "in_progress"?
  - What was the last message/timestamp?
  - Who is the owner?

# Check teammate session logs (if accessible)
ls -la /Users/*/Desktop/work/.claude/projects/*/
grep "error\|failed" teammate-session.log

Step 2: Assess damage (2 min)

Questions:
- Did teammate complete any work before failing?
- Did they commit to branch or save results?
- Is MEMORY.md updated with partial findings?
- Can work be resumed from where they left off?

Step 3: Choose recovery path

Path A: Retry with fresh teammate (recommended)

Lead: TaskUpdate(task, owner: "new-teammate-name", 
                 description: "Previous teammate failed. Retry from scratch or check MEMORY.md for partial findings")
      
      TaskCreate if previous teammate created partial work:
        - If branch partially implemented: "Completed implementation on perf/... branch"
        - If benchmark partial: "Validate benchmark results, see .codeflash/.../data/"
        
New teammate: Reads task description + previous teammate MEMORY.md
              Continues from where possible, or redoes from scratch

Path B: Lead takes over (if time-critical)

Lead: Break team pattern temporarily
      Manually complete the failing task
      Continue rest of team as normal
      
This saves time but adds context cost. Document cost in metrics.tsv

Path C: Abandon task (if low priority)

Lead: TaskUpdate(task, status: "deleted")
      SendMessage to remaining teammates: "Benchmark task cancelled, focus on review"

Step 4: Post-mortem

Add to failure-modes.md and project MEMORY.md:
- Failure: [what happened]
- Cause: [why - check logs if possible]
- Recovery: [what lead did]
- Prevention: [what to do differently]

Prevention

Monitor teammate progress manually every 1-2 hours if doing long-running tasks
Add PostToolUseFailure hook to capture error context and retry info
Define task timeouts: "Benchmark should complete within 30 min, alert lead if not"

Failure Mode 3: Context Loss After Compaction

Detection

Teammate resumes session after compaction
Forgets current branch or optimization goal
Asks "what was I working on?" or implements wrong thing
Compaction removed critical state info (branch name, status, next step)

Root Causes

PreCompact hook not configured: Critical state not snapshotted before compaction
SessionStart hook not re-injecting: State not restored after compaction
CLAUDE.md "Compact instructions" too aggressive: Dropped branch/status info
Teammate MEMORY.md not updated: No context to fall back to

Example Scenario

Optimizer: Working on perf/batch-size, got 40% gain, was about to benchmark
           Context at 150k tokens → compaction fires
           PreCompact hook missing → branch/status not saved
           SessionStart hook resumes but status.md not injected
           
Optimizer resumes: Reads compacted summary, can't find branch name
                   Runs `git branch | grep perf` but if many perf/* branches, confused
                   Starts on wrong branch or redoes profiling

Recovery Procedure

Step 1: Restore context (5 min)

Teammate: Check available context in order:

Option A: Read .codeflash/{org}/{project}/status.md
          Should say "Current branch: perf/batch-size"
          Should say "Latest: 40% throughput gain, ready for benchmark"
          
Option B: Read own MEMORY.md
          Should have notes about findings, branch, next steps
          
Option C: Check git
          git branch (shows current and recent branches)
          git log --oneline | head (shows recent commits with messages)
          
Option D: Read compacted summary in session
          Look for "Optimization" or branch/result references

Step 2: Verify before continuing (2 min)

# Confirm right branch
git status
git branch -v

# Confirm right state
git log -1 --format=fuller

# Confirm results exist
ls -la .codeflash/{org}/{project}/data/

# Read what lead said in last interaction
# (Look for approval, next steps in task description)

Step 3: Document if context was lost

If context had to be reconstructed from git/files instead of summary:
  - Note what was missing (branch? status? results?)
  - Check PreCompact hook configuration
  - Check SessionStart hook configuration
  
Add to failure-modes.md:
- Lost: [branch name / status / results]
- Recovered from: [git log / MEMORY.md / files]
- Fix: Configure PreCompact hook to snapshot state

Step 4: Continue work (normal)

Now that context restored, proceed normally

Prevention

Configure PreCompact hook to snapshot before compaction
Configure SessionStart hook to inject status after compaction
Keep MEMORY.md updated with current branch and findings as you work
Update .codeflash/{org}/{project}/status.md regularly, not just at handoff

Failure Mode 4: Stale Teammate Results

Detection

Lead reviews completed task 1-2 hours after completion
Requirements have changed since task started
Teammate's results are outdated
Lead asks for redo but frustrated by wasted time

Root Causes

No real-time notification: Lead doesn't check TaskList frequently
Long-running tasks: 2-3 hour benchmarks, teammates finish but lead not watching
Requirements evolved: New discovery mid-task, teammates didn't know to adjust
Lead context lost: Lead forgot teammate was working on this, started duplicate work

Example Scenario

Time 1:00 PM: Lead creates benchmark task for optimizer
              "Benchmark batch-size 32 on typeagent VM"

Time 1:30 PM: Optimizer implements, TaskUpdate(status: "completed")
              Results show batch-size 32 gives 40% gain

Time 2:30 PM: Lead discovers new issue: "Actually, test variance is 15%, need < 5%"
              Lead contacts optimizer: "We need tighter tolerance"
              
Time 3:00 PM: Optimizer: "I already finished, results are done"
              Lead: "Those are worthless now, redo with variance validation"
              Wasted 1.5 hours of benchmarking

Recovery Procedure

Step 1: Assess if results are salvageable (5 min)

Lead: Can we adjust the results, or truly need redo?
      - If just need re-analysis: Have teammate post-process
      - If need re-benchmarking: Full redo required

Step 2: Create new task with updated requirements (5 min)

Lead: TaskCreate("Re-benchmark batch-size 32 with variance < 5%", 
                 owner: "optimizer",
                 description: "Previous run showed 40% gain but variance 15%. Need tighter tolerance. Run at least 10 iterations.")
      
      SendMessage(to: "optimizer"): "Requirements refined, new task created with details"

Step 3: Reuse previous findings (5 min)

Optimizer: Reads new task description
           Reuses MEMORY from previous attempt
           Focuses on variance reduction (more iterations, control environment)
           Much faster than cold start

Step 4: Document and prevent

Add to failure-modes.md:
- Cause: Requirements changed mid-task, lead didn't notify
- Prevention: Lead checks TaskList every 30-60 min for long-running tasks
- Prevention: Send SendMessage "Requirements changed, task TBD" ASAP when discovered

Prevention

Set expectation upfront: "Lead will check TaskList every hour for in-progress tasks"
Use SendMessage proactively: If requirements change, contact teammate immediately
Keep lead engaged: Don't create tasks and disappear for 3 hours
Milestone tracking: For long tasks, ask for mid-task updates (e.g., "benchmark started, ETA 1 hour")

Failure Mode 5: Over-Specified Compaction

Detection

After compaction + resume, teammate forgets key results (performance numbers, branch state)
Compaction summary drops critical info
Teammate re-does work they'd already done

Root Causes

CLAUDE.md "Compact instructions" too aggressive: Dropped results to save tokens
PreCompact hook not configured: Key info not snapshotted before summarization
Summary algorithm dropped wrong info: All outputs removed, but results needed

Example Scenario

Optimizer: Profiled code, found O(n²) loop, results in .codeflash/data/profile.json
           Session context at 140k tokens
           Compaction fires, summary keeps "O(n²) found" but drops exact numbers
           
After compaction:
Optimizer: Compacted summary says "bottleneck found" but no numbers
           Teammate starts profiling again (déjà vu)
           Lead: "I already have the profile data!"

Recovery Procedure

Step 1: Check what was actually lost (2 min)

# Do the files still exist?
ls -la .codeflash/{org}/{project}/data/profile.json
cat .codeflash/{org}/{project}/data/profile.json | head

# Is it in teammate MEMORY.md?
grep -A 5 "O(n²)" MEMORY.md

# Is it in project MEMORY.md?
grep -A 5 "hotspot" MEMORY.md

Step 2: If files/memory exist, retrieve from there (2 min)

Teammate: Reference the files instead of repeating work
          Use previous profile data as starting point for next phase
          This is why MEMORY.md should be updated regularly

Step 3: If truly lost, redo (time-dependent)

Quick redo (< 5 min): Profile again, it's fast
Expensive redo (> 30 min): Benchmark or full optimization
  → Lead should have caught this, expensive oversight

Step 4: Adjust compaction settings

Edit CLAUDE.md root:

# Current (too aggressive)
# Compact instructions
When you are using compact, please focus on test output and code changes

# Better (preserve results)
# Compact instructions
When compacting, preserve:
1. Benchmark results and performance deltas
2. Code changes made (git diffs)
3. Active branch state and next steps
4. Key findings and hotspots
Drop: verbose tool output, full file reads, intermediate test runs

Prevention

Adjust CLAUDE.md "Compact instructions" to preserve benchmark numbers, branch state, findings
Configure PreCompact hook to snapshot before compaction
Rely on MEMORY.md and status.md, not compaction summary, for critical state
Inject via SessionStart hook after compaction

Failure Mode 6: Lead Doesn't Wait for Teammates

Detection

Lead starts implementing while teammates still in "in_progress"
Lead context balloons (own work + monitoring teammates)
Teammate completes but lead doesn't see because distracted
Task ordering breaks (lead gets ahead of profile results)

Root Causes

Lead gets impatient: "Teammates are taking too long, I'll start implementing"
No enforcement: Stop hook not configured to block lead
Lead habit: From single-session work where they do everything
Unclear role: Lead thinks they should be coding, not just coordinating

Example Scenario

Lead: TaskCreate("Profile typeagent", owner: "optimizer")

Lead: (3 minutes later) Starts implementing own optimization
      Figures "optimizer will take a while, I can start now"
      Context bloats with both leading AND implementing
      
Optimizer: Finishes profile after 20 min, posts findings
           Lead misses it (absorbed in own implementation)
           Optimizer does TaskUpdate but lead never reads
           
Result: Lead + Optimizer both working independently, no coordination
        All parallelism benefits lost

Recovery Procedure

Step 1: Acknowledge

Lead: Recognize you've started implementing while teammates work
      This defeats purpose of team (kills parallelism benefit)

Step 2: Pause own work (immediate)

Lead: Save current work (commit or stash)
      TaskUpdate for any open tasks to "on_hold" or revert
      Check TaskList for teammate progress
      Read teammate MEMORY.md for findings

Step 3: Reorient to coordination role (5 min)

Lead: What have teammates completed?
      What are they waiting for?
      What decisions do they need from you?
      
      Actions:
      - TaskUpdate with approvals if needed
      - SendMessage with feedback or next directions
      - Don't start new implementation

Step 4: Enforce for next session

Add Stop hook to prevent this:
  - If teammate tasks in "in_progress", block stop
  - Prevent lead from working on anything other than coordination
  
Add to CLAUDE.md:
  Lead role: Coordinate, approve, synthesize
  Lead constraint: Do not implement while teammates work

Prevention

Add Stop hook to prevent this
Make role explicit: "Your job is to coordinate and review, not code"
Review this document before starting team session

Failure Mode 7: Task Never Completes (Ambiguous "Done")

Detection

Task stuck in "in_progress" for many hours
Lead and teammate disagree on what "done" means
Teammate asks "Is this good enough?" Lead says "no, keep going"
No checklist to verify completion

Root Causes

Vague task description: "Profile Agent.execute()" (what counts as thorough?)
No deliverables defined: Teammate doesn't know what evidence to provide
Unclear success criteria: "Optimize for speed" (how fast is enough?)
Teammate over-thinking: Perfectionism, "one more iteration"

Example Scenario

Lead: TaskCreate("Profile Agent.execute()")

Optimizer: Profiles once, finds O(n²)
           TaskUpdate(status: "completed")

Lead: Reads profile, says "Need more detail on cache misses"
      Optimizer: "I did basic profiling, thought that was done"

Optimizer: Profiles again with cache analysis
           TaskUpdate(status: "completed")

Lead: Says "What about GC? And memory allocations?"
      Back and forth continues...

Recovery Procedure

Step 1: Define done explicitly

Lead + Teammate: Discuss and agree:
  - What is the minimum viable finding?
  - What metrics matter (throughput, latency, memory)?
  - How confident do we need to be?
  - What's the next stopping point?

Step 2: Create checklist

Lead: Update task description with deliverables:

DELIVERABLES (task is DONE when):
- [ ] Profile collected on 1000-message corpus
- [ ] Top 3 hotspots identified (> 5% each)
- [ ] O-complexity analysis for each hotspot
- [ ] Cache hit rates measured
- [ ] Variance < 5% (min 3 runs)
- [ ] Findings documented in MEMORY.md
- [ ] Branch created or pushed if applicable

NOT required (out of scope):
- GC analysis (separate task)
- Memory profiling (separate task)
- Implementation (separate task)

Step 3: Teammate completes to checklist

Optimizer: Works to checklist
           When all boxes checked, done
           TaskUpdate(status: "completed")
           Lead can't argue "but what about X" if X not on checklist

Step 4: TaskCompleted hook validates

Hook script:
- Read task deliverables checklist
- For "Profile" tasks: Verify profile.json exists, MEMORY.md has findings
- Accept completion only if deliverables present

Prevention

Always create task with deliverables checklist
Use TaskCompleted hook to validate before accepting completion
Define scope boundaries: "This task does X, NOT Y" (prevents scope creep)

Reference: Detecting Failure Early

Signals to Watch For

Signal	Possible Failure	Action
Task in "in_progress" > 2 hours, no updates	Silent failure or infinite loop	Check teammate logs, consider restart
Lead implementing while teammates work	Lead over-working, parallelism lost	Move lead implementation to new task, stop block enforcement
Task oscillating between "in_progress" and no change	Ambiguous "done", perfectionism	Define explicit checklist
All tasks "blocked", nothing moving	Deadlock, circular dependencies	Break cycle with lead override
Teammate forgets branch after compaction	Context loss	Check PreCompact/SessionStart hooks
Lead makes requirements change mid-task	Stale results	Notify teammate immediately, update task

Monitoring Routine (Every 1-2 hours during team session)

Lead checklist:
- [ ] TaskList: All in_progress tasks have recent updates?
- [ ] Teammates: Any stuck or stalled?
- [ ] MEMORY.md: Updated with current findings?
- [ ] Status: Still on track, or need course correction?
- [ ] Blockers: Any teammate waiting on lead decision?

If anything looks wrong:
  → SendMessage to teammate: "How's it going? Any blockers?"
  → Escalate if no response in 30 min

See also: team-structure.md (prevent deadlock via config), agent-teams.md (Claude Code agent team docs)

20 KiB Raw Blame History

Failure Modes & Recovery

Failure Mode 1: Deadlock (Circular Task Dependencies)

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Failure Mode 2: Silent Teammate Failure

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Failure Mode 3: Context Loss After Compaction

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Failure Mode 4: Stale Teammate Results

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Failure Mode 5: Over-Specified Compaction

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Failure Mode 6: Lead Doesn't Wait for Teammates

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Failure Mode 7: Task Never Completes (Ambiguous "Done")

Detection

Root Causes

Example Scenario

Recovery Procedure

Prevention

Reference: Detecting Failure Early

Signals to Watch For

Monitoring Routine (Every 1-2 hours during team session)

20 KiB

Raw Blame History