codeflash-agent/plugin/references/shared/failure-modes.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

20 KiB

Failure Modes & Recovery

This document catalogs known failure modes in multi-teammate workflows, how to detect them, root causes, and recovery procedures.

Failure Mode 1: Deadlock (Circular Task Dependencies)

Detection

  • TaskList shows all teammates in "blocked" state
  • No tasks in "in_progress" (all waiting for something)
  • Lead waits for teammates to move, teammates wait for lead or each other
  • Session stalled for > 30 minutes with no progress

Root Causes

  • Circular dependencies: Benchmarker blocked by Reviewer, Reviewer blocked by Benchmarker
  • Missing dependency definition: Lead created tasks without specifying blockedBy/blocks
  • Lead waiting for teammate, teammate waiting for lead approval: Both stalled

Example Scenario

Optimizer blocked by: Reviewer (waiting for code review)
Reviewer blocked by: Benchmarker (wants benchmark results first)
Benchmarker blocked by: Optimizer (waiting for implementation)
Lead: Blocked waiting for Optimizer to complete

Result: No one can move

Recovery Procedure

Step 1: Break the cycle (5 min)

Lead: Manually complete one task in the chain
  Option A: Lead implements the optimization (breaks Optimizer bottleneck)
  Option B: Lead approves current implementation (breaks Reviewer bottleneck)
  Option C: Lead runs benchmark manually (breaks Benchmarker bottleneck)

Step 2: Unblock teammates

Lead: TaskUpdate for the broken task with status: "completed"
      Notify teammates via SendMessage: "Cycle broken, [task] completed, proceed"

Step 3: Replan for next session

Lead: Review dependency graph in CLAUDE.md
      Redefine as linear chain, not circular:
      - Profile (Optimizer) → blocks
      - Implement (Optimizer) → blocks  
      - Benchmark (Benchmarker) → blocks
      - Review (Reviewer) → completed

Step 4: Document

Add to project MEMORY.md:
- Circular dependency detected: [describe]
- Broken by: [what lead did]
- Prevention: Use team-structure.md decision tree before creating teams

Prevention

  • Add TaskCreated hook to validate dependency graph
  • Use team-structure.md to pick proven configurations (single optimization = linear chain)
  • Always define dependencies when creating tasks

Failure Mode 2: Silent Teammate Failure

Detection

  • Teammate stops responding (no new messages/updates)
  • Task stuck in "in_progress" for > 2 hours
  • No error message in last response
  • Lead checks TaskList, sees task still assigned but no progress

Root Causes

  • API error (StopFailure hook triggered): Model hit rate limit, timeout, or internal error
  • Infinite loop: Teammate caught in retry logic (unlikely but possible)
  • Network timeout: SSH to VM failed, teammate waiting indefinitely
  • Rare: Teammate hit unrecoverable error and silently stopped

Example Scenario

Benchmarker: Started benchmark on VM, got timeout error
           StopFailure hook fired (no recovery context provided)
           Session ended with no message to lead
           
Lead: Checks TaskList 2 hours later, sees task still "in_progress"
      Has no idea benchmarker failed

Recovery Procedure

Step 1: Verify failure (2 min)

# Check if task is actually stuck
TaskList show:
  - Is task marked "in_progress"?
  - What was the last message/timestamp?
  - Who is the owner?

# Check teammate session logs (if accessible)
ls -la /Users/*/Desktop/work/.claude/projects/*/
grep "error\|failed" teammate-session.log

Step 2: Assess damage (2 min)

Questions:
- Did teammate complete any work before failing?
- Did they commit to branch or save results?
- Is MEMORY.md updated with partial findings?
- Can work be resumed from where they left off?

Step 3: Choose recovery path

Path A: Retry with fresh teammate (recommended)

Lead: TaskUpdate(task, owner: "new-teammate-name", 
                 description: "Previous teammate failed. Retry from scratch or check MEMORY.md for partial findings")
      
      TaskCreate if previous teammate created partial work:
        - If branch partially implemented: "Completed implementation on perf/... branch"
        - If benchmark partial: "Validate benchmark results, see .codeflash/.../data/"
        
New teammate: Reads task description + previous teammate MEMORY.md
              Continues from where possible, or redoes from scratch

Path B: Lead takes over (if time-critical)

Lead: Break team pattern temporarily
      Manually complete the failing task
      Continue rest of team as normal
      
This saves time but adds context cost. Document cost in metrics.tsv

Path C: Abandon task (if low priority)

Lead: TaskUpdate(task, status: "deleted")
      SendMessage to remaining teammates: "Benchmark task cancelled, focus on review"

Step 4: Post-mortem

Add to failure-modes.md and project MEMORY.md:
- Failure: [what happened]
- Cause: [why - check logs if possible]
- Recovery: [what lead did]
- Prevention: [what to do differently]

Prevention

  • Monitor teammate progress manually every 1-2 hours if doing long-running tasks
  • Add PostToolUseFailure hook to capture error context and retry info
  • Define task timeouts: "Benchmark should complete within 30 min, alert lead if not"

Failure Mode 3: Context Loss After Compaction

Detection

  • Teammate resumes session after compaction
  • Forgets current branch or optimization goal
  • Asks "what was I working on?" or implements wrong thing
  • Compaction removed critical state info (branch name, status, next step)

Root Causes

  • PreCompact hook not configured: Critical state not snapshotted before compaction
  • SessionStart hook not re-injecting: State not restored after compaction
  • CLAUDE.md "Compact instructions" too aggressive: Dropped branch/status info
  • Teammate MEMORY.md not updated: No context to fall back to

Example Scenario

Optimizer: Working on perf/batch-size, got 40% gain, was about to benchmark
           Context at 150k tokens → compaction fires
           PreCompact hook missing → branch/status not saved
           SessionStart hook resumes but status.md not injected
           
Optimizer resumes: Reads compacted summary, can't find branch name
                   Runs `git branch | grep perf` but if many perf/* branches, confused
                   Starts on wrong branch or redoes profiling

Recovery Procedure

Step 1: Restore context (5 min)

Teammate: Check available context in order:

Option A: Read .codeflash/{org}/{project}/status.md
          Should say "Current branch: perf/batch-size"
          Should say "Latest: 40% throughput gain, ready for benchmark"
          
Option B: Read own MEMORY.md
          Should have notes about findings, branch, next steps
          
Option C: Check git
          git branch (shows current and recent branches)
          git log --oneline | head (shows recent commits with messages)
          
Option D: Read compacted summary in session
          Look for "Optimization" or branch/result references

Step 2: Verify before continuing (2 min)

# Confirm right branch
git status
git branch -v

# Confirm right state
git log -1 --format=fuller

# Confirm results exist
ls -la .codeflash/{org}/{project}/data/

# Read what lead said in last interaction
# (Look for approval, next steps in task description)

Step 3: Document if context was lost

If context had to be reconstructed from git/files instead of summary:
  - Note what was missing (branch? status? results?)
  - Check PreCompact hook configuration
  - Check SessionStart hook configuration
  
Add to failure-modes.md:
- Lost: [branch name / status / results]
- Recovered from: [git log / MEMORY.md / files]
- Fix: Configure PreCompact hook to snapshot state

Step 4: Continue work (normal)

Now that context restored, proceed normally

Prevention

  • Configure PreCompact hook to snapshot before compaction
  • Configure SessionStart hook to inject status after compaction
  • Keep MEMORY.md updated with current branch and findings as you work
  • Update .codeflash/{org}/{project}/status.md regularly, not just at handoff

Failure Mode 4: Stale Teammate Results

Detection

  • Lead reviews completed task 1-2 hours after completion
  • Requirements have changed since task started
  • Teammate's results are outdated
  • Lead asks for redo but frustrated by wasted time

Root Causes

  • No real-time notification: Lead doesn't check TaskList frequently
  • Long-running tasks: 2-3 hour benchmarks, teammates finish but lead not watching
  • Requirements evolved: New discovery mid-task, teammates didn't know to adjust
  • Lead context lost: Lead forgot teammate was working on this, started duplicate work

Example Scenario

Time 1:00 PM: Lead creates benchmark task for optimizer
              "Benchmark batch-size 32 on typeagent VM"

Time 1:30 PM: Optimizer implements, TaskUpdate(status: "completed")
              Results show batch-size 32 gives 40% gain

Time 2:30 PM: Lead discovers new issue: "Actually, test variance is 15%, need < 5%"
              Lead contacts optimizer: "We need tighter tolerance"
              
Time 3:00 PM: Optimizer: "I already finished, results are done"
              Lead: "Those are worthless now, redo with variance validation"
              Wasted 1.5 hours of benchmarking

Recovery Procedure

Step 1: Assess if results are salvageable (5 min)

Lead: Can we adjust the results, or truly need redo?
      - If just need re-analysis: Have teammate post-process
      - If need re-benchmarking: Full redo required

Step 2: Create new task with updated requirements (5 min)

Lead: TaskCreate("Re-benchmark batch-size 32 with variance < 5%", 
                 owner: "optimizer",
                 description: "Previous run showed 40% gain but variance 15%. Need tighter tolerance. Run at least 10 iterations.")
      
      SendMessage(to: "optimizer"): "Requirements refined, new task created with details"

Step 3: Reuse previous findings (5 min)

Optimizer: Reads new task description
           Reuses MEMORY from previous attempt
           Focuses on variance reduction (more iterations, control environment)
           Much faster than cold start

Step 4: Document and prevent

Add to failure-modes.md:
- Cause: Requirements changed mid-task, lead didn't notify
- Prevention: Lead checks TaskList every 30-60 min for long-running tasks
- Prevention: Send SendMessage "Requirements changed, task TBD" ASAP when discovered

Prevention

  • Set expectation upfront: "Lead will check TaskList every hour for in-progress tasks"
  • Use SendMessage proactively: If requirements change, contact teammate immediately
  • Keep lead engaged: Don't create tasks and disappear for 3 hours
  • Milestone tracking: For long tasks, ask for mid-task updates (e.g., "benchmark started, ETA 1 hour")

Failure Mode 5: Over-Specified Compaction

Detection

  • After compaction + resume, teammate forgets key results (performance numbers, branch state)
  • Compaction summary drops critical info
  • Teammate re-does work they'd already done

Root Causes

  • CLAUDE.md "Compact instructions" too aggressive: Dropped results to save tokens
  • PreCompact hook not configured: Key info not snapshotted before summarization
  • Summary algorithm dropped wrong info: All outputs removed, but results needed

Example Scenario

Optimizer: Profiled code, found O(n²) loop, results in .codeflash/data/profile.json
           Session context at 140k tokens
           Compaction fires, summary keeps "O(n²) found" but drops exact numbers
           
After compaction:
Optimizer: Compacted summary says "bottleneck found" but no numbers
           Teammate starts profiling again (déjà vu)
           Lead: "I already have the profile data!"

Recovery Procedure

Step 1: Check what was actually lost (2 min)

# Do the files still exist?
ls -la .codeflash/{org}/{project}/data/profile.json
cat .codeflash/{org}/{project}/data/profile.json | head

# Is it in teammate MEMORY.md?
grep -A 5 "O(n²)" MEMORY.md

# Is it in project MEMORY.md?
grep -A 5 "hotspot" MEMORY.md

Step 2: If files/memory exist, retrieve from there (2 min)

Teammate: Reference the files instead of repeating work
          Use previous profile data as starting point for next phase
          This is why MEMORY.md should be updated regularly

Step 3: If truly lost, redo (time-dependent)

Quick redo (< 5 min): Profile again, it's fast
Expensive redo (> 30 min): Benchmark or full optimization
  → Lead should have caught this, expensive oversight

Step 4: Adjust compaction settings

Edit CLAUDE.md root:

# Current (too aggressive)
# Compact instructions
When you are using compact, please focus on test output and code changes

# Better (preserve results)
# Compact instructions
When compacting, preserve:
1. Benchmark results and performance deltas
2. Code changes made (git diffs)
3. Active branch state and next steps
4. Key findings and hotspots
Drop: verbose tool output, full file reads, intermediate test runs

Prevention

  • Adjust CLAUDE.md "Compact instructions" to preserve benchmark numbers, branch state, findings
  • Configure PreCompact hook to snapshot before compaction
  • Rely on MEMORY.md and status.md, not compaction summary, for critical state
  • Inject via SessionStart hook after compaction

Failure Mode 6: Lead Doesn't Wait for Teammates

Detection

  • Lead starts implementing while teammates still in "in_progress"
  • Lead context balloons (own work + monitoring teammates)
  • Teammate completes but lead doesn't see because distracted
  • Task ordering breaks (lead gets ahead of profile results)

Root Causes

  • Lead gets impatient: "Teammates are taking too long, I'll start implementing"
  • No enforcement: Stop hook not configured to block lead
  • Lead habit: From single-session work where they do everything
  • Unclear role: Lead thinks they should be coding, not just coordinating

Example Scenario

Lead: TaskCreate("Profile typeagent", owner: "optimizer")

Lead: (3 minutes later) Starts implementing own optimization
      Figures "optimizer will take a while, I can start now"
      Context bloats with both leading AND implementing
      
Optimizer: Finishes profile after 20 min, posts findings
           Lead misses it (absorbed in own implementation)
           Optimizer does TaskUpdate but lead never reads
           
Result: Lead + Optimizer both working independently, no coordination
        All parallelism benefits lost

Recovery Procedure

Step 1: Acknowledge

Lead: Recognize you've started implementing while teammates work
      This defeats purpose of team (kills parallelism benefit)

Step 2: Pause own work (immediate)

Lead: Save current work (commit or stash)
      TaskUpdate for any open tasks to "on_hold" or revert
      Check TaskList for teammate progress
      Read teammate MEMORY.md for findings

Step 3: Reorient to coordination role (5 min)

Lead: What have teammates completed?
      What are they waiting for?
      What decisions do they need from you?
      
      Actions:
      - TaskUpdate with approvals if needed
      - SendMessage with feedback or next directions
      - Don't start new implementation

Step 4: Enforce for next session

Add Stop hook to prevent this:
  - If teammate tasks in "in_progress", block stop
  - Prevent lead from working on anything other than coordination
  
Add to CLAUDE.md:
  Lead role: Coordinate, approve, synthesize
  Lead constraint: Do not implement while teammates work

Prevention

  • Add Stop hook to prevent this
  • Make role explicit: "Your job is to coordinate and review, not code"
  • Review this document before starting team session

Failure Mode 7: Task Never Completes (Ambiguous "Done")

Detection

  • Task stuck in "in_progress" for many hours
  • Lead and teammate disagree on what "done" means
  • Teammate asks "Is this good enough?" Lead says "no, keep going"
  • No checklist to verify completion

Root Causes

  • Vague task description: "Profile Agent.execute()" (what counts as thorough?)
  • No deliverables defined: Teammate doesn't know what evidence to provide
  • Unclear success criteria: "Optimize for speed" (how fast is enough?)
  • Teammate over-thinking: Perfectionism, "one more iteration"

Example Scenario

Lead: TaskCreate("Profile Agent.execute()")

Optimizer: Profiles once, finds O(n²)
           TaskUpdate(status: "completed")

Lead: Reads profile, says "Need more detail on cache misses"
      Optimizer: "I did basic profiling, thought that was done"

Optimizer: Profiles again with cache analysis
           TaskUpdate(status: "completed")

Lead: Says "What about GC? And memory allocations?"
      Back and forth continues...

Recovery Procedure

Step 1: Define done explicitly

Lead + Teammate: Discuss and agree:
  - What is the minimum viable finding?
  - What metrics matter (throughput, latency, memory)?
  - How confident do we need to be?
  - What's the next stopping point?

Step 2: Create checklist

Lead: Update task description with deliverables:

DELIVERABLES (task is DONE when):
- [ ] Profile collected on 1000-message corpus
- [ ] Top 3 hotspots identified (> 5% each)
- [ ] O-complexity analysis for each hotspot
- [ ] Cache hit rates measured
- [ ] Variance < 5% (min 3 runs)
- [ ] Findings documented in MEMORY.md
- [ ] Branch created or pushed if applicable

NOT required (out of scope):
- GC analysis (separate task)
- Memory profiling (separate task)
- Implementation (separate task)

Step 3: Teammate completes to checklist

Optimizer: Works to checklist
           When all boxes checked, done
           TaskUpdate(status: "completed")
           Lead can't argue "but what about X" if X not on checklist

Step 4: TaskCompleted hook validates

Hook script:
- Read task deliverables checklist
- For "Profile" tasks: Verify profile.json exists, MEMORY.md has findings
- Accept completion only if deliverables present

Prevention

  • Always create task with deliverables checklist
  • Use TaskCompleted hook to validate before accepting completion
  • Define scope boundaries: "This task does X, NOT Y" (prevents scope creep)

Reference: Detecting Failure Early

Signals to Watch For

Signal Possible Failure Action
Task in "in_progress" > 2 hours, no updates Silent failure or infinite loop Check teammate logs, consider restart
Lead implementing while teammates work Lead over-working, parallelism lost Move lead implementation to new task, stop block enforcement
Task oscillating between "in_progress" and no change Ambiguous "done", perfectionism Define explicit checklist
All tasks "blocked", nothing moving Deadlock, circular dependencies Break cycle with lead override
Teammate forgets branch after compaction Context loss Check PreCompact/SessionStart hooks
Lead makes requirements change mid-task Stale results Notify teammate immediately, update task

Monitoring Routine (Every 1-2 hours during team session)

Lead checklist:
- [ ] TaskList: All in_progress tasks have recent updates?
- [ ] Teammates: Any stuck or stalled?
- [ ] MEMORY.md: Updated with current findings?
- [ ] Status: Still on track, or need course correction?
- [ ] Blockers: Any teammate waiting on lead decision?

If anything looks wrong:
  → SendMessage to teammate: "How's it going? Any blockers?"
  → Escalate if no response in 30 min

See also: team-structure.md (prevent deadlock via config), agent-teams.md (Claude Code agent team docs)