codeflash-agent/plugin/references/shared/failure-modes.md

# Failure Modes & Recovery

This document catalogs known failure modes in multi-teammate workflows, how to detect them, root causes, and recovery procedures.

## Failure Mode 1: Deadlock (Circular Task Dependencies)

### Detection
- TaskList shows all teammates in "blocked" state
- No tasks in "in_progress" (all waiting for something)
- Lead waits for teammates to move, teammates wait for lead or each other
- Session stalled for > 30 minutes with no progress

### Root Causes
- **Circular dependencies**: Benchmarker blocked by Reviewer, Reviewer blocked by Benchmarker
- **Missing dependency definition**: Lead created tasks without specifying blockedBy/blocks
- **Lead waiting for teammate, teammate waiting for lead approval**: Both stalled

### Example Scenario
```
Optimizer blocked by: Reviewer (waiting for code review)
Reviewer blocked by: Benchmarker (wants benchmark results first)
Benchmarker blocked by: Optimizer (waiting for implementation)
Lead: Blocked waiting for Optimizer to complete

Result: No one can move
```

### Recovery Procedure

**Step 1: Break the cycle** (5 min)
```
Lead: Manually complete one task in the chain
  Option A: Lead implements the optimization (breaks Optimizer bottleneck)
  Option B: Lead approves current implementation (breaks Reviewer bottleneck)
  Option C: Lead runs benchmark manually (breaks Benchmarker bottleneck)
```

**Step 2: Unblock teammates**
```
Lead: TaskUpdate for the broken task with status: "completed"
      Notify teammates via SendMessage: "Cycle broken, [task] completed, proceed"
```

**Step 3: Replan for next session**
```
Lead: Review dependency graph in CLAUDE.md
      Redefine as linear chain, not circular:
      - Profile (Optimizer) → blocks
      - Implement (Optimizer) → blocks
      - Benchmark (Benchmarker) → blocks
      - Review (Reviewer) → completed
```

**Step 4: Document**
```
Add to project MEMORY.md:
- Circular dependency detected: [describe]
- Broken by: [what lead did]
- Prevention: Use team-structure.md decision tree before creating teams
```

### Prevention
- Add TaskCreated hook to validate dependency graph
- Use team-structure.md to pick proven configurations (single optimization = linear chain)
- Always define dependencies when creating tasks

---

## Failure Mode 2: Silent Teammate Failure

### Detection
- Teammate stops responding (no new messages/updates)
- Task stuck in "in_progress" for > 2 hours
- No error message in last response
- Lead checks TaskList, sees task still assigned but no progress

### Root Causes
- **API error** (StopFailure hook triggered): Model hit rate limit, timeout, or internal error
- **Infinite loop**: Teammate caught in retry logic (unlikely but possible)
- **Network timeout**: SSH to VM failed, teammate waiting indefinitely
- **Rare**: Teammate hit unrecoverable error and silently stopped

### Example Scenario
```
Benchmarker: Started benchmark on VM, got timeout error
           StopFailure hook fired (no recovery context provided)
           Session ended with no message to lead

Lead: Checks TaskList 2 hours later, sees task still "in_progress"
      Has no idea benchmarker failed
```

### Recovery Procedure

**Step 1: Verify failure** (2 min)
```bash
# Check if task is actually stuck
TaskList show:
  - Is task marked "in_progress"?
  - What was the last message/timestamp?
  - Who is the owner?

# Check teammate session logs (if accessible)
ls -la /Users/*/Desktop/work/.claude/projects/*/
grep "error\|failed" teammate-session.log
```

**Step 2: Assess damage** (2 min)
```
Questions:
- Did teammate complete any work before failing?
- Did they commit to branch or save results?
- Is MEMORY.md updated with partial findings?
- Can work be resumed from where they left off?
```

**Step 3: Choose recovery path**

**Path A: Retry with fresh teammate** (recommended)
```
Lead: TaskUpdate(task, owner: "new-teammate-name",
                 description: "Previous teammate failed. Retry from scratch or check MEMORY.md for partial findings")

      TaskCreate if previous teammate created partial work:
        - If branch partially implemented: "Completed implementation on perf/... branch"
        - If benchmark partial: "Validate benchmark results, see .codeflash/.../data/"

New teammate: Reads task description + previous teammate MEMORY.md
              Continues from where possible, or redoes from scratch
```

**Path B: Lead takes over** (if time-critical)
```
Lead: Break team pattern temporarily
      Manually complete the failing task
      Continue rest of team as normal

This saves time but adds context cost. Document cost in metrics.tsv
```

**Path C: Abandon task** (if low priority)
```
Lead: TaskUpdate(task, status: "deleted")
      SendMessage to remaining teammates: "Benchmark task cancelled, focus on review"
```

**Step 4: Post-mortem**
```
Add to failure-modes.md and project MEMORY.md:
- Failure: [what happened]
- Cause: [why - check logs if possible]
- Recovery: [what lead did]
- Prevention: [what to do differently]
```

### Prevention
- Monitor teammate progress manually every 1-2 hours if doing long-running tasks
- Add PostToolUseFailure hook to capture error context and retry info
- Define task timeouts: "Benchmark should complete within 30 min, alert lead if not"

---

## Failure Mode 3: Context Loss After Compaction

### Detection
- Teammate resumes session after compaction
- Forgets current branch or optimization goal
- Asks "what was I working on?" or implements wrong thing
- Compaction removed critical state info (branch name, status, next step)

### Root Causes
- **PreCompact hook not configured**: Critical state not snapshotted before compaction
- **SessionStart hook not re-injecting**: State not restored after compaction
- **CLAUDE.md "Compact instructions" too aggressive**: Dropped branch/status info
- **Teammate MEMORY.md not updated**: No context to fall back to

### Example Scenario
```
Optimizer: Working on perf/batch-size, got 40% gain, was about to benchmark
           Context at 150k tokens → compaction fires
           PreCompact hook missing → branch/status not saved
           SessionStart hook resumes but status.md not injected

Optimizer resumes: Reads compacted summary, can't find branch name
                   Runs `git branch | grep perf` but if many perf/* branches, confused
                   Starts on wrong branch or redoes profiling
```

### Recovery Procedure

**Step 1: Restore context** (5 min)
```
Teammate: Check available context in order:

Option A: Read .codeflash/{teammember}/{org}/{project}/status.md
          Should say "Current branch: perf/batch-size"
          Should say "Latest: 40% throughput gain, ready for benchmark"

Option B: Read own MEMORY.md
          Should have notes about findings, branch, next steps

Option C: Check git
          git branch (shows current and recent branches)
          git log --oneline | head (shows recent commits with messages)

Option D: Read compacted summary in session
          Look for "Optimization" or branch/result references
```

**Step 2: Verify before continuing** (2 min)
```bash
# Confirm right branch
git status
git branch -v

# Confirm right state
git log -1 --format=fuller

# Confirm results exist
ls -la .codeflash/{teammember}/{org}/{project}/data/

# Read what lead said in last interaction
# (Look for approval, next steps in task description)
```

**Step 3: Document if context was lost**
```
If context had to be reconstructed from git/files instead of summary:
  - Note what was missing (branch? status? results?)
  - Check PreCompact hook configuration
  - Check SessionStart hook configuration

Add to failure-modes.md:
- Lost: [branch name / status / results]
- Recovered from: [git log / MEMORY.md / files]
- Fix: Configure PreCompact hook to snapshot state
```

**Step 4: Continue work** (normal)
```
Now that context restored, proceed normally
```

### Prevention
- **Configure PreCompact hook** to snapshot before compaction
- **Configure SessionStart hook** to inject status after compaction
- **Keep MEMORY.md updated** with current branch and findings as you work
- **Update .codeflash/{teammember}/{org}/{project}/status.md** regularly, not just at handoff

---

## Failure Mode 4: Stale Teammate Results

### Detection
- Lead reviews completed task 1-2 hours after completion
- Requirements have changed since task started
- Teammate's results are outdated
- Lead asks for redo but frustrated by wasted time

### Root Causes
- **No real-time notification**: Lead doesn't check TaskList frequently
- **Long-running tasks**: 2-3 hour benchmarks, teammates finish but lead not watching
- **Requirements evolved**: New discovery mid-task, teammates didn't know to adjust
- **Lead context lost**: Lead forgot teammate was working on this, started duplicate work

### Example Scenario
```
Time 1:00 PM: Lead creates benchmark task for optimizer
              "Benchmark batch-size 32 on typeagent VM"

Time 1:30 PM: Optimizer implements, TaskUpdate(status: "completed")
              Results show batch-size 32 gives 40% gain

Time 2:30 PM: Lead discovers new issue: "Actually, test variance is 15%, need < 5%"
              Lead contacts optimizer: "We need tighter tolerance"

Time 3:00 PM: Optimizer: "I already finished, results are done"
              Lead: "Those are worthless now, redo with variance validation"
              Wasted 1.5 hours of benchmarking
```

### Recovery Procedure

**Step 1: Assess if results are salvageable** (5 min)
```
Lead: Can we adjust the results, or truly need redo?
      - If just need re-analysis: Have teammate post-process
      - If need re-benchmarking: Full redo required
```

**Step 2: Create new task with updated requirements** (5 min)
```
Lead: TaskCreate("Re-benchmark batch-size 32 with variance < 5%",
                 owner: "optimizer",
                 description: "Previous run showed 40% gain but variance 15%. Need tighter tolerance. Run at least 10 iterations.")

      SendMessage(to: "optimizer"): "Requirements refined, new task created with details"
```

**Step 3: Reuse previous findings** (5 min)
```
Optimizer: Reads new task description
           Reuses MEMORY from previous attempt
           Focuses on variance reduction (more iterations, control environment)
           Much faster than cold start
```

**Step 4: Document and prevent**
```
Add to failure-modes.md:
- Cause: Requirements changed mid-task, lead didn't notify
- Prevention: Lead checks TaskList every 30-60 min for long-running tasks
- Prevention: Send SendMessage "Requirements changed, task TBD" ASAP when discovered
```

### Prevention
- **Set expectation upfront**: "Lead will check TaskList every hour for in-progress tasks"
- **Use SendMessage proactively**: If requirements change, contact teammate immediately
- **Keep lead engaged**: Don't create tasks and disappear for 3 hours
- **Milestone tracking**: For long tasks, ask for mid-task updates (e.g., "benchmark started, ETA 1 hour")

---

## Failure Mode 5: Over-Specified Compaction

### Detection
- After compaction + resume, teammate forgets key results (performance numbers, branch state)
- Compaction summary drops critical info
- Teammate re-does work they'd already done

### Root Causes
- **CLAUDE.md "Compact instructions" too aggressive**: Dropped results to save tokens
- **PreCompact hook not configured**: Key info not snapshotted before summarization
- **Summary algorithm dropped wrong info**: All outputs removed, but results needed

### Example Scenario
```
Optimizer: Profiled code, found O(n²) loop, results in .codeflash/data/profile.json
           Session context at 140k tokens
           Compaction fires, summary keeps "O(n²) found" but drops exact numbers

After compaction:
Optimizer: Compacted summary says "bottleneck found" but no numbers
           Teammate starts profiling again (déjà vu)
           Lead: "I already have the profile data!"
```

### Recovery Procedure

**Step 1: Check what was actually lost** (2 min)
```bash
# Do the files still exist?
ls -la .codeflash/{teammember}/{org}/{project}/data/profile.json
cat .codeflash/{teammember}/{org}/{project}/data/profile.json | head

# Is it in teammate MEMORY.md?
grep -A 5 "O(n²)" MEMORY.md

# Is it in project MEMORY.md?
grep -A 5 "hotspot" MEMORY.md
```

**Step 2: If files/memory exist, retrieve from there** (2 min)
```
Teammate: Reference the files instead of repeating work
          Use previous profile data as starting point for next phase
          This is why MEMORY.md should be updated regularly
```

**Step 3: If truly lost, redo** (time-dependent)
```
Quick redo (< 5 min): Profile again, it's fast
Expensive redo (> 30 min): Benchmark or full optimization
  → Lead should have caught this, expensive oversight
```

**Step 4: Adjust compaction settings**
```
Edit CLAUDE.md root:

# Current (too aggressive)
# Compact instructions
When you are using compact, please focus on test output and code changes

# Better (preserve results)
# Compact instructions
When compacting, preserve:
1. Benchmark results and performance deltas
2. Code changes made (git diffs)
3. Active branch state and next steps
4. Key findings and hotspots
Drop: verbose tool output, full file reads, intermediate test runs
```

### Prevention
- **Adjust CLAUDE.md "Compact instructions"** to preserve benchmark numbers, branch state, findings
- **Configure PreCompact hook** to snapshot before compaction
- **Rely on MEMORY.md and status.md**, not compaction summary, for critical state
- **Inject via SessionStart hook** after compaction

---

## Failure Mode 6: Lead Doesn't Wait for Teammates

### Detection
- Lead starts implementing while teammates still in "in_progress"
- Lead context balloons (own work + monitoring teammates)
- Teammate completes but lead doesn't see because distracted
- Task ordering breaks (lead gets ahead of profile results)

### Root Causes
- **Lead gets impatient**: "Teammates are taking too long, I'll start implementing"
- **No enforcement**: Stop hook not configured to block lead
- **Lead habit**: From single-session work where they do everything
- **Unclear role**: Lead thinks they should be coding, not just coordinating

### Example Scenario
```
Lead: TaskCreate("Profile typeagent", owner: "optimizer")

Lead: (3 minutes later) Starts implementing own optimization
      Figures "optimizer will take a while, I can start now"
      Context bloats with both leading AND implementing

Optimizer: Finishes profile after 20 min, posts findings
           Lead misses it (absorbed in own implementation)
           Optimizer does TaskUpdate but lead never reads

Result: Lead + Optimizer both working independently, no coordination
        All parallelism benefits lost
```

### Recovery Procedure

**Step 1: Acknowledge**
```
Lead: Recognize you've started implementing while teammates work
      This defeats purpose of team (kills parallelism benefit)
```

**Step 2: Pause own work** (immediate)
```
Lead: Save current work (commit or stash)
      TaskUpdate for any open tasks to "on_hold" or revert
      Check TaskList for teammate progress
      Read teammate MEMORY.md for findings
```

**Step 3: Reorient to coordination role** (5 min)
```
Lead: What have teammates completed?
      What are they waiting for?
      What decisions do they need from you?

      Actions:
      - TaskUpdate with approvals if needed
      - SendMessage with feedback or next directions
      - Don't start new implementation
```

**Step 4: Enforce for next session**
```
Add Stop hook to prevent this:
  - If teammate tasks in "in_progress", block stop
  - Prevent lead from working on anything other than coordination

Add to CLAUDE.md:
  Lead role: Coordinate, approve, synthesize
  Lead constraint: Do not implement while teammates work
```

### Prevention
- **Add Stop hook** to prevent this
- **Make role explicit**: "Your job is to coordinate and review, not code"
- **Review this document** before starting team session

---

## Failure Mode 7: Task Never Completes (Ambiguous "Done")

### Detection
- Task stuck in "in_progress" for many hours
- Lead and teammate disagree on what "done" means
- Teammate asks "Is this good enough?" Lead says "no, keep going"
- No checklist to verify completion

### Root Causes
- **Vague task description**: "Profile Agent.execute()" (what counts as thorough?)
- **No deliverables defined**: Teammate doesn't know what evidence to provide
- **Unclear success criteria**: "Optimize for speed" (how fast is enough?)
- **Teammate over-thinking**: Perfectionism, "one more iteration"

### Example Scenario
```
Lead: TaskCreate("Profile Agent.execute()")

Optimizer: Profiles once, finds O(n²)
           TaskUpdate(status: "completed")

Lead: Reads profile, says "Need more detail on cache misses"
      Optimizer: "I did basic profiling, thought that was done"

Optimizer: Profiles again with cache analysis
           TaskUpdate(status: "completed")

Lead: Says "What about GC? And memory allocations?"
      Back and forth continues...
```

### Recovery Procedure

**Step 1: Define done explicitly**
```
Lead + Teammate: Discuss and agree:
  - What is the minimum viable finding?
  - What metrics matter (throughput, latency, memory)?
  - How confident do we need to be?
  - What's the next stopping point?
```

**Step 2: Create checklist**
```
Lead: Update task description with deliverables:

DELIVERABLES (task is DONE when):
- [ ] Profile collected on 1000-message corpus
- [ ] Top 3 hotspots identified (> 5% each)
- [ ] O-complexity analysis for each hotspot
- [ ] Cache hit rates measured
- [ ] Variance < 5% (min 3 runs)
- [ ] Findings documented in MEMORY.md
- [ ] Branch created or pushed if applicable

NOT required (out of scope):
- GC analysis (separate task)
- Memory profiling (separate task)
- Implementation (separate task)
```

**Step 3: Teammate completes to checklist**
```
Optimizer: Works to checklist
           When all boxes checked, done
           TaskUpdate(status: "completed")
           Lead can't argue "but what about X" if X not on checklist
```

**Step 4: TaskCompleted hook validates**
```
Hook script:
- Read task deliverables checklist
- For "Profile" tasks: Verify profile.json exists, MEMORY.md has findings
- Accept completion only if deliverables present
```

### Prevention
- **Always create task with deliverables checklist**
- **Use TaskCompleted hook** to validate before accepting completion
- **Define scope boundaries**: "This task does X, NOT Y" (prevents scope creep)

---

## Reference: Detecting Failure Early

### Signals to Watch For

| Signal | Possible Failure | Action |
|---|---|---|
| Task in "in_progress" > 2 hours, no updates | Silent failure or infinite loop | Check teammate logs, consider restart |
| Lead implementing while teammates work | Lead over-working, parallelism lost | Move lead implementation to new task, stop block enforcement |
| Task oscillating between "in_progress" and no change | Ambiguous "done", perfectionism | Define explicit checklist |
| All tasks "blocked", nothing moving | Deadlock, circular dependencies | Break cycle with lead override |
| Teammate forgets branch after compaction | Context loss | Check PreCompact/SessionStart hooks |
| Lead makes requirements change mid-task | Stale results | Notify teammate immediately, update task |

### Monitoring Routine (Every 1-2 hours during team session)

```bash
Lead checklist:
- [ ] TaskList: All in_progress tasks have recent updates?
- [ ] Teammates: Any stuck or stalled?
- [ ] MEMORY.md: Updated with current findings?
- [ ] Status: Still on track, or need course correction?
- [ ] Blockers: Any teammate waiting on lead decision?

If anything looks wrong:
  → SendMessage to teammate: "How's it going? Any blockers?"
  → Escalate if no response in 30 min
```

---

See also: team-structure.md (prevent deadlock via config), agent-teams.md (Claude Code agent team docs)