codeflash-agent/plugin/references/shared/failure-modes.md

594 lines
20 KiB
Markdown
Raw Permalink Normal View History

2026-04-09 08:36:01 +00:00
# Failure Modes & Recovery
This document catalogs known failure modes in multi-teammate workflows, how to detect them, root causes, and recovery procedures.
## Failure Mode 1: Deadlock (Circular Task Dependencies)
### Detection
- TaskList shows all teammates in "blocked" state
- No tasks in "in_progress" (all waiting for something)
- Lead waits for teammates to move, teammates wait for lead or each other
- Session stalled for > 30 minutes with no progress
### Root Causes
- **Circular dependencies**: Benchmarker blocked by Reviewer, Reviewer blocked by Benchmarker
- **Missing dependency definition**: Lead created tasks without specifying blockedBy/blocks
- **Lead waiting for teammate, teammate waiting for lead approval**: Both stalled
### Example Scenario
```
Optimizer blocked by: Reviewer (waiting for code review)
Reviewer blocked by: Benchmarker (wants benchmark results first)
Benchmarker blocked by: Optimizer (waiting for implementation)
Lead: Blocked waiting for Optimizer to complete
Result: No one can move
```
### Recovery Procedure
**Step 1: Break the cycle** (5 min)
```
Lead: Manually complete one task in the chain
Option A: Lead implements the optimization (breaks Optimizer bottleneck)
Option B: Lead approves current implementation (breaks Reviewer bottleneck)
Option C: Lead runs benchmark manually (breaks Benchmarker bottleneck)
```
**Step 2: Unblock teammates**
```
Lead: TaskUpdate for the broken task with status: "completed"
Notify teammates via SendMessage: "Cycle broken, [task] completed, proceed"
```
**Step 3: Replan for next session**
```
Lead: Review dependency graph in CLAUDE.md
Redefine as linear chain, not circular:
- Profile (Optimizer) → blocks
- Implement (Optimizer) → blocks
- Benchmark (Benchmarker) → blocks
- Review (Reviewer) → completed
```
**Step 4: Document**
```
Add to project MEMORY.md:
- Circular dependency detected: [describe]
- Broken by: [what lead did]
- Prevention: Use team-structure.md decision tree before creating teams
```
### Prevention
- Add TaskCreated hook to validate dependency graph
- Use team-structure.md to pick proven configurations (single optimization = linear chain)
- Always define dependencies when creating tasks
---
## Failure Mode 2: Silent Teammate Failure
### Detection
- Teammate stops responding (no new messages/updates)
- Task stuck in "in_progress" for > 2 hours
- No error message in last response
- Lead checks TaskList, sees task still assigned but no progress
### Root Causes
- **API error** (StopFailure hook triggered): Model hit rate limit, timeout, or internal error
- **Infinite loop**: Teammate caught in retry logic (unlikely but possible)
- **Network timeout**: SSH to VM failed, teammate waiting indefinitely
- **Rare**: Teammate hit unrecoverable error and silently stopped
### Example Scenario
```
Benchmarker: Started benchmark on VM, got timeout error
StopFailure hook fired (no recovery context provided)
Session ended with no message to lead
Lead: Checks TaskList 2 hours later, sees task still "in_progress"
Has no idea benchmarker failed
```
### Recovery Procedure
**Step 1: Verify failure** (2 min)
```bash
# Check if task is actually stuck
TaskList show:
- Is task marked "in_progress"?
- What was the last message/timestamp?
- Who is the owner?
# Check teammate session logs (if accessible)
ls -la /Users/*/Desktop/work/.claude/projects/*/
grep "error\|failed" teammate-session.log
```
**Step 2: Assess damage** (2 min)
```
Questions:
- Did teammate complete any work before failing?
- Did they commit to branch or save results?
- Is MEMORY.md updated with partial findings?
- Can work be resumed from where they left off?
```
**Step 3: Choose recovery path**
**Path A: Retry with fresh teammate** (recommended)
```
Lead: TaskUpdate(task, owner: "new-teammate-name",
description: "Previous teammate failed. Retry from scratch or check MEMORY.md for partial findings")
TaskCreate if previous teammate created partial work:
- If branch partially implemented: "Completed implementation on perf/... branch"
- If benchmark partial: "Validate benchmark results, see .codeflash/.../data/"
New teammate: Reads task description + previous teammate MEMORY.md
Continues from where possible, or redoes from scratch
```
**Path B: Lead takes over** (if time-critical)
```
Lead: Break team pattern temporarily
Manually complete the failing task
Continue rest of team as normal
This saves time but adds context cost. Document cost in metrics.tsv
```
**Path C: Abandon task** (if low priority)
```
Lead: TaskUpdate(task, status: "deleted")
SendMessage to remaining teammates: "Benchmark task cancelled, focus on review"
```
**Step 4: Post-mortem**
```
Add to failure-modes.md and project MEMORY.md:
- Failure: [what happened]
- Cause: [why - check logs if possible]
- Recovery: [what lead did]
- Prevention: [what to do differently]
```
### Prevention
- Monitor teammate progress manually every 1-2 hours if doing long-running tasks
- Add PostToolUseFailure hook to capture error context and retry info
- Define task timeouts: "Benchmark should complete within 30 min, alert lead if not"
---
## Failure Mode 3: Context Loss After Compaction
### Detection
- Teammate resumes session after compaction
- Forgets current branch or optimization goal
- Asks "what was I working on?" or implements wrong thing
- Compaction removed critical state info (branch name, status, next step)
### Root Causes
- **PreCompact hook not configured**: Critical state not snapshotted before compaction
- **SessionStart hook not re-injecting**: State not restored after compaction
- **CLAUDE.md "Compact instructions" too aggressive**: Dropped branch/status info
- **Teammate MEMORY.md not updated**: No context to fall back to
### Example Scenario
```
Optimizer: Working on perf/batch-size, got 40% gain, was about to benchmark
Context at 150k tokens → compaction fires
PreCompact hook missing → branch/status not saved
SessionStart hook resumes but status.md not injected
Optimizer resumes: Reads compacted summary, can't find branch name
Runs `git branch | grep perf` but if many perf/* branches, confused
Starts on wrong branch or redoes profiling
```
### Recovery Procedure
**Step 1: Restore context** (5 min)
```
Teammate: Check available context in order:
Option A: Read .codeflash/{teammember}/{org}/{project}/status.md
2026-04-09 08:36:01 +00:00
Should say "Current branch: perf/batch-size"
Should say "Latest: 40% throughput gain, ready for benchmark"
Option B: Read own MEMORY.md
Should have notes about findings, branch, next steps
Option C: Check git
git branch (shows current and recent branches)
git log --oneline | head (shows recent commits with messages)
Option D: Read compacted summary in session
Look for "Optimization" or branch/result references
```
**Step 2: Verify before continuing** (2 min)
```bash
# Confirm right branch
git status
git branch -v
# Confirm right state
git log -1 --format=fuller
# Confirm results exist
ls -la .codeflash/{teammember}/{org}/{project}/data/
2026-04-09 08:36:01 +00:00
# Read what lead said in last interaction
# (Look for approval, next steps in task description)
```
**Step 3: Document if context was lost**
```
If context had to be reconstructed from git/files instead of summary:
- Note what was missing (branch? status? results?)
- Check PreCompact hook configuration
- Check SessionStart hook configuration
Add to failure-modes.md:
- Lost: [branch name / status / results]
- Recovered from: [git log / MEMORY.md / files]
- Fix: Configure PreCompact hook to snapshot state
```
**Step 4: Continue work** (normal)
```
Now that context restored, proceed normally
```
### Prevention
- **Configure PreCompact hook** to snapshot before compaction
- **Configure SessionStart hook** to inject status after compaction
- **Keep MEMORY.md updated** with current branch and findings as you work
- **Update .codeflash/{teammember}/{org}/{project}/status.md** regularly, not just at handoff
2026-04-09 08:36:01 +00:00
---
## Failure Mode 4: Stale Teammate Results
### Detection
- Lead reviews completed task 1-2 hours after completion
- Requirements have changed since task started
- Teammate's results are outdated
- Lead asks for redo but frustrated by wasted time
### Root Causes
- **No real-time notification**: Lead doesn't check TaskList frequently
- **Long-running tasks**: 2-3 hour benchmarks, teammates finish but lead not watching
- **Requirements evolved**: New discovery mid-task, teammates didn't know to adjust
- **Lead context lost**: Lead forgot teammate was working on this, started duplicate work
### Example Scenario
```
Time 1:00 PM: Lead creates benchmark task for optimizer
"Benchmark batch-size 32 on typeagent VM"
Time 1:30 PM: Optimizer implements, TaskUpdate(status: "completed")
Results show batch-size 32 gives 40% gain
Time 2:30 PM: Lead discovers new issue: "Actually, test variance is 15%, need < 5%"
Lead contacts optimizer: "We need tighter tolerance"
Time 3:00 PM: Optimizer: "I already finished, results are done"
Lead: "Those are worthless now, redo with variance validation"
Wasted 1.5 hours of benchmarking
```
### Recovery Procedure
**Step 1: Assess if results are salvageable** (5 min)
```
Lead: Can we adjust the results, or truly need redo?
- If just need re-analysis: Have teammate post-process
- If need re-benchmarking: Full redo required
```
**Step 2: Create new task with updated requirements** (5 min)
```
Lead: TaskCreate("Re-benchmark batch-size 32 with variance < 5%",
owner: "optimizer",
description: "Previous run showed 40% gain but variance 15%. Need tighter tolerance. Run at least 10 iterations.")
SendMessage(to: "optimizer"): "Requirements refined, new task created with details"
```
**Step 3: Reuse previous findings** (5 min)
```
Optimizer: Reads new task description
Reuses MEMORY from previous attempt
Focuses on variance reduction (more iterations, control environment)
Much faster than cold start
```
**Step 4: Document and prevent**
```
Add to failure-modes.md:
- Cause: Requirements changed mid-task, lead didn't notify
- Prevention: Lead checks TaskList every 30-60 min for long-running tasks
- Prevention: Send SendMessage "Requirements changed, task TBD" ASAP when discovered
```
### Prevention
- **Set expectation upfront**: "Lead will check TaskList every hour for in-progress tasks"
- **Use SendMessage proactively**: If requirements change, contact teammate immediately
- **Keep lead engaged**: Don't create tasks and disappear for 3 hours
- **Milestone tracking**: For long tasks, ask for mid-task updates (e.g., "benchmark started, ETA 1 hour")
---
## Failure Mode 5: Over-Specified Compaction
### Detection
- After compaction + resume, teammate forgets key results (performance numbers, branch state)
- Compaction summary drops critical info
- Teammate re-does work they'd already done
### Root Causes
- **CLAUDE.md "Compact instructions" too aggressive**: Dropped results to save tokens
- **PreCompact hook not configured**: Key info not snapshotted before summarization
- **Summary algorithm dropped wrong info**: All outputs removed, but results needed
### Example Scenario
```
Optimizer: Profiled code, found O(n²) loop, results in .codeflash/data/profile.json
Session context at 140k tokens
Compaction fires, summary keeps "O(n²) found" but drops exact numbers
After compaction:
Optimizer: Compacted summary says "bottleneck found" but no numbers
Teammate starts profiling again (déjà vu)
Lead: "I already have the profile data!"
```
### Recovery Procedure
**Step 1: Check what was actually lost** (2 min)
```bash
# Do the files still exist?
ls -la .codeflash/{teammember}/{org}/{project}/data/profile.json
cat .codeflash/{teammember}/{org}/{project}/data/profile.json | head
2026-04-09 08:36:01 +00:00
# Is it in teammate MEMORY.md?
grep -A 5 "O(n²)" MEMORY.md
# Is it in project MEMORY.md?
grep -A 5 "hotspot" MEMORY.md
```
**Step 2: If files/memory exist, retrieve from there** (2 min)
```
Teammate: Reference the files instead of repeating work
Use previous profile data as starting point for next phase
This is why MEMORY.md should be updated regularly
```
**Step 3: If truly lost, redo** (time-dependent)
```
Quick redo (< 5 min): Profile again, it's fast
Expensive redo (> 30 min): Benchmark or full optimization
→ Lead should have caught this, expensive oversight
```
**Step 4: Adjust compaction settings**
```
Edit CLAUDE.md root:
# Current (too aggressive)
# Compact instructions
When you are using compact, please focus on test output and code changes
# Better (preserve results)
# Compact instructions
When compacting, preserve:
1. Benchmark results and performance deltas
2. Code changes made (git diffs)
3. Active branch state and next steps
4. Key findings and hotspots
Drop: verbose tool output, full file reads, intermediate test runs
```
### Prevention
- **Adjust CLAUDE.md "Compact instructions"** to preserve benchmark numbers, branch state, findings
- **Configure PreCompact hook** to snapshot before compaction
- **Rely on MEMORY.md and status.md**, not compaction summary, for critical state
- **Inject via SessionStart hook** after compaction
---
## Failure Mode 6: Lead Doesn't Wait for Teammates
### Detection
- Lead starts implementing while teammates still in "in_progress"
- Lead context balloons (own work + monitoring teammates)
- Teammate completes but lead doesn't see because distracted
- Task ordering breaks (lead gets ahead of profile results)
### Root Causes
- **Lead gets impatient**: "Teammates are taking too long, I'll start implementing"
- **No enforcement**: Stop hook not configured to block lead
- **Lead habit**: From single-session work where they do everything
- **Unclear role**: Lead thinks they should be coding, not just coordinating
### Example Scenario
```
Lead: TaskCreate("Profile typeagent", owner: "optimizer")
Lead: (3 minutes later) Starts implementing own optimization
Figures "optimizer will take a while, I can start now"
Context bloats with both leading AND implementing
Optimizer: Finishes profile after 20 min, posts findings
Lead misses it (absorbed in own implementation)
Optimizer does TaskUpdate but lead never reads
Result: Lead + Optimizer both working independently, no coordination
All parallelism benefits lost
```
### Recovery Procedure
**Step 1: Acknowledge**
```
Lead: Recognize you've started implementing while teammates work
This defeats purpose of team (kills parallelism benefit)
```
**Step 2: Pause own work** (immediate)
```
Lead: Save current work (commit or stash)
TaskUpdate for any open tasks to "on_hold" or revert
Check TaskList for teammate progress
Read teammate MEMORY.md for findings
```
**Step 3: Reorient to coordination role** (5 min)
```
Lead: What have teammates completed?
What are they waiting for?
What decisions do they need from you?
Actions:
- TaskUpdate with approvals if needed
- SendMessage with feedback or next directions
- Don't start new implementation
```
**Step 4: Enforce for next session**
```
Add Stop hook to prevent this:
- If teammate tasks in "in_progress", block stop
- Prevent lead from working on anything other than coordination
Add to CLAUDE.md:
Lead role: Coordinate, approve, synthesize
Lead constraint: Do not implement while teammates work
```
### Prevention
- **Add Stop hook** to prevent this
- **Make role explicit**: "Your job is to coordinate and review, not code"
- **Review this document** before starting team session
---
## Failure Mode 7: Task Never Completes (Ambiguous "Done")
### Detection
- Task stuck in "in_progress" for many hours
- Lead and teammate disagree on what "done" means
- Teammate asks "Is this good enough?" Lead says "no, keep going"
- No checklist to verify completion
### Root Causes
- **Vague task description**: "Profile Agent.execute()" (what counts as thorough?)
- **No deliverables defined**: Teammate doesn't know what evidence to provide
- **Unclear success criteria**: "Optimize for speed" (how fast is enough?)
- **Teammate over-thinking**: Perfectionism, "one more iteration"
### Example Scenario
```
Lead: TaskCreate("Profile Agent.execute()")
Optimizer: Profiles once, finds O(n²)
TaskUpdate(status: "completed")
Lead: Reads profile, says "Need more detail on cache misses"
Optimizer: "I did basic profiling, thought that was done"
Optimizer: Profiles again with cache analysis
TaskUpdate(status: "completed")
Lead: Says "What about GC? And memory allocations?"
Back and forth continues...
```
### Recovery Procedure
**Step 1: Define done explicitly**
```
Lead + Teammate: Discuss and agree:
- What is the minimum viable finding?
- What metrics matter (throughput, latency, memory)?
- How confident do we need to be?
- What's the next stopping point?
```
**Step 2: Create checklist**
```
Lead: Update task description with deliverables:
DELIVERABLES (task is DONE when):
- [ ] Profile collected on 1000-message corpus
- [ ] Top 3 hotspots identified (> 5% each)
- [ ] O-complexity analysis for each hotspot
- [ ] Cache hit rates measured
- [ ] Variance < 5% (min 3 runs)
- [ ] Findings documented in MEMORY.md
- [ ] Branch created or pushed if applicable
NOT required (out of scope):
- GC analysis (separate task)
- Memory profiling (separate task)
- Implementation (separate task)
```
**Step 3: Teammate completes to checklist**
```
Optimizer: Works to checklist
When all boxes checked, done
TaskUpdate(status: "completed")
Lead can't argue "but what about X" if X not on checklist
```
**Step 4: TaskCompleted hook validates**
```
Hook script:
- Read task deliverables checklist
- For "Profile" tasks: Verify profile.json exists, MEMORY.md has findings
- Accept completion only if deliverables present
```
### Prevention
- **Always create task with deliverables checklist**
- **Use TaskCompleted hook** to validate before accepting completion
- **Define scope boundaries**: "This task does X, NOT Y" (prevents scope creep)
---
## Reference: Detecting Failure Early
### Signals to Watch For
| Signal | Possible Failure | Action |
|---|---|---|
| Task in "in_progress" > 2 hours, no updates | Silent failure or infinite loop | Check teammate logs, consider restart |
| Lead implementing while teammates work | Lead over-working, parallelism lost | Move lead implementation to new task, stop block enforcement |
| Task oscillating between "in_progress" and no change | Ambiguous "done", perfectionism | Define explicit checklist |
| All tasks "blocked", nothing moving | Deadlock, circular dependencies | Break cycle with lead override |
| Teammate forgets branch after compaction | Context loss | Check PreCompact/SessionStart hooks |
| Lead makes requirements change mid-task | Stale results | Notify teammate immediately, update task |
### Monitoring Routine (Every 1-2 hours during team session)
```bash
Lead checklist:
- [ ] TaskList: All in_progress tasks have recent updates?
- [ ] Teammates: Any stuck or stalled?
- [ ] MEMORY.md: Updated with current findings?
- [ ] Status: Still on track, or need course correction?
- [ ] Blockers: Any teammate waiting on lead decision?
If anything looks wrong:
→ SendMessage to teammate: "How's it going? Any blockers?"
→ Escalate if no response in 30 min
```
---
See also: team-structure.md (prevent deadlock via config), agent-teams.md (Claude Code agent team docs)