mirror of
https://github.com/codeflash-ai/codeflash-agent.git
synced 2026-05-04 18:25:19 +00:00
Add team member dimension to case study paths so multiple contributors can track optimization data independently. Derives member from git config user.name in session-start hooks. - Move all case studies under .codeflash/krrt7/ - Rename pypa/pip → python/pip (org grouping) - Update session-start hooks, docs, scripts, and references
593 lines
20 KiB
Markdown
593 lines
20 KiB
Markdown
# Failure Modes & Recovery
|
|
|
|
This document catalogs known failure modes in multi-teammate workflows, how to detect them, root causes, and recovery procedures.
|
|
|
|
## Failure Mode 1: Deadlock (Circular Task Dependencies)
|
|
|
|
### Detection
|
|
- TaskList shows all teammates in "blocked" state
|
|
- No tasks in "in_progress" (all waiting for something)
|
|
- Lead waits for teammates to move, teammates wait for lead or each other
|
|
- Session stalled for > 30 minutes with no progress
|
|
|
|
### Root Causes
|
|
- **Circular dependencies**: Benchmarker blocked by Reviewer, Reviewer blocked by Benchmarker
|
|
- **Missing dependency definition**: Lead created tasks without specifying blockedBy/blocks
|
|
- **Lead waiting for teammate, teammate waiting for lead approval**: Both stalled
|
|
|
|
### Example Scenario
|
|
```
|
|
Optimizer blocked by: Reviewer (waiting for code review)
|
|
Reviewer blocked by: Benchmarker (wants benchmark results first)
|
|
Benchmarker blocked by: Optimizer (waiting for implementation)
|
|
Lead: Blocked waiting for Optimizer to complete
|
|
|
|
Result: No one can move
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Break the cycle** (5 min)
|
|
```
|
|
Lead: Manually complete one task in the chain
|
|
Option A: Lead implements the optimization (breaks Optimizer bottleneck)
|
|
Option B: Lead approves current implementation (breaks Reviewer bottleneck)
|
|
Option C: Lead runs benchmark manually (breaks Benchmarker bottleneck)
|
|
```
|
|
|
|
**Step 2: Unblock teammates**
|
|
```
|
|
Lead: TaskUpdate for the broken task with status: "completed"
|
|
Notify teammates via SendMessage: "Cycle broken, [task] completed, proceed"
|
|
```
|
|
|
|
**Step 3: Replan for next session**
|
|
```
|
|
Lead: Review dependency graph in CLAUDE.md
|
|
Redefine as linear chain, not circular:
|
|
- Profile (Optimizer) → blocks
|
|
- Implement (Optimizer) → blocks
|
|
- Benchmark (Benchmarker) → blocks
|
|
- Review (Reviewer) → completed
|
|
```
|
|
|
|
**Step 4: Document**
|
|
```
|
|
Add to project MEMORY.md:
|
|
- Circular dependency detected: [describe]
|
|
- Broken by: [what lead did]
|
|
- Prevention: Use team-structure.md decision tree before creating teams
|
|
```
|
|
|
|
### Prevention
|
|
- Add TaskCreated hook to validate dependency graph
|
|
- Use team-structure.md to pick proven configurations (single optimization = linear chain)
|
|
- Always define dependencies when creating tasks
|
|
|
|
---
|
|
|
|
## Failure Mode 2: Silent Teammate Failure
|
|
|
|
### Detection
|
|
- Teammate stops responding (no new messages/updates)
|
|
- Task stuck in "in_progress" for > 2 hours
|
|
- No error message in last response
|
|
- Lead checks TaskList, sees task still assigned but no progress
|
|
|
|
### Root Causes
|
|
- **API error** (StopFailure hook triggered): Model hit rate limit, timeout, or internal error
|
|
- **Infinite loop**: Teammate caught in retry logic (unlikely but possible)
|
|
- **Network timeout**: SSH to VM failed, teammate waiting indefinitely
|
|
- **Rare**: Teammate hit unrecoverable error and silently stopped
|
|
|
|
### Example Scenario
|
|
```
|
|
Benchmarker: Started benchmark on VM, got timeout error
|
|
StopFailure hook fired (no recovery context provided)
|
|
Session ended with no message to lead
|
|
|
|
Lead: Checks TaskList 2 hours later, sees task still "in_progress"
|
|
Has no idea benchmarker failed
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Verify failure** (2 min)
|
|
```bash
|
|
# Check if task is actually stuck
|
|
TaskList show:
|
|
- Is task marked "in_progress"?
|
|
- What was the last message/timestamp?
|
|
- Who is the owner?
|
|
|
|
# Check teammate session logs (if accessible)
|
|
ls -la /Users/*/Desktop/work/.claude/projects/*/
|
|
grep "error\|failed" teammate-session.log
|
|
```
|
|
|
|
**Step 2: Assess damage** (2 min)
|
|
```
|
|
Questions:
|
|
- Did teammate complete any work before failing?
|
|
- Did they commit to branch or save results?
|
|
- Is MEMORY.md updated with partial findings?
|
|
- Can work be resumed from where they left off?
|
|
```
|
|
|
|
**Step 3: Choose recovery path**
|
|
|
|
**Path A: Retry with fresh teammate** (recommended)
|
|
```
|
|
Lead: TaskUpdate(task, owner: "new-teammate-name",
|
|
description: "Previous teammate failed. Retry from scratch or check MEMORY.md for partial findings")
|
|
|
|
TaskCreate if previous teammate created partial work:
|
|
- If branch partially implemented: "Completed implementation on perf/... branch"
|
|
- If benchmark partial: "Validate benchmark results, see .codeflash/.../data/"
|
|
|
|
New teammate: Reads task description + previous teammate MEMORY.md
|
|
Continues from where possible, or redoes from scratch
|
|
```
|
|
|
|
**Path B: Lead takes over** (if time-critical)
|
|
```
|
|
Lead: Break team pattern temporarily
|
|
Manually complete the failing task
|
|
Continue rest of team as normal
|
|
|
|
This saves time but adds context cost. Document cost in metrics.tsv
|
|
```
|
|
|
|
**Path C: Abandon task** (if low priority)
|
|
```
|
|
Lead: TaskUpdate(task, status: "deleted")
|
|
SendMessage to remaining teammates: "Benchmark task cancelled, focus on review"
|
|
```
|
|
|
|
**Step 4: Post-mortem**
|
|
```
|
|
Add to failure-modes.md and project MEMORY.md:
|
|
- Failure: [what happened]
|
|
- Cause: [why - check logs if possible]
|
|
- Recovery: [what lead did]
|
|
- Prevention: [what to do differently]
|
|
```
|
|
|
|
### Prevention
|
|
- Monitor teammate progress manually every 1-2 hours if doing long-running tasks
|
|
- Add PostToolUseFailure hook to capture error context and retry info
|
|
- Define task timeouts: "Benchmark should complete within 30 min, alert lead if not"
|
|
|
|
---
|
|
|
|
## Failure Mode 3: Context Loss After Compaction
|
|
|
|
### Detection
|
|
- Teammate resumes session after compaction
|
|
- Forgets current branch or optimization goal
|
|
- Asks "what was I working on?" or implements wrong thing
|
|
- Compaction removed critical state info (branch name, status, next step)
|
|
|
|
### Root Causes
|
|
- **PreCompact hook not configured**: Critical state not snapshotted before compaction
|
|
- **SessionStart hook not re-injecting**: State not restored after compaction
|
|
- **CLAUDE.md "Compact instructions" too aggressive**: Dropped branch/status info
|
|
- **Teammate MEMORY.md not updated**: No context to fall back to
|
|
|
|
### Example Scenario
|
|
```
|
|
Optimizer: Working on perf/batch-size, got 40% gain, was about to benchmark
|
|
Context at 150k tokens → compaction fires
|
|
PreCompact hook missing → branch/status not saved
|
|
SessionStart hook resumes but status.md not injected
|
|
|
|
Optimizer resumes: Reads compacted summary, can't find branch name
|
|
Runs `git branch | grep perf` but if many perf/* branches, confused
|
|
Starts on wrong branch or redoes profiling
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Restore context** (5 min)
|
|
```
|
|
Teammate: Check available context in order:
|
|
|
|
Option A: Read .codeflash/{teammember}/{org}/{project}/status.md
|
|
Should say "Current branch: perf/batch-size"
|
|
Should say "Latest: 40% throughput gain, ready for benchmark"
|
|
|
|
Option B: Read own MEMORY.md
|
|
Should have notes about findings, branch, next steps
|
|
|
|
Option C: Check git
|
|
git branch (shows current and recent branches)
|
|
git log --oneline | head (shows recent commits with messages)
|
|
|
|
Option D: Read compacted summary in session
|
|
Look for "Optimization" or branch/result references
|
|
```
|
|
|
|
**Step 2: Verify before continuing** (2 min)
|
|
```bash
|
|
# Confirm right branch
|
|
git status
|
|
git branch -v
|
|
|
|
# Confirm right state
|
|
git log -1 --format=fuller
|
|
|
|
# Confirm results exist
|
|
ls -la .codeflash/{teammember}/{org}/{project}/data/
|
|
|
|
# Read what lead said in last interaction
|
|
# (Look for approval, next steps in task description)
|
|
```
|
|
|
|
**Step 3: Document if context was lost**
|
|
```
|
|
If context had to be reconstructed from git/files instead of summary:
|
|
- Note what was missing (branch? status? results?)
|
|
- Check PreCompact hook configuration
|
|
- Check SessionStart hook configuration
|
|
|
|
Add to failure-modes.md:
|
|
- Lost: [branch name / status / results]
|
|
- Recovered from: [git log / MEMORY.md / files]
|
|
- Fix: Configure PreCompact hook to snapshot state
|
|
```
|
|
|
|
**Step 4: Continue work** (normal)
|
|
```
|
|
Now that context restored, proceed normally
|
|
```
|
|
|
|
### Prevention
|
|
- **Configure PreCompact hook** to snapshot before compaction
|
|
- **Configure SessionStart hook** to inject status after compaction
|
|
- **Keep MEMORY.md updated** with current branch and findings as you work
|
|
- **Update .codeflash/{teammember}/{org}/{project}/status.md** regularly, not just at handoff
|
|
|
|
---
|
|
|
|
## Failure Mode 4: Stale Teammate Results
|
|
|
|
### Detection
|
|
- Lead reviews completed task 1-2 hours after completion
|
|
- Requirements have changed since task started
|
|
- Teammate's results are outdated
|
|
- Lead asks for redo but frustrated by wasted time
|
|
|
|
### Root Causes
|
|
- **No real-time notification**: Lead doesn't check TaskList frequently
|
|
- **Long-running tasks**: 2-3 hour benchmarks, teammates finish but lead not watching
|
|
- **Requirements evolved**: New discovery mid-task, teammates didn't know to adjust
|
|
- **Lead context lost**: Lead forgot teammate was working on this, started duplicate work
|
|
|
|
### Example Scenario
|
|
```
|
|
Time 1:00 PM: Lead creates benchmark task for optimizer
|
|
"Benchmark batch-size 32 on typeagent VM"
|
|
|
|
Time 1:30 PM: Optimizer implements, TaskUpdate(status: "completed")
|
|
Results show batch-size 32 gives 40% gain
|
|
|
|
Time 2:30 PM: Lead discovers new issue: "Actually, test variance is 15%, need < 5%"
|
|
Lead contacts optimizer: "We need tighter tolerance"
|
|
|
|
Time 3:00 PM: Optimizer: "I already finished, results are done"
|
|
Lead: "Those are worthless now, redo with variance validation"
|
|
Wasted 1.5 hours of benchmarking
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Assess if results are salvageable** (5 min)
|
|
```
|
|
Lead: Can we adjust the results, or truly need redo?
|
|
- If just need re-analysis: Have teammate post-process
|
|
- If need re-benchmarking: Full redo required
|
|
```
|
|
|
|
**Step 2: Create new task with updated requirements** (5 min)
|
|
```
|
|
Lead: TaskCreate("Re-benchmark batch-size 32 with variance < 5%",
|
|
owner: "optimizer",
|
|
description: "Previous run showed 40% gain but variance 15%. Need tighter tolerance. Run at least 10 iterations.")
|
|
|
|
SendMessage(to: "optimizer"): "Requirements refined, new task created with details"
|
|
```
|
|
|
|
**Step 3: Reuse previous findings** (5 min)
|
|
```
|
|
Optimizer: Reads new task description
|
|
Reuses MEMORY from previous attempt
|
|
Focuses on variance reduction (more iterations, control environment)
|
|
Much faster than cold start
|
|
```
|
|
|
|
**Step 4: Document and prevent**
|
|
```
|
|
Add to failure-modes.md:
|
|
- Cause: Requirements changed mid-task, lead didn't notify
|
|
- Prevention: Lead checks TaskList every 30-60 min for long-running tasks
|
|
- Prevention: Send SendMessage "Requirements changed, task TBD" ASAP when discovered
|
|
```
|
|
|
|
### Prevention
|
|
- **Set expectation upfront**: "Lead will check TaskList every hour for in-progress tasks"
|
|
- **Use SendMessage proactively**: If requirements change, contact teammate immediately
|
|
- **Keep lead engaged**: Don't create tasks and disappear for 3 hours
|
|
- **Milestone tracking**: For long tasks, ask for mid-task updates (e.g., "benchmark started, ETA 1 hour")
|
|
|
|
---
|
|
|
|
## Failure Mode 5: Over-Specified Compaction
|
|
|
|
### Detection
|
|
- After compaction + resume, teammate forgets key results (performance numbers, branch state)
|
|
- Compaction summary drops critical info
|
|
- Teammate re-does work they'd already done
|
|
|
|
### Root Causes
|
|
- **CLAUDE.md "Compact instructions" too aggressive**: Dropped results to save tokens
|
|
- **PreCompact hook not configured**: Key info not snapshotted before summarization
|
|
- **Summary algorithm dropped wrong info**: All outputs removed, but results needed
|
|
|
|
### Example Scenario
|
|
```
|
|
Optimizer: Profiled code, found O(n²) loop, results in .codeflash/data/profile.json
|
|
Session context at 140k tokens
|
|
Compaction fires, summary keeps "O(n²) found" but drops exact numbers
|
|
|
|
After compaction:
|
|
Optimizer: Compacted summary says "bottleneck found" but no numbers
|
|
Teammate starts profiling again (déjà vu)
|
|
Lead: "I already have the profile data!"
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Check what was actually lost** (2 min)
|
|
```bash
|
|
# Do the files still exist?
|
|
ls -la .codeflash/{teammember}/{org}/{project}/data/profile.json
|
|
cat .codeflash/{teammember}/{org}/{project}/data/profile.json | head
|
|
|
|
# Is it in teammate MEMORY.md?
|
|
grep -A 5 "O(n²)" MEMORY.md
|
|
|
|
# Is it in project MEMORY.md?
|
|
grep -A 5 "hotspot" MEMORY.md
|
|
```
|
|
|
|
**Step 2: If files/memory exist, retrieve from there** (2 min)
|
|
```
|
|
Teammate: Reference the files instead of repeating work
|
|
Use previous profile data as starting point for next phase
|
|
This is why MEMORY.md should be updated regularly
|
|
```
|
|
|
|
**Step 3: If truly lost, redo** (time-dependent)
|
|
```
|
|
Quick redo (< 5 min): Profile again, it's fast
|
|
Expensive redo (> 30 min): Benchmark or full optimization
|
|
→ Lead should have caught this, expensive oversight
|
|
```
|
|
|
|
**Step 4: Adjust compaction settings**
|
|
```
|
|
Edit CLAUDE.md root:
|
|
|
|
# Current (too aggressive)
|
|
# Compact instructions
|
|
When you are using compact, please focus on test output and code changes
|
|
|
|
# Better (preserve results)
|
|
# Compact instructions
|
|
When compacting, preserve:
|
|
1. Benchmark results and performance deltas
|
|
2. Code changes made (git diffs)
|
|
3. Active branch state and next steps
|
|
4. Key findings and hotspots
|
|
Drop: verbose tool output, full file reads, intermediate test runs
|
|
```
|
|
|
|
### Prevention
|
|
- **Adjust CLAUDE.md "Compact instructions"** to preserve benchmark numbers, branch state, findings
|
|
- **Configure PreCompact hook** to snapshot before compaction
|
|
- **Rely on MEMORY.md and status.md**, not compaction summary, for critical state
|
|
- **Inject via SessionStart hook** after compaction
|
|
|
|
---
|
|
|
|
## Failure Mode 6: Lead Doesn't Wait for Teammates
|
|
|
|
### Detection
|
|
- Lead starts implementing while teammates still in "in_progress"
|
|
- Lead context balloons (own work + monitoring teammates)
|
|
- Teammate completes but lead doesn't see because distracted
|
|
- Task ordering breaks (lead gets ahead of profile results)
|
|
|
|
### Root Causes
|
|
- **Lead gets impatient**: "Teammates are taking too long, I'll start implementing"
|
|
- **No enforcement**: Stop hook not configured to block lead
|
|
- **Lead habit**: From single-session work where they do everything
|
|
- **Unclear role**: Lead thinks they should be coding, not just coordinating
|
|
|
|
### Example Scenario
|
|
```
|
|
Lead: TaskCreate("Profile typeagent", owner: "optimizer")
|
|
|
|
Lead: (3 minutes later) Starts implementing own optimization
|
|
Figures "optimizer will take a while, I can start now"
|
|
Context bloats with both leading AND implementing
|
|
|
|
Optimizer: Finishes profile after 20 min, posts findings
|
|
Lead misses it (absorbed in own implementation)
|
|
Optimizer does TaskUpdate but lead never reads
|
|
|
|
Result: Lead + Optimizer both working independently, no coordination
|
|
All parallelism benefits lost
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Acknowledge**
|
|
```
|
|
Lead: Recognize you've started implementing while teammates work
|
|
This defeats purpose of team (kills parallelism benefit)
|
|
```
|
|
|
|
**Step 2: Pause own work** (immediate)
|
|
```
|
|
Lead: Save current work (commit or stash)
|
|
TaskUpdate for any open tasks to "on_hold" or revert
|
|
Check TaskList for teammate progress
|
|
Read teammate MEMORY.md for findings
|
|
```
|
|
|
|
**Step 3: Reorient to coordination role** (5 min)
|
|
```
|
|
Lead: What have teammates completed?
|
|
What are they waiting for?
|
|
What decisions do they need from you?
|
|
|
|
Actions:
|
|
- TaskUpdate with approvals if needed
|
|
- SendMessage with feedback or next directions
|
|
- Don't start new implementation
|
|
```
|
|
|
|
**Step 4: Enforce for next session**
|
|
```
|
|
Add Stop hook to prevent this:
|
|
- If teammate tasks in "in_progress", block stop
|
|
- Prevent lead from working on anything other than coordination
|
|
|
|
Add to CLAUDE.md:
|
|
Lead role: Coordinate, approve, synthesize
|
|
Lead constraint: Do not implement while teammates work
|
|
```
|
|
|
|
### Prevention
|
|
- **Add Stop hook** to prevent this
|
|
- **Make role explicit**: "Your job is to coordinate and review, not code"
|
|
- **Review this document** before starting team session
|
|
|
|
---
|
|
|
|
## Failure Mode 7: Task Never Completes (Ambiguous "Done")
|
|
|
|
### Detection
|
|
- Task stuck in "in_progress" for many hours
|
|
- Lead and teammate disagree on what "done" means
|
|
- Teammate asks "Is this good enough?" Lead says "no, keep going"
|
|
- No checklist to verify completion
|
|
|
|
### Root Causes
|
|
- **Vague task description**: "Profile Agent.execute()" (what counts as thorough?)
|
|
- **No deliverables defined**: Teammate doesn't know what evidence to provide
|
|
- **Unclear success criteria**: "Optimize for speed" (how fast is enough?)
|
|
- **Teammate over-thinking**: Perfectionism, "one more iteration"
|
|
|
|
### Example Scenario
|
|
```
|
|
Lead: TaskCreate("Profile Agent.execute()")
|
|
|
|
Optimizer: Profiles once, finds O(n²)
|
|
TaskUpdate(status: "completed")
|
|
|
|
Lead: Reads profile, says "Need more detail on cache misses"
|
|
Optimizer: "I did basic profiling, thought that was done"
|
|
|
|
Optimizer: Profiles again with cache analysis
|
|
TaskUpdate(status: "completed")
|
|
|
|
Lead: Says "What about GC? And memory allocations?"
|
|
Back and forth continues...
|
|
```
|
|
|
|
### Recovery Procedure
|
|
|
|
**Step 1: Define done explicitly**
|
|
```
|
|
Lead + Teammate: Discuss and agree:
|
|
- What is the minimum viable finding?
|
|
- What metrics matter (throughput, latency, memory)?
|
|
- How confident do we need to be?
|
|
- What's the next stopping point?
|
|
```
|
|
|
|
**Step 2: Create checklist**
|
|
```
|
|
Lead: Update task description with deliverables:
|
|
|
|
DELIVERABLES (task is DONE when):
|
|
- [ ] Profile collected on 1000-message corpus
|
|
- [ ] Top 3 hotspots identified (> 5% each)
|
|
- [ ] O-complexity analysis for each hotspot
|
|
- [ ] Cache hit rates measured
|
|
- [ ] Variance < 5% (min 3 runs)
|
|
- [ ] Findings documented in MEMORY.md
|
|
- [ ] Branch created or pushed if applicable
|
|
|
|
NOT required (out of scope):
|
|
- GC analysis (separate task)
|
|
- Memory profiling (separate task)
|
|
- Implementation (separate task)
|
|
```
|
|
|
|
**Step 3: Teammate completes to checklist**
|
|
```
|
|
Optimizer: Works to checklist
|
|
When all boxes checked, done
|
|
TaskUpdate(status: "completed")
|
|
Lead can't argue "but what about X" if X not on checklist
|
|
```
|
|
|
|
**Step 4: TaskCompleted hook validates**
|
|
```
|
|
Hook script:
|
|
- Read task deliverables checklist
|
|
- For "Profile" tasks: Verify profile.json exists, MEMORY.md has findings
|
|
- Accept completion only if deliverables present
|
|
```
|
|
|
|
### Prevention
|
|
- **Always create task with deliverables checklist**
|
|
- **Use TaskCompleted hook** to validate before accepting completion
|
|
- **Define scope boundaries**: "This task does X, NOT Y" (prevents scope creep)
|
|
|
|
---
|
|
|
|
## Reference: Detecting Failure Early
|
|
|
|
### Signals to Watch For
|
|
|
|
| Signal | Possible Failure | Action |
|
|
|---|---|---|
|
|
| Task in "in_progress" > 2 hours, no updates | Silent failure or infinite loop | Check teammate logs, consider restart |
|
|
| Lead implementing while teammates work | Lead over-working, parallelism lost | Move lead implementation to new task, stop block enforcement |
|
|
| Task oscillating between "in_progress" and no change | Ambiguous "done", perfectionism | Define explicit checklist |
|
|
| All tasks "blocked", nothing moving | Deadlock, circular dependencies | Break cycle with lead override |
|
|
| Teammate forgets branch after compaction | Context loss | Check PreCompact/SessionStart hooks |
|
|
| Lead makes requirements change mid-task | Stale results | Notify teammate immediately, update task |
|
|
|
|
### Monitoring Routine (Every 1-2 hours during team session)
|
|
|
|
```bash
|
|
Lead checklist:
|
|
- [ ] TaskList: All in_progress tasks have recent updates?
|
|
- [ ] Teammates: Any stuck or stalled?
|
|
- [ ] MEMORY.md: Updated with current findings?
|
|
- [ ] Status: Still on track, or need course correction?
|
|
- [ ] Blockers: Any teammate waiting on lead decision?
|
|
|
|
If anything looks wrong:
|
|
→ SendMessage to teammate: "How's it going? Any blockers?"
|
|
→ Escalate if no response in 30 min
|
|
```
|
|
|
|
---
|
|
|
|
See also: team-structure.md (prevent deadlock via config), agent-teams.md (Claude Code agent team docs)
|