mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Kevin Turcios e811d453f9 fix: address session-analysis findings from 89 unstructured_org sessions

Analyzed ~89 Claude Code sessions across 7 unstructured_org projects to
identify recurring failures and friction points, then applied fixes:

- Fix "ask then die" bug: skill now injects AUTONOMOUS MODE directive so
  domain agents work without interactive questions that kill the Agent tool
- Fix git add -A: all 4 domain agents now stage specific files instead of
  blindly staging everything (caused accidental commits of scratch files)
- Add pre-commit step: agents run pre-commit before every commit to catch
  linting failures before CI (ruff/undersort failures were recurring)
- Add measurement methodology lock: prevents changing profiling flags
  mid-experiment which created uninterpretable deltas
- Add branch state verification to router startup (prevents wrong-branch
  confusion that wasted multiple sessions)
- Add multi-repo detection to router (original work spanned 4 repos)
- Add library vs application awareness to memory agent (prevents wasting
  time on import-time optimizations in library projects)
- Add dependency resilience to setup agent (uv run --with isolation
  warning, private PyPI failure guidance)
- Add PR text quality guidelines (sessions showed AI-sounding text that
  required multiple user corrections)
- Add chart generation guidelines to pr-preparation.md
- Add context conservation rules (max 2 background tasks, use subagents)
- Add cross-session learnings template for .codeflash/learnings.md
- All domain agents now read learnings.md at startup

2026-03-27 10:08:50 -05:00

5.7 KiB

Raw Blame History

PR Preparation

After the experiment loop plateaus, prepare upstream PRs for kept optimizations.

Workflow

1. Inventory

Build a table of kept optimizations → target repos → PR status:

| # | Optimization | Target repo | PR status |
|---|-------------|-------------|-----------|
| 1 | description | repo-name   | needs PR  |
| 2 | description | repo-name   | PR #N opened |

For each optimization without a PR:

Check upstream — has the code already been changed on main? (gh api repos/ORG/REPO/contents/PATH --jq '.content' | base64 -d | grep ...)
Check existing PRs — is there already a PR covering this area? (gh pr list --repo ORG/REPO --state all --search "relevant keywords")
Decide: create new PR, fold into existing PR, or skip.

2. Folding into existing PRs

When a new optimization targets the same function/file as an existing open PR, fold it in rather than creating a separate PR:

Check out the existing PR branch
Apply the additional change
Commit with a clear message explaining the addition
Re-run the benchmark — this is critical. The PR's benchmark data must reflect ALL changes in the PR, not just the original ones.
Update the PR description with new benchmark results
Push

3. Comparative benchmarks

When a PR accumulates multiple changes, run a multi-variant benchmark showing each change's incremental contribution:

Variant 1: Baseline (upstream main, no changes)
Variant 2: Original PR changes only
Variant 3: Original + new changes (full PR)

This lets reviewers understand what each change contributes independently.

Benchmark script pattern

Write a self-contained script that:

Creates realistic test inputs (correct data sizes and volumes)
Runs each variant under the domain's profiling tool and parses output
Supports --runs N for repeated measurements and --report for chart generation
Uses tempfile.TemporaryDirectory() for all intermediate files

4. PR body structure

## Summary
<1-3 bullet points describing what changed and why>

## Details
<Technical explanation: what the code does, why the old version was suboptimal,
how the new version improves it, any safety considerations>

## Benchmark
<Chart image or text table with exact numbers>
<Platform/Python version/tool info>

## Test plan
- [x] Test A — PASSED
- [x] Test B — PASSED (no regression)

### Reproduce
<details>
<summary>Benchmark script</summary>

```python
# Full self-contained benchmark script

```

5. PR description updates

When folding changes into an existing PR, update the entire PR body — not just append. The PR body should read as a coherent description of everything in the PR. Specifically update:

Summary bullets to mention all changes
Benchmark table/chart with fresh numbers covering all changes
Changelog entry if the PR includes one

Use gh pr edit NUMBER --repo ORG/REPO --body "$(cat <<'EOF' ... EOF)" to replace the body.

6. Conventions

Each domain agent defines its own branch prefix and PR title prefix. Common rules:

Do NOT open PRs yourself unless the user explicitly asks. Prepare the branch, push it, tell the user it's ready. Do NOT push branches or create PRs as a "next step" — wait for explicit instruction.
Keep PR changed files minimal — only the actual code change, not benchmark scripts or images.
Benchmark scripts go inline in the PR body <details> block.

Writing quality

Write PR descriptions like a human engineer, not a summarizer:

Be specific: "Replaces HuggingFace's RTDetrImageProcessor with torchvision transforms to eliminate 110 MiB of duplicate weight loading" — not "Improves memory efficiency of image processing."
Lead with the technical mechanism, not the benefit. Reviewers want to know WHAT you did, not that it's "an improvement."
No generic headings like "Summary", "Overview", "Key Changes" unless the PR template requires them. If the change is simple enough for 2 sentences, use 2 sentences.
Don't over-explain the problem. Assume the reviewer knows the codebase. Explain WHY your approach works, not what the code does line-by-line.

7. Chart hosting (if available)

If the project has an image hosting setup (e.g., an orphan branch for assets), use it:

# Upload
gh api repos/ORG/REPO/contents/images/{name}.png \
  --method PUT \
  -f message="add {name} benchmark chart" \
  -f content="$(base64 -i /path/to/chart.png)" \
  -f branch=assets-branch

# To update an existing image, include the SHA:
SHA=$(gh api repos/ORG/REPO/contents/images/{name}.png -q '.sha' -H "Accept: application/vnd.github.v3+json" --method GET -f ref=assets-branch)
gh api repos/ORG/REPO/contents/images/{name}.png \
  --method PUT \
  -f message="update {name}" \
  -f content="$(base64 -i /path/to/chart.png)" \
  -f branch=assets-branch \
  -f sha="$SHA"

# Reference in PR body
![name](https://raw.githubusercontent.com/ORG/REPO/assets-branch/images/{name}.png)

Otherwise, describe the results in text tables only.

8. Chart generation guidelines

When generating benchmark charts (e.g., with plotly, matplotlib):

Separate concerns: Use distinct charts for different metrics (throughput vs memory, latency vs RSS). Combined charts are hard to read and require multiple iterations.
Plain-language axis labels: Use "Peak Memory (MiB)" not "RSS delta". Use "Throughput (req/s)" not "ops".
Include the baseline: Always show the baseline variant as the first bar/line for comparison.
Annotate absolute values: Don't just show bars — label each with the actual number.
Keep it simple: Bar charts for before/after comparisons. Line charts only for scaling tests (varying N). No 3D charts, no unnecessary styling.

5.7 KiB Raw Blame History