Remove .codeflash/ from ruff extend-exclude, add per-file ignores
for .codeflash/, scripts/, evals/, and plugin/ (benchmark/script
patterns like print, eval, magic values). Remove shebangs. Widen
pre-commit hooks to check the full repo.
* Add Unstructured engagement report as uv workspace member
Three-tier Plotly Dash app (Executive Brief, Engineering Team, Full
Detail) with data in JSON, theme constants in theme.py, and Dash
production improvements (Google Fonts, clientside callbacks, meta tags).
Also: add .playwright-mcp/ to .gitignore, add reports/* ruff overrides,
remove tracked .codeflash/observability/read-tracker.
* Rewrite statusline to derive context from git state
Detects active area from changed files (reports, packages, plugin,
.codeflash, case-studies, evals), falls back to branch name convention
(perf/*, feat/*, fix/*), shows dirty indicator. Uses whoami for
cross-platform user detection.
* Add pre-push lint rule to commit guidelines
* Exclude .codeflash/ from ruff linting
Benchmark and profiling scripts in .codeflash/ are scratch work, not
package source. Excluding them prevents CI failures from ad-hoc scripts.
* Run ruff format across packages, scripts, evals, and plugin refs
* Fix github-app async test failures in CI
Add asyncio_mode = "auto" to root pytest config so async tests
are detected when running from the repo root via uv run pytest packages/.
* Fix mypy errors and apply ruff formatting across packages
Fix ast.FunctionDef calls missing type_params for Python 3.12+,
correct type: ignore error codes in _comparator and _plugin, and
run ruff format on all package source and test files.
* Switch CI to prek for lint/typecheck checks
Use j178/prek-action for consistent lint+typecheck (ruff check,
ruff format, interrogate, mypy) matching local pre-commit config.
Keep test as a separate parallel job for test-env support.
Replace packages-ci.yml and github-app-tests.yml with a single
ci.yml that calls the shared ci-python-uv reusable workflow.
Lint, typecheck, and test run as parallel jobs. Version check
stays local (needs fetch-depth: 0 + PR-only conditional).
Replace placeholder text ("No optimizations applied yet", empty PR table) with:
- CAS lz4 compression results (7-18x on realistic ML payloads)
- Upstream PR status (Netflix/metaflow#3090, open)
- Open questions on dependency management and forward compat
- Methodology, remaining targets, and lessons learned
- Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention
- Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip)
- Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory)
- Update main README results table with all five case studies
Move vector search benchmarks out of main results into a Lessons Learned
section. The 3.7x-14.2x numbers were real but on a non-bottleneck —
maintainer confirmed model API calls and SQL dominate real latency.
Results section now only shows legitimate wins: import time (1.16x),
indexing pipeline (1.14-1.16x), and query batching (2.10-2.62x).
Add team member dimension to case study paths so multiple contributors
can track optimization data independently. Derives member from
git config user.name in session-start hooks.
- Move all case studies under .codeflash/krrt7/
- Rename pypa/pip → python/pip (org grouping)
- Update session-start hooks, docs, scripts, and references
* Move Go plugin overlay from languages/go/ to plugin/languages/go/
Aligns Go with the Java/Python/JavaScript convention where all language
overlays live under plugin/languages/<lang>/. The Makefile already
discovers from plugin/languages/* so Go is now included in builds.
* Remove accidental read-tracker changes
* Ignore .codeflash/observability/ in gitignore
* Add Java/Kotlin detection to top-level language router
Adds pom.xml, build.gradle, build.gradle.kts, settings.gradle, and
settings.gradle.kts as markers that route to the codeflash-java router.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add Java/Kotlin agent definitions for all optimization domains
10 agents covering the full optimization pipeline:
- codeflash-java: router/team lead for domain detection
- codeflash-java-setup: environment detection (build tool, JDK, profiling tools)
- codeflash-java-deep: cross-domain optimizer (default)
- codeflash-java-cpu: data structures, algorithms, JIT deopt, JMH benchmarks
- codeflash-java-memory: heap/GC tuning, escape analysis, leak detection
- codeflash-java-async: virtual threads, lock contention, CompletableFuture
- codeflash-java-structure: class loading, JPMS, startup time, circular deps
- codeflash-java-scan: quick cross-domain diagnosis via JFR/jdeps/GC logs
- codeflash-java-ci: GitHub webhook handler for Java PRs
- codeflash-java-pr-prep: JMH benchmarks and PR body templates
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add Java domain reference guides for all optimization domains
6 guides covering deep domain knowledge for agent consumption:
- data-structures: collection selection, autoboxing, JIT patterns, sorting
- memory: JVM heap layout, GC algorithms and tuning, escape analysis, leaks
- async: virtual threads, structured concurrency, lock hierarchy, contention
- structure: class loading, JPMS, CDS/AppCDS, ServiceLoader, Spring startup
- database: JPA N+1, HikariCP, pagination, batch operations, EXPLAIN plans
- native: JNI, Panama FFM API, GraalVM native-image, Vector API
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add Java optimization skills: session launcher and JFR profiling
- codeflash-optimize: session launcher with start/resume/status/scan/review
- jfr-profiling: quick-action JFR profiling in cpu/alloc/wall modes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Slim Java agents to match Go's concise ~175-line pattern
Move inline code examples, antipattern encyclopedias, JMH templates,
and deep-dive sections from agent prompts into reference guides.
Agents now contain only: target tables, one-liner antipatterns,
reasoning checklists, profiling commands, and keep/discard trees.
Line counts (before → after):
cpu: 636 → 181
memory: 878 → 193
async: 578 → 165
structure: 532 → 167
deep: 507 → 186
scan: 440 → 163
Average: 595 → 176 (vs Go's 175)
Adds to data-structures/guide.md:
- Collection contract traps table
- Reflection → MethodHandle migration pattern
- JMH benchmark template
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix Makefile build: use rsync merge and portable sed -i
Two bugs in the build target:
1. cp -R created nested dirs (agents/agents/, references/references/)
instead of merging language overlay into shared base. Fix: rsync -a.
2. sed -i '' is macOS-only; fails silently on Linux. Fix: sed -i.bak
(works on both macOS and Linux), then delete .bak files.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add HANDOFF.md session lifecycle to Java agents
Java agents could read HANDOFF.md on resume but never wrote or
updated it. A session that hit plateau would lose all context —
what was tried, what worked, why it stopped, what to do next.
Changes:
- Deep agent: init HANDOFF.md on fresh start, record after each
experiment, write Stop Reason + learnings.md on session end
- Domain agents (CPU, memory, async, structure): record to
HANDOFF.md after each keep/discard, write session-end state
- Handoff template: make language-agnostic (was Python-specific),
add Session status, Strategy & Decisions, and Stop Reason fields
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Close 11 gaps between Java and Python plugins
Add missing sections to Java deep agent: experiment loop depth (12 steps),
library boundary breaking, Phase 0 environment setup, CI mode, pre-submit
review, adversarial review, team orchestration, cross-domain results schema,
and structured progress reporting.
Add polymorphic dispatch safety to CPU agent and data-structures guide.
Add diff hygiene to CPU agent. Add native reference to router.
Create two new reference files: library-replacement.md (Guava/Commons/
Jackson/Joda replacement tables) and team-orchestration.md (full dispatch
and merge protocol).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Setup agent now ensures .codeflash/ is in .gitignore before writing
session state files (prevents accidental commits of profiling artifacts)
- Router agent gets a Cleanup section: preserves learnings.md and
results.tsv across sessions, deletes transient files (HANDOFF.md,
setup.md, conventions.md, bench scripts), removes agent-memory dir
Analyzed ~89 Claude Code sessions across 7 unstructured_org projects to
identify recurring failures and friction points, then applied fixes:
- Fix "ask then die" bug: skill now injects AUTONOMOUS MODE directive so
domain agents work without interactive questions that kill the Agent tool
- Fix git add -A: all 4 domain agents now stage specific files instead of
blindly staging everything (caused accidental commits of scratch files)
- Add pre-commit step: agents run pre-commit before every commit to catch
linting failures before CI (ruff/undersort failures were recurring)
- Add measurement methodology lock: prevents changing profiling flags
mid-experiment which created uninterpretable deltas
- Add branch state verification to router startup (prevents wrong-branch
confusion that wasted multiple sessions)
- Add multi-repo detection to router (original work spanned 4 repos)
- Add library vs application awareness to memory agent (prevents wasting
time on import-time optimizations in library projects)
- Add dependency resilience to setup agent (uv run --with isolation
warning, private PyPI failure guidance)
- Add PR text quality guidelines (sessions showed AI-sounding text that
required multiple user corrections)
- Add chart generation guidelines to pr-preparation.md
- Add context conservation rules (max 2 background tasks, use subagents)
- Add cross-session learnings template for .codeflash/learnings.md
- All domain agents now read learnings.md at startup
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: git memory, guard command, stuck recovery, batched setup
- Add git history review as step 1 of every experiment loop iteration
(read git log + diff to learn from past experiments and detect patterns)
- Add guard command as formal regression safety net (step 10) with
revert-and-rework protocol (max 2 attempts before discard)
- Add stuck state recovery protocol (5+ consecutive discards triggers
re-read of all files, results log analysis, goal re-check, combination
of past successes, and opposite-strategy attempts)
- Add batched setup questions rule to orchestrator (max 4 questions per
message, never one-at-a-time across round-trips)
- Update decision tree to include guard check before keep/discard
- Update orchestrator workflow to configure guard during session init
Inspired by patterns from uditgoenka/autoresearch.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add git history, guard, and config audit steps to cpu/memory/structure agents
Align experiment loops in all domain agents with async agent.
Each now includes step 1 (review git history), guard command
(revert+rework on failure), and config audit after KEEP with
domain-specific guidance.
* feat: add CI plugin validation workflow (#3)
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
* fix: correct plugin marketplace name for CI validation
plugin-dev is in claude-plugins-official, not claude-code-plugins.
Also adds plugin_marketplaces URL for discovery.
* fix: expand allowed tools for validation workflow
Add gh pr comment, gh api, cat, python3, jq to allowed tools so
Claude can post PR summary comments and subagents can function.
* fix: enable track_progress and show_full_output for debugging
* fix: remove colons from Bash glob patterns in validate allowedTools
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
* fix: fail CI job when validation finds issues
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: remove double-commit contradiction in async agent and shared base
Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP
step later says "Do NOT commit discards." This meant discarded
experiments were still committed. CPU/memory/structure agents already
had it right — only commit at the KEEP step.
* fix: treat validation warnings as blocking failures
Warnings were previously non-blocking — the verdict step only
checked for "issues that need fixing." Now any warning also
triggers FAIL.
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* fix: address plugin-validator warnings
- Declare context7 MCP server in plugin.json (domain agents use it)
- Change codeflash-setup color from green to red (collision with router)
- Add memory: project and example block to codeflash-setup frontmatter
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* feat: add eval regression testing (Layer 3)
- baseline-scores.json: checked-in expected scores + min thresholds
for ranking, memory-hard, memory-misdirection
- check-regression.sh: orchestrator that runs evals, scores them, and
compares to baselines (exits 1 if any score < min)
- eval-regression.yml: on-demand CI workflow (workflow_dispatch) with
Bedrock OIDC, artifact upload, and job summary table
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* fix: address remaining plugin-validator warnings
- Add memory: project to codeflash-memory.md (matches other agents)
- Add allowed-tools to memray-profiling skill
- Fix blank line in codeflash-setup.md frontmatter
- Add git log -20 --stat to all 4 domain agents' git history step
- Add step numbering note to async reference experiment-loop.md
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* feat: add deterministic scoring for profiler and ranking criteria
Add session-text-based auto-scoring that overrides LLM grades for
mechanically verifiable criteria:
- used_memory_profiler: grep for memray/tracemalloc Bash commands
- profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full)
- built_ranked_list_with_impact_pct: detect cProfile + ranking output
These anchor 2-4 points per eval deterministically, reducing LLM
variance. Baseline thresholds tightened from min=6 to min=7.
* fix: parse verdict from PR comment instead of temp file
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: address all remaining validator findings for PR #2
- Remove non-standard fields (repository, license, keywords) from plugin.json
- Pin context7 MCP to @2.1.4 instead of @latest
- Add context7 fallback note in router agent
- Remove unused Read from codeflash-optimize allowed-tools
- Make AskUserQuestion usage explicit in skill body
- Add tracemalloc to memray-profiling trigger phrases
- List all reference files in memray-profiling skill
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: wire stuck state recovery into domain agents, fix skill allowed-tools
- Add 5+ consecutive discard stuck recovery protocol to all 4 domain
agents (cpu, memory, async, structure) inline experiment loops,
matching the shared base definition
- Add Read and Bash to codeflash-optimize skill allowed-tools so it
can inspect session state before delegating
- Add Write to memray-profiling skill allowed-tools so it can create
profiling harness scripts
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: sync versions, add missing tools to router and skills
- Sync marketplace.json metadata version to 0.1.0 to match plugin.json
- Add Write and Edit to codeflash.md router tools (needed for
writing .codeflash/conventions.md during setup)
- Remove Bash from codeflash-optimize skill (least-privilege: all
execution is delegated via Agent)
- Add Grep and Glob to memray-profiling skill (needed for finding
test files, capture outputs, and config files)
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
* fix: restore plugin.json fields, revert marketplace version, relax validator verdict
- Restore repository, license, keywords in plugin.json (accidentally
removed when mcpServers was added)
- Revert marketplace.json metadata.version to 1.0.0 (collection
version, distinct from individual plugin version 0.1.0)
- Change validator verdict rule: only FAIL on major issues, not
warnings — prevents the LLM validator from blocking on subjective
minor findings each run
- Shallow clone (--no-checkout --depth 1 + fetch specific commit) instead
of full clone — 15s vs 2+ min for large repos like codeflash-internal
- Cache clone in evals/repos/<name>/workspace/, cp -r for each run
- Use gh repo clone for private repo auth
- Fix eval prompt to skip skill's AskUserQuestion step in non-interactive mode
- Gitignore workspace/ dirs
- Update intro.md with v2 eval docs
Add support for v2 evals that clone a real repo at a specific commit
instead of using bundled template source. The agent handles setup,
diagnosis, and fixing on its own.
- run-eval.sh: v1/v2 dispatch, repos/ directory, prompt from manifest
- First v2 eval: codeflash-internal psycopg serialization (PR #2489)
- EVAL-V2-SKETCH.md: design doc for the v2 eval system
- intro.md: repo onboarding guide
Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't
execute the python3 file-write command. The check step now reads
the claude[bot] comment via gh api and greps for the verdict line.
Add verdict step that writes PASS/FAIL to a file, with a follow-up
workflow step that exits 1 on FAIL. Previously validation reported
issues in comments but the job always succeeded.
The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which
are treated as literal characters, so they never matched actual
commands like `gh pr diff 2 --name-only`. This caused 1 permission
denial per CI run and prevented the summary comment from posting.
Uses claude-code-action with plugin-dev plugin to validate plugin
structure, agent consistency, eval manifests, and skills on every PR.
Includes @claude mention support for interactive fixes.
The async agent wrote a benchmark that tested the problem (psycopg2
single connection) but didn't validate the fix (thread_sensitive=False),
and missed dead config (conn_health_checks) after a driver migration.
- Add step 6 (benchmark fidelity) to shared and async experiment loops
- Add step 14 (config audit after KEEP) to shared and async loops
- Add architectural change workflow for driver/infrastructure migrations
- Strengthen profiling rule: never present benchmarking as optional
- Add config audit to async reasoning checklist (question #9)