codeflash-agent

mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Author	SHA1	Message	Date
Kevin Turcios	20f6c59f05	Lint and format entire repo, not just packages (#23 ) Remove .codeflash/ from ruff extend-exclude, add per-file ignores for .codeflash/, scripts/, evals/, and plugin/ (benchmark/script patterns like print, eval, magic values). Remove shebangs. Widen pre-commit hooks to check the full repo.	2026-04-15 03:16:15 -05:00
Kevin Turcios	33faedf427	Add Unstructured report, rewrite statusline, format evals/scripts (#20 ) * Add Unstructured engagement report as uv workspace member Three-tier Plotly Dash app (Executive Brief, Engineering Team, Full Detail) with data in JSON, theme constants in theme.py, and Dash production improvements (Google Fonts, clientside callbacks, meta tags). Also: add .playwright-mcp/ to .gitignore, add reports/* ruff overrides, remove tracked .codeflash/observability/read-tracker. * Rewrite statusline to derive context from git state Detects active area from changed files (reports, packages, plugin, .codeflash, case-studies, evals), falls back to branch name convention (perf/, feat/, fix/), shows dirty indicator. Uses whoami for cross-platform user detection. Add pre-push lint rule to commit guidelines * Exclude .codeflash/ from ruff linting Benchmark and profiling scripts in .codeflash/ are scratch work, not package source. Excluding them prevents CI failures from ad-hoc scripts. * Run ruff format across packages, scripts, evals, and plugin refs * Fix github-app async test failures in CI Add asyncio_mode = "auto" to root pytest config so async tests are detected when running from the repo root via uv run pytest packages/.	2026-04-15 03:06:16 -05:00
Kevin Turcios	2caaf6af7c	Fix CI: mypy errors, ruff formatting, switch to prek (#22 ) * Fix mypy errors and apply ruff formatting across packages Fix ast.FunctionDef calls missing type_params for Python 3.12+, correct type: ignore error codes in _comparator and _plugin, and run ruff format on all package source and test files. * Switch CI to prek for lint/typecheck checks Use j178/prek-action for consistent lint+typecheck (ruff check, ruff format, interrogate, mypy) matching local pre-commit config. Keep test as a separate parallel job for test-env support.	2026-04-15 02:52:47 -05:00
Kevin Turcios	a1710f7f92	Adopt shared CI workflow (#21 ) Replace packages-ci.yml and github-app-tests.yml with a single ci.yml that calls the shared ci-python-uv reusable workflow. Lint, typecheck, and test run as parallel jobs. Version check stays local (needs fetch-depth: 0 + PR-only conditional).	2026-04-15 02:36:17 -05:00
Kevin Turcios	7d86202524	Update metaflow README with actual results and PR status (#19 ) Replace placeholder text ("No optimizations applied yet", empty PR table) with: - CAS lz4 compression results (7-18x on realistic ML payloads) - Upstream PR status (Netflix/metaflow#3090, open) - Open questions on dependency management and forward compat - Methodology, remaining targets, and lessons learned	2026-04-14 23:41:55 -05:00
Kevin Turcios	1734199e85	Add metaflow and core-product case studies, rename pypa to python (#18 ) - Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention - Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip) - Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory) - Update main README results table with all five case studies	2026-04-14 23:31:49 -05:00
Kevin Turcios	09ba9b44b2	Add typeagent-py case study (#17 ) - Add case-studies/microsoft/typeagent/summary.md with results, lessons learned (failed vector search experiment, maintainer alignment), and takeaways for codeflash - Update upstream PR statuses: #235 merged, #236 closed (rejected), #232 blocked on #230 - Add typeagent to main README results table	2026-04-14 23:25:29 -05:00
Kevin Turcios	6dd3b02168	Restructure typeagent README: separate failed vector search experiment (#16 ) Move vector search benchmarks out of main results into a Lessons Learned section. The 3.7x-14.2x numbers were real but on a non-bottleneck — maintainer confirmed model API calls and SQL dominate real latency. Results section now only shows legitimate wins: import time (1.16x), indexing pipeline (1.14-1.16x), and query batching (2.10-2.62x).	2026-04-14 23:21:53 -05:00
Kevin Turcios	cc29a27289	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 ) Add team member dimension to case study paths so multiple contributors can track optimization data independently. Derives member from git config user.name in session-start hooks. - Move all case studies under .codeflash/krrt7/ - Rename pypa/pip → python/pip (org grouping) - Update session-start hooks, docs, scripts, and references	2026-04-14 23:04:34 -05:00
Kevin Turcios	4a65f17bfb	Set up CODEOWNERS for Go and Java language overlays (#14 )	2026-04-14 19:18:05 -05:00
Kevin Turcios	361bb899e2	Move Go overlay to plugin/languages/go/ (#13 ) * Move Go plugin overlay from languages/go/ to plugin/languages/go/ Aligns Go with the Java/Python/JavaScript convention where all language overlays live under plugin/languages/<lang>/. The Makefile already discovers from plugin/languages/* so Go is now included in builds. * Remove accidental read-tracker changes * Ignore .codeflash/observability/ in gitignore	2026-04-14 19:14:57 -05:00
m-ali-24	044b2f190a	[FEAT] golang agents (#11 ) * go base * missing javascript --------- Co-authored-by: ali <--global>	2026-04-14 18:55:36 -05:00
mashraf-222	270cb56cee	Feat/java language support (#12 ) * Add Java/Kotlin detection to top-level language router Adds pom.xml, build.gradle, build.gradle.kts, settings.gradle, and settings.gradle.kts as markers that route to the codeflash-java router. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Java/Kotlin agent definitions for all optimization domains 10 agents covering the full optimization pipeline: - codeflash-java: router/team lead for domain detection - codeflash-java-setup: environment detection (build tool, JDK, profiling tools) - codeflash-java-deep: cross-domain optimizer (default) - codeflash-java-cpu: data structures, algorithms, JIT deopt, JMH benchmarks - codeflash-java-memory: heap/GC tuning, escape analysis, leak detection - codeflash-java-async: virtual threads, lock contention, CompletableFuture - codeflash-java-structure: class loading, JPMS, startup time, circular deps - codeflash-java-scan: quick cross-domain diagnosis via JFR/jdeps/GC logs - codeflash-java-ci: GitHub webhook handler for Java PRs - codeflash-java-pr-prep: JMH benchmarks and PR body templates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Java domain reference guides for all optimization domains 6 guides covering deep domain knowledge for agent consumption: - data-structures: collection selection, autoboxing, JIT patterns, sorting - memory: JVM heap layout, GC algorithms and tuning, escape analysis, leaks - async: virtual threads, structured concurrency, lock hierarchy, contention - structure: class loading, JPMS, CDS/AppCDS, ServiceLoader, Spring startup - database: JPA N+1, HikariCP, pagination, batch operations, EXPLAIN plans - native: JNI, Panama FFM API, GraalVM native-image, Vector API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Java optimization skills: session launcher and JFR profiling - codeflash-optimize: session launcher with start/resume/status/scan/review - jfr-profiling: quick-action JFR profiling in cpu/alloc/wall modes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Slim Java agents to match Go's concise ~175-line pattern Move inline code examples, antipattern encyclopedias, JMH templates, and deep-dive sections from agent prompts into reference guides. Agents now contain only: target tables, one-liner antipatterns, reasoning checklists, profiling commands, and keep/discard trees. Line counts (before → after): cpu: 636 → 181 memory: 878 → 193 async: 578 → 165 structure: 532 → 167 deep: 507 → 186 scan: 440 → 163 Average: 595 → 176 (vs Go's 175) Adds to data-structures/guide.md: - Collection contract traps table - Reflection → MethodHandle migration pattern - JMH benchmark template Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Makefile build: use rsync merge and portable sed -i Two bugs in the build target: 1. cp -R created nested dirs (agents/agents/, references/references/) instead of merging language overlay into shared base. Fix: rsync -a. 2. sed -i '' is macOS-only; fails silently on Linux. Fix: sed -i.bak (works on both macOS and Linux), then delete .bak files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add HANDOFF.md session lifecycle to Java agents Java agents could read HANDOFF.md on resume but never wrote or updated it. A session that hit plateau would lose all context — what was tried, what worked, why it stopped, what to do next. Changes: - Deep agent: init HANDOFF.md on fresh start, record after each experiment, write Stop Reason + learnings.md on session end - Domain agents (CPU, memory, async, structure): record to HANDOFF.md after each keep/discard, write session-end state - Handoff template: make language-agnostic (was Python-specific), add Session status, Strategy & Decisions, and Stop Reason fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Close 11 gaps between Java and Python plugins Add missing sections to Java deep agent: experiment loop depth (12 steps), library boundary breaking, Phase 0 environment setup, CI mode, pre-submit review, adversarial review, team orchestration, cross-domain results schema, and structured progress reporting. Add polymorphic dispatch safety to CPU agent and data-structures guide. Add diff hygiene to CPU agent. Add native reference to router. Create two new reference files: library-replacement.md (Guava/Commons/ Jackson/Joda replacement tables) and team-orchestration.md (full dispatch and merge protocol). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 18:49:41 -05:00
Kevin Turcios	043bf45415	Ignore .lprof and .prof binary files, update read-tracker	2026-04-14 18:42:38 -05:00
Kevin Turcios	9830b7b4a1	Track .codeflash/ data: unignore observability and add krrt7/odoo case study	2026-04-14 18:40:08 -05:00
Kevin Turcios	3b59d97647	squash	2026-04-13 14:12:17 -05:00
Kevin Turcios	cee3987d7b	cleanup	2026-04-06 05:58:13 -05:00
Kevin Turcios	ebb9658dfd	Merge main-teammate branch	2026-04-03 17:36:50 -05:00
Kevin Turcios	0cda0d907c	fix: align marketplace version with plugin.json and recursive .DS_Store ignore - marketplace.json metadata.version 1.0.0 → 0.1.0 to match plugin.json - .gitignore .DS_Store → **/.DS_Store for nested directories	2026-03-27 11:39:34 -05:00
Kevin Turcios	7fab0082c0	Merge pull request #6 from codeflash-ai/feat/tool-configs feat: improve skill and eval system	2026-03-27 11:31:30 -05:00
Kevin Turcios	37efa524d7	feat: improve skill, eval system, and tessl config - Optimize codeflash-optimize SKILL.md (review score 17% → 98%, eval 87% → 100%) - Fix frontmatter (allowed-tools format, argument-hint under metadata) - Lead description with concrete actions, explicit agent launch parameters - Add multi-run variance detection to eval system (--runs N flag) - score.py aggregate command: min/max/avg/stddev per criterion, flaky detection - check-regression.sh defaults to 3 runs for reliable regression detection - Add per-criterion regression tracking to baseline-scores.json (v3) - Reports exactly which criteria regressed, not just total score drops - Rename evals/ → codeflash-evals/ to avoid tessl directory conflicts - Switch tessl to managed mode, gitignore vendored tiles and symlinks	2026-03-27 11:30:17 -05:00
Kevin Turcios	999e08fb5e	Merge pull request #5 from codeflash-ai/fix/session-analysis-improvements fix: session-analysis improvements from 89 real-world sessions	2026-03-27 10:17:44 -05:00
Kevin Turcios	61c393e7ed	ci: add actions:read permission for CI status checks The claude-code-action MCP server requires 'actions: read' to enable CI status check functionality. Without it, the server is skipped with a warning.	2026-03-27 10:16:16 -05:00
Kevin Turcios	24ffa83bbf	merge: resolve conflicts with main (guard, git history, stuck recovery) Merge origin/main which added guard commands, git history review step, stuck state recovery, batched setup questions, and config audit steps. Resolved 5 conflicts by keeping both: - Our git-add-specific-files + pre-commit rules applied to the new renumbered commit steps (15 instead of 12, etc.) - Upstream's Record, Config audit, Guard steps preserved - Router keeps both AUTONOMOUS MODE and batch-questions rules - Router start steps merged: our branch verification + multi-repo detection integrated into upstream's batched-questions flow	2026-03-27 10:15:10 -05:00
Kevin Turcios	ce02fdee29	fix: add .codeflash/ gitignore and session cleanup workflow - Setup agent now ensures .codeflash/ is in .gitignore before writing session state files (prevents accidental commits of profiling artifacts) - Router agent gets a Cleanup section: preserves learnings.md and results.tsv across sessions, deletes transient files (HANDOFF.md, setup.md, conventions.md, bench scripts), removes agent-memory dir	2026-03-27 10:09:51 -05:00
Kevin Turcios	e811d453f9	fix: address session-analysis findings from 89 unstructured_org sessions Analyzed ~89 Claude Code sessions across 7 unstructured_org projects to identify recurring failures and friction points, then applied fixes: - Fix "ask then die" bug: skill now injects AUTONOMOUS MODE directive so domain agents work without interactive questions that kill the Agent tool - Fix git add -A: all 4 domain agents now stage specific files instead of blindly staging everything (caused accidental commits of scratch files) - Add pre-commit step: agents run pre-commit before every commit to catch linting failures before CI (ruff/undersort failures were recurring) - Add measurement methodology lock: prevents changing profiling flags mid-experiment which created uninterpretable deltas - Add branch state verification to router startup (prevents wrong-branch confusion that wasted multiple sessions) - Add multi-repo detection to router (original work spanned 4 repos) - Add library vs application awareness to memory agent (prevents wasting time on import-time optimizations in library projects) - Add dependency resilience to setup agent (uv run --with isolation warning, private PyPI failure guidance) - Add PR text quality guidelines (sessions showed AI-sounding text that required multiple user corrections) - Add chart generation guidelines to pr-preparation.md - Add context conservation rules (max 2 background tasks, use subagents) - Add cross-session learnings template for .codeflash/learnings.md - All domain agents now read learnings.md at startup	2026-03-27 10:08:50 -05:00
Kevin Turcios	d1f34cf794	feat: git memory, guard command, stuck recovery, batched setup (#2 ) * feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add CI plugin validation workflow (#3) Uses claude-code-action with plugin-dev plugin to validate plugin structure, agent consistency, eval manifests, and skills on every PR. Includes @claude mention support for interactive fixes. * fix: correct plugin marketplace name for CI validation plugin-dev is in claude-plugins-official, not claude-code-plugins. Also adds plugin_marketplaces URL for discovery. * fix: expand allowed tools for validation workflow Add gh pr comment, gh api, cat, python3, jq to allowed tools so Claude can post PR summary comments and subagents can function. * fix: enable track_progress and show_full_output for debugging * fix: remove colons from Bash glob patterns in validate allowedTools The gh command patterns used colons (e.g. Bash(gh pr diff:)) which are treated as literal characters, so they never matched actual commands like `gh pr diff 2 --name-only`. This caused 1 permission denial per CI run and prevented the summary comment from posting. fix: fail CI job when validation finds issues Add verdict step that writes PASS/FAIL to a file, with a follow-up workflow step that exits 1 on FAIL. Previously validation reported issues in comments but the job always succeeded. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: treat validation warnings as blocking failures Warnings were previously non-blocking — the verdict step only checked for "issues that need fixing." Now any warning also triggers FAIL. * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * fix: parse verdict from PR comment instead of temp file Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't execute the python3 file-write command. The check step now reads the claude[bot] comment via gh api and greps for the verdict line. * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run	2026-03-27 07:43:14 -05:00
Kevin Turcios	e681b3732b	Merge pull request #4 from codeflash-ai/feat/eval-v2-real-repos feat: eval v2 — real-repo evals	2026-03-27 07:31:02 -05:00
Kevin Turcios	66187bbcc3	fix: v2 eval runner — shallow cached clones + non-interactive prompt - Shallow clone (--no-checkout --depth 1 + fetch specific commit) instead of full clone — 15s vs 2+ min for large repos like codeflash-internal - Cache clone in evals/repos/<name>/workspace/, cp -r for each run - Use gh repo clone for private repo auth - Fix eval prompt to skip skill's AskUserQuestion step in non-interactive mode - Gitignore workspace/ dirs - Update intro.md with v2 eval docs	2026-03-27 07:27:12 -05:00
Kevin Turcios	0d4fc9d8b7	feat: eval v2 — real-repo evals cloned from git Add support for v2 evals that clone a real repo at a specific commit instead of using bundled template source. The agent handles setup, diagnosis, and fixing on its own. - run-eval.sh: v1/v2 dispatch, repos/ directory, prompt from manifest - First v2 eval: codeflash-internal psycopg serialization (PR #2489) - EVAL-V2-SKETCH.md: design doc for the v2 eval system - intro.md: repo onboarding guide	2026-03-27 07:25:10 -05:00
Kevin Turcios	cae86f669b	fix: parse verdict from PR comment instead of temp file Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't execute the python3 file-write command. The check step now reads the claude[bot] comment via gh api and greps for the verdict line.	2026-03-27 06:40:40 -05:00
Kevin Turcios	b81c50f7f4	fix: treat validation warnings as blocking failures Warnings were previously non-blocking — the verdict step only checked for "issues that need fixing." Now any warning also triggers FAIL.	2026-03-27 06:17:50 -05:00
Kevin Turcios	037b1940eb	fix: fail CI job when validation finds issues Add verdict step that writes PASS/FAIL to a file, with a follow-up workflow step that exits 1 on FAIL. Previously validation reported issues in comments but the job always succeeded.	2026-03-27 06:09:21 -05:00
Kevin Turcios	a9fa0687d5	fix: remove colons from Bash glob patterns in validate allowedTools The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which are treated as literal characters, so they never matched actual commands like `gh pr diff 2 --name-only`. This caused 1 permission denial per CI run and prevented the summary comment from posting.	2026-03-27 06:05:07 -05:00
Kevin Turcios	a529812b9d	fix: enable track_progress and show_full_output for debugging	2026-03-27 05:58:34 -05:00
Kevin Turcios	9f4a8eda6d	fix: expand allowed tools for validation workflow Add gh pr comment, gh api, cat, python3, jq to allowed tools so Claude can post PR summary comments and subagents can function.	2026-03-27 05:53:45 -05:00
Kevin Turcios	a482bf6af8	fix: correct plugin marketplace name for CI validation plugin-dev is in claude-plugins-official, not claude-code-plugins. Also adds plugin_marketplaces URL for discovery.	2026-03-27 05:49:15 -05:00
Kevin Turcios	021d64a1fd	feat: add CI plugin validation workflow (#3 ) Uses claude-code-action with plugin-dev plugin to validate plugin structure, agent consistency, eval manifests, and skills on every PR. Includes @claude mention support for interactive fixes.	2026-03-27 05:44:44 -05:00
Kevin Turcios	5a990f78a9	Merge pull request #1 from codeflash-ai/fix/async-benchmark-validation fix: add benchmark fidelity verification and config audit steps	2026-03-27 05:06:43 -05:00
Kevin Turcios	93e93ff1f6	fix: add benchmark fidelity verification and config audit steps The async agent wrote a benchmark that tested the problem (psycopg2 single connection) but didn't validate the fix (thread_sensitive=False), and missed dead config (conn_health_checks) after a driver migration. - Add step 6 (benchmark fidelity) to shared and async experiment loops - Add step 14 (config audit after KEEP) to shared and async loops - Add architectural change workflow for driver/infrastructure migrations - Strengthen profiling rule: never present benchmarking as optional - Add config audit to async reasoning checklist (question #9)	2026-03-27 05:04:54 -05:00
Kevin Turcios	64268dd023	Hello World	2026-03-24 16:14:04 -05:00

41 commits