codeflash-agent

mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Author	SHA1	Message	Date
Kevin Turcios	0ad5e60523	Add blackbox package: session flight recorder with HTMX dashboard (#39 ) * feat(blackbox): add package with models, CLI, and HTMX dashboard * test(blackbox): add comprehensive test coverage for dashboard * feat(blackbox): cache session scanning via watcher invalidation * docs(blackbox): add README and use fastapi[standard] for dev server * refactor(blackbox): extract presentation logic into formatter classes * refactor(blackbox): extract classify_error helpers * feat(blackbox): wire analytics into session detail view Show token usage, tool breakdowns, and session stats in a collapsible panel when viewing a session. * feat(blackbox): add codeflash plugin detection Detect codeflash agent names, skills, and commands in transcripts. Surface language, optimization domain, and capability badges in the analytics panel. * refactor(blackbox): remove underscore prefixes from internal functions * chore: add ty python-version to root pyproject.toml * chore(blackbox): fix lint errors in test files * style(blackbox): apply ruff formatting to analytics * feat(blackbox): add Playwright E2E tests for dashboard Refactor app.py to expose create_app() factory accepting a projects_dir override, enabling tests to run against fixture data instead of the real ~/.claude/projects/ directory. Routes now read projects_dir from app.state instead of the module-level constant. Add 26 Playwright tests across 5 files covering dashboard loading, session list, session detail with filters and analytics, sidebar collapse/localStorage persistence, and SSE log streaming. All tests pass on chromium, firefox, and webkit (78 total). CI gets a new e2e-blackbox job with a browser matrix strategy running all three engines in parallel, conditional on blackbox path changes, with trace upload on failure. * fix(ci): sync only blackbox package in e2e job * fix(ci): exclude e2e tests from unit test job The test job doesn't install Playwright browsers, so e2e tests error when pytest collects them. Ignore tests/e2e/ directories in the test job — those are handled by the dedicated e2e-blackbox job.	2026-04-28 19:58:43 -05:00
Kevin Turcios	2ff9431656	ci: only test packages with changed files on PRs (#41 ) * ci: only test packages with changed files on PRs * ci: add pull-requests read permission for paths-filter	2026-04-28 18:53:20 -05:00
Kevin Turcios	9bc9ff2250	chore: add ready-to-merge gate for branch freshness	2026-04-23 08:12:25 -05:00
Kevin Turcios	3851fd4cf9	fix: pass CI bot secrets to tessl update workflow	2026-04-23 08:04:06 -05:00
Kevin Turcios	fa1b5ece4e	fix: add permissions to tessl update caller workflow	2026-04-23 07:49:56 -05:00
Kevin Turcios	f83ece9ad8	chore: initialize tessl with vendored tiles Install 32 Python tiles in vendored mode, add MCP configs for all agents, and set up weekly tile update workflow via reusable github-workflows caller.	2026-04-23 07:43:04 -05:00
Kevin Turcios	2caaf6af7c	Fix CI: mypy errors, ruff formatting, switch to prek (#22 ) * Fix mypy errors and apply ruff formatting across packages Fix ast.FunctionDef calls missing type_params for Python 3.12+, correct type: ignore error codes in _comparator and _plugin, and run ruff format on all package source and test files. * Switch CI to prek for lint/typecheck checks Use j178/prek-action for consistent lint+typecheck (ruff check, ruff format, interrogate, mypy) matching local pre-commit config. Keep test as a separate parallel job for test-env support.	2026-04-15 02:52:47 -05:00
Kevin Turcios	a1710f7f92	Adopt shared CI workflow (#21 ) Replace packages-ci.yml and github-app-tests.yml with a single ci.yml that calls the shared ci-python-uv reusable workflow. Lint, typecheck, and test run as parallel jobs. Version check stays local (needs fetch-depth: 0 + PR-only conditional).	2026-04-15 02:36:17 -05:00
Kevin Turcios	4a65f17bfb	Set up CODEOWNERS for Go and Java language overlays (#14 )	2026-04-14 19:18:05 -05:00
m-ali-24	044b2f190a	[FEAT] golang agents (#11 ) * go base * missing javascript --------- Co-authored-by: ali <--global>	2026-04-14 18:55:36 -05:00
Kevin Turcios	3b59d97647	squash	2026-04-13 14:12:17 -05:00
Kevin Turcios	ebb9658dfd	Merge main-teammate branch	2026-04-03 17:36:50 -05:00
Kevin Turcios	37efa524d7	feat: improve skill, eval system, and tessl config - Optimize codeflash-optimize SKILL.md (review score 17% → 98%, eval 87% → 100%) - Fix frontmatter (allowed-tools format, argument-hint under metadata) - Lead description with concrete actions, explicit agent launch parameters - Add multi-run variance detection to eval system (--runs N flag) - score.py aggregate command: min/max/avg/stddev per criterion, flaky detection - check-regression.sh defaults to 3 runs for reliable regression detection - Add per-criterion regression tracking to baseline-scores.json (v3) - Reports exactly which criteria regressed, not just total score drops - Rename evals/ → codeflash-evals/ to avoid tessl directory conflicts - Switch tessl to managed mode, gitignore vendored tiles and symlinks	2026-03-27 11:30:17 -05:00
Kevin Turcios	61c393e7ed	ci: add actions:read permission for CI status checks The claude-code-action MCP server requires 'actions: read' to enable CI status check functionality. Without it, the server is skipped with a warning.	2026-03-27 10:16:16 -05:00
Kevin Turcios	d1f34cf794	feat: git memory, guard command, stuck recovery, batched setup (#2 ) * feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: git memory, guard command, stuck recovery, batched setup - Add git history review as step 1 of every experiment loop iteration (read git log + diff to learn from past experiments and detect patterns) - Add guard command as formal regression safety net (step 10) with revert-and-rework protocol (max 2 attempts before discard) - Add stuck state recovery protocol (5+ consecutive discards triggers re-read of all files, results log analysis, goal re-check, combination of past successes, and opposite-strategy attempts) - Add batched setup questions rule to orchestrator (max 4 questions per message, never one-at-a-time across round-trips) - Update decision tree to include guard check before keep/discard - Update orchestrator workflow to configure guard during session init Inspired by patterns from uditgoenka/autoresearch. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add git history, guard, and config audit steps to cpu/memory/structure agents Align experiment loops in all domain agents with async agent. Each now includes step 1 (review git history), guard command (revert+rework on failure), and config audit after KEEP with domain-specific guidance. * feat: add CI plugin validation workflow (#3) Uses claude-code-action with plugin-dev plugin to validate plugin structure, agent consistency, eval manifests, and skills on every PR. Includes @claude mention support for interactive fixes. * fix: correct plugin marketplace name for CI validation plugin-dev is in claude-plugins-official, not claude-code-plugins. Also adds plugin_marketplaces URL for discovery. * fix: expand allowed tools for validation workflow Add gh pr comment, gh api, cat, python3, jq to allowed tools so Claude can post PR summary comments and subagents can function. * fix: enable track_progress and show_full_output for debugging * fix: remove colons from Bash glob patterns in validate allowedTools The gh command patterns used colons (e.g. Bash(gh pr diff:)) which are treated as literal characters, so they never matched actual commands like `gh pr diff 2 --name-only`. This caused 1 permission denial per CI run and prevented the summary comment from posting. fix: fail CI job when validation finds issues Add verdict step that writes PASS/FAIL to a file, with a follow-up workflow step that exits 1 on FAIL. Previously validation reported issues in comments but the job always succeeded. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: remove double-commit contradiction in async agent and shared base Step 4/5/6 (Implement) said "and commit" but the commit-after-KEEP step later says "Do NOT commit discards." This meant discarded experiments were still committed. CPU/memory/structure agents already had it right — only commit at the KEEP step. * fix: treat validation warnings as blocking failures Warnings were previously non-blocking — the verdict step only checked for "issues that need fixing." Now any warning also triggers FAIL. * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * fix: address plugin-validator warnings - Declare context7 MCP server in plugin.json (domain agents use it) - Change codeflash-setup color from green to red (collision with router) - Add memory: project and example block to codeflash-setup frontmatter * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * feat: add eval regression testing (Layer 3) - baseline-scores.json: checked-in expected scores + min thresholds for ranking, memory-hard, memory-misdirection - check-regression.sh: orchestrator that runs evals, scores them, and compares to baselines (exits 1 if any score < min) - eval-regression.yml: on-demand CI workflow (workflow_dispatch) with Bedrock OIDC, artifact upload, and job summary table * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * fix: address remaining plugin-validator warnings - Add memory: project to codeflash-memory.md (matches other agents) - Add allowed-tools to memray-profiling skill - Fix blank line in codeflash-setup.md frontmatter - Add git log -20 --stat to all 4 domain agents' git history step - Add step numbering note to async reference experiment-loop.md * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * feat: add deterministic scoring for profiler and ranking criteria Add session-text-based auto-scoring that overrides LLM grades for mechanically verifiable criteria: - used_memory_profiler: grep for memray/tracemalloc Bash commands - profiled_iteratively: count distinct profiling runs (1=1pt, 2+=full) - built_ranked_list_with_impact_pct: detect cProfile + ranking output These anchor 2-4 points per eval deterministically, reducing LLM variance. Baseline thresholds tightened from min=6 to min=7. * fix: parse verdict from PR comment instead of temp file Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't execute the python3 file-write command. The check step now reads the claude[bot] comment via gh api and greps for the verdict line. * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: address all remaining validator findings for PR #2 - Remove non-standard fields (repository, license, keywords) from plugin.json - Pin context7 MCP to @2.1.4 instead of @latest - Add context7 fallback note in router agent - Remove unused Read from codeflash-optimize allowed-tools - Make AskUserQuestion usage explicit in skill body - Add tracemalloc to memray-profiling trigger phrases - List all reference files in memray-profiling skill * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: wire stuck state recovery into domain agents, fix skill allowed-tools - Add 5+ consecutive discard stuck recovery protocol to all 4 domain agents (cpu, memory, async, structure) inline experiment loops, matching the shared base definition - Add Read and Bash to codeflash-optimize skill allowed-tools so it can inspect session state before delegating - Add Write to memray-profiling skill allowed-tools so it can create profiling harness scripts * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: sync versions, add missing tools to router and skills - Sync marketplace.json metadata version to 0.1.0 to match plugin.json - Add Write and Edit to codeflash.md router tools (needed for writing .codeflash/conventions.md during setup) - Remove Bash from codeflash-optimize skill (least-privilege: all execution is delegated via Agent) - Add Grep and Glob to memray-profiling skill (needed for finding test files, capture outputs, and config files) * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run * fix: restore plugin.json fields, revert marketplace version, relax validator verdict - Restore repository, license, keywords in plugin.json (accidentally removed when mcpServers was added) - Revert marketplace.json metadata.version to 1.0.0 (collection version, distinct from individual plugin version 0.1.0) - Change validator verdict rule: only FAIL on major issues, not warnings — prevents the LLM validator from blocking on subjective minor findings each run	2026-03-27 07:43:14 -05:00
Kevin Turcios	cae86f669b	fix: parse verdict from PR comment instead of temp file Claude writes "Verdict: FAIL/PASS" in the PR comment but doesn't execute the python3 file-write command. The check step now reads the claude[bot] comment via gh api and greps for the verdict line.	2026-03-27 06:40:40 -05:00
Kevin Turcios	b81c50f7f4	fix: treat validation warnings as blocking failures Warnings were previously non-blocking — the verdict step only checked for "issues that need fixing." Now any warning also triggers FAIL.	2026-03-27 06:17:50 -05:00
Kevin Turcios	037b1940eb	fix: fail CI job when validation finds issues Add verdict step that writes PASS/FAIL to a file, with a follow-up workflow step that exits 1 on FAIL. Previously validation reported issues in comments but the job always succeeded.	2026-03-27 06:09:21 -05:00
Kevin Turcios	a9fa0687d5	fix: remove colons from Bash glob patterns in validate allowedTools The gh command patterns used colons (e.g. Bash(gh pr diff:*)) which are treated as literal characters, so they never matched actual commands like `gh pr diff 2 --name-only`. This caused 1 permission denial per CI run and prevented the summary comment from posting.	2026-03-27 06:05:07 -05:00
Kevin Turcios	a529812b9d	fix: enable track_progress and show_full_output for debugging	2026-03-27 05:58:34 -05:00
Kevin Turcios	9f4a8eda6d	fix: expand allowed tools for validation workflow Add gh pr comment, gh api, cat, python3, jq to allowed tools so Claude can post PR summary comments and subagents can function.	2026-03-27 05:53:45 -05:00
Kevin Turcios	a482bf6af8	fix: correct plugin marketplace name for CI validation plugin-dev is in claude-plugins-official, not claude-code-plugins. Also adds plugin_marketplaces URL for discovery.	2026-03-27 05:49:15 -05:00
Kevin Turcios	021d64a1fd	feat: add CI plugin validation workflow (#3 ) Uses claude-code-action with plugin-dev plugin to validate plugin structure, agent consistency, eval manifests, and skills on every PR. Includes @claude mention support for interactive fixes.	2026-03-27 05:44:44 -05:00

23 commits