Commit graph

196 commits

Author SHA1 Message Date
Kevin Turcios
1da8b41b66 Add blackbox benchmark VM infra
D2s_v5 (non-burstable, 2 vCPU, 8 GB) with cloud-init provisioning,
CPU-pinned benchmarks, and A/B comparison scripts.
2026-04-29 03:12:19 -05:00
Kevin Turcios
5f3bb4dba8 fix: remove stale noqa and use set membership test
The refactoring reduced complexity below C901/PLR0912 thresholds,
making the noqa directive unused. Use `in` set test per PLR1714.
2026-04-29 02:20:05 -05:00
codeflash[bot]
3dfd5ffd10 perf(rendering): cache entry.level and is_thinking in render_log_html
Avoid repeated attribute access and redundant strip() calls
by caching level and is_thinking as local variables.
2026-04-29 02:06:05 -05:00
Kevin Turcios
1e8cbbede4
Fix Dependabot resolver and bump GitPython for security (#42)
* Fix Dependabot security updates and bump GitPython to 3.1.47+

Dependabot's uv ecosystem resolver was inferring Python 3.9 from the
workspace root's requires-python, then failing because sub-packages
require >=3.12. Adding .python-version=3.12 tells the resolver to use
a compatible Python. Also bumps gitpython>=3.1.47 to resolve the two
open security advisories (GHSA unsafe option check, command injection).

* Bump codeflash-core and codeflash-python versions for release
2026-04-28 20:28:42 -05:00
Kevin Turcios
0ad5e60523
Add blackbox package: session flight recorder with HTMX dashboard (#39)
* feat(blackbox): add package with models, CLI, and HTMX dashboard

* test(blackbox): add comprehensive test coverage for dashboard

* feat(blackbox): cache session scanning via watcher invalidation

* docs(blackbox): add README and use fastapi[standard] for dev server

* refactor(blackbox): extract presentation logic into formatter classes

* refactor(blackbox): extract classify_error helpers

* feat(blackbox): wire analytics into session detail view

Show token usage, tool breakdowns, and session stats in a
collapsible panel when viewing a session.

* feat(blackbox): add codeflash plugin detection

Detect codeflash agent names, skills, and commands in transcripts.
Surface language, optimization domain, and capability badges in
the analytics panel.

* refactor(blackbox): remove underscore prefixes from internal functions

* chore: add ty python-version to root pyproject.toml

* chore(blackbox): fix lint errors in test files

* style(blackbox): apply ruff formatting to analytics

* feat(blackbox): add Playwright E2E tests for dashboard

Refactor app.py to expose create_app() factory accepting a projects_dir
override, enabling tests to run against fixture data instead of the real
~/.claude/projects/ directory. Routes now read projects_dir from
app.state instead of the module-level constant.

Add 26 Playwright tests across 5 files covering dashboard loading,
session list, session detail with filters and analytics, sidebar
collapse/localStorage persistence, and SSE log streaming. All tests
pass on chromium, firefox, and webkit (78 total).

CI gets a new e2e-blackbox job with a browser matrix strategy running
all three engines in parallel, conditional on blackbox path changes,
with trace upload on failure.

* fix(ci): sync only blackbox package in e2e job

* fix(ci): exclude e2e tests from unit test job

The test job doesn't install Playwright browsers, so e2e tests error
when pytest collects them. Ignore tests/e2e/ directories in the test
job — those are handled by the dedicated e2e-blackbox job.
2026-04-28 19:58:43 -05:00
Kevin Turcios
2ff9431656
ci: only test packages with changed files on PRs (#41)
* ci: only test packages with changed files on PRs

* ci: add pull-requests read permission for paths-filter
2026-04-28 18:53:20 -05:00
Kevin Turcios
919a673be2
Fix pre-existing CI lint and test failures (#40)
* chore: add gitignore entries for local eval repos, e2e fixtures, and env files

* fix: restore clean bubble_sort_method.py test fixture

The call-site ID commit re-contaminated this file with instrumentation
decorators, causing tests to fail with missing CODEFLASH_LOOP_INDEX.

* fix: resolve ruff and mypy errors in codeflash-python

- Add import-not-found ignores for optional torch/jax imports
- Extract magic column index to _STDOUT_COLUMN_INDEX constant
- Fix unused variable in _instrument_sync.py
- Cast cpu_time_ns to int for mypy arg-type

* fix: add skip markers for optional deps and apply ruff formatting to tests

Skip torch/jax/tensorflow tests when those packages are not installed.
Move has_module helper to conftest.py for reuse across test files.
Apply ruff format to all test files that drifted.

* fix: resolve remaining ruff format and mypy errors

- Add missing blank line in conftest.py (ruff format)
- Remove unused import-untyped ignore on jax import (mypy unused-ignore)
- Add type: ignore comments for object-typed SQLite row values

* chore: bump codeflash-python to 0.1.1.dev0
2026-04-28 18:39:46 -05:00
Kevin Turcios
2c9f2ad8de Fix call-site IDs to use source line numbers instead of sequential counter
Restore the old InjectPerfOnly behavior where call-site identifiers
are the source line number of the instrumented statement. Also fix
the sync integration test to properly apply the decorator and write
the helper file, and remove dead imports from test_instrumentation.
2026-04-24 07:12:45 -05:00
Kevin Turcios
5b20981cd4 Unify SQLite schema into single codeflash_results table
Merge async_results and test_results tables into one 15-column
codeflash_results table with a dedicated cpu_time_ns column.
Consolidate file pattern to codeflash_results_{N}.sqlite and
delete the now-unused _data_parsers.py module.
2026-04-24 07:12:34 -05:00
Kevin Turcios
ba001950ee fix: restore clean bubble_sort_method.py test fixture 2026-04-24 05:55:32 -05:00
Kevin Turcios
ca951dd1f3 Rewrite sync instrumentation to decorator-based approach
Replace the old AST-injected codeflash_wrap/InjectPerfOnly sync path
with decorator-based instrumentation matching the async path:

- Add codeflash_performance_sync and codeflash_behavior_sync decorators
  with GPU device sync (torch CUDA/MPS, JAX, TensorFlow) via find_spec
- Add sync_devices_before/sync_devices_after with lazy cached detection
- Clean _instrumentation.py to a thin sync/async dispatcher (~47 lines)
- Remove dead code from _instrument_core.py (create_wrapper_function,
  create_device_sync_statements, get_call_arguments, etc.)
- Fix all production imports to point at source modules directly
- Drop underscore prefixes on internal helpers (connections, get_async_db,
  close_all_connections, detect_device_sync, etc.)
- Rewrite all test files for the new sync path assertions
- Add real-framework GPU device sync tests (torch, jax, tensorflow)
2026-04-24 05:54:32 -05:00
Kevin Turcios
918a2a10a4 feat: add sync instrumentation module with decorator-based approach
New _instrument_sync.py mirrors the async instrumentation pattern:
- SyncCallInstrumenter injects _codeflash_call_site.set() before sync calls
- SyncDecoratorAdder applies @codeflash_behavior_sync via libcst
- add_sync_decorator_to_function() decorates source files
- inject_sync_profiling_into_existing_test() instruments test files

Reuses the same helper file (codeflash_async_wrapper.py) since both
sync and async decorators live in _codeflash_async_decorators.py.
2026-04-24 04:54:45 -05:00
Kevin Turcios
8c218038e9 feat: add codeflash_behavior_sync decorator
Same pattern as the async behavior decorator: decorates the
function-under-test directly, captures return values, timing
(wall + CPU), and stdout into the shared async_results SQLite
table. This is the first step toward replacing the AST-injected
codeflash_wrap approach for sync functions.
2026-04-24 04:41:34 -05:00
Kevin Turcios
c9f65aba6b fix: capture stdout in async decorator and fix result merger
The async behavior decorator now captures stdout per invocation via
io.StringIO into a new `stdout` column in the async_results SQLite
table. The result merger prefers data-sourced stdout over XML stdout,
fixing the root cause of empty stdout in merged async results.

Also fixes: duplicate async parse block in _parse_results.py,
CODEFLASH_RUN_TMPDIR propagation to subprocesses, and removes
dead async code from _stdout_parsers.py and _wrap_decorator.py.
2026-04-24 04:35:02 -05:00
Kevin Turcios
629d7f9f08 feat: rewrite async instrumentation to use SQLite-only data path and contextvars
Replace the fragile stdout tag protocol with a unified SQLite table
(async_results) for all 3 async test modes. The new runtime decorators
write behavior, performance, and concurrency results directly to the DB
with zero stdout output. Test-file instrumentation now injects
_codeflash_call_site.set() (contextvar) instead of os.environ assignments,
which is correct for async task isolation.

New modules:
- runtime/_codeflash_async_decorators.py: self-contained decorators
- testing/_async_data_parser.py: SQLite reader replacing stdout parsing

Both at 100% test coverage (42 new tests).
2026-04-24 03:44:06 -05:00
Kevin Turcios
24199efc63 refactor: remove dead parameters from AsyncCallInstrumenter and inject_async_profiling
Drop unused module_path, mode, tests_project_root parameters and the
module_name_from_file_path import they required. Update all call sites.
2026-04-24 02:49:05 -05:00
Kevin Turcios
c670d637c0 refactor: clean up _instrument_async and add 100% test coverage
Remove dead code (unused fields, hasattr guard, duplicate decorator
set), rename _optimized_instrument_statement to _find_awaited_target_call,
simplify AsyncDecoratorAdder init and leave_FunctionDef. Add 21 new
unit tests covering all branches: non-test skipping, attribute calls,
class body recursion, counter independence, decorator deduplication
(name and call form), error handlers, and mode selection.
2026-04-24 02:45:07 -05:00
Kevin Turcios
2fd9d06e28 refactor: eliminate inline async decorator duplication and fix 10-column test gaps
Replace 218-line ASYNC_HELPER_INLINE_CODE string with shutil.copy2 of the
runtime decorator file. Update remaining test files for 10-column SQLite
schema (cpu_runtime). Add cpu_runtime assertions to async E2E tests.
2026-04-24 02:31:40 -05:00
Kevin Turcios
eb6a0be717 feat: add dual-clock instrumentation (wall + CPU time) and remove dead binary parser
Measure both wall-clock time (perf_counter_ns) and CPU thread time
(thread_time_ns) in instrumented test code. cpu_runtime is now a required
int field on FunctionTestInvocation, stored in the SQLite test_results
table as a 10th column.

Also fixes the sleeptime.py bug (10e9 → 1e9 divisor) and removes the
binary pickle parser (parse_test_return_values_bin) since no writer
exists in the current codebase — SQLite is the sole data capture path.
2026-04-24 02:21:22 -05:00
Kevin Turcios
0c622ac469 fix: loosen timing tolerance in time correction instrumentation tests
The busy-wait sleep function can overshoot by 90%+ under CPU contention
(observed 190ms for a 100ms target). The test verifies that
instrumentation produces runtimes in the right order of magnitude,
not that sleep timing is precise. Widen rel_tol from 0.05 to 1.0.
2026-04-24 01:38:06 -05:00
Kevin Turcios
fd88580ac8 test: add 262 tests for previously untested core modules
- test_danom_result.py: 58 tests for Ok/Err Result monad
- test_danom_stream.py: 65 tests for Stream pipeline operations
- test_model.py: 57 tests for core data models and serialization
- test_pipeline.py: 59 tests for pipeline utilities and candidate evaluation
- test_normalizer.py: 23 tests for code normalization including SyntaxError handling
2026-04-24 01:36:14 -05:00
Kevin Turcios
90a46d732c fix: harden error handling and add missing future annotations
Error handling:
- Protect ast.parse() in _normalizer.py (returns original on SyntaxError)
- Protect cst.parse_module() in _replacement.py (raises ValueError)
- Narrow except Exception to OSError/SyntaxError in _discovery.py (2 sites)
- Narrow except Exception to sqlite3.Error/OSError in _data_parsers.py
- Narrow pickle except to specific unpickling errors in _data_parsers.py

Missing future annotations:
- Add from __future__ import annotations to 12 __init__.py files
2026-04-24 01:36:04 -05:00
Kevin Turcios
6b73b07d15 fix: deduplicate code across codeflash-core and codeflash-python
- Extract _parse_candidates helper in _client.py (used by get_candidates and optimize_with_line_profiler)
- Parameterize URL resolution in _http.py (_resolve_url_from_env replaces two near-identical functions)
- Delegate get_repo_owner_and_name to parse_repo_owner_and_name in _git.py
- Simplify _par_apply_fns to delegate to _apply_fns in danom/stream.py
- Remove duplicate performance_gain from _verification.py (use codeflash_core's version)
- Extract _extract_pytest_error helper in _verification.py (replaces duplicated 6-line block)
- Consolidate collect_names_from_annotation into collect_type_names_from_annotation in _ast_helpers.py
- Add ast.Attribute handling and relax BinOp guard in collect_type_names_from_annotation
- Add unit tests for all extracted helpers
2026-04-23 22:39:50 -05:00
Kevin Turcios
ffadf16147
chore: add standup dashboard with CI audit integration (#36)
Dash app at .codeflash/standups/ for weekly eng meetings. Pulls live PR data across 4 org repos, renders markdown standup notes, integrates CI audit report with corrected billing numbers from real GitHub API data. Deployed to Plotly Cloud.
2026-04-23 18:52:33 -05:00
Kevin Turcios
3ee9c22c8e
fix: resolve all ruff lint errors across repo (#38)
* fix: resolve all ruff lint errors across repo

Auto-fixed 31 errors (unused imports, formatting, simplifications).
Manually fixed 14 remaining:
- EXE001: removed shebangs from non-executable bench scripts
- C417: replaced map(lambda) with generator expression
- C901/PLR0915: extracted _write_and_instrument_tests from generate_ai_tests
- C901/PLR0912: extracted _parse_toml_addopts and _ini_section_name from modify_addopts
- RUF001/RUF002: replaced ambiguous Unicode chars (en dash, multiplication sign)
- FBT002: made boolean params keyword-only in report functions
- E402: moved `import re` to top of file in security reports

* fix: resolve pre-existing mypy errors across packages

- _testgen.py: annotate `generated` as `str` to avoid no-any-return
- _test_runner.py: use str() for TimeoutExpired stdout/stderr (bytes|str),
  remove unused type: ignore on proc.kill()
- _candidate_eval.py: annotate `speedup` as `float` to avoid no-any-return
  from lazy-loaded performance_gain
2026-04-23 10:22:42 -05:00
codeflash-ci-bot[bot]
c249bcd0ce
chore: update tessl tiles 2026-04-23 (#35)
Co-authored-by: codeflash-ci-bot[bot] <codeflash-ci-bot[bot]@users.noreply.github.com>
2026-04-23 08:15:44 -05:00
Kevin Turcios
9bc9ff2250 chore: add ready-to-merge gate for branch freshness 2026-04-23 08:12:25 -05:00
Kevin Turcios
5d02df9890
Merge pull request #34 from codeflash-ai/fix/tessl-app-token
fix: pass CI bot secrets to tessl update workflow
2026-04-23 08:04:58 -05:00
Kevin Turcios
3851fd4cf9 fix: pass CI bot secrets to tessl update workflow 2026-04-23 08:04:06 -05:00
Kevin Turcios
82f16dc9f0
Merge pull request #33 from codeflash-ai/fix/tessl-caller-permissions
fix: add permissions to tessl update caller workflow
2026-04-23 07:50:43 -05:00
Kevin Turcios
fa1b5ece4e fix: add permissions to tessl update caller workflow 2026-04-23 07:49:56 -05:00
Kevin Turcios
01dc370d09
Merge pull request #32 from codeflash-ai/chore/tessl-vendored-setup
chore: initialize tessl with vendored tiles
2026-04-23 07:44:58 -05:00
Kevin Turcios
f83ece9ad8 chore: initialize tessl with vendored tiles
Install 32 Python tiles in vendored mode, add MCP configs for all
agents, and set up weekly tile update workflow via reusable
github-workflows caller.
2026-04-23 07:43:04 -05:00
Kevin Turcios
616d078ecd Add CI optimization item to roadmap 2026-04-23 05:59:08 -05:00
Kevin Turcios
851bd61017 Add project roadmap 2026-04-23 05:27:58 -05:00
Kevin Turcios
72a8610fcf Add vulture to dev dependencies for dead code detection 2026-04-23 04:57:32 -05:00
Kevin Turcios
57446aad31 Fix unawaited coroutine warning in test_default_timeout_is_600
The AsyncMock for wait_for discarded the coroutine from
proc.communicate() without consuming it. Replace with a side_effect
that closes the coroutine before returning the mock result.
2026-04-23 04:46:32 -05:00
Kevin Turcios
76a07c7f66 Add __test__ = False to Test*-prefixed domain model classes
Pytest's default collection pattern matches any class starting with
"Test". These domain models (TestType, TestResults, TestFiles, etc.)
are not test classes — mark them explicitly so pytest skips them.
2026-04-23 04:41:18 -05:00
Kevin Turcios
4f98b5421f Rename TestDiff/TestDiffScope to BehaviorDiff/BehaviorDiffScope
These classes represent behavioral verification diffs, not tests. The
Test* prefix caused pytest to attempt collection and emit warnings.
2026-04-23 04:37:24 -05:00
Kevin Turcios
9e893675c9 Add Plotly Cloud deployment config for CI audit report 2026-04-23 03:59:35 -05:00
Kevin Turcios
c492164fbf Add codeflash org CI audit case study and interactive Dash report
Case study in .codeflash/krrt7/codeflash-ai/ci-audit/ with README,
status, and raw data (fork activity, PRs merged).

Interactive Dash report in reports/codeflash-ci-audit/ with two tabs:
Executive Summary (hero metrics, cost impact charts, before/after) and
Full Detail (fork breakdown, findings table, PR inventory, methodology).

Key numbers: 71% fewer workflow runs, ~$12K/yr in Enterprise overage
savings, 200+ forks disabled, 11 PRs merged across 2 repos.
2026-04-23 03:56:04 -05:00
Kevin Turcios
8221ce32a2 Add codeflash-ai/ci-audit to active case studies list 2026-04-23 03:52:10 -05:00
Kevin Turcios
0316d9822a Add testcontainers[postgres] to workspace dev dependencies
The 12 DB integration tests in codeflash-api need testcontainers to spin
up a real PostgreSQL instance via Docker. Was already declared in the
package's own dev deps but missing from the root workspace.
2026-04-23 03:52:07 -05:00
Kevin Turcios
bf8707695f Add codeflash-api to workspace dev dependencies
The package was a workspace member but not listed in the root dev
group, so its tests couldn't import codeflash_api when running
from the monorepo root.
2026-04-23 03:38:30 -05:00
Kevin Turcios
e3e74c3f2e Add missing pyproject.toml for codeflash-ci-audit workspace member 2026-04-23 03:34:03 -05:00
Kevin Turcios
e41a1bf56a Fix conftest collision between codeflash-api and github-app test suites
Both packages had tests/__init__.py, creating competing `tests`
packages under --import-mode=importlib. Remove both __init__.py files
and change github-app imports from `from tests.helpers` to
`from helpers` via sys.path insertion in conftest.py.
2026-04-23 03:33:58 -05:00
Kevin Turcios
43a4009294 Fix callee syntax validation in prepare_python_module
normalize_code no longer raises SyntaxError (it returns raw code as
fallback), so validate callee source with ast.parse() explicitly
before normalizing. Fixes test_callee_syntax_error_returns_none.
2026-04-23 03:28:58 -05:00
Kevin Turcios
bd5613d22f Update test-coverage.md: remove resolved callouts for covered modules 2026-04-23 03:13:28 -05:00
Kevin Turcios
e1990092e0 Add tests for error handling paths in ranking, refinement, and state
- test_ranking: Update normalize_code test to expect fallback on invalid
  syntax instead of SyntaxError (matches new behavior)
- test_refinement: Add 7 tests for _parse_candidate markdown parsing
  (fenced blocks, file paths, multiple blocks, plain fallback)
- test_state: Add 6 tests for PythonState.module_ast and invalidate_module
  (valid parse, caching, SyntaxError→None, re-parse after fix)
2026-04-23 03:13:23 -05:00
Kevin Turcios
9e679f1c06 Fix error handling: add logging to bare excepts, protect ast.parse(), parse markdown in refinement
- _tracing.py: Add log.warning(exc_info=True) to 4 bare except blocks that
  previously silently swallowed errors
- _state.py: Wrap ast.parse() in SyntaxError handler, return None for
  malformed files
- _ranking.py: Wrap ast.parse() in SyntaxError handler, fall back to raw
  code string for dedup
- _refinement.py: Add CodeStringsMarkdown.parse_markdown_code() to
  _parse_candidate(), matching the pattern in _candidate_gen.py
- Update error-handling.md rules to reflect resolved issues
2026-04-23 03:06:03 -05:00