Commit graph

151 commits

Author SHA1 Message Date
Kevin Turcios
e41a1bf56a Fix conftest collision between codeflash-api and github-app test suites
Both packages had tests/__init__.py, creating competing `tests`
packages under --import-mode=importlib. Remove both __init__.py files
and change github-app imports from `from tests.helpers` to
`from helpers` via sys.path insertion in conftest.py.
2026-04-23 03:33:58 -05:00
Kevin Turcios
43a4009294 Fix callee syntax validation in prepare_python_module
normalize_code no longer raises SyntaxError (it returns raw code as
fallback), so validate callee source with ast.parse() explicitly
before normalizing. Fixes test_callee_syntax_error_returns_none.
2026-04-23 03:28:58 -05:00
Kevin Turcios
bd5613d22f Update test-coverage.md: remove resolved callouts for covered modules 2026-04-23 03:13:28 -05:00
Kevin Turcios
e1990092e0 Add tests for error handling paths in ranking, refinement, and state
- test_ranking: Update normalize_code test to expect fallback on invalid
  syntax instead of SyntaxError (matches new behavior)
- test_refinement: Add 7 tests for _parse_candidate markdown parsing
  (fenced blocks, file paths, multiple blocks, plain fallback)
- test_state: Add 6 tests for PythonState.module_ast and invalidate_module
  (valid parse, caching, SyntaxError→None, re-parse after fix)
2026-04-23 03:13:23 -05:00
Kevin Turcios
9e679f1c06 Fix error handling: add logging to bare excepts, protect ast.parse(), parse markdown in refinement
- _tracing.py: Add log.warning(exc_info=True) to 4 bare except blocks that
  previously silently swallowed errors
- _state.py: Wrap ast.parse() in SyntaxError handler, return None for
  malformed files
- _ranking.py: Wrap ast.parse() in SyntaxError handler, fall back to raw
  code string for dedup
- _refinement.py: Add CodeStringsMarkdown.parse_markdown_code() to
  _parse_candidate(), matching the pattern in _candidate_gen.py
- Update error-handling.md rules to reflect resolved issues
2026-04-23 03:06:03 -05:00
Kevin Turcios
dd7d2db451 Add unit tests for _benchmark_worker subprocess script
5 tests covering module-level argv parsing, project_root derivation,
benchmark plugin and trace decorator imports, and __main__ guard.
2026-04-23 02:31:38 -05:00
Kevin Turcios
e2135e39b2 Add unit tests for vendored _tabulate module
64 tests covering: tabulate() with pipe/simple formats, empty/None
data, dict input, numeric alignment, float formatting, whitespace
preservation, separating lines, firstrow headers. Internal helpers:
type detection, number parsing, ANSI stripping, padding, multiline
detection, pipe segment alignment. Integration test matching the
_create_pr use case.
2026-04-23 02:31:38 -05:00
Kevin Turcios
276c2f36da Add unit tests for _discovery_worker (collection parsing, plugin)
Covers parse_pytest_collection_results with top-level functions,
class methods, and empty input. Tests PytestCollectionPlugin
benchmark skipping, collection_finish pickle output, and item
accumulation. Uses sys.argv patching to handle module-level reads.
2026-04-23 02:28:23 -05:00
Kevin Turcios
957f299243 Add unit tests for _create_pr (PR creation, suggestion, error paths)
Covers PR number env var parsing, suggest-changes vs create-pr
branching, branch push failure, GitHub App not-installed warning,
and generic API error logging.
2026-04-23 02:24:01 -05:00
Kevin Turcios
c31fbc1e43 Add unit tests for _trace_db (sanitize, trace queries, run time)
Covers sanitize_to_filename edge cases, get_traced_arguments with
class filtering and invalid event types, and get_trace_total_run_time_ns
with missing files/tables/empty tables.
2026-04-23 02:23:56 -05:00
Kevin Turcios
cf7cf60936 Add unit tests for _candidate_gen (generate, repair, refinement)
Covers happy paths and error paths for generate_candidates,
repair_failed_candidates, and generate_refinement_candidates.
Tests AI service errors, unparseable markdown, missing runtime
data, and repair failures.
2026-04-23 02:23:52 -05:00
Kevin Turcios
815eba00c0 Fix unawaited coroutine warning in test_skips_ai_test_generation
optimize() is now async, so the test must use async def + await.
2026-04-23 01:47:26 -05:00
Kevin Turcios
92e39d6923 Convert remaining sync test runner callers to async
Replace all sync test runner calls (run_behavioral_tests,
run_benchmarking_tests, run_line_profile_tests) with their async
counterparts throughout the pipeline. This eliminates the
ThreadPoolExecutor in _baseline.py in favor of asyncio.gather(),
and makes _async_bench.py, _candidate_gen.py, and
_function_optimizer.py fully async. Adds async_run_line_profile_tests
and coverage support to async_run_behavioral_tests in _test_runner.py.
2026-04-23 01:46:01 -05:00
Kevin Turcios
a292698a1d Add pytest-cov to dev dependencies 2026-04-23 00:41:32 -05:00
Kevin Turcios
f204f8e740 Unify sync/async candidate eval into single async path
Delete the sync evaluate_candidate() and run_tests_and_benchmark()
functions — all callers now use the async versions. Rename
async_run_tests_and_benchmark → run_tests_and_benchmark and
async_evaluate_candidate_isolated → evaluate_candidate_isolated.

The entire optimization pipeline is now async with a single
asyncio.run() entry point in _cli.py:main(). PythonOptimizer.run()
and PythonFunctionOptimizer.optimize() are async coroutines.

Update test_candidate_eval.py and test_parallel_eval_integration.py
to match the unified API.
2026-04-23 00:41:28 -05:00
Kevin Turcios
8defba8a72 Add unit tests for async test runners and candidate evaluation
29 new tests in test_test_runner.py covering async_execute_test_subprocess,
async_run_behavioral_tests, async_run_benchmarking_tests, _base_pytest_args,
replay test path, and coverage path.

21 new tests in test_candidate_eval.py covering evaluate_candidate,
rank_candidates, build_benchmark_details, log_evaluation_results, and
async_run_tests_and_benchmark.
2026-04-23 00:24:59 -05:00
Kevin Turcios
8d308fe8e8 Replace ThreadPoolExecutor with asyncio for parallel candidate evaluation
Thread-safety concern with shared EvaluationContext mutations is
eliminated by switching to cooperative concurrency — between await
points only one coroutine runs, so no locks are needed.

Adds async variants of test runners (async_execute_test_subprocess,
async_run_behavioral_tests, async_run_benchmarking_tests) and async
evaluation functions (async_run_tests_and_benchmark,
async_evaluate_candidate_isolated). Rewrites _evaluate_batch_parallel
to use asyncio.Semaphore + asyncio.gather instead of ThreadPoolExecutor.
2026-04-23 00:12:53 -05:00
Kevin Turcios
df3538167f Add .coverage to gitignore 2026-04-22 23:40:22 -05:00
Kevin Turcios
fb76024cfb Fix CLAUDE.md accuracy: remove nonexistent files, update patterns
- Remove _line_profiler.py, observability/models.py, _optimizer.py,
  _rate_limit.py, _usage.py from tree (never created)
- Add _background.py, _markdown.py, _xml.py that actually exist
- Mark java/ and js_ts/ as stubs
- Update endpoint count from 15 to 14, note log_features stub
- Fix Depends() example to use Annotated[] pattern
- Add deferred items: optimize-line-profiler, observability DB writes
2026-04-22 23:40:01 -05:00
Kevin Turcios
3a07579bb0 Raise codeflash-api test coverage from 81% to 92%
Add 182 new tests across optimize, V4A diff, CST utils, and postprocess
modules. Key coverage improvements:
- optimize/_pipeline.py: 29% → 97%
- optimize/_router.py: 40% → 93%
- diff/_v4a.py: 40% → 97%
- languages/python/_cst_utils.py: 67% → 96%
- languages/python/_postprocess.py: 67% → 87%

Also apply ruff format to 5 files that had formatting drift.
2026-04-22 23:39:54 -05:00
Kevin Turcios
2d9fca6b3e Fix all ruff lint errors in codeflash-core
- Replace commented-out code pattern with descriptive comment in __init__.py
- Move ModuleType into TYPE_CHECKING block in _git.py
- Add noqa: F821 for PEP 562 lazy-loaded git module references
- Restore noqa: PLC0415 on reformatted sentry imports in _telemetry.py
2026-04-22 23:39:47 -05:00
Kevin Turcios
fdfade528f Strengthen testgen test assertions and remove duplicate integration tests
Replace weak assertions (len >= 1, bare MagicMock) with exact counts,
_stub_llm_response, response body checks, and mock call verification.
Remove 6 duplicate TestTestgenIntegration tests already covered in
test_testgen.py::TestTestgenEndpoint.
2026-04-22 23:24:36 -05:00
Kevin Turcios
758da2592f Achieve 100% test coverage for testgen module
Add 15 new tests covering all previously uncovered paths:
- _validate.py: regex class splitting, trailing blank stripping,
  repair preamble edge cases (empty during iteration, lineno=None,
  out-of-range index, max attempts exhausted), AST gap/decorator paths
- _generate.py: multi-context ellipsis detection, extract_code_block
  returning None, no test functions after validation
- _review_router.py: non-dict/non-list JSON in review verdicts

Mark 2 provably unreachable defensive lines with pragma: no cover.
2026-04-22 23:10:27 -05:00
Kevin Turcios
92c5fd7c74 Remove instrumented_behavior_tests and instrumented_perf_tests from testgen API
Instrumentation (behavior/perf AST transformations) moves to the client
side. The API now returns raw validated code only via generated_tests.
2026-04-22 23:10:16 -05:00
Kevin Turcios
051317e2dc Mark all 9 implementation steps complete in architecture docs 2026-04-22 22:11:08 -05:00
Kevin Turcios
4b219907fd Implement POST /ai/testgen endpoint with full generation pipeline
Port test generation from Django reference: prompt templates (Jinja2
with model-type-aware formatting), LLM call orchestration with
even/odd model selection, AST-based code validation with regex
fallback, preamble repair, and ellipsis detection. Instrumentation
and postprocessing are deferred — all four response fields return
the same validated code for now.
2026-04-22 22:11:04 -05:00
Kevin Turcios
b3840627bb Use explicit .strip() assertions in testgen repair tests 2026-04-22 20:36:14 -05:00
Kevin Turcios
6abcc8daa3 Add testgen review and repair endpoints
Port /ai/testgen_review and /ai/testgen_repair from Django reference.
Review: parallel LLM calls per test source, auto-flags behavioral
failures, parses JSON verdicts. Repair: Jinja2 prompt templates,
syntax-error retry loop, Python code extraction and validation.

Schemas: TestgenReviewRequest/Response, TestRepairRequest/Response,
CoverageDetails, FunctionVerdict, TestSourceWithFailures.

23 tests covering: coverage context building, verdict parsing,
syntax error detection, endpoint success/error/retry/language paths,
and the model validator for python_version resolution.
2026-04-22 20:35:39 -05:00
Kevin Turcios
1d70d65914 Wire observability recording into LLM client
Add fire-and-forget background task manager (background.py) and
LLM call recording (recording.py). Every LLMClient.call now records
trace_id, model, latency, tokens, cost, and errors via fire-and-forget.
drain() awaits pending tasks on shutdown. Currently logs only —
database persistence deferred until llm_calls table is wired.
2026-04-22 20:30:10 -05:00
Kevin Turcios
a62f1ecd03 Add real DB integration tests with testcontainers
12 tests covering all Queries methods against a real PostgreSQL
instance via testcontainers. Automatically skipped when Docker is
unavailable. Tests: api key lookup, last_used update, organization
fetch, subscription CRUD, usage increment, cumulative increments.
2026-04-22 20:02:28 -05:00
Kevin Turcios
3e16d44912 Fix all mypy strict errors across codeflash-api
- Narrow search_start guard in search/replace parser
- Type optimizations_limit as int|str instead of object
- Wrap cost calculation return in float()
- Cast binary op result to int in CST evaluator
- Suppress import-untyped for asyncpg (no stubs available)
- Suppress arg-type for OpenAI messages (dict→union mismatch)
- Type isort kwargs as Any, add Coroutine import for refinement
- Narrow feature_version to tuple[int, int] for ast.parse
- Rename shadowed loop variable in annotation walker
- Add mypy strict=true config to pyproject.toml
2026-04-22 19:59:42 -05:00
Kevin Turcios
03bb712c65 Add integration tests and fix AuthenticatedUser runtime import
FastAPI with `from __future__ import annotations` cannot resolve
Annotated[AuthenticatedUser, Depends()] when AuthenticatedUser is
only imported under TYPE_CHECKING — it becomes a query parameter
instead of a dependency. Move to runtime import in all 11 routers
with noqa: TC001 suppression.

30 integration tests cover all endpoints (success, invalid trace_id,
LLM failure, edge cases) using httpx ASGITransport with mocked LLM.
2026-04-21 22:48:30 -05:00
Kevin Turcios
935c6f229e Add remaining endpoints: repair, refinement, adaptive, explain, review, ranking, jit, workflow, testgen, log_features
Port all P1 endpoints from the Django aiservice to FastAPI:

- repair: 2-attempt LLM retry, SearchAndReplaceDiff patch application
- refinement: parallel LLM calls via asyncio.gather, single/multi-file
  context dispatch, XML explanation extraction, deduplication
- adaptive: single LLM call with previous candidate history
- explain: conditional throughput/concurrency/acceptance prompt sections,
  XML <explain> tag extraction
- review: 4-dimension scoring, JSON code block extraction, 2-attempt retry
- ranking: 4-dimension weighted scoring, JSON extraction with 3 fallbacks
  (direct parse, markdown block, brace matching), legacy XML fallback
- jit: reuses optimize pipeline with JIT-specific prompts
- workflow: 3-tier regex YAML extraction, LLM-generated CI steps
- testgen: stub returning 501 (language-specific logic deferred)
- log_features: trace_id validation, DB write stubbed

Also adds:
- Task-specific model assignments in llm/_models.py
- XML tag extraction utility in languages/python/_xml.py
- All 11 routers registered in _app.py

348 tests passing, all lint clean.
2026-04-21 22:36:31 -05:00
Kevin Turcios
6c04324e25 Add optimize endpoint: context, pipeline, router, prompt templates
Faithful port of the Python optimization pipeline from Django aiservice:
- schemas.py: Pydantic request/response models (OptimizeRequest, OptimizeResponse)
- _markdown.py: markdown code block extraction, splitting, grouping
- _context.py: BaseOptimizerContext with Single/Multi variants for
  prompt assembly, LLM response extraction, and postprocessing
- _pipeline.py: parallel LLM orchestration with model distribution
  (GPT-5-mini + Claude Sonnet 4.5), diversity via line profiler toggling
- _router.py: POST /ai/optimize with auth, rate limiting, usage tracking
- 11 prompt templates copied verbatim from Django reference
- LLM client wired into app lifespan
2026-04-21 22:16:22 -05:00
Kevin Turcios
3e62f502e7 Add language layer: CST utils, validator, postprocessing pipeline
Faithful port of Python language utilities from Django aiservice:
- _cst_utils.py: depth tracking, import extraction, definition removal,
  ellipsis detection, expression evaluation, module path helpers
- _validator.py: dual ast+libcst syntax validation, parse-or-none
- _postprocess.py: full optimization postprocessing pipeline including
  dedup, equality check, docstring restoration, comment cleaning,
  forward reference fixing, ellipsis filtering, isort
2026-04-21 22:04:39 -05:00
Kevin Turcios
5c6b82050a Add diff layer: SEARCH/REPLACE and V4A patch application
Faithfully ported from Django aiservice. V4A uses 3-tier fuzzy
context matching (exact/rstrip/strip) with EOF penalties and scope
markers. Per-file lint ignores for ported complexity.
2026-04-21 21:55:28 -05:00
Kevin Turcios
2acebdbf51 Add DB layer: asyncpg pool, engine, row schemas, lifespan wiring
Pool creation with min=2/max=100, row schema attrs classes for all
7 tables, and lifespan integration in the app factory for pool
startup/shutdown.
2026-04-21 21:55:19 -05:00
Kevin Turcios
fcaac3a9f2 Add LLM layer: client abstraction, cost calculation, retry policy
Dual-provider client (Azure OpenAI + Anthropic Bedrock) behind
a common async interface with cache-aware cost calculation and
event-loop-safe client lifecycle.
2026-04-21 21:55:09 -05:00
Kevin Turcios
d20b82762a Add auth layer: key hashing, rate limiting, usage tracking
SHA-384 + base64url key hashing matching the JS client. FastAPI
dependencies for require_auth, check_rate_limit, and track_usage
with Annotated[Depends()] pattern. Per-user per-endpoint rate
limiting with employee bypass. Atomic subscription usage tracking
with enterprise org and employee exemptions. DB queries module
with asyncpg raw SQL for auth tables. 27 new tests covering auth
flow, rate limits, usage enforcement, and edge cases.
2026-04-21 21:33:02 -05:00
Kevin Turcios
69714f410f Scaffold codeflash-api package with app factory, config, and healthcheck
FastAPI app factory with lifespan, CORS, optional Sentry. Pydantic-settings
config for all env vars. Full directory structure for all 15 endpoints per
the architecture doc. Workspace integration: ruff src paths, isort, pytest
testpaths, per-file ignores. aiohttp for production, httpx for test client.
2026-04-21 21:28:59 -05:00
Kevin Turcios
0901db9fee Update coveragepy status after E2E validation session 2026-04-21 21:19:24 -05:00
Kevin Turcios
e34873fb82 Add codeflash-api architecture docs and project-scoped rules
CLAUDE.md with full package structure, layer boundaries, endpoint map,
implementation order, business logic audit, and design decisions.

Rules: architecture (layer boundaries, model conventions), testing
(coverage requirements, mocking strategy), porting (reference files,
what to port vs skip).
2026-04-21 21:16:32 -05:00
Kevin Turcios
2221de0a71 Clarify attrs for internals, Pydantic for API boundary only 2026-04-21 21:13:04 -05:00
Kevin Turcios
c978a9d2a9 Add codeflash-api to project layout and rewrite context in CLAUDE.md 2026-04-21 21:12:30 -05:00
Kevin Turcios
122152b3ce Upgrade all dependencies to latest versions
Notable: pydantic 2.13.3, rich 15.0.0, fastapi 0.136.0, sentry-sdk 2.58.0,
ruff 0.15.11, mypy 1.20.2, uvicorn 0.45.0. All 2693 tests pass.
2026-04-21 21:11:25 -05:00
Kevin Turcios
b912121cf4 Upgrade lxml to 6.1.0 to fix XXE vulnerability (CVE-2026-41066) 2026-04-21 21:07:09 -05:00
Kevin Turcios
c0a72e978d Add rules from session audit: error handling, testing, debugging
- sessions.md: hard compaction limits, no-polling, file read budget
- debugging.md: root cause first, isolated testing, subprocess logging
- github.md: strengthen MCP-first enforcement
- error-handling.md (packages): no silent swallowing, protect ast.parse
- test-coverage.md (packages): every module needs tests, known gaps
2026-04-21 21:06:15 -05:00
Kevin Turcios
cefd625d35 Fix pytest 9 compat, addopts conflicts, XML recovery, and diagnostics
- Use getattr for rootdir/rootpath in discovery worker (pytest 9 compat)
- Add -o addopts= to all pytest invocations to override project config
- Extract _base_pytest_args helper to eliminate duplication across runners
- Support [tool.pytest] config section (not just [tool.pytest.ini_options])
- Add --dist, --no-flaky-report, --failed-first to BLACKLIST_ADDOPTS
- Add recover=True to XMLParser for malformed JUnit XML tolerance
- Log subprocess stdout/stderr on baseline and candidate test failures
- Friendly warning when GitHub App not installed instead of raw error
- Upgrade repair failure logging from debug to warning
2026-04-21 20:41:43 -05:00
Kevin Turcios
17de71251f Send existing unit test source to optimizer for behavioral context
_extract_test_input_examples() now includes EXISTING_UNIT_TEST files
(hand-written assertions) in addition to GENERATED_REGRESSION tests.
Existing tests are prioritized since they represent the developer's
explicit behavioral expectations. Extracted a _collect_test_sources()
helper to keep complexity manageable.
2026-04-21 17:06:30 -05:00
Kevin Turcios
9abaa95437 Fix pytest rootdir mismatch during overlay candidate evaluation
When evaluating candidates in an overlay directory, pytest's
--rootdir was set to the overlay cwd, causing classnames in JUnit
XML to be computed relative to the wrong directory. The XML parser
then failed to resolve test files, producing "0 diffs" instead of
detecting real behavioral failures. Pass tests_project_rootdir as
--rootdir so classnames are always relative to the project root.
2026-04-21 15:45:32 -05:00