Covers PR number env var parsing, suggest-changes vs create-pr
branching, branch push failure, GitHub App not-installed warning,
and generic API error logging.
Covers sanitize_to_filename edge cases, get_traced_arguments with
class filtering and invalid event types, and get_trace_total_run_time_ns
with missing files/tables/empty tables.
Covers happy paths and error paths for generate_candidates,
repair_failed_candidates, and generate_refinement_candidates.
Tests AI service errors, unparseable markdown, missing runtime
data, and repair failures.
Replace all sync test runner calls (run_behavioral_tests,
run_benchmarking_tests, run_line_profile_tests) with their async
counterparts throughout the pipeline. This eliminates the
ThreadPoolExecutor in _baseline.py in favor of asyncio.gather(),
and makes _async_bench.py, _candidate_gen.py, and
_function_optimizer.py fully async. Adds async_run_line_profile_tests
and coverage support to async_run_behavioral_tests in _test_runner.py.
Delete the sync evaluate_candidate() and run_tests_and_benchmark()
functions — all callers now use the async versions. Rename
async_run_tests_and_benchmark → run_tests_and_benchmark and
async_evaluate_candidate_isolated → evaluate_candidate_isolated.
The entire optimization pipeline is now async with a single
asyncio.run() entry point in _cli.py:main(). PythonOptimizer.run()
and PythonFunctionOptimizer.optimize() are async coroutines.
Update test_candidate_eval.py and test_parallel_eval_integration.py
to match the unified API.
29 new tests in test_test_runner.py covering async_execute_test_subprocess,
async_run_behavioral_tests, async_run_benchmarking_tests, _base_pytest_args,
replay test path, and coverage path.
21 new tests in test_candidate_eval.py covering evaluate_candidate,
rank_candidates, build_benchmark_details, log_evaluation_results, and
async_run_tests_and_benchmark.
Thread-safety concern with shared EvaluationContext mutations is
eliminated by switching to cooperative concurrency — between await
points only one coroutine runs, so no locks are needed.
Adds async variants of test runners (async_execute_test_subprocess,
async_run_behavioral_tests, async_run_benchmarking_tests) and async
evaluation functions (async_run_tests_and_benchmark,
async_evaluate_candidate_isolated). Rewrites _evaluate_batch_parallel
to use asyncio.Semaphore + asyncio.gather instead of ThreadPoolExecutor.
- Remove _line_profiler.py, observability/models.py, _optimizer.py,
_rate_limit.py, _usage.py from tree (never created)
- Add _background.py, _markdown.py, _xml.py that actually exist
- Mark java/ and js_ts/ as stubs
- Update endpoint count from 15 to 14, note log_features stub
- Fix Depends() example to use Annotated[] pattern
- Add deferred items: optimize-line-profiler, observability DB writes
- Replace commented-out code pattern with descriptive comment in __init__.py
- Move ModuleType into TYPE_CHECKING block in _git.py
- Add noqa: F821 for PEP 562 lazy-loaded git module references
- Restore noqa: PLC0415 on reformatted sentry imports in _telemetry.py
Port test generation from Django reference: prompt templates (Jinja2
with model-type-aware formatting), LLM call orchestration with
even/odd model selection, AST-based code validation with regex
fallback, preamble repair, and ellipsis detection. Instrumentation
and postprocessing are deferred — all four response fields return
the same validated code for now.
Add fire-and-forget background task manager (background.py) and
LLM call recording (recording.py). Every LLMClient.call now records
trace_id, model, latency, tokens, cost, and errors via fire-and-forget.
drain() awaits pending tasks on shutdown. Currently logs only —
database persistence deferred until llm_calls table is wired.
12 tests covering all Queries methods against a real PostgreSQL
instance via testcontainers. Automatically skipped when Docker is
unavailable. Tests: api key lookup, last_used update, organization
fetch, subscription CRUD, usage increment, cumulative increments.
- Narrow search_start guard in search/replace parser
- Type optimizations_limit as int|str instead of object
- Wrap cost calculation return in float()
- Cast binary op result to int in CST evaluator
- Suppress import-untyped for asyncpg (no stubs available)
- Suppress arg-type for OpenAI messages (dict→union mismatch)
- Type isort kwargs as Any, add Coroutine import for refinement
- Narrow feature_version to tuple[int, int] for ast.parse
- Rename shadowed loop variable in annotation walker
- Add mypy strict=true config to pyproject.toml
FastAPI with `from __future__ import annotations` cannot resolve
Annotated[AuthenticatedUser, Depends()] when AuthenticatedUser is
only imported under TYPE_CHECKING — it becomes a query parameter
instead of a dependency. Move to runtime import in all 11 routers
with noqa: TC001 suppression.
30 integration tests cover all endpoints (success, invalid trace_id,
LLM failure, edge cases) using httpx ASGITransport with mocked LLM.
Faithful port of the Python optimization pipeline from Django aiservice:
- schemas.py: Pydantic request/response models (OptimizeRequest, OptimizeResponse)
- _markdown.py: markdown code block extraction, splitting, grouping
- _context.py: BaseOptimizerContext with Single/Multi variants for
prompt assembly, LLM response extraction, and postprocessing
- _pipeline.py: parallel LLM orchestration with model distribution
(GPT-5-mini + Claude Sonnet 4.5), diversity via line profiler toggling
- _router.py: POST /ai/optimize with auth, rate limiting, usage tracking
- 11 prompt templates copied verbatim from Django reference
- LLM client wired into app lifespan
Dual-provider client (Azure OpenAI + Anthropic Bedrock) behind
a common async interface with cache-aware cost calculation and
event-loop-safe client lifecycle.
SHA-384 + base64url key hashing matching the JS client. FastAPI
dependencies for require_auth, check_rate_limit, and track_usage
with Annotated[Depends()] pattern. Per-user per-endpoint rate
limiting with employee bypass. Atomic subscription usage tracking
with enterprise org and employee exemptions. DB queries module
with asyncpg raw SQL for auth tables. 27 new tests covering auth
flow, rate limits, usage enforcement, and edge cases.
FastAPI app factory with lifespan, CORS, optional Sentry. Pydantic-settings
config for all env vars. Full directory structure for all 15 endpoints per
the architecture doc. Workspace integration: ruff src paths, isort, pytest
testpaths, per-file ignores. aiohttp for production, httpx for test client.
CLAUDE.md with full package structure, layer boundaries, endpoint map,
implementation order, business logic audit, and design decisions.
Rules: architecture (layer boundaries, model conventions), testing
(coverage requirements, mocking strategy), porting (reference files,
what to port vs skip).
- Use getattr for rootdir/rootpath in discovery worker (pytest 9 compat)
- Add -o addopts= to all pytest invocations to override project config
- Extract _base_pytest_args helper to eliminate duplication across runners
- Support [tool.pytest] config section (not just [tool.pytest.ini_options])
- Add --dist, --no-flaky-report, --failed-first to BLACKLIST_ADDOPTS
- Add recover=True to XMLParser for malformed JUnit XML tolerance
- Log subprocess stdout/stderr on baseline and candidate test failures
- Friendly warning when GitHub App not installed instead of raw error
- Upgrade repair failure logging from debug to warning
_extract_test_input_examples() now includes EXISTING_UNIT_TEST files
(hand-written assertions) in addition to GENERATED_REGRESSION tests.
Existing tests are prioritized since they represent the developer's
explicit behavioral expectations. Extracted a _collect_test_sources()
helper to keep complexity manageable.
When evaluating candidates in an overlay directory, pytest's
--rootdir was set to the overlay cwd, causing classnames in JUnit
XML to be computed relative to the wrong directory. The XML parser
then failed to resolve test files, producing "0 diffs" instead of
detecting real behavioral failures. Pass tests_project_rootdir as
--rootdir so classnames are always relative to the project root.
Send baseline_runtime_ns, loop_count, test_input_examples, and
line_profiler_results from the client to the optimize endpoint so
the AI service can generate better-informed candidates. Restructure
the per-function optimizer to establish baseline before candidate
generation, and alternate line profiler data across calls for
diversity.
- Generate regression tests concurrently with ThreadPoolExecutor
instead of sequentially, cutting testgen wall time roughly in half
- Increase default AIClient timeout from 120s to 300s to match
typical LLM response times for optimization endpoints
Three fixes to baseline establishment:
- Include zero-nanosecond runtimes in usable_runtime_data_by_test_case
(use `is not None` instead of truthiness check)
- Check runtime_data dict instead of total_timing scalar for skip decision
- Use module_root directly in PYTHONPATH (not module_root.parent) so
src-layout projects resolve imports correctly in test subprocesses
Server-side instrumentation wrote return values to .bin files, which
corrupted under concurrent pytest processes (interleaved records →
UnicodeDecodeError). Client-side instrumentation writes to SQLite,
which handles concurrent access safely.
The client now ignores instrumented_behavior_tests and
instrumented_perf_tests from the AI service response and instruments
the plain generated_tests locally using inject_profiling_into_existing_test,
the same path used for discovered existing tests.