# Pull Request Checklist
## Description
- [ ] **Description of PR**: Clear and concise description of what this
PR accomplishes
- [ ] **Breaking Changes**: Document any breaking changes (if
applicable)
- [ ] **Related Issues**: Link to any related issues or tickets
## Testing
- [ ] **Test cases Attached**: All relevant test cases have been
added/updated
- [ ] **Manual Testing**: Manual testing completed for the changes
## Monitoring & Debugging
- [ ] **Logging in place**: Appropriate logging has been added for
debugging user issues
- [ ] **Sentry will be able to catch errors**: Error handling ensures
Sentry can capture and report errors
- [ ] **Avoid Dev based/Prisma logging**: No development-only or
Prisma-specific logging in production code
## Configuration
- [ ] **Env variables newly added**: Any new environment variables are
documented in .env.example file or mentioned in description
---
## Additional Notes
<!-- Add any additional context, screenshots, or notes for reviewers
here -->
Co-authored-by: ali <mohammed18200118@gmail.com>
- ruff-format: reformat test file
- fix ty type error: cast mock clients to MagicMock for assert_called_once
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This fixes a critical bug where old AsyncAzureOpenAI and AsyncAnthropicBedrock
clients were not being closed when the event loop changed, causing:
1. Connection pool exhaustion → "couldn't get a connection after 30.00 sec"
2. RuntimeError: Event loop is closed during httpx client cleanup
Root cause:
In LLMClient.call(), when the event loop changed, new clients were created
but old clients were not properly closed, leading to connection leaks.
Fix:
- Added await client.close() for both openai_client and anthropic_client
before creating new instances
- Added comprehensive unit tests to verify proper cleanup
Impact:
- Resolves ~150+ test generation failures (500 errors)
- Fixes event loop closure errors in aiservice logs
Trace IDs affected: 04500fbd-88e0-44e4-8d20-32f6a0dc06cc (and many others)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
## Summary
- Convert
`debug_log_sensitive_data(f"...{response.model_dump_json(indent=2)}")`
to `debug_log_sensitive_data_from_callable(lambda: ...)` across 8
endpoint files
- In production, `debug_log_sensitive_data` is a no-op but the f-string
interpolation (including `model_dump_json(indent=2)`) was always
evaluated — serializing the full LLM response to JSON on every call
- The `_from_callable` variant only invokes the lambda when debug
logging is active (non-production)
- **Fix pre-existing bug**: `log_response()` closures in 4 endpoint
files returned `None` instead of a string, causing
`debug_log_sensitive_data_from_callable` to log `None`. Now they return
the concatenated log string as expected by the callable-based API.
Affected endpoints: Python optimizer, line profiler, jit_rewrite, Java
optimizer, Java line profiler, JS/TS optimizer, JS/TS line profiler,
testgen.
## Test plan
- [x] All 558 unit tests pass
- [x] mypy clean
- [x] ruff clean
- [ ] Verify debug logging still works in non-production environments
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
## Summary
- Replace Pydantic frozen dataclass with stdlib
`@dataclass(frozen=True)` for `CodeExplanationAndID` and
`CodeAndExplanation`, removing `field_validator` that ran `.code` +
`compile()` ~280 times per pipeline run
- Pre-compute `original_module.code` once and pass to pipeline steps
(`clean_extraneous_comments`, `equality_check`) that previously called
it independently
- Replace `ast.dump(annotate_fields=False)` with `ast.unparse` in
`deduplicate_optimizations` (70% faster)
- Skip re-parse in `dedup_and_sort_imports` when isort returns unchanged
code
- Cache comment-stripped original code across candidates in
`clean_extraneous_comments`
**Pipeline median per-run: ~1.5s → 184ms** (4 candidates, controlled
measurement). Saves ~4-5s of CPU per optimization request in production.
## Test plan
- [x] All 558 unit tests pass
- [x] mypy clean
- [x] ruff clean (no new warnings)
- [ ] Verify optimizer endpoints return correct results in staging
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
## Summary
- Move `safe_log_features()` and `update_optimization_cost()` out of
blocking `TaskGroup`s into fire-and-forget background tasks across 4
optimization endpoints (optimizer, optimizer_line_profiler, jit_rewrite,
adaptive_optimizer)
- These DB writes are analytics-only and don't affect response bodies —
waiting for them adds 100-300ms per request unnecessarily
- Add `aiservice/background.py` with `fire_and_forget()` helper using
the same `set` + `add_done_callback` pattern already used in `LLMClient`
- `get_or_create_optimization_event()` remains awaited where the
response needs `event.id`
## Test plan
- [x] All 550 tests pass locally
- [ ] Verify response latency improvement in production metrics after
deploy
- [ ] Confirm `safe_log_features` and `update_optimization_cost` still
complete successfully in background (check DB records)
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Vitest tests were failing with "Cannot find module" errors because
`vi.mock()` calls retained `.js` extensions while imports had them
stripped, causing mock/import path mismatch in ESM mode.
## Root Cause
The `strip_js_extensions()` function in `testgen.py` only handled
`jest.mock()` but not `vi.mock()`, which is used by Vitest. The pattern
`_JEST_MOCK_EXTENSION_PATTERN` matched Jest mocking functions but not
Vitest's `vi.*` equivalents.
## Fix
Added `_VITEST_MOCK_EXTENSION_PATTERN` regex to match and strip
extensions from:
- `vi.mock()`
- `vi.doMock()`
- `vi.unmock()`
- `vi.requireActual()`
- `vi.requireMock()`
- `vi.importActual()`
- `vi.importMock()`
## Affected Trace IDs
- `0fe99c9f-b348-4f0a-b051-0ea9455231ba`
- `127cdaec-a343-4918-a86a-b646dd4d79cf`
- `2b6c896e-20d7-4505-8bf4-e4a2f20b37fc`
These trace IDs exhibited the bug where generated tests had
`vi.mock('../config/paths.js')` but imports had `from
'../config/paths'`, causing module resolution failures.
## Test Coverage
- Added 8 new tests in `TestStripJsExtensions` class
- All 31 tests in `test_testgen_javascript.py` pass
- Specific regression test for vi.mock() extension stripping
- Tests cover all vi.mock variants and edge cases
## Files Changed
- `django/aiservice/core/languages/js_ts/testgen.py` (fix)
- `django/aiservice/tests/testgen/test_testgen_javascript.py` (tests)
---------
Co-authored-by: Codeflash Bot <codeflash-bot@codeflash.ai>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Sarthak Agarwal <sarthak.saga@gmail.com>
## Summary
Fixes 500 Internal Server Error when replaying test generation with
`--rerun` flag and database arrays contain `None`/`NULL` values.
## Root Cause
The `rerun_testgen()` function in `core/shared/replay.py` accessed array
elements without checking if they were `None`. When PostgreSQL arrays
contained `NULL` values (e.g., `generated_test = [NULL, 'test2']`), the
function returned a `TestGenResponseSchema` with `None` values, causing
Pydantic validation to fail:
```
pydantic_core._pydantic_core.ValidationError: 2 validation errors for TestGenResponseSchema
generated_tests
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
instrumented_behavior_tests
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
```
## Changes
Added explicit `None` checks before creating `TestGenResponseSchema`:
- If `generated_test[index]` or `instrumented_generated_test[index]` is
`None`, return `None` (skip this test)
- If `instrumented_perf_test[index]` is `None`, default to empty string
(non-critical field)
## Impact
Resolves **10+ replay failures** where test generation produced partial
results stored as `NULL` in database arrays.
## Test Coverage
Added comprehensive test suite for `replay.py`:
- `test_rerun_with_valid_test_data()` - Happy path
- `test_rerun_with_none_values_in_arrays()` - **Primary bug fix test**
- `test_rerun_with_index_out_of_bounds()` - Boundary conditions
- `test_rerun_with_empty_arrays()` - Empty data handling
- `test_rerun_with_none_arrays()` - NULL arrays
- `test_rerun_with_mismatched_array_lengths()` - Length mismatches
- `test_rerun_missing_perf_test()` - Missing perf data
All 7 tests pass.
## Trace IDs
This fix addresses errors seen in traces:
- Primary: `056561cc-94af-4d7b-ac79-85dfd4b7282d`
- And 9 additional trace IDs with the same "500 - Error generating
JavaScript tests" error
## Verification
Tested with original failing trace:
```bash
cd /workspace/target && codeflash --file src/daemon/constants.ts --function formatGatewayServiceDescription --rerun 056561cc-94af-4d7b-ac79-85dfd4b7282d
```
**Before fix:** `ERROR: 500 - Traceback... ValidationError: Input should
be a valid string [type=string_type, input_value=None]`
**After fix:** Gracefully skips None entries, no 500 error ✅
---------
Co-authored-by: Codeflash Bot <codeflash-bot@codeflash.ai>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- **Memory leak fix**: Added explicit `LOGGING` config in `settings.py`
to prevent unbounded `LogRecord` buffering. Django's `django.request`
logger creates WARNING records for 4xx responses with the full
`ASGIRequest` (headers, body, payload) pinned in `args`. Without
explicit config, Django's default handlers and Sentry's
`enable_logs=True` buffer these indefinitely. Setting `django.request`
to ERROR level + removing `enable_logs=True` eliminated the leak — load
testing showed **84% reduction** in per-request memory growth (7.4 → 1.2
KiB/req).
- **Async event loop fix**: Wrapped
`parse_and_generate_candidate_schema()` in `asyncio.to_thread()` across
all 4 async callers (optimizer, optimizer_line_profiler, jit_rewrite,
adaptive_optimizer). This offloads the synchronous libcst parsing +
8-stage postprocessing pipeline to the thread pool, preventing it from
blocking the event loop during peak traffic.
## Test plan
- [x] All 550 tests pass (`uv run pytest tests/ --ignore=tests/profiling
-x -q`)
- [ ] Monitor Azure memory alerts after deploy — expect significant
reduction in memory growth rate
- [ ] Monitor 5xx error rate during peak traffic — expect reduction from
event loop no longer blocked by sync postprocessing
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- **`thread_sensitive=False`** on `sync_to_async` so concurrent
`log_features` calls get their own threads instead of serializing
through one (was `True`, causing a bottleneck)
- **Raised DB pool `max_size` from 10 to 100** — prod Postgres allows
859 connections, giving plenty of headroom
- **Added `safe_log_features` wrapper** that catches errors via Sentry
instead of propagating — used at all 9 TaskGroup and bare-await call
sites so a logging failure can't crash an otherwise successful
optimization endpoint
- **Kept `transaction.atomic` + `select_for_update`** for correctness
(Django doesn't support async transactions yet, and removing these
causes lost-update races on dict-merge fields)
## Root cause
`log_features` uses `@sync_to_async` + `@transaction.atomic` because
Django lacks async transaction support. The previous fix for pool
exhaustion changed `thread_sensitive=False` to `True`, which serialized
all calls through a single thread — fixing pool exhaustion but creating
a throughput bottleneck that caused 500s under load. Additionally, 6
call sites used `asyncio.TaskGroup` where any `log_features` exception
would propagate and crash the entire endpoint.
## Test plan
- [x] `tests/log_features/test_log_features_concurrency.py` — verifies
`thread_sensitive=False` and `safe_log_features` is async
- [x] `ruff check` passes on all changed files
- [ ] Deploy to staging and verify no 500s under concurrent optimization
requests
The optimization hoisted the 70-element `reserved_words` set out of `_is_valid_js_identifier` into a module-level `frozenset`, eliminating 1677 repeated set constructions that consumed 1.79 ms per profiler (42% of that function's time). More significantly, `_detect_export_style` previously compiled six regex patterns on every invocation via f-string interpolation with `escaped_id`; the optimized version pre-compiles generic patterns once at module load and uses `finditer` plus manual identifier comparison, cutting the function's runtime from 3.17 s to 14.7 ms across 1146 calls—a 99.5% reduction that accounts for nearly all of the 10× speedup. Test annotations confirm the largest gains occur in the `test_large_scale_many_class_methods_with_alternating_export_styles` case (107 ms → 4.66 ms), where repeated export detection dominated.
When generating tests for class methods (e.g., ModulesContainer.getById),
the test generator was incorrectly assuming default import style, generating:
import ModulesContainer from '...'
This caused "Cannot find module" errors when:
1. The class was not exported at all
2. The class used named export (export class X) instead of default export
This fix:
- Adds _detect_export_style() to parse source code and detect actual export style
- Modifies _resolve_import() to use detected export style:
- 'export default class X' → default import
- 'export class X' → named import
- No export → named import (test will fail, surfacing the issue)
- Adds comprehensive unit tests for all scenarios
Affected traces: 12332328-80e8-4bde-bdd6-c76ac373675a, 73ccd4c6-a4f7-467a-8356-5199e9d9b877, 989dcbda-bc27-40b7-aed0-0ab51fd00e6d, and others with ERR_MODULE_NOT_FOUND
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
## Summary
Parallelize independent DB writes at the end of 4 endpoints using
`asyncio.TaskGroup`. With psycopg3 connection pooling (#2489), each task
gets its own connection from the pool.
### Endpoints optimized
| Endpoint | Before | After |
|----------|--------|-------|
| **Refinement** | `log_features` then `update_optimization_cost` |
`TaskGroup` (concurrent) |
| **Explanations** | `update_optimization_cost` inside inner fn | Moved
to handler, `TaskGroup` with `log_features` |
| **Optimization review** | `update_optimization_cost` inside inner fn |
Moved to handler, `TaskGroup` with `update_optimization_features_review`
|
| **Ranker** | `update_optimization_cost` inside inner fn | Moved to
handler, `TaskGroup` with `log_features` |
Each endpoint saves ~87ms (one DB round-trip) by overlapping two
independent writes.
### Comprehensive audit
All 13 endpoints were audited — no remaining async antipatterns found:
- No blocking calls in async paths
- No `await`-in-loop patterns
- LLM clients already use connection reuse
- All other endpoints have at most 1 DB write in the epilogue
## Test plan
- [x] All 538 tests passing
- [ ] Verify under load in staging
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Kevin Turcios <KRRT7@users.noreply.github.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
## Summary
The TypeScript validator was rejecting valid JSX/TSX syntax, causing
optimization runs to fail on React components with JSX.
## Problem
The validator was using `tree_sitter_typescript.language_typescript()`
which doesn't parse JSX syntax. This caused validation failures for
`.tsx` files containing JSX elements like:
- `<div className={...} />`
- `{...rest}` (spread props)
- Any JSX tags
## Solution
Changed to use `tree_sitter_typescript.language_tsx()` instead. Since
TSX is a superset of TypeScript, this supports both:
- Plain TypeScript code
- TypeScript with JSX (TSX)
## Testing
Added three new test cases:
- `test_tsx_simple_jsx` - Tests basic JSX elements
- `test_tsx_nested_jsx` - Tests nested JSX
- `test_tsx_with_props_spread` - Tests spread props in JSX
All existing tests continue to pass.
## Impact
This fixes validation errors for all React/JSX components. Affected
trace IDs from logs:
- 5bedfbb7-ccc0-4fdd-b208-60b8b860750c
- 39892d42-774f-4921-80fc-2ee42ff8ae1c
- 80b818b6-e784-4ff8-abda-c3ce6b25422f
- 9b76943e-1a93-45fa-84b9-aae7d6305f79
- d1bac014-d622-4772-90ea-0f9ff88e32dd
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Codeflash Bot <codeflash-bot@codeflash.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Bug: PostgreSQL connection pool timeout (30 seconds)
Root cause: log_features uses @sync_to_async(thread_sensitive=False), causing
each call to grab a separate database connection from the pool. When multiple
optimization requests run concurrently, the pool (max_size=10) exhausts.
Error seen: psycopg_pool.PoolTimeout: couldn't get a connection after 30.00 sec
Fix: Change thread_sensitive=False to thread_sensitive=True. This ensures Django
properly reuses connections across async/sync boundaries instead of allocating
a new connection for each call.
Affected trace IDs from logs:
- a0d8dab6-6524-47dc-9c82-5fa92e6390fb
- 62f5c35b-7161-4ab0-958a-4865231f5188
- ddc0e882-f914-49e4-a2ac-2d5f19a17507
- eaeb0cbe-6474-4808-9092-42f837dd52cf
Testing:
- Added test_log_features_concurrency.py to verify thread_sensitive=True
- Verified reproduction script now passes without pool exhaustion
- All existing tests pass
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
## Summary
- Adds `rerun_trace_id` field to all request schemas (`OptimizeSchema`,
`OptimizeSchemaLP`, `TestGenSchema`, `RefinementRequestSchema`,
`CodeRepairRequestSchema`)
- Creates `core/shared/replay.py` with shared rerun logic that queries
`optimization_features` and returns stored results
- Adds early-return short-circuit to `/optimize`,
`/optimize-line-profiler`, `/testgen`, `/refinement`, `/code_repair` —
bypasses LLM calls when `rerun_trace_id` is provided
- Filters results by `optimizations_origin.source` (OPTIMIZE,
OPTIMIZE_LP, REFINE, REPAIR) and matches by parent optimization ID for
refinement/repair
## Test plan
- [ ] Run optimization normally to populate `optimization_features` with
a trace_id
- [ ] Rerun with `codeflash --rerun <trace_id>` against local server
- [ ] Verify each endpoint returns stored results without LLM calls
- [ ] Verify backward compatibility — requests without `rerun_trace_id`
behave unchanged
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Sarthak Agarwal <sarthak.saga@gmail.com>
## Summary
- The `/testgen_repair` endpoint was missed when `language_version`
support was added in #2488
- Clients that stopped sending `python_version`
(codeflash-ai/codeflash#1914) hit `400 - Python version is required`
- Adds `language_version` field and `resolve_python_version` validator
to `TestRepairSchema`, matching the pattern in
`OptimizeSchema`/`TestGenSchema`
- Replaces `python_version=data.python_version` with
`language_version=data.language_version` when constructing
`TestGenSchema` in the repair handler
## Test plan
- [ ] Deploy and verify testgen repair calls no longer return 400
- [ ] Verify old clients sending `python_version` still work (backward
compat via validator)
## Summary
- Replaces `psycopg2-binary` with `psycopg[binary,pool]>=3.1.8,<4` for
native async connection pooling
- Sets `conn_max_age=0` (persistent connections don't work correctly
under ASGI) and configures psycopg3's `ConnectionPool` with `min_size=2,
max_size=10`
- Changes `log_features` to use `@sync_to_async(thread_sensitive=False)`
so concurrent calls dispatch to different threads, each getting a
separate connection from the pool
## Why
With psycopg2, all `sync_to_async` calls from the same coroutine shared
a single thread and Postgres connection. Parallel DB calls via
`TaskGroup` were serialized at the connection level — zero improvement.
Benchmarks confirmed that with psycopg3's pool, 5 parallel calls now get
5 different threads and 5 different `pg_backend_pid`s. Parallel
`aexists` operations showed **15.6% improvement** (208ms → 175ms).
## Test plan
- [x] 538 tests pass, 0 failures, 12 skipped (pre-existing)
- [x] Ruff lint clean
- [x] Benchmark confirms multiple connections: 5 unique threads, 5
unique pg_pids
- [ ] Verify in staging that connection pool behaves correctly under
sustained load
## Summary
- Add `model_validator` to `OptimizeSchema`, `OptimizeSchemaLP`, and
`TestGenSchema` that resolves `python_version` from `language_version`
when the request language is `"python"`
- Old clients sending `python_version` continue to work unchanged
- New clients sending only `language_version` get `python_version`
populated automatically for downstream handlers
This must be deployed **before** the client-side changes that stop
sending `python_version`
(codeflash-ai/codeflash#remove-python-version-field).
Once all clients are updated, `python_version` can be fully removed from
the schemas and handlers.
## Test plan
- [x] Existing schema tests pass
(`tests/optimizer/test_javascript_schema.py`)
- [ ] Verify old clients (sending `python_version`) still work after
deploy
- [ ] Verify new clients (sending only `language_version`) resolve
correctly
## Summary
- The `/testgen` response's `generated_tests` field contained the
assert-removed version with `codeflash_output` assignments
- When the CLI's testgen review fell back to this field (instead of
`raw_generated_tests`), the review LLM flagged every test as a "no-op
assignment"
- Now returns the display version (asserts kept, no instrumentation) as
`generated_tests`, matching what the repair endpoint already does
- Also applies isort to the display source for consistency
## Summary
- Use greedy code extraction and retry on syntax errors in testgen
repair
- Fix broken `asyncio.to_thread(log_features)` in Java optimizer —
`log_features` is `@sync_to_async` so calling it via `to_thread` created
an unawaited coroutine (`RuntimeWarning: coroutine
'SyncToAsync.__call__' was never awaited`) and silently skipped logging.
Replaced with `await log_features(...)` using correct keyword arguments.
## Test plan
- [ ] Verify testgen repair handles syntax errors with retry
- [ ] Verify Java optimization requests no longer emit `SyncToAsync`
RuntimeWarning
- [ ] Verify Java optimization features are correctly logged to DB
## Summary
- Switch `extract_code_block_with_context` (non-greedy `.*?`) →
`extract_code_block` (greedy `.*`) for repair code extraction — the
non-greedy regex matched the first closing fence, truncating code when
the LLM included explanatory snippets before the full file (root cause
of 82% of repair failures)
- Add `ast.parse` validation before CST parsing for fast syntax checking
- Retry the LLM once with the specific syntax error appended to the
conversation when validation fails
## Test plan
- [x] Existing tests pass
- [ ] Run end-to-end optimization to verify repairs succeed
## Summary
- Add multiline string literal constraint to testgen and repair prompts
— LLM was consistently generating unterminated string literals by
splitting strings across lines without triple quotes
- Deduplicate anthropic/markdown branches in testgen prompt templates —
single flow with inline `{% if is_xml %}` wrappers instead of duplicated
content
## Test plan
- [x] Verified templates render correctly for both anthropic and openai
model types (sync and async)
- [x] All block overrides from child templates work with the unified
block names
## Summary
- Pass coverage details (unexecuted lines, threshold) to review and
repair prompts so the LLM can identify low-coverage tests
- Accept previous repair errors in the repair endpoint and include them
in the prompt for retry cycles
- Parallelize per-test review LLM calls with `asyncio.TaskGroup`
- Conditionally include codeflash env var context
(`CODEFLASH_TRACER_DISABLE`, etc.) in repair prompts when the function
under test references them
## Test plan
- [x] Tested locally with codeflash CLI against `Tracer.__enter__` —
review, repair, and retry cycles all work
- [x] Coverage details and previous errors appear correctly in prompts
- [x] Review parallelization reduces latency from sequential ~60s per
test to concurrent
## Summary
- Switch testgen repair endpoint from `EXECUTE_MODEL` (GPT-5-Mini) to
`HAIKU_MODEL` (Haiku 4.5)
- Matches the review endpoint which already uses Haiku
- Repair is a structured task (splice functions, fix assertions) that
doesn't need a frontier model
- Should reduce latency (was timing out at 90s in CI) and cost
Accept coverage_summary in the review schema and pass it to the prompt.
Add two new review criteria: low coverage detection and constructor/
dependency error patterns. Coverage percentage is shown in the user
prompt so the reviewer can flag tests that don't exercise the function.
Include runtime error messages from behavioral test failures in the
review request. Failed function verdicts now include the specific error
message. The review prompt shows error details so the AI can see
patterns like type validation failures.