## Summary
- The `/testgen` response's `generated_tests` field contained the
assert-removed version with `codeflash_output` assignments
- When the CLI's testgen review fell back to this field (instead of
`raw_generated_tests`), the review LLM flagged every test as a "no-op
assignment"
- Now returns the display version (asserts kept, no instrumentation) as
`generated_tests`, matching what the repair endpoint already does
- Also applies isort to the display source for consistency
## Summary
- Use greedy code extraction and retry on syntax errors in testgen
repair
- Fix broken `asyncio.to_thread(log_features)` in Java optimizer —
`log_features` is `@sync_to_async` so calling it via `to_thread` created
an unawaited coroutine (`RuntimeWarning: coroutine
'SyncToAsync.__call__' was never awaited`) and silently skipped logging.
Replaced with `await log_features(...)` using correct keyword arguments.
## Test plan
- [ ] Verify testgen repair handles syntax errors with retry
- [ ] Verify Java optimization requests no longer emit `SyncToAsync`
RuntimeWarning
- [ ] Verify Java optimization features are correctly logged to DB
## Summary
- Switch `extract_code_block_with_context` (non-greedy `.*?`) →
`extract_code_block` (greedy `.*`) for repair code extraction — the
non-greedy regex matched the first closing fence, truncating code when
the LLM included explanatory snippets before the full file (root cause
of 82% of repair failures)
- Add `ast.parse` validation before CST parsing for fast syntax checking
- Retry the LLM once with the specific syntax error appended to the
conversation when validation fails
## Test plan
- [x] Existing tests pass
- [ ] Run end-to-end optimization to verify repairs succeed
## Summary
- Add multiline string literal constraint to testgen and repair prompts
— LLM was consistently generating unterminated string literals by
splitting strings across lines without triple quotes
- Deduplicate anthropic/markdown branches in testgen prompt templates —
single flow with inline `{% if is_xml %}` wrappers instead of duplicated
content
## Test plan
- [x] Verified templates render correctly for both anthropic and openai
model types (sync and async)
- [x] All block overrides from child templates work with the unified
block names
## Summary
- Pass coverage details (unexecuted lines, threshold) to review and
repair prompts so the LLM can identify low-coverage tests
- Accept previous repair errors in the repair endpoint and include them
in the prompt for retry cycles
- Parallelize per-test review LLM calls with `asyncio.TaskGroup`
- Conditionally include codeflash env var context
(`CODEFLASH_TRACER_DISABLE`, etc.) in repair prompts when the function
under test references them
## Test plan
- [x] Tested locally with codeflash CLI against `Tracer.__enter__` —
review, repair, and retry cycles all work
- [x] Coverage details and previous errors appear correctly in prompts
- [x] Review parallelization reduces latency from sequential ~60s per
test to concurrent
## Summary
- Switch testgen repair endpoint from `EXECUTE_MODEL` (GPT-5-Mini) to
`HAIKU_MODEL` (Haiku 4.5)
- Matches the review endpoint which already uses Haiku
- Repair is a structured task (splice functions, fix assertions) that
doesn't need a frontier model
- Should reduce latency (was timing out at 90s in CI) and cost
Accept coverage_summary in the review schema and pass it to the prompt.
Add two new review criteria: low coverage detection and constructor/
dependency error patterns. Coverage percentage is shown in the user
prompt so the reviewer can flag tests that don't exercise the function.
Include runtime error messages from behavioral test failures in the
review request. Failed function verdicts now include the specific error
message. The review prompt shows error details so the AI can see
patterns like type validation failures.
Instead of replacing the entire test file with the LLM's output, parse
both the original and repaired sources as CST, extract only the flagged
function nodes from the repair output, and surgically replace them in
the original. Unflagged functions are preserved exactly as-is.
Repaired tests from the LLM now go through the same postprocessing
pipeline as initial generation (import fixing, loop limiting, unused
definition removal) before instrumentation. Returns the display version
(with asserts) as generated_tests for client-side display.
Split postprocessing_testgen_pipeline to capture the test source before
assert removal — fully cleaned (imports, loops, definitions) but with
original asserts intact. Return it as raw_generated_tests in the
TestGenResponseSchema so the CLI can display the human-readable version.
Deduplicate the identical Environment(FileSystemLoader, StrictUndefined,
keep_trailing_newline=True) setup across JS testgen, Python testgen, and
Python explanations into core/shared/jinja_utils.py.
Also fix tests/testgen/test_testgen_javascript.py which had a stale
copy of build_javascript_prompt and loaded the now-deleted .md files.
Split _generate_import_statement into _resolve_import (pure logic:
identifier validation, dot splitting, reserved words) and a js_import
Jinja2 macro (pure formatting: ESM vs CJS syntax). The macro lives in
_macros.md.j2 and is imported by user.md.j2.
Replace plain .md prompts rendered with str.format() with Jinja2
templates using {% extends %}, {% block %}, and {% if %} branching:
- model_type branching: XML tags for Anthropic, markdown headers for OpenAI
- module_system support: ESM imports (import { fn } from '...') vs CJS (require)
- Template inheritance: base_system.md.j2 with sync/async overrides
- Unified user.md.j2 with is_async and module_system conditionals
- Add module_system field to TestGenSchema
The async testgen prompt was steering the LLM toward generating
timing-dependent and ordering-sensitive tests that produce
non-deterministic results across runs. This caused ~50% E2E failure
rate for the JS ESM async workflow.
- Add determinism requirement: never assert on timing, elapsed
duration, or relative ordering of async side effects
- Remove directive to use Promise.all() for large-scale tests
- Change large-scale objective from "concurrent operations" to
"correctness with larger inputs"
- Replace concurrent execution template example with a simple
large-input correctness test
Add POST /ai/testgen_review and POST /ai/testgen_repair endpoints.
Review accepts per-test data with pre-flagged behavioral failures, AI
reviews passing functions for unrealistic patterns, returns per-function
verdicts. Repair takes flagged functions, LLM rewrites them,
re-instruments, returns repaired test source. Python-only gate.
Split the 1,734-line instrument_new_tests.py into three modules by concern:
- device_sync.py: GPU/device framework detection and sync AST generation
- wrapper.py: wrapper function generation, unified inject_logging_code, format_and_float_to_top
- instrument_new_tests.py: core AST transformer (InjectPerfAndLogging) and instrument_test_source
Also extract select_model_for_test() from testgen_python() in generate.py to
separate model selection logic from the HTTP handler.
Replace class hierarchy (BaseTestGenContext → Single/Multi) with
standalone functions that branch on is_multi_context() internally.
Delete context.py, move TestGenContextData to models.py, and
distribute logic to validate.py, preprocess_pipeline.py, and
generate.py.
Use {% extends %} to deduplicate sync/async system templates via
base_system.md.j2, {% include %} for conditional JIT content, and a
compose_user.md.j2 wrapper to replace Python string assembly in
build_prompt().
Move prompts into prompts/ subdirectory with clearer names, rename
testgen.py to generate.py, extract validate.py and demo_hacks.py,
rename testgen_context.py to context.py, delete unused explain prompts.
- Extract shared content into Jinja2 macros (`section`, `field`,
`code_field`) that handle Anthropic XML vs OpenAI markdown wrapping,
eliminating full duplication of every section across both branches
- Tighten system prompt to enforce concise 3-6 sentence output: trim
bloated per-field context descriptions, add concrete positive example,
explicitly forbid section headers and bullet groups, move output_format
to be the last section so constraints are closest to generation
- Add caveat that original_explanation is for factual reference only (in
both system and user prompts) to prevent the model from mimicking its
verbose multi-section format
- Condense throughput/concurrency/acceptance sections to essentials
- Rename misleading `## CRITICAL` heading to `## Acceptance Criteria`