The async testgen prompt was steering the LLM toward generating
timing-dependent and ordering-sensitive tests that produce
non-deterministic results across runs. This caused ~50% E2E failure
rate for the JS ESM async workflow.
- Add determinism requirement: never assert on timing, elapsed
duration, or relative ordering of async side effects
- Remove directive to use Promise.all() for large-scale tests
- Change large-scale objective from "concurrent operations" to
"correctness with larger inputs"
- Replace concurrent execution template example with a simple
large-input correctness test
Add POST /ai/testgen_review and POST /ai/testgen_repair endpoints.
Review accepts per-test data with pre-flagged behavioral failures, AI
reviews passing functions for unrealistic patterns, returns per-function
verdicts. Repair takes flagged functions, LLM rewrites them,
re-instruments, returns repaired test source. Python-only gate.
Split the 1,734-line instrument_new_tests.py into three modules by concern:
- device_sync.py: GPU/device framework detection and sync AST generation
- wrapper.py: wrapper function generation, unified inject_logging_code, format_and_float_to_top
- instrument_new_tests.py: core AST transformer (InjectPerfAndLogging) and instrument_test_source
Also extract select_model_for_test() from testgen_python() in generate.py to
separate model selection logic from the HTTP handler.
Replace class hierarchy (BaseTestGenContext → Single/Multi) with
standalone functions that branch on is_multi_context() internally.
Delete context.py, move TestGenContextData to models.py, and
distribute logic to validate.py, preprocess_pipeline.py, and
generate.py.
Use {% extends %} to deduplicate sync/async system templates via
base_system.md.j2, {% include %} for conditional JIT content, and a
compose_user.md.j2 wrapper to replace Python string assembly in
build_prompt().
Move prompts into prompts/ subdirectory with clearer names, rename
testgen.py to generate.py, extract validate.py and demo_hacks.py,
rename testgen_context.py to context.py, delete unused explain prompts.
- Extract shared content into Jinja2 macros (`section`, `field`,
`code_field`) that handle Anthropic XML vs OpenAI markdown wrapping,
eliminating full duplication of every section across both branches
- Tighten system prompt to enforce concise 3-6 sentence output: trim
bloated per-field context descriptions, add concrete positive example,
explicitly forbid section headers and bullet groups, move output_format
to be the last section so constraints are closest to generation
- Add caveat that original_explanation is for factual reference only (in
both system and user prompts) to prevent the model from mimicking its
verbose multi-section format
- Condense throughput/concurrency/acceptance sections to essentials
- Rename misleading `## CRITICAL` heading to `## Acceptance Criteria`
Extract inline prompts into .md.j2 templates, move schemas to
models.py, and add model_type branching (XML for Anthropic, markdown
for OpenAI) following the testgen pattern. Uses StrictUndefined,
trim_blocks, and lstrip_blocks.
Reduce IntersectionObserver thresholds from 6 to 1, remove backdrop-blur
from sticky header, drop opacity/color/maxHeight transitions that fired
on every activeIndex change, and narrow progress bar to transition-[width].
Display ADAPTIVE source candidates in the timeline with Sparkles icon,
parent candidate linking, and ranking labels. Also fix the backend to
pass call_type, trace_id, and user_id to call_llm for proper
observability logging.
Add markdown code block parsing, detailed syntax error locations with
line/col info, and structured logging to the JavaScript/TypeScript
validators.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update type hints for `add_months_safe` and `get_next_subscription_period`
to accept both datetime.datetime and datetime.date, and add ty:ignore
comment for Django ORM field type that ty cannot infer correctly.
Co-authored-by: Aseem Saxena <aseembits93@users.noreply.github.com>
Auth now attaches fetched organization/subscription to the request so
TrackUsageMiddleware reuses them instead of re-querying. RateLimitMiddleware
caches restricted_paths at init and uses async cache methods. LLM call
recording is fire-and-forget via asyncio.create_task to avoid blocking
responses on DB writes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move JIT instructions appending from the per-call level
(optimize_python_code_line_profiler_single) to the endpoint level
(optimize endpoint), matching the regular optimizer's pattern.
This removes the is_numerical_code parameter threading through
the call chain.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When is_numerical_code is true, the LLM sometimes outputs conditional
fallback paths (try/except, if/else) instead of applying the JIT
decorator directly. Add explicit output format instructions to prevent
this behavior.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coverage analysis in the Claude pr-review job needs these env vars
to run pytest, matching how django-unit-tests and codeflash-aiservice
workflows configure them.
The ty type checker correctly flags that list[str] is not a subtype of list[str | None] due to list invariance. Added explicit cast.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The JSON parsing path returned the LLM's explicit ranking array,
which sometimes contradicted its own per-dimension scores. Use
_scores_to_ranking() to compute the ranking from weighted scores
when available, falling back to the LLM ranking only when scores
are absent.