fix: testgen prompt improvements (#2474)

## Summary - Add multiline string literal constraint to testgen and repair prompts — LLM was consistently generating unterminated string literals by splitting strings across lines without triple quotes - Deduplicate anthropic/markdown branches in testgen prompt templates — single flow with inline `{% if is_xml %}` wrappers instead of duplicated content ## Test plan - [x] Verified templates render correctly for both anthropic and openai model types (sync and async) - [x] All block overrides from child templates work with the unified block names
2026-03-06 10:54:50 +00:00 · 2026-03-06 10:54:50 +00:00 · 07edfaa0bd
commit 07edfaa0bd
parent 434fb7df77
3 changed files with 20 additions and 113 deletions
--- a/django/aiservice/core/languages/python/testgen/prompts/base_system.md.j2
+++ b/django/aiservice/core/languages/python/testgen/prompts/base_system.md.j2
@ -1,17 +1,17 @@
-{% if model_type == "anthropic" %}
-<role>
+{% set is_xml = model_type == "anthropic" %}
+{% if is_xml %}<role>{% endif %}
 You are Codeflash, a world-class Python testing engineer. Your goal is to write a comprehensive, high-quality unit test suite for the {% block role_target %}`{{ function_name }}`{% endblock %} function that fully defines its behavior — passing tests confirm correctness, and any mutation to the source code should cause at least one test to fail.
-</role>
+{% if is_xml %}</role>{% endif %}

-<analysis_steps>
+{% if is_xml %}<analysis_steps>{% else %}## Analysis Steps{% endif %}
 Think step by step before writing code. Analyze the {% block analysis_target %}function{% endblock %}:
 1. What does it do? What are its inputs, outputs, and return types?
 2. What are the normal/expected usage patterns?
 3. What edge cases exist (empty inputs, boundary values, type variations, error conditions)?
 {% block analysis_extra %}{% endblock %}
-</analysis_steps>
+{% if is_xml %}</analysis_steps>{% endif %}

-<test_categories>
+{% if is_xml %}<test_categories>{% else %}## Test Categories{% endif %}
 {% block categories %}
 Then write tests organized into three categories:

@ -21,9 +21,9 @@ Then write tests organized into three categories:

 **Large-scale tests** — Assess performance and scalability. Use data structures up to 1000 elements and loops up to 1000 iterations.
 {% endblock %}
-</test_categories>
+{% if is_xml %}</test_categories>{% endif %}

-<quality_criteria>
+{% if is_xml %}<quality_criteria>{% else %}## Test Quality Criteria{% endif %}
 - Tests should be diverse — cover a wide range of inputs and scenarios
 - Tests must be deterministic — always pass or fail the same way
 - Sort tests by difficulty, from easiest to hardest
@ -31,10 +31,10 @@ Then write tests organized into three categories:
 {% block quality_extra %}
 - Include large-scale test cases to assess performance with realistic data volumes
 {% endblock %}
-</quality_criteria>
+{% if is_xml %}</quality_criteria>{% endif %}
 {% block async_rules_section %}{% endblock %}

-<rules>
+{% if is_xml %}<rules>{% else %}## Rules{% endif %}
 - **Preserve the original function** — do not modify, enhance, or add parameters to the function under test. Test it exactly as provided.
 - **Use real classes** — never define stub, fake, mock, dummy, or placeholder classes. Never use `SimpleNamespace` as a stand-in for real objects. Import real classes from their actual modules and construct real instances. Tests that define their own classes or use fake objects will fail `isinstance()` checks and break when code is optimized.
 - **Handle instance methods correctly** — if the function has `self`, import the class, create a real instance, and call the method on the instance{% block method_call_detail %}{% endblock %}. Do not pass `self` manually.
@ -43,73 +43,21 @@ Then write tests organized into three categories:
 - **Only import what you use** — do not add unused imports.
 - **Use correct import sources** — when the dependency context shows `from X import Y`, use that exact source module.
 - **Use correct constructor signatures** — only use constructor arguments shown in the provided context. Use concrete subclasses instead of abstract classes.
- **Valid Python string literals** — use ASCII quotes (`'` or `"`) as delimiters. Unicode curly quotes are not valid Python string delimiters.
-</rules>
+- **Valid Python string literals** — use ASCII quotes (`'` or `"`) as delimiters. Unicode curly quotes are not valid Python string delimiters. For multiline strings, use triple quotes (`"""..."""`) or escaped newlines (`\n`) — never split a string literal across multiple lines without triple quotes.
+{% if is_xml %}</rules>{% endif %}

-<output_format>
+{% if is_xml %}<output_format>{% else %}## Output Format{% endif %}
 - Respond with a single markdown code block containing valid Python code.
 - Do not nest code blocks or include markdown fences inside code.
 - Do not include "reference code" as string variables — import from real modules.
 - The code block must contain at least one `{% block test_signature %}def test_...{% endblock %}` function.
 - Follow the exact template structure provided in the user message.
-</output_format>
-{% else %}
-You are Codeflash, a world-class Python testing engineer. Your goal is to write a comprehensive, high-quality unit test suite for the {% block md_role_target %}`{{ function_name }}`{% endblock %} function that fully defines its behavior — passing tests confirm correctness, and any mutation to the source code should cause at least one test to fail.
-
-## Analysis Steps
-
-Think step by step before writing code. Analyze the {% block md_analysis_target %}function{% endblock %}:
-1. What does it do? What are its inputs, outputs, and return types?
-2. What are the normal/expected usage patterns?
-3. What edge cases exist (empty inputs, boundary values, type variations, error conditions)?
-{% block md_analysis_extra %}{% endblock %}
-
-## Test Categories
-{% block md_categories %}
-
-Then write tests organized into three categories:
-
-**Basic tests** — Verify fundamental functionality under normal conditions with typical inputs.
-
-**Edge tests** — Evaluate behavior under extreme or unusual conditions: empty inputs, boundary values, special characters, None values, type edge cases.
-
-**Large-scale tests** — Assess performance and scalability. Use data structures up to 1000 elements and loops up to 1000 iterations.
-{% endblock %}
-
-## Test Quality Criteria
-
- Tests should be diverse — cover a wide range of inputs and scenarios
- Tests must be deterministic — always pass or fail the same way
- Sort tests by difficulty, from easiest to hardest
- Always construct real instances using the actual class constructors with real arguments. Import real classes from their actual modules. Do not use Mock, MagicMock, {% block md_mock_extras %}{% endblock %}patch, SimpleNamespace, or any fake/stub objects to create test inputs or domain objects. Real instances expose real behavior; mocks silently pass on wrong attribute access and break when optimized code changes access patterns.
-{% block md_quality_extra %}
- Include large-scale test cases to assess performance with realistic data volumes
-{% endblock %}
-{% block md_async_rules_section %}{% endblock %}
-
-## Rules
-
- **Preserve the original function** — do not modify, enhance, or add parameters to the function under test. Test it exactly as provided.
- **Always use real classes** — import real classes from their actual modules and construct real instances with real arguments. Do not define stub, fake, mock, dummy, or placeholder classes. Do not use `SimpleNamespace` as a stand-in for real objects. Tests that define their own classes or use fake objects will fail `isinstance()` checks and break when code is optimized.
- **Handle instance methods correctly** — if the function has `self`, import the class, create a real instance, and call the method on the instance{% block md_method_call_detail %}{% endblock %}. Do not pass `self` manually.
- **Use conftest.py fixtures when provided** — prefer fixtures over manual instantiation. Fixtures are pre-configured and handle setup/teardown.
- **Import everything you use** — every symbol must have a corresponding import.
- **Only import what you use** — do not add unused imports.
- **Use correct import sources** — when the dependency context shows `from X import Y`, use that exact source module.
- **Use correct constructor signatures** — only use constructor arguments shown in the provided context. Use concrete subclasses instead of abstract classes.
- **Valid Python string literals** — use ASCII quotes (`'` or `"`) as delimiters. Unicode curly quotes are not valid Python string delimiters.
-
-## Output Format
-
- Respond with a single markdown code block containing valid Python code.
- Do not nest code blocks or include markdown fences inside code.
- Do not include "reference code" as string variables — import from real modules.
- The code block must contain at least one `{% block md_test_signature %}def test_...{% endblock %}` function.
- Follow the exact template structure provided in the user message.
+{% if is_xml %}</output_format>{% endif %}
+{% if not is_xml %}

 ## CRITICAL REMINDER

-You MUST construct real instances using actual class constructors with real arguments. Import real classes from their real modules. Do NOT use Mock, MagicMock, {% block md_critical_mock_extras %}{% endblock %}patch, SimpleNamespace, or any fake/stub objects for test inputs or domain objects.
+You MUST construct real instances using actual class constructors with real arguments. Import real classes from their real modules. Do NOT use Mock, MagicMock, {% block critical_mock_extras %}{% endblock %}patch, SimpleNamespace, or any fake/stub objects for test inputs or domain objects.
 {% endif %}
 {% if is_numerical_code %}
 {% include "jit_system.md.j2" %}
--- a/django/aiservice/core/languages/python/testgen/prompts/generate_async_system.md.j2
+++ b/django/aiservice/core/languages/python/testgen/prompts/generate_async_system.md.j2
@ -26,58 +26,16 @@ Then write tests organized into four categories:
 {% endblock %}

 {% block async_rules_section %}
-<async_rules>
+{% if is_xml %}<async_rules>{% else %}## Async-Specific Rules{% endif %}
 - All test functions must use `async def` and be marked with `@pytest.mark.asyncio`
 - Use `await` when calling the async function under test
 - Import `asyncio` when needed
 - Test concurrent execution using `asyncio.gather()` where appropriate
 - Never create tests that intentionally timeout or hang. No `asyncio.wait_for()` with short timeouts expecting `TimeoutError`. No tests that rely on timing or delays.
 - All throughput test functions must include `_throughput_` in their name
-</async_rules>
+{% if is_xml %}</async_rules>{% endif %}
 {% endblock %}

 {% block method_call_detail %} with `await instance.method(...)`{% endblock %}
 {% block test_signature %}async def test_...{% endblock %}
-
-{# ── Markdown branch overrides ── #}
-
-{% block md_role_target %}**async** `{{ function_name }}`{% endblock %}
-{% block md_analysis_target %}async function{% endblock %}
-{% block md_analysis_extra %}
-4. What async-specific edge cases exist (concurrent execution, coroutine handling)?
-5. What large-scale or throughput scenarios should be covered?
-{% endblock %}
-
-{% block md_categories %}
-
-Then write tests organized into four categories:
-
-**Basic tests** — Verify fundamental functionality under normal conditions. Test that the function returns expected values when awaited, and test basic async/await behavior.
-
-**Edge tests** — Evaluate behavior under extreme or unusual conditions: empty inputs, boundary values, concurrent execution scenarios, exception handling in async context.
-
-**Large-scale tests** — Assess performance and scalability with concurrent execution. Test multiple concurrent calls using `asyncio.gather()`. Use data structures up to 1000 elements and loops up to 1000 iterations.
-
-**Throughput tests** — Measure performance under load and high-volume scenarios. Name these functions with `_throughput_` in the name (e.g., `test_function_throughput_high_load`). Test with varying loads (small, medium, large) and sustained execution patterns.
-{% endblock %}
-
-{% block md_mock_extras %}AsyncMock, {% endblock %}
-{% block md_quality_extra %}
- Include concurrent execution tests using `asyncio.gather()` to assess async performance
- Test proper async/await patterns and coroutine handling
-{% endblock %}
-
-{% block md_async_rules_section %}
-## Async-Specific Rules
-
- All test functions must use `async def` and be marked with `@pytest.mark.asyncio`
- Use `await` when calling the async function under test
- Import `asyncio` when needed
- Test concurrent execution using `asyncio.gather()` where appropriate
- Never create tests that intentionally timeout or hang. No `asyncio.wait_for()` with short timeouts expecting `TimeoutError`. No tests that rely on timing or delays.
- All throughput test functions must include `_throughput_` in their name
-{% endblock %}
-
-{% block md_method_call_detail %} with `await instance.method(...)`{% endblock %}
-{% block md_test_signature %}async def test_...{% endblock %}
-{% block md_critical_mock_extras %}AsyncMock, {% endblock %}
+{% block critical_mock_extras %}AsyncMock, {% endblock %}
--- a/django/aiservice/core/shared/testgen_review/prompts/testgen_repair_system_prompt.md.j2
+++ b/django/aiservice/core/shared/testgen_review/prompts/testgen_repair_system_prompt.md.j2
@ -18,6 +18,7 @@ Constraints for repaired functions:
 - Maintain the same testing framework conventions (pytest, unittest, etc.)
 - Do NOT add comments, docstrings, or type annotations that weren't in the original
 - The reason provided for each function describes the specific problem — fix that problem
+- All string literals must be syntactically valid Python — use triple quotes (`"""..."""`) or escaped newlines (`\n`) for multiline strings. NEVER split a string literal across multiple lines without triple quotes.

 Return the complete file in a single Python code block:
 ```python