feat: add debugging workflow and response checklist to observability chat prompt

Guide the chat agent to use the new tools proactively: a DEBUGGING TOOLS
section with structured guidance for get_llm_call_detail and codebase
browsing, a 4-step workflow (OBSERVE → INVESTIGATE → LOCATE → RECOMMEND),
and a RESPONSE CHECKLIST at the end of the prompt requiring the agent to
cite real file paths before responding.
This commit is contained in:
Kevin Turcios 2026-02-15 00:26:36 -05:00
parent 782ee508de
commit 51372ca0ad

View file

@ -185,6 +185,58 @@ export function buildSummaryPrompt(data: IndexedTraceData): string {
"navigate to the pipeline code to suggest a concrete fix.",
)
lines.push("")
lines.push("=== DEBUGGING TOOLS — USE THESE PROACTIVELY ===")
lines.push(
"You have tools beyond trace data. Your job is not just to describe what happened — it's to " +
"investigate WHY it happened and point to the specific code or prompt that needs to change. " +
"Always go one level deeper than the surface-level observation.\n\n" +
"IMPORTANT: When you identify a problem (bad tests, failed optimizations, parsing errors, etc.), " +
"you MUST use get_llm_call_detail to inspect the actual prompts and responses involved. Then, if " +
"the issue traces back to a prompt or pipeline bug, use the codebase browsing tools to find the " +
"source code and suggest a concrete fix. Do not stop at 'the tests used mocks' — find out what " +
"prompt instructions led to that and where to fix them.\n\n" +
"=== get_llm_call_detail(call_id) ===\n" +
"Fetches the full system prompt, user prompt, raw LLM response, and parsing results for any " +
"LLM call in this trace. You SHOULD use this:\n" +
"- When analyzing test quality: inspect the testgen prompt to see what instructions the model " +
"received. Did the prompt forbid mocks? Did it provide enough context about the classes?\n" +
"- When investigating bad optimizations: read the optimizer prompt to check if context was " +
"missing or if instructions were unclear\n" +
"- When debugging parsing failures: compare raw_response vs parsed_response to find extraction bugs\n" +
"- When understanding ranking decisions: read the ranker prompt and response\n\n" +
"=== read_file, search_code, list_directory ===\n" +
"Browse the codeflash-internal and codeflash (CLI) source repos. You SHOULD use these:\n" +
"- After inspecting an LLM call, find the prompt template to suggest a specific fix\n" +
"- To understand how a pipeline stage works (postprocessing, deduplication, instrumentation)\n" +
"- To trace a code path from an LLM call back to the pipeline logic that invoked it\n" +
"- When the user asks 'where does X happen' or 'why does Y behave this way'\n\n" +
"Key paths in codeflash-internal:\n" +
"- django/aiservice/core/shared/ — optimizer_router, testgen_router, ranker\n" +
"- django/aiservice/core/languages/python/optimizer/ — Python optimizer pipeline\n" +
"- django/aiservice/core/languages/python/testgen/ — test generation pipeline\n" +
"- django/aiservice/aiservice/llm.py — LLM provider abstraction\n" +
"- Prompt templates are .md files alongside their modules (rendered with Jinja2)\n\n" +
"=== EXPECTED WORKFLOW — YOU MUST COMPLETE ALL STEPS ===\n" +
"When you find a problem in a trace, DO NOT stop at describing the symptoms. You MUST complete " +
"the full investigation:\n\n" +
"1. OBSERVE: Answer the user's question using trace data tools (get_test_code, get_candidate_code, etc.)\n" +
"2. INVESTIGATE: Use get_llm_call_detail to read the prompts and responses that caused the problem. " +
"Identify whether the issue is a prompt gap, a model failure to follow instructions, or a pipeline bug.\n" +
"3. LOCATE: Use search_code to find the prompt template or pipeline code responsible. Read it with " +
"read_file. Prompt templates are .md files — search for distinctive phrases from the prompt you found " +
"in step 2 to locate the template file.\n" +
"4. RECOMMEND: Suggest a concrete fix — name the file, quote the relevant section, and describe " +
"what to change. For example: 'In django/aiservice/core/languages/python/testgen/prompt.md, the " +
"no-mocks instruction at line 45 should be moved to the system prompt for stronger enforcement.'\n\n" +
"If you skip steps 3-4, your response is INCOMPLETE. The user is a developer who wants actionable " +
"fixes, not just observations about what went wrong.\n\n" +
"HARD REQUIREMENT: When you identify a problem caused by a prompt or pipeline stage, your response " +
"MUST include at least one real file path from the codebase that you found via search_code or " +
"read_file. Generic advice like 'strengthen the prompt' is not enough — find the actual file, " +
"read it, and reference the specific lines that need to change.",
)
lines.push("")
lines.push("=== CODEFLASH TESTING GUIDELINES ===")
lines.push(
@ -262,6 +314,19 @@ export function buildSummaryPrompt(data: IndexedTraceData): string {
}
}
lines.push("")
lines.push("=== RESPONSE CHECKLIST (review before responding) ===")
lines.push(
"Before you send your response, verify:\n" +
"[ ] If you identified a problem (bad tests, failed optimization, parsing error, etc.), did you " +
"use get_llm_call_detail to read the actual prompt/response that caused it?\n" +
"[ ] If the root cause is in a prompt or pipeline, did you use search_code and read_file to " +
"find the actual source file? Your response MUST include at least one real file path from the " +
"codebase (e.g., 'django/aiservice/core/languages/python/testgen/system_prompt.md').\n" +
"[ ] Are your recommendations grounded in specific code you read, not generic advice?\n\n" +
"If any box is unchecked, go back and use the tools before responding.",
)
return lines.join("\n")
}