feat: add debugging workflow and response checklist to observability chat prompt

Guide the chat agent to use the new tools proactively: a DEBUGGING TOOLS section with structured guidance for get_llm_call_detail and codebase browsing, a 4-step workflow (OBSERVE → INVESTIGATE → LOCATE → RECOMMEND), and a RESPONSE CHECKLIST at the end of the prompt requiring the agent to cite real file paths before responding.
2026-05-04 18:25:18 +00:00 · 2026-02-15 00:26:36 -05:00 · 2026-02-15 00:26:36 -05:00 · 51372ca0ad
commit 51372ca0ad
parent 782ee508de
1 changed files with 65 additions and 0 deletions
--- a/js/cf-webapp/src/app/observability/lib/build-chat-context.ts
+++ b/js/cf-webapp/src/app/observability/lib/build-chat-context.ts
@ -185,6 +185,58 @@ export function buildSummaryPrompt(data: IndexedTraceData): string {
      "navigate to the pipeline code to suggest a concrete fix.",
  )

+  lines.push("")
+  lines.push("=== DEBUGGING TOOLS — USE THESE PROACTIVELY ===")
+  lines.push(
+    "You have tools beyond trace data. Your job is not just to describe what happened — it's to " +
+      "investigate WHY it happened and point to the specific code or prompt that needs to change. " +
+      "Always go one level deeper than the surface-level observation.\n\n" +
+      "IMPORTANT: When you identify a problem (bad tests, failed optimizations, parsing errors, etc.), " +
+      "you MUST use get_llm_call_detail to inspect the actual prompts and responses involved. Then, if " +
+      "the issue traces back to a prompt or pipeline bug, use the codebase browsing tools to find the " +
+      "source code and suggest a concrete fix. Do not stop at 'the tests used mocks' — find out what " +
+      "prompt instructions led to that and where to fix them.\n\n" +
+      "=== get_llm_call_detail(call_id) ===\n" +
+      "Fetches the full system prompt, user prompt, raw LLM response, and parsing results for any " +
+      "LLM call in this trace. You SHOULD use this:\n" +
+      "- When analyzing test quality: inspect the testgen prompt to see what instructions the model " +
+      "received. Did the prompt forbid mocks? Did it provide enough context about the classes?\n" +
+      "- When investigating bad optimizations: read the optimizer prompt to check if context was " +
+      "missing or if instructions were unclear\n" +
+      "- When debugging parsing failures: compare raw_response vs parsed_response to find extraction bugs\n" +
+      "- When understanding ranking decisions: read the ranker prompt and response\n\n" +
+      "=== read_file, search_code, list_directory ===\n" +
+      "Browse the codeflash-internal and codeflash (CLI) source repos. You SHOULD use these:\n" +
+      "- After inspecting an LLM call, find the prompt template to suggest a specific fix\n" +
+      "- To understand how a pipeline stage works (postprocessing, deduplication, instrumentation)\n" +
+      "- To trace a code path from an LLM call back to the pipeline logic that invoked it\n" +
+      "- When the user asks 'where does X happen' or 'why does Y behave this way'\n\n" +
+      "Key paths in codeflash-internal:\n" +
+      "- django/aiservice/core/shared/ — optimizer_router, testgen_router, ranker\n" +
+      "- django/aiservice/core/languages/python/optimizer/ — Python optimizer pipeline\n" +
+      "- django/aiservice/core/languages/python/testgen/ — test generation pipeline\n" +
+      "- django/aiservice/aiservice/llm.py — LLM provider abstraction\n" +
+      "- Prompt templates are .md files alongside their modules (rendered with Jinja2)\n\n" +
+      "=== EXPECTED WORKFLOW — YOU MUST COMPLETE ALL STEPS ===\n" +
+      "When you find a problem in a trace, DO NOT stop at describing the symptoms. You MUST complete " +
+      "the full investigation:\n\n" +
+      "1. OBSERVE: Answer the user's question using trace data tools (get_test_code, get_candidate_code, etc.)\n" +
+      "2. INVESTIGATE: Use get_llm_call_detail to read the prompts and responses that caused the problem. " +
+      "Identify whether the issue is a prompt gap, a model failure to follow instructions, or a pipeline bug.\n" +
+      "3. LOCATE: Use search_code to find the prompt template or pipeline code responsible. Read it with " +
+      "read_file. Prompt templates are .md files — search for distinctive phrases from the prompt you found " +
+      "in step 2 to locate the template file.\n" +
+      "4. RECOMMEND: Suggest a concrete fix — name the file, quote the relevant section, and describe " +
+      "what to change. For example: 'In django/aiservice/core/languages/python/testgen/prompt.md, the " +
+      "no-mocks instruction at line 45 should be moved to the system prompt for stronger enforcement.'\n\n" +
+      "If you skip steps 3-4, your response is INCOMPLETE. The user is a developer who wants actionable " +
+      "fixes, not just observations about what went wrong.\n\n" +
+      "HARD REQUIREMENT: When you identify a problem caused by a prompt or pipeline stage, your response " +
+      "MUST include at least one real file path from the codebase that you found via search_code or " +
+      "read_file. Generic advice like 'strengthen the prompt' is not enough — find the actual file, " +
+      "read it, and reference the specific lines that need to change.",
+  )
+
  lines.push("")
  lines.push("=== CODEFLASH TESTING GUIDELINES ===")
  lines.push(
@ -262,6 +314,19 @@ export function buildSummaryPrompt(data: IndexedTraceData): string {
    }
  }

+  lines.push("")
+  lines.push("=== RESPONSE CHECKLIST (review before responding) ===")
+  lines.push(
+    "Before you send your response, verify:\n" +
+      "[ ] If you identified a problem (bad tests, failed optimization, parsing error, etc.), did you " +
+      "use get_llm_call_detail to read the actual prompt/response that caused it?\n" +
+      "[ ] If the root cause is in a prompt or pipeline, did you use search_code and read_file to " +
+      "find the actual source file? Your response MUST include at least one real file path from the " +
+      "codebase (e.g., 'django/aiservice/core/languages/python/testgen/system_prompt.md').\n" +
+      "[ ] Are your recommendations grounded in specific code you read, not generic advice?\n\n" +
+      "If any box is unchecked, go back and use the tools before responding.",
+  )
+
  return lines.join("\n")
 }