codeflash-internal/tiles/codeflash-internal-skills/evals/scenario-2/task.md
2026-02-14 22:25:30 -05:00

2.2 KiB

Scenario: All optimization candidates silently disappear

Context

An optimization request for Python code completes without errors, but returns zero candidates. The request payload is valid:

{
  "source_code": "def add(a, b):\n    return a + b",
  "trace_id": "trace-456",
  "language": "python",
  "n_candidates": 5
}

The server logs show:

INFO  2026-02-14 11:05:00 llm: call_llm completed for model gpt-4o, received response
INFO  2026-02-14 11:05:01 llm: call_llm completed for model gpt-4o, received response
INFO  2026-02-14 11:05:01 llm: call_llm completed for model claude-sonnet, received response
INFO  2026-02-14 11:05:02 postprocess: deduplicate_optimizations removed 4 of 5 candidates
INFO  2026-02-14 11:05:02 postprocess: equality_check removed 1 of 1 remaining candidates
INFO  2026-02-14 11:05:02 optimizer: 0 candidates after postprocessing

The optimization logging table shows optimizations_raw = 5 but optimizations_post = 0.

Task

Diagnose why all candidates were removed. Walk through the optimization pipeline to find the failure stage, explain why this happens for a trivial function like add, and recommend a fix.

  1. Identify which stage removed the candidates.
  2. Explain the two postprocessing checks that reduced 5 candidates to 0.
  3. Explain why optimizations_raw = 5 but optimizations_post = 0 in the logging table.
  4. Recommend what to do when all candidates are removed by postprocessing.

Expected Outputs

  • Identification that the failure is at the postprocessing stage (Step 5 of the debug-optimization-failure workflow).
  • Explanation that deduplicate_optimizations() in core/languages/python/optimizer/postprocess.py uses ast.parse() + ast.dump() to remove candidates with identical ASTs, and equality_check() removes candidates identical to the original code.
  • For a trivial function like add, all LLM candidates likely generate the same code (or code identical to the original), so dedup and equality checks remove everything.
  • The logging discrepancy (optimizations_raw vs optimizations_post) confirms candidates existed before postprocessing but were all filtered.
  • Recommendation: increase n_candidates, improve prompt quality, or adjust dedup thresholds.