Kevin Turcios ff2abd29f2 chore: add eval scenarios for codeflash-skills tile

5 scenarios testing: sequential debugging, Result type + effort config,
test patterns, domain type conventions, and deduplication/repair mechanics.
Also adds tessl-labs/tessl-skill-eval-scenarios dev dependency.

2026-02-14 21:24:54 -05:00

825 B

Raw Blame History

Investigate Low Candidate Diversity

Context

A codeflash user is optimizing a data processing function at medium effort level. The AI service returns 5 candidates, but the optimization log shows only 1 candidate was actually benchmarked. Of the 5 candidates, 1 passed behavioral tests but didn't meet the performance threshold. The user wants to understand what happened to the other 4 candidates and why no repair attempts were made.

Task

Write an analysis document explaining:

Why only 1 out of 5 candidates was benchmarked
How the system determines which candidates to actually test
Under what conditions the system would have attempted to repair the failing candidates
What the user could change to get more diverse results

Expected Outputs

A markdown file analysis.md with the explanation.

825 B Raw Blame History

Investigate Low Candidate Diversity

Context

Task

Expected Outputs

825 B

Raw Blame History