chore: improve skills to 100% review score and bump to v0.2.0

- Add trigger hints and code snippets to both skills - Add checkpoints after each step - Extract module reference and troubleshooting into linked files - Bump codeflash-skills tile to 0.2.0
2026-02-14 21:07:24 -05:00 · 2026-02-14 21:07:24 -05:00 · 18ad00be59
commit 18ad00be59
parent 6718e66582
6 changed files with 173 additions and 72 deletions
--- a/tessl.json
+++ b/tessl.json
@ -71,7 +71,7 @@
      "version": "0.1.0"
    },
    "codeflash/codeflash-skills": {
-      "version": "0.1.0"
+      "version": "0.2.0"
    }
  }
 }
--- a/tiles/codeflash-skills/skills/add-codeflash-feature/MODULE_REFERENCE.md
+++ b/tiles/codeflash-skills/skills/add-codeflash-feature/MODULE_REFERENCE.md
@ -0,0 +1,13 @@
+# Module Reference
+
+| Feature area | Primary module | Key files |
+|-------------|----------------|-----------|
+| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
+| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
+| New AI service endpoint | `api/` | `aiservice.py` |
+| New language support | `languages/` | Create new `languages/<lang>/support.py` |
+| Context extraction change | `context/` | `code_context_extractor.py` |
+| New CLI command | `cli_cmds/` | `cli.py` |
+| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
+| Discovery filter | `discovery/` | `functions_to_optimize.py` |
+| PR/result changes | `github/`, `result/` | Relevant handlers |
--- a/tiles/codeflash-skills/skills/add-codeflash-feature/SKILL.md
+++ b/tiles/codeflash-skills/skills/add-codeflash-feature/SKILL.md
@ -1,27 +1,23 @@
 ---
 name: add-codeflash-feature
-description: Step-by-step workflow for adding a new feature to the codeflash codebase
+description: >
+  Guides implementation of new functionality in the codeflash optimization engine.
+  Use when adding a feature, building new functionality, implementing a new
+  optimization strategy, adding a language backend, creating an API endpoint,
+  extending the verification pipeline, or developing any new codeflash capability.
+  Covers module identification, Result type patterns, config, types, tests, and
+  quality checks.
 ---

 # Add Codeflash Feature

-Use this workflow when implementing a new feature in the codeflash codebase.
+Use this workflow when implementing new functionality in the codeflash codebase — new optimization strategies, language backends, API endpoints, CLI commands, config options, or pipeline extensions.

 ## Step 1: Identify Target Modules

-Determine which module(s) need modification based on the feature:
+Determine which module(s) need modification. See [MODULE_REFERENCE.md](MODULE_REFERENCE.md) for the full mapping of feature areas to modules and key files.

-| Feature area | Primary module | Key files |
-|-------------|----------------|-----------|
-| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
-| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
-| New AI service endpoint | `api/` | `aiservice.py` |
-| New language support | `languages/` | Create new `languages/<lang>/support.py` |
-| Context extraction change | `context/` | `code_context_extractor.py` |
-| New CLI command | `cli_cmds/` | `cli.py` |
-| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
-| Discovery filter | `discovery/` | `functions_to_optimize.py` |
-| PR/result changes | `github/`, `result/` | Relevant handlers |
+**Checkpoint**: Read the target files and understand existing patterns before writing any code. Look for similar features already implemented as reference.

 ## Step 2: Follow Result Type Pattern

@ -43,33 +39,76 @@ if not is_successful(result):
 value = result.unwrap()
 ```

+**Checkpoint**: Verify your function signatures match the `Result` pattern used in surrounding code. Not all functions use `Result` — match the convention of the module you're modifying.
+
 ## Step 3: Add Configuration Constants

 If the feature needs configurable thresholds or limits:

 1. Add constants to `code_utils/config_consts.py`
-2. If effort-dependent, add to `EFFORT_VALUES` dict with values for `LOW`, `MEDIUM`, `HIGH`
-3. Add a corresponding `EffortKeys` enum entry
-4. Access via `get_effort_value(EffortKeys.MY_KEY, effort_level)`
+2. If effort-dependent, add to `EFFORT_VALUES` dict with values for all three levels:
+   ```python
+   # In config_consts.py:
+   class EffortKeys(str, Enum):
+       MY_NEW_KEY = "MY_NEW_KEY"
+
+   EFFORT_VALUES: dict[str, dict[EffortLevel, Any]] = {
+       # ... existing entries ...
+       EffortKeys.MY_NEW_KEY.value: {
+           EffortLevel.LOW: 1,
+           EffortLevel.MEDIUM: 3,
+           EffortLevel.HIGH: 5,
+       },
+   }
+   ```
+3. Access via `get_effort_value(EffortKeys.MY_NEW_KEY, effort_level)`
+
+**Checkpoint**: Skip this step if the feature doesn't need configuration. Not every feature requires new constants.

 ## Step 4: Add Domain Types

 If new data structures are needed:

 1. Add Pydantic models or frozen dataclasses to `models/models.py` or `models/function_types.py`
-2. Use `@dataclass(frozen=True)` for immutable data
-3. Use `BaseModel` for models that need serialization
-4. Keep `function_types.py` dependency-free (no imports from other codeflash modules)
+2. Use `@dataclass(frozen=True)` for immutable data, `BaseModel` for models that need serialization
+3. Keep `function_types.py` dependency-free — no imports from other codeflash modules
+
+Example following existing patterns:
+```python
+# In models/models.py:
+@dataclass(frozen=True)
+class MyNewType:
+    name: str
+    value: int
+    source: OptimizedCandidateSource
+
+# For serializable models:
+class MyNewModel(BaseModel):
+    items: list[MyNewType] = []
+```
+
+**Checkpoint**: Skip this step if you can reuse existing types. Check `models/models.py` for types that already fit your needs.

 ## Step 5: Write Tests

 Follow existing test patterns:

-1. Create test files in the `tests/` directory mirroring the source structure
-2. Use pytest's `tmp_path` fixture for temp directories
-3. Always call `.resolve()` on Path objects
+1. Create test files in `tests/` mirroring the source structure (e.g., `tests/test_optimization/test_my_feature.py`)
+2. Use pytest's `tmp_path` fixture for temp directories — never `NamedTemporaryFile`
+3. Always call `.resolve()` on Path objects and `.as_posix()` for string conversion
 4. Assert full string equality for code context tests — no substring matching
-5. Remember the pytest plugin patches `time`, `random`, `uuid`, `datetime` — don't rely on real values
+5. The pytest plugin patches `time`, `random`, `uuid`, `datetime` — never rely on real values in verification tests
+
+```python
+def test_my_feature(tmp_path: Path) -> None:
+    test_file = tmp_path / "test_module.py"
+    test_file.write_text("def foo(): return 1", encoding="utf-8")
+    result = my_operation(test_file.resolve())
+    assert is_successful(result)
+    assert result.unwrap() == expected_value
+```
+
+**Checkpoint**: Run the new tests in isolation before proceeding: `uv run pytest tests/path/to/test_file.py -x`

 ## Step 6: Run Quality Checks

@ -86,11 +125,22 @@ uv run mypy codeflash/
 uv run pytest tests/path/to/relevant/tests -x
 ```

+**If checks fail**:
+- `prek run` failures: Fix formatting/lint issues reported by ruff, then re-run
+- `mypy` failures: Fix type errors — common issues are missing return types, wrong `Optional` usage, or missing imports in `TYPE_CHECKING` block
+- Test failures: Fix the failing test or the implementation, then re-run
+
 ## Step 7: Language Support Considerations

 If the feature needs to work across languages:

-1. Check if the feature uses language-specific APIs — use `get_language_support(identifier)` from `languages/registry.py`
+1. Use `get_language_support(identifier)` from `languages/registry.py` — never import language classes directly
 2. Current language is a singleton: `set_current_language()` / `current_language()` from `languages/current.py`
 3. Use `is_python()` / `is_javascript()` guards for language-specific branches
-4. New language support classes must use `@register_language` decorator
+4. New language support classes must use `@register_language` decorator and be instantiable without arguments
+
+**Checkpoint**: Skip this step if the feature is Python-only. Most features don't need multi-language support.
+
+## Troubleshooting
+
+If you run into issues, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common problems and fixes (circular imports, `UnsupportedLanguageError`, CI path failures, Pydantic validation errors, token limit exceeded).
--- a/tiles/codeflash-skills/skills/add-codeflash-feature/TROUBLESHOOTING.md
+++ b/tiles/codeflash-skills/skills/add-codeflash-feature/TROUBLESHOOTING.md
@ -0,0 +1,9 @@
+# Troubleshooting
+
+| Problem | Likely cause | Fix |
+|---------|-------------|-----|
+| Circular import at startup | Importing from `models/` in a module loaded early | Move import into `TYPE_CHECKING` block or use lazy import |
+| `UnsupportedLanguageError` | Language modules not registered yet | Call `_ensure_languages_registered()` or use `get_language_support()` which does it automatically |
+| Tests pass locally but fail in CI | Path differences (absolute vs relative) | Always use `.resolve()` on Path objects |
+| `ValidationError` from Pydantic | Invalid code passed to `CodeString` | Check that generated code passes syntax validation for the target language |
+| `encoded_tokens_len` exceeds limit | Context too large | Reduce helper functions or split into read-only vs read-writable |
--- a/tiles/codeflash-skills/skills/debug-optimization-failure/SKILL.md
+++ b/tiles/codeflash-skills/skills/debug-optimization-failure/SKILL.md
@ -1,6 +1,10 @@
 ---
 name: debug-optimization-failure
-description: Debug why a codeflash optimization failed at any pipeline stage
+description: >
+  Diagnose why a codeflash optimization produced no results or failed silently.
+  Use when an optimization run errors out, returns no candidates, or all candidates
+  are rejected. Walks through discovery, ranking, context limits, AI service,
+  test verification, deduplication, and repair stages.
 ---

 # Debug Optimization Failure
@ -11,85 +15,110 @@ Use this workflow when an optimization run fails or produces no results. Work th

 Determine if the function was discovered by `FunctionVisitor`.

-1. Look at the discovery output or logs for the function name
-2. Check `discovery/functions_to_optimize.py` — the `FunctionVisitor` filters out:
-   - Functions that are too small or trivial
-   - Functions matching exclude patterns in config
-   - Functions already optimized (`was_function_previously_optimized()`)
-3. Verify the function file is under the configured `module-root`
+1. Search logs for the function name in discovery output:
+   ```python
+   # In discovery/functions_to_optimize.py, FunctionVisitor filters out:
+   # - Functions matching exclude patterns in pyproject.toml [tool.codeflash]
+   # - Functions already optimized (was_function_previously_optimized())
+   # - Functions outside the configured module-root
+   ```
+2. Verify the function file is under the configured `module-root` in `pyproject.toml`
+3. Check if the function was previously optimized — look for it in the optimization history

-**If not discovered**: Check config patterns, file location, and function size.
+**Checkpoint**: If the function doesn't appear in discovery output, fix config patterns or file location before proceeding.

 ## Step 2: Check Ranking

 If trace data is used, check if the function was ranked high enough.

-1. Look at `benchmarking/function_ranker.py` output
-2. The function's **addressable time** must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`
-3. Addressable time = own time + callee time / call count
+1. Look at `benchmarking/function_ranker.py` output for the function's addressable time
+2. The function must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`:
+   ```python
+   # Addressable time = own time + callee time / call count
+   # Grep for the function in ranking output:
+   # grep -i "function_name" in ranking logs
+   ```
+3. Functions below the threshold are silently skipped

-**If ranked too low**: The function doesn't spend enough time to be worth optimizing.
+**Checkpoint**: If ranked too low, the function doesn't spend enough time to be worth optimizing. No fix needed — this is expected.

 ## Step 3: Check Context Token Limits

 Verify the function's context fits within token limits.

-1. Check `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`
-2. Token counting is done by `encoded_tokens_len()` in `code_utils/code_utils.py`
-3. Large helper function chains or deep dependency trees can blow the limit
+1. Check thresholds in `code_utils/config_consts.py`:
+   ```python
+   OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000  # tokens
+   TESTGEN_CONTEXT_TOKEN_LIMIT = 16000       # tokens
+   ```
+2. Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`
+3. Common causes: large helper function chains, deep dependency trees, large class hierarchies

-**If context too large**: The function has too many dependencies. Consider refactoring to reduce context size.
+**Checkpoint**: If context exceeds limits, the function is rejected. Consider refactoring to reduce dependencies or splitting large modules.

 ## Step 4: Check AI Service Response

 Verify the AI service returned valid candidates.

-1. Check logs for `AiServiceClient` request/response
-2. Look for HTTP errors (non-200 status codes)
-3. Verify `_get_valid_candidates()` parsed the response — empty `code_strings` means invalid markdown code blocks
-4. Check if all candidates were filtered out during parsing
+1. Look for HTTP errors in logs:
+   ```
+   # Error patterns to search for:
+   "Error generating optimized candidates"
+   "Error generating jit rewritten candidate"
+   "cli-optimize-error-caught"
+   "cli-optimize-error-response"
+   ```
+2. Check `_get_valid_candidates()` in `api/aiservice.py` — empty `code_strings` after `CodeStringsMarkdown.parse_markdown_code()` means the LLM returned malformed code blocks
+3. Verify API key is valid (`get_codeflash_api_key()`)

-**If no candidates returned**: Check API key, network connectivity, and service status.
+**Checkpoint**: If no candidates returned, check API key, network, and service status before proceeding.

 ## Step 5: Check Test Failures

 Determine if candidates failed behavioral or benchmark tests.

-1. **Behavioral failures**: Compare return values, stdout, pass/fail status between original baseline and candidate
-   - Check `TestDiffScope`: `RETURN_VALUE`, `STDOUT`, `DID_PASS`
-   - Look at JUnit XML results for specific test failures
-2. **Benchmark failures**: Check if candidate met `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
-3. **Stability failures**: Check if timing was stable within `STABILITY_WINDOW_SIZE=0.35`
+1. **Behavioral failures** — compare return values, stdout, pass/fail between baseline and candidate:
+   ```python
+   # TestDiffScope enum values to look for:
+   # RETURN_VALUE - function returned different value
+   # STDOUT - different stdout output
+   # DID_PASS - test passed/failed differently
+   ```
+2. **Benchmark failures** — candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
+3. **Stability failures** — timing must be stable within `STABILITY_WINDOW_SIZE=0.35` (35% of iterations)
+4. Check JUnit XML test results in the temp directory for specific failure messages

-**If behavioral failure**: The optimization changed the function's behavior. Check test diffs for specific mismatches.
-**If benchmark failure**: The optimization didn't provide enough speedup.
+**Checkpoint**: Behavioral failure = optimization changed behavior (check test diffs). Benchmark failure = not fast enough. Stability failure = noisy timing environment.

 ## Step 6: Check Deduplication

 Verify candidates weren't deduplicated away.

-1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized code → candidate mapping
-2. `normalize_code()` from `code_utils/deduplicate_code.py` normalizes AST for comparison
-3. If all candidates normalize to the same code, only one is actually tested
+1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized AST → candidate mapping
+2. `normalize_code()` from `code_utils/deduplicate_code.py` strips comments/whitespace and normalizes the AST
+3. If all candidates normalize to identical code, only the first is tested — the rest copy its results

-**If all duplicates**: The LLM generated the same optimization multiple times. Try higher effort level.
+**Checkpoint**: If all duplicates, the LLM generated the same optimization repeatedly. Try a higher effort level for more diverse candidates.

 ## Step 7: Check Repair/Refinement

 If initial candidates failed, check repair and refinement stages.

-1. Repair only runs if fewer than `MIN_CORRECT_CANDIDATES=2` passed
-2. Repair sends `AIServiceCodeRepairRequest` with test diffs
-3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` — if too many tests failed, repair is skipped
-4. Refinement only runs on top valid candidates
+1. Repair only triggers if fewer than `MIN_CORRECT_CANDIDATES=2` passed behavioral tests
+2. Repair sends `AIServiceCodeRepairRequest` with `TestDiff` objects showing what went wrong
+3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` (effort-dependent: 0.2/0.3/0.4) — if too many tests failed, repair is skipped entirely
+4. Refinement only runs on the top valid candidates (count depends on effort level)

-**If repair also failed**: The optimization approach may not work for this function.
+**Checkpoint**: If repair also fails, the optimization approach likely doesn't work for this function. The function may rely on side effects or external state that the LLM can't safely optimize.

-## Key Files to Check
+## Key Files Reference

- `optimization/function_optimizer.py` — Main optimization loop, `determine_best_candidate()`
- `verification/test_runner.py` — Test execution
- `api/aiservice.py` — AI service communication
- `code_utils/config_consts.py` — Thresholds
- `context/code_context_extractor.py` — Context extraction
- `models/models.py` — `CandidateEvaluationContext`, `TestResults`
+| File | What to check |
+|------|---------------|
+| `optimization/function_optimizer.py` | Main loop, `determine_best_candidate()` |
+| `verification/test_runner.py` | Test subprocess execution |
+| `api/aiservice.py` | AI service requests/responses |
+| `code_utils/config_consts.py` | All thresholds and limits |
+| `context/code_context_extractor.py` | Context extraction and token counting |
+| `models/models.py` | `CandidateEvaluationContext`, `TestResults`, `TestDiff` |
+| `code_utils/deduplicate_code.py` | AST normalization for deduplication |
--- a/tiles/codeflash-skills/tile.json
+++ b/tiles/codeflash-skills/tile.json
@ -1,6 +1,6 @@
 {
  "name": "codeflash/codeflash-skills",
-  "version": "0.1.0",
+  "version": "0.2.0",
  "summary": "Procedural workflows for developing and debugging codeflash",
  "private": true,
  "skills": {