chore: improve skills to 100% review score and bump to v0.2.0

- Add trigger hints and code snippets to both skills
- Add checkpoints after each step
- Extract module reference and troubleshooting into linked files
- Bump codeflash-skills tile to 0.2.0
This commit is contained in:
Kevin Turcios 2026-02-14 21:07:24 -05:00
parent 6718e66582
commit 18ad00be59
6 changed files with 173 additions and 72 deletions

View file

@ -71,7 +71,7 @@
"version": "0.1.0"
},
"codeflash/codeflash-skills": {
"version": "0.1.0"
"version": "0.2.0"
}
}
}

View file

@ -0,0 +1,13 @@
# Module Reference
| Feature area | Primary module | Key files |
|-------------|----------------|-----------|
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
| New AI service endpoint | `api/` | `aiservice.py` |
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
| Context extraction change | `context/` | `code_context_extractor.py` |
| New CLI command | `cli_cmds/` | `cli.py` |
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
| PR/result changes | `github/`, `result/` | Relevant handlers |

View file

@ -1,27 +1,23 @@
---
name: add-codeflash-feature
description: Step-by-step workflow for adding a new feature to the codeflash codebase
description: >
Guides implementation of new functionality in the codeflash optimization engine.
Use when adding a feature, building new functionality, implementing a new
optimization strategy, adding a language backend, creating an API endpoint,
extending the verification pipeline, or developing any new codeflash capability.
Covers module identification, Result type patterns, config, types, tests, and
quality checks.
---
# Add Codeflash Feature
Use this workflow when implementing a new feature in the codeflash codebase.
Use this workflow when implementing new functionality in the codeflash codebase — new optimization strategies, language backends, API endpoints, CLI commands, config options, or pipeline extensions.
## Step 1: Identify Target Modules
Determine which module(s) need modification based on the feature:
Determine which module(s) need modification. See [MODULE_REFERENCE.md](MODULE_REFERENCE.md) for the full mapping of feature areas to modules and key files.
| Feature area | Primary module | Key files |
|-------------|----------------|-----------|
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
| New AI service endpoint | `api/` | `aiservice.py` |
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
| Context extraction change | `context/` | `code_context_extractor.py` |
| New CLI command | `cli_cmds/` | `cli.py` |
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
| PR/result changes | `github/`, `result/` | Relevant handlers |
**Checkpoint**: Read the target files and understand existing patterns before writing any code. Look for similar features already implemented as reference.
## Step 2: Follow Result Type Pattern
@ -43,33 +39,76 @@ if not is_successful(result):
value = result.unwrap()
```
**Checkpoint**: Verify your function signatures match the `Result` pattern used in surrounding code. Not all functions use `Result` — match the convention of the module you're modifying.
## Step 3: Add Configuration Constants
If the feature needs configurable thresholds or limits:
1. Add constants to `code_utils/config_consts.py`
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for `LOW`, `MEDIUM`, `HIGH`
3. Add a corresponding `EffortKeys` enum entry
4. Access via `get_effort_value(EffortKeys.MY_KEY, effort_level)`
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for all three levels:
```python
# In config_consts.py:
class EffortKeys(str, Enum):
MY_NEW_KEY = "MY_NEW_KEY"
EFFORT_VALUES: dict[str, dict[EffortLevel, Any]] = {
# ... existing entries ...
EffortKeys.MY_NEW_KEY.value: {
EffortLevel.LOW: 1,
EffortLevel.MEDIUM: 3,
EffortLevel.HIGH: 5,
},
}
```
3. Access via `get_effort_value(EffortKeys.MY_NEW_KEY, effort_level)`
**Checkpoint**: Skip this step if the feature doesn't need configuration. Not every feature requires new constants.
## Step 4: Add Domain Types
If new data structures are needed:
1. Add Pydantic models or frozen dataclasses to `models/models.py` or `models/function_types.py`
2. Use `@dataclass(frozen=True)` for immutable data
3. Use `BaseModel` for models that need serialization
4. Keep `function_types.py` dependency-free (no imports from other codeflash modules)
2. Use `@dataclass(frozen=True)` for immutable data, `BaseModel` for models that need serialization
3. Keep `function_types.py` dependency-free — no imports from other codeflash modules
Example following existing patterns:
```python
# In models/models.py:
@dataclass(frozen=True)
class MyNewType:
name: str
value: int
source: OptimizedCandidateSource
# For serializable models:
class MyNewModel(BaseModel):
items: list[MyNewType] = []
```
**Checkpoint**: Skip this step if you can reuse existing types. Check `models/models.py` for types that already fit your needs.
## Step 5: Write Tests
Follow existing test patterns:
1. Create test files in the `tests/` directory mirroring the source structure
2. Use pytest's `tmp_path` fixture for temp directories
3. Always call `.resolve()` on Path objects
1. Create test files in `tests/` mirroring the source structure (e.g., `tests/test_optimization/test_my_feature.py`)
2. Use pytest's `tmp_path` fixture for temp directories — never `NamedTemporaryFile`
3. Always call `.resolve()` on Path objects and `.as_posix()` for string conversion
4. Assert full string equality for code context tests — no substring matching
5. Remember the pytest plugin patches `time`, `random`, `uuid`, `datetime` — don't rely on real values
5. The pytest plugin patches `time`, `random`, `uuid`, `datetime` — never rely on real values in verification tests
```python
def test_my_feature(tmp_path: Path) -> None:
test_file = tmp_path / "test_module.py"
test_file.write_text("def foo(): return 1", encoding="utf-8")
result = my_operation(test_file.resolve())
assert is_successful(result)
assert result.unwrap() == expected_value
```
**Checkpoint**: Run the new tests in isolation before proceeding: `uv run pytest tests/path/to/test_file.py -x`
## Step 6: Run Quality Checks
@ -86,11 +125,22 @@ uv run mypy codeflash/
uv run pytest tests/path/to/relevant/tests -x
```
**If checks fail**:
- `prek run` failures: Fix formatting/lint issues reported by ruff, then re-run
- `mypy` failures: Fix type errors — common issues are missing return types, wrong `Optional` usage, or missing imports in `TYPE_CHECKING` block
- Test failures: Fix the failing test or the implementation, then re-run
## Step 7: Language Support Considerations
If the feature needs to work across languages:
1. Check if the feature uses language-specific APIs — use `get_language_support(identifier)` from `languages/registry.py`
1. Use `get_language_support(identifier)` from `languages/registry.py` — never import language classes directly
2. Current language is a singleton: `set_current_language()` / `current_language()` from `languages/current.py`
3. Use `is_python()` / `is_javascript()` guards for language-specific branches
4. New language support classes must use `@register_language` decorator
4. New language support classes must use `@register_language` decorator and be instantiable without arguments
**Checkpoint**: Skip this step if the feature is Python-only. Most features don't need multi-language support.
## Troubleshooting
If you run into issues, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common problems and fixes (circular imports, `UnsupportedLanguageError`, CI path failures, Pydantic validation errors, token limit exceeded).

View file

@ -0,0 +1,9 @@
# Troubleshooting
| Problem | Likely cause | Fix |
|---------|-------------|-----|
| Circular import at startup | Importing from `models/` in a module loaded early | Move import into `TYPE_CHECKING` block or use lazy import |
| `UnsupportedLanguageError` | Language modules not registered yet | Call `_ensure_languages_registered()` or use `get_language_support()` which does it automatically |
| Tests pass locally but fail in CI | Path differences (absolute vs relative) | Always use `.resolve()` on Path objects |
| `ValidationError` from Pydantic | Invalid code passed to `CodeString` | Check that generated code passes syntax validation for the target language |
| `encoded_tokens_len` exceeds limit | Context too large | Reduce helper functions or split into read-only vs read-writable |

View file

@ -1,6 +1,10 @@
---
name: debug-optimization-failure
description: Debug why a codeflash optimization failed at any pipeline stage
description: >
Diagnose why a codeflash optimization produced no results or failed silently.
Use when an optimization run errors out, returns no candidates, or all candidates
are rejected. Walks through discovery, ranking, context limits, AI service,
test verification, deduplication, and repair stages.
---
# Debug Optimization Failure
@ -11,85 +15,110 @@ Use this workflow when an optimization run fails or produces no results. Work th
Determine if the function was discovered by `FunctionVisitor`.
1. Look at the discovery output or logs for the function name
2. Check `discovery/functions_to_optimize.py` — the `FunctionVisitor` filters out:
- Functions that are too small or trivial
- Functions matching exclude patterns in config
- Functions already optimized (`was_function_previously_optimized()`)
3. Verify the function file is under the configured `module-root`
1. Search logs for the function name in discovery output:
```python
# In discovery/functions_to_optimize.py, FunctionVisitor filters out:
# - Functions matching exclude patterns in pyproject.toml [tool.codeflash]
# - Functions already optimized (was_function_previously_optimized())
# - Functions outside the configured module-root
```
2. Verify the function file is under the configured `module-root` in `pyproject.toml`
3. Check if the function was previously optimized — look for it in the optimization history
**If not discovered**: Check config patterns, file location, and function size.
**Checkpoint**: If the function doesn't appear in discovery output, fix config patterns or file location before proceeding.
## Step 2: Check Ranking
If trace data is used, check if the function was ranked high enough.
1. Look at `benchmarking/function_ranker.py` output
2. The function's **addressable time** must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`
3. Addressable time = own time + callee time / call count
1. Look at `benchmarking/function_ranker.py` output for the function's addressable time
2. The function must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`:
```python
# Addressable time = own time + callee time / call count
# Grep for the function in ranking output:
# grep -i "function_name" in ranking logs
```
3. Functions below the threshold are silently skipped
**If ranked too low**: The function doesn't spend enough time to be worth optimizing.
**Checkpoint**: If ranked too low, the function doesn't spend enough time to be worth optimizing. No fix needed — this is expected.
## Step 3: Check Context Token Limits
Verify the function's context fits within token limits.
1. Check `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`
2. Token counting is done by `encoded_tokens_len()` in `code_utils/code_utils.py`
3. Large helper function chains or deep dependency trees can blow the limit
1. Check thresholds in `code_utils/config_consts.py`:
```python
OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000 # tokens
TESTGEN_CONTEXT_TOKEN_LIMIT = 16000 # tokens
```
2. Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`
3. Common causes: large helper function chains, deep dependency trees, large class hierarchies
**If context too large**: The function has too many dependencies. Consider refactoring to reduce context size.
**Checkpoint**: If context exceeds limits, the function is rejected. Consider refactoring to reduce dependencies or splitting large modules.
## Step 4: Check AI Service Response
Verify the AI service returned valid candidates.
1. Check logs for `AiServiceClient` request/response
2. Look for HTTP errors (non-200 status codes)
3. Verify `_get_valid_candidates()` parsed the response — empty `code_strings` means invalid markdown code blocks
4. Check if all candidates were filtered out during parsing
1. Look for HTTP errors in logs:
```
# Error patterns to search for:
"Error generating optimized candidates"
"Error generating jit rewritten candidate"
"cli-optimize-error-caught"
"cli-optimize-error-response"
```
2. Check `_get_valid_candidates()` in `api/aiservice.py` — empty `code_strings` after `CodeStringsMarkdown.parse_markdown_code()` means the LLM returned malformed code blocks
3. Verify API key is valid (`get_codeflash_api_key()`)
**If no candidates returned**: Check API key, network connectivity, and service status.
**Checkpoint**: If no candidates returned, check API key, network, and service status before proceeding.
## Step 5: Check Test Failures
Determine if candidates failed behavioral or benchmark tests.
1. **Behavioral failures**: Compare return values, stdout, pass/fail status between original baseline and candidate
- Check `TestDiffScope`: `RETURN_VALUE`, `STDOUT`, `DID_PASS`
- Look at JUnit XML results for specific test failures
2. **Benchmark failures**: Check if candidate met `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
3. **Stability failures**: Check if timing was stable within `STABILITY_WINDOW_SIZE=0.35`
1. **Behavioral failures** — compare return values, stdout, pass/fail between baseline and candidate:
```python
# TestDiffScope enum values to look for:
# RETURN_VALUE - function returned different value
# STDOUT - different stdout output
# DID_PASS - test passed/failed differently
```
2. **Benchmark failures** — candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
3. **Stability failures** — timing must be stable within `STABILITY_WINDOW_SIZE=0.35` (35% of iterations)
4. Check JUnit XML test results in the temp directory for specific failure messages
**If behavioral failure**: The optimization changed the function's behavior. Check test diffs for specific mismatches.
**If benchmark failure**: The optimization didn't provide enough speedup.
**Checkpoint**: Behavioral failure = optimization changed behavior (check test diffs). Benchmark failure = not fast enough. Stability failure = noisy timing environment.
## Step 6: Check Deduplication
Verify candidates weren't deduplicated away.
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized code → candidate mapping
2. `normalize_code()` from `code_utils/deduplicate_code.py` normalizes AST for comparison
3. If all candidates normalize to the same code, only one is actually tested
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized AST → candidate mapping
2. `normalize_code()` from `code_utils/deduplicate_code.py` strips comments/whitespace and normalizes the AST
3. If all candidates normalize to identical code, only the first is tested — the rest copy its results
**If all duplicates**: The LLM generated the same optimization multiple times. Try higher effort level.
**Checkpoint**: If all duplicates, the LLM generated the same optimization repeatedly. Try a higher effort level for more diverse candidates.
## Step 7: Check Repair/Refinement
If initial candidates failed, check repair and refinement stages.
1. Repair only runs if fewer than `MIN_CORRECT_CANDIDATES=2` passed
2. Repair sends `AIServiceCodeRepairRequest` with test diffs
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` — if too many tests failed, repair is skipped
4. Refinement only runs on top valid candidates
1. Repair only triggers if fewer than `MIN_CORRECT_CANDIDATES=2` passed behavioral tests
2. Repair sends `AIServiceCodeRepairRequest` with `TestDiff` objects showing what went wrong
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` (effort-dependent: 0.2/0.3/0.4) — if too many tests failed, repair is skipped entirely
4. Refinement only runs on the top valid candidates (count depends on effort level)
**If repair also failed**: The optimization approach may not work for this function.
**Checkpoint**: If repair also fails, the optimization approach likely doesn't work for this function. The function may rely on side effects or external state that the LLM can't safely optimize.
## Key Files to Check
## Key Files Reference
- `optimization/function_optimizer.py` — Main optimization loop, `determine_best_candidate()`
- `verification/test_runner.py` — Test execution
- `api/aiservice.py` — AI service communication
- `code_utils/config_consts.py` — Thresholds
- `context/code_context_extractor.py` — Context extraction
- `models/models.py``CandidateEvaluationContext`, `TestResults`
| File | What to check |
|------|---------------|
| `optimization/function_optimizer.py` | Main loop, `determine_best_candidate()` |
| `verification/test_runner.py` | Test subprocess execution |
| `api/aiservice.py` | AI service requests/responses |
| `code_utils/config_consts.py` | All thresholds and limits |
| `context/code_context_extractor.py` | Context extraction and token counting |
| `models/models.py` | `CandidateEvaluationContext`, `TestResults`, `TestDiff` |
| `code_utils/deduplicate_code.py` | AST normalization for deduplication |

View file

@ -1,6 +1,6 @@
{
"name": "codeflash/codeflash-skills",
"version": "0.1.0",
"version": "0.2.0",
"summary": "Procedural workflows for developing and debugging codeflash",
"private": true,
"skills": {