chore: improve skills to 100% review score and bump to v0.2.0
- Add trigger hints and code snippets to both skills - Add checkpoints after each step - Extract module reference and troubleshooting into linked files - Bump codeflash-skills tile to 0.2.0
This commit is contained in:
parent
6718e66582
commit
18ad00be59
6 changed files with 173 additions and 72 deletions
|
|
@ -71,7 +71,7 @@
|
|||
"version": "0.1.0"
|
||||
},
|
||||
"codeflash/codeflash-skills": {
|
||||
"version": "0.1.0"
|
||||
"version": "0.2.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,13 @@
|
|||
# Module Reference
|
||||
|
||||
| Feature area | Primary module | Key files |
|
||||
|-------------|----------------|-----------|
|
||||
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
|
||||
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
|
||||
| New AI service endpoint | `api/` | `aiservice.py` |
|
||||
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
|
||||
| Context extraction change | `context/` | `code_context_extractor.py` |
|
||||
| New CLI command | `cli_cmds/` | `cli.py` |
|
||||
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
|
||||
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
|
||||
| PR/result changes | `github/`, `result/` | Relevant handlers |
|
||||
|
|
@ -1,27 +1,23 @@
|
|||
---
|
||||
name: add-codeflash-feature
|
||||
description: Step-by-step workflow for adding a new feature to the codeflash codebase
|
||||
description: >
|
||||
Guides implementation of new functionality in the codeflash optimization engine.
|
||||
Use when adding a feature, building new functionality, implementing a new
|
||||
optimization strategy, adding a language backend, creating an API endpoint,
|
||||
extending the verification pipeline, or developing any new codeflash capability.
|
||||
Covers module identification, Result type patterns, config, types, tests, and
|
||||
quality checks.
|
||||
---
|
||||
|
||||
# Add Codeflash Feature
|
||||
|
||||
Use this workflow when implementing a new feature in the codeflash codebase.
|
||||
Use this workflow when implementing new functionality in the codeflash codebase — new optimization strategies, language backends, API endpoints, CLI commands, config options, or pipeline extensions.
|
||||
|
||||
## Step 1: Identify Target Modules
|
||||
|
||||
Determine which module(s) need modification based on the feature:
|
||||
Determine which module(s) need modification. See [MODULE_REFERENCE.md](MODULE_REFERENCE.md) for the full mapping of feature areas to modules and key files.
|
||||
|
||||
| Feature area | Primary module | Key files |
|
||||
|-------------|----------------|-----------|
|
||||
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
|
||||
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
|
||||
| New AI service endpoint | `api/` | `aiservice.py` |
|
||||
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
|
||||
| Context extraction change | `context/` | `code_context_extractor.py` |
|
||||
| New CLI command | `cli_cmds/` | `cli.py` |
|
||||
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
|
||||
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
|
||||
| PR/result changes | `github/`, `result/` | Relevant handlers |
|
||||
**Checkpoint**: Read the target files and understand existing patterns before writing any code. Look for similar features already implemented as reference.
|
||||
|
||||
## Step 2: Follow Result Type Pattern
|
||||
|
||||
|
|
@ -43,33 +39,76 @@ if not is_successful(result):
|
|||
value = result.unwrap()
|
||||
```
|
||||
|
||||
**Checkpoint**: Verify your function signatures match the `Result` pattern used in surrounding code. Not all functions use `Result` — match the convention of the module you're modifying.
|
||||
|
||||
## Step 3: Add Configuration Constants
|
||||
|
||||
If the feature needs configurable thresholds or limits:
|
||||
|
||||
1. Add constants to `code_utils/config_consts.py`
|
||||
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for `LOW`, `MEDIUM`, `HIGH`
|
||||
3. Add a corresponding `EffortKeys` enum entry
|
||||
4. Access via `get_effort_value(EffortKeys.MY_KEY, effort_level)`
|
||||
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for all three levels:
|
||||
```python
|
||||
# In config_consts.py:
|
||||
class EffortKeys(str, Enum):
|
||||
MY_NEW_KEY = "MY_NEW_KEY"
|
||||
|
||||
EFFORT_VALUES: dict[str, dict[EffortLevel, Any]] = {
|
||||
# ... existing entries ...
|
||||
EffortKeys.MY_NEW_KEY.value: {
|
||||
EffortLevel.LOW: 1,
|
||||
EffortLevel.MEDIUM: 3,
|
||||
EffortLevel.HIGH: 5,
|
||||
},
|
||||
}
|
||||
```
|
||||
3. Access via `get_effort_value(EffortKeys.MY_NEW_KEY, effort_level)`
|
||||
|
||||
**Checkpoint**: Skip this step if the feature doesn't need configuration. Not every feature requires new constants.
|
||||
|
||||
## Step 4: Add Domain Types
|
||||
|
||||
If new data structures are needed:
|
||||
|
||||
1. Add Pydantic models or frozen dataclasses to `models/models.py` or `models/function_types.py`
|
||||
2. Use `@dataclass(frozen=True)` for immutable data
|
||||
3. Use `BaseModel` for models that need serialization
|
||||
4. Keep `function_types.py` dependency-free (no imports from other codeflash modules)
|
||||
2. Use `@dataclass(frozen=True)` for immutable data, `BaseModel` for models that need serialization
|
||||
3. Keep `function_types.py` dependency-free — no imports from other codeflash modules
|
||||
|
||||
Example following existing patterns:
|
||||
```python
|
||||
# In models/models.py:
|
||||
@dataclass(frozen=True)
|
||||
class MyNewType:
|
||||
name: str
|
||||
value: int
|
||||
source: OptimizedCandidateSource
|
||||
|
||||
# For serializable models:
|
||||
class MyNewModel(BaseModel):
|
||||
items: list[MyNewType] = []
|
||||
```
|
||||
|
||||
**Checkpoint**: Skip this step if you can reuse existing types. Check `models/models.py` for types that already fit your needs.
|
||||
|
||||
## Step 5: Write Tests
|
||||
|
||||
Follow existing test patterns:
|
||||
|
||||
1. Create test files in the `tests/` directory mirroring the source structure
|
||||
2. Use pytest's `tmp_path` fixture for temp directories
|
||||
3. Always call `.resolve()` on Path objects
|
||||
1. Create test files in `tests/` mirroring the source structure (e.g., `tests/test_optimization/test_my_feature.py`)
|
||||
2. Use pytest's `tmp_path` fixture for temp directories — never `NamedTemporaryFile`
|
||||
3. Always call `.resolve()` on Path objects and `.as_posix()` for string conversion
|
||||
4. Assert full string equality for code context tests — no substring matching
|
||||
5. Remember the pytest plugin patches `time`, `random`, `uuid`, `datetime` — don't rely on real values
|
||||
5. The pytest plugin patches `time`, `random`, `uuid`, `datetime` — never rely on real values in verification tests
|
||||
|
||||
```python
|
||||
def test_my_feature(tmp_path: Path) -> None:
|
||||
test_file = tmp_path / "test_module.py"
|
||||
test_file.write_text("def foo(): return 1", encoding="utf-8")
|
||||
result = my_operation(test_file.resolve())
|
||||
assert is_successful(result)
|
||||
assert result.unwrap() == expected_value
|
||||
```
|
||||
|
||||
**Checkpoint**: Run the new tests in isolation before proceeding: `uv run pytest tests/path/to/test_file.py -x`
|
||||
|
||||
## Step 6: Run Quality Checks
|
||||
|
||||
|
|
@ -86,11 +125,22 @@ uv run mypy codeflash/
|
|||
uv run pytest tests/path/to/relevant/tests -x
|
||||
```
|
||||
|
||||
**If checks fail**:
|
||||
- `prek run` failures: Fix formatting/lint issues reported by ruff, then re-run
|
||||
- `mypy` failures: Fix type errors — common issues are missing return types, wrong `Optional` usage, or missing imports in `TYPE_CHECKING` block
|
||||
- Test failures: Fix the failing test or the implementation, then re-run
|
||||
|
||||
## Step 7: Language Support Considerations
|
||||
|
||||
If the feature needs to work across languages:
|
||||
|
||||
1. Check if the feature uses language-specific APIs — use `get_language_support(identifier)` from `languages/registry.py`
|
||||
1. Use `get_language_support(identifier)` from `languages/registry.py` — never import language classes directly
|
||||
2. Current language is a singleton: `set_current_language()` / `current_language()` from `languages/current.py`
|
||||
3. Use `is_python()` / `is_javascript()` guards for language-specific branches
|
||||
4. New language support classes must use `@register_language` decorator
|
||||
4. New language support classes must use `@register_language` decorator and be instantiable without arguments
|
||||
|
||||
**Checkpoint**: Skip this step if the feature is Python-only. Most features don't need multi-language support.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If you run into issues, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common problems and fixes (circular imports, `UnsupportedLanguageError`, CI path failures, Pydantic validation errors, token limit exceeded).
|
||||
|
|
|
|||
|
|
@ -0,0 +1,9 @@
|
|||
# Troubleshooting
|
||||
|
||||
| Problem | Likely cause | Fix |
|
||||
|---------|-------------|-----|
|
||||
| Circular import at startup | Importing from `models/` in a module loaded early | Move import into `TYPE_CHECKING` block or use lazy import |
|
||||
| `UnsupportedLanguageError` | Language modules not registered yet | Call `_ensure_languages_registered()` or use `get_language_support()` which does it automatically |
|
||||
| Tests pass locally but fail in CI | Path differences (absolute vs relative) | Always use `.resolve()` on Path objects |
|
||||
| `ValidationError` from Pydantic | Invalid code passed to `CodeString` | Check that generated code passes syntax validation for the target language |
|
||||
| `encoded_tokens_len` exceeds limit | Context too large | Reduce helper functions or split into read-only vs read-writable |
|
||||
|
|
@ -1,6 +1,10 @@
|
|||
---
|
||||
name: debug-optimization-failure
|
||||
description: Debug why a codeflash optimization failed at any pipeline stage
|
||||
description: >
|
||||
Diagnose why a codeflash optimization produced no results or failed silently.
|
||||
Use when an optimization run errors out, returns no candidates, or all candidates
|
||||
are rejected. Walks through discovery, ranking, context limits, AI service,
|
||||
test verification, deduplication, and repair stages.
|
||||
---
|
||||
|
||||
# Debug Optimization Failure
|
||||
|
|
@ -11,85 +15,110 @@ Use this workflow when an optimization run fails or produces no results. Work th
|
|||
|
||||
Determine if the function was discovered by `FunctionVisitor`.
|
||||
|
||||
1. Look at the discovery output or logs for the function name
|
||||
2. Check `discovery/functions_to_optimize.py` — the `FunctionVisitor` filters out:
|
||||
- Functions that are too small or trivial
|
||||
- Functions matching exclude patterns in config
|
||||
- Functions already optimized (`was_function_previously_optimized()`)
|
||||
3. Verify the function file is under the configured `module-root`
|
||||
1. Search logs for the function name in discovery output:
|
||||
```python
|
||||
# In discovery/functions_to_optimize.py, FunctionVisitor filters out:
|
||||
# - Functions matching exclude patterns in pyproject.toml [tool.codeflash]
|
||||
# - Functions already optimized (was_function_previously_optimized())
|
||||
# - Functions outside the configured module-root
|
||||
```
|
||||
2. Verify the function file is under the configured `module-root` in `pyproject.toml`
|
||||
3. Check if the function was previously optimized — look for it in the optimization history
|
||||
|
||||
**If not discovered**: Check config patterns, file location, and function size.
|
||||
**Checkpoint**: If the function doesn't appear in discovery output, fix config patterns or file location before proceeding.
|
||||
|
||||
## Step 2: Check Ranking
|
||||
|
||||
If trace data is used, check if the function was ranked high enough.
|
||||
|
||||
1. Look at `benchmarking/function_ranker.py` output
|
||||
2. The function's **addressable time** must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`
|
||||
3. Addressable time = own time + callee time / call count
|
||||
1. Look at `benchmarking/function_ranker.py` output for the function's addressable time
|
||||
2. The function must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`:
|
||||
```python
|
||||
# Addressable time = own time + callee time / call count
|
||||
# Grep for the function in ranking output:
|
||||
# grep -i "function_name" in ranking logs
|
||||
```
|
||||
3. Functions below the threshold are silently skipped
|
||||
|
||||
**If ranked too low**: The function doesn't spend enough time to be worth optimizing.
|
||||
**Checkpoint**: If ranked too low, the function doesn't spend enough time to be worth optimizing. No fix needed — this is expected.
|
||||
|
||||
## Step 3: Check Context Token Limits
|
||||
|
||||
Verify the function's context fits within token limits.
|
||||
|
||||
1. Check `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`
|
||||
2. Token counting is done by `encoded_tokens_len()` in `code_utils/code_utils.py`
|
||||
3. Large helper function chains or deep dependency trees can blow the limit
|
||||
1. Check thresholds in `code_utils/config_consts.py`:
|
||||
```python
|
||||
OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000 # tokens
|
||||
TESTGEN_CONTEXT_TOKEN_LIMIT = 16000 # tokens
|
||||
```
|
||||
2. Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`
|
||||
3. Common causes: large helper function chains, deep dependency trees, large class hierarchies
|
||||
|
||||
**If context too large**: The function has too many dependencies. Consider refactoring to reduce context size.
|
||||
**Checkpoint**: If context exceeds limits, the function is rejected. Consider refactoring to reduce dependencies or splitting large modules.
|
||||
|
||||
## Step 4: Check AI Service Response
|
||||
|
||||
Verify the AI service returned valid candidates.
|
||||
|
||||
1. Check logs for `AiServiceClient` request/response
|
||||
2. Look for HTTP errors (non-200 status codes)
|
||||
3. Verify `_get_valid_candidates()` parsed the response — empty `code_strings` means invalid markdown code blocks
|
||||
4. Check if all candidates were filtered out during parsing
|
||||
1. Look for HTTP errors in logs:
|
||||
```
|
||||
# Error patterns to search for:
|
||||
"Error generating optimized candidates"
|
||||
"Error generating jit rewritten candidate"
|
||||
"cli-optimize-error-caught"
|
||||
"cli-optimize-error-response"
|
||||
```
|
||||
2. Check `_get_valid_candidates()` in `api/aiservice.py` — empty `code_strings` after `CodeStringsMarkdown.parse_markdown_code()` means the LLM returned malformed code blocks
|
||||
3. Verify API key is valid (`get_codeflash_api_key()`)
|
||||
|
||||
**If no candidates returned**: Check API key, network connectivity, and service status.
|
||||
**Checkpoint**: If no candidates returned, check API key, network, and service status before proceeding.
|
||||
|
||||
## Step 5: Check Test Failures
|
||||
|
||||
Determine if candidates failed behavioral or benchmark tests.
|
||||
|
||||
1. **Behavioral failures**: Compare return values, stdout, pass/fail status between original baseline and candidate
|
||||
- Check `TestDiffScope`: `RETURN_VALUE`, `STDOUT`, `DID_PASS`
|
||||
- Look at JUnit XML results for specific test failures
|
||||
2. **Benchmark failures**: Check if candidate met `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
|
||||
3. **Stability failures**: Check if timing was stable within `STABILITY_WINDOW_SIZE=0.35`
|
||||
1. **Behavioral failures** — compare return values, stdout, pass/fail between baseline and candidate:
|
||||
```python
|
||||
# TestDiffScope enum values to look for:
|
||||
# RETURN_VALUE - function returned different value
|
||||
# STDOUT - different stdout output
|
||||
# DID_PASS - test passed/failed differently
|
||||
```
|
||||
2. **Benchmark failures** — candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
|
||||
3. **Stability failures** — timing must be stable within `STABILITY_WINDOW_SIZE=0.35` (35% of iterations)
|
||||
4. Check JUnit XML test results in the temp directory for specific failure messages
|
||||
|
||||
**If behavioral failure**: The optimization changed the function's behavior. Check test diffs for specific mismatches.
|
||||
**If benchmark failure**: The optimization didn't provide enough speedup.
|
||||
**Checkpoint**: Behavioral failure = optimization changed behavior (check test diffs). Benchmark failure = not fast enough. Stability failure = noisy timing environment.
|
||||
|
||||
## Step 6: Check Deduplication
|
||||
|
||||
Verify candidates weren't deduplicated away.
|
||||
|
||||
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized code → candidate mapping
|
||||
2. `normalize_code()` from `code_utils/deduplicate_code.py` normalizes AST for comparison
|
||||
3. If all candidates normalize to the same code, only one is actually tested
|
||||
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized AST → candidate mapping
|
||||
2. `normalize_code()` from `code_utils/deduplicate_code.py` strips comments/whitespace and normalizes the AST
|
||||
3. If all candidates normalize to identical code, only the first is tested — the rest copy its results
|
||||
|
||||
**If all duplicates**: The LLM generated the same optimization multiple times. Try higher effort level.
|
||||
**Checkpoint**: If all duplicates, the LLM generated the same optimization repeatedly. Try a higher effort level for more diverse candidates.
|
||||
|
||||
## Step 7: Check Repair/Refinement
|
||||
|
||||
If initial candidates failed, check repair and refinement stages.
|
||||
|
||||
1. Repair only runs if fewer than `MIN_CORRECT_CANDIDATES=2` passed
|
||||
2. Repair sends `AIServiceCodeRepairRequest` with test diffs
|
||||
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` — if too many tests failed, repair is skipped
|
||||
4. Refinement only runs on top valid candidates
|
||||
1. Repair only triggers if fewer than `MIN_CORRECT_CANDIDATES=2` passed behavioral tests
|
||||
2. Repair sends `AIServiceCodeRepairRequest` with `TestDiff` objects showing what went wrong
|
||||
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` (effort-dependent: 0.2/0.3/0.4) — if too many tests failed, repair is skipped entirely
|
||||
4. Refinement only runs on the top valid candidates (count depends on effort level)
|
||||
|
||||
**If repair also failed**: The optimization approach may not work for this function.
|
||||
**Checkpoint**: If repair also fails, the optimization approach likely doesn't work for this function. The function may rely on side effects or external state that the LLM can't safely optimize.
|
||||
|
||||
## Key Files to Check
|
||||
## Key Files Reference
|
||||
|
||||
- `optimization/function_optimizer.py` — Main optimization loop, `determine_best_candidate()`
|
||||
- `verification/test_runner.py` — Test execution
|
||||
- `api/aiservice.py` — AI service communication
|
||||
- `code_utils/config_consts.py` — Thresholds
|
||||
- `context/code_context_extractor.py` — Context extraction
|
||||
- `models/models.py` — `CandidateEvaluationContext`, `TestResults`
|
||||
| File | What to check |
|
||||
|------|---------------|
|
||||
| `optimization/function_optimizer.py` | Main loop, `determine_best_candidate()` |
|
||||
| `verification/test_runner.py` | Test subprocess execution |
|
||||
| `api/aiservice.py` | AI service requests/responses |
|
||||
| `code_utils/config_consts.py` | All thresholds and limits |
|
||||
| `context/code_context_extractor.py` | Context extraction and token counting |
|
||||
| `models/models.py` | `CandidateEvaluationContext`, `TestResults`, `TestDiff` |
|
||||
| `code_utils/deduplicate_code.py` | AST normalization for deduplication |
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
{
|
||||
"name": "codeflash/codeflash-skills",
|
||||
"version": "0.1.0",
|
||||
"version": "0.2.0",
|
||||
"summary": "Procedural workflows for developing and debugging codeflash",
|
||||
"private": true,
|
||||
"skills": {
|
||||
|
|
|
|||
Loading…
Reference in a new issue