mirror of
https://github.com/codeflash-ai/codeflash.git
synced 2026-05-04 18:25:17 +00:00
feat: add private tessl tiles for codeflash rules, docs, and skills
Three private tiles in the codeflash workspace: - codeflash-rules: 6 steering rules (code-style, architecture, optimization-patterns, git-conventions, testing-rules, language-rules) - codeflash-docs: 7 doc pages (domain-types, optimization-pipeline, context-extraction, verification, ai-service, configuration) - codeflash-skills: 2 skills (debug-optimization-failure, add-codeflash-feature)
This commit is contained in:
parent
90601c3324
commit
6718e66582
20 changed files with 965 additions and 0 deletions
|
|
@ -33,3 +33,5 @@ Discovery → Ranking → Context Extraction → Test Gen + Optimization → Bas
|
|||
# Agent Rules <!-- tessl-managed -->
|
||||
|
||||
@.tessl/RULES.md follow the [instructions](.tessl/RULES.md)
|
||||
|
||||
@AGENTS.md
|
||||
|
|
|
|||
|
|
@ -63,6 +63,15 @@
|
|||
},
|
||||
"tessl/pypi-filelock": {
|
||||
"version": "3.19.0"
|
||||
},
|
||||
"codeflash/codeflash-rules": {
|
||||
"version": "0.1.0"
|
||||
},
|
||||
"codeflash/codeflash-docs": {
|
||||
"version": "0.1.0"
|
||||
},
|
||||
"codeflash/codeflash-skills": {
|
||||
"version": "0.1.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
108
tiles/codeflash-docs/docs/ai-service.md
Normal file
108
tiles/codeflash-docs/docs/ai-service.md
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
# AI Service
|
||||
|
||||
How codeflash communicates with the AI optimization backend.
|
||||
|
||||
## `AiServiceClient` (`api/aiservice.py`)
|
||||
|
||||
The client connects to the AI service at `https://app.codeflash.ai` (or `http://localhost:8000` when `CODEFLASH_AIS_SERVER=local`).
|
||||
|
||||
Authentication uses Bearer token from `get_codeflash_api_key()`. All requests go through `make_ai_service_request()` which handles JSON serialization via Pydantic encoder.
|
||||
|
||||
Timeout: 90s for production, 300s for local.
|
||||
|
||||
## Endpoints
|
||||
|
||||
### `/ai/optimize` — Generate Candidates
|
||||
|
||||
Method: `optimize_code()`
|
||||
|
||||
Sends source code + dependency context to generate optimization candidates.
|
||||
|
||||
Payload:
|
||||
- `source_code` — The read-writable code (markdown format)
|
||||
- `dependency_code` — Read-only context code
|
||||
- `trace_id` — Unique trace ID for the optimization run
|
||||
- `language` — `"python"`, `"javascript"`, or `"typescript"`
|
||||
- `n_candidates` — Number of candidates to generate (controlled by effort level)
|
||||
- `is_async` — Whether the function is async
|
||||
- `is_numerical_code` — Whether the code is numerical (affects optimization strategy)
|
||||
|
||||
Returns: `list[OptimizedCandidate]` with `source=OptimizedCandidateSource.OPTIMIZE`
|
||||
|
||||
### `/ai/optimize_line_profiler` — Line-Profiler-Guided Candidates
|
||||
|
||||
Method: `optimize_python_code_line_profiler()`
|
||||
|
||||
Like `/optimize` but includes `line_profiler_results` to guide the LLM toward hot lines.
|
||||
|
||||
Returns: candidates with `source=OptimizedCandidateSource.OPTIMIZE_LP`
|
||||
|
||||
### `/ai/refine` — Refine Existing Candidate
|
||||
|
||||
Method: `refine_code()`
|
||||
|
||||
Request type: `AIServiceRefinerRequest`
|
||||
|
||||
Sends an existing candidate with runtime data and line profiler results to generate an improved version.
|
||||
|
||||
Key fields:
|
||||
- `original_source_code` / `optimized_source_code` — Before and after
|
||||
- `original_code_runtime` / `optimized_code_runtime` — Timing data
|
||||
- `speedup` — Current speedup ratio
|
||||
- `original_line_profiler_results` / `optimized_line_profiler_results`
|
||||
|
||||
Returns: candidates with `source=OptimizedCandidateSource.REFINE` and `parent_id` set to the refined candidate's ID
|
||||
|
||||
### `/ai/repair` — Fix Failed Candidate
|
||||
|
||||
Method: `repair_code()`
|
||||
|
||||
Request type: `AIServiceCodeRepairRequest`
|
||||
|
||||
Sends a failed candidate with test diffs showing what went wrong.
|
||||
|
||||
Key fields:
|
||||
- `original_source_code` / `modified_source_code`
|
||||
- `test_diffs: list[TestDiff]` — Each with `scope` (return_value/stdout/did_pass), original vs candidate values, and test source code
|
||||
|
||||
Returns: candidates with `source=OptimizedCandidateSource.REPAIR` and `parent_id` set
|
||||
|
||||
### `/ai/adaptive_optimize` — Multi-Candidate Adaptive
|
||||
|
||||
Method: `adaptive_optimize()`
|
||||
|
||||
Request type: `AIServiceAdaptiveOptimizeRequest`
|
||||
|
||||
Sends multiple previous candidates with their speedups for the LLM to learn from and generate better candidates.
|
||||
|
||||
Key fields:
|
||||
- `candidates: list[AdaptiveOptimizedCandidate]` — Previous candidates with source code, explanation, source type, and speedup
|
||||
|
||||
Returns: candidates with `source=OptimizedCandidateSource.ADAPTIVE`
|
||||
|
||||
### `/ai/rewrite_jit` — JIT Rewrite
|
||||
|
||||
Method: `get_jit_rewritten_code()`
|
||||
|
||||
Rewrites code to use JIT compilation (e.g., Numba).
|
||||
|
||||
Returns: candidates with `source=OptimizedCandidateSource.JIT_REWRITE`
|
||||
|
||||
## Candidate Parsing
|
||||
|
||||
All endpoints return JSON with an `optimizations` array. Each entry has:
|
||||
- `source_code` — Markdown-formatted code blocks
|
||||
- `explanation` — LLM explanation
|
||||
- `optimization_id` — Unique ID
|
||||
- `parent_id` — Optional parent reference
|
||||
- `model` — Which LLM model was used
|
||||
|
||||
`_get_valid_candidates()` parses the markdown code via `CodeStringsMarkdown.parse_markdown_code()` and filters out entries with empty code blocks.
|
||||
|
||||
## `LocalAiServiceClient`
|
||||
|
||||
Used when `CODEFLASH_EXPERIMENT_ID` is set. Mirrors `AiServiceClient` but sends to a separate experimental endpoint for A/B testing optimization strategies.
|
||||
|
||||
## LLM Call Sequencing
|
||||
|
||||
`AiServiceClient` tracks call sequence via `llm_call_counter` (itertools.count). Each request includes a `call_sequence` number, used by the backend to maintain conversation context across multiple calls for the same function.
|
||||
79
tiles/codeflash-docs/docs/configuration.md
Normal file
79
tiles/codeflash-docs/docs/configuration.md
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
# Configuration
|
||||
|
||||
Key configuration constants, effort levels, and thresholds.
|
||||
|
||||
## Constants (`code_utils/config_consts.py`)
|
||||
|
||||
### Test Execution
|
||||
|
||||
| Constant | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `MAX_TEST_RUN_ITERATIONS` | 5 | Maximum test loop iterations |
|
||||
| `INDIVIDUAL_TESTCASE_TIMEOUT` | 15s | Timeout per individual test case |
|
||||
| `MAX_FUNCTION_TEST_SECONDS` | 60s | Max total time for function testing |
|
||||
| `MAX_TEST_FUNCTION_RUNS` | 50 | Max test function executions |
|
||||
| `MAX_CUMULATIVE_TEST_RUNTIME_NANOSECONDS` | 100ms | Max cumulative test runtime |
|
||||
| `TOTAL_LOOPING_TIME` | 10s | Candidate benchmarking budget |
|
||||
| `MIN_TESTCASE_PASSED_THRESHOLD` | 6 | Minimum test cases that must pass |
|
||||
|
||||
### Performance Thresholds
|
||||
|
||||
| Constant | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `MIN_IMPROVEMENT_THRESHOLD` | 0.05 (5%) | Minimum speedup to accept a candidate |
|
||||
| `MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD` | 0.10 (10%) | Minimum async throughput improvement |
|
||||
| `MIN_CONCURRENCY_IMPROVEMENT_THRESHOLD` | 0.20 (20%) | Minimum concurrency ratio improvement |
|
||||
| `COVERAGE_THRESHOLD` | 60.0% | Minimum test coverage |
|
||||
|
||||
### Stability Thresholds
|
||||
|
||||
| Constant | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `STABILITY_WINDOW_SIZE` | 0.35 | 35% of total iteration window |
|
||||
| `STABILITY_CENTER_TOLERANCE` | 0.0025 | ±0.25% around median |
|
||||
| `STABILITY_SPREAD_TOLERANCE` | 0.0025 | 0.25% window spread |
|
||||
|
||||
### Context Limits
|
||||
|
||||
| Constant | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `OPTIMIZATION_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for optimization context |
|
||||
| `TESTGEN_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for test generation context |
|
||||
| `MAX_CONTEXT_LEN_REVIEW` | 1000 | Max context length for optimization review |
|
||||
|
||||
### Other
|
||||
|
||||
| Constant | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `MIN_CORRECT_CANDIDATES` | 2 | Min correct candidates before skipping repair |
|
||||
| `REPEAT_OPTIMIZATION_PROBABILITY` | 0.1 | Probability of re-optimizing a function |
|
||||
| `DEFAULT_IMPORTANCE_THRESHOLD` | 0.001 | Minimum addressable time to consider a function |
|
||||
| `CONCURRENCY_FACTOR` | 10 | Number of concurrent executions for concurrency benchmark |
|
||||
| `REFINED_CANDIDATE_RANKING_WEIGHTS` | (2, 1) | (runtime, diff) weights — runtime 2x more important |
|
||||
|
||||
## Effort Levels
|
||||
|
||||
`EffortLevel` enum: `LOW`, `MEDIUM`, `HIGH`
|
||||
|
||||
Effort controls the number of candidates, repairs, and refinements:
|
||||
|
||||
| Key | LOW | MEDIUM | HIGH |
|
||||
|-----|-----|--------|------|
|
||||
| `N_OPTIMIZER_CANDIDATES` | 3 | 5 | 6 |
|
||||
| `N_OPTIMIZER_LP_CANDIDATES` | 4 | 6 | 7 |
|
||||
| `N_GENERATED_TESTS` | 2 | 2 | 2 |
|
||||
| `MAX_CODE_REPAIRS_PER_TRACE` | 2 | 3 | 5 |
|
||||
| `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` | 0.2 | 0.3 | 0.4 |
|
||||
| `TOP_VALID_CANDIDATES_FOR_REFINEMENT` | 2 | 3 | 4 |
|
||||
| `ADAPTIVE_OPTIMIZATION_THRESHOLD` | 0 | 0 | 2 |
|
||||
| `MAX_ADAPTIVE_OPTIMIZATIONS_PER_TRACE` | 0 | 0 | 4 |
|
||||
|
||||
Use `get_effort_value(EffortKeys.KEY, effort_level)` to retrieve values.
|
||||
|
||||
## Project Configuration
|
||||
|
||||
Configuration is read from `pyproject.toml` under `[tool.codeflash]`. Key settings are auto-detected by `setup/detector.py`:
|
||||
- `module-root` — Root of the module to optimize
|
||||
- `tests-root` — Root of test files
|
||||
- `test-framework` — pytest, unittest, jest, etc.
|
||||
- `formatter-cmds` — Code formatting commands
|
||||
60
tiles/codeflash-docs/docs/context-extraction.md
Normal file
60
tiles/codeflash-docs/docs/context-extraction.md
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
# Context Extraction
|
||||
|
||||
How codeflash extracts and limits code context for optimization and test generation.
|
||||
|
||||
## Overview
|
||||
|
||||
Context extraction (`context/code_context_extractor.py`) builds a `CodeOptimizationContext` containing all code needed for the LLM to understand and optimize a function, split into:
|
||||
|
||||
- **Read-writable code** (`CodeContextType.READ_WRITABLE`): The function being optimized plus its helper functions — code the LLM is allowed to modify
|
||||
- **Read-only context** (`CodeContextType.READ_ONLY`): Dependency code for reference — imports, type definitions, base classes
|
||||
- **Testgen context** (`CodeContextType.TESTGEN`): Context for test generation, may include imported class definitions and external base class inits
|
||||
- **Hashing context** (`CodeContextType.HASHING`): Used for deduplication of optimization runs
|
||||
|
||||
## Token Limits
|
||||
|
||||
Both optimization and test generation contexts are token-limited:
|
||||
- `OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000` tokens
|
||||
- `TESTGEN_CONTEXT_TOKEN_LIMIT = 16000` tokens
|
||||
|
||||
Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`. Functions whose context exceeds these limits are skipped.
|
||||
|
||||
## Context Building Process
|
||||
|
||||
### 1. Helper Discovery
|
||||
|
||||
For the target function (`FunctionToOptimize`), the extractor finds:
|
||||
- **Helpers of the function**: Functions/classes in the same file that the target function calls
|
||||
- **Helpers of helpers**: Transitive dependencies of the helper functions
|
||||
|
||||
These are organized as `dict[Path, set[FunctionSource]]` — mapping file paths to the set of helper functions found in each file.
|
||||
|
||||
### 2. Code Extraction
|
||||
|
||||
`extract_code_markdown_context_from_files()` builds `CodeStringsMarkdown` from the helper dictionaries. Each file's relevant code is extracted as a `CodeString` with its file path.
|
||||
|
||||
### 3. Testgen Context Enrichment
|
||||
|
||||
`build_testgen_context()` extends the basic context with:
|
||||
- Imported class definitions (resolved from imports)
|
||||
- External base class `__init__` methods
|
||||
- External class `__init__` methods referenced in the context
|
||||
|
||||
### 4. Unused Definition Removal
|
||||
|
||||
`detect_unused_helper_functions()` and `remove_unused_definitions_by_function_names()` from `context/unused_definition_remover.py` prune definitions that are not transitively reachable from the target function, reducing token usage.
|
||||
|
||||
### 5. Deduplication
|
||||
|
||||
The hashing context (`hashing_code_context`) generates a hash (`hashing_code_context_hash`) used to detect when the same function context has already been optimized in a previous run, avoiding redundant work.
|
||||
|
||||
## Key Functions
|
||||
|
||||
| Function | Location | Purpose |
|
||||
|----------|----------|---------|
|
||||
| `build_testgen_context()` | `context/code_context_extractor.py` | Build enriched testgen context |
|
||||
| `extract_code_markdown_context_from_files()` | `context/code_context_extractor.py` | Convert helper dicts to `CodeStringsMarkdown` |
|
||||
| `detect_unused_helper_functions()` | `context/unused_definition_remover.py` | Find unused definitions |
|
||||
| `remove_unused_definitions_by_function_names()` | `context/unused_definition_remover.py` | Remove unused definitions |
|
||||
| `collect_top_level_defs_with_usages()` | `context/unused_definition_remover.py` | Analyze definition usage |
|
||||
| `encoded_tokens_len()` | `code_utils/code_utils.py` | Count tokens in code |
|
||||
153
tiles/codeflash-docs/docs/domain-types.md
Normal file
153
tiles/codeflash-docs/docs/domain-types.md
Normal file
|
|
@ -0,0 +1,153 @@
|
|||
# Domain Types
|
||||
|
||||
Core data types used throughout the codeflash optimization pipeline.
|
||||
|
||||
## Function Representation
|
||||
|
||||
### `FunctionToOptimize` (`models/function_types.py`)
|
||||
|
||||
The canonical dataclass representing a function candidate for optimization. Works across Python, JavaScript, and TypeScript.
|
||||
|
||||
Key fields:
|
||||
- `function_name: str` — The function name
|
||||
- `file_path: Path` — Absolute file path where the function is located
|
||||
- `parents: list[FunctionParent]` — Parent scopes (classes/functions), each with `name` and `type`
|
||||
- `starting_line / ending_line: Optional[int]` — Line range (1-indexed)
|
||||
- `is_async: bool` — Whether the function is async
|
||||
- `is_method: bool` — Whether it belongs to a class
|
||||
- `language: str` — Programming language (default: `"python"`)
|
||||
|
||||
Key properties:
|
||||
- `qualified_name` — Full dotted name including parent classes (e.g., `MyClass.my_method`)
|
||||
- `top_level_parent_name` — Name of outermost parent, or function name if no parents
|
||||
- `class_name` — Immediate parent class name, or `None`
|
||||
|
||||
### `FunctionParent` (`models/function_types.py`)
|
||||
|
||||
Represents a parent scope: `name: str` (e.g., `"MyClass"`) and `type: str` (e.g., `"ClassDef"`).
|
||||
|
||||
### `FunctionSource` (`models/models.py`)
|
||||
|
||||
Represents a resolved function with source code. Used for helper functions in context extraction.
|
||||
|
||||
Fields: `file_path`, `qualified_name`, `fully_qualified_name`, `only_function_name`, `source_code`, `jedi_definition`.
|
||||
|
||||
## Code Representation
|
||||
|
||||
### `CodeString` (`models/models.py`)
|
||||
|
||||
A single code block with validated syntax:
|
||||
- `code: str` — The source code
|
||||
- `file_path: Optional[Path]` — Origin file path
|
||||
- `language: str` — Language for validation (default: `"python"`)
|
||||
|
||||
Validates syntax on construction via `model_validator`.
|
||||
|
||||
### `CodeStringsMarkdown` (`models/models.py`)
|
||||
|
||||
A collection of `CodeString` blocks — the primary format for passing code through the pipeline.
|
||||
|
||||
Key properties:
|
||||
- `.flat` — Combined source code with file-path comment prefixes (e.g., `# file: path/to/file.py`)
|
||||
- `.markdown` — Markdown-formatted with fenced code blocks: `` ```python:filepath\ncode\n``` ``
|
||||
- `.file_to_path()` — Dict mapping file path strings to code
|
||||
|
||||
Static method:
|
||||
- `parse_markdown_code(markdown_code, expected_language)` — Parses markdown code blocks back into `CodeStringsMarkdown`
|
||||
|
||||
## Optimization Context
|
||||
|
||||
### `CodeOptimizationContext` (`models/models.py`)
|
||||
|
||||
Holds all code context needed for optimization:
|
||||
- `read_writable_code: CodeStringsMarkdown` — Code the LLM can modify
|
||||
- `read_only_context_code: str` — Reference-only dependency code
|
||||
- `testgen_context: CodeStringsMarkdown` — Context for test generation
|
||||
- `hashing_code_context: str` / `hashing_code_context_hash: str` — For deduplication
|
||||
- `helper_functions: list[FunctionSource]` — Helper functions in the writable code
|
||||
- `preexisting_objects: set[tuple[str, tuple[FunctionParent, ...]]]` — Objects that already exist in the code
|
||||
|
||||
### `CodeContextType` enum (`models/models.py`)
|
||||
|
||||
Defines context categories: `READ_WRITABLE`, `READ_ONLY`, `TESTGEN`, `HASHING`.
|
||||
|
||||
## Candidates
|
||||
|
||||
### `OptimizedCandidate` (`models/models.py`)
|
||||
|
||||
A generated code variant:
|
||||
- `source_code: CodeStringsMarkdown` — The optimized code
|
||||
- `explanation: str` — LLM explanation of the optimization
|
||||
- `optimization_id: str` — Unique identifier
|
||||
- `source: OptimizedCandidateSource` — How it was generated
|
||||
- `parent_id: str | None` — ID of parent candidate (for refinements/repairs)
|
||||
- `model: str | None` — Which LLM model generated it
|
||||
|
||||
### `OptimizedCandidateSource` enum (`models/models.py`)
|
||||
|
||||
How a candidate was generated: `OPTIMIZE`, `OPTIMIZE_LP` (line profiler), `REFINE`, `REPAIR`, `ADAPTIVE`, `JIT_REWRITE`.
|
||||
|
||||
### `CandidateEvaluationContext` (`models/models.py`)
|
||||
|
||||
Tracks state during candidate evaluation:
|
||||
- `speedup_ratios` / `optimized_runtimes` / `is_correct` — Per-candidate results
|
||||
- `ast_code_to_id` — Deduplication map (normalized AST → first seen candidate)
|
||||
- `valid_optimizations` — Candidates that passed all checks
|
||||
|
||||
Key methods: `record_failed_candidate()`, `record_successful_candidate()`, `handle_duplicate_candidate()`, `register_new_candidate()`.
|
||||
|
||||
## Baseline & Results
|
||||
|
||||
### `OriginalCodeBaseline` (`models/models.py`)
|
||||
|
||||
Baseline measurements for the original code:
|
||||
- `behavior_test_results: TestResults` / `benchmarking_test_results: TestResults`
|
||||
- `line_profile_results: dict`
|
||||
- `runtime: int` — Total runtime in nanoseconds
|
||||
- `coverage_results: Optional[CoverageData]`
|
||||
|
||||
### `BestOptimization` (`models/models.py`)
|
||||
|
||||
The winning candidate after evaluation:
|
||||
- `candidate: OptimizedCandidate`
|
||||
- `helper_functions: list[FunctionSource]`
|
||||
- `code_context: CodeOptimizationContext`
|
||||
- `runtime: int`
|
||||
- `winning_behavior_test_results` / `winning_benchmarking_test_results: TestResults`
|
||||
|
||||
## Test Types
|
||||
|
||||
### `TestType` enum (`models/test_type.py`)
|
||||
|
||||
- `EXISTING_UNIT_TEST` (1) — Pre-existing tests from the codebase
|
||||
- `INSPIRED_REGRESSION` (2) — Tests inspired by existing tests
|
||||
- `GENERATED_REGRESSION` (3) — AI-generated regression tests
|
||||
- `REPLAY_TEST` (4) — Tests from recorded benchmark data
|
||||
- `CONCOLIC_COVERAGE_TEST` (5) — Coverage-guided tests
|
||||
- `INIT_STATE_TEST` (6) — Class init state verification
|
||||
|
||||
### `TestFile` / `TestFiles` (`models/models.py`)
|
||||
|
||||
`TestFile` represents a single test file with `instrumented_behavior_file_path`, optional `benchmarking_file_path`, `original_file_path`, `test_type`, and `tests_in_file`.
|
||||
|
||||
`TestFiles` is a collection with lookup methods: `get_by_type()`, `get_by_original_file_path()`, `get_test_type_by_instrumented_file_path()`.
|
||||
|
||||
### `TestResults` (`models/models.py`)
|
||||
|
||||
Collection of `FunctionTestInvocation` results with indexed lookup. Key methods:
|
||||
- `add(invocation)` — Deduplicated insert
|
||||
- `total_passed_runtime()` — Sum of minimum runtimes per test case (nanoseconds)
|
||||
- `number_of_loops()` — Max loop index across all results
|
||||
- `usable_runtime_data_by_test_case()` — Dict of invocation ID → list of runtimes
|
||||
|
||||
## Result Type
|
||||
|
||||
### `Result[L, R]` / `Success` / `Failure` (`either.py`)
|
||||
|
||||
Functional error handling type:
|
||||
- `Success(value)` — Wraps a successful result
|
||||
- `Failure(error)` — Wraps an error
|
||||
- `result.is_successful()` / `result.is_failure()` — Check type
|
||||
- `result.unwrap()` — Get success value (raises if Failure)
|
||||
- `result.failure()` — Get failure value (raises if Success)
|
||||
- `is_successful(result)` — Module-level helper function
|
||||
41
tiles/codeflash-docs/docs/index.md
Normal file
41
tiles/codeflash-docs/docs/index.md
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
# Codeflash Internal Documentation
|
||||
|
||||
CodeFlash is an AI-powered Python code optimizer that automatically improves code performance while maintaining correctness. It uses LLMs to generate optimization candidates, verifies correctness through test execution, and benchmarks performance improvements.
|
||||
|
||||
## Pipeline Overview
|
||||
|
||||
```
|
||||
Discovery → Ranking → Context Extraction → Test Gen + Optimization → Baseline → Candidate Evaluation → PR
|
||||
```
|
||||
|
||||
1. **Discovery** (`discovery/`): Find optimizable functions across the codebase using `FunctionVisitor`
|
||||
2. **Ranking** (`benchmarking/function_ranker.py`): Rank functions by addressable time using trace data
|
||||
3. **Context** (`context/`): Extract code dependencies — split into read-writable (modifiable) and read-only (reference)
|
||||
4. **Optimization** (`optimization/`, `api/`): Generate candidates via AI service, runs concurrently with test generation
|
||||
5. **Verification** (`verification/`): Run candidates against tests via custom pytest plugin, compare outputs
|
||||
6. **Benchmarking** (`benchmarking/`): Measure performance, select best candidate by speedup
|
||||
7. **Result** (`result/`, `github/`): Create PR with winning optimization
|
||||
|
||||
## Key Entry Points
|
||||
|
||||
| Task | File |
|
||||
|------|------|
|
||||
| CLI arguments & commands | `cli_cmds/cli.py` |
|
||||
| Optimization orchestration | `optimization/optimizer.py` → `Optimizer.run()` |
|
||||
| Per-function optimization | `optimization/function_optimizer.py` → `FunctionOptimizer` |
|
||||
| Function discovery | `discovery/functions_to_optimize.py` |
|
||||
| Context extraction | `context/code_context_extractor.py` |
|
||||
| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
|
||||
| Performance ranking | `benchmarking/function_ranker.py` |
|
||||
| Domain types | `models/models.py`, `models/function_types.py` |
|
||||
| AI service | `api/aiservice.py` → `AiServiceClient` |
|
||||
| Configuration | `code_utils/config_consts.py` |
|
||||
|
||||
## Documentation Pages
|
||||
|
||||
- [Domain Types](domain-types.md) — Core data types and their relationships
|
||||
- [Optimization Pipeline](optimization-pipeline.md) — Step-by-step data flow through the pipeline
|
||||
- [Context Extraction](context-extraction.md) — How code context is extracted and token-limited
|
||||
- [Verification](verification.md) — Test execution, pytest plugin, deterministic patches
|
||||
- [AI Service](ai-service.md) — AI service client endpoints and request types
|
||||
- [Configuration](configuration.md) — Config schema, effort levels, thresholds
|
||||
84
tiles/codeflash-docs/docs/optimization-pipeline.md
Normal file
84
tiles/codeflash-docs/docs/optimization-pipeline.md
Normal file
|
|
@ -0,0 +1,84 @@
|
|||
# Optimization Pipeline
|
||||
|
||||
Step-by-step data flow from function discovery to PR creation.
|
||||
|
||||
## 1. Entry Point: `Optimizer.run()` (`optimization/optimizer.py`)
|
||||
|
||||
The `Optimizer` class is initialized with CLI args and creates:
|
||||
- `TestConfig` with test roots, project root, pytest command
|
||||
- `AiServiceClient` for AI service communication
|
||||
- Optional `LocalAiServiceClient` for experiments
|
||||
|
||||
`run()` orchestrates the full pipeline: discovers functions, optionally ranks them, then optimizes each in turn.
|
||||
|
||||
## 2. Function Discovery (`discovery/functions_to_optimize.py`)
|
||||
|
||||
`FunctionVisitor` traverses source files to find optimizable functions, producing `FunctionToOptimize` instances. Filters include:
|
||||
- Skipping functions that are too small or trivial
|
||||
- Skipping previously optimized functions (via `was_function_previously_optimized()`)
|
||||
- Applying user-configured include/exclude patterns
|
||||
|
||||
## 3. Function Ranking (`benchmarking/function_ranker.py`)
|
||||
|
||||
When trace data is available, `FunctionRanker` ranks functions by **addressable time** — the time a function spends that could be optimized (own time + callee time / call count). Functions below `DEFAULT_IMPORTANCE_THRESHOLD=0.001` are skipped.
|
||||
|
||||
## 4. Per-Function Optimization: `FunctionOptimizer` (`optimization/function_optimizer.py`)
|
||||
|
||||
For each function, `FunctionOptimizer.optimize_function()` runs the full optimization loop:
|
||||
|
||||
### 4a. Context Extraction (`context/code_context_extractor.py`)
|
||||
|
||||
Extracts `CodeOptimizationContext` containing:
|
||||
- `read_writable_code` — Code the LLM can modify (the function + helpers)
|
||||
- `read_only_context_code` — Dependency code for reference only
|
||||
- `testgen_context` — Context for test generation (may include imported class definitions)
|
||||
|
||||
Token limits are enforced: `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000`. Functions exceeding these are rejected.
|
||||
|
||||
### 4b. Concurrent Test Generation + LLM Optimization
|
||||
|
||||
These run in parallel using `concurrent.futures`:
|
||||
- **Test generation**: Generates regression tests from the function context
|
||||
- **LLM optimization**: Sends `read_writable_code.markdown` + `read_only_context_code` to the AI service
|
||||
|
||||
The number of candidates depends on effort level (see Configuration docs).
|
||||
|
||||
### 4c. Candidate Evaluation
|
||||
|
||||
For each `OptimizedCandidate`:
|
||||
|
||||
1. **Deduplication**: Normalize code AST and check against `CandidateEvaluationContext.ast_code_to_id`. If duplicate, copy results from previous evaluation.
|
||||
|
||||
2. **Code replacement**: Replace the original function with the candidate using `replace_function_definitions_in_module()`.
|
||||
|
||||
3. **Behavioral testing**: Run instrumented tests in subprocess. The custom pytest plugin applies deterministic patches. Compare return values, stdout, and pass/fail status against the original baseline.
|
||||
|
||||
4. **Benchmarking**: If behavior matches, run performance tests with looping (`TOTAL_LOOPING_TIME=10s`). Calculate speedup ratio.
|
||||
|
||||
5. **Validation**: Candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup) and pass stability checks.
|
||||
|
||||
### 4d. Refinement & Repair
|
||||
|
||||
- **Repair**: If fewer than `MIN_CORRECT_CANDIDATES=2` pass, failed candidates can be repaired via `AIServiceCodeRepairRequest` (sends test diffs to LLM).
|
||||
- **Refinement**: Top valid candidates are refined via `AIServiceRefinerRequest` (sends runtime data, line profiler results).
|
||||
- **Adaptive**: At HIGH effort, additional adaptive optimization rounds via `AIServiceAdaptiveOptimizeRequest`.
|
||||
|
||||
### 4e. Best Candidate Selection
|
||||
|
||||
The winning candidate is selected by:
|
||||
1. Highest speedup ratio
|
||||
2. For tied speedups, shortest diff length from original
|
||||
3. Refinement candidates use weighted ranking: `(2 * runtime_rank + 1 * diff_rank)`
|
||||
|
||||
Result is a `BestOptimization` with the candidate, context, test results, and runtime.
|
||||
|
||||
## 5. PR Creation (`github/`)
|
||||
|
||||
If a winning candidate is found, a PR is created with:
|
||||
- The optimized code diff
|
||||
- Performance benchmark details
|
||||
- Explanation from the LLM
|
||||
|
||||
## Worktree Mode
|
||||
|
||||
When `--worktree` is enabled, optimization runs in an isolated git worktree (`code_utils/git_worktree_utils.py`). This allows parallel optimization without affecting the working tree. Changes are captured as patch files.
|
||||
93
tiles/codeflash-docs/docs/verification.md
Normal file
93
tiles/codeflash-docs/docs/verification.md
Normal file
|
|
@ -0,0 +1,93 @@
|
|||
# Verification
|
||||
|
||||
How codeflash verifies candidate correctness and measures performance.
|
||||
|
||||
## Test Execution Architecture
|
||||
|
||||
Tests are executed in a **subprocess** to isolate the test environment from the main codeflash process. The test runner (`verification/test_runner.py`) invokes pytest (or Jest for JS/TS) with specific plugin configurations.
|
||||
|
||||
### Plugin Blocklists
|
||||
|
||||
- **Behavioral tests**: Block `benchmark`, `codspeed`, `xdist`, `sugar`
|
||||
- **Benchmarking tests**: Block `codspeed`, `cov`, `benchmark`, `profiling`, `xdist`, `sugar`
|
||||
|
||||
These are defined as `BEHAVIORAL_BLOCKLISTED_PLUGINS` and `BENCHMARKING_BLOCKLISTED_PLUGINS` in `verification/test_runner.py`.
|
||||
|
||||
## Custom Pytest Plugin (`verification/pytest_plugin.py`)
|
||||
|
||||
The plugin is loaded into the test subprocess and provides:
|
||||
|
||||
### Deterministic Patches
|
||||
|
||||
`_apply_deterministic_patches()` replaces non-deterministic functions with fixed values to ensure reproducible test output:
|
||||
|
||||
| Module | Function | Fixed Value |
|
||||
|--------|----------|-------------|
|
||||
| `time` | `time()` | `1761717605.108106` |
|
||||
| `time` | `perf_counter()` | Incrementing by 1ms per call |
|
||||
| `datetime` | `datetime.now()` | `2021-01-01 02:05:10 UTC` |
|
||||
| `datetime` | `datetime.utcnow()` | `2021-01-01 02:05:10 UTC` |
|
||||
| `uuid` | `uuid4()` / `uuid1()` | `12345678-1234-5678-9abc-123456789012` |
|
||||
| `random` | `random()` | `0.123456789` (seeded with 42) |
|
||||
| `os` | `urandom(n)` | `b"\x42" * n` |
|
||||
| `numpy.random` | seed | `42` |
|
||||
|
||||
Patches call the original function first to maintain performance characteristics (same call overhead).
|
||||
|
||||
### Timing Markers
|
||||
|
||||
Test results include timing markers in stdout: `!######<id>:<duration_ns>######!`
|
||||
|
||||
The pattern `_TIMING_MARKER_PATTERN` extracts timing data for calculating function utilization fraction.
|
||||
|
||||
### Loop Stability
|
||||
|
||||
Performance benchmarking uses configurable stability thresholds:
|
||||
- `STABILITY_WINDOW_SIZE = 0.35` (35% of total iterations)
|
||||
- `STABILITY_CENTER_TOLERANCE = 0.0025` (±0.25% around median)
|
||||
- `STABILITY_SPREAD_TOLERANCE = 0.0025` (0.25% window spread)
|
||||
|
||||
### Memory Limits (Linux)
|
||||
|
||||
On Linux, the plugin sets `RLIMIT_AS` to 85% of total system memory (RAM + swap) to prevent OOM kills.
|
||||
|
||||
## Test Result Processing
|
||||
|
||||
### `TestResults` (`models/models.py`)
|
||||
|
||||
Collects `FunctionTestInvocation` results with:
|
||||
- Deduplicated insertion via `unique_invocation_loop_id`
|
||||
- `total_passed_runtime()` — Sum of minimum runtimes per test case (nanoseconds)
|
||||
- `number_of_loops()` — Max loop index
|
||||
- `usable_runtime_data_by_test_case()` — Grouped timing data
|
||||
|
||||
### `FunctionTestInvocation`
|
||||
|
||||
Each invocation records:
|
||||
- `loop_index` — Iteration number (starts at 1)
|
||||
- `id: InvocationId` — Fully qualified test identifier
|
||||
- `did_pass: bool` — Pass/fail status
|
||||
- `runtime: Optional[int]` — Time in nanoseconds
|
||||
- `return_value: Optional[object]` — Captured return value
|
||||
- `test_type: TestType` — Which test category
|
||||
|
||||
### Behavioral vs Performance Testing
|
||||
|
||||
1. **Behavioral**: Runs with `TestingMode.BEHAVIOR`. Compares return values and stdout between original and candidate. Any difference = candidate rejected.
|
||||
2. **Performance**: Runs with `TestingMode.PERFORMANCE`. Loops for `TOTAL_LOOPING_TIME=10s` to get stable timing. Calculates speedup ratio.
|
||||
3. **Line Profile**: Runs with `TestingMode.LINE_PROFILE`. Collects per-line timing data for refinement.
|
||||
|
||||
## Test Types
|
||||
|
||||
| TestType | Value | Description |
|
||||
|----------|-------|-------------|
|
||||
| `EXISTING_UNIT_TEST` | 1 | Pre-existing tests from the codebase |
|
||||
| `INSPIRED_REGRESSION` | 2 | Tests inspired by existing tests |
|
||||
| `GENERATED_REGRESSION` | 3 | AI-generated regression tests |
|
||||
| `REPLAY_TEST` | 4 | Tests from recorded benchmark data |
|
||||
| `CONCOLIC_COVERAGE_TEST` | 5 | Coverage-guided tests |
|
||||
| `INIT_STATE_TEST` | 6 | Class init state verification |
|
||||
|
||||
## Coverage
|
||||
|
||||
Coverage is measured via `CoverageData` with a threshold of `COVERAGE_THRESHOLD=60.0%`. Low coverage may affect confidence in the optimization's correctness.
|
||||
7
tiles/codeflash-docs/tile.json
Normal file
7
tiles/codeflash-docs/tile.json
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
{
|
||||
"name": "codeflash/codeflash-docs",
|
||||
"version": "0.1.0",
|
||||
"summary": "Internal documentation for the codeflash optimization engine",
|
||||
"private": true,
|
||||
"docs": "docs/index.md"
|
||||
}
|
||||
45
tiles/codeflash-rules/rules/architecture.md
Normal file
45
tiles/codeflash-rules/rules/architecture.md
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
# Architecture
|
||||
|
||||
```
|
||||
codeflash/
|
||||
├── main.py # CLI entry point
|
||||
├── cli_cmds/ # Command handling, console output (Rich)
|
||||
├── discovery/ # Find optimizable functions
|
||||
├── context/ # Extract code dependencies and imports
|
||||
├── optimization/ # Generate optimized code via AI
|
||||
│ ├── optimizer.py # Main optimization orchestration
|
||||
│ └── function_optimizer.py # Per-function optimization logic
|
||||
├── verification/ # Run deterministic tests (pytest plugin)
|
||||
├── benchmarking/ # Performance measurement
|
||||
├── github/ # PR creation
|
||||
├── api/ # AI service communication
|
||||
├── code_utils/ # Code parsing, git utilities
|
||||
├── models/ # Pydantic models and types
|
||||
├── languages/ # Multi-language support (Python, JavaScript/TypeScript)
|
||||
├── setup/ # Config schema, auto-detection, first-run experience
|
||||
├── picklepatch/ # Serialization/deserialization utilities
|
||||
├── tracing/ # Function call tracing
|
||||
├── tracer.py # Root-level tracer entry point for profiling
|
||||
├── lsp/ # IDE integration (Language Server Protocol)
|
||||
├── telemetry/ # Sentry, PostHog
|
||||
├── either.py # Functional Result type for error handling
|
||||
├── result/ # Result types and handling
|
||||
└── version.py # Version information
|
||||
```
|
||||
|
||||
## Key Entry Points
|
||||
|
||||
| Task | Start here |
|
||||
|------|------------|
|
||||
| CLI arguments & commands | `cli_cmds/cli.py` |
|
||||
| Optimization orchestration | `optimization/optimizer.py` → `Optimizer.run()` |
|
||||
| Per-function optimization | `optimization/function_optimizer.py` → `FunctionOptimizer` |
|
||||
| Function discovery | `discovery/functions_to_optimize.py` |
|
||||
| Context extraction | `context/code_context_extractor.py` |
|
||||
| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
|
||||
| Performance ranking | `benchmarking/function_ranker.py` |
|
||||
| Domain types | `models/models.py`, `models/function_types.py` |
|
||||
| Result handling | `either.py` (`Result`, `Success`, `Failure`, `is_successful`) |
|
||||
| AI service communication | `api/aiservice.py` → `AiServiceClient` |
|
||||
| Configuration constants | `code_utils/config_consts.py` |
|
||||
| Language support | `languages/registry.py` → `get_language_support()` |
|
||||
11
tiles/codeflash-rules/rules/code-style.md
Normal file
11
tiles/codeflash-rules/rules/code-style.md
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
# Code Style
|
||||
|
||||
- **Line length**: 120 characters
|
||||
- **Python**: 3.9+ syntax (use `from __future__ import annotations` for type hints)
|
||||
- **Package management**: Always use `uv`, never `pip` — run commands via `uv run`
|
||||
- **Tooling**: Ruff for linting/formatting, mypy strict mode, prek for pre-commit checks (`uv run prek run`)
|
||||
- **Comments**: Minimal — only explain "why", not "what"
|
||||
- **Docstrings**: Do not add unless explicitly requested
|
||||
- **Naming**: NEVER use leading underscores (`_function_name`) — Python has no true private functions, use public names
|
||||
- **Paths**: Always use absolute `Path` objects, handle encoding explicitly (UTF-8)
|
||||
- **Source transforms**: Use `libcst` for code modification/transformation to preserve formatting; `ast` is acceptable for read-only analysis and parsing
|
||||
9
tiles/codeflash-rules/rules/git-conventions.md
Normal file
9
tiles/codeflash-rules/rules/git-conventions.md
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
# Git Conventions
|
||||
|
||||
- **Always create a new branch from `main`** — never commit directly to `main` or reuse an existing feature branch for unrelated changes
|
||||
- Use conventional commit format: `fix:`, `feat:`, `refactor:`, `docs:`, `test:`, `chore:`
|
||||
- Keep commits atomic — one logical change per commit
|
||||
- Commit message body should be concise (1-2 sentences max)
|
||||
- PR titles should also use conventional format
|
||||
- Branch naming: `cf-#-title` (lowercase, hyphenated) where `#` is the Linear issue number
|
||||
- If related to a Linear issue, include `CF-#` in the PR body
|
||||
9
tiles/codeflash-rules/rules/language-rules.md
Normal file
9
tiles/codeflash-rules/rules/language-rules.md
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
# Language Support Rules
|
||||
|
||||
- Current language is a module-level singleton in `languages/current.py` — use `set_current_language()` / `current_language()`, never pass language as a parameter through call chains
|
||||
- Use `get_language_support(identifier)` from `languages/registry.py` to get a `LanguageSupport` instance — accepts `Path`, `Language` enum, or string; never import language classes directly
|
||||
- New language support classes must use the `@register_language` decorator to register with the extension and language registries
|
||||
- `languages/__init__.py` uses `__getattr__` for lazy imports to avoid circular dependencies — follow this pattern when adding new exports
|
||||
- `is_javascript()` returns `True` for both JavaScript and TypeScript
|
||||
- Language modules are lazily imported on first `get_language_support()` call via `_ensure_languages_registered()` — the `@register_language` decorator fires on import and populates `_EXTENSION_REGISTRY` and `_LANGUAGE_REGISTRY`
|
||||
- `LanguageSupport` instances are cached in `_SUPPORT_CACHE` — use `clear_cache()` only in tests
|
||||
11
tiles/codeflash-rules/rules/optimization-patterns.md
Normal file
11
tiles/codeflash-rules/rules/optimization-patterns.md
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
# Optimization Pipeline Patterns
|
||||
|
||||
- All major operations return `Result[SuccessType, ErrorType]` — construct with `Success(value)` / `Failure(error)`, check with `is_successful()` before calling `unwrap()`
|
||||
- Code context has token limits (`OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000`, `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`) — exceeding them rejects the function
|
||||
- `read_writable_code` (modifiable code) can span multiple files; `read_only_context_code` is reference-only dependency code
|
||||
- Code is serialized as markdown code blocks: `` ```language:filepath\ncode\n``` `` — see `CodeStringsMarkdown` in `models/models.py`
|
||||
- Candidates form a forest (DAG): refinements/repairs reference `parent_id` on previous candidates via `OptimizedCandidateSource` (OPTIMIZE, REFINE, REPAIR, ADAPTIVE, JIT_REWRITE)
|
||||
- Test generation and optimization run concurrently — coordinate through `CandidateEvaluationContext`
|
||||
- Generated tests are instrumented with `codeflash_capture.py` to record return values and traces
|
||||
- Minimum improvement threshold is 5% (`MIN_IMPROVEMENT_THRESHOLD=0.05`) — candidates below this are rejected
|
||||
- Stability thresholds: `STABILITY_WINDOW_SIZE=0.35`, `STABILITY_CENTER_TOLERANCE=0.0025`, `STABILITY_SPREAD_TOLERANCE=0.0025`
|
||||
13
tiles/codeflash-rules/rules/testing-rules.md
Normal file
13
tiles/codeflash-rules/rules/testing-rules.md
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
# Testing Rules
|
||||
|
||||
- Code context extraction and replacement tests must assert full string equality — no substring matching
|
||||
- Use pytest's `tmp_path` fixture for temp directories (it's a `Path` object)
|
||||
- Write temp files inside `tmp_path`, never use `NamedTemporaryFile` (causes Windows file contention)
|
||||
- Always call `.resolve()` on Path objects to ensure absolute paths and resolve symlinks
|
||||
- Use `.as_posix()` when converting resolved paths to strings (normalizes to forward slashes)
|
||||
- Any new feature or bug fix that can be tested automatically must have test cases
|
||||
- If changes affect existing test expectations, update the tests accordingly — tests must always pass after changes
|
||||
- The pytest plugin patches `time`, `random`, `uuid`, `datetime`, `os.urandom`, and `numpy.random` for deterministic test execution — never assume real randomness or real time in verification tests
|
||||
- `conftest.py` uses an autouse fixture that calls `reset_current_language()` — tests always start with Python as the default language
|
||||
- Test types are defined by the `TestType` enum: `EXISTING_UNIT_TEST`, `INSPIRED_REGRESSION`, `GENERATED_REGRESSION`, `REPLAY_TEST`, `CONCOLIC_COVERAGE_TEST`, `INIT_STATE_TEST`
|
||||
- Verification runs tests in a subprocess using a custom pytest plugin (`verification/pytest_plugin.py`) — behavioral tests use blocklisted plugins (`benchmark`, `codspeed`, `xdist`, `sugar`), benchmarking tests additionally block `cov` and `profiling`
|
||||
26
tiles/codeflash-rules/tile.json
Normal file
26
tiles/codeflash-rules/tile.json
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
{
|
||||
"name": "codeflash/codeflash-rules",
|
||||
"version": "0.1.0",
|
||||
"summary": "Coding standards and conventions for the codeflash codebase",
|
||||
"private": true,
|
||||
"rules": {
|
||||
"code-style": {
|
||||
"rules": "rules/code-style.md"
|
||||
},
|
||||
"architecture": {
|
||||
"rules": "rules/architecture.md"
|
||||
},
|
||||
"optimization-patterns": {
|
||||
"rules": "rules/optimization-patterns.md"
|
||||
},
|
||||
"git-conventions": {
|
||||
"rules": "rules/git-conventions.md"
|
||||
},
|
||||
"testing-rules": {
|
||||
"rules": "rules/testing-rules.md"
|
||||
},
|
||||
"language-rules": {
|
||||
"rules": "rules/language-rules.md"
|
||||
}
|
||||
}
|
||||
}
|
||||
96
tiles/codeflash-skills/skills/add-codeflash-feature/SKILL.md
Normal file
96
tiles/codeflash-skills/skills/add-codeflash-feature/SKILL.md
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
---
|
||||
name: add-codeflash-feature
|
||||
description: Step-by-step workflow for adding a new feature to the codeflash codebase
|
||||
---
|
||||
|
||||
# Add Codeflash Feature
|
||||
|
||||
Use this workflow when implementing a new feature in the codeflash codebase.
|
||||
|
||||
## Step 1: Identify Target Modules
|
||||
|
||||
Determine which module(s) need modification based on the feature:
|
||||
|
||||
| Feature area | Primary module | Key files |
|
||||
|-------------|----------------|-----------|
|
||||
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
|
||||
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
|
||||
| New AI service endpoint | `api/` | `aiservice.py` |
|
||||
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
|
||||
| Context extraction change | `context/` | `code_context_extractor.py` |
|
||||
| New CLI command | `cli_cmds/` | `cli.py` |
|
||||
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
|
||||
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
|
||||
| PR/result changes | `github/`, `result/` | Relevant handlers |
|
||||
|
||||
## Step 2: Follow Result Type Pattern
|
||||
|
||||
Use the `Result[L, R]` type from `either.py` for error handling in pipeline operations:
|
||||
|
||||
```python
|
||||
from codeflash.either import Success, Failure, is_successful
|
||||
|
||||
def my_operation() -> Result[str, MyResultType]:
|
||||
if error_condition:
|
||||
return Failure("descriptive error message")
|
||||
return Success(result_value)
|
||||
|
||||
# Usage:
|
||||
result = my_operation()
|
||||
if not is_successful(result):
|
||||
logger.error(result.failure())
|
||||
return
|
||||
value = result.unwrap()
|
||||
```
|
||||
|
||||
## Step 3: Add Configuration Constants
|
||||
|
||||
If the feature needs configurable thresholds or limits:
|
||||
|
||||
1. Add constants to `code_utils/config_consts.py`
|
||||
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for `LOW`, `MEDIUM`, `HIGH`
|
||||
3. Add a corresponding `EffortKeys` enum entry
|
||||
4. Access via `get_effort_value(EffortKeys.MY_KEY, effort_level)`
|
||||
|
||||
## Step 4: Add Domain Types
|
||||
|
||||
If new data structures are needed:
|
||||
|
||||
1. Add Pydantic models or frozen dataclasses to `models/models.py` or `models/function_types.py`
|
||||
2. Use `@dataclass(frozen=True)` for immutable data
|
||||
3. Use `BaseModel` for models that need serialization
|
||||
4. Keep `function_types.py` dependency-free (no imports from other codeflash modules)
|
||||
|
||||
## Step 5: Write Tests
|
||||
|
||||
Follow existing test patterns:
|
||||
|
||||
1. Create test files in the `tests/` directory mirroring the source structure
|
||||
2. Use pytest's `tmp_path` fixture for temp directories
|
||||
3. Always call `.resolve()` on Path objects
|
||||
4. Assert full string equality for code context tests — no substring matching
|
||||
5. Remember the pytest plugin patches `time`, `random`, `uuid`, `datetime` — don't rely on real values
|
||||
|
||||
## Step 6: Run Quality Checks
|
||||
|
||||
Run all validation before committing:
|
||||
|
||||
```bash
|
||||
# Pre-commit checks (ruff format + lint)
|
||||
uv run prek run
|
||||
|
||||
# Type checking
|
||||
uv run mypy codeflash/
|
||||
|
||||
# Run relevant tests
|
||||
uv run pytest tests/path/to/relevant/tests -x
|
||||
```
|
||||
|
||||
## Step 7: Language Support Considerations
|
||||
|
||||
If the feature needs to work across languages:
|
||||
|
||||
1. Check if the feature uses language-specific APIs — use `get_language_support(identifier)` from `languages/registry.py`
|
||||
2. Current language is a singleton: `set_current_language()` / `current_language()` from `languages/current.py`
|
||||
3. Use `is_python()` / `is_javascript()` guards for language-specific branches
|
||||
4. New language support classes must use `@register_language` decorator
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
---
|
||||
name: debug-optimization-failure
|
||||
description: Debug why a codeflash optimization failed at any pipeline stage
|
||||
---
|
||||
|
||||
# Debug Optimization Failure
|
||||
|
||||
Use this workflow when an optimization run fails or produces no results. Work through the stages sequentially — stop at the first failure found.
|
||||
|
||||
## Step 1: Check Function Discovery
|
||||
|
||||
Determine if the function was discovered by `FunctionVisitor`.
|
||||
|
||||
1. Look at the discovery output or logs for the function name
|
||||
2. Check `discovery/functions_to_optimize.py` — the `FunctionVisitor` filters out:
|
||||
- Functions that are too small or trivial
|
||||
- Functions matching exclude patterns in config
|
||||
- Functions already optimized (`was_function_previously_optimized()`)
|
||||
3. Verify the function file is under the configured `module-root`
|
||||
|
||||
**If not discovered**: Check config patterns, file location, and function size.
|
||||
|
||||
## Step 2: Check Ranking
|
||||
|
||||
If trace data is used, check if the function was ranked high enough.
|
||||
|
||||
1. Look at `benchmarking/function_ranker.py` output
|
||||
2. The function's **addressable time** must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`
|
||||
3. Addressable time = own time + callee time / call count
|
||||
|
||||
**If ranked too low**: The function doesn't spend enough time to be worth optimizing.
|
||||
|
||||
## Step 3: Check Context Token Limits
|
||||
|
||||
Verify the function's context fits within token limits.
|
||||
|
||||
1. Check `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`
|
||||
2. Token counting is done by `encoded_tokens_len()` in `code_utils/code_utils.py`
|
||||
3. Large helper function chains or deep dependency trees can blow the limit
|
||||
|
||||
**If context too large**: The function has too many dependencies. Consider refactoring to reduce context size.
|
||||
|
||||
## Step 4: Check AI Service Response
|
||||
|
||||
Verify the AI service returned valid candidates.
|
||||
|
||||
1. Check logs for `AiServiceClient` request/response
|
||||
2. Look for HTTP errors (non-200 status codes)
|
||||
3. Verify `_get_valid_candidates()` parsed the response — empty `code_strings` means invalid markdown code blocks
|
||||
4. Check if all candidates were filtered out during parsing
|
||||
|
||||
**If no candidates returned**: Check API key, network connectivity, and service status.
|
||||
|
||||
## Step 5: Check Test Failures
|
||||
|
||||
Determine if candidates failed behavioral or benchmark tests.
|
||||
|
||||
1. **Behavioral failures**: Compare return values, stdout, pass/fail status between original baseline and candidate
|
||||
- Check `TestDiffScope`: `RETURN_VALUE`, `STDOUT`, `DID_PASS`
|
||||
- Look at JUnit XML results for specific test failures
|
||||
2. **Benchmark failures**: Check if candidate met `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
|
||||
3. **Stability failures**: Check if timing was stable within `STABILITY_WINDOW_SIZE=0.35`
|
||||
|
||||
**If behavioral failure**: The optimization changed the function's behavior. Check test diffs for specific mismatches.
|
||||
**If benchmark failure**: The optimization didn't provide enough speedup.
|
||||
|
||||
## Step 6: Check Deduplication
|
||||
|
||||
Verify candidates weren't deduplicated away.
|
||||
|
||||
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized code → candidate mapping
|
||||
2. `normalize_code()` from `code_utils/deduplicate_code.py` normalizes AST for comparison
|
||||
3. If all candidates normalize to the same code, only one is actually tested
|
||||
|
||||
**If all duplicates**: The LLM generated the same optimization multiple times. Try higher effort level.
|
||||
|
||||
## Step 7: Check Repair/Refinement
|
||||
|
||||
If initial candidates failed, check repair and refinement stages.
|
||||
|
||||
1. Repair only runs if fewer than `MIN_CORRECT_CANDIDATES=2` passed
|
||||
2. Repair sends `AIServiceCodeRepairRequest` with test diffs
|
||||
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` — if too many tests failed, repair is skipped
|
||||
4. Refinement only runs on top valid candidates
|
||||
|
||||
**If repair also failed**: The optimization approach may not work for this function.
|
||||
|
||||
## Key Files to Check
|
||||
|
||||
- `optimization/function_optimizer.py` — Main optimization loop, `determine_best_candidate()`
|
||||
- `verification/test_runner.py` — Test execution
|
||||
- `api/aiservice.py` — AI service communication
|
||||
- `code_utils/config_consts.py` — Thresholds
|
||||
- `context/code_context_extractor.py` — Context extraction
|
||||
- `models/models.py` — `CandidateEvaluationContext`, `TestResults`
|
||||
14
tiles/codeflash-skills/tile.json
Normal file
14
tiles/codeflash-skills/tile.json
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
{
|
||||
"name": "codeflash/codeflash-skills",
|
||||
"version": "0.1.0",
|
||||
"summary": "Procedural workflows for developing and debugging codeflash",
|
||||
"private": true,
|
||||
"skills": {
|
||||
"debug-optimization-failure": {
|
||||
"path": "skills/debug-optimization-failure/SKILL.md"
|
||||
},
|
||||
"add-codeflash-feature": {
|
||||
"path": "skills/add-codeflash-feature/SKILL.md"
|
||||
}
|
||||
}
|
||||
}
|
||||
Loading…
Reference in a new issue