feat: add private tessl tiles for codeflash rules, docs, and skills

Three private tiles in the codeflash workspace:
- codeflash-rules: 6 steering rules (code-style, architecture, optimization-patterns, git-conventions, testing-rules, language-rules)
- codeflash-docs: 7 doc pages (domain-types, optimization-pipeline, context-extraction, verification, ai-service, configuration)
- codeflash-skills: 2 skills (debug-optimization-failure, add-codeflash-feature)
This commit is contained in:
Kevin Turcios 2026-02-14 20:55:06 -05:00
parent 90601c3324
commit 6718e66582
20 changed files with 965 additions and 0 deletions

View file

@ -33,3 +33,5 @@ Discovery → Ranking → Context Extraction → Test Gen + Optimization → Bas
# Agent Rules <!-- tessl-managed -->
@.tessl/RULES.md follow the [instructions](.tessl/RULES.md)
@AGENTS.md

View file

@ -63,6 +63,15 @@
},
"tessl/pypi-filelock": {
"version": "3.19.0"
},
"codeflash/codeflash-rules": {
"version": "0.1.0"
},
"codeflash/codeflash-docs": {
"version": "0.1.0"
},
"codeflash/codeflash-skills": {
"version": "0.1.0"
}
}
}

View file

@ -0,0 +1,108 @@
# AI Service
How codeflash communicates with the AI optimization backend.
## `AiServiceClient` (`api/aiservice.py`)
The client connects to the AI service at `https://app.codeflash.ai` (or `http://localhost:8000` when `CODEFLASH_AIS_SERVER=local`).
Authentication uses Bearer token from `get_codeflash_api_key()`. All requests go through `make_ai_service_request()` which handles JSON serialization via Pydantic encoder.
Timeout: 90s for production, 300s for local.
## Endpoints
### `/ai/optimize` — Generate Candidates
Method: `optimize_code()`
Sends source code + dependency context to generate optimization candidates.
Payload:
- `source_code` — The read-writable code (markdown format)
- `dependency_code` — Read-only context code
- `trace_id` — Unique trace ID for the optimization run
- `language``"python"`, `"javascript"`, or `"typescript"`
- `n_candidates` — Number of candidates to generate (controlled by effort level)
- `is_async` — Whether the function is async
- `is_numerical_code` — Whether the code is numerical (affects optimization strategy)
Returns: `list[OptimizedCandidate]` with `source=OptimizedCandidateSource.OPTIMIZE`
### `/ai/optimize_line_profiler` — Line-Profiler-Guided Candidates
Method: `optimize_python_code_line_profiler()`
Like `/optimize` but includes `line_profiler_results` to guide the LLM toward hot lines.
Returns: candidates with `source=OptimizedCandidateSource.OPTIMIZE_LP`
### `/ai/refine` — Refine Existing Candidate
Method: `refine_code()`
Request type: `AIServiceRefinerRequest`
Sends an existing candidate with runtime data and line profiler results to generate an improved version.
Key fields:
- `original_source_code` / `optimized_source_code` — Before and after
- `original_code_runtime` / `optimized_code_runtime` — Timing data
- `speedup` — Current speedup ratio
- `original_line_profiler_results` / `optimized_line_profiler_results`
Returns: candidates with `source=OptimizedCandidateSource.REFINE` and `parent_id` set to the refined candidate's ID
### `/ai/repair` — Fix Failed Candidate
Method: `repair_code()`
Request type: `AIServiceCodeRepairRequest`
Sends a failed candidate with test diffs showing what went wrong.
Key fields:
- `original_source_code` / `modified_source_code`
- `test_diffs: list[TestDiff]` — Each with `scope` (return_value/stdout/did_pass), original vs candidate values, and test source code
Returns: candidates with `source=OptimizedCandidateSource.REPAIR` and `parent_id` set
### `/ai/adaptive_optimize` — Multi-Candidate Adaptive
Method: `adaptive_optimize()`
Request type: `AIServiceAdaptiveOptimizeRequest`
Sends multiple previous candidates with their speedups for the LLM to learn from and generate better candidates.
Key fields:
- `candidates: list[AdaptiveOptimizedCandidate]` — Previous candidates with source code, explanation, source type, and speedup
Returns: candidates with `source=OptimizedCandidateSource.ADAPTIVE`
### `/ai/rewrite_jit` — JIT Rewrite
Method: `get_jit_rewritten_code()`
Rewrites code to use JIT compilation (e.g., Numba).
Returns: candidates with `source=OptimizedCandidateSource.JIT_REWRITE`
## Candidate Parsing
All endpoints return JSON with an `optimizations` array. Each entry has:
- `source_code` — Markdown-formatted code blocks
- `explanation` — LLM explanation
- `optimization_id` — Unique ID
- `parent_id` — Optional parent reference
- `model` — Which LLM model was used
`_get_valid_candidates()` parses the markdown code via `CodeStringsMarkdown.parse_markdown_code()` and filters out entries with empty code blocks.
## `LocalAiServiceClient`
Used when `CODEFLASH_EXPERIMENT_ID` is set. Mirrors `AiServiceClient` but sends to a separate experimental endpoint for A/B testing optimization strategies.
## LLM Call Sequencing
`AiServiceClient` tracks call sequence via `llm_call_counter` (itertools.count). Each request includes a `call_sequence` number, used by the backend to maintain conversation context across multiple calls for the same function.

View file

@ -0,0 +1,79 @@
# Configuration
Key configuration constants, effort levels, and thresholds.
## Constants (`code_utils/config_consts.py`)
### Test Execution
| Constant | Value | Description |
|----------|-------|-------------|
| `MAX_TEST_RUN_ITERATIONS` | 5 | Maximum test loop iterations |
| `INDIVIDUAL_TESTCASE_TIMEOUT` | 15s | Timeout per individual test case |
| `MAX_FUNCTION_TEST_SECONDS` | 60s | Max total time for function testing |
| `MAX_TEST_FUNCTION_RUNS` | 50 | Max test function executions |
| `MAX_CUMULATIVE_TEST_RUNTIME_NANOSECONDS` | 100ms | Max cumulative test runtime |
| `TOTAL_LOOPING_TIME` | 10s | Candidate benchmarking budget |
| `MIN_TESTCASE_PASSED_THRESHOLD` | 6 | Minimum test cases that must pass |
### Performance Thresholds
| Constant | Value | Description |
|----------|-------|-------------|
| `MIN_IMPROVEMENT_THRESHOLD` | 0.05 (5%) | Minimum speedup to accept a candidate |
| `MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD` | 0.10 (10%) | Minimum async throughput improvement |
| `MIN_CONCURRENCY_IMPROVEMENT_THRESHOLD` | 0.20 (20%) | Minimum concurrency ratio improvement |
| `COVERAGE_THRESHOLD` | 60.0% | Minimum test coverage |
### Stability Thresholds
| Constant | Value | Description |
|----------|-------|-------------|
| `STABILITY_WINDOW_SIZE` | 0.35 | 35% of total iteration window |
| `STABILITY_CENTER_TOLERANCE` | 0.0025 | ±0.25% around median |
| `STABILITY_SPREAD_TOLERANCE` | 0.0025 | 0.25% window spread |
### Context Limits
| Constant | Value | Description |
|----------|-------|-------------|
| `OPTIMIZATION_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for optimization context |
| `TESTGEN_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for test generation context |
| `MAX_CONTEXT_LEN_REVIEW` | 1000 | Max context length for optimization review |
### Other
| Constant | Value | Description |
|----------|-------|-------------|
| `MIN_CORRECT_CANDIDATES` | 2 | Min correct candidates before skipping repair |
| `REPEAT_OPTIMIZATION_PROBABILITY` | 0.1 | Probability of re-optimizing a function |
| `DEFAULT_IMPORTANCE_THRESHOLD` | 0.001 | Minimum addressable time to consider a function |
| `CONCURRENCY_FACTOR` | 10 | Number of concurrent executions for concurrency benchmark |
| `REFINED_CANDIDATE_RANKING_WEIGHTS` | (2, 1) | (runtime, diff) weights — runtime 2x more important |
## Effort Levels
`EffortLevel` enum: `LOW`, `MEDIUM`, `HIGH`
Effort controls the number of candidates, repairs, and refinements:
| Key | LOW | MEDIUM | HIGH |
|-----|-----|--------|------|
| `N_OPTIMIZER_CANDIDATES` | 3 | 5 | 6 |
| `N_OPTIMIZER_LP_CANDIDATES` | 4 | 6 | 7 |
| `N_GENERATED_TESTS` | 2 | 2 | 2 |
| `MAX_CODE_REPAIRS_PER_TRACE` | 2 | 3 | 5 |
| `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` | 0.2 | 0.3 | 0.4 |
| `TOP_VALID_CANDIDATES_FOR_REFINEMENT` | 2 | 3 | 4 |
| `ADAPTIVE_OPTIMIZATION_THRESHOLD` | 0 | 0 | 2 |
| `MAX_ADAPTIVE_OPTIMIZATIONS_PER_TRACE` | 0 | 0 | 4 |
Use `get_effort_value(EffortKeys.KEY, effort_level)` to retrieve values.
## Project Configuration
Configuration is read from `pyproject.toml` under `[tool.codeflash]`. Key settings are auto-detected by `setup/detector.py`:
- `module-root` — Root of the module to optimize
- `tests-root` — Root of test files
- `test-framework` — pytest, unittest, jest, etc.
- `formatter-cmds` — Code formatting commands

View file

@ -0,0 +1,60 @@
# Context Extraction
How codeflash extracts and limits code context for optimization and test generation.
## Overview
Context extraction (`context/code_context_extractor.py`) builds a `CodeOptimizationContext` containing all code needed for the LLM to understand and optimize a function, split into:
- **Read-writable code** (`CodeContextType.READ_WRITABLE`): The function being optimized plus its helper functions — code the LLM is allowed to modify
- **Read-only context** (`CodeContextType.READ_ONLY`): Dependency code for reference — imports, type definitions, base classes
- **Testgen context** (`CodeContextType.TESTGEN`): Context for test generation, may include imported class definitions and external base class inits
- **Hashing context** (`CodeContextType.HASHING`): Used for deduplication of optimization runs
## Token Limits
Both optimization and test generation contexts are token-limited:
- `OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000` tokens
- `TESTGEN_CONTEXT_TOKEN_LIMIT = 16000` tokens
Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`. Functions whose context exceeds these limits are skipped.
## Context Building Process
### 1. Helper Discovery
For the target function (`FunctionToOptimize`), the extractor finds:
- **Helpers of the function**: Functions/classes in the same file that the target function calls
- **Helpers of helpers**: Transitive dependencies of the helper functions
These are organized as `dict[Path, set[FunctionSource]]` — mapping file paths to the set of helper functions found in each file.
### 2. Code Extraction
`extract_code_markdown_context_from_files()` builds `CodeStringsMarkdown` from the helper dictionaries. Each file's relevant code is extracted as a `CodeString` with its file path.
### 3. Testgen Context Enrichment
`build_testgen_context()` extends the basic context with:
- Imported class definitions (resolved from imports)
- External base class `__init__` methods
- External class `__init__` methods referenced in the context
### 4. Unused Definition Removal
`detect_unused_helper_functions()` and `remove_unused_definitions_by_function_names()` from `context/unused_definition_remover.py` prune definitions that are not transitively reachable from the target function, reducing token usage.
### 5. Deduplication
The hashing context (`hashing_code_context`) generates a hash (`hashing_code_context_hash`) used to detect when the same function context has already been optimized in a previous run, avoiding redundant work.
## Key Functions
| Function | Location | Purpose |
|----------|----------|---------|
| `build_testgen_context()` | `context/code_context_extractor.py` | Build enriched testgen context |
| `extract_code_markdown_context_from_files()` | `context/code_context_extractor.py` | Convert helper dicts to `CodeStringsMarkdown` |
| `detect_unused_helper_functions()` | `context/unused_definition_remover.py` | Find unused definitions |
| `remove_unused_definitions_by_function_names()` | `context/unused_definition_remover.py` | Remove unused definitions |
| `collect_top_level_defs_with_usages()` | `context/unused_definition_remover.py` | Analyze definition usage |
| `encoded_tokens_len()` | `code_utils/code_utils.py` | Count tokens in code |

View file

@ -0,0 +1,153 @@
# Domain Types
Core data types used throughout the codeflash optimization pipeline.
## Function Representation
### `FunctionToOptimize` (`models/function_types.py`)
The canonical dataclass representing a function candidate for optimization. Works across Python, JavaScript, and TypeScript.
Key fields:
- `function_name: str` — The function name
- `file_path: Path` — Absolute file path where the function is located
- `parents: list[FunctionParent]` — Parent scopes (classes/functions), each with `name` and `type`
- `starting_line / ending_line: Optional[int]` — Line range (1-indexed)
- `is_async: bool` — Whether the function is async
- `is_method: bool` — Whether it belongs to a class
- `language: str` — Programming language (default: `"python"`)
Key properties:
- `qualified_name` — Full dotted name including parent classes (e.g., `MyClass.my_method`)
- `top_level_parent_name` — Name of outermost parent, or function name if no parents
- `class_name` — Immediate parent class name, or `None`
### `FunctionParent` (`models/function_types.py`)
Represents a parent scope: `name: str` (e.g., `"MyClass"`) and `type: str` (e.g., `"ClassDef"`).
### `FunctionSource` (`models/models.py`)
Represents a resolved function with source code. Used for helper functions in context extraction.
Fields: `file_path`, `qualified_name`, `fully_qualified_name`, `only_function_name`, `source_code`, `jedi_definition`.
## Code Representation
### `CodeString` (`models/models.py`)
A single code block with validated syntax:
- `code: str` — The source code
- `file_path: Optional[Path]` — Origin file path
- `language: str` — Language for validation (default: `"python"`)
Validates syntax on construction via `model_validator`.
### `CodeStringsMarkdown` (`models/models.py`)
A collection of `CodeString` blocks — the primary format for passing code through the pipeline.
Key properties:
- `.flat` — Combined source code with file-path comment prefixes (e.g., `# file: path/to/file.py`)
- `.markdown` — Markdown-formatted with fenced code blocks: `` ```python:filepath\ncode\n``` ``
- `.file_to_path()` — Dict mapping file path strings to code
Static method:
- `parse_markdown_code(markdown_code, expected_language)` — Parses markdown code blocks back into `CodeStringsMarkdown`
## Optimization Context
### `CodeOptimizationContext` (`models/models.py`)
Holds all code context needed for optimization:
- `read_writable_code: CodeStringsMarkdown` — Code the LLM can modify
- `read_only_context_code: str` — Reference-only dependency code
- `testgen_context: CodeStringsMarkdown` — Context for test generation
- `hashing_code_context: str` / `hashing_code_context_hash: str` — For deduplication
- `helper_functions: list[FunctionSource]` — Helper functions in the writable code
- `preexisting_objects: set[tuple[str, tuple[FunctionParent, ...]]]` — Objects that already exist in the code
### `CodeContextType` enum (`models/models.py`)
Defines context categories: `READ_WRITABLE`, `READ_ONLY`, `TESTGEN`, `HASHING`.
## Candidates
### `OptimizedCandidate` (`models/models.py`)
A generated code variant:
- `source_code: CodeStringsMarkdown` — The optimized code
- `explanation: str` — LLM explanation of the optimization
- `optimization_id: str` — Unique identifier
- `source: OptimizedCandidateSource` — How it was generated
- `parent_id: str | None` — ID of parent candidate (for refinements/repairs)
- `model: str | None` — Which LLM model generated it
### `OptimizedCandidateSource` enum (`models/models.py`)
How a candidate was generated: `OPTIMIZE`, `OPTIMIZE_LP` (line profiler), `REFINE`, `REPAIR`, `ADAPTIVE`, `JIT_REWRITE`.
### `CandidateEvaluationContext` (`models/models.py`)
Tracks state during candidate evaluation:
- `speedup_ratios` / `optimized_runtimes` / `is_correct` — Per-candidate results
- `ast_code_to_id` — Deduplication map (normalized AST → first seen candidate)
- `valid_optimizations` — Candidates that passed all checks
Key methods: `record_failed_candidate()`, `record_successful_candidate()`, `handle_duplicate_candidate()`, `register_new_candidate()`.
## Baseline & Results
### `OriginalCodeBaseline` (`models/models.py`)
Baseline measurements for the original code:
- `behavior_test_results: TestResults` / `benchmarking_test_results: TestResults`
- `line_profile_results: dict`
- `runtime: int` — Total runtime in nanoseconds
- `coverage_results: Optional[CoverageData]`
### `BestOptimization` (`models/models.py`)
The winning candidate after evaluation:
- `candidate: OptimizedCandidate`
- `helper_functions: list[FunctionSource]`
- `code_context: CodeOptimizationContext`
- `runtime: int`
- `winning_behavior_test_results` / `winning_benchmarking_test_results: TestResults`
## Test Types
### `TestType` enum (`models/test_type.py`)
- `EXISTING_UNIT_TEST` (1) — Pre-existing tests from the codebase
- `INSPIRED_REGRESSION` (2) — Tests inspired by existing tests
- `GENERATED_REGRESSION` (3) — AI-generated regression tests
- `REPLAY_TEST` (4) — Tests from recorded benchmark data
- `CONCOLIC_COVERAGE_TEST` (5) — Coverage-guided tests
- `INIT_STATE_TEST` (6) — Class init state verification
### `TestFile` / `TestFiles` (`models/models.py`)
`TestFile` represents a single test file with `instrumented_behavior_file_path`, optional `benchmarking_file_path`, `original_file_path`, `test_type`, and `tests_in_file`.
`TestFiles` is a collection with lookup methods: `get_by_type()`, `get_by_original_file_path()`, `get_test_type_by_instrumented_file_path()`.
### `TestResults` (`models/models.py`)
Collection of `FunctionTestInvocation` results with indexed lookup. Key methods:
- `add(invocation)` — Deduplicated insert
- `total_passed_runtime()` — Sum of minimum runtimes per test case (nanoseconds)
- `number_of_loops()` — Max loop index across all results
- `usable_runtime_data_by_test_case()` — Dict of invocation ID → list of runtimes
## Result Type
### `Result[L, R]` / `Success` / `Failure` (`either.py`)
Functional error handling type:
- `Success(value)` — Wraps a successful result
- `Failure(error)` — Wraps an error
- `result.is_successful()` / `result.is_failure()` — Check type
- `result.unwrap()` — Get success value (raises if Failure)
- `result.failure()` — Get failure value (raises if Success)
- `is_successful(result)` — Module-level helper function

View file

@ -0,0 +1,41 @@
# Codeflash Internal Documentation
CodeFlash is an AI-powered Python code optimizer that automatically improves code performance while maintaining correctness. It uses LLMs to generate optimization candidates, verifies correctness through test execution, and benchmarks performance improvements.
## Pipeline Overview
```
Discovery → Ranking → Context Extraction → Test Gen + Optimization → Baseline → Candidate Evaluation → PR
```
1. **Discovery** (`discovery/`): Find optimizable functions across the codebase using `FunctionVisitor`
2. **Ranking** (`benchmarking/function_ranker.py`): Rank functions by addressable time using trace data
3. **Context** (`context/`): Extract code dependencies — split into read-writable (modifiable) and read-only (reference)
4. **Optimization** (`optimization/`, `api/`): Generate candidates via AI service, runs concurrently with test generation
5. **Verification** (`verification/`): Run candidates against tests via custom pytest plugin, compare outputs
6. **Benchmarking** (`benchmarking/`): Measure performance, select best candidate by speedup
7. **Result** (`result/`, `github/`): Create PR with winning optimization
## Key Entry Points
| Task | File |
|------|------|
| CLI arguments & commands | `cli_cmds/cli.py` |
| Optimization orchestration | `optimization/optimizer.py``Optimizer.run()` |
| Per-function optimization | `optimization/function_optimizer.py``FunctionOptimizer` |
| Function discovery | `discovery/functions_to_optimize.py` |
| Context extraction | `context/code_context_extractor.py` |
| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
| Performance ranking | `benchmarking/function_ranker.py` |
| Domain types | `models/models.py`, `models/function_types.py` |
| AI service | `api/aiservice.py``AiServiceClient` |
| Configuration | `code_utils/config_consts.py` |
## Documentation Pages
- [Domain Types](domain-types.md) — Core data types and their relationships
- [Optimization Pipeline](optimization-pipeline.md) — Step-by-step data flow through the pipeline
- [Context Extraction](context-extraction.md) — How code context is extracted and token-limited
- [Verification](verification.md) — Test execution, pytest plugin, deterministic patches
- [AI Service](ai-service.md) — AI service client endpoints and request types
- [Configuration](configuration.md) — Config schema, effort levels, thresholds

View file

@ -0,0 +1,84 @@
# Optimization Pipeline
Step-by-step data flow from function discovery to PR creation.
## 1. Entry Point: `Optimizer.run()` (`optimization/optimizer.py`)
The `Optimizer` class is initialized with CLI args and creates:
- `TestConfig` with test roots, project root, pytest command
- `AiServiceClient` for AI service communication
- Optional `LocalAiServiceClient` for experiments
`run()` orchestrates the full pipeline: discovers functions, optionally ranks them, then optimizes each in turn.
## 2. Function Discovery (`discovery/functions_to_optimize.py`)
`FunctionVisitor` traverses source files to find optimizable functions, producing `FunctionToOptimize` instances. Filters include:
- Skipping functions that are too small or trivial
- Skipping previously optimized functions (via `was_function_previously_optimized()`)
- Applying user-configured include/exclude patterns
## 3. Function Ranking (`benchmarking/function_ranker.py`)
When trace data is available, `FunctionRanker` ranks functions by **addressable time** — the time a function spends that could be optimized (own time + callee time / call count). Functions below `DEFAULT_IMPORTANCE_THRESHOLD=0.001` are skipped.
## 4. Per-Function Optimization: `FunctionOptimizer` (`optimization/function_optimizer.py`)
For each function, `FunctionOptimizer.optimize_function()` runs the full optimization loop:
### 4a. Context Extraction (`context/code_context_extractor.py`)
Extracts `CodeOptimizationContext` containing:
- `read_writable_code` — Code the LLM can modify (the function + helpers)
- `read_only_context_code` — Dependency code for reference only
- `testgen_context` — Context for test generation (may include imported class definitions)
Token limits are enforced: `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000`. Functions exceeding these are rejected.
### 4b. Concurrent Test Generation + LLM Optimization
These run in parallel using `concurrent.futures`:
- **Test generation**: Generates regression tests from the function context
- **LLM optimization**: Sends `read_writable_code.markdown` + `read_only_context_code` to the AI service
The number of candidates depends on effort level (see Configuration docs).
### 4c. Candidate Evaluation
For each `OptimizedCandidate`:
1. **Deduplication**: Normalize code AST and check against `CandidateEvaluationContext.ast_code_to_id`. If duplicate, copy results from previous evaluation.
2. **Code replacement**: Replace the original function with the candidate using `replace_function_definitions_in_module()`.
3. **Behavioral testing**: Run instrumented tests in subprocess. The custom pytest plugin applies deterministic patches. Compare return values, stdout, and pass/fail status against the original baseline.
4. **Benchmarking**: If behavior matches, run performance tests with looping (`TOTAL_LOOPING_TIME=10s`). Calculate speedup ratio.
5. **Validation**: Candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup) and pass stability checks.
### 4d. Refinement & Repair
- **Repair**: If fewer than `MIN_CORRECT_CANDIDATES=2` pass, failed candidates can be repaired via `AIServiceCodeRepairRequest` (sends test diffs to LLM).
- **Refinement**: Top valid candidates are refined via `AIServiceRefinerRequest` (sends runtime data, line profiler results).
- **Adaptive**: At HIGH effort, additional adaptive optimization rounds via `AIServiceAdaptiveOptimizeRequest`.
### 4e. Best Candidate Selection
The winning candidate is selected by:
1. Highest speedup ratio
2. For tied speedups, shortest diff length from original
3. Refinement candidates use weighted ranking: `(2 * runtime_rank + 1 * diff_rank)`
Result is a `BestOptimization` with the candidate, context, test results, and runtime.
## 5. PR Creation (`github/`)
If a winning candidate is found, a PR is created with:
- The optimized code diff
- Performance benchmark details
- Explanation from the LLM
## Worktree Mode
When `--worktree` is enabled, optimization runs in an isolated git worktree (`code_utils/git_worktree_utils.py`). This allows parallel optimization without affecting the working tree. Changes are captured as patch files.

View file

@ -0,0 +1,93 @@
# Verification
How codeflash verifies candidate correctness and measures performance.
## Test Execution Architecture
Tests are executed in a **subprocess** to isolate the test environment from the main codeflash process. The test runner (`verification/test_runner.py`) invokes pytest (or Jest for JS/TS) with specific plugin configurations.
### Plugin Blocklists
- **Behavioral tests**: Block `benchmark`, `codspeed`, `xdist`, `sugar`
- **Benchmarking tests**: Block `codspeed`, `cov`, `benchmark`, `profiling`, `xdist`, `sugar`
These are defined as `BEHAVIORAL_BLOCKLISTED_PLUGINS` and `BENCHMARKING_BLOCKLISTED_PLUGINS` in `verification/test_runner.py`.
## Custom Pytest Plugin (`verification/pytest_plugin.py`)
The plugin is loaded into the test subprocess and provides:
### Deterministic Patches
`_apply_deterministic_patches()` replaces non-deterministic functions with fixed values to ensure reproducible test output:
| Module | Function | Fixed Value |
|--------|----------|-------------|
| `time` | `time()` | `1761717605.108106` |
| `time` | `perf_counter()` | Incrementing by 1ms per call |
| `datetime` | `datetime.now()` | `2021-01-01 02:05:10 UTC` |
| `datetime` | `datetime.utcnow()` | `2021-01-01 02:05:10 UTC` |
| `uuid` | `uuid4()` / `uuid1()` | `12345678-1234-5678-9abc-123456789012` |
| `random` | `random()` | `0.123456789` (seeded with 42) |
| `os` | `urandom(n)` | `b"\x42" * n` |
| `numpy.random` | seed | `42` |
Patches call the original function first to maintain performance characteristics (same call overhead).
### Timing Markers
Test results include timing markers in stdout: `!######<id>:<duration_ns>######!`
The pattern `_TIMING_MARKER_PATTERN` extracts timing data for calculating function utilization fraction.
### Loop Stability
Performance benchmarking uses configurable stability thresholds:
- `STABILITY_WINDOW_SIZE = 0.35` (35% of total iterations)
- `STABILITY_CENTER_TOLERANCE = 0.0025` (±0.25% around median)
- `STABILITY_SPREAD_TOLERANCE = 0.0025` (0.25% window spread)
### Memory Limits (Linux)
On Linux, the plugin sets `RLIMIT_AS` to 85% of total system memory (RAM + swap) to prevent OOM kills.
## Test Result Processing
### `TestResults` (`models/models.py`)
Collects `FunctionTestInvocation` results with:
- Deduplicated insertion via `unique_invocation_loop_id`
- `total_passed_runtime()` — Sum of minimum runtimes per test case (nanoseconds)
- `number_of_loops()` — Max loop index
- `usable_runtime_data_by_test_case()` — Grouped timing data
### `FunctionTestInvocation`
Each invocation records:
- `loop_index` — Iteration number (starts at 1)
- `id: InvocationId` — Fully qualified test identifier
- `did_pass: bool` — Pass/fail status
- `runtime: Optional[int]` — Time in nanoseconds
- `return_value: Optional[object]` — Captured return value
- `test_type: TestType` — Which test category
### Behavioral vs Performance Testing
1. **Behavioral**: Runs with `TestingMode.BEHAVIOR`. Compares return values and stdout between original and candidate. Any difference = candidate rejected.
2. **Performance**: Runs with `TestingMode.PERFORMANCE`. Loops for `TOTAL_LOOPING_TIME=10s` to get stable timing. Calculates speedup ratio.
3. **Line Profile**: Runs with `TestingMode.LINE_PROFILE`. Collects per-line timing data for refinement.
## Test Types
| TestType | Value | Description |
|----------|-------|-------------|
| `EXISTING_UNIT_TEST` | 1 | Pre-existing tests from the codebase |
| `INSPIRED_REGRESSION` | 2 | Tests inspired by existing tests |
| `GENERATED_REGRESSION` | 3 | AI-generated regression tests |
| `REPLAY_TEST` | 4 | Tests from recorded benchmark data |
| `CONCOLIC_COVERAGE_TEST` | 5 | Coverage-guided tests |
| `INIT_STATE_TEST` | 6 | Class init state verification |
## Coverage
Coverage is measured via `CoverageData` with a threshold of `COVERAGE_THRESHOLD=60.0%`. Low coverage may affect confidence in the optimization's correctness.

View file

@ -0,0 +1,7 @@
{
"name": "codeflash/codeflash-docs",
"version": "0.1.0",
"summary": "Internal documentation for the codeflash optimization engine",
"private": true,
"docs": "docs/index.md"
}

View file

@ -0,0 +1,45 @@
# Architecture
```
codeflash/
├── main.py # CLI entry point
├── cli_cmds/ # Command handling, console output (Rich)
├── discovery/ # Find optimizable functions
├── context/ # Extract code dependencies and imports
├── optimization/ # Generate optimized code via AI
│ ├── optimizer.py # Main optimization orchestration
│ └── function_optimizer.py # Per-function optimization logic
├── verification/ # Run deterministic tests (pytest plugin)
├── benchmarking/ # Performance measurement
├── github/ # PR creation
├── api/ # AI service communication
├── code_utils/ # Code parsing, git utilities
├── models/ # Pydantic models and types
├── languages/ # Multi-language support (Python, JavaScript/TypeScript)
├── setup/ # Config schema, auto-detection, first-run experience
├── picklepatch/ # Serialization/deserialization utilities
├── tracing/ # Function call tracing
├── tracer.py # Root-level tracer entry point for profiling
├── lsp/ # IDE integration (Language Server Protocol)
├── telemetry/ # Sentry, PostHog
├── either.py # Functional Result type for error handling
├── result/ # Result types and handling
└── version.py # Version information
```
## Key Entry Points
| Task | Start here |
|------|------------|
| CLI arguments & commands | `cli_cmds/cli.py` |
| Optimization orchestration | `optimization/optimizer.py``Optimizer.run()` |
| Per-function optimization | `optimization/function_optimizer.py``FunctionOptimizer` |
| Function discovery | `discovery/functions_to_optimize.py` |
| Context extraction | `context/code_context_extractor.py` |
| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
| Performance ranking | `benchmarking/function_ranker.py` |
| Domain types | `models/models.py`, `models/function_types.py` |
| Result handling | `either.py` (`Result`, `Success`, `Failure`, `is_successful`) |
| AI service communication | `api/aiservice.py``AiServiceClient` |
| Configuration constants | `code_utils/config_consts.py` |
| Language support | `languages/registry.py``get_language_support()` |

View file

@ -0,0 +1,11 @@
# Code Style
- **Line length**: 120 characters
- **Python**: 3.9+ syntax (use `from __future__ import annotations` for type hints)
- **Package management**: Always use `uv`, never `pip` — run commands via `uv run`
- **Tooling**: Ruff for linting/formatting, mypy strict mode, prek for pre-commit checks (`uv run prek run`)
- **Comments**: Minimal — only explain "why", not "what"
- **Docstrings**: Do not add unless explicitly requested
- **Naming**: NEVER use leading underscores (`_function_name`) — Python has no true private functions, use public names
- **Paths**: Always use absolute `Path` objects, handle encoding explicitly (UTF-8)
- **Source transforms**: Use `libcst` for code modification/transformation to preserve formatting; `ast` is acceptable for read-only analysis and parsing

View file

@ -0,0 +1,9 @@
# Git Conventions
- **Always create a new branch from `main`** — never commit directly to `main` or reuse an existing feature branch for unrelated changes
- Use conventional commit format: `fix:`, `feat:`, `refactor:`, `docs:`, `test:`, `chore:`
- Keep commits atomic — one logical change per commit
- Commit message body should be concise (1-2 sentences max)
- PR titles should also use conventional format
- Branch naming: `cf-#-title` (lowercase, hyphenated) where `#` is the Linear issue number
- If related to a Linear issue, include `CF-#` in the PR body

View file

@ -0,0 +1,9 @@
# Language Support Rules
- Current language is a module-level singleton in `languages/current.py` — use `set_current_language()` / `current_language()`, never pass language as a parameter through call chains
- Use `get_language_support(identifier)` from `languages/registry.py` to get a `LanguageSupport` instance — accepts `Path`, `Language` enum, or string; never import language classes directly
- New language support classes must use the `@register_language` decorator to register with the extension and language registries
- `languages/__init__.py` uses `__getattr__` for lazy imports to avoid circular dependencies — follow this pattern when adding new exports
- `is_javascript()` returns `True` for both JavaScript and TypeScript
- Language modules are lazily imported on first `get_language_support()` call via `_ensure_languages_registered()` — the `@register_language` decorator fires on import and populates `_EXTENSION_REGISTRY` and `_LANGUAGE_REGISTRY`
- `LanguageSupport` instances are cached in `_SUPPORT_CACHE` — use `clear_cache()` only in tests

View file

@ -0,0 +1,11 @@
# Optimization Pipeline Patterns
- All major operations return `Result[SuccessType, ErrorType]` — construct with `Success(value)` / `Failure(error)`, check with `is_successful()` before calling `unwrap()`
- Code context has token limits (`OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000`, `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`) — exceeding them rejects the function
- `read_writable_code` (modifiable code) can span multiple files; `read_only_context_code` is reference-only dependency code
- Code is serialized as markdown code blocks: `` ```language:filepath\ncode\n``` `` — see `CodeStringsMarkdown` in `models/models.py`
- Candidates form a forest (DAG): refinements/repairs reference `parent_id` on previous candidates via `OptimizedCandidateSource` (OPTIMIZE, REFINE, REPAIR, ADAPTIVE, JIT_REWRITE)
- Test generation and optimization run concurrently — coordinate through `CandidateEvaluationContext`
- Generated tests are instrumented with `codeflash_capture.py` to record return values and traces
- Minimum improvement threshold is 5% (`MIN_IMPROVEMENT_THRESHOLD=0.05`) — candidates below this are rejected
- Stability thresholds: `STABILITY_WINDOW_SIZE=0.35`, `STABILITY_CENTER_TOLERANCE=0.0025`, `STABILITY_SPREAD_TOLERANCE=0.0025`

View file

@ -0,0 +1,13 @@
# Testing Rules
- Code context extraction and replacement tests must assert full string equality — no substring matching
- Use pytest's `tmp_path` fixture for temp directories (it's a `Path` object)
- Write temp files inside `tmp_path`, never use `NamedTemporaryFile` (causes Windows file contention)
- Always call `.resolve()` on Path objects to ensure absolute paths and resolve symlinks
- Use `.as_posix()` when converting resolved paths to strings (normalizes to forward slashes)
- Any new feature or bug fix that can be tested automatically must have test cases
- If changes affect existing test expectations, update the tests accordingly — tests must always pass after changes
- The pytest plugin patches `time`, `random`, `uuid`, `datetime`, `os.urandom`, and `numpy.random` for deterministic test execution — never assume real randomness or real time in verification tests
- `conftest.py` uses an autouse fixture that calls `reset_current_language()` — tests always start with Python as the default language
- Test types are defined by the `TestType` enum: `EXISTING_UNIT_TEST`, `INSPIRED_REGRESSION`, `GENERATED_REGRESSION`, `REPLAY_TEST`, `CONCOLIC_COVERAGE_TEST`, `INIT_STATE_TEST`
- Verification runs tests in a subprocess using a custom pytest plugin (`verification/pytest_plugin.py`) — behavioral tests use blocklisted plugins (`benchmark`, `codspeed`, `xdist`, `sugar`), benchmarking tests additionally block `cov` and `profiling`

View file

@ -0,0 +1,26 @@
{
"name": "codeflash/codeflash-rules",
"version": "0.1.0",
"summary": "Coding standards and conventions for the codeflash codebase",
"private": true,
"rules": {
"code-style": {
"rules": "rules/code-style.md"
},
"architecture": {
"rules": "rules/architecture.md"
},
"optimization-patterns": {
"rules": "rules/optimization-patterns.md"
},
"git-conventions": {
"rules": "rules/git-conventions.md"
},
"testing-rules": {
"rules": "rules/testing-rules.md"
},
"language-rules": {
"rules": "rules/language-rules.md"
}
}
}

View file

@ -0,0 +1,96 @@
---
name: add-codeflash-feature
description: Step-by-step workflow for adding a new feature to the codeflash codebase
---
# Add Codeflash Feature
Use this workflow when implementing a new feature in the codeflash codebase.
## Step 1: Identify Target Modules
Determine which module(s) need modification based on the feature:
| Feature area | Primary module | Key files |
|-------------|----------------|-----------|
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
| New AI service endpoint | `api/` | `aiservice.py` |
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
| Context extraction change | `context/` | `code_context_extractor.py` |
| New CLI command | `cli_cmds/` | `cli.py` |
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
| PR/result changes | `github/`, `result/` | Relevant handlers |
## Step 2: Follow Result Type Pattern
Use the `Result[L, R]` type from `either.py` for error handling in pipeline operations:
```python
from codeflash.either import Success, Failure, is_successful
def my_operation() -> Result[str, MyResultType]:
if error_condition:
return Failure("descriptive error message")
return Success(result_value)
# Usage:
result = my_operation()
if not is_successful(result):
logger.error(result.failure())
return
value = result.unwrap()
```
## Step 3: Add Configuration Constants
If the feature needs configurable thresholds or limits:
1. Add constants to `code_utils/config_consts.py`
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for `LOW`, `MEDIUM`, `HIGH`
3. Add a corresponding `EffortKeys` enum entry
4. Access via `get_effort_value(EffortKeys.MY_KEY, effort_level)`
## Step 4: Add Domain Types
If new data structures are needed:
1. Add Pydantic models or frozen dataclasses to `models/models.py` or `models/function_types.py`
2. Use `@dataclass(frozen=True)` for immutable data
3. Use `BaseModel` for models that need serialization
4. Keep `function_types.py` dependency-free (no imports from other codeflash modules)
## Step 5: Write Tests
Follow existing test patterns:
1. Create test files in the `tests/` directory mirroring the source structure
2. Use pytest's `tmp_path` fixture for temp directories
3. Always call `.resolve()` on Path objects
4. Assert full string equality for code context tests — no substring matching
5. Remember the pytest plugin patches `time`, `random`, `uuid`, `datetime` — don't rely on real values
## Step 6: Run Quality Checks
Run all validation before committing:
```bash
# Pre-commit checks (ruff format + lint)
uv run prek run
# Type checking
uv run mypy codeflash/
# Run relevant tests
uv run pytest tests/path/to/relevant/tests -x
```
## Step 7: Language Support Considerations
If the feature needs to work across languages:
1. Check if the feature uses language-specific APIs — use `get_language_support(identifier)` from `languages/registry.py`
2. Current language is a singleton: `set_current_language()` / `current_language()` from `languages/current.py`
3. Use `is_python()` / `is_javascript()` guards for language-specific branches
4. New language support classes must use `@register_language` decorator

View file

@ -0,0 +1,95 @@
---
name: debug-optimization-failure
description: Debug why a codeflash optimization failed at any pipeline stage
---
# Debug Optimization Failure
Use this workflow when an optimization run fails or produces no results. Work through the stages sequentially — stop at the first failure found.
## Step 1: Check Function Discovery
Determine if the function was discovered by `FunctionVisitor`.
1. Look at the discovery output or logs for the function name
2. Check `discovery/functions_to_optimize.py` — the `FunctionVisitor` filters out:
- Functions that are too small or trivial
- Functions matching exclude patterns in config
- Functions already optimized (`was_function_previously_optimized()`)
3. Verify the function file is under the configured `module-root`
**If not discovered**: Check config patterns, file location, and function size.
## Step 2: Check Ranking
If trace data is used, check if the function was ranked high enough.
1. Look at `benchmarking/function_ranker.py` output
2. The function's **addressable time** must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`
3. Addressable time = own time + callee time / call count
**If ranked too low**: The function doesn't spend enough time to be worth optimizing.
## Step 3: Check Context Token Limits
Verify the function's context fits within token limits.
1. Check `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`
2. Token counting is done by `encoded_tokens_len()` in `code_utils/code_utils.py`
3. Large helper function chains or deep dependency trees can blow the limit
**If context too large**: The function has too many dependencies. Consider refactoring to reduce context size.
## Step 4: Check AI Service Response
Verify the AI service returned valid candidates.
1. Check logs for `AiServiceClient` request/response
2. Look for HTTP errors (non-200 status codes)
3. Verify `_get_valid_candidates()` parsed the response — empty `code_strings` means invalid markdown code blocks
4. Check if all candidates were filtered out during parsing
**If no candidates returned**: Check API key, network connectivity, and service status.
## Step 5: Check Test Failures
Determine if candidates failed behavioral or benchmark tests.
1. **Behavioral failures**: Compare return values, stdout, pass/fail status between original baseline and candidate
- Check `TestDiffScope`: `RETURN_VALUE`, `STDOUT`, `DID_PASS`
- Look at JUnit XML results for specific test failures
2. **Benchmark failures**: Check if candidate met `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
3. **Stability failures**: Check if timing was stable within `STABILITY_WINDOW_SIZE=0.35`
**If behavioral failure**: The optimization changed the function's behavior. Check test diffs for specific mismatches.
**If benchmark failure**: The optimization didn't provide enough speedup.
## Step 6: Check Deduplication
Verify candidates weren't deduplicated away.
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized code → candidate mapping
2. `normalize_code()` from `code_utils/deduplicate_code.py` normalizes AST for comparison
3. If all candidates normalize to the same code, only one is actually tested
**If all duplicates**: The LLM generated the same optimization multiple times. Try higher effort level.
## Step 7: Check Repair/Refinement
If initial candidates failed, check repair and refinement stages.
1. Repair only runs if fewer than `MIN_CORRECT_CANDIDATES=2` passed
2. Repair sends `AIServiceCodeRepairRequest` with test diffs
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` — if too many tests failed, repair is skipped
4. Refinement only runs on top valid candidates
**If repair also failed**: The optimization approach may not work for this function.
## Key Files to Check
- `optimization/function_optimizer.py` — Main optimization loop, `determine_best_candidate()`
- `verification/test_runner.py` — Test execution
- `api/aiservice.py` — AI service communication
- `code_utils/config_consts.py` — Thresholds
- `context/code_context_extractor.py` — Context extraction
- `models/models.py``CandidateEvaluationContext`, `TestResults`

View file

@ -0,0 +1,14 @@
{
"name": "codeflash/codeflash-skills",
"version": "0.1.0",
"summary": "Procedural workflows for developing and debugging codeflash",
"private": true,
"skills": {
"debug-optimization-failure": {
"path": "skills/debug-optimization-failure/SKILL.md"
},
"add-codeflash-feature": {
"path": "skills/add-codeflash-feature/SKILL.md"
}
}
}