Kevin Turcios 6718e66582 feat: add private tessl tiles for codeflash rules, docs, and skills

Three private tiles in the codeflash workspace:
- codeflash-rules: 6 steering rules (code-style, architecture, optimization-patterns, git-conventions, testing-rules, language-rules)
- codeflash-docs: 7 doc pages (domain-types, optimization-pipeline, context-extraction, verification, ai-service, configuration)
- codeflash-skills: 2 skills (debug-optimization-failure, add-codeflash-feature)

2026-02-14 20:55:06 -05:00

3.9 KiB

Raw Blame History

Verification

How codeflash verifies candidate correctness and measures performance.

Test Execution Architecture

Tests are executed in a subprocess to isolate the test environment from the main codeflash process. The test runner (verification/test_runner.py) invokes pytest (or Jest for JS/TS) with specific plugin configurations.

Plugin Blocklists

Behavioral tests: Block benchmark, codspeed, xdist, sugar
Benchmarking tests: Block codspeed, cov, benchmark, profiling, xdist, sugar

These are defined as BEHAVIORAL_BLOCKLISTED_PLUGINS and BENCHMARKING_BLOCKLISTED_PLUGINS in verification/test_runner.py.

Custom Pytest Plugin (`verification/pytest_plugin.py`)

The plugin is loaded into the test subprocess and provides:

Deterministic Patches

_apply_deterministic_patches() replaces non-deterministic functions with fixed values to ensure reproducible test output:

Module	Function	Fixed Value
`time`	`time()`	`1761717605.108106`
`time`	`perf_counter()`	Incrementing by 1ms per call
`datetime`	`datetime.now()`	`2021-01-01 02:05:10 UTC`
`datetime`	`datetime.utcnow()`	`2021-01-01 02:05:10 UTC`
`uuid`	`uuid4()` / `uuid1()`	`12345678-1234-5678-9abc-123456789012`
`random`	`random()`	`0.123456789` (seeded with 42)
`os`	`urandom(n)`	`b"\x42" * n`
`numpy.random`	seed	`42`

Patches call the original function first to maintain performance characteristics (same call overhead).

Timing Markers

Test results include timing markers in stdout: !######<id>:<duration_ns>######!

The pattern _TIMING_MARKER_PATTERN extracts timing data for calculating function utilization fraction.

Loop Stability

Performance benchmarking uses configurable stability thresholds:

STABILITY_WINDOW_SIZE = 0.35 (35% of total iterations)
STABILITY_CENTER_TOLERANCE = 0.0025 (±0.25% around median)
STABILITY_SPREAD_TOLERANCE = 0.0025 (0.25% window spread)

Memory Limits (Linux)

On Linux, the plugin sets RLIMIT_AS to 85% of total system memory (RAM + swap) to prevent OOM kills.

Test Result Processing

`TestResults` (`models/models.py`)

Collects FunctionTestInvocation results with:

Deduplicated insertion via unique_invocation_loop_id
total_passed_runtime() — Sum of minimum runtimes per test case (nanoseconds)
number_of_loops() — Max loop index
usable_runtime_data_by_test_case() — Grouped timing data

`FunctionTestInvocation`

Each invocation records:

loop_index — Iteration number (starts at 1)
id: InvocationId — Fully qualified test identifier
did_pass: bool — Pass/fail status
runtime: Optional[int] — Time in nanoseconds
return_value: Optional[object] — Captured return value
test_type: TestType — Which test category

Behavioral vs Performance Testing

Behavioral: Runs with TestingMode.BEHAVIOR. Compares return values and stdout between original and candidate. Any difference = candidate rejected.
Performance: Runs with TestingMode.PERFORMANCE. Loops for TOTAL_LOOPING_TIME=10s to get stable timing. Calculates speedup ratio.
Line Profile: Runs with TestingMode.LINE_PROFILE. Collects per-line timing data for refinement.

Test Types

TestType	Value	Description
`EXISTING_UNIT_TEST`	1	Pre-existing tests from the codebase
`INSPIRED_REGRESSION`	2	Tests inspired by existing tests
`GENERATED_REGRESSION`	3	AI-generated regression tests
`REPLAY_TEST`	4	Tests from recorded benchmark data
`CONCOLIC_COVERAGE_TEST`	5	Coverage-guided tests
`INIT_STATE_TEST`	6	Class init state verification

Coverage

Coverage is measured via CoverageData with a threshold of COVERAGE_THRESHOLD=60.0%. Low coverage may affect confidence in the optimization's correctness.

3.9 KiB Raw Blame History