Merge branch 'main' of github.com:codeflash-ai/codeflash

2026-05-04 18:25:17 +00:00 · 2026-02-17 23:10:28 +02:00 · 2026-02-17 23:10:28 +02:00 · 7308afebc7
commit 7308afebc7
parent d3074096e8 110e9b8d1e
109 changed files with 4260 additions and 2254 deletions
--- a/.claude/rules/architecture.md
+++ b/.claude/rules/architecture.md
@ -26,3 +26,17 @@ codeflash/
 ├── result/                 # Result types and handling
 └── version.py              # Version information
 ```
+
+## Key Entry Points
+
+| Task | Start here |
+|------|------------|
+| CLI arguments & commands | `cli_cmds/cli.py` |
+| Optimization orchestration | `optimization/optimizer.py` → `run()` |
+| Per-function optimization | `optimization/function_optimizer.py` |
+| Function discovery | `discovery/functions_to_optimize.py` |
+| Context extraction | `context/code_context_extractor.py` |
+| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
+| Performance ranking | `benchmarking/function_ranker.py` |
+| Domain types | `models/models.py`, `models/function_types.py` |
+| Result handling | `either.py` (`Result`, `Success`, `Failure`, `is_successful`) |
--- a/.claude/rules/code-style.md
+++ b/.claude/rules/code-style.md
@ -2,6 +2,7 @@

 - **Line length**: 120 characters
 - **Python**: 3.9+ syntax
+- **Package management**: Always use `uv`, never `pip`
 - **Tooling**: Ruff for linting/formatting, mypy strict mode, prek for pre-commit checks
 - **Comments**: Minimal - only explain "why", not "what"
 - **Docstrings**: Do not add unless explicitly requested
--- a/.claude/rules/git.md
+++ b/.claude/rules/git.md
@ -1,5 +1,6 @@
 # Git Commits & Pull Requests

+- **Always create a new branch from `main` before starting any new work** — never commit directly to `main` or reuse an existing feature branch for unrelated changes
 - Use conventional commit format: `fix:`, `feat:`, `refactor:`, `docs:`, `test:`, `chore:`
 - Keep commits atomic - one logical change per commit
 - Commit message body should be concise (1-2 sentences max)
--- a/.claude/rules/language-patterns.md
+++ b/.claude/rules/language-patterns.md
@ -0,0 +1,12 @@
+---
+paths:
+  - "codeflash/languages/**/*.py"
+---
+
+# Language Support Patterns
+
+- Current language is a module-level singleton in `languages/current.py` — use `set_current_language()` / `current_language()`, never pass language as a parameter through call chains
+- Use `get_language_support(identifier)` from `languages/registry.py` to get a `LanguageSupport` instance — never import language classes directly
+- New language support classes must use the `@register_language` decorator to register with the extension and language registries
+- `languages/__init__.py` uses `__getattr__` for lazy imports to avoid circular dependencies — follow this pattern when adding new exports
+- `is_javascript()` returns `True` for both JavaScript and TypeScript
--- a/.claude/rules/optimization-patterns.md
+++ b/.claude/rules/optimization-patterns.md
@ -0,0 +1,17 @@
+---
+paths:
+  - "codeflash/optimization/**/*.py"
+  - "codeflash/verification/**/*.py"
+  - "codeflash/benchmarking/**/*.py"
+  - "codeflash/context/**/*.py"
+---
+
+# Optimization Pipeline Patterns
+
+- All major operations return `Result[SuccessType, ErrorType]` — construct with `Success(value)` / `Failure(error)`, check with `is_successful()` before calling `unwrap()`
+- Code context has token limits (`OPTIMIZATION_CONTEXT_TOKEN_LIMIT`, `TESTGEN_CONTEXT_TOKEN_LIMIT` in `config_consts.py`) — exceeding them rejects the function
+- `read_writable_code` can span multiple files; `read_only_context_code` is reference-only
+- Code is serialized as markdown code blocks: ` ```language:filepath\ncode\n``` ` (see `CodeStringsMarkdown`)
+- Candidates form a forest (DAG): refinements/repairs reference `parent_id` on previous candidates
+- Test generation and optimization run concurrently — coordinate through `CandidateEvaluationContext`
+- Generated tests are instrumented with `codeflash_capture.py` to record return values and traces
--- a/.claude/rules/source-code.md
+++ b/.claude/rules/source-code.md
@ -6,6 +6,3 @@ paths:
 # Source Code Rules

 - Use `libcst` for code modification/transformation to preserve formatting. `ast` is acceptable for read-only analysis and parsing.
- NEVER use leading underscores for function names (e.g., `_helper`). Python has no true private functions. Always use public names.
- Any new feature or bug fix that can be tested automatically must have test cases.
- If changes affect existing test expectations, update the tests accordingly. Tests must always pass after changes.
--- a/.claude/rules/testing.md
+++ b/.claude/rules/testing.md
@ -13,3 +13,5 @@ paths:
 - Use `.as_posix()` when converting resolved paths to strings (normalizes to forward slashes).
 - Any new feature or bug fix that can be tested automatically must have test cases.
 - If changes affect existing test expectations, update the tests accordingly. Tests must always pass after changes.
+- The pytest plugin patches `time`, `random`, `uuid`, and `datetime` for deterministic test execution — never assume real randomness or real time in verification tests.
+- `conftest.py` uses an autouse fixture that calls `reset_current_language()` — tests always start with Python as the default language.
--- a/.claude/skills/fix-mypy.md
+++ b/.claude/skills/fix-mypy.md
@ -0,0 +1,12 @@
+# Fix mypy errors
+
+When modifying code, fix any mypy type errors in the files you changed:
+
+```bash
+uv run mypy --non-interactive --config-file pyproject.toml <changed_files>
+```
+
+- Fix type annotation issues: missing return types, incorrect types, Optional/None unions, import errors for type hints
+- Do NOT add `# type: ignore` comments — always fix the root cause
+- Do NOT fix type errors that require logic changes, complex generic type rework, or anything that could change runtime behavior
+- Files in `mypy_allowlist.txt` are checked in CI — ensure they remain error-free
--- a/.claude/skills/fix-prek.md
+++ b/.claude/skills/fix-prek.md
@ -0,0 +1,9 @@
+# Fix prek failures
+
+When prek (pre-commit) checks fail:
+
+1. Run `uv run prek run` to see failures (local, checks staged files)
+2. In CI, the equivalent is `uv run prek run --from-ref origin/main`
+3. prek runs ruff format, ruff check, and mypy on changed files
+4. Fix issues in order: formatting → lint → type errors
+5. Re-run `uv run prek run` to verify all checks pass
--- a/.codex/skills/.gitignore
+++ b/.codex/skills/.gitignore
@ -0,0 +1,2 @@
+# Managed by Tessl
+tessl:*
--- a/.gemini/skills/.gitignore
+++ b/.gemini/skills/.gitignore
@ -0,0 +1,2 @@
+# Managed by Tessl
+tessl:*
--- a/.github/workflows/claude.yml
+++ b/.github/workflows/claude.yml
@ -42,11 +42,17 @@ jobs:
          uv venv --seed
          uv sync

+      - name: Configure AWS Credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
+          aws-region: ${{ secrets.AWS_REGION }}
+
      - name: Run Claude Code
        id: claude
        uses: anthropics/claude-code-action@v1
        with:
-          use_foundry: "true"
+          use_bedrock: "true"
          use_sticky_comment: true
          allowed_bots: "claude[bot],codeflash-ai[bot]"
          prompt: |
@ -173,12 +179,9 @@ jobs:
            2. For each optimization PR:
               - Check if CI is passing: `gh pr checks <number>`
               - If all checks pass, merge it: `gh pr merge <number> --squash --delete-branch`
-          claude_args: '--model claude-opus-4-6 --allowedTools "mcp__github_inline_comment__create_inline_comment,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*),Bash(gh pr list:*),Bash(gh pr checks:*),Bash(gh pr merge:*),Bash(gh issue view:*),Bash(gh issue list:*),Bash(gh api:*),Bash(uv run prek *),Bash(uv run mypy *),Bash(uv run coverage *),Bash(uv run pytest *),Bash(git status*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git diff *),Bash(git checkout *),Read,Glob,Grep,Edit"'
+          claude_args: '--model us.anthropic.claude-opus-4-6-v1 --allowedTools "mcp__github_inline_comment__create_inline_comment,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*),Bash(gh pr list:*),Bash(gh pr checks:*),Bash(gh pr merge:*),Bash(gh issue view:*),Bash(gh issue list:*),Bash(gh api:*),Bash(uv run prek *),Bash(uv run mypy *),Bash(uv run coverage *),Bash(uv run pytest *),Bash(git status*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git diff *),Bash(git checkout *),Read,Glob,Grep,Edit"'
          additional_permissions: |
            actions: read
-        env:
-          ANTHROPIC_FOUNDRY_API_KEY: ${{ secrets.AZURE_ANTHROPIC_API_KEY }}
-          ANTHROPIC_FOUNDRY_BASE_URL: ${{ secrets.AZURE_ANTHROPIC_ENDPOINT }}

  # @claude mentions (can edit and push) - restricted to maintainers only
  claude-mention:
@ -240,14 +243,17 @@ jobs:
          uv venv --seed
          uv sync

+      - name: Configure AWS Credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
+          aws-region: ${{ secrets.AWS_REGION }}
+
      - name: Run Claude Code
        id: claude
        uses: anthropics/claude-code-action@v1
        with:
-          use_foundry: "true"
-          claude_args: '--model claude-opus-4-6 --allowedTools "Read,Edit,Write,Glob,Grep,Bash(git status*),Bash(git diff*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git log*),Bash(git merge*),Bash(git fetch*),Bash(git checkout*),Bash(git branch*),Bash(uv run prek *),Bash(prek *),Bash(uv run ruff *),Bash(uv run pytest *),Bash(uv run mypy *),Bash(uv run coverage *),Bash(gh pr comment*),Bash(gh pr view*),Bash(gh pr diff*),Bash(gh pr merge*),Bash(gh pr close*)"'
+          use_bedrock: "true"
+          claude_args: '--model us.anthropic.claude-opus-4-6-v1 --allowedTools "Read,Edit,Write,Glob,Grep,Bash(git status*),Bash(git diff*),Bash(git add *),Bash(git commit *),Bash(git push*),Bash(git log*),Bash(git merge*),Bash(git fetch*),Bash(git checkout*),Bash(git branch*),Bash(uv run prek *),Bash(prek *),Bash(uv run ruff *),Bash(uv run pytest *),Bash(uv run mypy *),Bash(uv run coverage *),Bash(gh pr comment*),Bash(gh pr view*),Bash(gh pr diff*),Bash(gh pr merge*),Bash(gh pr close*)"'
          additional_permissions: |
            actions: read
-        env:
-          ANTHROPIC_FOUNDRY_API_KEY: ${{ secrets.AZURE_ANTHROPIC_API_KEY }}
-          ANTHROPIC_FOUNDRY_BASE_URL: ${{ secrets.AZURE_ANTHROPIC_ENDPOINT }}
--- a/.github/workflows/duplicate-code-detector.yml
+++ b/.github/workflows/duplicate-code-detector.yml
@ -0,0 +1,116 @@
+name: Duplicate Code Detector
+
+on:
+  workflow_dispatch:
+  pull_request:
+    types: [opened, synchronize]
+
+jobs:
+  detect-duplicates:
+    if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name == 'workflow_dispatch'
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: write
+      issues: write
+      id-token: write
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          ref: ${{ github.event.pull_request.head.ref || github.ref }}
+
+      - name: Start Serena MCP server
+        run: |
+          docker pull ghcr.io/github/serena-mcp-server:latest
+          docker run -d --name serena \
+            --network host \
+            -v "${{ github.workspace }}:${{ github.workspace }}:rw" \
+            ghcr.io/github/serena-mcp-server:latest \
+            serena start-mcp-server --context codex --project "${{ github.workspace }}"
+
+          mkdir -p /tmp/mcp-config
+          cat > /tmp/mcp-config/mcp-servers.json << 'EOF'
+          {
+            "mcpServers": {
+              "serena": {
+                "command": "docker",
+                "args": ["exec", "-i", "serena", "serena", "start-mcp-server", "--context", "codex", "--project", "${{ github.workspace }}"]
+              }
+            }
+          }
+          EOF
+
+      - name: Configure AWS Credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
+          aws-region: ${{ secrets.AWS_REGION }}
+
+      - name: Run Claude Code
+        uses: anthropics/claude-code-action@v1
+        with:
+          use_bedrock: "true"
+          use_sticky_comment: true
+          allowed_bots: "claude[bot],codeflash-ai[bot]"
+          claude_args: '--mcp-config /tmp/mcp-config/mcp-servers.json --allowedTools "Read,Glob,Grep,Bash(git diff:*),Bash(git log:*),Bash(git show:*),Bash(wc *),Bash(find *),mcp__serena__*"'
+          prompt: |
+            You are a duplicate code detector with access to Serena semantic code analysis.
+
+            ## Setup
+
+            First activate the project in Serena:
+            - Use `mcp__serena__activate_project` with the workspace path `${{ github.workspace }}`
+
+            ## Steps
+
+            1. Get the list of changed .py files (excluding tests):
+               `git diff --name-only origin/main...HEAD -- '*.py' | grep -v -E '(test_|_test\.py|/tests/|/test/)'`
+
+            2. Use Serena's semantic analysis on changed files:
+               - `mcp__serena__get_symbols_overview` to understand file structure
+               - `mcp__serena__find_symbol` to search for similarly named symbols across the codebase
+               - `mcp__serena__find_referencing_symbols` to understand usage patterns
+               - `mcp__serena__search_for_pattern` to find similar code patterns
+
+            3. For each changed file, look for:
+               - **Exact Duplication**: Identical code blocks (>10 lines) in multiple locations
+               - **Structural Duplication**: Same logic with minor variations (different variable names)
+               - **Functional Duplication**: Different implementations of the same functionality
+               - **Copy-Paste Programming**: Similar blocks that could be extracted into shared utilities
+
+            4. Cross-reference against the rest of the codebase using Serena:
+               - Search for similar function signatures and logic patterns
+               - Check if new code duplicates existing utilities or helpers
+               - Look for repeated patterns across modules
+
+            ## What to Report
+
+            - Identical or nearly identical functions in different files
+            - Repeated code blocks that could be extracted to utilities
+            - Similar classes or modules with overlapping functionality
+            - Copy-pasted code with minor modifications
+            - Duplicated business logic across components
+
+            ## What to Skip
+
+            - Standard boilerplate (imports, __init__, etc.)
+            - Test setup/teardown code
+            - Configuration with similar structure
+            - Language-specific patterns (constructors, getters/setters)
+            - Small snippets (<5 lines) unless highly repetitive
+            - Workflow files under .github/
+
+            ## Output
+
+            Post a single PR comment with your findings. For each pattern found:
+            - Severity (High/Medium/Low)
+            - File locations with line numbers
+            - Code samples showing the duplication
+            - Concrete refactoring suggestion
+
+            If no significant duplication is found, say so briefly. Do not create issues — just comment on the PR.
+      - name: Stop Serena
+        if: always()
+        run: docker stop serena && docker rm serena || true
--- a/.github/workflows/js-tests.yml
+++ b/.github/workflows/js-tests.yml
@ -1,50 +0,0 @@
-name: JavaScript/TypeScript Integration Tests
-
-on:
-  push:
-    branches:
-      - main
-  pull_request:
-  workflow_dispatch:
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.ref_name }}
-  cancel-in-progress: true
-
-jobs:
-  js-integration-tests:
-    name: JS/TS Integration Tests
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout code
-        uses: actions/checkout@v4
-        with:
-          fetch-depth: 0
-          token: ${{ secrets.GITHUB_TOKEN }}
-
-      - name: Setup Node.js
-        uses: actions/setup-node@v4
-        with:
-          node-version: '20'
-
-      - name: Install uv
-        uses: astral-sh/setup-uv@v6
-
-      - name: Install Python dependencies
-        run: |
-          uv venv --seed
-          uv sync
-
-      - name: Install npm dependencies for test projects
-        run: |
-          npm install --prefix code_to_optimize/js/code_to_optimize_js
-          npm install --prefix code_to_optimize/js/code_to_optimize_ts
-          npm install --prefix code_to_optimize/js/code_to_optimize_vitest
-
-      - name: Run JavaScript integration tests
-        run: |
-          uv run pytest tests/languages/javascript/ -v
-          uv run pytest tests/test_languages/test_vitest_e2e.py -v
-          uv run pytest tests/test_languages/test_javascript_e2e.py -v
-          uv run pytest tests/test_languages/test_javascript_support.py -v
-          uv run pytest tests/code_utils/test_config_js.py -v
--- a/.gitignore
+++ b/.gitignore
@ -268,3 +268,5 @@ tessl.json

 # Tessl auto-generates AGENTS.md on install; ignore to avoid cluttering git status
 AGENTS.md
+.serena/
+.codeflash/
--- a/.mcp.json
+++ b/.mcp.json
@ -0,0 +1,12 @@
+{
+  "mcpServers": {
+    "tessl": {
+      "type": "stdio",
+      "command": "tessl",
+      "args": [
+        "mcp",
+        "start"
+      ]
+    }
+  }
+}
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -1,54 +1,32 @@
 # CLAUDE.md

-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
 ## Project Overview

 CodeFlash is an AI-powered Python code optimizer that automatically improves code performance while maintaining correctness. It uses LLMs to generate optimization candidates, verifies correctness through test execution, and benchmarks performance improvements.

-## Common Commands
+## Optimization Pipeline

-```bash
-# Package management (NEVER use pip)
-uv sync                          # Install dependencies
-uv sync --group dev              # Install dev dependencies
-uv add <package>                 # Add a package
-
-# Running tests
-uv run pytest tests/             # Run all tests
-uv run pytest tests/test_foo.py  # Run specific test file
-uv run pytest tests/test_foo.py::test_bar -v  # Run single test
-
-# Type checking and linting
-uv run mypy codeflash/           # Type check
-uv run ruff check codeflash/     # Lint
-uv run ruff format codeflash/    # Format
-
-# Linting (run before committing)
-uv run prek run --from-ref origin/main
-
-# Mypy type checking (run on changed files before committing)
-uv run mypy --non-interactive --config-file pyproject.toml <changed_files>
-
-# Running the CLI
-uv run codeflash --help
-uv run codeflash init            # Initialize in a project
-uv run codeflash --all           # Optimize entire codebase
+```
+Discovery → Ranking → Context Extraction → Test Gen + Optimization → Baseline → Candidate Evaluation → PR
 ```

-## Mypy Type Checking
+1. **Discovery** (`discovery/`): Find optimizable functions across the codebase
+2. **Ranking** (`benchmarking/function_ranker.py`): Rank functions by addressable time using trace data
+3. **Context** (`context/`): Extract code dependencies (read-writable code + read-only imports)
+4. **Optimization** (`optimization/`, `api/`): Generate candidates via AI service, run in parallel with test generation
+5. **Verification** (`verification/`): Run candidates against tests, compare outputs via custom pytest plugin
+6. **Benchmarking** (`benchmarking/`): Measure performance, select best candidate by speedup
+7. **Result** (`result/`, `github/`): Create PR with winning optimization

-When modifying code, fix any mypy type errors in the files you changed. Run mypy on changed files:
+## Domain Glossary

-```bash
-uv run mypy --non-interactive --config-file pyproject.toml <changed_files>
-```
-
-Rules:
- Fix type annotation issues: missing return types, incorrect types, Optional/None unions, import errors for type hints
- Do NOT add `# type: ignore` comments — always fix the root cause
- Do NOT fix type errors that require logic changes, complex generic type rework, or anything that could change runtime behavior
- Files in `mypy_allowlist.txt` are checked in CI — ensure they remain error-free
+- **Optimization candidate**: A generated code variant that might be faster (`OptimizedCandidate`)
+- **Function context**: All code needed for optimization — split into read-writable (modifiable) and read-only (reference)
+- **Addressable time**: Time a function spends that could be optimized (own time + callee time / call count)
+- **Candidate forest**: DAG of candidates where refinements/repairs build on previous candidates
+- **Replay test**: Test generated from recorded benchmark data to reproduce real workloads
+- **Tracer**: Profiling system that records function call trees and timings (`tracing/`, `tracer.py`)
+- **Worktree mode**: Git worktree-based parallel optimization (`--worktree` flag)

 <!-- Section below is auto-generated by `tessl install` - do not edit manually -->

--- a/98
+++ b/98
@ -0,0 +1,98 @@
+Business Source License 1.1
+
+Parameters
+
+Licensor:             CodeFlash Inc.
+Licensed Work:        Codeflash Client version 0.20.x
+                      The Licensed Work is (c) 2024 CodeFlash Inc.
+
+Additional Use Grant: None. Production use of the Licensed Work is only permitted
+                      if you have entered into a separate written agreement
+                      with CodeFlash Inc. for production use in connection
+                      with a subscription to CodeFlash's Code Optimization
+                      Platform. Please visit codeflash.ai for further
+                      information.
+
+Change Date:          2030-01-26
+
+Change License:       MIT
+
+Notice
+
+The Business Source License (this document, or the “License”) is not an Open
+Source license. However, the Licensed Work will eventually be made available
+under an Open Source License, as stated in this License.
+
+License text copyright (c) 2017 MariaDB Corporation Ab, All Rights Reserved.
+“Business Source License” is a trademark of MariaDB Corporation Ab.
+
+-----------------------------------------------------------------------------
+
+Business Source License 1.1
+
+Terms
+
+The Licensor hereby grants you the right to copy, modify, create derivative
+works, redistribute, and make non-production use of the Licensed Work. The
+Licensor may make an Additional Use Grant, above, permitting limited
+production use.
+
+Effective on the Change Date, or the fourth anniversary of the first publicly
+available distribution of a specific version of the Licensed Work under this
+License, whichever comes first, the Licensor hereby grants you rights under
+the terms of the Change License, and the rights granted in the paragraph
+above terminate.
+
+If your use of the Licensed Work does not comply with the requirements
+currently in effect as described in this License, you must purchase a
+commercial license from the Licensor, its affiliated entities, or authorized
+resellers, or you must refrain from using the Licensed Work.
+
+All copies of the original and modified Licensed Work, and derivative works
+of the Licensed Work, are subject to this License. This License applies
+separately for each version of the Licensed Work and the Change Date may vary
+for each version of the Licensed Work released by Licensor.
+
+You must conspicuously display this License on each original or modified copy
+of the Licensed Work. If you receive the Licensed Work in original or
+modified form from a third party, the terms and conditions set forth in this
+License apply to your use of that work.
+
+Any use of the Licensed Work in violation of this License will automatically
+terminate your rights under this License for the current and all other
+versions of the Licensed Work.
+
+This License does not grant you any right in any trademark or logo of
+Licensor or its affiliates (provided that you may use a trademark or logo of
+Licensor as expressly required by this License).
+
+TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON
+AN “AS IS” BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS,
+EXPRESS OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND
+TITLE.
+
+MariaDB hereby grants you permission to use this License’s text to license
+your works, and to refer to it using the trademark “Business Source License”,
+as long as you comply with the Covenants of Licensor below.
+
+Covenants of Licensor
+
+In consideration of the right to use this License’s text and the “Business
+Source License” name and trademark, Licensor covenants to MariaDB, and to all
+other recipients of the licensed work to be provided by Licensor:
+
+1. To specify as the Change License the GPL Version 2.0 or any later version,
+   or a license that is compatible with GPL Version 2.0 or a later version,
+   where “compatible” means that software provided under the Change License can
+   be included in a program with software provided under GPL Version 2.0 or a
+   later version. Licensor may specify additional Change Licenses without
+   limitation.
+
+2. To either: (a) specify an additional grant of rights to use that does not
+   impose any additional restriction on the right granted in this License, as
+   the Additional Use Grant; or (b) insert the text “None”.
+
+3. To specify a Change Date.
+
+4. Not to modify this License in any other way.
--- a/codeflash-benchmark/LICENSE
+++ b/codeflash-benchmark/LICENSE
@ -0,0 +1,98 @@
+Business Source License 1.1
+
+Parameters
+
+Licensor:             CodeFlash Inc.
+Licensed Work:        Codeflash Client version 0.20.x
+                      The Licensed Work is (c) 2024 CodeFlash Inc.
+
+Additional Use Grant: None. Production use of the Licensed Work is only permitted
+                      if you have entered into a separate written agreement
+                      with CodeFlash Inc. for production use in connection
+                      with a subscription to CodeFlash's Code Optimization
+                      Platform. Please visit codeflash.ai for further
+                      information.
+
+Change Date:          2030-01-26
+
+Change License:       MIT
+
+Notice
+
+The Business Source License (this document, or the “License”) is not an Open
+Source license. However, the Licensed Work will eventually be made available
+under an Open Source License, as stated in this License.
+
+License text copyright (c) 2017 MariaDB Corporation Ab, All Rights Reserved.
+“Business Source License” is a trademark of MariaDB Corporation Ab.
+
+-----------------------------------------------------------------------------
+
+Business Source License 1.1
+
+Terms
+
+The Licensor hereby grants you the right to copy, modify, create derivative
+works, redistribute, and make non-production use of the Licensed Work. The
+Licensor may make an Additional Use Grant, above, permitting limited
+production use.
+
+Effective on the Change Date, or the fourth anniversary of the first publicly
+available distribution of a specific version of the Licensed Work under this
+License, whichever comes first, the Licensor hereby grants you rights under
+the terms of the Change License, and the rights granted in the paragraph
+above terminate.
+
+If your use of the Licensed Work does not comply with the requirements
+currently in effect as described in this License, you must purchase a
+commercial license from the Licensor, its affiliated entities, or authorized
+resellers, or you must refrain from using the Licensed Work.
+
+All copies of the original and modified Licensed Work, and derivative works
+of the Licensed Work, are subject to this License. This License applies
+separately for each version of the Licensed Work and the Change Date may vary
+for each version of the Licensed Work released by Licensor.
+
+You must conspicuously display this License on each original or modified copy
+of the Licensed Work. If you receive the Licensed Work in original or
+modified form from a third party, the terms and conditions set forth in this
+License apply to your use of that work.
+
+Any use of the Licensed Work in violation of this License will automatically
+terminate your rights under this License for the current and all other
+versions of the Licensed Work.
+
+This License does not grant you any right in any trademark or logo of
+Licensor or its affiliates (provided that you may use a trademark or logo of
+Licensor as expressly required by this License).
+
+TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON
+AN “AS IS” BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS,
+EXPRESS OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND
+TITLE.
+
+MariaDB hereby grants you permission to use this License’s text to license
+your works, and to refer to it using the trademark “Business Source License”,
+as long as you comply with the Covenants of Licensor below.
+
+Covenants of Licensor
+
+In consideration of the right to use this License’s text and the “Business
+Source License” name and trademark, Licensor covenants to MariaDB, and to all
+other recipients of the licensed work to be provided by Licensor:
+
+1. To specify as the Change License the GPL Version 2.0 or any later version,
+   or a license that is compatible with GPL Version 2.0 or a later version,
+   where “compatible” means that software provided under the Change License can
+   be included in a program with software provided under GPL Version 2.0 or a
+   later version. Licensor may specify additional Change Licenses without
+   limitation.
+
+2. To either: (a) specify an additional grant of rights to use that does not
+   impose any additional restriction on the right granted in this License, as
+   the Additional Use Grant; or (b) insert the text “None”.
+
+3. To specify a Change Date.
+
+4. Not to modify this License in any other way.
--- a/codeflash-benchmark/README.md
+++ b/codeflash-benchmark/README.md
@ -0,0 +1,15 @@
+# CodeFlash Benchmark
+
+A pytest benchmarking plugin for [CodeFlash](https://codeflash.ai) - automatic code performance optimization.
+
+## Installation
+
+```bash
+pip install codeflash-benchmark
+```
+
+## Usage
+
+This plugin provides benchmarking capabilities for pytest tests used by CodeFlash's optimization pipeline.
+
+For more information, visit [codeflash.ai](https://codeflash.ai).
--- a/codeflash-benchmark/pyproject.toml
+++ b/codeflash-benchmark/pyproject.toml
@ -1,32 +1,32 @@
-[project]
-name = "codeflash-benchmark"
-version = "0.2.0"
-description = "Pytest benchmarking plugin for codeflash.ai - automatic code performance optimization"
-authors = [{ name = "CodeFlash Inc.", email = "contact@codeflash.ai" }]
-requires-python = ">=3.9"
-readme = "README.md"
-license = {text = "BSL-1.1"}
-keywords = [
-    "codeflash",
-    "benchmark",
-    "pytest",
-    "performance",
-    "testing",
-]
-dependencies = [
-    "pytest>=7.0.0,!=8.3.4",
-]
-
-[project.urls]
-Homepage = "https://codeflash.ai"
-Repository = "https://github.com/codeflash-ai/codeflash-benchmark"
-
-[project.entry-points.pytest11]
-codeflash-benchmark = "codeflash_benchmark.plugin"
-
-[build-system]
-requires = ["setuptools>=45", "wheel"]
-build-backend = "setuptools.build_meta"
-
-[tool.setuptools]
-packages = ["codeflash_benchmark"]
+[project]
+name = "codeflash-benchmark"
+version = "0.2.0"
+description = "Pytest benchmarking plugin for codeflash.ai - automatic code performance optimization"
+authors = [{ name = "CodeFlash Inc.", email = "contact@codeflash.ai" }]
+requires-python = ">=3.9"
+readme = "README.md"
+license-files = ["LICENSE"]
+keywords = [
+    "codeflash",
+    "benchmark",
+    "pytest",
+    "performance",
+    "testing",
+]
+dependencies = [
+    "pytest>=7.0.0,!=8.3.4",
+]
+
+[project.urls]
+Homepage = "https://codeflash.ai"
+Repository = "https://github.com/codeflash-ai/codeflash-benchmark"
+
+[project.entry-points.pytest11]
+codeflash-benchmark = "codeflash_benchmark.plugin"
+
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools]
+packages = ["codeflash_benchmark"]
--- a/codeflash/code_utils/config_consts.py
+++ b/codeflash/code_utils/config_consts.py
@ -4,8 +4,8 @@ from enum import Enum
 from typing import Any, Union

 MAX_TEST_RUN_ITERATIONS = 5
-OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000
-TESTGEN_CONTEXT_TOKEN_LIMIT = 16000
+OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 48000
+TESTGEN_CONTEXT_TOKEN_LIMIT = 48000
 INDIVIDUAL_TESTCASE_TIMEOUT = 15
 MAX_FUNCTION_TEST_SECONDS = 60
 MIN_IMPROVEMENT_THRESHOLD = 0.05
--- a/codeflash/languages/base.py
+++ b/codeflash/languages/base.py
@ -519,15 +519,6 @@ class LanguageSupport(Protocol):
        """
        ...

-    def get_comment_prefix(self) -> str:
-        """Get the comment prefix for this language.
-
-        Returns:
-            Comment prefix (e.g., "//" for JS, "#" for Python).
-
-        """
-        ...
-
    def find_test_root(self, project_root: Path) -> Path | None:
        """Find the test root directory for a project.

--- a/codeflash/languages/current.py
+++ b/codeflash/languages/current.py
@ -34,7 +34,7 @@ if TYPE_CHECKING:
    from codeflash.languages.base import LanguageSupport

 # Module-level singleton for the current language
-_current_language: Language | None = None
+_current_language: Language = Language.PYTHON


 def current_language() -> Language:
--- a/codeflash/languages/javascript/instrument.py
+++ b/codeflash/languages/javascript/instrument.py
@ -1354,12 +1354,10 @@ def fix_jest_mock_paths(test_code: str, test_file_path: Path, source_file_path:
                or source_relative_resolved.with_suffix(".jsx").exists()
            ):
                # Calculate the correct relative path from test_dir to source_relative_resolved
-                new_rel_path = os.path.relpath(str(source_relative_resolved), str(test_dir))
+                new_rel_path = Path(os.path.relpath(source_relative_resolved, test_dir)).as_posix()
                # Ensure it starts with ./ or ../
                if not new_rel_path.startswith("../") and not new_rel_path.startswith("./"):
                    new_rel_path = f"./{new_rel_path}"
-                # Use forward slashes
-                new_rel_path = new_rel_path.replace("\\", "/")

                logger.debug(f"Fixed jest.mock path: {rel_path} -> {new_rel_path}")
                return f"{prefix}{new_rel_path}{suffix}"
--- a/codeflash/languages/javascript/parse.py
+++ b/codeflash/languages/javascript/parse.py
@ -527,10 +527,5 @@ def parse_jest_test_xml(
                f"[LOOP-SUMMARY] Results loop_index: min={min_idx}, max={max_idx}, "
                f"unique_count={len(unique_loop_indices)}, total_results={len(loop_indices)}"
            )
-            if max_idx == 1 and len(loop_indices) > 1:
-                logger.warning(
-                    f"[LOOP-WARNING] All {len(loop_indices)} results have loop_index=1. "
-                    "Perf test markers may not have been parsed correctly."
-                )

    return test_results
--- a/codeflash/languages/javascript/support.py
+++ b/codeflash/languages/javascript/support.py
@ -1805,15 +1805,6 @@ class JavaScriptSupport:
        """
        return ".test.js"

-    def get_comment_prefix(self) -> str:
-        """Get the comment prefix for JavaScript.
-
-        Returns:
-            JavaScript single-line comment prefix.
-
-        """
-        return "//"
-
    def find_test_root(self, project_root: Path) -> Path | None:
        """Find the test root directory for a JavaScript project.

--- a/codeflash/languages/javascript/test_runner.py
+++ b/codeflash/languages/javascript/test_runner.py
@ -803,8 +803,6 @@ def run_jest_behavioral_tests(
        wall_clock_ns = time.perf_counter_ns() - start_time_ns
        logger.debug(f"Jest behavioral tests completed in {wall_clock_ns / 1e9:.2f}s")

-    print(result.stdout)
-
    return result_file_path, result, coverage_json_path, None


@ -1046,6 +1044,10 @@ def run_jest_benchmarking_tests(

        # Create result with combined stdout
        result = subprocess.CompletedProcess(args=result.args, returncode=result.returncode, stdout=stdout, stderr="")
+        if result.returncode != 0:
+            logger.info(f"Jest benchmarking failed with return code {result.returncode}")
+            logger.info(f"Jest benchmarking stdout: {result.stdout}")
+            logger.info(f"Jest benchmarking stderr: {result.stderr}")

    except subprocess.TimeoutExpired:
        logger.warning(f"Jest benchmarking timed out after {total_timeout}s")
--- a/codeflash/languages/python/context/init.py
+++ b/codeflash/languages/python/context/init.py
--- a/codeflash/languages/python/context/code_context_extractor.py
+++ b/codeflash/languages/python/context/code_context_extractor.py
--- a/codeflash/languages/python/context/unused_definition_remover.py
+++ b/codeflash/languages/python/context/unused_definition_remover.py
@ -15,6 +15,8 @@ from codeflash.languages import is_javascript
 from codeflash.models.models import CodeString, CodeStringsMarkdown

 if TYPE_CHECKING:
+    from collections.abc import Callable
+
    from codeflash.discovery.functions_to_optimize import FunctionToOptimize
    from codeflash.models.models import CodeOptimizationContext, FunctionSource

@ -49,6 +51,69 @@ def extract_names_from_targets(target: cst.CSTNode) -> list[str]:
    return names


+def is_assignment_used(node: cst.CSTNode, definitions: dict[str, UsageInfo], name_prefix: str = "") -> bool:
+    if isinstance(node, cst.Assign):
+        for target in node.targets:
+            names = extract_names_from_targets(target.target)
+            for name in names:
+                lookup = f"{name_prefix}{name}" if name_prefix else name
+                if lookup in definitions and definitions[lookup].used_by_qualified_function:
+                    return True
+        return False
+    if isinstance(node, (cst.AnnAssign, cst.AugAssign)):
+        names = extract_names_from_targets(node.target)
+        for name in names:
+            lookup = f"{name_prefix}{name}" if name_prefix else name
+            if lookup in definitions and definitions[lookup].used_by_qualified_function:
+                return True
+        return False
+    return False
+
+
+def recurse_sections(
+    node: cst.CSTNode,
+    section_names: list[str],
+    prune_fn: Callable[[cst.CSTNode], tuple[cst.CSTNode | None, bool]],
+    keep_non_target_children: bool = False,
+) -> tuple[cst.CSTNode | None, bool]:
+    updates: dict[str, list[cst.CSTNode] | cst.CSTNode] = {}
+    found_any_target = False
+    for section in section_names:
+        original_content = getattr(node, section, None)
+        if isinstance(original_content, (list, tuple)):
+            new_children = []
+            section_found_target = False
+            for child in original_content:
+                filtered, found_target = prune_fn(child)
+                if filtered:
+                    new_children.append(filtered)
+                section_found_target |= found_target
+            if keep_non_target_children:
+                if section_found_target or new_children:
+                    found_any_target |= section_found_target
+                    updates[section] = new_children
+            elif section_found_target:
+                found_any_target = True
+                updates[section] = new_children
+        elif original_content is not None:
+            filtered, found_target = prune_fn(original_content)
+            if keep_non_target_children:
+                found_any_target |= found_target
+                if filtered:
+                    updates[section] = filtered
+            elif found_target:
+                found_any_target = True
+                if filtered:
+                    updates[section] = filtered
+    if keep_non_target_children:
+        if updates:
+            return node.with_changes(**updates), found_any_target
+        return None, False
+    if not found_any_target:
+        return None, False
+    return (node.with_changes(**updates) if updates else node), True
+
+
 def collect_top_level_definitions(
    node: cst.CSTNode, definitions: Optional[dict[str, UsageInfo]] = None
 ) -> dict[str, UsageInfo]:
@ -423,27 +488,9 @@ def remove_unused_definitions_recursively(
                elif isinstance(statement, (cst.Assign, cst.AnnAssign, cst.AugAssign)):
                    var_used = False

-                    # Check if any variable in this assignment is used
-                    if isinstance(statement, cst.Assign):
-                        for target in statement.targets:
-                            names = extract_names_from_targets(target.target)
-                            for name in names:
-                                class_var_name = f"{class_name}.{name}"
-                                if (
-                                    class_var_name in definitions
-                                    and definitions[class_var_name].used_by_qualified_function
-                                ):
-                                    var_used = True
-                                    method_or_var_used = True
-                                    break
-                    elif isinstance(statement, (cst.AnnAssign, cst.AugAssign)):
-                        names = extract_names_from_targets(statement.target)
-                        for name in names:
-                            class_var_name = f"{class_name}.{name}"
-                            if class_var_name in definitions and definitions[class_var_name].used_by_qualified_function:
-                                var_used = True
-                                method_or_var_used = True
-                                break
+                    if is_assignment_used(statement, definitions, name_prefix=f"{class_name}."):
+                        var_used = True
+                        method_or_var_used = True

                    if var_used or class_has_dependencies:
                        new_statements.append(statement)
@ -459,56 +506,19 @@ def remove_unused_definitions_recursively(

        return node, method_or_var_used or class_has_dependencies

-    # Handle assignments (Assign and AnnAssign)
-    if isinstance(node, cst.Assign):
-        for target in node.targets:
-            names = extract_names_from_targets(target.target)
-            for name in names:
-                if name in definitions and definitions[name].used_by_qualified_function:
-                    return node, True
-        return None, False
-
-    if isinstance(node, (cst.AnnAssign, cst.AugAssign)):
-        names = extract_names_from_targets(node.target)
-        for name in names:
-            if name in definitions and definitions[name].used_by_qualified_function:
-                return node, True
+    # Handle assignments (Assign, AnnAssign, AugAssign)
+    if isinstance(node, (cst.Assign, cst.AnnAssign, cst.AugAssign)):
+        if is_assignment_used(node, definitions):
+            return node, True
        return None, False

    # For other nodes, recursively process children
    section_names = get_section_names(node)
    if not section_names:
        return node, False
-
-    updates = {}
-    found_used = False
-
-    for section in section_names:
-        original_content = getattr(node, section, None)
-        if isinstance(original_content, (list, tuple)):
-            new_children = []
-            section_found_used = False
-
-            for child in original_content:
-                filtered, used = remove_unused_definitions_recursively(child, definitions)
-                if filtered:
-                    new_children.append(filtered)
-                section_found_used |= used
-
-            if new_children or section_found_used:
-                found_used |= section_found_used
-                updates[section] = new_children
-        elif original_content is not None:
-            filtered, used = remove_unused_definitions_recursively(original_content, definitions)
-            found_used |= used
-            if filtered:
-                updates[section] = filtered
-    if not found_used:
-        return None, False
-    if updates:
-        return node.with_changes(**updates), found_used
-
-    return node, False
+    return recurse_sections(
+        node, section_names, lambda child: remove_unused_definitions_recursively(child, definitions)
+    )


 def collect_top_level_defs_with_usages(
--- a/codeflash/languages/python/support.py
+++ b/codeflash/languages/python/support.py
@ -21,9 +21,25 @@ from codeflash.languages.registry import register_language
 if TYPE_CHECKING:
    from collections.abc import Sequence

+    from codeflash.models.models import FunctionSource
+
 logger = logging.getLogger(__name__)


+def function_sources_to_helpers(sources: list[FunctionSource]) -> list[HelperFunction]:
+    return [
+        HelperFunction(
+            name=fs.only_function_name,
+            qualified_name=fs.qualified_name,
+            file_path=fs.file_path,
+            source_code=fs.source_code,
+            start_line=fs.jedi_definition.line if fs.jedi_definition else 1,
+            end_line=fs.jedi_definition.line if fs.jedi_definition else 1,
+        )
+        for fs in sources
+    ]
+
+
@register_language
 class PythonSupport:
    """Python language support implementation.
@ -171,127 +187,39 @@ class PythonSupport:
    # === Code Analysis ===

    def extract_code_context(self, function: FunctionToOptimize, project_root: Path, module_root: Path) -> CodeContext:
-        """Extract function code and its dependencies.
+        """Extract function code and its dependencies via the canonical context pipeline."""
+        from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context

-        Uses jedi and libcst for Python code analysis.
-
-        Args:
-            function: The function to extract context for.
-            project_root: Root of the project.
-            module_root: Root of the module containing the function.
-
-        Returns:
-            CodeContext with target code and dependencies.
-
-        """
        try:
-            source = function.file_path.read_text()
+            result = get_code_optimization_context(function, project_root)
        except Exception as e:
-            logger.exception("Failed to read %s: %s", function.file_path, e)
+            logger.warning("Failed to extract code context for %s: %s", function.function_name, e)
            return CodeContext(target_code="", target_file=function.file_path, language=Language.PYTHON)

-        # Extract the function source
-        lines = source.splitlines(keepends=True)
-        if function.starting_line and function.ending_line:
-            target_lines = lines[function.starting_line - 1 : function.ending_line]
-            target_code = "".join(target_lines)
-        else:
-            target_code = ""
-
-        # Find helper functions
-        helpers = self.find_helper_functions(function, project_root)
-
-        # Extract imports
-        import_lines = []
-        for line in lines:
-            stripped = line.strip()
-            if stripped.startswith(("import ", "from ")):
-                import_lines.append(stripped)
-            elif stripped and not stripped.startswith("#"):
-                # Stop at first non-import, non-comment line
-                break
+        helpers = function_sources_to_helpers(result.helper_functions)

        return CodeContext(
-            target_code=target_code,
+            target_code=result.read_writable_code.markdown,
            target_file=function.file_path,
            helper_functions=helpers,
-            read_only_context="",
-            imports=import_lines,
+            read_only_context=result.read_only_context_code,
+            imports=[],
            language=Language.PYTHON,
        )

    def find_helper_functions(self, function: FunctionToOptimize, project_root: Path) -> list[HelperFunction]:
-        """Find helper functions called by the target function.
-
-        Uses jedi for Python code analysis.
-
-        Args:
-            function: The target function to analyze.
-            project_root: Root of the project.
-
-        Returns:
-            List of HelperFunction objects.
-
-        """
-        helpers: list[HelperFunction] = []
+        """Find helper functions called by the target function via the canonical jedi pipeline."""
+        from codeflash.languages.python.context.code_context_extractor import get_function_sources_from_jedi

        try:
-            import jedi
-
-            from codeflash.code_utils.code_utils import get_qualified_name, path_belongs_to_site_packages
-            from codeflash.optimization.function_context import belongs_to_function_qualified
-
-            script = jedi.Script(path=function.file_path, project=jedi.Project(path=project_root))
-            file_refs = script.get_names(all_scopes=True, definitions=False, references=True)
-
-            qualified_name = function.qualified_name
-
-            for ref in file_refs:
-                if not ref.full_name or not belongs_to_function_qualified(ref, qualified_name):
-                    continue
-
-                try:
-                    definitions = ref.goto(follow_imports=True, follow_builtin_imports=False)
-                except Exception:
-                    continue
-
-                for definition in definitions:
-                    definition_path = definition.module_path
-                    if definition_path is None:
-                        continue
-
-                    # Check if it's a valid helper (in project, not in target function)
-                    is_valid = (
-                        str(definition_path).startswith(str(project_root))
-                        and not path_belongs_to_site_packages(definition_path)
-                        and definition.full_name
-                        and not belongs_to_function_qualified(definition, qualified_name)
-                        and definition.type == "function"
-                    )
-
-                    if is_valid:
-                        helper_qualified_name = get_qualified_name(definition.module_name, definition.full_name)
-                        # Get source code
-                        try:
-                            helper_source = definition.get_line_code()
-                        except Exception:
-                            helper_source = ""
-
-                        helpers.append(
-                            HelperFunction(
-                                name=definition.name,
-                                qualified_name=helper_qualified_name,
-                                file_path=definition_path,
-                                source_code=helper_source,
-                                start_line=definition.line or 1,
-                                end_line=definition.line or 1,
-                            )
-                        )
-
+            _dict, sources = get_function_sources_from_jedi(
+                {function.file_path: {function.qualified_name}}, project_root
+            )
        except Exception as e:
            logger.warning("Failed to find helpers for %s: %s", function.function_name, e)
+            return []

-        return helpers
+        return function_sources_to_helpers(sources)

    def find_references(
        self, function: FunctionToOptimize, project_root: Path, tests_root: Path | None = None, max_files: int = 500
@ -728,15 +656,6 @@ class PythonSupport:
        """
        return ".py"

-    def get_comment_prefix(self) -> str:
-        """Get the comment prefix for Python.
-
-        Returns:
-            Python single-line comment prefix.
-
-        """
-        return "#"
-
    def find_test_root(self, project_root: Path) -> Path | None:
        """Find the test root directory for a Python project.

--- a/codeflash/optimization/function_optimizer.py
+++ b/codeflash/optimization/function_optimizer.py
@ -72,8 +72,6 @@ from codeflash.code_utils.line_profile_utils import add_decorator_imports, conta
 from codeflash.code_utils.shell_utils import make_env_with_project_root
 from codeflash.code_utils.static_analysis import get_first_top_level_function_or_method_ast
 from codeflash.code_utils.time_utils import humanize_runtime
-from codeflash.context import code_context_extractor
-from codeflash.context.unused_definition_remover import detect_unused_helper_functions, revert_unused_helper_functions
 from codeflash.discovery.functions_to_optimize import was_function_previously_optimized
 from codeflash.either import Failure, Success, is_successful
 from codeflash.languages import is_python
@ -81,6 +79,11 @@ from codeflash.languages.base import Language
 from codeflash.languages.current import current_language_support, is_typescript
 from codeflash.languages.javascript.module_system import detect_module_system
 from codeflash.languages.javascript.test_runner import clear_created_config_files, get_created_config_files
+from codeflash.languages.python.context import code_context_extractor
+from codeflash.languages.python.context.unused_definition_remover import (
+    detect_unused_helper_functions,
+    revert_unused_helper_functions,
+)
 from codeflash.lsp.helpers import is_LSP_enabled, report_to_markdown_table, tree_to_markdown
 from codeflash.lsp.lsp_message import LspCodeMessage, LspMarkdownMessage, LSPMessageId
 from codeflash.models.ExperimentMetadata import ExperimentMetadata
--- a/codeflash/verification/parse_test_output.py
+++ b/codeflash/verification/parse_test_output.py
@ -1,6 +1,5 @@
 from __future__ import annotations

-import contextlib
 import os
 import re
 import sqlite3
@ -22,6 +21,9 @@ from codeflash.code_utils.code_utils import (
 )
 from codeflash.discovery.discover_unit_tests import discover_parameters_unittest
 from codeflash.languages import is_javascript
+
+# Import Jest-specific parsing from the JavaScript language module
+from codeflash.languages.javascript.parse import parse_jest_test_xml as _parse_jest_test_xml
 from codeflash.models.models import (
    ConcurrencyMetrics,
    FunctionTestInvocation,
@ -32,10 +34,6 @@ from codeflash.models.models import (
 )
 from codeflash.verification.coverage_utils import CoverageUtils, JestCoverageUtils

-# Import Jest-specific parsing from the JavaScript language module
-from codeflash.languages.javascript.parse import jest_end_pattern, jest_start_pattern
-from codeflash.languages.javascript.parse import parse_jest_test_xml as _parse_jest_test_xml
-
 if TYPE_CHECKING:
    import subprocess

--- a/mypy_allowlist.txt
+++ b/mypy_allowlist.txt
@ -6,8 +6,8 @@ codeflash/result/explanation.py
 codeflash/result/critic.py
 codeflash/version.py
 codeflash/optimization/__init__.py
-codeflash/context/__init__.py
-codeflash/context/code_context_extractor.py
+codeflash/languages/python/context/__init__.py
+codeflash/languages/python/context/code_context_extractor.py
 codeflash/discovery/__init__.py
 codeflash/__init__.py
 codeflash/models/ExperimentMetadata.py
--- a/packages/codeflash/runtime/capture.js
+++ b/packages/codeflash/runtime/capture.js
@ -113,21 +113,26 @@ function checkSharedTimeLimit() {

 /**
 * Get the current loop index for a specific invocation.
- * The loop index represents how many times ALL test files have been run through.
- * This is the batch count from the loop-runner.
+ * When using external loop-runner (Jest), returns the batch number directly.
+ * When using internal looping (Vitest), tracks and returns the invocation count.
+ *
 * @param {string} invocationKey - Unique key for this test invocation
- * @returns {number} The current batch number (loop index)
+ * @returns {number} The loop index for timing markers (1-based)
 */
 function getInvocationLoopIndex(invocationKey) {
-    // Track local loop count for stopping logic (increments on each call)
+    // When using external loop-runner, use the batch number directly
+    // This is reliable because Jest resets module state between batches
+    const currentBatch = process.env.CODEFLASH_PERF_CURRENT_BATCH;
+    if (currentBatch !== undefined) {
+        return parseInt(currentBatch, 10);
+    }
+
+    // For internal looping (Vitest), track the count locally
    if (!sharedPerfState.invocationLoopCounts[invocationKey]) {
        sharedPerfState.invocationLoopCounts[invocationKey] = 0;
    }
    ++sharedPerfState.invocationLoopCounts[invocationKey];
-
-    // Return the batch number as the loop index for timing markers
-    // This represents how many times all test files have been run through
-    return parseInt(process.env.CODEFLASH_PERF_CURRENT_BATCH || '1', 10);
+    return sharedPerfState.invocationLoopCounts[invocationKey];
 }

 /**
@ -693,11 +698,9 @@ function capturePerf(funcName, lineId, fn, ...args) {
    // If not set, we're in Vitest mode and need to do all loops internally
    const hasExternalLoopRunner = process.env.CODEFLASH_PERF_CURRENT_BATCH !== undefined;

-    // Batched looping: run BATCH_SIZE loops per capturePerf call when using loop-runner
+    // When using external loop-runner (Jest), execute only once per call - the loop-runner handles batching
    // For Vitest (no loop-runner), do all loops internally in a single call
-    const batchSize = shouldLoop
-        ? (hasExternalLoopRunner ? getPerfBatchSize() : getPerfLoopCount())
-        : 1;
+    const batchSize = hasExternalLoopRunner ? 1 : (shouldLoop ? getPerfLoopCount() : 1);

    // Initialize runtime tracking for this invocation if needed
    if (!sharedPerfState.invocationRuntimes[invocationKey]) {
@ -719,7 +722,7 @@ function capturePerf(funcName, lineId, fn, ...args) {
            break;
        }

-        // Get the loop index (batch number) for timing markers
+        // Get the loop index for timing markers
        const loopIndex = getInvocationLoopIndex(invocationKey);

        // Check if we've exceeded max loops for this invocation
--- a/packages/codeflash/runtime/loop-runner.js
+++ b/packages/codeflash/runtime/loop-runner.js
@ -35,69 +35,113 @@ const path = require('path');
 const fs = require('fs');

 /**
- * Validates that a jest-runner path is valid by checking for package.json.
- * @param {string} jestRunnerPath - Path to check
- * @returns {boolean} True if valid jest-runner package
+ * Recursively find jest-runner package in node_modules.
+ * Works with any package manager (npm, yarn, pnpm) by searching for
+ * jest-runner/package.json anywhere in the tree.
+ *
+ * @param {string} nodeModulesPath - Path to node_modules directory
+ * @param {number} maxDepth - Maximum recursion depth (default: 5)
+ * @returns {string|null} Path to jest-runner or null if not found
 */
-function isValidJestRunnerPath(jestRunnerPath) {
-    if (!fs.existsSync(jestRunnerPath)) {
-        return false;
+function findJestRunnerRecursive(nodeModulesPath, maxDepth = 5) {
+    function search(dir, depth) {
+        if (depth > maxDepth || !fs.existsSync(dir)) return null;
+
+        try {
+            let entries = fs.readdirSync(dir, { withFileTypes: true });
+
+            // Sort entries: prefer higher versions for jest-runner@X.Y.Z directories
+            entries = entries.slice().sort((a, b) => {
+                const aMatch = a.name.match(/^jest-runner@(\d+)/);
+                const bMatch = b.name.match(/^jest-runner@(\d+)/);
+                if (aMatch && bMatch) {
+                    return parseInt(bMatch[1], 10) - parseInt(aMatch[1], 10);
+                }
+                return a.name.localeCompare(b.name);
+            });
+
+            for (const entry of entries) {
+                if (!entry.isDirectory()) continue;
+
+                const entryPath = path.join(dir, entry.name);
+
+                // Found jest-runner directory - check if it's a valid package
+                if (entry.name === 'jest-runner') {
+                    const pkgJsonPath = path.join(entryPath, 'package.json');
+                    if (fs.existsSync(pkgJsonPath)) {
+                        try {
+                            const pkgJson = JSON.parse(fs.readFileSync(pkgJsonPath, 'utf8'));
+                            if (pkgJson.name === 'jest-runner') {
+                                return entryPath;
+                            }
+                        } catch (e) {
+                            // Ignore JSON parse errors
+                        }
+                    }
+                }
+
+                // Recurse into:
+                // - node_modules subdirectories
+                // - scoped packages (@org/pkg)
+                // - hidden directories (.pnpm, .yarn, etc.)
+                // - pnpm versioned directories (jest-runner@30.0.5)
+                const shouldRecurse = entry.name === 'node_modules' ||
+                    entry.name.startsWith('@') ||
+                    entry.name === '.pnpm' || entry.name === '.yarn' ||
+                    entry.name.startsWith('jest-runner@');
+
+                if (shouldRecurse) {
+                    const result = search(entryPath, depth + 1);
+                    if (result) return result;
+                }
+            }
+        } catch (e) {
+            // Ignore permission errors
+        }
+
+        return null;
    }
-    const packageJsonPath = path.join(jestRunnerPath, 'package.json');
-    return fs.existsSync(packageJsonPath);
+
+    return search(nodeModulesPath, 0);
 }

 /**
- * Resolve jest-runner with monorepo support.
- * Uses CODEFLASH_MONOREPO_ROOT environment variable if available,
- * otherwise walks up the directory tree looking for node_modules/jest-runner.
+ * Resolve jest-runner from the PROJECT's node_modules (not codeflash's).
+ *
+ * Uses recursive search to find jest-runner anywhere in node_modules,
+ * working with any package manager (npm, yarn, pnpm).
 *
 * @returns {string} Path to jest-runner package
 * @throws {Error} If jest-runner cannot be found
 */
 function resolveJestRunner() {
-    // Try standard resolution first (works in simple projects)
-    try {
-        return require.resolve('jest-runner');
-    } catch (e) {
-        // Standard resolution failed - try monorepo-aware resolution
-    }
-
-    // If Python detected a monorepo root, check there first
-    const monorepoRoot = process.env.CODEFLASH_MONOREPO_ROOT;
-    if (monorepoRoot) {
-        const jestRunnerPath = path.join(monorepoRoot, 'node_modules', 'jest-runner');
-        if (isValidJestRunnerPath(jestRunnerPath)) {
-            return jestRunnerPath;
-        }
-    }
-
-    // Fallback: Walk up from cwd looking for node_modules/jest-runner
    const monorepoMarkers = ['yarn.lock', 'pnpm-workspace.yaml', 'lerna.json', 'package-lock.json'];
+
+    // Walk up from cwd to find all potential node_modules locations
    let currentDir = process.cwd();
    const visitedDirs = new Set();

+    // If Python detected a monorepo root, check there first
+    const monorepoRoot = process.env.CODEFLASH_MONOREPO_ROOT;
+    if (monorepoRoot && !visitedDirs.has(monorepoRoot)) {
+        visitedDirs.add(monorepoRoot);
+        const result = findJestRunnerRecursive(path.join(monorepoRoot, 'node_modules'));
+        if (result) return result;
+    }
+
    while (currentDir !== path.dirname(currentDir)) {
-        // Avoid infinite loops
        if (visitedDirs.has(currentDir)) break;
        visitedDirs.add(currentDir);

-        // Try node_modules/jest-runner at this level
-        const jestRunnerPath = path.join(currentDir, 'node_modules', 'jest-runner');
-        if (isValidJestRunnerPath(jestRunnerPath)) {
-            return jestRunnerPath;
-        }
+        const result = findJestRunnerRecursive(path.join(currentDir, 'node_modules'));
+        if (result) return result;

-        // Check if this is a workspace root (has monorepo markers)
+        // Check if this is a workspace root - stop after this
        const isWorkspaceRoot = monorepoMarkers.some(marker =>
            fs.existsSync(path.join(currentDir, marker))
        );

-        if (isWorkspaceRoot) {
-            // Found workspace root but no jest-runner - stop searching
-            break;
-        }
-
+        if (isWorkspaceRoot) break;
        currentDir = path.dirname(currentDir);
    }

@ -120,10 +164,15 @@ let jestVersion = 0;

 try {
    const jestRunnerPath = resolveJestRunner();
-    const internalRequire = createRequire(jestRunnerPath);

-    // Try to get the TestRunner class (Jest 30+)
-    const jestRunner = internalRequire(jestRunnerPath);
+    // Read the package.json to find the actual entry point and version
+    const pkgJsonPath = path.join(jestRunnerPath, 'package.json');
+    const pkgJson = JSON.parse(fs.readFileSync(pkgJsonPath, 'utf8'));
+
+    // Require using the full path to the entry point
+    const entryPoint = path.join(jestRunnerPath, pkgJson.main || 'build/index.js');
+    const jestRunner = require(entryPoint);
+
    TestRunner = jestRunner.default || jestRunner.TestRunner;

    if (TestRunner && TestRunner.prototype && typeof TestRunner.prototype.runTests === 'function') {
@ -131,9 +180,11 @@ try {
        jestVersion = 30;
        jestRunnerAvailable = true;
    } else {
-        // Try Jest 29 style import
+        // Try Jest 29 style import - runTest is in build/runTest.js
        try {
-            runTest = internalRequire('./runTest').default;
+            const runTestPath = path.join(jestRunnerPath, 'build', 'runTest.js');
+            const runTestModule = require(runTestPath);
+            runTest = runTestModule.default;
            if (typeof runTest === 'function') {
                // Jest 29 - use direct runTest function
                jestVersion = 29;
@ -141,17 +192,23 @@ try {
            }
        } catch (e29) {
            // Neither Jest 29 nor 30 style import worked
-            const errorMsg = `Found jest-runner at ${jestRunnerPath} but could not load it. ` +
-                `This may indicate an unsupported Jest version. ` +
-                `Supported versions: Jest 29.x and Jest 30.x`;
-            console.error(errorMsg);
            jestRunnerAvailable = false;
        }
    }
 } catch (e) {
-    // jest-runner not installed - this is expected for Vitest projects
-    // The runner will throw a helpful error if someone tries to use it without jest-runner
-    jestRunnerAvailable = false;
+    // try to directly import jest-runner
+    try {
+        const jestRunner = require('jest-runner');
+        TestRunner = jestRunner.default || jestRunner.TestRunner;
+        if (TestRunner && TestRunner.prototype && typeof TestRunner.prototype.runTests === 'function') {
+            jestVersion = 30;
+            jestRunnerAvailable = true;
+        } else {
+            jestRunnerAvailable = false;
+        }
+    } catch (e2) {
+        jestRunnerAvailable = false;
+    }
 }

 // Configuration
@ -233,15 +290,12 @@ class CodeflashLoopRunner {
        this._context = context || {};
        this._eventEmitter = new SimpleEventEmitter();

-        // For Jest 30+, create an instance of the base TestRunner for delegation
-        if (jestVersion >= 30) {
-            if (!TestRunner) {
-                throw new Error(
-                    `Jest ${jestVersion} detected but TestRunner class not available. ` +
-                    `This indicates an internal error in loop-runner initialization.`
-                );
-            }
-            this._baseRunner = new TestRunner(globalConfig, context);
+        // For Jest 30+, verify TestRunner is available (we create fresh instances per batch)
+        if (jestVersion >= 30 && !TestRunner) {
+            throw new Error(
+                `Jest ${jestVersion} detected but TestRunner class not available. ` +
+                `This indicates an internal error in loop-runner initialization.`
+            );
        }
    }

@ -270,7 +324,7 @@ class CodeflashLoopRunner {
     * @param {Object} options - Jest runner options
     * @returns {Promise<void>}
     */
-    async runTests(tests, watcher, options) {
+    async runTests(tests, watcher, ...rest) {
        const startTime = Date.now();
        let batchCount = 0;
        let hasFailure = false;
@ -289,13 +343,11 @@ class CodeflashLoopRunner {

            // Check time limit BEFORE each batch
            if (batchCount > MIN_BATCHES && checkTimeLimit()) {
-                console.log(`[codeflash] Time limit reached after ${batchCount - 1} batches (${Date.now() - startTime}ms elapsed)`);
                break;
            }

            // Check if interrupted
            if (watcher.isInterrupted()) {
-                console.log(`[codeflash] Watcher is interrupted`)
                break;
            }

@ -303,57 +355,54 @@ class CodeflashLoopRunner {
            process.env.CODEFLASH_PERF_CURRENT_BATCH = String(batchCount);

            // Run all test files in this batch
-            const batchResult = await this._runAllTestsOnce(tests, watcher, options);
+            const batchResult = await this._runAllTestsOnce(tests, watcher, ...rest);
            allConsoleOutput += batchResult.consoleOutput;

-            // if (batchResult.hasFailure) {
-            //     hasFailure = true;
-            //     break;
-            // }
-
            // Check time limit AFTER each batch
            if (checkTimeLimit()) {
-                console.log(`[codeflash] Time limit reached after ${batchCount} batches (${Date.now() - startTime}ms elapsed)`);
                break;
            }
        }

        const totalTimeMs = Date.now() - startTime;

-        console.log(`[codeflash] now: ${Date.now()}`)
        // Output all collected console logs - this is critical for timing marker extraction
        // The console output contains the !######...######! timing markers from capturePerf
        if (allConsoleOutput) {
            process.stdout.write(allConsoleOutput);
        }
-
-        console.log(`[codeflash] Batched runner completed: ${batchCount} batches, ${tests.length} test files, ${totalTimeMs}ms total`);
    }

    /**
     * Run all test files once (one batch).
     * Uses different approaches for Jest 29 vs Jest 30.
     */
-    async _runAllTestsOnce(tests, watcher, options) {
+    async _runAllTestsOnce(tests, watcher, ...args) {
        if (jestVersion >= 30) {
-            return this._runAllTestsOnceJest30(tests, watcher, options);
+            return this._runAllTestsOnceJest30(tests, watcher, ...args);
        } else {
            return this._runAllTestsOnceJest29(tests, watcher);
        }
    }

    /**
-     * Jest 30+ implementation - delegates to base TestRunner and collects results.
+     * Jest 30+ implementation - creates a fresh TestRunner for each batch to avoid
+     * state corruption issues that occur when reusing runners across batches.
     */
-    async _runAllTestsOnceJest30(tests, watcher, options) {
+    async _runAllTestsOnceJest30(tests, watcher, ...args) {
        let hasFailure = false;
        let allConsoleOutput = '';

        // For Jest 30, we need to collect results through event listeners
        const resultsCollector = [];

-        // Subscribe to events from the base runner
-        const unsubscribeSuccess = this._baseRunner.on('test-file-success', (testData) => {
+        // Create a FRESH TestRunner instance for each batch
+        // Jest 30's TestRunner corrupts its internal state after running tests,
+        // so we cannot reuse the same instance across multiple batches
+        const batchRunner = new TestRunner(this._globalConfig, this._context);
+
+        // Subscribe to events from the batch runner
+        const unsubscribeSuccess = batchRunner.on('test-file-success', (testData) => {
            const [test, result] = testData;
            resultsCollector.push({ test, result, success: true });

@ -369,7 +418,7 @@ class CodeflashLoopRunner {
            this._eventEmitter.emit('test-file-success', testData);
        });

-        const unsubscribeFailure = this._baseRunner.on('test-file-failure', (testData) => {
+        const unsubscribeFailure = batchRunner.on('test-file-failure', (testData) => {
            const [test, error] = testData;
            resultsCollector.push({ test, error, success: false });
            hasFailure = true;
@ -378,14 +427,14 @@ class CodeflashLoopRunner {
            this._eventEmitter.emit('test-file-failure', testData);
        });

-        const unsubscribeStart = this._baseRunner.on('test-file-start', (testData) => {
+        const unsubscribeStart = batchRunner.on('test-file-start', (testData) => {
            // Forward to our event emitter
            this._eventEmitter.emit('test-file-start', testData);
        });

        try {
-            // Run tests using the base runner (always serial for benchmarking)
-            await this._baseRunner.runTests(tests, watcher, { ...options, serial: true });
+            // Run tests using the fresh batch runner (always serial for benchmarking)
+            await batchRunner.runTests(tests, watcher, ...args);
        } finally {
            // Cleanup subscriptions
            if (typeof unsubscribeSuccess === 'function') unsubscribeSuccess();
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,357 +1,358 @@
-[project]
-name = "codeflash"
-dynamic = ["version"]
-description = "Client for codeflash.ai - automatic code performance optimization, powered by AI"
-authors = [{ name = "CodeFlash Inc.", email = "contact@codeflash.ai" }]
-requires-python = ">=3.9"
-readme = "README.md"
-license = {text = "BSL-1.1"}
-keywords = [
-    "codeflash",
-    "performance",
-    "optimization",
-    "ai",
-    "code",
-    "machine learning",
-    "LLM",
-]
-dependencies = [
-    "unidiff>=0.7.4",
-    "pytest>=7.0.0",
-    "gitpython>=3.1.31",
-    "libcst>=1.0.1",
-    "jedi>=0.19.1",
-    # Tree-sitter for multi-language support
-    "tree-sitter>=0.23.0",
-    "tree-sitter-javascript>=0.23.0",
-    "tree-sitter-typescript>=0.23.0",
-    "pytest-timeout>=2.1.0",
-    "tomlkit>=0.11.7",
-    "junitparser>=3.1.0",
-    "pydantic>=1.10.1",
-    "humanize>=4.0.0",
-    "posthog>=3.0.0",
-    "click>=8.1.0",
-    "inquirer>=3.0.0",
-    "sentry-sdk>=1.40.6,<3.0.0",
-    "parameterized>=0.9.0",
-    "isort>=5.11.0",
-    "dill>=0.3.8",
-    "rich>=13.8.1",
-    "lxml>=5.3.0",
-    "crosshair-tool>=0.0.78",
-    "coverage>=7.6.4",
-    "line_profiler>=4.2.0",
-    "platformdirs>=4.3.7",
-    "pygls>=2.0.0,<3.0.0",
-    "codeflash-benchmark",
-    "filelock",
-    "pytest-asyncio>=0.18.0",
-]
-
-[project.urls]
-Homepage = "https://codeflash.ai"
-
-[project.scripts]
-codeflash = "codeflash.main:main"
-
-[project.optional-dependencies]
-
-[dependency-groups]
-dev = [
-    "ipython>=8.12.0",
-    "mypy>=1.13",
-    "ruff>=0.7.0",
-    "lxml-stubs>=0.5.1",
-    "pandas-stubs>=2.2.2.240807, <2.2.3.241009",
-    "types-Pygments>=2.18.0.20240506",
-    "types-colorama>=0.4.15.20240311",
-    "types-decorator>=5.1.8.20240310",
-    "types-jsonschema>=4.23.0.20240813",
-    "types-requests>=2.32.0.20241016",
-    "types-six>=1.16.21.20241009",
-    "types-cffi>=1.16.0.20240331",
-    "types-openpyxl>=3.1.5.20241020",
-    "types-regex>=2024.9.11.20240912",
-    "types-python-dateutil>=2.9.0.20241003",
-    "types-gevent>=24.11.0.20241230,<25",
-    "types-greenlet>=3.1.0.20241221,<4",
-    "types-pexpect>=4.9.0.20241208,<5",
-    "types-unidiff>=0.7.0.20240505,<0.8",
-    "prek>=0.2.25",
-    "ty>=0.0.14",
-    "uv>=0.9.29",
-]
-tests = [
-    "black>=25.9.0",
-    "jax>=0.4.30",
-    "numpy>=2.0.2",
-    "pandas>=2.3.3",
-    "pyarrow>=15.0.0",
-    "pyrsistent>=0.20.0",
-    "scipy>=1.13.1",
-    "torch>=2.8.0",
-    "xarray>=2024.7.0",
-    "eval_type_backport",
-    "numba>=0.60.0",
-    "tensorflow>=2.20.0",
-]
-
-[tool.hatch.build.targets.sdist]
-include = ["codeflash"]
-exclude = [
-    "docs/*",
-    "experiments/*",
-    "tests/*",
-    "*.pyc",
-    "__pycache__",
-    "*.pyo",
-    "*.pyd",
-    "*.so",
-    "*.dylib",
-    "*.dll",
-    "*.exe",
-    "*.log",
-    "*.tmp",
-    ".env",
-    ".env.*",
-    "**/.env",
-    "**/.env.*",
-    ".env.example",
-    "*.pem",
-    "*.key",
-    "secrets.*",
-    "config.yaml",
-    "config.json",
-    ".git",
-    ".gitignore",
-    ".gitattributes",
-    ".github",
-    "Dockerfile",
-    "docker-compose.yml",
-    "*.md",
-    "*.txt",
-    "*.csv",
-    "*.db",
-    "*.sqlite3",
-    "*.pdf",
-    "*.docx",
-    "*.xlsx",
-    "*.pptx",
-    "*.iml",
-    ".idea",
-    ".vscode",
-    ".DS_Store",
-    "Thumbs.db",
-    "venv",
-    "env",
-]
-
-[tool.hatch.build.targets.wheel]
-exclude = [
-    "docs/*",
-    "experiments/*",
-    "tests/*",
-    "*.pyc",
-    "__pycache__",
-    "*.pyo",
-    "*.pyd",
-    "*.so",
-    "*.dylib",
-    "*.dll",
-    "*.exe",
-    "*.log",
-    "*.tmp",
-    ".env",
-    ".env.*",
-    "**/.env",
-    "**/.env.*",
-    ".env.example",
-    "*.pem",
-    "*.key",
-    "secrets.*",
-    "config.yaml",
-    "config.json",
-    ".git",
-    ".gitignore",
-    ".gitattributes",
-    ".github",
-    "Dockerfile",
-    "docker-compose.yml",
-    "*.md",
-    "*.txt",
-    "*.csv",
-    "*.db",
-    "*.sqlite3",
-    "*.pdf",
-    "*.docx",
-    "*.xlsx",
-    "*.pptx",
-    "*.iml",
-    ".idea",
-    ".vscode",
-    ".DS_Store",
-    "Thumbs.db",
-    "venv",
-    "env",
-]
-
-[tool.mypy]
-show_error_code_links = true
-pretty = true
-show_absolute_path = true
-show_error_context = true
-show_error_end = true
-strict = true
-warn_unreachable = true
-install_types = true
-plugins = ["pydantic.mypy"]
-
-[[tool.mypy.overrides]]
-module = ["jedi", "jedi.api.classes", "inquirer", "inquirer.themes", "numba"]
-ignore_missing_imports = true
-
-[tool.pydantic-mypy]
-init_forbid_extra = true
-init_typed = true
-warn_required_dynamic_aliases = true
-
-[tool.ruff]
-target-version = "py39"
-line-length = 120
-fix = true
-show-fixes = true
-extend-exclude = ["code_to_optimize/", "pie_test_set/", "tests/", "experiments/"]
-
-[tool.ruff.lint]
-select = ["ALL"]
-ignore = [
-    "N802",
-    "C901",
-    "D100",
-    "D101",
-    "D102",
-    "D103",
-    "D105",
-    "D107",
-    "D203",  # incorrect-blank-line-before-class (incompatible with D211)
-    "D213",  # multi-line-summary-second-line (incompatible with D212)
-    "S101",
-    "S603",
-    "S607",
-    "COM812",
-    "FIX002",
-    "PLR0912",
-    "PLR0913",
-    "PLR0915",
-    "TD002",
-    "TD003",
-    "TD004",
-    "PLR2004",
-    "UP007", # remove once we drop 3.9 support.
-    "E501",
-    "BLE001",
-    "ERA001",
-    "TRY003",
-    "EM101",
-    "T201",
-    "PGH004",
-    "S301",
-    "D104",
-    "PERF203",
-    "LOG015",
-    "PLC0415",
-    "UP045",
-    "TD007",
-    "D417",
-    "D401",
-    "S110",    # try-except-pass - we do this a lot
-    "ARG002",  # Unused method argument
-    # Added for multi-language branch
-    "FBT001",  # Boolean positional argument
-    "FBT002",  # Boolean default positional argument
-    "ANN401",  # typing.Any disallowed
-    "ARG001",  # Unused function argument (common in abstract/interface methods)
-    "TRY300",  # Consider moving to else block
-    "FURB110", # if-exp-instead-of-or-operator - we prefer explicit if-else over "or"
-    "TRY401",  # Redundant exception in logging.exception
-    "PLR0911", # Too many return statements
-    "PLW0603", # Global statement
-    "PLW2901", # Loop variable overwritten
-    "SIM102",  # Nested if statements
-    "SIM103",  # Return negated condition
-    "ANN001",  # Missing type annotation
-    "PLC0206", # Dictionary items
-    "S314",    # XML parsing (acceptable for dev tool)
-    "S608",    # SQL injection (internal use only)
-    "S112",    # try-except-continue
-    "PERF401", # List comprehension suggestion
-    "SIM108",  # Ternary operator suggestion
-    "F841",    # Unused variable (often intentional)
-    "ANN202",  # Missing return type for private functions
-]
-
-[tool.ruff.lint.flake8-type-checking]
-strict = true
-runtime-evaluated-base-classes = ["pydantic.BaseModel"]
-runtime-evaluated-decorators = ["pydantic.validate_call", "pydantic.dataclasses.dataclass"]
-
-[tool.ruff.lint.pep8-naming]
-classmethod-decorators = [
-    # Allow Pydantic's `@validator` decorator to trigger class method treatment.
-    "pydantic.validator",
-]
-
-[tool.ruff.lint.isort]
-split-on-trailing-comma = false
-
-[tool.ruff.format]
-docstring-code-format = true
-skip-magic-trailing-comma = true
-
-[tool.hatch.version]
-source = "uv-dynamic-versioning"
-
-[tool.uv]
-workspace = { members = ["codeflash-benchmark"] }
-
-[tool.uv.sources]
-codeflash-benchmark = { workspace = true }
-
-[tool.uv-dynamic-versioning]
-enable = true
-style = "pep440"
-vcs = "git"
-
-[tool.hatch.build.hooks.version]
-path = "codeflash/version.py"
-template = """# These version placeholders will be replaced by uv-dynamic-versioning during build.
-__version__ = "{version}"
-"""
-
-
-#[tool.hatch.build.hooks.custom]
-#path = "codeflash/update_license_version.py"
-
-
-[tool.codeflash]
-# All paths are relative to this pyproject.toml's directory.
-module-root = "codeflash"
-tests-root = "codeflash"
-benchmarks-root = "tests/benchmarks"
-ignore-paths = []
-formatter-cmds = ["disabled"]
-
-[tool.pytest.ini_options]
-filterwarnings = [
-    "ignore::pytest.PytestCollectionWarning",
-]
-markers = [
-    "ci_skip: mark test to skip in CI environment",
-]
-
-
-[build-system]
-requires = ["hatchling", "uv-dynamic-versioning"]
-build-backend = "hatchling.build"
-
+[project]
+name = "codeflash"
+dynamic = ["version"]
+description = "Client for codeflash.ai - automatic code performance optimization, powered by AI"
+authors = [{ name = "CodeFlash Inc.", email = "contact@codeflash.ai" }]
+requires-python = ">=3.9"
+readme = "README.md"
+license-files = ["LICENSE"]
+keywords = [
+    "codeflash",
+    "performance",
+    "optimization",
+    "ai",
+    "code",
+    "machine learning",
+    "LLM",
+]
+dependencies = [
+    "unidiff>=0.7.4",
+    "pytest>=7.0.0",
+    "gitpython>=3.1.31",
+    "libcst>=1.0.1",
+    "jedi>=0.19.1",
+    # Tree-sitter for multi-language support
+    "tree-sitter>=0.23.0",
+    "tree-sitter-javascript>=0.23.0",
+    "tree-sitter-typescript>=0.23.0",
+    "pytest-timeout>=2.1.0",
+    "tomlkit>=0.11.7",
+    "junitparser>=3.1.0",
+    "pydantic>=1.10.1",
+    "humanize>=4.0.0",
+    "posthog>=3.0.0",
+    "click>=8.1.0",
+    "inquirer>=3.0.0",
+    "sentry-sdk>=1.40.6,<3.0.0",
+    "parameterized>=0.9.0",
+    "isort>=5.11.0",
+    "dill>=0.3.8",
+    "rich>=13.8.1",
+    "lxml>=5.3.0",
+    "crosshair-tool>=0.0.78",
+    "coverage>=7.6.4",
+    "line_profiler>=4.2.0",
+    "platformdirs>=4.3.7",
+    "pygls>=2.0.0,<3.0.0",
+    "codeflash-benchmark",
+    "filelock",
+    "pytest-asyncio>=0.18.0",
+]
+
+[project.urls]
+Homepage = "https://codeflash.ai"
+
+[project.scripts]
+codeflash = "codeflash.main:main"
+
+[project.optional-dependencies]
+
+[dependency-groups]
+dev = [
+    "ipython>=8.12.0",
+    "mypy>=1.13",
+    "ruff>=0.7.0",
+    "lxml-stubs>=0.5.1",
+    "pandas-stubs>=2.2.2.240807, <2.2.3.241009",
+    "types-Pygments>=2.18.0.20240506",
+    "types-colorama>=0.4.15.20240311",
+    "types-decorator>=5.1.8.20240310",
+    "types-jsonschema>=4.23.0.20240813",
+    "types-requests>=2.32.0.20241016",
+    "types-six>=1.16.21.20241009",
+    "types-cffi>=1.16.0.20240331",
+    "types-openpyxl>=3.1.5.20241020",
+    "types-regex>=2024.9.11.20240912",
+    "types-python-dateutil>=2.9.0.20241003",
+    "types-gevent>=24.11.0.20241230,<25",
+    "types-greenlet>=3.1.0.20241221,<4",
+    "types-pexpect>=4.9.0.20241208,<5",
+    "types-unidiff>=0.7.0.20240505,<0.8",
+    "prek>=0.2.25",
+    "ty>=0.0.14",
+    "uv>=0.9.29",
+]
+tests = [
+    "black>=25.9.0",
+    "jax>=0.4.30",
+    "numpy>=2.0.2",
+    "pandas>=2.3.3",
+    "pyarrow>=15.0.0",
+    "pyrsistent>=0.20.0",
+    "scipy>=1.13.1",
+    "torch>=2.8.0",
+    "xarray>=2024.7.0",
+    "eval_type_backport",
+    "numba>=0.60.0",
+    "tensorflow>=2.20.0",
+]
+
+[tool.hatch.build.targets.sdist]
+include = ["codeflash"]
+exclude = [
+    "docs/*",
+    "experiments/*",
+    "tests/*",
+    "*.pyc",
+    "__pycache__",
+    "*.pyo",
+    "*.pyd",
+    "*.so",
+    "*.dylib",
+    "*.dll",
+    "*.exe",
+    "*.log",
+    "*.tmp",
+    ".env",
+    ".env.*",
+    "**/.env",
+    "**/.env.*",
+    ".env.example",
+    "*.pem",
+    "*.key",
+    "secrets.*",
+    "config.yaml",
+    "config.json",
+    ".git",
+    ".gitignore",
+    ".gitattributes",
+    ".github",
+    "Dockerfile",
+    "docker-compose.yml",
+    "*.md",
+    "*.txt",
+    "*.csv",
+    "*.db",
+    "*.sqlite3",
+    "*.pdf",
+    "*.docx",
+    "*.xlsx",
+    "*.pptx",
+    "*.iml",
+    ".idea",
+    ".vscode",
+    ".DS_Store",
+    "Thumbs.db",
+    "venv",
+    "env",
+]
+
+[tool.hatch.build.targets.wheel]
+exclude = [
+    "docs/*",
+    "experiments/*",
+    "tests/*",
+    "*.pyc",
+    "__pycache__",
+    "*.pyo",
+    "*.pyd",
+    "*.so",
+    "*.dylib",
+    "*.dll",
+    "*.exe",
+    "*.log",
+    "*.tmp",
+    ".env",
+    ".env.*",
+    "**/.env",
+    "**/.env.*",
+    ".env.example",
+    "*.pem",
+    "*.key",
+    "secrets.*",
+    "config.yaml",
+    "config.json",
+    ".git",
+    ".gitignore",
+    ".gitattributes",
+    ".github",
+    "Dockerfile",
+    "docker-compose.yml",
+    "*.md",
+    "*.txt",
+    "*.csv",
+    "*.db",
+    "*.sqlite3",
+    "*.pdf",
+    "*.docx",
+    "*.xlsx",
+    "*.pptx",
+    "*.iml",
+    ".idea",
+    ".vscode",
+    ".DS_Store",
+    "Thumbs.db",
+    "venv",
+    "env",
+]
+
+[tool.mypy]
+show_error_code_links = true
+pretty = true
+show_absolute_path = true
+show_error_context = true
+show_error_end = true
+strict = true
+warn_unreachable = true
+install_types = true
+plugins = ["pydantic.mypy"]
+
+[[tool.mypy.overrides]]
+module = ["jedi", "jedi.api.classes", "inquirer", "inquirer.themes", "numba"]
+ignore_missing_imports = true
+
+[tool.pydantic-mypy]
+init_forbid_extra = true
+init_typed = true
+warn_required_dynamic_aliases = true
+
+[tool.ruff]
+target-version = "py39"
+line-length = 120
+fix = true
+show-fixes = true
+extend-exclude = ["code_to_optimize/", "pie_test_set/", "tests/", "experiments/"]
+
+[tool.ruff.lint]
+select = ["ALL"]
+ignore = [
+    "N802",
+    "C901",
+    "D100",
+    "D101",
+    "D102",
+    "D103",
+    "D105",
+    "D107",
+    "D203",  # incorrect-blank-line-before-class (incompatible with D211)
+    "D213",  # multi-line-summary-second-line (incompatible with D212)
+    "S101",
+    "S603",
+    "S607",
+    "COM812",
+    "FIX002",
+    "PLR0912",
+    "PLR0913",
+    "PLR0915",
+    "TD002",
+    "TD003",
+    "TD004",
+    "PLR2004",
+    "UP007", # remove once we drop 3.9 support.
+    "E501",
+    "BLE001",
+    "ERA001",
+    "TRY003",
+    "EM101",
+    "T201",
+    "PGH004",
+    "S301",
+    "D104",
+    "PERF203",
+    "LOG015",
+    "PLC0415",
+    "UP045",
+    "TD007",
+    "D417",
+    "D401",
+    "S110",    # try-except-pass - we do this a lot
+    "ARG002",  # Unused method argument
+    # Added for multi-language branch
+    "FBT001",  # Boolean positional argument
+    "FBT002",  # Boolean default positional argument
+    "ANN401",  # typing.Any disallowed
+    "ARG001",  # Unused function argument (common in abstract/interface methods)
+    "TRY300",  # Consider moving to else block
+    "FURB110", # if-exp-instead-of-or-operator - we prefer explicit if-else over "or"
+    "TRY401",  # Redundant exception in logging.exception
+    "PLR0911", # Too many return statements
+    "PLW0603", # Global statement
+    "PLW2901", # Loop variable overwritten
+    "SIM102",  # Nested if statements
+    "SIM103",  # Return negated condition
+    "ANN001",  # Missing type annotation
+    "PLC0206", # Dictionary items
+    "S314",    # XML parsing (acceptable for dev tool)
+    "S608",    # SQL injection (internal use only)
+    "S112",    # try-except-continue
+    "PERF401", # List comprehension suggestion
+    "SIM108",  # Ternary operator suggestion
+    "F841",    # Unused variable (often intentional)
+    "ANN202",  # Missing return type for private functions
+    "B009",    # getattr-with-constant - needed to avoid mypy [misc] on dunder access
+]
+
+[tool.ruff.lint.flake8-type-checking]
+strict = true
+runtime-evaluated-base-classes = ["pydantic.BaseModel"]
+runtime-evaluated-decorators = ["pydantic.validate_call", "pydantic.dataclasses.dataclass"]
+
+[tool.ruff.lint.pep8-naming]
+classmethod-decorators = [
+    # Allow Pydantic's `@validator` decorator to trigger class method treatment.
+    "pydantic.validator",
+]
+
+[tool.ruff.lint.isort]
+split-on-trailing-comma = false
+
+[tool.ruff.format]
+docstring-code-format = true
+skip-magic-trailing-comma = true
+
+[tool.hatch.version]
+source = "uv-dynamic-versioning"
+
+[tool.uv]
+workspace = { members = ["codeflash-benchmark"] }
+
+[tool.uv.sources]
+codeflash-benchmark = { workspace = true }
+
+[tool.uv-dynamic-versioning]
+enable = true
+style = "pep440"
+vcs = "git"
+
+[tool.hatch.build.hooks.version]
+path = "codeflash/version.py"
+template = """# These version placeholders will be replaced by uv-dynamic-versioning during build.
+__version__ = "{version}"
+"""
+
+
+#[tool.hatch.build.hooks.custom]
+#path = "codeflash/update_license_version.py"
+
+
+[tool.codeflash]
+# All paths are relative to this pyproject.toml's directory.
+module-root = "codeflash"
+tests-root = "codeflash"
+benchmarks-root = "tests/benchmarks"
+ignore-paths = []
+formatter-cmds = ["disabled"]
+
+[tool.pytest.ini_options]
+filterwarnings = [
+    "ignore::pytest.PytestCollectionWarning",
+]
+markers = [
+    "ci_skip: mark test to skip in CI environment",
+]
+
+
+[build-system]
+requires = ["hatchling", "uv-dynamic-versioning"]
+build-backend = "hatchling.build"
+
--- a/tessl.json
+++ b/tessl.json
@ -0,0 +1,80 @@
+{
+  "name": "codeflash",
+  "dependencies": {
+    "tessl/pypi-pytest": {
+      "version": "8.4.0"
+    },
+    "tessl/pypi-gitpython": {
+      "version": "3.1.0"
+    },
+    "tessl/pypi-libcst": {
+      "version": "1.8.0"
+    },
+    "tessl/pypi-jedi": {
+      "version": "0.19.0"
+    },
+    "tessl/pypi-tree-sitter": {
+      "version": "0.25.0"
+    },
+    "tessl/pypi-tomlkit": {
+      "version": "0.13.0"
+    },
+    "tessl/pypi-pydantic": {
+      "version": "1.10.0"
+    },
+    "tessl/pypi-humanize": {
+      "version": "4.13.0"
+    },
+    "tessl/pypi-posthog": {
+      "version": "6.7.0"
+    },
+    "tessl/pypi-click": {
+      "version": "8.2.0"
+    },
+    "tessl/pypi-inquirer": {
+      "version": "3.4.0"
+    },
+    "tessl/pypi-sentry-sdk": {
+      "version": "1.45.0"
+    },
+    "tessl/pypi-parameterized": {
+      "version": "0.9.0"
+    },
+    "tessl/pypi-dill": {
+      "version": "0.4.0"
+    },
+    "tessl/pypi-rich": {
+      "version": "13.9.0"
+    },
+    "tessl/pypi-lxml": {
+      "version": "5.4.0"
+    },
+    "tessl/pypi-crosshair-tool": {
+      "version": "0.0.0"
+    },
+    "tessl/pypi-coverage": {
+      "version": "7.10.0"
+    },
+    "tessl/pypi-platformdirs": {
+      "version": "4.4.0"
+    },
+    "tessl/pypi-pygls": {
+      "version": "1.3.0"
+    },
+    "tessl/pypi-filelock": {
+      "version": "3.19.0"
+    },
+    "codeflash/codeflash-rules": {
+      "version": "0.1.0"
+    },
+    "codeflash/codeflash-docs": {
+      "version": "0.1.0"
+    },
+    "codeflash/codeflash-skills": {
+      "version": "0.2.0"
+    },
+    "tessl-labs/tessl-skill-eval-scenarios": {
+      "version": "0.0.5"
+    }
+  }
+}
--- a/tests/benchmarks/test_benchmark_code_extract_code_context.py
+++ b/tests/benchmarks/test_benchmark_code_extract_code_context.py
@ -1,8 +1,8 @@
 from argparse import Namespace
 from pathlib import Path

-from codeflash.context.code_context_extractor import get_code_optimization_context
 from codeflash.discovery.functions_to_optimize import FunctionToOptimize
+from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context
 from codeflash.models.models import FunctionParent
 from codeflash.optimization.optimizer import Optimizer

--- a/tests/languages/javascript/test_vitest_junit.py
+++ b/tests/languages/javascript/test_vitest_junit.py
@ -12,7 +12,7 @@ from pathlib import Path
 import pytest
 from junitparser import JUnitXml

-from codeflash.verification.parse_test_output import jest_end_pattern, jest_start_pattern
+from codeflash.languages.javascript.parse import jest_end_pattern, jest_start_pattern


 class TestVitestJunitXmlFormat:
--- a/tests/test_code_context_extractor.py
+++ b/tests/test_code_context_extractor.py
--- a/tests/test_get_read_only_code.py
+++ b/tests/test_get_read_only_code.py
@ -2,7 +2,7 @@ from textwrap import dedent

 import pytest

-from codeflash.context.code_context_extractor import parse_code_and_prune_cst
+from codeflash.languages.python.context.code_context_extractor import parse_code_and_prune_cst
 from codeflash.models.models import CodeContextType


--- a/tests/test_get_read_writable_code.py
+++ b/tests/test_get_read_writable_code.py
@ -2,7 +2,7 @@ from textwrap import dedent

 import pytest

-from codeflash.context.code_context_extractor import parse_code_and_prune_cst
+from codeflash.languages.python.context.code_context_extractor import parse_code_and_prune_cst
 from codeflash.models.models import CodeContextType


--- a/tests/test_get_testgen_code.py
+++ b/tests/test_get_testgen_code.py
@ -2,7 +2,7 @@ from textwrap import dedent

 import pytest

-from codeflash.context.code_context_extractor import parse_code_and_prune_cst
+from codeflash.languages.python.context.code_context_extractor import parse_code_and_prune_cst
 from codeflash.models.models import CodeContextType


--- a/tests/test_languages/test_code_context_extraction.py
+++ b/tests/test_languages/test_code_context_extraction.py
@ -20,14 +20,12 @@ All assertions use strict string equality to verify exact extraction output.

 from __future__ import annotations

-from pathlib import Path
-
 import pytest

-from codeflash.context.code_context_extractor import get_code_optimization_context_for_language
 from codeflash.discovery.functions_to_optimize import FunctionToOptimize
 from codeflash.languages.base import Language
 from codeflash.languages.javascript.support import JavaScriptSupport, TypeScriptSupport
+from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context_for_language


@pytest.fixture
--- a/tests/test_languages/test_javascript_e2e.py
+++ b/tests/test_languages/test_javascript_e2e.py
@ -106,9 +106,9 @@ class TestJavaScriptCodeContext:
    def test_extract_code_context_for_javascript(self, js_project_dir):
        """Test extracting code context for a JavaScript function."""
        skip_if_js_not_supported()
-        from codeflash.context.code_context_extractor import get_code_optimization_context
        from codeflash.discovery.functions_to_optimize import find_all_functions_in_file
        from codeflash.languages import current as lang_current
+        from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context

        lang_current._current_language = Language.JAVASCRIPT

--- a/tests/test_languages/test_javascript_optimization_flow.py
+++ b/tests/test_languages/test_javascript_optimization_flow.py
@ -9,7 +9,6 @@ These tests verify the full optimization pipeline including:
 This is the JavaScript equivalent of test_instrument_tests.py for Python.
 """

-from pathlib import Path
 from unittest.mock import MagicMock, patch

 import pytest
@ -71,9 +70,9 @@ module.exports = { add };
    def test_code_context_preserves_language(self, tmp_path):
        """Verify language is preserved in code context extraction."""
        skip_if_js_not_supported()
-        from codeflash.context.code_context_extractor import get_code_optimization_context
        from codeflash.discovery.functions_to_optimize import find_all_functions_in_file
        from codeflash.languages import current as lang_current
+        from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context

        lang_current._current_language = Language.TYPESCRIPT

@ -164,7 +163,7 @@ export function add(a: number, b: number): number {

        # Mock the AI service request
        ai_client = AiServiceClient()
-        with patch.object(ai_client, 'make_ai_service_request') as mock_request:
+        with patch.object(ai_client, "make_ai_service_request") as mock_request:
            mock_response = MagicMock()
            mock_response.status_code = 200
            mock_response.json.return_value = {
@ -191,8 +190,8 @@ export function add(a: number, b: number): number {
            # Verify the request was made with correct language
            assert mock_request.called, "API request should have been made"
            call_args = mock_request.call_args
-            payload = call_args[1].get('payload', call_args[0][1] if len(call_args[0]) > 1 else {})
-            assert payload.get('language') == 'typescript', \
+            payload = call_args[1].get("payload", call_args[0][1] if len(call_args[0]) > 1 else {})
+            assert payload.get("language") == "typescript", \
                f"Expected language='typescript', got language='{payload.get('language')}'"


@ -462,7 +461,7 @@ class TestHelperFunctionLanguageAttribute:
        """Verify helper functions have language='javascript' for .js files."""
        skip_if_js_not_supported()
        from codeflash.discovery.functions_to_optimize import find_all_functions_in_file
-        from codeflash.languages import current as lang_current, get_language_support
+        from codeflash.languages import current as lang_current
        from codeflash.optimization.function_optimizer import FunctionOptimizer

        lang_current._current_language = Language.JAVASCRIPT
--- a/tests/test_languages/test_typescript_e2e.py
+++ b/tests/test_languages/test_typescript_e2e.py
@ -69,7 +69,7 @@ class TestTypeScriptFunctionDiscovery:
        from codeflash.discovery.functions_to_optimize import find_all_functions_in_file

        with tempfile.NamedTemporaryFile(suffix=".ts", mode="w", delete=False) as f:
-            f.write("""
+            f.write(r"""
 export function add(a: number, b: number): number {
    return a + b;
 }
@ -123,9 +123,9 @@ class TestTypeScriptCodeContext:
    def test_extract_code_context_for_typescript(self, ts_project_dir):
        """Test extracting code context for a TypeScript function."""
        skip_if_ts_not_supported()
-        from codeflash.context.code_context_extractor import get_code_optimization_context
        from codeflash.discovery.functions_to_optimize import find_all_functions_in_file
        from codeflash.languages import current as lang_current
+        from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context

        lang_current._current_language = Language.TYPESCRIPT

@ -201,7 +201,7 @@ function multiply(a: number, b: number): number {
        from codeflash.languages import get_language_support
        from codeflash.languages.base import FunctionInfo

-        original_source = """
+        original_source = r"""
 interface Config {
    timeout: number;
    retries: number;
@ -212,7 +212,7 @@ function processConfig(config: Config): string {
 }
 """

-        new_function = """function processConfig(config: Config): string {
+        new_function = r"""function processConfig(config: Config): string {
    // Optimized with template caching
    const { timeout, retries } = config;
    return `timeout=\${timeout}, retries=\${retries}`;
--- a/tests/test_languages/test_vitest_e2e.py
+++ b/tests/test_languages/test_vitest_e2e.py
@ -117,10 +117,10 @@ class TestVitestCodeContext:
    def test_extract_code_context_for_typescript(self, vitest_project_dir):
        """Test extracting code context for a TypeScript function."""
        skip_if_js_not_supported()
-        from codeflash.context.code_context_extractor import get_code_optimization_context
        from codeflash.discovery.functions_to_optimize import find_all_functions_in_file
        from codeflash.languages import current as lang_current
        from codeflash.languages.base import Language
+        from codeflash.languages.python.context.code_context_extractor import get_code_optimization_context

        lang_current._current_language = Language.TYPESCRIPT

--- a/tests/test_remove_unused_definitions.py
+++ b/tests/test_remove_unused_definitions.py
@ -1,6 +1,6 @@


-from codeflash.context.unused_definition_remover import remove_unused_definitions_by_function_names
+from codeflash.languages.python.context.unused_definition_remover import remove_unused_definitions_by_function_names


 def test_variable_removal_only() -> None:
--- a/tests/test_unused_helper_revert.py
+++ b/tests/test_unused_helper_revert.py
@ -5,8 +5,11 @@ from pathlib import Path

 import pytest

-from codeflash.context.unused_definition_remover import detect_unused_helper_functions, revert_unused_helper_functions
 from codeflash.discovery.functions_to_optimize import FunctionToOptimize
+from codeflash.languages.python.context.unused_definition_remover import (
+    detect_unused_helper_functions,
+    revert_unused_helper_functions,
+)
 from codeflash.models.models import CodeStringsMarkdown
 from codeflash.optimization.function_optimizer import FunctionOptimizer
 from codeflash.verification.verification_utils import TestConfig
--- a/tiles/codeflash-docs/docs/ai-service.md
+++ b/tiles/codeflash-docs/docs/ai-service.md
@ -0,0 +1,108 @@
+# AI Service
+
+How codeflash communicates with the AI optimization backend.
+
+## `AiServiceClient` (`api/aiservice.py`)
+
+The client connects to the AI service at `https://app.codeflash.ai` (or `http://localhost:8000` when `CODEFLASH_AIS_SERVER=local`).
+
+Authentication uses Bearer token from `get_codeflash_api_key()`. All requests go through `make_ai_service_request()` which handles JSON serialization via Pydantic encoder.
+
+Timeout: 90s for production, 300s for local.
+
+## Endpoints
+
+### `/ai/optimize` — Generate Candidates
+
+Method: `optimize_code()`
+
+Sends source code + dependency context to generate optimization candidates.
+
+Payload:
+- `source_code` — The read-writable code (markdown format)
+- `dependency_code` — Read-only context code
+- `trace_id` — Unique trace ID for the optimization run
+- `language` — `"python"`, `"javascript"`, or `"typescript"`
+- `n_candidates` — Number of candidates to generate (controlled by effort level)
+- `is_async` — Whether the function is async
+- `is_numerical_code` — Whether the code is numerical (affects optimization strategy)
+
+Returns: `list[OptimizedCandidate]` with `source=OptimizedCandidateSource.OPTIMIZE`
+
+### `/ai/optimize_line_profiler` — Line-Profiler-Guided Candidates
+
+Method: `optimize_python_code_line_profiler()`
+
+Like `/optimize` but includes `line_profiler_results` to guide the LLM toward hot lines.
+
+Returns: candidates with `source=OptimizedCandidateSource.OPTIMIZE_LP`
+
+### `/ai/refine` — Refine Existing Candidate
+
+Method: `refine_code()`
+
+Request type: `AIServiceRefinerRequest`
+
+Sends an existing candidate with runtime data and line profiler results to generate an improved version.
+
+Key fields:
+- `original_source_code` / `optimized_source_code` — Before and after
+- `original_code_runtime` / `optimized_code_runtime` — Timing data
+- `speedup` — Current speedup ratio
+- `original_line_profiler_results` / `optimized_line_profiler_results`
+
+Returns: candidates with `source=OptimizedCandidateSource.REFINE` and `parent_id` set to the refined candidate's ID
+
+### `/ai/repair` — Fix Failed Candidate
+
+Method: `repair_code()`
+
+Request type: `AIServiceCodeRepairRequest`
+
+Sends a failed candidate with test diffs showing what went wrong.
+
+Key fields:
+- `original_source_code` / `modified_source_code`
+- `test_diffs: list[TestDiff]` — Each with `scope` (return_value/stdout/did_pass), original vs candidate values, and test source code
+
+Returns: candidates with `source=OptimizedCandidateSource.REPAIR` and `parent_id` set
+
+### `/ai/adaptive_optimize` — Multi-Candidate Adaptive
+
+Method: `adaptive_optimize()`
+
+Request type: `AIServiceAdaptiveOptimizeRequest`
+
+Sends multiple previous candidates with their speedups for the LLM to learn from and generate better candidates.
+
+Key fields:
+- `candidates: list[AdaptiveOptimizedCandidate]` — Previous candidates with source code, explanation, source type, and speedup
+
+Returns: candidates with `source=OptimizedCandidateSource.ADAPTIVE`
+
+### `/ai/rewrite_jit` — JIT Rewrite
+
+Method: `get_jit_rewritten_code()`
+
+Rewrites code to use JIT compilation (e.g., Numba).
+
+Returns: candidates with `source=OptimizedCandidateSource.JIT_REWRITE`
+
+## Candidate Parsing
+
+All endpoints return JSON with an `optimizations` array. Each entry has:
+- `source_code` — Markdown-formatted code blocks
+- `explanation` — LLM explanation
+- `optimization_id` — Unique ID
+- `parent_id` — Optional parent reference
+- `model` — Which LLM model was used
+
+`_get_valid_candidates()` parses the markdown code via `CodeStringsMarkdown.parse_markdown_code()` and filters out entries with empty code blocks.
+
+## `LocalAiServiceClient`
+
+Used when `CODEFLASH_EXPERIMENT_ID` is set. Mirrors `AiServiceClient` but sends to a separate experimental endpoint for A/B testing optimization strategies.
+
+## LLM Call Sequencing
+
+`AiServiceClient` tracks call sequence via `llm_call_counter` (itertools.count). Each request includes a `call_sequence` number, used by the backend to maintain conversation context across multiple calls for the same function.
--- a/tiles/codeflash-docs/docs/configuration.md
+++ b/tiles/codeflash-docs/docs/configuration.md
@ -0,0 +1,79 @@
+# Configuration
+
+Key configuration constants, effort levels, and thresholds.
+
+## Constants (`code_utils/config_consts.py`)
+
+### Test Execution
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `MAX_TEST_RUN_ITERATIONS` | 5 | Maximum test loop iterations |
+| `INDIVIDUAL_TESTCASE_TIMEOUT` | 15s | Timeout per individual test case |
+| `MAX_FUNCTION_TEST_SECONDS` | 60s | Max total time for function testing |
+| `MAX_TEST_FUNCTION_RUNS` | 50 | Max test function executions |
+| `MAX_CUMULATIVE_TEST_RUNTIME_NANOSECONDS` | 100ms | Max cumulative test runtime |
+| `TOTAL_LOOPING_TIME` | 10s | Candidate benchmarking budget |
+| `MIN_TESTCASE_PASSED_THRESHOLD` | 6 | Minimum test cases that must pass |
+
+### Performance Thresholds
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `MIN_IMPROVEMENT_THRESHOLD` | 0.05 (5%) | Minimum speedup to accept a candidate |
+| `MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD` | 0.10 (10%) | Minimum async throughput improvement |
+| `MIN_CONCURRENCY_IMPROVEMENT_THRESHOLD` | 0.20 (20%) | Minimum concurrency ratio improvement |
+| `COVERAGE_THRESHOLD` | 60.0% | Minimum test coverage |
+
+### Stability Thresholds
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `STABILITY_WINDOW_SIZE` | 0.35 | 35% of total iteration window |
+| `STABILITY_CENTER_TOLERANCE` | 0.0025 | ±0.25% around median |
+| `STABILITY_SPREAD_TOLERANCE` | 0.0025 | 0.25% window spread |
+
+### Context Limits
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `OPTIMIZATION_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for optimization context |
+| `TESTGEN_CONTEXT_TOKEN_LIMIT` | 16000 | Max tokens for test generation context |
+| `MAX_CONTEXT_LEN_REVIEW` | 1000 | Max context length for optimization review |
+
+### Other
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `MIN_CORRECT_CANDIDATES` | 2 | Min correct candidates before skipping repair |
+| `REPEAT_OPTIMIZATION_PROBABILITY` | 0.1 | Probability of re-optimizing a function |
+| `DEFAULT_IMPORTANCE_THRESHOLD` | 0.001 | Minimum addressable time to consider a function |
+| `CONCURRENCY_FACTOR` | 10 | Number of concurrent executions for concurrency benchmark |
+| `REFINED_CANDIDATE_RANKING_WEIGHTS` | (2, 1) | (runtime, diff) weights — runtime 2x more important |
+
+## Effort Levels
+
+`EffortLevel` enum: `LOW`, `MEDIUM`, `HIGH`
+
+Effort controls the number of candidates, repairs, and refinements:
+
+| Key | LOW | MEDIUM | HIGH |
+|-----|-----|--------|------|
+| `N_OPTIMIZER_CANDIDATES` | 3 | 5 | 6 |
+| `N_OPTIMIZER_LP_CANDIDATES` | 4 | 6 | 7 |
+| `N_GENERATED_TESTS` | 2 | 2 | 2 |
+| `MAX_CODE_REPAIRS_PER_TRACE` | 2 | 3 | 5 |
+| `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` | 0.2 | 0.3 | 0.4 |
+| `TOP_VALID_CANDIDATES_FOR_REFINEMENT` | 2 | 3 | 4 |
+| `ADAPTIVE_OPTIMIZATION_THRESHOLD` | 0 | 0 | 2 |
+| `MAX_ADAPTIVE_OPTIMIZATIONS_PER_TRACE` | 0 | 0 | 4 |
+
+Use `get_effort_value(EffortKeys.KEY, effort_level)` to retrieve values.
+
+## Project Configuration
+
+Configuration is read from `pyproject.toml` under `[tool.codeflash]`. Key settings are auto-detected by `setup/detector.py`:
+- `module-root` — Root of the module to optimize
+- `tests-root` — Root of test files
+- `test-framework` — pytest, unittest, jest, etc.
+- `formatter-cmds` — Code formatting commands
--- a/tiles/codeflash-docs/docs/context-extraction.md
+++ b/tiles/codeflash-docs/docs/context-extraction.md
@ -0,0 +1,60 @@
+# Context Extraction
+
+How codeflash extracts and limits code context for optimization and test generation.
+
+## Overview
+
+Context extraction (`context/code_context_extractor.py`) builds a `CodeOptimizationContext` containing all code needed for the LLM to understand and optimize a function, split into:
+
+- **Read-writable code** (`CodeContextType.READ_WRITABLE`): The function being optimized plus its helper functions — code the LLM is allowed to modify
+- **Read-only context** (`CodeContextType.READ_ONLY`): Dependency code for reference — imports, type definitions, base classes
+- **Testgen context** (`CodeContextType.TESTGEN`): Context for test generation, may include imported class definitions and external base class inits
+- **Hashing context** (`CodeContextType.HASHING`): Used for deduplication of optimization runs
+
+## Token Limits
+
+Both optimization and test generation contexts are token-limited:
+- `OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000` tokens
+- `TESTGEN_CONTEXT_TOKEN_LIMIT = 16000` tokens
+
+Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`. Functions whose context exceeds these limits are skipped.
+
+## Context Building Process
+
+### 1. Helper Discovery
+
+For the target function (`FunctionToOptimize`), the extractor finds:
+- **Helpers of the function**: Functions/classes in the same file that the target function calls
+- **Helpers of helpers**: Transitive dependencies of the helper functions
+
+These are organized as `dict[Path, set[FunctionSource]]` — mapping file paths to the set of helper functions found in each file.
+
+### 2. Code Extraction
+
+`extract_code_markdown_context_from_files()` builds `CodeStringsMarkdown` from the helper dictionaries. Each file's relevant code is extracted as a `CodeString` with its file path.
+
+### 3. Testgen Context Enrichment
+
+`build_testgen_context()` extends the basic context with:
+- Imported class definitions (resolved from imports)
+- External base class `__init__` methods
+- External class `__init__` methods referenced in the context
+
+### 4. Unused Definition Removal
+
+`detect_unused_helper_functions()` and `remove_unused_definitions_by_function_names()` from `context/unused_definition_remover.py` prune definitions that are not transitively reachable from the target function, reducing token usage.
+
+### 5. Deduplication
+
+The hashing context (`hashing_code_context`) generates a hash (`hashing_code_context_hash`) used to detect when the same function context has already been optimized in a previous run, avoiding redundant work.
+
+## Key Functions
+
+| Function | Location | Purpose |
+|----------|----------|---------|
+| `build_testgen_context()` | `context/code_context_extractor.py` | Build enriched testgen context |
+| `extract_code_markdown_context_from_files()` | `context/code_context_extractor.py` | Convert helper dicts to `CodeStringsMarkdown` |
+| `detect_unused_helper_functions()` | `context/unused_definition_remover.py` | Find unused definitions |
+| `remove_unused_definitions_by_function_names()` | `context/unused_definition_remover.py` | Remove unused definitions |
+| `collect_top_level_defs_with_usages()` | `context/unused_definition_remover.py` | Analyze definition usage |
+| `encoded_tokens_len()` | `code_utils/code_utils.py` | Count tokens in code |
--- a/tiles/codeflash-docs/docs/domain-types.md
+++ b/tiles/codeflash-docs/docs/domain-types.md
@ -0,0 +1,153 @@
+# Domain Types
+
+Core data types used throughout the codeflash optimization pipeline.
+
+## Function Representation
+
+### `FunctionToOptimize` (`models/function_types.py`)
+
+The canonical dataclass representing a function candidate for optimization. Works across Python, JavaScript, and TypeScript.
+
+Key fields:
+- `function_name: str` — The function name
+- `file_path: Path` — Absolute file path where the function is located
+- `parents: list[FunctionParent]` — Parent scopes (classes/functions), each with `name` and `type`
+- `starting_line / ending_line: Optional[int]` — Line range (1-indexed)
+- `is_async: bool` — Whether the function is async
+- `is_method: bool` — Whether it belongs to a class
+- `language: str` — Programming language (default: `"python"`)
+
+Key properties:
+- `qualified_name` — Full dotted name including parent classes (e.g., `MyClass.my_method`)
+- `top_level_parent_name` — Name of outermost parent, or function name if no parents
+- `class_name` — Immediate parent class name, or `None`
+
+### `FunctionParent` (`models/function_types.py`)
+
+Represents a parent scope: `name: str` (e.g., `"MyClass"`) and `type: str` (e.g., `"ClassDef"`).
+
+### `FunctionSource` (`models/models.py`)
+
+Represents a resolved function with source code. Used for helper functions in context extraction.
+
+Fields: `file_path`, `qualified_name`, `fully_qualified_name`, `only_function_name`, `source_code`, `jedi_definition`.
+
+## Code Representation
+
+### `CodeString` (`models/models.py`)
+
+A single code block with validated syntax:
+- `code: str` — The source code
+- `file_path: Optional[Path]` — Origin file path
+- `language: str` — Language for validation (default: `"python"`)
+
+Validates syntax on construction via `model_validator`.
+
+### `CodeStringsMarkdown` (`models/models.py`)
+
+A collection of `CodeString` blocks — the primary format for passing code through the pipeline.
+
+Key properties:
+- `.flat` — Combined source code with file-path comment prefixes (e.g., `# file: path/to/file.py`)
+- `.markdown` — Markdown-formatted with fenced code blocks: `` ```python:filepath\ncode\n``` ``
+- `.file_to_path()` — Dict mapping file path strings to code
+
+Static method:
+- `parse_markdown_code(markdown_code, expected_language)` — Parses markdown code blocks back into `CodeStringsMarkdown`
+
+## Optimization Context
+
+### `CodeOptimizationContext` (`models/models.py`)
+
+Holds all code context needed for optimization:
+- `read_writable_code: CodeStringsMarkdown` — Code the LLM can modify
+- `read_only_context_code: str` — Reference-only dependency code
+- `testgen_context: CodeStringsMarkdown` — Context for test generation
+- `hashing_code_context: str` / `hashing_code_context_hash: str` — For deduplication
+- `helper_functions: list[FunctionSource]` — Helper functions in the writable code
+- `preexisting_objects: set[tuple[str, tuple[FunctionParent, ...]]]` — Objects that already exist in the code
+
+### `CodeContextType` enum (`models/models.py`)
+
+Defines context categories: `READ_WRITABLE`, `READ_ONLY`, `TESTGEN`, `HASHING`.
+
+## Candidates
+
+### `OptimizedCandidate` (`models/models.py`)
+
+A generated code variant:
+- `source_code: CodeStringsMarkdown` — The optimized code
+- `explanation: str` — LLM explanation of the optimization
+- `optimization_id: str` — Unique identifier
+- `source: OptimizedCandidateSource` — How it was generated
+- `parent_id: str | None` — ID of parent candidate (for refinements/repairs)
+- `model: str | None` — Which LLM model generated it
+
+### `OptimizedCandidateSource` enum (`models/models.py`)
+
+How a candidate was generated: `OPTIMIZE`, `OPTIMIZE_LP` (line profiler), `REFINE`, `REPAIR`, `ADAPTIVE`, `JIT_REWRITE`.
+
+### `CandidateEvaluationContext` (`models/models.py`)
+
+Tracks state during candidate evaluation:
+- `speedup_ratios` / `optimized_runtimes` / `is_correct` — Per-candidate results
+- `ast_code_to_id` — Deduplication map (normalized AST → first seen candidate)
+- `valid_optimizations` — Candidates that passed all checks
+
+Key methods: `record_failed_candidate()`, `record_successful_candidate()`, `handle_duplicate_candidate()`, `register_new_candidate()`.
+
+## Baseline & Results
+
+### `OriginalCodeBaseline` (`models/models.py`)
+
+Baseline measurements for the original code:
+- `behavior_test_results: TestResults` / `benchmarking_test_results: TestResults`
+- `line_profile_results: dict`
+- `runtime: int` — Total runtime in nanoseconds
+- `coverage_results: Optional[CoverageData]`
+
+### `BestOptimization` (`models/models.py`)
+
+The winning candidate after evaluation:
+- `candidate: OptimizedCandidate`
+- `helper_functions: list[FunctionSource]`
+- `code_context: CodeOptimizationContext`
+- `runtime: int`
+- `winning_behavior_test_results` / `winning_benchmarking_test_results: TestResults`
+
+## Test Types
+
+### `TestType` enum (`models/test_type.py`)
+
+- `EXISTING_UNIT_TEST` (1) — Pre-existing tests from the codebase
+- `INSPIRED_REGRESSION` (2) — Tests inspired by existing tests
+- `GENERATED_REGRESSION` (3) — AI-generated regression tests
+- `REPLAY_TEST` (4) — Tests from recorded benchmark data
+- `CONCOLIC_COVERAGE_TEST` (5) — Coverage-guided tests
+- `INIT_STATE_TEST` (6) — Class init state verification
+
+### `TestFile` / `TestFiles` (`models/models.py`)
+
+`TestFile` represents a single test file with `instrumented_behavior_file_path`, optional `benchmarking_file_path`, `original_file_path`, `test_type`, and `tests_in_file`.
+
+`TestFiles` is a collection with lookup methods: `get_by_type()`, `get_by_original_file_path()`, `get_test_type_by_instrumented_file_path()`.
+
+### `TestResults` (`models/models.py`)
+
+Collection of `FunctionTestInvocation` results with indexed lookup. Key methods:
+- `add(invocation)` — Deduplicated insert
+- `total_passed_runtime()` — Sum of minimum runtimes per test case (nanoseconds)
+- `number_of_loops()` — Max loop index across all results
+- `usable_runtime_data_by_test_case()` — Dict of invocation ID → list of runtimes
+
+## Result Type
+
+### `Result[L, R]` / `Success` / `Failure` (`either.py`)
+
+Functional error handling type:
+- `Success(value)` — Wraps a successful result
+- `Failure(error)` — Wraps an error
+- `result.is_successful()` / `result.is_failure()` — Check type
+- `result.unwrap()` — Get success value (raises if Failure)
+- `result.failure()` — Get failure value (raises if Success)
+- `is_successful(result)` — Module-level helper function
--- a/tiles/codeflash-docs/docs/index.md
+++ b/tiles/codeflash-docs/docs/index.md
@ -0,0 +1,41 @@
+# Codeflash Internal Documentation
+
+CodeFlash is an AI-powered Python code optimizer that automatically improves code performance while maintaining correctness. It uses LLMs to generate optimization candidates, verifies correctness through test execution, and benchmarks performance improvements.
+
+## Pipeline Overview
+
+```
+Discovery → Ranking → Context Extraction → Test Gen + Optimization → Baseline → Candidate Evaluation → PR
+```
+
+1. **Discovery** (`discovery/`): Find optimizable functions across the codebase using `FunctionVisitor`
+2. **Ranking** (`benchmarking/function_ranker.py`): Rank functions by addressable time using trace data
+3. **Context** (`context/`): Extract code dependencies — split into read-writable (modifiable) and read-only (reference)
+4. **Optimization** (`optimization/`, `api/`): Generate candidates via AI service, runs concurrently with test generation
+5. **Verification** (`verification/`): Run candidates against tests via custom pytest plugin, compare outputs
+6. **Benchmarking** (`benchmarking/`): Measure performance, select best candidate by speedup
+7. **Result** (`result/`, `github/`): Create PR with winning optimization
+
+## Key Entry Points
+
+| Task | File |
+|------|------|
+| CLI arguments & commands | `cli_cmds/cli.py` |
+| Optimization orchestration | `optimization/optimizer.py` → `Optimizer.run()` |
+| Per-function optimization | `optimization/function_optimizer.py` → `FunctionOptimizer` |
+| Function discovery | `discovery/functions_to_optimize.py` |
+| Context extraction | `context/code_context_extractor.py` |
+| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
+| Performance ranking | `benchmarking/function_ranker.py` |
+| Domain types | `models/models.py`, `models/function_types.py` |
+| AI service | `api/aiservice.py` → `AiServiceClient` |
+| Configuration | `code_utils/config_consts.py` |
+
+## Documentation Pages
+
+- [Domain Types](domain-types.md) — Core data types and their relationships
+- [Optimization Pipeline](optimization-pipeline.md) — Step-by-step data flow through the pipeline
+- [Context Extraction](context-extraction.md) — How code context is extracted and token-limited
+- [Verification](verification.md) — Test execution, pytest plugin, deterministic patches
+- [AI Service](ai-service.md) — AI service client endpoints and request types
+- [Configuration](configuration.md) — Config schema, effort levels, thresholds
--- a/tiles/codeflash-docs/docs/optimization-pipeline.md
+++ b/tiles/codeflash-docs/docs/optimization-pipeline.md
@ -0,0 +1,84 @@
+# Optimization Pipeline
+
+Step-by-step data flow from function discovery to PR creation.
+
+## 1. Entry Point: `Optimizer.run()` (`optimization/optimizer.py`)
+
+The `Optimizer` class is initialized with CLI args and creates:
+- `TestConfig` with test roots, project root, pytest command
+- `AiServiceClient` for AI service communication
+- Optional `LocalAiServiceClient` for experiments
+
+`run()` orchestrates the full pipeline: discovers functions, optionally ranks them, then optimizes each in turn.
+
+## 2. Function Discovery (`discovery/functions_to_optimize.py`)
+
+`FunctionVisitor` traverses source files to find optimizable functions, producing `FunctionToOptimize` instances. Filters include:
+- Skipping functions that are too small or trivial
+- Skipping previously optimized functions (via `was_function_previously_optimized()`)
+- Applying user-configured include/exclude patterns
+
+## 3. Function Ranking (`benchmarking/function_ranker.py`)
+
+When trace data is available, `FunctionRanker` ranks functions by **addressable time** — the time a function spends that could be optimized (own time + callee time / call count). Functions below `DEFAULT_IMPORTANCE_THRESHOLD=0.001` are skipped.
+
+## 4. Per-Function Optimization: `FunctionOptimizer` (`optimization/function_optimizer.py`)
+
+For each function, `FunctionOptimizer.optimize_function()` runs the full optimization loop:
+
+### 4a. Context Extraction (`context/code_context_extractor.py`)
+
+Extracts `CodeOptimizationContext` containing:
+- `read_writable_code` — Code the LLM can modify (the function + helpers)
+- `read_only_context_code` — Dependency code for reference only
+- `testgen_context` — Context for test generation (may include imported class definitions)
+
+Token limits are enforced: `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000`. Functions exceeding these are rejected.
+
+### 4b. Concurrent Test Generation + LLM Optimization
+
+These run in parallel using `concurrent.futures`:
+- **Test generation**: Generates regression tests from the function context
+- **LLM optimization**: Sends `read_writable_code.markdown` + `read_only_context_code` to the AI service
+
+The number of candidates depends on effort level (see Configuration docs).
+
+### 4c. Candidate Evaluation
+
+For each `OptimizedCandidate`:
+
+1. **Deduplication**: Normalize code AST and check against `CandidateEvaluationContext.ast_code_to_id`. If duplicate, copy results from previous evaluation.
+
+2. **Code replacement**: Replace the original function with the candidate using `replace_function_definitions_in_module()`.
+
+3. **Behavioral testing**: Run instrumented tests in subprocess. The custom pytest plugin applies deterministic patches. Compare return values, stdout, and pass/fail status against the original baseline.
+
+4. **Benchmarking**: If behavior matches, run performance tests with looping (`TOTAL_LOOPING_TIME=10s`). Calculate speedup ratio.
+
+5. **Validation**: Candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup) and pass stability checks.
+
+### 4d. Refinement & Repair
+
+- **Repair**: If fewer than `MIN_CORRECT_CANDIDATES=2` pass, failed candidates can be repaired via `AIServiceCodeRepairRequest` (sends test diffs to LLM).
+- **Refinement**: Top valid candidates are refined via `AIServiceRefinerRequest` (sends runtime data, line profiler results).
+- **Adaptive**: At HIGH effort, additional adaptive optimization rounds via `AIServiceAdaptiveOptimizeRequest`.
+
+### 4e. Best Candidate Selection
+
+The winning candidate is selected by:
+1. Highest speedup ratio
+2. For tied speedups, shortest diff length from original
+3. Refinement candidates use weighted ranking: `(2 * runtime_rank + 1 * diff_rank)`
+
+Result is a `BestOptimization` with the candidate, context, test results, and runtime.
+
+## 5. PR Creation (`github/`)
+
+If a winning candidate is found, a PR is created with:
+- The optimized code diff
+- Performance benchmark details
+- Explanation from the LLM
+
+## Worktree Mode
+
+When `--worktree` is enabled, optimization runs in an isolated git worktree (`code_utils/git_worktree_utils.py`). This allows parallel optimization without affecting the working tree. Changes are captured as patch files.
--- a/tiles/codeflash-docs/docs/verification.md
+++ b/tiles/codeflash-docs/docs/verification.md
@ -0,0 +1,93 @@
+# Verification
+
+How codeflash verifies candidate correctness and measures performance.
+
+## Test Execution Architecture
+
+Tests are executed in a **subprocess** to isolate the test environment from the main codeflash process. The test runner (`verification/test_runner.py`) invokes pytest (or Jest for JS/TS) with specific plugin configurations.
+
+### Plugin Blocklists
+
+- **Behavioral tests**: Block `benchmark`, `codspeed`, `xdist`, `sugar`
+- **Benchmarking tests**: Block `codspeed`, `cov`, `benchmark`, `profiling`, `xdist`, `sugar`
+
+These are defined as `BEHAVIORAL_BLOCKLISTED_PLUGINS` and `BENCHMARKING_BLOCKLISTED_PLUGINS` in `verification/test_runner.py`.
+
+## Custom Pytest Plugin (`verification/pytest_plugin.py`)
+
+The plugin is loaded into the test subprocess and provides:
+
+### Deterministic Patches
+
+`_apply_deterministic_patches()` replaces non-deterministic functions with fixed values to ensure reproducible test output:
+
+| Module | Function | Fixed Value |
+|--------|----------|-------------|
+| `time` | `time()` | `1761717605.108106` |
+| `time` | `perf_counter()` | Incrementing by 1ms per call |
+| `datetime` | `datetime.now()` | `2021-01-01 02:05:10 UTC` |
+| `datetime` | `datetime.utcnow()` | `2021-01-01 02:05:10 UTC` |
+| `uuid` | `uuid4()` / `uuid1()` | `12345678-1234-5678-9abc-123456789012` |
+| `random` | `random()` | `0.123456789` (seeded with 42) |
+| `os` | `urandom(n)` | `b"\x42" * n` |
+| `numpy.random` | seed | `42` |
+
+Patches call the original function first to maintain performance characteristics (same call overhead).
+
+### Timing Markers
+
+Test results include timing markers in stdout: `!######<id>:<duration_ns>######!`
+
+The pattern `_TIMING_MARKER_PATTERN` extracts timing data for calculating function utilization fraction.
+
+### Loop Stability
+
+Performance benchmarking uses configurable stability thresholds:
+- `STABILITY_WINDOW_SIZE = 0.35` (35% of total iterations)
+- `STABILITY_CENTER_TOLERANCE = 0.0025` (±0.25% around median)
+- `STABILITY_SPREAD_TOLERANCE = 0.0025` (0.25% window spread)
+
+### Memory Limits (Linux)
+
+On Linux, the plugin sets `RLIMIT_AS` to 85% of total system memory (RAM + swap) to prevent OOM kills.
+
+## Test Result Processing
+
+### `TestResults` (`models/models.py`)
+
+Collects `FunctionTestInvocation` results with:
+- Deduplicated insertion via `unique_invocation_loop_id`
+- `total_passed_runtime()` — Sum of minimum runtimes per test case (nanoseconds)
+- `number_of_loops()` — Max loop index
+- `usable_runtime_data_by_test_case()` — Grouped timing data
+
+### `FunctionTestInvocation`
+
+Each invocation records:
+- `loop_index` — Iteration number (starts at 1)
+- `id: InvocationId` — Fully qualified test identifier
+- `did_pass: bool` — Pass/fail status
+- `runtime: Optional[int]` — Time in nanoseconds
+- `return_value: Optional[object]` — Captured return value
+- `test_type: TestType` — Which test category
+
+### Behavioral vs Performance Testing
+
+1. **Behavioral**: Runs with `TestingMode.BEHAVIOR`. Compares return values and stdout between original and candidate. Any difference = candidate rejected.
+2. **Performance**: Runs with `TestingMode.PERFORMANCE`. Loops for `TOTAL_LOOPING_TIME=10s` to get stable timing. Calculates speedup ratio.
+3. **Line Profile**: Runs with `TestingMode.LINE_PROFILE`. Collects per-line timing data for refinement.
+
+## Test Types
+
+| TestType | Value | Description |
+|----------|-------|-------------|
+| `EXISTING_UNIT_TEST` | 1 | Pre-existing tests from the codebase |
+| `INSPIRED_REGRESSION` | 2 | Tests inspired by existing tests |
+| `GENERATED_REGRESSION` | 3 | AI-generated regression tests |
+| `REPLAY_TEST` | 4 | Tests from recorded benchmark data |
+| `CONCOLIC_COVERAGE_TEST` | 5 | Coverage-guided tests |
+| `INIT_STATE_TEST` | 6 | Class init state verification |
+
+## Coverage
+
+Coverage is measured via `CoverageData` with a threshold of `COVERAGE_THRESHOLD=60.0%`. Low coverage may affect confidence in the optimization's correctness.
--- a/tiles/codeflash-docs/evals/capabilities.json
+++ b/tiles/codeflash-docs/evals/capabilities.json
@ -0,0 +1,118 @@
+{
+  "package_name": "codeflash-docs",
+  "total_capabilities": 16,
+  "capabilities": [
+    {
+      "id": 0,
+      "name": "pipeline-stage-ordering",
+      "description": "Know the correct ordering of codeflash pipeline stages: Discovery → Ranking → Context Extraction → Test Gen + Optimization (concurrent) → Baseline → Candidate Evaluation → PR",
+      "complexity": "basic",
+      "api_elements": ["Optimizer.run()", "FunctionOptimizer.optimize_function()"]
+    },
+    {
+      "id": 1,
+      "name": "function-to-optimize-fields",
+      "description": "Know FunctionToOptimize key fields (function_name, file_path, parents, starting_line/ending_line, is_async, is_method, language) and properties (qualified_name, top_level_parent_name, class_name)",
+      "complexity": "intermediate",
+      "api_elements": ["FunctionToOptimize", "FunctionParent", "models/function_types.py"]
+    },
+    {
+      "id": 2,
+      "name": "code-strings-markdown-format",
+      "description": "Know that code is serialized as markdown fenced blocks with language:filepath syntax (```python:filepath\\ncode\\n```) and parsed via CodeStringsMarkdown.parse_markdown_code()",
+      "complexity": "intermediate",
+      "api_elements": ["CodeStringsMarkdown", "CodeString", ".markdown", ".flat", "parse_markdown_code()"]
+    },
+    {
+      "id": 3,
+      "name": "read-writable-vs-read-only",
+      "description": "Distinguish read_writable_code (LLM can modify) from read_only_context_code (reference only) in CodeOptimizationContext",
+      "complexity": "basic",
+      "api_elements": ["CodeOptimizationContext", "read_writable_code", "read_only_context_code"]
+    },
+    {
+      "id": 4,
+      "name": "candidate-source-types",
+      "description": "Know OptimizedCandidateSource variants: OPTIMIZE, OPTIMIZE_LP, REFINE, REPAIR, ADAPTIVE, JIT_REWRITE and when each is used",
+      "complexity": "intermediate",
+      "api_elements": ["OptimizedCandidateSource", "OptimizedCandidate"]
+    },
+    {
+      "id": 5,
+      "name": "candidate-forest-dag",
+      "description": "Know that candidates form a forest/DAG via parent_id references where refinements and repairs build on previous candidates",
+      "complexity": "intermediate",
+      "api_elements": ["parent_id", "OptimizedCandidate", "CandidateForest"]
+    },
+    {
+      "id": 6,
+      "name": "concurrent-testgen-optimization",
+      "description": "Know that test generation and LLM optimization run concurrently using concurrent.futures, not sequentially",
+      "complexity": "intermediate",
+      "api_elements": ["concurrent.futures", "FunctionOptimizer.optimize_function()"]
+    },
+    {
+      "id": 7,
+      "name": "deterministic-patch-values",
+      "description": "Know the specific fixed values used by deterministic patches: time=1761717605.108106, datetime=2021-01-01 02:05:10 UTC, uuid=12345678-1234-5678-9abc-123456789012, random seeded with 42",
+      "complexity": "advanced",
+      "api_elements": ["_apply_deterministic_patches()", "pytest_plugin.py"]
+    },
+    {
+      "id": 8,
+      "name": "test-type-enum",
+      "description": "Know the 6 TestType variants: EXISTING_UNIT_TEST, INSPIRED_REGRESSION, GENERATED_REGRESSION, REPLAY_TEST, CONCOLIC_COVERAGE_TEST, INIT_STATE_TEST",
+      "complexity": "basic",
+      "api_elements": ["TestType", "models/test_type.py"]
+    },
+    {
+      "id": 9,
+      "name": "ai-service-endpoints",
+      "description": "Know the AI service endpoints: /ai/optimize, /ai/optimize_line_profiler, /ai/refine, /ai/repair, /ai/adaptive_optimize, /ai/rewrite_jit",
+      "complexity": "intermediate",
+      "api_elements": ["AiServiceClient", "api/aiservice.py"]
+    },
+    {
+      "id": 10,
+      "name": "repair-request-structure",
+      "description": "Know that AIServiceCodeRepairRequest includes TestDiff objects with scope (RETURN_VALUE/STDOUT/DID_PASS), original vs candidate values, and test source code",
+      "complexity": "advanced",
+      "api_elements": ["AIServiceCodeRepairRequest", "TestDiff", "TestDiffScope"]
+    },
+    {
+      "id": 11,
+      "name": "effort-level-values",
+      "description": "Know specific effort level values: LOW gets 3 candidates, MEDIUM gets 5, HIGH gets 6 (N_OPTIMIZER_CANDIDATES)",
+      "complexity": "intermediate",
+      "api_elements": ["EffortLevel", "N_OPTIMIZER_CANDIDATES", "EFFORT_VALUES"]
+    },
+    {
+      "id": 12,
+      "name": "context-token-limits",
+      "description": "Know OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000 and TESTGEN_CONTEXT_TOKEN_LIMIT=16000 and that encoded_tokens_len() is used for counting",
+      "complexity": "basic",
+      "api_elements": ["OPTIMIZATION_CONTEXT_TOKEN_LIMIT", "TESTGEN_CONTEXT_TOKEN_LIMIT", "encoded_tokens_len()"]
+    },
+    {
+      "id": 13,
+      "name": "best-candidate-selection",
+      "description": "Know the selection criteria: highest speedup, then shortest diff for ties, and refinement weighted ranking (2*runtime + 1*diff)",
+      "complexity": "advanced",
+      "api_elements": ["BestOptimization", "REFINED_CANDIDATE_RANKING_WEIGHTS"]
+    },
+    {
+      "id": 14,
+      "name": "plugin-blocklists",
+      "description": "Know behavioral test blocklisted plugins (benchmark, codspeed, xdist, sugar) and benchmarking blocklist (adds cov, profiling)",
+      "complexity": "intermediate",
+      "api_elements": ["BEHAVIORAL_BLOCKLISTED_PLUGINS", "BENCHMARKING_BLOCKLISTED_PLUGINS"]
+    },
+    {
+      "id": 15,
+      "name": "result-type-usage",
+      "description": "Know that Result[L,R] from either.py uses Success(value)/Failure(error) with is_successful() check before unwrap()",
+      "complexity": "basic",
+      "api_elements": ["Result", "Success", "Failure", "is_successful", "either.py"]
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/scenario-1/capability.txt
+++ b/tiles/codeflash-docs/evals/scenario-1/capability.txt
@ -0,0 +1 @@
+Code serialization format and context splitting
--- a/tiles/codeflash-docs/evals/scenario-1/criteria.json
+++ b/tiles/codeflash-docs/evals/scenario-1/criteria.json
@ -0,0 +1,21 @@
+{
+  "context": "Tests whether the agent knows the CodeStringsMarkdown serialization format and the distinction between read-writable and read-only code context in the codeflash pipeline.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Markdown code block format",
+      "description": "Uses the correct fenced code block format with language:filepath syntax (```python:path/to/file.py) when constructing code for the AI service, NOT plain code blocks without file paths",
+      "max_score": 30
+    },
+    {
+      "name": "Read-writable vs read-only split",
+      "description": "Correctly separates code into read_writable_code (code the LLM can modify) and read_only_context_code (reference-only dependency code), NOT treating all code as modifiable",
+      "max_score": 35
+    },
+    {
+      "name": "parse_markdown_code usage",
+      "description": "Uses CodeStringsMarkdown.parse_markdown_code() to parse AI service responses back into structured code, NOT manual string splitting or regex",
+      "max_score": 35
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/scenario-1/task.md
+++ b/tiles/codeflash-docs/evals/scenario-1/task.md
@ -0,0 +1,35 @@
+# Format Code for AI Service Request
+
+## Context
+
+You are working on the codeflash optimization engine. The AI service accepts optimization requests with source code and dependency context. A function `calculate_total` in `analytics/metrics.py` needs to be optimized. It calls a helper `normalize_values` in the same file (both modifiable), and imports `BaseMetric` from `analytics/base.py` (not modifiable, just for reference).
+
+```python
+# analytics/metrics.py
+from analytics.base import BaseMetric
+
+def normalize_values(data: list[float]) -> list[float]:
+    max_val = max(data)
+    return [x / max_val for x in data]
+
+def calculate_total(metrics: list[BaseMetric]) -> float:
+    values = [m.value for m in metrics]
+    normalized = normalize_values(values)
+    return sum(normalized)
+```
+
+```python
+# analytics/base.py
+class BaseMetric:
+    def __init__(self, name: str, value: float):
+        self.name = name
+        self.value = value
+```
+
+## Task
+
+Write a Python function `prepare_optimization_payload` that constructs the code payload for an AI service optimization request for `calculate_total`. It should properly format the source code and dependency code, and include a function to parse the AI service response back into structured code objects.
+
+## Expected Outputs
+
+- A Python file `payload_builder.py` with the payload construction and response parsing logic
--- a/tiles/codeflash-docs/evals/scenario-2/capability.txt
+++ b/tiles/codeflash-docs/evals/scenario-2/capability.txt
@ -0,0 +1 @@
+Candidate source types and DAG relationships
--- a/tiles/codeflash-docs/evals/scenario-2/criteria.json
+++ b/tiles/codeflash-docs/evals/scenario-2/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent knows the different OptimizedCandidateSource types and how candidates form a DAG via parent_id references in the codeflash pipeline.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Lists source types",
+      "description": "Identifies at least 4 of the 6 OptimizedCandidateSource variants: OPTIMIZE, OPTIMIZE_LP, REFINE, REPAIR, ADAPTIVE, JIT_REWRITE",
+      "max_score": 25
+    },
+    {
+      "name": "Parent ID linkage",
+      "description": "Explains that REFINE and REPAIR candidates reference their parent via parent_id, creating a DAG/forest structure, NOT independent candidates",
+      "max_score": 25
+    },
+    {
+      "name": "Refinement uses runtime data",
+      "description": "States that refinement sends runtime data and line profiler results to the AI service (AIServiceRefinerRequest), NOT just the source code",
+      "max_score": 25
+    },
+    {
+      "name": "Repair uses test diffs",
+      "description": "States that repair sends test failure diffs (TestDiff with scope: RETURN_VALUE/STDOUT/DID_PASS) to the AI service, NOT just error messages",
+      "max_score": 25
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/scenario-2/task.md
+++ b/tiles/codeflash-docs/evals/scenario-2/task.md
@ -0,0 +1,13 @@
+# Document the Candidate Lifecycle
+
+## Context
+
+A new engineer is joining the codeflash team and needs to understand how optimization candidates are generated, improved, and related to each other throughout the pipeline. They've asked for a clear explanation of the different ways candidates are produced and how the system iterates on them.
+
+## Task
+
+Write a technical document explaining the full lifecycle of an optimization candidate in codeflash — from initial generation through improvement iterations. Cover all the different ways candidates can be created, what data is sent to the AI service for each type, and how candidates relate to each other structurally.
+
+## Expected Outputs
+
+- A markdown file `candidate-lifecycle.md`
--- a/tiles/codeflash-docs/evals/scenario-3/capability.txt
+++ b/tiles/codeflash-docs/evals/scenario-3/capability.txt
@ -0,0 +1 @@
+Deterministic patch values and test execution architecture
--- a/tiles/codeflash-docs/evals/scenario-3/criteria.json
+++ b/tiles/codeflash-docs/evals/scenario-3/criteria.json
@ -0,0 +1,31 @@
+{
+  "context": "Tests whether the agent knows the specific deterministic patch values used in codeflash's pytest plugin and the subprocess-based test execution architecture.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Subprocess isolation",
+      "description": "States that tests run in a subprocess to isolate the test environment from the main codeflash process, NOT in the same process",
+      "max_score": 20
+    },
+    {
+      "name": "Fixed time value",
+      "description": "References the specific fixed timestamp 1761717605.108106 for time.time() or the fixed datetime 2021-01-01 02:05:10 UTC for datetime.now()",
+      "max_score": 20
+    },
+    {
+      "name": "Fixed UUID value",
+      "description": "References the specific fixed UUID 12345678-1234-5678-9abc-123456789012 for uuid4/uuid1",
+      "max_score": 20
+    },
+    {
+      "name": "Random seed",
+      "description": "States that random is seeded with 42 (NOT a different seed value)",
+      "max_score": 20
+    },
+    {
+      "name": "Plugin blocklists",
+      "description": "Mentions that behavioral tests block specific pytest plugins (at least 2 of: benchmark, codspeed, xdist, sugar) to ensure deterministic execution",
+      "max_score": 20
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/scenario-3/task.md
+++ b/tiles/codeflash-docs/evals/scenario-3/task.md
@ -0,0 +1,13 @@
+# Explain Test Reproducibility Guarantees
+
+## Context
+
+A codeflash user notices that their optimization candidate passes behavioral tests on one run but fails on the next. They suspect non-determinism in the test execution. They want to understand what guarantees codeflash provides for test reproducibility and how the system ensures consistent results.
+
+## Task
+
+Write a technical explanation of how codeflash ensures deterministic test execution. Cover the execution environment setup, what sources of non-determinism are controlled, and any specific values or configurations used. Also explain the test execution architecture.
+
+## Expected Outputs
+
+- A markdown file `test-reproducibility.md`
--- a/tiles/codeflash-docs/evals/scenario-4/capability.txt
+++ b/tiles/codeflash-docs/evals/scenario-4/capability.txt
@ -0,0 +1 @@
+Effort level configuration and candidate selection criteria
--- a/tiles/codeflash-docs/evals/scenario-4/criteria.json
+++ b/tiles/codeflash-docs/evals/scenario-4/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent knows the specific effort level values for candidate generation and the criteria used to select the best optimization candidate.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Candidate counts by effort",
+      "description": "States correct N_OPTIMIZER_CANDIDATES values: LOW=3, MEDIUM=5, HIGH=6 (at least 2 of 3 correct)",
+      "max_score": 25
+    },
+    {
+      "name": "Speedup as primary selector",
+      "description": "States that the winning candidate is selected primarily by highest speedup ratio",
+      "max_score": 25
+    },
+    {
+      "name": "Diff length as tiebreaker",
+      "description": "States that for tied speedups, shortest diff length from original is used as tiebreaker",
+      "max_score": 25
+    },
+    {
+      "name": "Refinement ranking weights",
+      "description": "States that refinement candidates use weighted ranking with runtime weighted more heavily than diff (2:1 ratio or REFINED_CANDIDATE_RANKING_WEIGHTS=(2,1))",
+      "max_score": 25
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/scenario-4/task.md
+++ b/tiles/codeflash-docs/evals/scenario-4/task.md
@ -0,0 +1,18 @@
+# Design a Candidate Selection Dashboard
+
+## Context
+
+The codeflash team wants to build a dashboard that shows users how optimization candidates were evaluated and why a particular candidate won. The dashboard needs to display the selection process at each stage, from initial candidate pool through to the final winner.
+
+## Task
+
+Write a specification document for the dashboard that explains:
+1. How many candidates are generated at each effort level
+2. The exact criteria and order of operations used to pick the winning candidate
+3. How refinement candidates are ranked differently from initial candidates
+
+Include concrete examples showing how two hypothetical candidates would be compared.
+
+## Expected Outputs
+
+- A markdown file `selection-dashboard-spec.md`
--- a/tiles/codeflash-docs/evals/scenario-5/capability.txt
+++ b/tiles/codeflash-docs/evals/scenario-5/capability.txt
@ -0,0 +1 @@
+Pipeline concurrency and FunctionToOptimize structure
--- a/tiles/codeflash-docs/evals/scenario-5/criteria.json
+++ b/tiles/codeflash-docs/evals/scenario-5/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent knows the FunctionToOptimize data structure and the concurrent execution model for test generation and optimization.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "FunctionToOptimize fields",
+      "description": "Includes at least 4 of: function_name, file_path, parents (list of FunctionParent), starting_line, ending_line, is_async, is_method, language",
+      "max_score": 25
+    },
+    {
+      "name": "Qualified name property",
+      "description": "Mentions qualified_name as a property that produces the full dotted name including parent classes (e.g., MyClass.my_method)",
+      "max_score": 25
+    },
+    {
+      "name": "Concurrent execution",
+      "description": "States that test generation and LLM optimization run concurrently (in parallel), NOT sequentially one after the other",
+      "max_score": 25
+    },
+    {
+      "name": "Entry point identification",
+      "description": "Correctly identifies Optimizer.run() as the top-level entry point and FunctionOptimizer.optimize_function() as the per-function entry point",
+      "max_score": 25
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/scenario-5/task.md
+++ b/tiles/codeflash-docs/evals/scenario-5/task.md
@ -0,0 +1,17 @@
+# Implement a Function Optimization Status Tracker
+
+## Context
+
+The codeflash team needs a status tracker that logs what happens to each function during an optimization run. For each function, it should record the function identity, which pipeline stages it passed through, and how long each stage took.
+
+## Task
+
+Write a design document explaining:
+1. What data structure represents a function being optimized, including its identity fields and how nested functions (methods inside classes) are represented
+2. The full name resolution strategy for identifying functions uniquely
+3. Which stages of the pipeline operate on a single function at a time vs. operating on multiple functions
+4. Where in the codebase the per-function optimization is orchestrated and what the top-level entry point is
+
+## Expected Outputs
+
+- A markdown file `status-tracker-design.md`
--- a/tiles/codeflash-docs/evals/summary.json
+++ b/tiles/codeflash-docs/evals/summary.json
@ -0,0 +1,40 @@
+{
+  "total_scenarios": 5,
+  "capabilities_coverage": {
+    "total_capabilities": 16,
+    "capabilities_tested": 12,
+    "coverage_percentage": 75.0
+  },
+  "complexity_distribution": {
+    "basic": 1,
+    "intermediate": 3,
+    "advanced": 1
+  },
+  "scenarios": [
+    {
+      "index": 1,
+      "capability": "code-strings-markdown-format, read-writable-vs-read-only",
+      "complexity": "intermediate"
+    },
+    {
+      "index": 2,
+      "capability": "candidate-source-types, candidate-forest-dag, repair-request-structure",
+      "complexity": "intermediate"
+    },
+    {
+      "index": 3,
+      "capability": "deterministic-patch-values, plugin-blocklists",
+      "complexity": "advanced"
+    },
+    {
+      "index": 4,
+      "capability": "effort-level-values, best-candidate-selection",
+      "complexity": "intermediate"
+    },
+    {
+      "index": 5,
+      "capability": "function-to-optimize-fields, concurrent-testgen-optimization, pipeline-stage-ordering",
+      "complexity": "basic"
+    }
+  ]
+}
--- a/tiles/codeflash-docs/evals/summary_infeasible.json
+++ b/tiles/codeflash-docs/evals/summary_infeasible.json
@ -0,0 +1,25 @@
+{
+  "total_infeasible": 4,
+  "infeasible_capabilities": [
+    {
+      "capability": "ai-service-endpoints",
+      "complexity": "intermediate",
+      "reasoning": "Testing knowledge of specific API endpoints requires actual HTTP requests or mocking that bypasses the capability being tested"
+    },
+    {
+      "capability": "context-token-limits",
+      "complexity": "basic",
+      "reasoning": "Already covered by the skills tile eval (scenario-1). Testing token counting requires the actual tokenizer library"
+    },
+    {
+      "capability": "test-type-enum",
+      "complexity": "basic",
+      "reasoning": "Simple enum knowledge is better verified through skills that use test types rather than isolated recall"
+    },
+    {
+      "capability": "result-type-usage",
+      "complexity": "basic",
+      "reasoning": "Already covered by the skills tile eval (scenario-2). Testing Result type usage is better done through implementation tasks"
+    }
+  ]
+}
--- a/tiles/codeflash-docs/tile.json
+++ b/tiles/codeflash-docs/tile.json
@ -0,0 +1,7 @@
+{
+  "name": "codeflash/codeflash-docs",
+  "version": "0.1.0",
+  "summary": "Internal documentation for the codeflash optimization engine",
+  "private": true,
+  "docs": "docs/index.md"
+}
--- a/tiles/codeflash-rules/rules/architecture.md
+++ b/tiles/codeflash-rules/rules/architecture.md
@ -0,0 +1,45 @@
+# Architecture
+
+```
+codeflash/
+├── main.py                 # CLI entry point
+├── cli_cmds/               # Command handling, console output (Rich)
+├── discovery/              # Find optimizable functions
+├── context/                # Extract code dependencies and imports
+├── optimization/           # Generate optimized code via AI
+│   ├── optimizer.py        # Main optimization orchestration
+│   └── function_optimizer.py  # Per-function optimization logic
+├── verification/           # Run deterministic tests (pytest plugin)
+├── benchmarking/           # Performance measurement
+├── github/                 # PR creation
+├── api/                    # AI service communication
+├── code_utils/             # Code parsing, git utilities
+├── models/                 # Pydantic models and types
+├── languages/              # Multi-language support (Python, JavaScript/TypeScript)
+├── setup/                  # Config schema, auto-detection, first-run experience
+├── picklepatch/            # Serialization/deserialization utilities
+├── tracing/                # Function call tracing
+├── tracer.py               # Root-level tracer entry point for profiling
+├── lsp/                    # IDE integration (Language Server Protocol)
+├── telemetry/              # Sentry, PostHog
+├── either.py               # Functional Result type for error handling
+├── result/                 # Result types and handling
+└── version.py              # Version information
+```
+
+## Key Entry Points
+
+| Task | Start here |
+|------|------------|
+| CLI arguments & commands | `cli_cmds/cli.py` |
+| Optimization orchestration | `optimization/optimizer.py` → `Optimizer.run()` |
+| Per-function optimization | `optimization/function_optimizer.py` → `FunctionOptimizer` |
+| Function discovery | `discovery/functions_to_optimize.py` |
+| Context extraction | `context/code_context_extractor.py` |
+| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
+| Performance ranking | `benchmarking/function_ranker.py` |
+| Domain types | `models/models.py`, `models/function_types.py` |
+| Result handling | `either.py` (`Result`, `Success`, `Failure`, `is_successful`) |
+| AI service communication | `api/aiservice.py` → `AiServiceClient` |
+| Configuration constants | `code_utils/config_consts.py` |
+| Language support | `languages/registry.py` → `get_language_support()` |
--- a/tiles/codeflash-rules/rules/code-style.md
+++ b/tiles/codeflash-rules/rules/code-style.md
@ -0,0 +1,11 @@
+# Code Style
+
+- **Line length**: 120 characters
+- **Python**: 3.9+ syntax (use `from __future__ import annotations` for type hints)
+- **Package management**: Always use `uv`, never `pip` — run commands via `uv run`
+- **Tooling**: Ruff for linting/formatting, mypy strict mode, prek for pre-commit checks (`uv run prek run`)
+- **Comments**: Minimal — only explain "why", not "what"
+- **Docstrings**: Do not add unless explicitly requested
+- **Naming**: NEVER use leading underscores (`_function_name`) — Python has no true private functions, use public names
+- **Paths**: Always use absolute `Path` objects, handle encoding explicitly (UTF-8)
+- **Source transforms**: Use `libcst` for code modification/transformation to preserve formatting; `ast` is acceptable for read-only analysis and parsing
--- a/tiles/codeflash-rules/rules/git-conventions.md
+++ b/tiles/codeflash-rules/rules/git-conventions.md
@ -0,0 +1,9 @@
+# Git Conventions
+
+- **Always create a new branch from `main`** — never commit directly to `main` or reuse an existing feature branch for unrelated changes
+- Use conventional commit format: `fix:`, `feat:`, `refactor:`, `docs:`, `test:`, `chore:`
+- Keep commits atomic — one logical change per commit
+- Commit message body should be concise (1-2 sentences max)
+- PR titles should also use conventional format
+- Branch naming: `cf-#-title` (lowercase, hyphenated) where `#` is the Linear issue number
+- If related to a Linear issue, include `CF-#` in the PR body
--- a/tiles/codeflash-rules/rules/language-rules.md
+++ b/tiles/codeflash-rules/rules/language-rules.md
@ -0,0 +1,9 @@
+# Language Support Rules
+
+- Current language is a module-level singleton in `languages/current.py` — use `set_current_language()` / `current_language()`, never pass language as a parameter through call chains
+- Use `get_language_support(identifier)` from `languages/registry.py` to get a `LanguageSupport` instance — accepts `Path`, `Language` enum, or string; never import language classes directly
+- New language support classes must use the `@register_language` decorator to register with the extension and language registries
+- `languages/__init__.py` uses `__getattr__` for lazy imports to avoid circular dependencies — follow this pattern when adding new exports
+- `is_javascript()` returns `True` for both JavaScript and TypeScript
+- Language modules are lazily imported on first `get_language_support()` call via `_ensure_languages_registered()` — the `@register_language` decorator fires on import and populates `_EXTENSION_REGISTRY` and `_LANGUAGE_REGISTRY`
+- `LanguageSupport` instances are cached in `_SUPPORT_CACHE` — use `clear_cache()` only in tests
--- a/tiles/codeflash-rules/rules/optimization-patterns.md
+++ b/tiles/codeflash-rules/rules/optimization-patterns.md
@ -0,0 +1,11 @@
+# Optimization Pipeline Patterns
+
+- All major operations return `Result[SuccessType, ErrorType]` — construct with `Success(value)` / `Failure(error)`, check with `is_successful()` before calling `unwrap()`
+- Code context has token limits (`OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000`, `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`) — exceeding them rejects the function
+- `read_writable_code` (modifiable code) can span multiple files; `read_only_context_code` is reference-only dependency code
+- Code is serialized as markdown code blocks: `` ```language:filepath\ncode\n``` `` — see `CodeStringsMarkdown` in `models/models.py`
+- Candidates form a forest (DAG): refinements/repairs reference `parent_id` on previous candidates via `OptimizedCandidateSource` (OPTIMIZE, REFINE, REPAIR, ADAPTIVE, JIT_REWRITE)
+- Test generation and optimization run concurrently — coordinate through `CandidateEvaluationContext`
+- Generated tests are instrumented with `codeflash_capture.py` to record return values and traces
+- Minimum improvement threshold is 5% (`MIN_IMPROVEMENT_THRESHOLD=0.05`) — candidates below this are rejected
+- Stability thresholds: `STABILITY_WINDOW_SIZE=0.35`, `STABILITY_CENTER_TOLERANCE=0.0025`, `STABILITY_SPREAD_TOLERANCE=0.0025`
--- a/tiles/codeflash-rules/rules/testing-rules.md
+++ b/tiles/codeflash-rules/rules/testing-rules.md
@ -0,0 +1,13 @@
+# Testing Rules
+
+- Code context extraction and replacement tests must assert full string equality — no substring matching
+- Use pytest's `tmp_path` fixture for temp directories (it's a `Path` object)
+- Write temp files inside `tmp_path`, never use `NamedTemporaryFile` (causes Windows file contention)
+- Always call `.resolve()` on Path objects to ensure absolute paths and resolve symlinks
+- Use `.as_posix()` when converting resolved paths to strings (normalizes to forward slashes)
+- Any new feature or bug fix that can be tested automatically must have test cases
+- If changes affect existing test expectations, update the tests accordingly — tests must always pass after changes
+- The pytest plugin patches `time`, `random`, `uuid`, `datetime`, `os.urandom`, and `numpy.random` for deterministic test execution — never assume real randomness or real time in verification tests
+- `conftest.py` uses an autouse fixture that calls `reset_current_language()` — tests always start with Python as the default language
+- Test types are defined by the `TestType` enum: `EXISTING_UNIT_TEST`, `INSPIRED_REGRESSION`, `GENERATED_REGRESSION`, `REPLAY_TEST`, `CONCOLIC_COVERAGE_TEST`, `INIT_STATE_TEST`
+- Verification runs tests in a subprocess using a custom pytest plugin (`verification/pytest_plugin.py`) — behavioral tests use blocklisted plugins (`benchmark`, `codspeed`, `xdist`, `sugar`), benchmarking tests additionally block `cov` and `profiling`
--- a/tiles/codeflash-rules/tile.json
+++ b/tiles/codeflash-rules/tile.json
@ -0,0 +1,26 @@
+{
+  "name": "codeflash/codeflash-rules",
+  "version": "0.1.0",
+  "summary": "Coding standards and conventions for the codeflash codebase",
+  "private": true,
+  "rules": {
+    "code-style": {
+      "rules": "rules/code-style.md"
+    },
+    "architecture": {
+      "rules": "rules/architecture.md"
+    },
+    "optimization-patterns": {
+      "rules": "rules/optimization-patterns.md"
+    },
+    "git-conventions": {
+      "rules": "rules/git-conventions.md"
+    },
+    "testing-rules": {
+      "rules": "rules/testing-rules.md"
+    },
+    "language-rules": {
+      "rules": "rules/language-rules.md"
+    }
+  }
+}
--- a/tiles/codeflash-skills/evals/capabilities.json
+++ b/tiles/codeflash-skills/evals/capabilities.json
@ -0,0 +1,104 @@
+{
+  "package_name": "codeflash-skills",
+  "total_capabilities": 14,
+  "capabilities": [
+    {
+      "id": 0,
+      "name": "sequential-pipeline-debugging",
+      "description": "Debug optimization failures by walking through pipeline stages sequentially and stopping at the first failure found",
+      "complexity": "intermediate",
+      "api_elements": ["discovery", "ranking", "context", "AI service", "verification", "deduplication", "repair"]
+    },
+    {
+      "id": 1,
+      "name": "token-limit-awareness",
+      "description": "Know that OPTIMIZATION_CONTEXT_TOKEN_LIMIT and TESTGEN_CONTEXT_TOKEN_LIMIT are both 16000 tokens and that exceeding them causes function rejection",
+      "complexity": "basic",
+      "api_elements": ["OPTIMIZATION_CONTEXT_TOKEN_LIMIT", "TESTGEN_CONTEXT_TOKEN_LIMIT", "encoded_tokens_len()"]
+    },
+    {
+      "id": 2,
+      "name": "improvement-threshold",
+      "description": "Know that MIN_IMPROVEMENT_THRESHOLD is 0.05 (5%) and candidates below this speedup are rejected",
+      "complexity": "basic",
+      "api_elements": ["MIN_IMPROVEMENT_THRESHOLD", "STABILITY_WINDOW_SIZE"]
+    },
+    {
+      "id": 3,
+      "name": "ast-deduplication",
+      "description": "Know that candidates are deduplicated via AST normalization using normalize_code() and CandidateEvaluationContext.ast_code_to_id",
+      "complexity": "intermediate",
+      "api_elements": ["normalize_code()", "CandidateEvaluationContext.ast_code_to_id", "code_utils/deduplicate_code.py"]
+    },
+    {
+      "id": 4,
+      "name": "repair-trigger-conditions",
+      "description": "Know that repair only triggers when fewer than MIN_CORRECT_CANDIDATES=2 pass, and is skipped when REPAIR_UNMATCHED_PERCENTAGE_LIMIT is exceeded",
+      "complexity": "advanced",
+      "api_elements": ["MIN_CORRECT_CANDIDATES", "REPAIR_UNMATCHED_PERCENTAGE_LIMIT", "AIServiceCodeRepairRequest"]
+    },
+    {
+      "id": 5,
+      "name": "ai-service-error-patterns",
+      "description": "Know specific log patterns to search for when AI service fails: 'Error generating optimized candidates', 'cli-optimize-error-caught', 'cli-optimize-error-response'",
+      "complexity": "intermediate",
+      "api_elements": ["AiServiceClient", "api/aiservice.py"]
+    },
+    {
+      "id": 6,
+      "name": "behavioral-vs-benchmark-failures",
+      "description": "Distinguish between behavioral test failures (return value/stdout/pass-fail mismatches via TestDiffScope) and benchmark failures (speedup below threshold)",
+      "complexity": "intermediate",
+      "api_elements": ["TestDiffScope", "RETURN_VALUE", "STDOUT", "DID_PASS"]
+    },
+    {
+      "id": 7,
+      "name": "result-type-pattern",
+      "description": "Use Result[L, R] from either.py with Success/Failure constructors and is_successful() checks before unwrap()",
+      "complexity": "basic",
+      "api_elements": ["Result", "Success", "Failure", "is_successful", "unwrap()", "either.py"]
+    },
+    {
+      "id": 8,
+      "name": "effort-config-pattern",
+      "description": "Add effort-dependent config via EffortKeys enum, EFFORT_VALUES dict with LOW/MEDIUM/HIGH levels, and get_effort_value()",
+      "complexity": "intermediate",
+      "api_elements": ["EffortKeys", "EffortLevel", "EFFORT_VALUES", "get_effort_value()", "config_consts.py"]
+    },
+    {
+      "id": 9,
+      "name": "module-to-feature-mapping",
+      "description": "Know which codeflash module to modify for different feature types (optimization/ for strategies, api/ for endpoints, languages/ for language support, etc.)",
+      "complexity": "basic",
+      "api_elements": ["MODULE_REFERENCE.md"]
+    },
+    {
+      "id": 10,
+      "name": "domain-type-conventions",
+      "description": "Use @dataclass(frozen=True) for immutable data, BaseModel for serializable models, and keep function_types.py dependency-free",
+      "complexity": "intermediate",
+      "api_elements": ["@dataclass(frozen=True)", "BaseModel", "models/models.py", "models/function_types.py"]
+    },
+    {
+      "id": 11,
+      "name": "test-patterns",
+      "description": "Use tmp_path fixture, .resolve() on Paths, .as_posix() for string conversion, full string equality assertions, and awareness of deterministic patches",
+      "complexity": "basic",
+      "api_elements": ["tmp_path", ".resolve()", ".as_posix()", "pytest_plugin.py"]
+    },
+    {
+      "id": 12,
+      "name": "quality-check-commands",
+      "description": "Run uv run prek run for formatting/linting, uv run mypy for type checking, and uv run pytest for tests",
+      "complexity": "basic",
+      "api_elements": ["uv run prek run", "uv run mypy", "uv run pytest"]
+    },
+    {
+      "id": 13,
+      "name": "language-support-patterns",
+      "description": "Use @register_language decorator, get_language_support() for lookup, singleton pattern via set_current_language()/current_language(), and is_python()/is_javascript() guards",
+      "complexity": "advanced",
+      "api_elements": ["@register_language", "get_language_support()", "set_current_language()", "is_python()", "is_javascript()"]
+    }
+  ]
+}
--- a/tiles/codeflash-skills/evals/scenario-1/capability.txt
+++ b/tiles/codeflash-skills/evals/scenario-1/capability.txt
@ -0,0 +1 @@
+Sequential pipeline debugging with specific thresholds
--- a/tiles/codeflash-skills/evals/scenario-1/criteria.json
+++ b/tiles/codeflash-skills/evals/scenario-1/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent follows the sequential debugging workflow from the skill, checking pipeline stages in order and using correct threshold values when diagnosing an optimization that produced no results.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Sequential stage order",
+      "description": "Investigates pipeline stages in order: discovery before ranking before context before AI service before test failures. Does NOT jump to later stages without checking earlier ones first.",
+      "max_score": 25
+    },
+    {
+      "name": "Token limit value",
+      "description": "References the specific token limit of 16000 for OPTIMIZATION_CONTEXT_TOKEN_LIMIT or TESTGEN_CONTEXT_TOKEN_LIMIT when checking context extraction",
+      "max_score": 25
+    },
+    {
+      "name": "Importance threshold",
+      "description": "References DEFAULT_IMPORTANCE_THRESHOLD=0.001 when checking function ranking",
+      "max_score": 25
+    },
+    {
+      "name": "Stops at failure",
+      "description": "Identifies the failing stage and focuses investigation there rather than continuing through all remaining stages",
+      "max_score": 25
+    }
+  ]
+}
--- a/tiles/codeflash-skills/evals/scenario-1/task.md
+++ b/tiles/codeflash-skills/evals/scenario-1/task.md
@ -0,0 +1,13 @@
+# Diagnose Silent Optimization Skip
+
+## Context
+
+A user reports that when running codeflash on their project, a specific function `calculate_metrics` in `analytics/processor.py` never appears in the optimization results. The function exists in the module root, is not in the exclude list, and has not been previously optimized. Trace data shows the function is called frequently but with very short execution times (averaging 0.0005 seconds total addressable time). The function has moderate dependencies.
+
+## Task
+
+Write a diagnostic report explaining why this function is being skipped and at which stage in the pipeline the function is filtered out. Include the specific threshold or condition that causes the skip.
+
+## Expected Outputs
+
+A markdown file `diagnostic-report.md` explaining the root cause.
--- a/tiles/codeflash-skills/evals/scenario-2/capability.txt
+++ b/tiles/codeflash-skills/evals/scenario-2/capability.txt
@ -0,0 +1 @@
+Result type pattern and effort-dependent configuration
--- a/tiles/codeflash-skills/evals/scenario-2/criteria.json
+++ b/tiles/codeflash-skills/evals/scenario-2/criteria.json
@ -0,0 +1,31 @@
+{
+  "context": "Tests whether the agent uses the codeflash Result type pattern from either.py and the effort-dependent configuration pattern when implementing a new pipeline feature.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Imports from either.py",
+      "description": "Imports Success, Failure, and is_successful from codeflash.either (NOT from a different error handling module)",
+      "max_score": 20
+    },
+    {
+      "name": "Result return type",
+      "description": "Function returns Result type using Success() for success and Failure() for errors, not exceptions or None",
+      "max_score": 20
+    },
+    {
+      "name": "is_successful check",
+      "description": "Calls is_successful() or .is_successful() before calling unwrap() on the result",
+      "max_score": 20
+    },
+    {
+      "name": "EffortKeys enum entry",
+      "description": "Adds a new entry to the EffortKeys enum in config_consts.py",
+      "max_score": 20
+    },
+    {
+      "name": "Three effort levels",
+      "description": "Adds values for all three EffortLevel variants (LOW, MEDIUM, HIGH) in EFFORT_VALUES dict",
+      "max_score": 20
+    }
+  ]
+}
--- a/tiles/codeflash-skills/evals/scenario-2/task.md
+++ b/tiles/codeflash-skills/evals/scenario-2/task.md
@ -0,0 +1,21 @@
+# Add Candidate Timeout Feature
+
+## Context
+
+The codeflash optimization engine currently has no per-candidate timeout. Some candidates take too long during verification, wasting the optimization budget. A new feature is needed to skip candidates that exceed a configurable time limit during behavioral testing.
+
+The timeout should vary based on the optimization effort setting — shorter timeouts for low effort runs (to save time) and longer for high effort runs (to allow more complex optimizations).
+
+## Task
+
+Implement a `check_candidate_timeout` function in `codeflash/optimization/function_optimizer.py` that:
+1. Takes a candidate runtime and returns whether the candidate should be skipped
+2. Uses a configurable timeout threshold that scales with optimization effort
+3. Handles the error case where the runtime measurement is unavailable
+
+Also add the necessary configuration constant to `codeflash/code_utils/config_consts.py`.
+
+## Expected Outputs
+
+- Modified `function_optimizer.py` with the new function
+- Modified `config_consts.py` with the new configuration
--- a/tiles/codeflash-skills/evals/scenario-3/capability.txt
+++ b/tiles/codeflash-skills/evals/scenario-3/capability.txt
@ -0,0 +1 @@
+Test patterns and deterministic patch awareness
--- a/tiles/codeflash-skills/evals/scenario-3/criteria.json
+++ b/tiles/codeflash-skills/evals/scenario-3/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent follows codeflash test conventions when writing tests, including path handling, temp directory patterns, and awareness of the deterministic patching system.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Uses tmp_path fixture",
+      "description": "Test function uses pytest tmp_path fixture parameter, NOT tempfile.NamedTemporaryFile or tempfile.mkdtemp",
+      "max_score": 25
+    },
+    {
+      "name": "Calls resolve on paths",
+      "description": "Calls .resolve() on Path objects before using them in assertions or function calls",
+      "max_score": 25
+    },
+    {
+      "name": "Full string equality",
+      "description": "Uses exact equality assertions (== or assert_equal) for code string comparisons, NOT substring checks like 'in' or assertIn or contains",
+      "max_score": 25
+    },
+    {
+      "name": "No real time dependency",
+      "description": "Test does NOT depend on real time.time(), datetime.now(), random values, or uuid generation for correctness. Acknowledges or accounts for deterministic patches if time/random values are involved.",
+      "max_score": 25
+    }
+  ]
+}
--- a/tiles/codeflash-skills/evals/scenario-3/task.md
+++ b/tiles/codeflash-skills/evals/scenario-3/task.md
@ -0,0 +1,24 @@
+# Write Tests for Context Hash Comparison
+
+## Context
+
+The codeflash context extraction module has a function `compare_context_hashes(context_a, context_b)` that takes two `CodeOptimizationContext` objects and returns whether their hashing contexts are identical. This is used to detect when the same function has already been optimized.
+
+```python
+# In codeflash/context/code_context_extractor.py
+def compare_context_hashes(context_a: CodeOptimizationContext, context_b: CodeOptimizationContext) -> bool:
+    return context_a.hashing_code_context_hash == context_b.hashing_code_context_hash
+```
+
+## Task
+
+Write a test file `tests/test_context/test_hash_comparison.py` with tests for this function. Include tests for:
+1. Two contexts with identical code producing the same hash
+2. Two contexts with different code producing different hashes
+3. A context compared with itself
+
+The tests should create temporary Python source files to build realistic context objects.
+
+## Expected Outputs
+
+- `tests/test_context/test_hash_comparison.py`
--- a/tiles/codeflash-skills/evals/scenario-4/capability.txt
+++ b/tiles/codeflash-skills/evals/scenario-4/capability.txt
@ -0,0 +1 @@
+Domain type conventions and module identification
--- a/tiles/codeflash-skills/evals/scenario-4/criteria.json
+++ b/tiles/codeflash-skills/evals/scenario-4/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent follows codeflash domain type conventions and correctly identifies the right module when adding a new data type for the optimization pipeline.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "Placed in models/models.py",
+      "description": "New data type is added to codeflash/models/models.py (NOT models/function_types.py, since it has dependencies on other codeflash modules)",
+      "max_score": 25
+    },
+    {
+      "name": "Uses frozen dataclass",
+      "description": "Immutable data type uses @dataclass(frozen=True) decorator, NOT a regular class or unfrozen dataclass",
+      "max_score": 25
+    },
+    {
+      "name": "BaseModel for serializable",
+      "description": "If a serializable model is needed, uses Pydantic BaseModel (NOT dataclass or dict)",
+      "max_score": 25
+    },
+    {
+      "name": "Correct module for feature",
+      "description": "Places the main logic in the correct module for the feature type (e.g., verification/ for test-related, optimization/ for candidate-related, api/ for service-related)",
+      "max_score": 25
+    }
+  ]
+}
--- a/tiles/codeflash-skills/evals/scenario-4/task.md
+++ b/tiles/codeflash-skills/evals/scenario-4/task.md
@ -0,0 +1,21 @@
+# Add Optimization Confidence Score
+
+## Context
+
+The codeflash team wants to add a confidence score to each optimization result. The score should capture how confident the system is that an optimization is both correct and beneficial. It combines test coverage percentage, number of passing test cases, and speedup stability into a single metric.
+
+The score needs to be:
+- Attached to each candidate during evaluation (immutable once computed)
+- Included in the final PR report (needs JSON serialization)
+- Computed during the candidate evaluation phase
+
+## Task
+
+1. Define the data types needed for the confidence score
+2. Write a `compute_confidence_score` function that takes coverage percentage (float), passing test count (int), and stability ratio (float) and returns the confidence result
+3. Place all code in the appropriate codeflash modules
+
+## Expected Outputs
+
+- New/modified type definitions in the appropriate models file
+- New function in the appropriate module
--- a/tiles/codeflash-skills/evals/scenario-5/capability.txt
+++ b/tiles/codeflash-skills/evals/scenario-5/capability.txt
@ -0,0 +1 @@
+Deduplication mechanics and repair trigger conditions
--- a/tiles/codeflash-skills/evals/scenario-5/criteria.json
+++ b/tiles/codeflash-skills/evals/scenario-5/criteria.json
@ -0,0 +1,26 @@
+{
+  "context": "Tests whether the agent understands codeflash's candidate deduplication via AST normalization and the specific conditions under which code repair is triggered vs skipped.",
+  "type": "weighted_checklist",
+  "checklist": [
+    {
+      "name": "AST normalization",
+      "description": "Mentions that deduplication uses AST normalization (normalize_code from code_utils/deduplicate_code.py), NOT simple string comparison",
+      "max_score": 25
+    },
+    {
+      "name": "Duplicate result copying",
+      "description": "Explains that duplicate candidates copy results from the first-seen candidate rather than being re-tested",
+      "max_score": 25
+    },
+    {
+      "name": "Repair trigger threshold",
+      "description": "States that repair triggers when fewer than 2 candidates pass (MIN_CORRECT_CANDIDATES=2), NOT when zero candidates pass or when any candidate fails",
+      "max_score": 25
+    },
+    {
+      "name": "Unmatched percentage limit",
+      "description": "Mentions REPAIR_UNMATCHED_PERCENTAGE_LIMIT as a condition that can cause repair to be skipped entirely, with effort-dependent values (0.2/0.3/0.4)",
+      "max_score": 25
+    }
+  ]
+}
--- a/Show more
+++ b/Show more
				`@ -0,0 +1 @@`
				`Code serialization format and context splitting`
				`@ -0,0 +1 @@`
				`Candidate source types and DAG relationships`
				`@ -0,0 +1 @@`
				`Deterministic patch values and test execution architecture`
				`@ -0,0 +1 @@`
				`Effort level configuration and candidate selection criteria`
				`@ -0,0 +1 @@`
				`Pipeline concurrency and FunctionToOptimize structure`
				`@ -0,0 +1 @@`
				`Sequential pipeline debugging with specific thresholds`
				`@ -0,0 +1 @@`
				`Result type pattern and effort-dependent configuration`
				`@ -0,0 +1 @@`
				`Test patterns and deterministic patch awareness`
				`@ -0,0 +1 @@`
				`Domain type conventions and module identification`
				`@ -0,0 +1 @@`
				`Deduplication mechanics and repair trigger conditions`