Three fixes to baseline establishment:
- Include zero-nanosecond runtimes in usable_runtime_data_by_test_case
(use `is not None` instead of truthiness check)
- Check runtime_data dict instead of total_timing scalar for skip decision
- Use module_root directly in PYTHONPATH (not module_root.parent) so
src-layout projects resolve imports correctly in test subprocesses
Server-side instrumentation wrote return values to .bin files, which
corrupted under concurrent pytest processes (interleaved records →
UnicodeDecodeError). Client-side instrumentation writes to SQLite,
which handles concurrent access safely.
The client now ignores instrumented_behavior_tests and
instrumented_perf_tests from the AI service response and instruments
the plain generated_tests locally using inject_profiling_into_existing_test,
the same path used for discovered existing tests.
The AI service returns instrumented test code with a
{codeflash_run_tmp_dir_client_side} placeholder in file paths.
generate_ai_tests() wrote these files without replacing the
placeholder, causing FileNotFoundError at test runtime.
Pytest natively discovers and runs unittest.TestCase subclasses,
making the separate unittest discovery/linking/merging code paths
redundant. This removes ~150 lines of dead branching logic across
discovery, linking, result merging, and test generation.
The pytest collection subprocess only added module_root's parent to
PYTHONPATH, which works when module_root is a package (e.g. src/aviary)
but fails when it is the source root itself (e.g. src). Now both
module_root and its parent are added so imports like
`from mypkg.core import func` resolve correctly in either layout.
AIClient.post() now catches ValueError (parent of JSONDecodeError)
when the server returns 200 with an empty or malformed body, converting
it to AIServiceError. All callers already handle AIServiceError, so the
pipeline degrades cleanly instead of crashing with an unhandled exception.
TimeoutExpired in execute_test_subprocess was propagating unhandled,
crashing the pipeline instead of degrading to "no baseline." Now
caught and returned as CompletedProcess(returncode=-1). Subprocess
timeout scales with test file count (120s base + 60s/file, cap 600s)
instead of a fixed 10-minute wait for small suites.
Notable upstream fixes:
- Fix working-tree review crash on untracked directories
- Avoid embedding large adversarial review diffs
- Inherit process.env in app-server spawn
- Scope implicit resume-last and cancel to current session
- Gracefully handle unsupported thread/name/set on older CLI
- Use app-server auth status for Codex readiness
The codex session-lifecycle-hook.mjs SessionStart path only appends
two env vars to CLAUDE_ENV_FILE. Rewrite that as a bash script to
avoid ~100ms V8 startup overhead. SessionEnd stays in Node.js since
it needs async broker teardown and process tree management.
Tests overlay isolation, concurrent dispatch, thread safety,
exception handling, and the full evaluate_candidate_isolated flow
with mocked subprocess execution.
Previously, a failed baseline would still block on
candidates_future.result(), hanging the process if the AI service
was slow. Now checks baseline first and cancels the future.
Evaluates candidates concurrently using ThreadPoolExecutor with project
overlays for isolation. Each candidate gets its own symlinked copy of
the project so test subprocesses don't interfere with each other or
the original source. Shared result lists protected with threading.Lock.
Introduces symlink-based temporary directories that mirror the project
root, replacing only the target module file with candidate code. This
allows test subprocesses to run against candidate code without mutating
the original source on disk, enabling safe parallel evaluation.
Start the AI service HTTP call for candidate generation concurrently
with baseline subprocess runs. The HTTP call doesn't depend on
baseline results, so it runs in a background thread while behavioral
tests, line profiling, and benchmarking execute. Saves 10-20s per
function (the AI round-trip time that was previously sequential).
Run line profiling and benchmarking subprocesses concurrently via
ThreadPoolExecutor instead of sequentially. Each writes to a unique
JUnit XML file to avoid collisions. Saves 20-60s per function
(the duration of the shorter subprocess).
Add missing # noqa: PLC0415 comments to deferred imports that lost
them during ruff --fix reformatting. Add unittest to TYPE_CHECKING
block in discovery.py so annotations resolve. Fix import sorting.
Move jedi, libcst, dill, and unittest imports from module level to
first use in _function_optimizer, _optimizer, _orchestrator, and
test_discovery. Pipeline __init__.py is now lazy via __getattr__.
Optimizer module load: 3.7s → 51ms. All 2591 tests pass.
Point attrs dependency at local fork (KRRT7/attrs perf/defer-inspect-import)
which defers the ~12ms inspect import until first class build. Temporary
override until upstream merges python-attrs/attrs#1547.
Also adds attrs optimization case study data (VM infra, status).
Replace eager imports of all submodules with a __getattr__-based
lookup table. Submodules now load only when a name is accessed,
dropping `import codeflash_core` from ~74ms to ~4ms (98% reduction
from the original ~230ms). TYPE_CHECKING imports preserved for
static analysis. All 65 core + 2526 downstream tests pass.
Move requests, gitpython, sentry-sdk, and posthog imports from
module level into the functions that use them. This drops
`import codeflash_core` from ~230ms to ~74ms, making it viable
for lightweight consumers (e.g. the project detector) to depend
on core submodules without blowing startup budgets.
- _telemetry: defer sentry_sdk + posthog into init functions (20ms → 0.2ms)
- _git: use PEP 562 __getattr__ for lazy git import (59ms → 4ms)
- _platform: defer requests + sentry_sdk into methods (55ms → 1ms)
- _client: defer requests into post() method (72ms → 34ms)
- _http: add shared _make_session() factory for deferred Session creation
Weave "optimizations reveal deeper issues" framing into engagement report
executive summary, case study, and optimization README. Add O(N²) text
extraction fix, per-request RSS creep (24→17 MB), and memray profiling
data that were previously undocumented.
Apply research-backed case study structure: headline anchoring on
biggest numbers, customer-as-hero framing, loss aversion, narrative
arc, methodology for developer credibility. Collapse PR inventory
to category summary, ~1,100 words in optimal range.
- Rewrite executive summary to reference his PR #1465 lockfile fix and
existing tooling (Renovate, Anchore, Chainguard)
- Reorder findings by category priority (supply chain > container > CI/CD)
to lead with what matters most to the audience
- Add animated parallelogram background matching codeflash.ai aesthetic
- 6 research-backed UX changes: severity icons (WCAG 1.4.1), title-first
cards (F-pattern), loss-framed 85% CTA, distinct status colors, card
opacity for figure-ground separation
- Correct SEC-021 from 67% to 97% mutable Action pins per VM verification
(only 2 of 96 SHA-pinned in core-product)
- Add talking-points-lawrence.md with profile, pain points, pitch strategy
Split the 39-finding wall into tabbed views matching the engagement
report pattern: Summary, Critical & High (21), Medium & Low (18),
and By Category with both category and repository breakdowns.
- Add build_jpc_view() with clean standalone layout at /jpc for JPC
(no tabs, no hero — just the document that "stands on its own")
- Add URL routing via dcc.Location: / serves full report, /jpc serves summary
- Add methodology notes to exec view (How This Was Tested annotations)
- Add methodology notes to detail view (7-entry "why" card)
- Enrich team view Memory + Standalone vs. Cumulative explanations
Team view:
- Add Engineering Impact Summary at top (4 metrics: memory, density,
latency, idle vCPU) with pointer to sections below
- Remove Production Context card (redundant with Impact Summary)
- Trim memory table to only metrics not shown in chart (RSS per
request, K8s allocation) — chart already shows pre/post/delta
- Fix "10-page scan" → "10-page scanned document" in methodology
Detail view:
- Add intro callout explaining this is the raw data backing the
other two views
Reorder based on persuasion research (Three-Talk Model, Prospect
Theory, Kotter):
1. "The Engagement" — collaborative shared context (team talk)
2. "What This Enables" — loss-framed enablement: 9.2x pod density,
41 idle vCPUs now available, -12.9% latency for agentic API
3. "The Results" — before/after proof of execution
4. Infrastructure Cost Impact (anchored on $100K/mo)
5. Workload Profiles + Methodology (credibility)
6. Delivered + Proposed Next Engagements
Key shift: lead with what the work unlocks (feature velocity,
platform capacity, API speed) rather than the technical achievement
(memory reduction). Cost savings is proof of execution, not the
headline.
The 1p/10p/16p benchmark rationale belongs in the exec view — JPC
needs to understand that page count != workload before seeing the
numbers. Added "Benchmark Workload Profiles" section before "How This
Was Tested" with the three profiles and the data punchline (#1505 at
-32.6% on 1 page vs -7.4% on 16 pages).
The 1p/10p/16p column headers weren't self-explanatory. Added a
"Benchmark Workload Profiles" card above the latency table in the
Detail view explaining that each document tests a distinct workload
shape (table-dense, scanned, mixed), not just different page counts.
Also added annotation below the table calling out that #1505 has 4x
the impact on the 1-page doc vs. the 16-page doc — letting the data
demonstrate that per-document cost depends on content, not page count.