Commit graph

198 commits

Author SHA1 Message Date
Kevin Turcios
bde8a7b782 Fix benchmark baseline for zero-runtime tests and src-layout projects
Three fixes to baseline establishment:
- Include zero-nanosecond runtimes in usable_runtime_data_by_test_case
  (use `is not None` instead of truthiness check)
- Check runtime_data dict instead of total_timing scalar for skip decision
- Use module_root directly in PYTHONPATH (not module_root.parent) so
  src-layout projects resolve imports correctly in test subprocesses
2026-04-21 08:59:32 -05:00
Kevin Turcios
434e888571 Move AI-generated test instrumentation from server-side to client-side
Server-side instrumentation wrote return values to .bin files, which
corrupted under concurrent pytest processes (interleaved records →
UnicodeDecodeError). Client-side instrumentation writes to SQLite,
which handles concurrent access safely.

The client now ignores instrumented_behavior_tests and
instrumented_perf_tests from the AI service response and instruments
the plain generated_tests locally using inject_profiling_into_existing_test,
the same path used for discovered existing tests.
2026-04-21 07:38:48 -05:00
Kevin Turcios
bc0323a46c Fix codeflash_run_tmp_dir_client_side placeholder not being substituted
The AI service returns instrumented test code with a
{codeflash_run_tmp_dir_client_side} placeholder in file paths.
generate_ai_tests() wrote these files without replacing the
placeholder, causing FileNotFoundError at test runtime.
2026-04-21 07:01:26 -05:00
Kevin Turcios
05cf34eaf8 Remove unittest framework support from Python test pipeline
Pytest natively discovers and runs unittest.TestCase subclasses,
making the separate unittest discovery/linking/merging code paths
redundant. This removes ~150 lines of dead branching logic across
discovery, linking, result merging, and test generation.
2026-04-21 06:41:41 -05:00
Kevin Turcios
ff310a7639 Add [tool.codeflash] config to layered eval template
The E2E pipeline needs module-root, tests-root, and test-framework
to discover and run tests in src-layout projects.
2026-04-21 06:41:31 -05:00
Kevin Turcios
26709f9b82 Fix test discovery for src-layout projects with module_root=src
The pytest collection subprocess only added module_root's parent to
PYTHONPATH, which works when module_root is a package (e.g. src/aviary)
but fails when it is the source root itself (e.g. src). Now both
module_root and its parent are added so imports like
`from mypkg.core import func` resolve correctly in either layout.
2026-04-21 06:18:27 -05:00
Kevin Turcios
a4e7e9f5fa Handle empty JSON responses from AI service gracefully
AIClient.post() now catches ValueError (parent of JSONDecodeError)
when the server returns 200 with an empty or malformed body, converting
it to AIServiceError. All callers already handle AIServiceError, so the
pipeline degrades cleanly instead of crashing with an unhandled exception.
2026-04-21 05:51:13 -05:00
Kevin Turcios
87ed0011c2 Handle subprocess timeout gracefully and scale timeout dynamically
TimeoutExpired in execute_test_subprocess was propagating unhandled,
crashing the pipeline instead of degrading to "no baseline." Now
caught and returned as CompletedProcess(returncode=-1). Subprocess
timeout scales with test file count (120s base + 60s/file, cap 600s)
instead of a fixed 10-minute wait for small suites.
2026-04-21 05:47:32 -05:00
Kevin Turcios
9456cf48fa Add set -e to vendored session-start-env.sh
Match upstream fix: propagate write failures as non-zero exit.
2026-04-21 05:32:30 -05:00
Kevin Turcios
42c8310494 Update vendored codex plugin from v1.0.2 to v1.0.4
Notable upstream fixes:
- Fix working-tree review crash on untracked directories
- Avoid embedding large adversarial review diffs
- Inherit process.env in app-server spawn
- Scope implicit resume-last and cancel to current session
- Gracefully handle unsupported thread/name/set on older CLI
- Use app-server auth status for Codex readiness
2026-04-21 05:11:56 -05:00
Kevin Turcios
de6046df8d Replace Node.js SessionStart hook with bash (106ms → 5ms)
The codex session-lifecycle-hook.mjs SessionStart path only appends
two env vars to CLAUDE_ENV_FILE. Rewrite that as a bash script to
avoid ~100ms V8 startup overhead. SessionEnd stays in Node.js since
it needs async broker teardown and process tree management.
2026-04-21 05:07:03 -05:00
Kevin Turcios
42b855899f Reduce hook latency by consolidating subprocess calls
Collapse multiple jq/grep/sed invocations into single passes:
- post-compact-state-inject: 7 jq calls → 1 (-51%)
- session-start: 6 sed/grep → 1 sed pipeline (-37%)
- user-prompt-context-inject: 2 jq → 1 (-33%)
- pre-compact-state-save: 3 tail|grep → 1 awk (-42%)
- post-tool-benchmark-capture: 3 jq → 1 (-23%)
- stop-optimization-gate: 3 tail|grep → 1 awk
- pre-compact: fix macOS-incompatible sed newline escaping
2026-04-21 05:01:53 -05:00
Kevin Turcios
ebd239bbbc Add integration tests for parallel candidate evaluation
Tests overlay isolation, concurrent dispatch, thread safety,
exception handling, and the full evaluate_candidate_isolated flow
with mocked subprocess execution.
2026-04-21 04:34:44 -05:00
Kevin Turcios
2cea1ab784 Cancel AI generation future early when baseline fails
Previously, a failed baseline would still block on
candidates_future.result(), hanging the process if the AI service
was slow. Now checks baseline first and cancels the future.
2026-04-21 04:34:39 -05:00
Kevin Turcios
66461ad4e7 Parallelize candidate evaluation across all optimization passes
Evaluates candidates concurrently using ThreadPoolExecutor with project
overlays for isolation. Each candidate gets its own symlinked copy of
the project so test subprocesses don't interfere with each other or
the original source. Shared result lists protected with threading.Lock.
2026-04-21 04:16:43 -05:00
Kevin Turcios
b455c1e69f Add project overlay infrastructure for isolated candidate evaluation
Introduces symlink-based temporary directories that mirror the project
root, replacing only the target module file with candidate code. This
allows test subprocesses to run against candidate code without mutating
the original source on disk, enabling safe parallel evaluation.
2026-04-21 04:16:34 -05:00
Kevin Turcios
1d48b7792d Overlap AI candidate generation with baseline establishment
Start the AI service HTTP call for candidate generation concurrently
with baseline subprocess runs. The HTTP call doesn't depend on
baseline results, so it runs in a background thread while behavioral
tests, line profiling, and benchmarking execute. Saves 10-20s per
function (the AI round-trip time that was previously sequential).
2026-04-21 03:52:47 -05:00
Kevin Turcios
d9be7272e9 Parallelize baseline line profiling and benchmarking
Run line profiling and benchmarking subprocesses concurrently via
ThreadPoolExecutor instead of sequentially. Each writes to a unique
JUnit XML file to avoid collisions. Saves 20-60s per function
(the duration of the shorter subprocess).
2026-04-21 03:44:52 -05:00
Kevin Turcios
be4d51b886 Fix ruff lint errors from deferred import changes
Add missing # noqa: PLC0415 comments to deferred imports that lost
them during ruff --fix reformatting. Add unittest to TYPE_CHECKING
block in discovery.py so annotations resolve. Fix import sorting.
2026-04-21 03:30:00 -05:00
Kevin Turcios
eb639fb24e Defer heavy third-party imports in pipeline to cut startup 99%
Move jedi, libcst, dill, and unittest imports from module level to
first use in _function_optimizer, _optimizer, _orchestrator, and
test_discovery. Pipeline __init__.py is now lazy via __getattr__.

Optimizer module load: 3.7s → 51ms. All 2591 tests pass.
2026-04-21 03:18:53 -05:00
Kevin Turcios
4399fe0763 Make codeflash_python.__init__ fully lazy via __getattr__
Defer all imports (models, api, importlib.metadata) to first access.
Import time drops from ~100ms to ~5ms.
2026-04-21 02:50:39 -05:00
Kevin Turcios
d25d7bdad4 Point attrs source at GitHub release wheel for portability
Replace local path override with wheel URL from KRRT7/attrs release
so team members and CI get the optimized attrs on uv sync.
2026-04-21 02:35:45 -05:00
Kevin Turcios
edfdd231e0 Use attrs fork with deferred inspect import
Point attrs dependency at local fork (KRRT7/attrs perf/defer-inspect-import)
which defers the ~12ms inspect import until first class build. Temporary
override until upstream merges python-attrs/attrs#1547.

Also adds attrs optimization case study data (VM infra, status).
2026-04-21 02:27:50 -05:00
Kevin Turcios
4f206be46b Make codeflash_core.__init__ fully lazy via __getattr__
Replace eager imports of all submodules with a __getattr__-based
lookup table. Submodules now load only when a name is accessed,
dropping `import codeflash_core` from ~74ms to ~4ms (98% reduction
from the original ~230ms). TYPE_CHECKING imports preserved for
static analysis. All 65 core + 2526 downstream tests pass.
2026-04-21 01:22:18 -05:00
Kevin Turcios
31cdba9e3b Defer heavy third-party imports in codeflash-core to cut import time 68%
Move requests, gitpython, sentry-sdk, and posthog imports from
module level into the functions that use them. This drops
`import codeflash_core` from ~230ms to ~74ms, making it viable
for lightweight consumers (e.g. the project detector) to depend
on core submodules without blowing startup budgets.

- _telemetry: defer sentry_sdk + posthog into init functions (20ms → 0.2ms)
- _git: use PEP 562 __getattr__ for lazy git import (59ms → 4ms)
- _platform: defer requests + sentry_sdk into methods (55ms → 1ms)
- _client: defer requests into post() method (72ms → 34ms)
- _http: add shared _make_session() factory for deferred Session creation
2026-04-21 01:08:17 -05:00
misrasaurabh1
cf67686b29 add codeflash-agent case study md 2026-04-19 19:59:23 -07:00
misrasaurabh1
d33e82b647 Merge remote-tracking branch 'origin/main'
# Conflicts:
#	reports/unstructured/engagement_report.py
2026-04-19 19:58:37 -07:00
misrasaurabh1
42debefd12 remove lightspeed canvas animation from report 2026-04-19 19:57:59 -07:00
Kevin Turcios
d63bb51800 Revert timeline phase duration back to 6 weeks 2026-04-19 03:28:46 -05:00
Kevin Turcios
c9a8e9b1ea Fix engagement duration from 6 weeks to 2 months 2026-04-19 03:28:06 -05:00
Kevin Turcios
0e248b865f no light speed anims 2026-04-19 03:24:41 -05:00
Kevin Turcios
b42417532d Add optimization project scaffolding for plotly/plotly.py 2026-04-16 23:57:06 -05:00
Kevin Turcios
a4276d658a Refine engagement report and case study for executive review
- Hero metrics: -89% cost, -52% peak memory, flat scaling, -12.9% latency
- Add lightspeed canvas animation via assets/lightspeed.js for Plotly Cloud
- Add platform-libs CI/CD migration to timeline (Phase 1b) with PR links
- Update next-engagement card with POC branch and PR references
- Replace RSS with peak memory in user-facing copy
- Add flat memory scaling to case study results table
2026-04-16 17:51:54 -05:00
Kevin Turcios
380bd59503 Add iterative-discovery narrative and missing findings across all reports
Weave "optimizations reveal deeper issues" framing into engagement report
executive summary, case study, and optimization README. Add O(N²) text
extraction fix, per-request RSS creep (24→17 MB), and memray profiling
data that were previously undocumented.
2026-04-16 15:02:39 -05:00
Kevin Turcios
3c705d4e2d Rewrite Unstructured case study for public-facing clarity
Apply research-backed case study structure: headline anchoring on
biggest numbers, customer-as-hero framing, loss aversion, narrative
arc, methodology for developer credibility. Collapse PR inventory
to category summary, ~1,100 words in optimal range.
2026-04-16 14:40:05 -05:00
Kevin Turcios
6d05aea09c Revamp engagement report layout and timeline for executive clarity
- Move Infrastructure Cost Impact above hero metrics and tab toggle
- Extract shared above-fold content into _above_fold_content() for /jpc parity
- Replace plotly Gantt chart with pure-HTML vertical timeline
- Fix cross-browser flex layout (explicit flex: 1 1 0%, minWidth: 0)
- Remove redundant "The Results" and "How This Was Tested" sections
- Rename Engineering Team → Engineering Details
- Rename Peak RSS → Peak Memory Usage
- Update timeline dates: 1-week buffer after Phase 1, cascade phases
- Rename section headers: Vertical Optimization Roadmap, Proposed Next Engagement
2026-04-16 14:31:32 -05:00
Kevin Turcios
aa259b4652 Update uv.lock for security audit app dependencies 2026-04-16 06:19:28 -05:00
Kevin Turcios
3e63326876 Add standalone security audit app for Plotly Cloud deployment
Separate deployment at https://19727fbf-a6a0-45ac-968f-680035ab6b3b.plotly.app
with its own pyproject.toml, lockfile, and plotly-cloud.toml config.
2026-04-16 06:18:33 -05:00
Kevin Turcios
514c1e28c9 Tailor security report for Lawrence, add UX improvements and talking points
- Rewrite executive summary to reference his PR #1465 lockfile fix and
  existing tooling (Renovate, Anchore, Chainguard)
- Reorder findings by category priority (supply chain > container > CI/CD)
  to lead with what matters most to the audience
- Add animated parallelogram background matching codeflash.ai aesthetic
- 6 research-backed UX changes: severity icons (WCAG 1.4.1), title-first
  cards (F-pattern), loss-framed 85% CTA, distinct status colors, card
  opacity for figure-ground separation
- Correct SEC-021 from 67% to 97% mutable Action pins per VM verification
  (only 2 of 96 SHA-pinned in core-product)
- Add talking-points-lawrence.md with profile, pain points, pitch strategy
2026-04-16 06:01:52 -05:00
Kevin Turcios
8c42f27eed Add 4-tab navigation to security audit report
Split the 39-finding wall into tabbed views matching the engagement
report pattern: Summary, Critical & High (21), Medium & Low (18),
and By Category with both category and repository breakdowns.
2026-04-16 05:05:32 -05:00
Kevin Turcios
3dc58775e3 Consolidate report into 4-tab view and clean up for production
- Replace Executive Brief with JPC Summary as default tab (Executive Summary)
- Add Timeline as 4th tab; standalone /jpc and /timeline routes preserved
- Remove dead code: build_exec_view, make_k8s_chart, unused latency vars
- Extract _logo_lockup helper, _TAB_BTN_STYLE constants to reduce duplication
- Use app.layout as function, env-configurable debug/port, update docstring
2026-04-16 04:48:16 -05:00
Kevin Turcios
c22c5babd1 Organize screenshots by date and session
- 2026-04-15: exec restructure, team view, engagements
- 2026-04-16-methodology: methodology notes across all views
- 2026-04-16-jpc: standalone JPC summary and route verification
- 2026-04-16-timeline: timeline iterations (reordering, date fixes, chart tuning)
2026-04-16 03:49:15 -05:00
Kevin Turcios
c3e7dba47b Add report screenshots to reports/unstructured/screenshots/ 2026-04-16 03:48:14 -05:00
Kevin Turcios
b20c05a799 Add /timeline route with proposed engagement roadmap
- Gantt chart with 5 phases: Core-Product (completed), DevEx & CI/CD,
  Platform API, Security Hardening (concurrent with DevEx), Cost Discovery
- Phase detail cards with duration, dates, deliverables, dependencies
- DevEx as Phase 2 (POC already done, sets up faster CI for Phase 3)
- Security runs concurrent with Phase 2 (uv workspace enables lockfile)
- Investment summary with ~5 month total timeline
- Fixed x-axis range and removed rangeslider for clean proportional bars
2026-04-16 03:46:50 -05:00
Kevin Turcios
90091ccc12 Add /jpc standalone summary route and methodology notes
- Add build_jpc_view() with clean standalone layout at /jpc for JPC
  (no tabs, no hero — just the document that "stands on its own")
- Add URL routing via dcc.Location: / serves full report, /jpc serves summary
- Add methodology notes to exec view (How This Was Tested annotations)
- Add methodology notes to detail view (7-entry "why" card)
- Enrich team view Memory + Standalone vs. Cumulative explanations
2026-04-16 03:07:33 -05:00
Kevin Turcios
2da186d4df Apply learnings to team + detail views, remove redundancy
Team view:
- Add Engineering Impact Summary at top (4 metrics: memory, density,
  latency, idle vCPU) with pointer to sections below
- Remove Production Context card (redundant with Impact Summary)
- Trim memory table to only metrics not shown in chart (RSS per
  request, K8s allocation) — chart already shows pre/post/delta
- Fix "10-page scan" → "10-page scanned document" in methodology

Detail view:
- Add intro callout explaining this is the raw data backing the
  other two views
2026-04-16 02:46:01 -05:00
Kevin Turcios
c1b603afc4 Fix technical terminology in exec brief
- "CFS quota" → "1-CPU limit" (CFS is implementation detail, too
  technical for exec audience)
- "jemalloc" → "jemalloc, opt-in for 1-CPU pods" (missed instance)
- "requests 1 CPU / 32 GB RAM resource requests" → "per pod" (double
  "requests" was grammatically broken)
- "10-page scan" → "10-page scanned document" (consistent with
  workload profiles section)
2026-04-16 02:41:17 -05:00
Kevin Turcios
2c3aad4325 Restructure exec view: enablement-first flow for JPC audience
Reorder based on persuasion research (Three-Talk Model, Prospect
Theory, Kotter):

1. "The Engagement" — collaborative shared context (team talk)
2. "What This Enables" — loss-framed enablement: 9.2x pod density,
   41 idle vCPUs now available, -12.9% latency for agentic API
3. "The Results" — before/after proof of execution
4. Infrastructure Cost Impact (anchored on $100K/mo)
5. Workload Profiles + Methodology (credibility)
6. Delivered + Proposed Next Engagements

Key shift: lead with what the work unlocks (feature velocity,
platform capacity, API speed) rather than the technical achievement
(memory reduction). Cost savings is proof of execution, not the
headline.
2026-04-16 02:36:29 -05:00
Kevin Turcios
6143c38d78 Move workload profile explanations into Executive Brief
The 1p/10p/16p benchmark rationale belongs in the exec view — JPC
needs to understand that page count != workload before seeing the
numbers. Added "Benchmark Workload Profiles" section before "How This
Was Tested" with the three profiles and the data punchline (#1505 at
-32.6% on 1 page vs -7.4% on 16 pages).
2026-04-16 02:32:35 -05:00
Kevin Turcios
eeebf6eec2 Add workload profile explanations to latency benchmark table
The 1p/10p/16p column headers weren't self-explanatory. Added a
"Benchmark Workload Profiles" card above the latency table in the
Detail view explaining that each document tests a distinct workload
shape (table-dense, scanned, mixed), not just different page counts.

Also added annotation below the table calling out that #1505 has 4x
the impact on the 1-page doc vs. the 16-page doc — letting the data
demonstrate that per-document cost depends on content, not page count.
2026-04-16 02:27:00 -05:00