codeflash-agent

mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Author	SHA1	Message	Date
Kevin Turcios	bde8a7b782	Fix benchmark baseline for zero-runtime tests and src-layout projects Three fixes to baseline establishment: - Include zero-nanosecond runtimes in usable_runtime_data_by_test_case (use `is not None` instead of truthiness check) - Check runtime_data dict instead of total_timing scalar for skip decision - Use module_root directly in PYTHONPATH (not module_root.parent) so src-layout projects resolve imports correctly in test subprocesses	2026-04-21 08:59:32 -05:00
Kevin Turcios	434e888571	Move AI-generated test instrumentation from server-side to client-side Server-side instrumentation wrote return values to .bin files, which corrupted under concurrent pytest processes (interleaved records → UnicodeDecodeError). Client-side instrumentation writes to SQLite, which handles concurrent access safely. The client now ignores instrumented_behavior_tests and instrumented_perf_tests from the AI service response and instruments the plain generated_tests locally using inject_profiling_into_existing_test, the same path used for discovered existing tests.	2026-04-21 07:38:48 -05:00
Kevin Turcios	bc0323a46c	Fix codeflash_run_tmp_dir_client_side placeholder not being substituted The AI service returns instrumented test code with a {codeflash_run_tmp_dir_client_side} placeholder in file paths. generate_ai_tests() wrote these files without replacing the placeholder, causing FileNotFoundError at test runtime.	2026-04-21 07:01:26 -05:00
Kevin Turcios	05cf34eaf8	Remove unittest framework support from Python test pipeline Pytest natively discovers and runs unittest.TestCase subclasses, making the separate unittest discovery/linking/merging code paths redundant. This removes ~150 lines of dead branching logic across discovery, linking, result merging, and test generation.	2026-04-21 06:41:41 -05:00
Kevin Turcios	ff310a7639	Add [tool.codeflash] config to layered eval template The E2E pipeline needs module-root, tests-root, and test-framework to discover and run tests in src-layout projects.	2026-04-21 06:41:31 -05:00
Kevin Turcios	26709f9b82	Fix test discovery for src-layout projects with module_root=src The pytest collection subprocess only added module_root's parent to PYTHONPATH, which works when module_root is a package (e.g. src/aviary) but fails when it is the source root itself (e.g. src). Now both module_root and its parent are added so imports like `from mypkg.core import func` resolve correctly in either layout.	2026-04-21 06:18:27 -05:00
Kevin Turcios	a4e7e9f5fa	Handle empty JSON responses from AI service gracefully AIClient.post() now catches ValueError (parent of JSONDecodeError) when the server returns 200 with an empty or malformed body, converting it to AIServiceError. All callers already handle AIServiceError, so the pipeline degrades cleanly instead of crashing with an unhandled exception.	2026-04-21 05:51:13 -05:00
Kevin Turcios	87ed0011c2	Handle subprocess timeout gracefully and scale timeout dynamically TimeoutExpired in execute_test_subprocess was propagating unhandled, crashing the pipeline instead of degrading to "no baseline." Now caught and returned as CompletedProcess(returncode=-1). Subprocess timeout scales with test file count (120s base + 60s/file, cap 600s) instead of a fixed 10-minute wait for small suites.	2026-04-21 05:47:32 -05:00
Kevin Turcios	9456cf48fa	Add set -e to vendored session-start-env.sh Match upstream fix: propagate write failures as non-zero exit.	2026-04-21 05:32:30 -05:00
Kevin Turcios	42c8310494	Update vendored codex plugin from v1.0.2 to v1.0.4 Notable upstream fixes: - Fix working-tree review crash on untracked directories - Avoid embedding large adversarial review diffs - Inherit process.env in app-server spawn - Scope implicit resume-last and cancel to current session - Gracefully handle unsupported thread/name/set on older CLI - Use app-server auth status for Codex readiness	2026-04-21 05:11:56 -05:00
Kevin Turcios	de6046df8d	Replace Node.js SessionStart hook with bash (106ms → 5ms) The codex session-lifecycle-hook.mjs SessionStart path only appends two env vars to CLAUDE_ENV_FILE. Rewrite that as a bash script to avoid ~100ms V8 startup overhead. SessionEnd stays in Node.js since it needs async broker teardown and process tree management.	2026-04-21 05:07:03 -05:00
Kevin Turcios	42b855899f	Reduce hook latency by consolidating subprocess calls Collapse multiple jq/grep/sed invocations into single passes: - post-compact-state-inject: 7 jq calls → 1 (-51%) - session-start: 6 sed/grep → 1 sed pipeline (-37%) - user-prompt-context-inject: 2 jq → 1 (-33%) - pre-compact-state-save: 3 tail\|grep → 1 awk (-42%) - post-tool-benchmark-capture: 3 jq → 1 (-23%) - stop-optimization-gate: 3 tail\|grep → 1 awk - pre-compact: fix macOS-incompatible sed newline escaping	2026-04-21 05:01:53 -05:00
Kevin Turcios	ebd239bbbc	Add integration tests for parallel candidate evaluation Tests overlay isolation, concurrent dispatch, thread safety, exception handling, and the full evaluate_candidate_isolated flow with mocked subprocess execution.	2026-04-21 04:34:44 -05:00
Kevin Turcios	2cea1ab784	Cancel AI generation future early when baseline fails Previously, a failed baseline would still block on candidates_future.result(), hanging the process if the AI service was slow. Now checks baseline first and cancels the future.	2026-04-21 04:34:39 -05:00
Kevin Turcios	66461ad4e7	Parallelize candidate evaluation across all optimization passes Evaluates candidates concurrently using ThreadPoolExecutor with project overlays for isolation. Each candidate gets its own symlinked copy of the project so test subprocesses don't interfere with each other or the original source. Shared result lists protected with threading.Lock.	2026-04-21 04:16:43 -05:00
Kevin Turcios	b455c1e69f	Add project overlay infrastructure for isolated candidate evaluation Introduces symlink-based temporary directories that mirror the project root, replacing only the target module file with candidate code. This allows test subprocesses to run against candidate code without mutating the original source on disk, enabling safe parallel evaluation.	2026-04-21 04:16:34 -05:00
Kevin Turcios	1d48b7792d	Overlap AI candidate generation with baseline establishment Start the AI service HTTP call for candidate generation concurrently with baseline subprocess runs. The HTTP call doesn't depend on baseline results, so it runs in a background thread while behavioral tests, line profiling, and benchmarking execute. Saves 10-20s per function (the AI round-trip time that was previously sequential).	2026-04-21 03:52:47 -05:00
Kevin Turcios	d9be7272e9	Parallelize baseline line profiling and benchmarking Run line profiling and benchmarking subprocesses concurrently via ThreadPoolExecutor instead of sequentially. Each writes to a unique JUnit XML file to avoid collisions. Saves 20-60s per function (the duration of the shorter subprocess).	2026-04-21 03:44:52 -05:00
Kevin Turcios	be4d51b886	Fix ruff lint errors from deferred import changes Add missing # noqa: PLC0415 comments to deferred imports that lost them during ruff --fix reformatting. Add unittest to TYPE_CHECKING block in discovery.py so annotations resolve. Fix import sorting.	2026-04-21 03:30:00 -05:00
Kevin Turcios	eb639fb24e	Defer heavy third-party imports in pipeline to cut startup 99% Move jedi, libcst, dill, and unittest imports from module level to first use in _function_optimizer, _optimizer, _orchestrator, and test_discovery. Pipeline __init__.py is now lazy via __getattr__. Optimizer module load: 3.7s → 51ms. All 2591 tests pass.	2026-04-21 03:18:53 -05:00
Kevin Turcios	4399fe0763	Make codeflash_python.__init__ fully lazy via __getattr__ Defer all imports (models, api, importlib.metadata) to first access. Import time drops from ~100ms to ~5ms.	2026-04-21 02:50:39 -05:00
Kevin Turcios	d25d7bdad4	Point attrs source at GitHub release wheel for portability Replace local path override with wheel URL from KRRT7/attrs release so team members and CI get the optimized attrs on uv sync.	2026-04-21 02:35:45 -05:00
Kevin Turcios	edfdd231e0	Use attrs fork with deferred inspect import Point attrs dependency at local fork (KRRT7/attrs perf/defer-inspect-import) which defers the ~12ms inspect import until first class build. Temporary override until upstream merges python-attrs/attrs#1547. Also adds attrs optimization case study data (VM infra, status).	2026-04-21 02:27:50 -05:00
Kevin Turcios	4f206be46b	Make codeflash_core.__init__ fully lazy via __getattr__ Replace eager imports of all submodules with a __getattr__-based lookup table. Submodules now load only when a name is accessed, dropping `import codeflash_core` from ~74ms to ~4ms (98% reduction from the original ~230ms). TYPE_CHECKING imports preserved for static analysis. All 65 core + 2526 downstream tests pass.	2026-04-21 01:22:18 -05:00
Kevin Turcios	31cdba9e3b	Defer heavy third-party imports in codeflash-core to cut import time 68% Move requests, gitpython, sentry-sdk, and posthog imports from module level into the functions that use them. This drops `import codeflash_core` from ~230ms to ~74ms, making it viable for lightweight consumers (e.g. the project detector) to depend on core submodules without blowing startup budgets. - _telemetry: defer sentry_sdk + posthog into init functions (20ms → 0.2ms) - _git: use PEP 562 __getattr__ for lazy git import (59ms → 4ms) - _platform: defer requests + sentry_sdk into methods (55ms → 1ms) - _client: defer requests into post() method (72ms → 34ms) - _http: add shared _make_session() factory for deferred Session creation	2026-04-21 01:08:17 -05:00
misrasaurabh1	cf67686b29	add codeflash-agent case study md	2026-04-19 19:59:23 -07:00
misrasaurabh1	d33e82b647	Merge remote-tracking branch 'origin/main' # Conflicts: # reports/unstructured/engagement_report.py	2026-04-19 19:58:37 -07:00
misrasaurabh1	42debefd12	remove lightspeed canvas animation from report	2026-04-19 19:57:59 -07:00
Kevin Turcios	d63bb51800	Revert timeline phase duration back to 6 weeks	2026-04-19 03:28:46 -05:00
Kevin Turcios	c9a8e9b1ea	Fix engagement duration from 6 weeks to 2 months	2026-04-19 03:28:06 -05:00
Kevin Turcios	0e248b865f	no light speed anims	2026-04-19 03:24:41 -05:00
Kevin Turcios	b42417532d	Add optimization project scaffolding for plotly/plotly.py	2026-04-16 23:57:06 -05:00
Kevin Turcios	a4276d658a	Refine engagement report and case study for executive review - Hero metrics: -89% cost, -52% peak memory, flat scaling, -12.9% latency - Add lightspeed canvas animation via assets/lightspeed.js for Plotly Cloud - Add platform-libs CI/CD migration to timeline (Phase 1b) with PR links - Update next-engagement card with POC branch and PR references - Replace RSS with peak memory in user-facing copy - Add flat memory scaling to case study results table	2026-04-16 17:51:54 -05:00
Kevin Turcios	380bd59503	Add iterative-discovery narrative and missing findings across all reports Weave "optimizations reveal deeper issues" framing into engagement report executive summary, case study, and optimization README. Add O(N²) text extraction fix, per-request RSS creep (24→17 MB), and memray profiling data that were previously undocumented.	2026-04-16 15:02:39 -05:00
Kevin Turcios	3c705d4e2d	Rewrite Unstructured case study for public-facing clarity Apply research-backed case study structure: headline anchoring on biggest numbers, customer-as-hero framing, loss aversion, narrative arc, methodology for developer credibility. Collapse PR inventory to category summary, ~1,100 words in optimal range.	2026-04-16 14:40:05 -05:00
Kevin Turcios	6d05aea09c	Revamp engagement report layout and timeline for executive clarity - Move Infrastructure Cost Impact above hero metrics and tab toggle - Extract shared above-fold content into _above_fold_content() for /jpc parity - Replace plotly Gantt chart with pure-HTML vertical timeline - Fix cross-browser flex layout (explicit flex: 1 1 0%, minWidth: 0) - Remove redundant "The Results" and "How This Was Tested" sections - Rename Engineering Team → Engineering Details - Rename Peak RSS → Peak Memory Usage - Update timeline dates: 1-week buffer after Phase 1, cascade phases - Rename section headers: Vertical Optimization Roadmap, Proposed Next Engagement	2026-04-16 14:31:32 -05:00
Kevin Turcios	aa259b4652	Update uv.lock for security audit app dependencies	2026-04-16 06:19:28 -05:00
Kevin Turcios	3e63326876	Add standalone security audit app for Plotly Cloud deployment Separate deployment at https://19727fbf-a6a0-45ac-968f-680035ab6b3b.plotly.app with its own pyproject.toml, lockfile, and plotly-cloud.toml config.	2026-04-16 06:18:33 -05:00
Kevin Turcios	514c1e28c9	Tailor security report for Lawrence, add UX improvements and talking points - Rewrite executive summary to reference his PR #1465 lockfile fix and existing tooling (Renovate, Anchore, Chainguard) - Reorder findings by category priority (supply chain > container > CI/CD) to lead with what matters most to the audience - Add animated parallelogram background matching codeflash.ai aesthetic - 6 research-backed UX changes: severity icons (WCAG 1.4.1), title-first cards (F-pattern), loss-framed 85% CTA, distinct status colors, card opacity for figure-ground separation - Correct SEC-021 from 67% to 97% mutable Action pins per VM verification (only 2 of 96 SHA-pinned in core-product) - Add talking-points-lawrence.md with profile, pain points, pitch strategy	2026-04-16 06:01:52 -05:00
Kevin Turcios	8c42f27eed	Add 4-tab navigation to security audit report Split the 39-finding wall into tabbed views matching the engagement report pattern: Summary, Critical & High (21), Medium & Low (18), and By Category with both category and repository breakdowns.	2026-04-16 05:05:32 -05:00
Kevin Turcios	3dc58775e3	Consolidate report into 4-tab view and clean up for production - Replace Executive Brief with JPC Summary as default tab (Executive Summary) - Add Timeline as 4th tab; standalone /jpc and /timeline routes preserved - Remove dead code: build_exec_view, make_k8s_chart, unused latency vars - Extract _logo_lockup helper, _TAB_BTN_STYLE constants to reduce duplication - Use app.layout as function, env-configurable debug/port, update docstring	2026-04-16 04:48:16 -05:00
Kevin Turcios	c22c5babd1	Organize screenshots by date and session - 2026-04-15: exec restructure, team view, engagements - 2026-04-16-methodology: methodology notes across all views - 2026-04-16-jpc: standalone JPC summary and route verification - 2026-04-16-timeline: timeline iterations (reordering, date fixes, chart tuning)	2026-04-16 03:49:15 -05:00
Kevin Turcios	c3e7dba47b	Add report screenshots to reports/unstructured/screenshots/	2026-04-16 03:48:14 -05:00
Kevin Turcios	b20c05a799	Add /timeline route with proposed engagement roadmap - Gantt chart with 5 phases: Core-Product (completed), DevEx & CI/CD, Platform API, Security Hardening (concurrent with DevEx), Cost Discovery - Phase detail cards with duration, dates, deliverables, dependencies - DevEx as Phase 2 (POC already done, sets up faster CI for Phase 3) - Security runs concurrent with Phase 2 (uv workspace enables lockfile) - Investment summary with ~5 month total timeline - Fixed x-axis range and removed rangeslider for clean proportional bars	2026-04-16 03:46:50 -05:00
Kevin Turcios	90091ccc12	Add /jpc standalone summary route and methodology notes - Add build_jpc_view() with clean standalone layout at /jpc for JPC (no tabs, no hero — just the document that "stands on its own") - Add URL routing via dcc.Location: / serves full report, /jpc serves summary - Add methodology notes to exec view (How This Was Tested annotations) - Add methodology notes to detail view (7-entry "why" card) - Enrich team view Memory + Standalone vs. Cumulative explanations	2026-04-16 03:07:33 -05:00
Kevin Turcios	2da186d4df	Apply learnings to team + detail views, remove redundancy Team view: - Add Engineering Impact Summary at top (4 metrics: memory, density, latency, idle vCPU) with pointer to sections below - Remove Production Context card (redundant with Impact Summary) - Trim memory table to only metrics not shown in chart (RSS per request, K8s allocation) — chart already shows pre/post/delta - Fix "10-page scan" → "10-page scanned document" in methodology Detail view: - Add intro callout explaining this is the raw data backing the other two views	2026-04-16 02:46:01 -05:00
Kevin Turcios	c1b603afc4	Fix technical terminology in exec brief - "CFS quota" → "1-CPU limit" (CFS is implementation detail, too technical for exec audience) - "jemalloc" → "jemalloc, opt-in for 1-CPU pods" (missed instance) - "requests 1 CPU / 32 GB RAM resource requests" → "per pod" (double "requests" was grammatically broken) - "10-page scan" → "10-page scanned document" (consistent with workload profiles section)	2026-04-16 02:41:17 -05:00
Kevin Turcios	2c3aad4325	Restructure exec view: enablement-first flow for JPC audience Reorder based on persuasion research (Three-Talk Model, Prospect Theory, Kotter): 1. "The Engagement" — collaborative shared context (team talk) 2. "What This Enables" — loss-framed enablement: 9.2x pod density, 41 idle vCPUs now available, -12.9% latency for agentic API 3. "The Results" — before/after proof of execution 4. Infrastructure Cost Impact (anchored on $100K/mo) 5. Workload Profiles + Methodology (credibility) 6. Delivered + Proposed Next Engagements Key shift: lead with what the work unlocks (feature velocity, platform capacity, API speed) rather than the technical achievement (memory reduction). Cost savings is proof of execution, not the headline.	2026-04-16 02:36:29 -05:00
Kevin Turcios	6143c38d78	Move workload profile explanations into Executive Brief The 1p/10p/16p benchmark rationale belongs in the exec view — JPC needs to understand that page count != workload before seeing the numbers. Added "Benchmark Workload Profiles" section before "How This Was Tested" with the three profiles and the data punchline (#1505 at -32.6% on 1 page vs -7.4% on 16 pages).	2026-04-16 02:32:35 -05:00
Kevin Turcios	eeebf6eec2	Add workload profile explanations to latency benchmark table The 1p/10p/16p column headers weren't self-explanatory. Added a "Benchmark Workload Profiles" card above the latency table in the Detail view explaining that each document tests a distinct workload shape (table-dense, scanned, mixed), not just different page counts. Also added annotation below the table calling out that #1505 has 4x the impact on the 1-page doc vs. the 16-page doc — letting the data demonstrate that per-document cost depends on content, not page count.	2026-04-16 02:27:00 -05:00

1 2 3 4

198 commits