Commit graph

7890 commits

Author SHA1 Message Date
Sarthak Agarwal
3b8a2e5c82 update Docs for Plugin 2026-04-15 00:37:17 +05:30
Kevin Turcios
4d4cb5f517
Merge pull request #2059 from codeflash-ai/refactor/benchmarks-to-dotcodeflash
Move benchmarks to .codeflash/benchmarks/
2026-04-13 05:06:00 -05:00
Kevin Turcios
819a56c33e
Merge pull request #2058 from codeflash-ai/perf/reduce-java-tracer-e2e
perf: optimize Java tracing agent (E2E reduction + serialization + writes)
2026-04-10 18:43:58 -05:00
Kevin Turcios
b737f71e46 fix: update test assertions to match simplified Workload fixture
The Workload.java fixture was trimmed to only repeatString but test
files still asserted computeSum, filterEvens, and instanceMethod.
2026-04-10 16:05:27 -05:00
Kevin Turcios
0cb67c1a17 fix: add --no-pr to codeflash optimize workflow to prevent CI-opened PRs 2026-04-10 15:12:48 -05:00
Kevin Turcios
5c778dfad4 perf: trim tracer E2E workload to single function (repeatString)
Keep only repeatString which reliably produces 284% improvement.
Drop computeSum (marginal 16%), filterEvens and instanceMethod (no
optimization found). Reduces tracer E2E from ~1h27m to ~21m.
2026-04-10 15:08:03 -05:00
Kevin Turcios
40f16b565a ci: add standalone Java E2E workflow for isolated testing 2026-04-10 13:09:36 -05:00
Kevin Turcios
cb87763a2d fix: skip environment approval gate for trusted users on workflow_dispatch 2026-04-10 12:58:54 -05:00
Kevin Turcios
013c83f5e4 fix: drop jdk.ExecutionSample#period from combined JFR opts (unsupported on Java 11) 2026-04-10 09:11:02 -05:00
Kevin Turcios
0d928f2b49 perf: merge Java tracer into single-pass JVM invocation
Combine JFR profiling and argument capture agent into one
JAVA_TOOL_OPTIONS string, running the target program once instead of
twice. JFR and javaagent are orthogonal JVM features that coexist
without conflict. Keeps build_jfr_env/build_agent_env for standalone
use.
2026-04-10 09:05:30 -05:00
Kevin Turcios
ecf4e63eca perf: reduce Java E2E looping time to 5s and cache runtime JAR build
Make TOTAL_LOOPING_TIME configurable via CODEFLASH_LOOPING_TIME env var
(defaults to 10s). Set to 5s in Java E2E CI jobs to cut verification
time per candidate. Also cache the codeflash-runtime JAR keyed on
source hash to skip mvn install when unchanged.
2026-04-10 09:02:45 -05:00
Kevin Turcios
8959ead2f9 fix: resolve Windows 8.3 short paths in get_run_tmp_file and fix ruff lint errors
Add .resolve() to TemporaryDirectory path to expand Windows 8.3 short
paths (e.g. RUNNER~1) to canonical long form, fixing test_pickle_patcher
failures on Windows CI. Also add missing return type annotations and
noqa suppressions for benchmark test file.
2026-04-10 08:51:10 -05:00
Kevin Turcios
ec14860d29 Move benchmarks to .codeflash/benchmarks/ and auto-discover
Move codeflash's own benchmarks to .codeflash/benchmarks/. Add
auto-discovery of .codeflash/benchmarks/ in codeflash compare and
benchmark mode -- when benchmarks-root is not explicitly configured,
the CLI checks for .codeflash/benchmarks/ before erroring.

Backwards compatible: users with existing benchmarks-root config
are unaffected. Docs continue to show tests/benchmarks as the
example path.
2026-04-10 08:39:15 -05:00
Kevin Turcios
151df774a4 perf: use --effort low for java-tracer E2E to reduce CI time 2026-04-10 08:29:46 -05:00
Kevin Turcios
b05561ef9e chore: replace console.print with logger.info for Java project detection 2026-04-10 07:51:08 -05:00
Kevin Turcios
70260f22b3 fix: ensure language_version is detected before optimization API calls
JavaSupport.ensure_runtime_environment() was never called during the
optimization flow, so _language_version stayed None and the backend
received language_version=null. The LLM had no Java version constraint,
causing it to generate Java 16+ APIs (e.g. Stream.toList()) for Java 11
projects.
2026-04-10 07:39:49 -05:00
Kevin Turcios
82ec301fad chore: remove diagnostic logging from compare_test_results 2026-04-10 06:49:43 -05:00
Kevin Turcios
986654b7e6 fix: pin PYTHONHASHSEED=0 in test env and enhance diff diagnostics
Set PYTHONHASHSEED=0 in test subprocess environments so original and
candidate runs use identical hash behavior, eliminating a source of
non-deterministic return-value comparisons.

Also upgrade diff logging from debug to info level with actual types
and repr values for DID_PASS, RETURN_VALUE, and STDOUT diffs.
2026-04-10 06:38:08 -05:00
Kevin Turcios
e191f74aa6 chore: add diagnostic logging to compare_test_results
Temporary instrumentation to debug flaky futurehouse E2E test.
Logs matched/skipped/timed-out counts and did_all_timeout state.
2026-04-10 06:16:39 -05:00
Kevin Turcios
fefccd5935 fix: drop JFR inline event config that breaks JDK 11
The jdk.ExecutionSample#period=1ms syntax in -XX:StartFlightRecording
is only supported on JDK 13+. On JDK 11 (CI), it causes
"Failure when starting JFR on_create_vm_2" and no JFR file is created.
The settings=profile preset still provides 10ms CPU sampling.
2026-04-10 05:28:34 -05:00
Kevin Turcios
bfe6f3a828 Remove debug timing instrumentation from tracer
Strip AtomicLong accumulators, System.nanoTime() timing, and
getTimingSummary() that were added for profiling. No functional change.
2026-04-10 05:16:49 -05:00
Kevin Turcios
01e22152c7 flexing 2026-04-10 05:07:53 -05:00
Kevin Turcios
e81f25f825 fix: remove stale repeatString assertions from integration tests
repeatString was removed from Workload.java in the E2E reduction.
2026-04-10 05:05:17 -05:00
Kevin Turcios
0772398c59 perf: optimize Java tracing agent serialization and writes
- Reuse ThreadLocal Kryo Output buffers (eliminates #1 allocation hotspot)
- Fast-path inline serialization for safe arg types (bypasses executor)
- Skip verification roundtrip for known-safe containers (ArrayList, HashMap, etc.)
- Batch SQLite inserts (256/txn) with permanent autocommit-off
- Switch to ArrayBlockingQueue (no per-element Node allocation)
- Add opt-in in-memory SQLite mode (VACUUM INTO at shutdown), enabled in CI
- Add timing instrumentation (onEntry, serialization, writes, dump)
- Add ProfilingWorkload fixture for benchmarking

Benchmark (50k captures): onEntry 5200ms→1200ms (4.3x), avg/capture
0.43ms→0.02ms (21x), writes 3200ms→900ms (3.5x) with in-memory mode.
2026-04-10 04:55:36 -05:00
Kevin Turcios
08aa94c54a perf: reduce java-tracer E2E to single function for ~11 min target
Drop repeatString from the Workload fixture (2→1 function).
computeSum alone exercises the full tracer→optimizer pipeline
(trace → replay tests → optimize → evaluate → rank → explain → review).
The second function added no additional pipeline coverage.
2026-04-10 03:44:54 -05:00
Kevin Turcios
46957e190f fix: update java tracer unit tests for reduced Workload fixture
Remove assertions for filterEvens and instanceMethod which were removed
from the Workload fixture. Adjust expected invocation counts accordingly.
2026-04-10 03:17:46 -05:00
Kevin Turcios
21f61ec93d ci: add java_tracer_e2e fixture path to e2e_java change detection
The fixture directory wasn't in the path filter, so changes to
Workload.java didn't trigger the java E2E tests.
2026-04-10 03:08:03 -05:00
Kevin Turcios
2b0f633c0f perf: reduce java-tracer E2E from ~75 min to ~15 min
Remove filterEvens and instanceMethod from the Workload fixture (4→2
functions) and reduce main() loop from 1000→100 rounds. The E2E test
only needs to verify the tracer→optimizer pipeline works end-to-end;
it doesn't need 4 functions or 1604 replay tests to prove that.

Expected impact: ~2 functions × ~8 candidates × fewer replay tests
should bring the job from ~75 min down to ~10-15 min.
2026-04-10 03:04:29 -05:00
Kevin Turcios
5ee642e35e
Merge pull request #2057 from codeflash-ai/fix/api-read-timeout
fix: increase API read timeout to prevent flaky E2E failures
2026-04-10 02:45:31 -05:00
Kevin Turcios
4ac573f10f fix: increase API read timeout from 90s to 300s to prevent flaky E2E failures
The flat 90s timeout was too aggressive for LLM-powered endpoints
(/testgen, /optimize, /refinement) under load, causing ReadTimeoutError
and failing the async-optimization E2E test. Split into (10s connect,
300s read) tuple so connections fail fast but LLM inference gets adequate time.
2026-04-10 02:33:16 -05:00
Kevin Turcios
72a41a5665
Merge pull request #2055 from codeflash-ai/perf/defer-cli-imports
perf: defer cli.py imports for 7.7x faster --help
2026-04-10 01:59:57 -05:00
Kevin Turcios
93810f8be6
Merge pull request #2056 from codeflash-ai/chore/delete-disabled-workflows
chore: delete disabled codeflash.yaml workflow
2026-04-10 01:52:47 -05:00
Kevin Turcios
79d47e0fae chore: delete disabled codeflash.yaml workflow
JS ESM integration test — disabled and superseded by ci.yaml's e2e-js matrix.
2026-04-10 01:51:52 -05:00
Kevin Turcios
381d1319ea fix: specify utf-8 encoding in benchmark read_text for Windows CI
Windows defaults to cp1252 which can't decode some source file bytes.
2026-04-10 01:48:31 -05:00
Kevin Turcios
fe39d40e1b perf: add type identity fast-paths for str/list/tuple/dict in comparator
Move the 4 most common return-value types (str, list/tuple, dict) to
`orig_type is T` identity checks at the top of the dispatch chain,
before the frozenset lookup.  A single pointer comparison is cheaper
than a frozenset hash, and these types need special handling anyway
(temp-path normalization, recursive comparison, superset support).

Before: dict traversed ~8 isinstance checks before being handled.
After:  dict is handled at check #3 via `orig_type is dict`.

The isinstance fallbacks remain as slow-paths for subclasses (deque,
ChainMap, defaultdict, scipy dok_matrix, etc.).

Backported from codeflash-python dispatch ordering.
2026-04-10 01:25:05 -05:00
Kevin Turcios
5a5b6e46ac bench: add dedicated comparator microbenchmark for frozenset fast-path
5 scenarios: primitives, nested dicts, DB rows, deep nesting,
and identity types (frozenset/range/complex/Decimal/OrderedDict).
2026-04-10 01:05:02 -05:00
Kevin Turcios
4c3c6ea167 perf: add frozenset fast-path for comparator type dispatch
Use O(1) frozenset membership test with type identity before falling
through to isinstance MRO traversal. Backported from codeflash-python.
2026-04-10 00:53:55 -05:00
Kevin Turcios
accbab4a16 fix: update test_cmd_auth patches for deferred imports
Imports in cmd_auth.py were moved into function bodies, so mock
patches must target the source modules instead of cmd_auth's namespace.
2026-04-10 00:36:02 -05:00
Kevin Turcios
2e2e19f7ae bench: add libcst visitor benchmarks for multi-file and full pipeline
- test_benchmark_libcst_multi_file: discover_functions + get_code_optimization_context across 10 real source files
- test_benchmark_libcst_pipeline: full discover → extract → replace → merge pipeline on one file
2026-04-10 00:21:45 -05:00
Kevin Turcios
1a25f05e14 fix: remove unnecessary Optimizer from benchmark test
The test only needs project_root, not a full Optimizer (which requires
an API key). Also adds missing __init__.py to tests/benchmarks/.
2026-04-10 00:10:36 -05:00
Kevin Turcios
2208e8ca77 bench: add CLI startup benchmark for codeflash compare --script
Measures median wall-clock time for --version, --help, auth status,
and compare --help across 30 runs with 3 warmups.

Usage:
  codeflash compare main codeflash/optimize \
    --script "python benchmarks/bench_cli_startup.py" \
    --script-output benchmarks/results.json
2026-04-09 23:59:26 -05:00
Kevin Turcios
b533f50bdc perf: backport libcst visitor dispatch cache from codeflash-python
Cache the visitor dispatch tables that libcst rebuilds on every
MatcherDecoratableTransformer/Visitor instantiation. The tables
depend only on the class, not the instance, so caching by type is
safe. Saves ~27ms per visitor instantiation (24x faster).

Also fix pre-existing ruff F821 in cli.py (missing exit_with_message
import in process_pyproject_config).
2026-04-09 23:46:45 -05:00
github-actions[bot]
61053be9ce style: auto-format with ruff 2026-04-10 04:39:45 +00:00
Kevin Turcios
436d642847 perf: defer libcst, Rich, comparator imports in models.py
Move libcst, rich.tree.Tree, console, comparator, code_utils, registry,
lsp.helpers, and LspMarkdownMessage from module-level to the methods that
use them. Only pydantic and TestType remain at module level (needed for
class definitions).

models.py import: 633ms → 125ms on Azure Standard_D4s_v5.
2026-04-09 23:38:40 -05:00
github-actions[bot]
88babfef25 style: auto-format with ruff 2026-04-10 04:30:36 +00:00
Kevin Turcios
2fc528ebda perf: defer heavy imports in env_utils and shell_utils
Defer console, formatter, code_utils, registry, and lsp.helpers imports
from module level into the functions that use them. Inline is_LSP_enabled
(a one-liner env var check) to avoid importing lsp.helpers on the happy
path of get_codeflash_api_key.

auth status: 237ms → 160ms on Azure Standard_D4s_v5.
2026-04-09 23:29:31 -05:00
Kevin Turcios
992e91abc7 fix: prevent ruff auto-format from rewriting version.py placeholders
uv-dynamic-versioning rewrites version.py on every `uv run`, so the
ruff auto-format job was inadvertently committing dev version strings.
Restore version.py files after formatting and revert the ones already
changed on this branch.
2026-04-09 23:21:25 -05:00
github-actions[bot]
1e8e5d2cc2 style: auto-format with ruff 2026-04-10 04:14:58 +00:00
Kevin Turcios
a8c004164e perf: skip telemetry/banner for auth and compare commands
Restructure main() command dispatch so auth and compare exit early
without loading telemetry (sentry, posthog), version_check, or the
banner. Defer cmd_auth.py imports into functions.

auth status: ~1000ms → 237ms (4.2x)
compare --help: ~297ms → 38ms (7.9x)
2026-04-09 23:14:03 -05:00
github-actions[bot]
05a7641405 style: auto-format with ruff 2026-04-10 04:09:00 +00:00