Keep only repeatString which reliably produces 284% improvement.
Drop computeSum (marginal 16%), filterEvens and instanceMethod (no
optimization found). Reduces tracer E2E from ~1h27m to ~21m.
Combine JFR profiling and argument capture agent into one
JAVA_TOOL_OPTIONS string, running the target program once instead of
twice. JFR and javaagent are orthogonal JVM features that coexist
without conflict. Keeps build_jfr_env/build_agent_env for standalone
use.
Make TOTAL_LOOPING_TIME configurable via CODEFLASH_LOOPING_TIME env var
(defaults to 10s). Set to 5s in Java E2E CI jobs to cut verification
time per candidate. Also cache the codeflash-runtime JAR keyed on
source hash to skip mvn install when unchanged.
Add .resolve() to TemporaryDirectory path to expand Windows 8.3 short
paths (e.g. RUNNER~1) to canonical long form, fixing test_pickle_patcher
failures on Windows CI. Also add missing return type annotations and
noqa suppressions for benchmark test file.
Move codeflash's own benchmarks to .codeflash/benchmarks/. Add
auto-discovery of .codeflash/benchmarks/ in codeflash compare and
benchmark mode -- when benchmarks-root is not explicitly configured,
the CLI checks for .codeflash/benchmarks/ before erroring.
Backwards compatible: users with existing benchmarks-root config
are unaffected. Docs continue to show tests/benchmarks as the
example path.
JavaSupport.ensure_runtime_environment() was never called during the
optimization flow, so _language_version stayed None and the backend
received language_version=null. The LLM had no Java version constraint,
causing it to generate Java 16+ APIs (e.g. Stream.toList()) for Java 11
projects.
Set PYTHONHASHSEED=0 in test subprocess environments so original and
candidate runs use identical hash behavior, eliminating a source of
non-deterministic return-value comparisons.
Also upgrade diff logging from debug to info level with actual types
and repr values for DID_PASS, RETURN_VALUE, and STDOUT diffs.
The jdk.ExecutionSample#period=1ms syntax in -XX:StartFlightRecording
is only supported on JDK 13+. On JDK 11 (CI), it causes
"Failure when starting JFR on_create_vm_2" and no JFR file is created.
The settings=profile preset still provides 10ms CPU sampling.
Drop repeatString from the Workload fixture (2→1 function).
computeSum alone exercises the full tracer→optimizer pipeline
(trace → replay tests → optimize → evaluate → rank → explain → review).
The second function added no additional pipeline coverage.
Remove filterEvens and instanceMethod from the Workload fixture (4→2
functions) and reduce main() loop from 1000→100 rounds. The E2E test
only needs to verify the tracer→optimizer pipeline works end-to-end;
it doesn't need 4 functions or 1604 replay tests to prove that.
Expected impact: ~2 functions × ~8 candidates × fewer replay tests
should bring the job from ~75 min down to ~10-15 min.
The flat 90s timeout was too aggressive for LLM-powered endpoints
(/testgen, /optimize, /refinement) under load, causing ReadTimeoutError
and failing the async-optimization E2E test. Split into (10s connect,
300s read) tuple so connections fail fast but LLM inference gets adequate time.
Move the 4 most common return-value types (str, list/tuple, dict) to
`orig_type is T` identity checks at the top of the dispatch chain,
before the frozenset lookup. A single pointer comparison is cheaper
than a frozenset hash, and these types need special handling anyway
(temp-path normalization, recursive comparison, superset support).
Before: dict traversed ~8 isinstance checks before being handled.
After: dict is handled at check #3 via `orig_type is dict`.
The isinstance fallbacks remain as slow-paths for subclasses (deque,
ChainMap, defaultdict, scipy dok_matrix, etc.).
Backported from codeflash-python dispatch ordering.
- test_benchmark_libcst_multi_file: discover_functions + get_code_optimization_context across 10 real source files
- test_benchmark_libcst_pipeline: full discover → extract → replace → merge pipeline on one file
Measures median wall-clock time for --version, --help, auth status,
and compare --help across 30 runs with 3 warmups.
Usage:
codeflash compare main codeflash/optimize \
--script "python benchmarks/bench_cli_startup.py" \
--script-output benchmarks/results.json
Cache the visitor dispatch tables that libcst rebuilds on every
MatcherDecoratableTransformer/Visitor instantiation. The tables
depend only on the class, not the instance, so caching by type is
safe. Saves ~27ms per visitor instantiation (24x faster).
Also fix pre-existing ruff F821 in cli.py (missing exit_with_message
import in process_pyproject_config).
Move libcst, rich.tree.Tree, console, comparator, code_utils, registry,
lsp.helpers, and LspMarkdownMessage from module-level to the methods that
use them. Only pydantic and TestType remain at module level (needed for
class definitions).
models.py import: 633ms → 125ms on Azure Standard_D4s_v5.
Defer console, formatter, code_utils, registry, and lsp.helpers imports
from module level into the functions that use them. Inline is_LSP_enabled
(a one-liner env var check) to avoid importing lsp.helpers on the happy
path of get_codeflash_api_key.
auth status: 237ms → 160ms on Azure Standard_D4s_v5.
uv-dynamic-versioning rewrites version.py on every `uv run`, so the
ruff auto-format job was inadvertently committing dev version strings.
Restore version.py files after formatting and revert the ones already
changed on this branch.