- Add PR template with required linked issue/discussion section
- Add check-linked-issue CI job that validates PR body contains a
reference (#123, Closes/Fixes/Relates, GitHub URL, or CF-# ticket)
- Wire into required-checks-passed gate so it blocks merge
- Update CONTRIBUTING.md with the policy and motivation
Closes#522
Covers the two audiences the issue calls out:
1. People contributing changes back to Codeflash - development
environment setup with uv, the single-command verification via
uv run prek, test runner invocation, code-style pointers to
.claude/rules/code-style.md, and the branch / commit / PR
conventions from .claude/rules/git.md and CLAUDE.md.
2. People using Codeflash in editable mode from a source checkout
to optimize their own projects, including the install commands
for uv and pip, when editable mode is appropriate vs installing
the PyPI package, and a pointer to the README Quick Start for
the codeflash init flow.
Keep the combined JFR + tracing agent single JVM invocation from main while
preserving the fix's intent: raise when trace-db was not created, warn when
exit code is non-zero but trace-db exists. Integration tests rewritten to
match the combined-invocation semantics.
Gradle evaluates all project configurations during the configuration
phase, even when only one module is targeted. Multi-module projects with
diverse toolchain requirements (e.g., OpenRewrite's rewrite-gradle needs
JDK 8) fail when an unrelated module's toolchain isn't available.
Adds --configure-on-demand to all 8 Gradle command construction sites
so Gradle only configures projects needed for the requested task.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_run_java_with_graceful_timeout() discarded the subprocess exit code in
both the no-timeout and timeout paths. If Maven/Gradle failed (compilation
error, OOM, etc.), the tracer silently continued with missing/stale data.
Now returns the exit code. Stage 1 (JFR profiling) warns on failure but
continues. Stage 2 (argument capture) raises RuntimeError since trace
data is essential for replay test generation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep only repeatString which reliably produces 284% improvement.
Drop computeSum (marginal 16%), filterEvens and instanceMethod (no
optimization found). Reduces tracer E2E from ~1h27m to ~21m.
Combine JFR profiling and argument capture agent into one
JAVA_TOOL_OPTIONS string, running the target program once instead of
twice. JFR and javaagent are orthogonal JVM features that coexist
without conflict. Keeps build_jfr_env/build_agent_env for standalone
use.
Make TOTAL_LOOPING_TIME configurable via CODEFLASH_LOOPING_TIME env var
(defaults to 10s). Set to 5s in Java E2E CI jobs to cut verification
time per candidate. Also cache the codeflash-runtime JAR keyed on
source hash to skip mvn install when unchanged.
Add .resolve() to TemporaryDirectory path to expand Windows 8.3 short
paths (e.g. RUNNER~1) to canonical long form, fixing test_pickle_patcher
failures on Windows CI. Also add missing return type annotations and
noqa suppressions for benchmark test file.
Move codeflash's own benchmarks to .codeflash/benchmarks/. Add
auto-discovery of .codeflash/benchmarks/ in codeflash compare and
benchmark mode -- when benchmarks-root is not explicitly configured,
the CLI checks for .codeflash/benchmarks/ before erroring.
Backwards compatible: users with existing benchmarks-root config
are unaffected. Docs continue to show tests/benchmarks as the
example path.
JavaSupport.ensure_runtime_environment() was never called during the
optimization flow, so _language_version stayed None and the backend
received language_version=null. The LLM had no Java version constraint,
causing it to generate Java 16+ APIs (e.g. Stream.toList()) for Java 11
projects.
Set PYTHONHASHSEED=0 in test subprocess environments so original and
candidate runs use identical hash behavior, eliminating a source of
non-deterministic return-value comparisons.
Also upgrade diff logging from debug to info level with actual types
and repr values for DID_PASS, RETURN_VALUE, and STDOUT diffs.
The jdk.ExecutionSample#period=1ms syntax in -XX:StartFlightRecording
is only supported on JDK 13+. On JDK 11 (CI), it causes
"Failure when starting JFR on_create_vm_2" and no JFR file is created.
The settings=profile preset still provides 10ms CPU sampling.
Drop repeatString from the Workload fixture (2→1 function).
computeSum alone exercises the full tracer→optimizer pipeline
(trace → replay tests → optimize → evaluate → rank → explain → review).
The second function added no additional pipeline coverage.
Remove filterEvens and instanceMethod from the Workload fixture (4→2
functions) and reduce main() loop from 1000→100 rounds. The E2E test
only needs to verify the tracer→optimizer pipeline works end-to-end;
it doesn't need 4 functions or 1604 replay tests to prove that.
Expected impact: ~2 functions × ~8 candidates × fewer replay tests
should bring the job from ~75 min down to ~10-15 min.
The flat 90s timeout was too aggressive for LLM-powered endpoints
(/testgen, /optimize, /refinement) under load, causing ReadTimeoutError
and failing the async-optimization E2E test. Split into (10s connect,
300s read) tuple so connections fail fast but LLM inference gets adequate time.
Move the 4 most common return-value types (str, list/tuple, dict) to
`orig_type is T` identity checks at the top of the dispatch chain,
before the frozenset lookup. A single pointer comparison is cheaper
than a frozenset hash, and these types need special handling anyway
(temp-path normalization, recursive comparison, superset support).
Before: dict traversed ~8 isinstance checks before being handled.
After: dict is handled at check #3 via `orig_type is dict`.
The isinstance fallbacks remain as slow-paths for subclasses (deque,
ChainMap, defaultdict, scipy dok_matrix, etc.).
Backported from codeflash-python dispatch ordering.
- test_benchmark_libcst_multi_file: discover_functions + get_code_optimization_context across 10 real source files
- test_benchmark_libcst_pipeline: full discover → extract → replace → merge pipeline on one file
Measures median wall-clock time for --version, --help, auth status,
and compare --help across 30 runs with 3 warmups.
Usage:
codeflash compare main codeflash/optimize \
--script "python benchmarks/bench_cli_startup.py" \
--script-output benchmarks/results.json