codeflash-internal/experiments/rl_env/validation_report_final.md
2026-04-16 16:31:25 -07:00

43 KiB

Codeflash RL Environment — Batch Validation Report

Summary

Metric Count %
Total tasks 166 100%
Solve passes 0 0%
Eval correct (all behavioral tests pass) 153 92%
Faster than original (speedup > 1.0) 122 73%
All test cases pass 157 94%

Speedup Distribution (correct tasks only)

  • Slower (< 1x): 13 tasks
  • 1-1.5x: 85 tasks
  • 1.5-2x: 18 tasks
  • 2-5x: 14 tasks
  • 5-100x: 17 tasks
  • >100x: 6 tasks

Successful Tasks (correct=1.0)

Task Function Speedup Tests Coverage Quality DB Speedup
models-prepare_multi_label_classification_response prepare_multi_label_classification_response 74810.9605x 32/32 7.7% low 68465.46x
introspection-prepare_operators_descriptions prepare_operators_descriptions 15514.0489x 1057/1057 35.0% low 14150.75x
decorators-withfixedsizecache-memory_pressure_detected memory_pressure_detected 1747.8138x 132/132 37.8% 917.09x
depth_anything_v3-inferencemodelsdepthanythingv3adapter-predict predict 397.7983x 36/36 23.2% 391.52x
detection_event_log-detectioneventlogblockv1-_evict_oldest_video _evict_oldest_video 343.5465x 170/170 46.4% 15.92x
camera-_generate_grid_colors _generate_grid_colors 269.6811x 1901/1901 9.0% 218.30x
workflow_caller-_check_workflow_for_circular_references _check_workflow_for_circular_references 27.0112x 41/41 31.1% low 11.84x
semantic_segmentation-blockmanifest-describe_outputs describe_outputs 23.9434x 2539/2539 38.9% high 21.92x
dynamic_blocks-build_traceback_string build_traceback_string 20.3358x 2047/2047 16.0% low 13.38x
sort-sortmanifest-describe_outputs describe_outputs 19.1846x 8036/8036 89.7% low 17.16x
bytetrack-bytetrackmanifest-describe_outputs describe_outputs 19.0393x 3033/3033 90.2% low 17.56x
event_writer-_extract_detail _extract_detail 16.4218x 43/43 31.9% 13.46x
workflow_caller-_describe_outputs_from_spec _describe_outputs_from_spec 16.0714x 25/25 23.1% low 12.94x
managers-try_releasing_cuda_memory try_releasing_cuda_memory 15.0925x 1006/1006 10.8% 1.22x
cache-_slugify_model_id _slugify_model_id 13.9914x 1050/1050 26.1% medium 11.21x
s3-deduct_csv_header deduct_csv_header 13.1148x 54/54 38.6% high 8.90x
dynamic_blocks-create_dynamic_module create_dynamic_module 10.8808x 142/142 27.4% 12.31x
dataset_upload-roboflowdatasetuploadblockv2-run run 10.0212x 13/13 57.1% 9.14x
glm_ocr-blockmanifest-describe_outputs describe_outputs 8.9607x 1035/1035 51.9% 9.33x
qwen3_5vl-blockmanifest-describe_outputs describe_outputs 7.6146x 3228/3228 48.6% medium 7.30x
http-with_route_exceptions with_route_exceptions 6.7561x 1297/1297 8.1% 6.89x
qwen3_5vl-qwen35vlblockv1-run_remotely run_remotely 6.3813x 23/23 69.4% 5.64x
introspection-prepare_operations_descriptions prepare_operations_descriptions 6.2270x 147/147 82.5% high 6.26x
core_steps-load_kinds load_kinds 4.7753x 1153/1153 42.0% high 3.68x
depth_anything_v2-inferencemodelsdepthanythingv2adapter-predict predict 4.0460x 38/38 60.2% 5.03x
qwen3_5vl-inferencemodelsqwen35vladapter-predict predict 3.2474x 2275/2275 70.3% low 3.72x
core-_prepare_workflow_response_cache_key _prepare_workflow_response_cache_key 3.0359x 7539/7539 2.7% medium 2.39x
compiler-establish_step_execution_dimensionality establish_step_execution_dimensionality 2.6841x 47/47 23.2% 2.37x
semantic_segmentation-roboflowsemanticsegmentationmodelblockv1-_convert_to_sv_de _convert_to_sv_detections 2.6825x 13/13 71.7% 2.22x
managers-modelmanager-_dispose_model_lock _dispose_model_lock 2.5469x 2784/2784 14.7% 3.24x
text_display-clamp_box clamp_box 2.5308x 1210/1210 15.0% high 2.80x
event_writer-_detections_to_v2_instance_segmentations _detections_to_v2_instance_segmentations 2.3078x 36/36 41.2% 2.18x
models-baseinference-infer infer 2.2736x 1037/1037 2.8% low 2.32x
qwen3vl-inferencemodelsqwen3vladapter-map_inference_kwargs map_inference_kwargs 2.1968x 1125/1125 26.8% medium 2.39x
clip_comparison-blockmanifest-get_required_cache_artifacts get_required_cache_artifacts 2.1802x 130/130 26.6% low 2.04x
introspection-_get_property_name_options _get_property_name_options 2.0733x 1053/1053 57.5% 1.52x
compiler-verify_compatibility_of_input_data_lineage_with_control_flow_lineage verify_compatibility_of_input_data_lineage_with_control_flow_lineage 2.0635x 39/39 26.4% 2.11x
execution_data_manager-executiondatamanager-_register_control_flow_output_for_no _register_control_flow_output_for_non_simd_step 1.9788x 32/32 20.2% 2.65x
core-_forcetracerootsampler-get_description get_description 1.9079x 3244/3244 1.6% 2.03x
enterprise_blocks-load_enterprise_blocks load_enterprise_blocks 1.8940x 1936/1936 32.2% medium 1.45x
entities-workflowimagedata-copy_and_replace copy_and_replace 1.8862x 2336/2336 72.1% 2.04x
compiler-_collect_unique_control_flow_lineages_with_step_mapping _collect_unique_control_flow_lineages_with_step_mapping 1.8585x 33/33 24.3% 1.95x
mask_area_measurement-maskareameasurementblockv1-run run 1.8337x 39/39 93.0% 1.65x
compiler-separate_control_flow_predecessors_from_data_providers separate_control_flow_predecessors_from_data_providers 1.8200x 34/34 23.1% high 1.87x
event_writer-_build_event_data _build_event_data 1.8120x 4732/4732 34.7% medium 1.74x
compiler-step_definition_allows_control_flow_references step_definition_allows_control_flow_references 1.7192x 27/27 22.5% medium 1.86x
introspection-retrieve_selectors_from_union_definition retrieve_selectors_from_union_definition 1.6618x 36/36 22.2% high 1.98x
dataset_upload-maybe_register_datapoint_at_roboflow maybe_register_datapoint_at_roboflow 1.6392x 1039/1039 55.6% low 1.47x
cache-is_block_cached is_block_cached 1.6255x 53/53 27.9% low 1.36x
introspection-_ref_to_def_name _ref_to_def_name 1.6030x 1344/1344 27.5% high 1.51x
mask_area_measurement-compute_detection_areas compute_detection_areas 1.5829x 24/24 83.0% 1.46x
managers-list_files list_files 1.5797x 99/99 8.9% 1.66x
dynamic_blocks-assembly_custom_python_block assembly_custom_python_block 1.5787x 135/135 36.7% low 1.61x
cache-get_cached_foundation_models get_cached_foundation_models 1.5691x 32/32 34.7% low 1.46x
compiler-is_control_flow_step is_control_flow_step 1.5035x 1830/1830 15.3% medium 1.34x
execution_data_manager-construct_mask_for_all_inputs_dimensionalities construct_mask_for_all_inputs_dimensionalities 1.4808x 31/31 19.0% low 1.51x
common-deserialize_image_kind deserialize_image_kind 1.4732x 1506/1506 7.4% 1.42x
usage_tracking-usagecollector-_compute_execution_duration _compute_execution_duration 1.4709x 2017/2017 27.5% medium 1.55x
core-_url_for_safe_logging _url_for_safe_logging 1.4616x 1055/1055 2.8% 1.47x
dataset_upload-is_prediction_registration_forbidden is_prediction_registration_forbidden 1.4475x 2043/2043 31.7% 1.44x
qwen3_5vl-qwen35vlblockv1-run run 1.4215x 28/28 93.1% low 1.69x
execution_data_manager-construct_simd_step_input construct_simd_step_input 1.4138x 26/26 28.3% low 1.37x
cache-get_task_type_to_block_mapping get_task_type_to_block_mapping 1.4136x 30/30 29.6% low 1.39x
anthropic_claude-blockmanifest-get_air_gapped_availability get_air_gapped_availability 1.3907x 2243/2243 16.4% low 1.45x
qwen3_5vl-inferencemodelsqwen35vladapter-map_inference_kwargs map_inference_kwargs 1.3866x 1549/1549 64.9% low 1.53x
email_notification-format_email_message format_email_message 1.3730x 56/56 31.7% high 1.35x
dataset_upload-register_datapoint_at_roboflow register_datapoint_at_roboflow 1.3720x 2037/2037 38.6% low 1.32x
common-add_inference_keypoints_to_sv_detections add_inference_keypoints_to_sv_detections 1.3657x 30/30 4.1% 1.56x
core-get_workflow_specification get_workflow_specification 1.3586x 1157/1157 3.6% low 1.56x
sequences-sequence_apply sequence_apply 1.3540x 58/58 30.2% medium 1.48x
managers-modelmanager-infer_from_request_sync infer_from_request_sync 1.3382x 3041/3041 13.7% low 1.46x
entities-batch-remove_by_indices remove_by_indices 1.3240x 44/44 65.4% high 1.26x
cache-_is_model_cached _is_model_cached 1.3055x 45/45 27.0% 1.24x
workflow_caller-_extract_workflow_caller_ids_from_spec _extract_workflow_caller_ids_from_spec 1.3007x 44/44 25.8% 1.34x
openai-execute_gpt_4v_request execute_gpt_4v_request 1.3006x 37/37 25.8% medium 2.00x
cache-is_model_cached is_model_cached 1.2989x 55/55 28.7% high 1.22x
core-load_cached_workflow_response load_cached_workflow_response 1.2951x 12126/12126 2.8% low 1.38x
execution_data_manager-filter_to_valid_prefix_chains filter_to_valid_prefix_chains 1.2932x 32/32 15.3% 1.32x
execution_data_manager-intersect_masks_per_dimension intersect_masks_per_dimension 1.2908x 40/40 13.5% high 1.66x
webrtc_worker-videoframeprocessor-serialize_outputs_sync serialize_outputs_sync 1.2882x 48/48 17.9% low 1.37x
webrtc_worker-videoframeprocessor-_check_termination _check_termination 1.2650x 2029/2029 16.1% 1.36x
workflow_caller-_fetch_workflow_spec_for_validation _fetch_workflow_spec_for_validation 1.2390x 1547/1547 23.1% 1.33x
dataset_upload-roboflowdatasetuploadblockv1-run run 1.2328x 41/41 38.6% 1.26x
executor-_run_workflow _run_workflow 1.2137x 130/130 21.6% low 1.22x
managers-rank_for_deletion rank_for_deletion 1.2067x 106/106 7.3% 1.88x
detection_event_log-detectioneventlogblockv1-_get_relative_time _get_relative_time 1.2017x 41/41 43.0% 1.19x
http-_build_step_execution_error_response _build_step_execution_error_response 1.1955x 1029/1029 1.0% low 1.19x
common-serialise_sv_detections serialise_sv_detections 1.1825x 149/149 5.1% 1.19x
models-inferencemodelsobjectdetectionadapter-postprocess postprocess 1.1819x 33/33 8.7% medium 1.23x
text_display-draw_background_with_alpha draw_background_with_alpha 1.1776x 176/176 29.5% high 1.18x
core-record_inference record_inference 1.1693x 3033/3033 1.6% low 1.22x
webrtc_worker-default_encoder default_encoder 1.1565x 4071/4071 17.0% medium 1.12x
easy_ocr-blockmanifest-get_supported_model_variants get_supported_model_variants 1.1535x 2039/2039 57.5% low 1.31x
execution_data_manager-get_masks_intersection_for_dimensions get_masks_intersection_for_dimensions 1.1505x 36/36 16.9% low 1.23x
mask_area_measurement-get_detection_area get_detection_area 1.1443x 129/129 83.7% 1.19x
email_notification-apply_operations_to_message_parameters apply_operations_to_message_parameters 1.1389x 44/44 29.5% low 1.15x
dataset_upload-register_datapoint register_datapoint 1.1321x 1138/1138 42.5% low 1.16x
compiler-get_lineage_derived_from_control_flow get_lineage_derived_from_control_flow 1.1298x 33/33 23.8% low 1.25x
yolo_world-blockmanifest-get_supported_model_variants get_supported_model_variants 1.1252x 2232/2232 50.0% medium 1.29x
trackers-instancecache-record_instance record_instance 1.1226x 14857/14857 17.3% medium 1.14x
event_writer-_build_image_entry _build_image_entry 1.1222x 1337/1337 60.6% low 1.10x
workflow_caller-_convert_output_descriptions_to_kinds _convert_output_descriptions_to_kinds 1.1212x 37/37 24.6% medium 1.19x
workflow_caller-workflowcallerblockv1-run run 1.1159x 59/59 48.9% 1.13x
notification-blockmanifest-get_air_gapped_availability get_air_gapped_availability 1.1135x 1535/1535 43.0% low 1.14x
moondream2-inferencemodelsmoondream2adapter-caption caption 1.1118x 185/185 45.1% high 1.11x
cache-_get_block_type_identifier _get_block_type_identifier 1.1082x 34/34 26.5% 1.11x
models-inferencemodelsobjectdetectionadapter-preprocess preprocess 1.1014x 31/31 7.5% low 1.12x
workflow_caller-_deserialize_output_value _deserialize_output_value 1.0981x 139/139 27.4% medium 1.11x
workflow_caller-_resolve_output_kinds_for_run _resolve_output_kinds_for_run 1.0975x 1047/1047 26.2% 1.12x
lmm-blockmanifest-get_air_gapped_availability get_air_gapped_availability 1.0895x 4625/4625 46.1% low 1.12x
workflow_caller-build_workflow_url build_workflow_url 1.0892x 6137/6137 22.2% low 1.29x
custom_metadata-blockmanifest-get_air_gapped_availability get_air_gapped_availability 1.0780x 3724/3724 49.4% low 1.12x
models-inferencemodelskeypointsdetectionadapter-map_inference_kwargs map_inference_kwargs 1.0669x 2232/2232 7.4% low 1.13x
heatmap-heatmapvisualizationblockv1-getannotator getAnnotator 1.0615x 38/38 44.2% low 1.41x
detection_event_log-detectioneventlogblockv1-run run 1.0380x 4426/4426 97.3% low 3.40x
glm_ocr-inferencemodelsglmocradapter-postprocess postprocess 1.0374x 1788/1788 54.5% low 1.19x
sms-blockmanifest-get_air_gapped_availability get_air_gapped_availability 1.0320x 2743/2743 15.4% low 1.14x
handlers-handle_describe_workflows_blocks_request handle_describe_workflows_blocks_request 1.0280x 153/153 42.0% low 2.50x
dataset_upload-blockmanifest-get_air_gapped_availability get_air_gapped_availability 1.0160x 3532/3532 29.3% low 1.10x
core-get_workflow_cache_file get_workflow_cache_file 1.0093x 1551/1551 2.8% low 1.17x
dataset_upload-execute_registration execute_registration 1.0064x 1005/1005 38.1% low 1.17x
clip_comparison-blockmanifest-get_supported_model_variants get_supported_model_variants 1.0027x 3031/3031 25.9% low 1.28x
builder-get_cached_models get_cached_models 1.0000x 21/21 51.4% low 1.80x
cache-measure_memory_for_eviction measure_memory_for_eviction 1.0000x 2/2 N/A low 19.13x
compiler-establish_control_flow_edge establish_control_flow_edge 1.0000x 26/26 24.4% low 1.32x
compiler-find_longest_lineage_support find_longest_lineage_support 1.0000x 51/51 23.1% low 1.26x
core-wrap_roboflow_api_errors wrap_roboflow_api_errors 1.0000x 438/438 3.7% low 1.28x
core_steps-load_blocks load_blocks 1.0000x 2/2 N/A low 5.09x
dataset_upload-_expand_metadata_to_records _expand_metadata_to_records 1.0000x 2/2 50.6% low 1.62x
dataset_upload-_transpose_metadata_batches _transpose_metadata_batches 1.0000x 2/2 50.6% low 1.36x
dynamic_blocks-_create_clean_traceback _create_clean_traceback 1.0000x 2/2 13.4% low 3.22x
execution_data_manager-_transpose_dict_of_batches_if_needed _transpose_dict_of_batches_if_needed 1.0000x 2/2 12.7% low 1.61x
halo-halovisualizationblockv1-getannotator getAnnotator 1.0000x 2/2 34.2% low 11.57x
http-with_route_exceptions_async with_route_exceptions_async 1.0000x 1/1 0.8% 6.20x
managers-customcollector-_fetch_stream_metrics _fetch_stream_metrics 1.0000x 41/41 7.2% low 1.19x
managers-experimentalmodelmanager-is_loaded is_loaded 1.0000x 2/2 0.6% low 2.31x
mask_area_measurement-areameasurementblockv1-run run 1.0000x 2/2 N/A low 1.25x
models-semanticsegmentationbaseonnxroboflowinferencemodel-make_response make_response 1.0000x 5/5 11.5% low 1.33x
overlap-overlapmanifest-describe_outputs describe_outputs 1.0000x 2/2 N/A low 6.36x
qwen3vl-_is_flash_attn_usable _is_flash_attn_usable 1.0000x 2/2 17.4% low 141.41x
decorators-withfixedsizecache-add_model add_model 0.9999x 1224/1224 57.8% low 1.17x
object_detection-blockmanifest-get_compatible_task_types get_compatible_task_types 0.9982x 3874/3874 27.1% low 1.12x
instance_segmentation-blockmanifest-get_compatible_task_types get_compatible_task_types 0.9954x 3836/3836 27.6% low 1.15x
semantic_segmentation-blockmanifest-get_compatible_task_types get_compatible_task_types 0.9907x 4124/4124 38.9% low 1.11x
segment_anything3-blockmanifest-get_supported_model_variants get_supported_model_variants 0.9862x 5631/5631 9.5% low 1.13x
keypoint_detection-blockmanifest-get_compatible_task_types get_compatible_task_types 0.9836x 2631/2631 26.7% low 1.11x
gaze-blockmanifest-get_supported_model_variants get_supported_model_variants 0.9797x 2229/2229 33.1% low 1.21x
yolo26-yolo26instancesegmentation-predict predict 0.9786x 33/33 32.9% low 1.26x
operations-build_sequence_apply_operation build_sequence_apply_operation 0.9720x 25/25 30.3% low 1.35x
common-deserialize_detections_kind deserialize_detections_kind 0.9719x 8/8 5.5% low 1.15x
stream-inferencepipeline-init_with_workflow init_with_workflow 0.9698x 53/53 30.5% low 1.10x
moondream2-blockmanifest-get_supported_model_variants get_supported_model_variants 0.9669x 1238/1238 50.6% low 1.15x
multi_class_classification-blockmanifest-get_compatible_task_types get_compatible_task_types 0.9653x 2622/2622 26.0% low 1.14x

Failed Tasks (13)

byte_tracker-bytetrackmanifest-describe_outputs

  • Function: describe_outputs

  • File: inference/core/workflows/core_steps/transformations/byte_tracker/v1.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 16.66x

  • Solve OK: False

  • Duration: 17.4s

  • Reward: correct=0.0, speedup=0.0, tests=6035/6036

Key errors
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_0.py::test_describe_outputs_basic_structure_and_contents[ 1 ]
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_1.py::test_describe_outputs_already_seen_instances_kind[ 1 ]
INFO:   INCORRECT: 6035/6036 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh byte_tracker-bytetrackmanifest-describe_outputs --debug


clip-inferencemodelsclipadapter-compare

  • Function: compare

  • File: inference/models/clip/clip_inference_models.py

  • Commit: 7648e452a70ff1aad09f017a0eb2ea4022b7e177

  • Method: db_code_match

  • DB Speedup: 3.37x

  • Solve OK: False

  • Duration: 23.7s

  • Reward: correct=0.0, speedup=0.0, tests=135/136

Key errors
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_compare__behaviorinstrumented_0.py::TestInferenceModelsClipAdapterCompare::test_compare_empty_prompt_list[ 1 ]
INFO:   INCORRECT: 135/136 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh clip-inferencemodelsclipadapter-compare --debug


compiler-establish_batch_oriented_step_lineage

  • Function: establish_batch_oriented_step_lineage

  • File: inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

  • Commit: 90243bdc6278ef7d17b6db09dc1eb5b0d155b4be

  • Method: db_code_match

  • DB Speedup: 1.54x

  • Solve OK: False

  • Duration: 14.0s

  • Reward: correct=0.0, speedup=0.0, tests=33/36

Key errors
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1220: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1236: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_0.py::test_multiple_control_flow_lineages_with_same_min_length_raises_assumption_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_empty_lineage_lists[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_missing_dimensionality_reference_property[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_non_batch_oriented_property_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_multiple_control_flow_same_min_length_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_compound_input_no_batch_oriented_raises_error[ 1 ]
INFO:   INCORRECT: 33/36 passed, 3 diffs

Reproduce: bash docker_e2e_test.sh compiler-establish_batch_oriented_step_lineage --debug


compiler-get_reference_lineage

  • Function: get_reference_lineage

  • File: inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 1.61x

  • Solve OK: False

  • Duration: 14.6s

  • Reward: correct=0.0, speedup=0.0, tests=20/24

Key errors
/errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1294: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1310: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageBasic::test_batch_oriented_property_in_simple_input[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageEdge::test_compound_input_with_batch_oriented_nested[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageEdge::test_compound_input_no_batch_oriented_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageLargeScale::test_large_compound_input_many_nested[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageLargeScale::test_many_input_data_keys[ 1 ]
INFO:   INCORRECT: 20/24 passed, 4 diffs

Reproduce: bash docker_e2e_test.sh compiler-get_reference_lineage --debug


core_steps-_should_filter_block

  • Function: _should_filter_block

  • File: inference/core/workflows/core_steps/loader.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 4.93x

  • Solve OK: False

  • Duration: 27.4s

  • Reward: correct=0.0, speedup=0.0, tests=41/41

Key errors
_ ERROR collecting tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py _
ImportError while importing test module '/workspace/inference/tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py'.
E   ImportError: cannot import name 'WORKFLOW_SELECTIVE_BLOCKS_DISABLE' from 'inference.core.env' (/workspace/inference/inference/core/env.py)
  /usr/local/lib/python3.12/site-packages/pydantic/fields.py:1093: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
1 warning, 2 errors in 0.28s
INFO:   INCORRECT: 41/41 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh core_steps-_should_filter_block --debug


execution_data_manager-prepare_parameters

  • Function: prepare_parameters

  • File: inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 1.12x

  • Solve OK: False

  • Duration: 15.7s

  • Reward: correct=0.0, speedup=0.0, tests=1/1

Key errors
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_empty_runtime_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_step_execution_dimensionality_zero[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_large_dimensionality[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_special_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_unicode_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_input_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_deeply_nested_compound_inputs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_masks[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_auto_batch_casting_configs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_iteration_performance[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_complex_data_structures[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_mixed_parameter_types[ 1 ]
INFO:   INCORRECT: 1/1 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh execution_data_manager-prepare_parameters --debug


glm_ocr-glmocrblockv1-run_remotely

  • Function: run_remotely

  • File: inference/core/workflows/core_steps/models/foundation/glm_ocr/v1.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 3.24x

  • Solve OK: False

  • Duration: 20.8s

  • Reward: correct=0.0, speedup=0.0, tests=1/1

Key errors
0: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO:   INCORRECT: 1/1 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh glm_ocr-glmocrblockv1-run_remotely --debug


ocsort-ocsortblockv1-run

  • Function: run

  • File: inference/core/workflows/core_steps/trackers/ocsort/v1.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 1.60x

  • Solve OK: False

  • Duration: 16.5s

  • Reward: correct=0.0, speedup=0.0, tests=408/408

Key errors
@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
25 warnings, 2 errors in 0.84s
INFO:   INCORRECT: 408/408 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh ocsort-ocsortblockv1-run --debug


perception_encoder-inferencemodelsperceptionencoderadapter-preprocess

  • Function: preprocess

  • File: inference/models/perception_encoder/perception_encoder_inference_models.py

  • Commit: 7648e452a70ff1aad09f017a0eb2ea4022b7e177

  • Method: db_code_match

  • DB Speedup: 2.47x

  • Solve OK: False

  • Duration: 37.7s

  • Reward: correct=0.0, speedup=0.0, tests=2031/2235

Key errors
  PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_returns_tuple_with_correct_types[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_calls_preproc_image[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_metadata_is_empty_dict[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_preserves_image_dimensions[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_kwargs[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_multiple_calls_independence[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_1000_rapid_calls[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_varying_channel_counts[ 1 ]
INFO:   INCORRECT: 2031/2235 passed, 204 diffs

Reproduce: bash docker_e2e_test.sh perception_encoder-inferencemodelsperceptionencoderadapter-preprocess --debug


qwen3vl-qwen3vlblockv1-run

  • Function: run

  • File: inference/core/workflows/core_steps/models/foundation/qwen3vl/v1.py

  • Commit: c20359386c628a08bde69f5f3f780cedd782c50c

  • Method: db_code_match

  • DB Speedup: 1.45x

  • Solve OK: False

  • Duration: 27.9s

  • Reward: correct=0.0, speedup=0.0, tests=41/42

Key errors
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_local_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_remote_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_local[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_remote[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_single_image[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_multiple_images[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_invalid_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_empty_prompt_string[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_locally_with_none_api_key[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_different_model_versions[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_locally_with_repeated_calls[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_batch_type_handling[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_1.py::test_run_local_various_image_reference_types[ 1 ]
INFO:   INCORRECT: 41/42 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh qwen3vl-qwen3vlblockv1-run --debug


rfdetr-rfdetrobjectdetection-postprocess

  • Function: postprocess

  • File: inference/models/rfdetr/rfdetr.py

  • Commit: 6078c43bae0aa336aef12e324b9a9008a35d2408

  • Method: git_parent

  • DB Speedup: 1.13x

  • Solve OK: False

  • Duration: 12.4s

  • Reward: correct=0.0, speedup=0.0, tests=10/29

Key errors
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_bbox_format_conversion[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_sigmoid_stable_applied[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_empty_predictions[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_single_query_single_class[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_zero_confidence_threshold[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_very_small_image_dims[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_very_large_image_dims[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_bbox_clipping[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_class_id_filtering[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_data_type_conversion[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_negative_bbox_coordinates[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_max_detections_large_value[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_precision_with_small_values[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_large_bbox_values[ 1 ]
INFO:   INCORRECT: 10/29 passed, 19 diffs

Reproduce: bash docker_e2e_test.sh rfdetr-rfdetrobjectdetection-postprocess --debug


s3-s3sinkblockv1-_upload_separate_file

  • Function: _upload_separate_file

  • File: inference/core/workflows/core_steps/sinks/s3/v1.py

  • Commit: 639c8e77ab90d6a43f32fe55a355373ae74e0924

  • Method: db_code_match

  • DB Speedup: 1.15x

  • Solve OK: False

  • Duration: 41.7s

  • Reward: correct=0.0, speedup=0.0, tests=1249/1252

Key errors
.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1267: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1280: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1296: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1311: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO:   INCORRECT: 1249/1252 passed, 3 diffs
INFO:     [stdout] WARNING  S3 connection error on attempt 1/4: An unspecified error occurred        vs  WARNING  Could not upload to S3: An unspecified error occurred                  
INFO:     [stdout] WARNING  Non-retryable S3 error (NoSuchBucket): An error occurred (NoSuchBucket)  vs  WARNING  Could not upload to S3: An error occurred (NoSuchBucket) when calling  

Reproduce: bash docker_e2e_test.sh s3-s3sinkblockv1-_upload_separate_file --debug


sort-sortblockv1-run

  • Function: run

  • File: inference/core/workflows/core_steps/trackers/sort/v1.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 2.07x

  • Solve OK: False

  • Duration: 19.8s

  • Reward: correct=0.0, speedup=0.0, tests=682/684

Key errors
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_new_then_already_seen_instance_detection[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_filter_out_unmatched_tracks_with_negative_id[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_create_tracker_receives_default_fps_when_missing[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_large_scale_many_instances_and_cache_behavior[ 1 ]
INFO:   INCORRECT: 682/684 passed, 2 diffs

Reproduce: bash docker_e2e_test.sh sort-sortblockv1-run --debug