43 KiB
Codeflash RL Environment — Batch Validation Report
Summary
| Metric | Count | % |
|---|---|---|
| Total tasks | 166 | 100% |
| Solve passes | 0 | 0% |
| Eval correct (all behavioral tests pass) | 153 | 92% |
| Faster than original (speedup > 1.0) | 122 | 73% |
| All test cases pass | 157 | 94% |
Speedup Distribution (correct tasks only)
- Slower (< 1x): 13 tasks
- 1-1.5x: 85 tasks
- 1.5-2x: 18 tasks
- 2-5x: 14 tasks
- 5-100x: 17 tasks
- >100x: 6 tasks
Successful Tasks (correct=1.0)
| Task | Function | Speedup | Tests | Coverage | Quality | DB Speedup |
|---|---|---|---|---|---|---|
| models-prepare_multi_label_classification_response | prepare_multi_label_classification_response |
74810.9605x | 32/32 | 7.7% | low | 68465.46x |
| introspection-prepare_operators_descriptions | prepare_operators_descriptions |
15514.0489x | 1057/1057 | 35.0% | low | 14150.75x |
| decorators-withfixedsizecache-memory_pressure_detected | memory_pressure_detected |
1747.8138x | 132/132 | 37.8% | 917.09x | |
| depth_anything_v3-inferencemodelsdepthanythingv3adapter-predict | predict |
397.7983x | 36/36 | 23.2% | 391.52x | |
| detection_event_log-detectioneventlogblockv1-_evict_oldest_video | _evict_oldest_video |
343.5465x | 170/170 | 46.4% | 15.92x | |
| camera-_generate_grid_colors | _generate_grid_colors |
269.6811x | 1901/1901 | 9.0% | 218.30x | |
| workflow_caller-_check_workflow_for_circular_references | _check_workflow_for_circular_references |
27.0112x | 41/41 | 31.1% | low | 11.84x |
| semantic_segmentation-blockmanifest-describe_outputs | describe_outputs |
23.9434x | 2539/2539 | 38.9% | high | 21.92x |
| dynamic_blocks-build_traceback_string | build_traceback_string |
20.3358x | 2047/2047 | 16.0% | low | 13.38x |
| sort-sortmanifest-describe_outputs | describe_outputs |
19.1846x | 8036/8036 | 89.7% | low | 17.16x |
| bytetrack-bytetrackmanifest-describe_outputs | describe_outputs |
19.0393x | 3033/3033 | 90.2% | low | 17.56x |
| event_writer-_extract_detail | _extract_detail |
16.4218x | 43/43 | 31.9% | 13.46x | |
| workflow_caller-_describe_outputs_from_spec | _describe_outputs_from_spec |
16.0714x | 25/25 | 23.1% | low | 12.94x |
| managers-try_releasing_cuda_memory | try_releasing_cuda_memory |
15.0925x | 1006/1006 | 10.8% | 1.22x | |
| cache-_slugify_model_id | _slugify_model_id |
13.9914x | 1050/1050 | 26.1% | medium | 11.21x |
| s3-deduct_csv_header | deduct_csv_header |
13.1148x | 54/54 | 38.6% | high | 8.90x |
| dynamic_blocks-create_dynamic_module | create_dynamic_module |
10.8808x | 142/142 | 27.4% | 12.31x | |
| dataset_upload-roboflowdatasetuploadblockv2-run | run |
10.0212x | 13/13 | 57.1% | 9.14x | |
| glm_ocr-blockmanifest-describe_outputs | describe_outputs |
8.9607x | 1035/1035 | 51.9% | 9.33x | |
| qwen3_5vl-blockmanifest-describe_outputs | describe_outputs |
7.6146x | 3228/3228 | 48.6% | medium | 7.30x |
| http-with_route_exceptions | with_route_exceptions |
6.7561x | 1297/1297 | 8.1% | 6.89x | |
| qwen3_5vl-qwen35vlblockv1-run_remotely | run_remotely |
6.3813x | 23/23 | 69.4% | 5.64x | |
| introspection-prepare_operations_descriptions | prepare_operations_descriptions |
6.2270x | 147/147 | 82.5% | high | 6.26x |
| core_steps-load_kinds | load_kinds |
4.7753x | 1153/1153 | 42.0% | high | 3.68x |
| depth_anything_v2-inferencemodelsdepthanythingv2adapter-predict | predict |
4.0460x | 38/38 | 60.2% | 5.03x | |
| qwen3_5vl-inferencemodelsqwen35vladapter-predict | predict |
3.2474x | 2275/2275 | 70.3% | low | 3.72x |
| core-_prepare_workflow_response_cache_key | _prepare_workflow_response_cache_key |
3.0359x | 7539/7539 | 2.7% | medium | 2.39x |
| compiler-establish_step_execution_dimensionality | establish_step_execution_dimensionality |
2.6841x | 47/47 | 23.2% | 2.37x | |
| semantic_segmentation-roboflowsemanticsegmentationmodelblockv1-_convert_to_sv_de | _convert_to_sv_detections |
2.6825x | 13/13 | 71.7% | 2.22x | |
| managers-modelmanager-_dispose_model_lock | _dispose_model_lock |
2.5469x | 2784/2784 | 14.7% | 3.24x | |
| text_display-clamp_box | clamp_box |
2.5308x | 1210/1210 | 15.0% | high | 2.80x |
| event_writer-_detections_to_v2_instance_segmentations | _detections_to_v2_instance_segmentations |
2.3078x | 36/36 | 41.2% | 2.18x | |
| models-baseinference-infer | infer |
2.2736x | 1037/1037 | 2.8% | low | 2.32x |
| qwen3vl-inferencemodelsqwen3vladapter-map_inference_kwargs | map_inference_kwargs |
2.1968x | 1125/1125 | 26.8% | medium | 2.39x |
| clip_comparison-blockmanifest-get_required_cache_artifacts | get_required_cache_artifacts |
2.1802x | 130/130 | 26.6% | low | 2.04x |
| introspection-_get_property_name_options | _get_property_name_options |
2.0733x | 1053/1053 | 57.5% | 1.52x | |
| compiler-verify_compatibility_of_input_data_lineage_with_control_flow_lineage | verify_compatibility_of_input_data_lineage_with_control_flow_lineage |
2.0635x | 39/39 | 26.4% | 2.11x | |
| execution_data_manager-executiondatamanager-_register_control_flow_output_for_no | _register_control_flow_output_for_non_simd_step |
1.9788x | 32/32 | 20.2% | 2.65x | |
| core-_forcetracerootsampler-get_description | get_description |
1.9079x | 3244/3244 | 1.6% | 2.03x | |
| enterprise_blocks-load_enterprise_blocks | load_enterprise_blocks |
1.8940x | 1936/1936 | 32.2% | medium | 1.45x |
| entities-workflowimagedata-copy_and_replace | copy_and_replace |
1.8862x | 2336/2336 | 72.1% | 2.04x | |
| compiler-_collect_unique_control_flow_lineages_with_step_mapping | _collect_unique_control_flow_lineages_with_step_mapping |
1.8585x | 33/33 | 24.3% | 1.95x | |
| mask_area_measurement-maskareameasurementblockv1-run | run |
1.8337x | 39/39 | 93.0% | 1.65x | |
| compiler-separate_control_flow_predecessors_from_data_providers | separate_control_flow_predecessors_from_data_providers |
1.8200x | 34/34 | 23.1% | high | 1.87x |
| event_writer-_build_event_data | _build_event_data |
1.8120x | 4732/4732 | 34.7% | medium | 1.74x |
| compiler-step_definition_allows_control_flow_references | step_definition_allows_control_flow_references |
1.7192x | 27/27 | 22.5% | medium | 1.86x |
| introspection-retrieve_selectors_from_union_definition | retrieve_selectors_from_union_definition |
1.6618x | 36/36 | 22.2% | high | 1.98x |
| dataset_upload-maybe_register_datapoint_at_roboflow | maybe_register_datapoint_at_roboflow |
1.6392x | 1039/1039 | 55.6% | low | 1.47x |
| cache-is_block_cached | is_block_cached |
1.6255x | 53/53 | 27.9% | low | 1.36x |
| introspection-_ref_to_def_name | _ref_to_def_name |
1.6030x | 1344/1344 | 27.5% | high | 1.51x |
| mask_area_measurement-compute_detection_areas | compute_detection_areas |
1.5829x | 24/24 | 83.0% | 1.46x | |
| managers-list_files | list_files |
1.5797x | 99/99 | 8.9% | 1.66x | |
| dynamic_blocks-assembly_custom_python_block | assembly_custom_python_block |
1.5787x | 135/135 | 36.7% | low | 1.61x |
| cache-get_cached_foundation_models | get_cached_foundation_models |
1.5691x | 32/32 | 34.7% | low | 1.46x |
| compiler-is_control_flow_step | is_control_flow_step |
1.5035x | 1830/1830 | 15.3% | medium | 1.34x |
| execution_data_manager-construct_mask_for_all_inputs_dimensionalities | construct_mask_for_all_inputs_dimensionalities |
1.4808x | 31/31 | 19.0% | low | 1.51x |
| common-deserialize_image_kind | deserialize_image_kind |
1.4732x | 1506/1506 | 7.4% | 1.42x | |
| usage_tracking-usagecollector-_compute_execution_duration | _compute_execution_duration |
1.4709x | 2017/2017 | 27.5% | medium | 1.55x |
| core-_url_for_safe_logging | _url_for_safe_logging |
1.4616x | 1055/1055 | 2.8% | 1.47x | |
| dataset_upload-is_prediction_registration_forbidden | is_prediction_registration_forbidden |
1.4475x | 2043/2043 | 31.7% | 1.44x | |
| qwen3_5vl-qwen35vlblockv1-run | run |
1.4215x | 28/28 | 93.1% | low | 1.69x |
| execution_data_manager-construct_simd_step_input | construct_simd_step_input |
1.4138x | 26/26 | 28.3% | low | 1.37x |
| cache-get_task_type_to_block_mapping | get_task_type_to_block_mapping |
1.4136x | 30/30 | 29.6% | low | 1.39x |
| anthropic_claude-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.3907x | 2243/2243 | 16.4% | low | 1.45x |
| qwen3_5vl-inferencemodelsqwen35vladapter-map_inference_kwargs | map_inference_kwargs |
1.3866x | 1549/1549 | 64.9% | low | 1.53x |
| email_notification-format_email_message | format_email_message |
1.3730x | 56/56 | 31.7% | high | 1.35x |
| dataset_upload-register_datapoint_at_roboflow | register_datapoint_at_roboflow |
1.3720x | 2037/2037 | 38.6% | low | 1.32x |
| common-add_inference_keypoints_to_sv_detections | add_inference_keypoints_to_sv_detections |
1.3657x | 30/30 | 4.1% | 1.56x | |
| core-get_workflow_specification | get_workflow_specification |
1.3586x | 1157/1157 | 3.6% | low | 1.56x |
| sequences-sequence_apply | sequence_apply |
1.3540x | 58/58 | 30.2% | medium | 1.48x |
| managers-modelmanager-infer_from_request_sync | infer_from_request_sync |
1.3382x | 3041/3041 | 13.7% | low | 1.46x |
| entities-batch-remove_by_indices | remove_by_indices |
1.3240x | 44/44 | 65.4% | high | 1.26x |
| cache-_is_model_cached | _is_model_cached |
1.3055x | 45/45 | 27.0% | 1.24x | |
| workflow_caller-_extract_workflow_caller_ids_from_spec | _extract_workflow_caller_ids_from_spec |
1.3007x | 44/44 | 25.8% | 1.34x | |
| openai-execute_gpt_4v_request | execute_gpt_4v_request |
1.3006x | 37/37 | 25.8% | medium | 2.00x |
| cache-is_model_cached | is_model_cached |
1.2989x | 55/55 | 28.7% | high | 1.22x |
| core-load_cached_workflow_response | load_cached_workflow_response |
1.2951x | 12126/12126 | 2.8% | low | 1.38x |
| execution_data_manager-filter_to_valid_prefix_chains | filter_to_valid_prefix_chains |
1.2932x | 32/32 | 15.3% | 1.32x | |
| execution_data_manager-intersect_masks_per_dimension | intersect_masks_per_dimension |
1.2908x | 40/40 | 13.5% | high | 1.66x |
| webrtc_worker-videoframeprocessor-serialize_outputs_sync | serialize_outputs_sync |
1.2882x | 48/48 | 17.9% | low | 1.37x |
| webrtc_worker-videoframeprocessor-_check_termination | _check_termination |
1.2650x | 2029/2029 | 16.1% | 1.36x | |
| workflow_caller-_fetch_workflow_spec_for_validation | _fetch_workflow_spec_for_validation |
1.2390x | 1547/1547 | 23.1% | 1.33x | |
| dataset_upload-roboflowdatasetuploadblockv1-run | run |
1.2328x | 41/41 | 38.6% | 1.26x | |
| executor-_run_workflow | _run_workflow |
1.2137x | 130/130 | 21.6% | low | 1.22x |
| managers-rank_for_deletion | rank_for_deletion |
1.2067x | 106/106 | 7.3% | 1.88x | |
| detection_event_log-detectioneventlogblockv1-_get_relative_time | _get_relative_time |
1.2017x | 41/41 | 43.0% | 1.19x | |
| http-_build_step_execution_error_response | _build_step_execution_error_response |
1.1955x | 1029/1029 | 1.0% | low | 1.19x |
| common-serialise_sv_detections | serialise_sv_detections |
1.1825x | 149/149 | 5.1% | 1.19x | |
| models-inferencemodelsobjectdetectionadapter-postprocess | postprocess |
1.1819x | 33/33 | 8.7% | medium | 1.23x |
| text_display-draw_background_with_alpha | draw_background_with_alpha |
1.1776x | 176/176 | 29.5% | high | 1.18x |
| core-record_inference | record_inference |
1.1693x | 3033/3033 | 1.6% | low | 1.22x |
| webrtc_worker-default_encoder | default_encoder |
1.1565x | 4071/4071 | 17.0% | medium | 1.12x |
| easy_ocr-blockmanifest-get_supported_model_variants | get_supported_model_variants |
1.1535x | 2039/2039 | 57.5% | low | 1.31x |
| execution_data_manager-get_masks_intersection_for_dimensions | get_masks_intersection_for_dimensions |
1.1505x | 36/36 | 16.9% | low | 1.23x |
| mask_area_measurement-get_detection_area | get_detection_area |
1.1443x | 129/129 | 83.7% | 1.19x | |
| email_notification-apply_operations_to_message_parameters | apply_operations_to_message_parameters |
1.1389x | 44/44 | 29.5% | low | 1.15x |
| dataset_upload-register_datapoint | register_datapoint |
1.1321x | 1138/1138 | 42.5% | low | 1.16x |
| compiler-get_lineage_derived_from_control_flow | get_lineage_derived_from_control_flow |
1.1298x | 33/33 | 23.8% | low | 1.25x |
| yolo_world-blockmanifest-get_supported_model_variants | get_supported_model_variants |
1.1252x | 2232/2232 | 50.0% | medium | 1.29x |
| trackers-instancecache-record_instance | record_instance |
1.1226x | 14857/14857 | 17.3% | medium | 1.14x |
| event_writer-_build_image_entry | _build_image_entry |
1.1222x | 1337/1337 | 60.6% | low | 1.10x |
| workflow_caller-_convert_output_descriptions_to_kinds | _convert_output_descriptions_to_kinds |
1.1212x | 37/37 | 24.6% | medium | 1.19x |
| workflow_caller-workflowcallerblockv1-run | run |
1.1159x | 59/59 | 48.9% | 1.13x | |
| notification-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.1135x | 1535/1535 | 43.0% | low | 1.14x |
| moondream2-inferencemodelsmoondream2adapter-caption | caption |
1.1118x | 185/185 | 45.1% | high | 1.11x |
| cache-_get_block_type_identifier | _get_block_type_identifier |
1.1082x | 34/34 | 26.5% | 1.11x | |
| models-inferencemodelsobjectdetectionadapter-preprocess | preprocess |
1.1014x | 31/31 | 7.5% | low | 1.12x |
| workflow_caller-_deserialize_output_value | _deserialize_output_value |
1.0981x | 139/139 | 27.4% | medium | 1.11x |
| workflow_caller-_resolve_output_kinds_for_run | _resolve_output_kinds_for_run |
1.0975x | 1047/1047 | 26.2% | 1.12x | |
| lmm-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.0895x | 4625/4625 | 46.1% | low | 1.12x |
| workflow_caller-build_workflow_url | build_workflow_url |
1.0892x | 6137/6137 | 22.2% | low | 1.29x |
| custom_metadata-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.0780x | 3724/3724 | 49.4% | low | 1.12x |
| models-inferencemodelskeypointsdetectionadapter-map_inference_kwargs | map_inference_kwargs |
1.0669x | 2232/2232 | 7.4% | low | 1.13x |
| heatmap-heatmapvisualizationblockv1-getannotator | getAnnotator |
1.0615x | 38/38 | 44.2% | low | 1.41x |
| detection_event_log-detectioneventlogblockv1-run | run |
1.0380x | 4426/4426 | 97.3% | low | 3.40x |
| glm_ocr-inferencemodelsglmocradapter-postprocess | postprocess |
1.0374x | 1788/1788 | 54.5% | low | 1.19x |
| sms-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.0320x | 2743/2743 | 15.4% | low | 1.14x |
| handlers-handle_describe_workflows_blocks_request | handle_describe_workflows_blocks_request |
1.0280x | 153/153 | 42.0% | low | 2.50x |
| dataset_upload-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.0160x | 3532/3532 | 29.3% | low | 1.10x |
| core-get_workflow_cache_file | get_workflow_cache_file |
1.0093x | 1551/1551 | 2.8% | low | 1.17x |
| dataset_upload-execute_registration | execute_registration |
1.0064x | 1005/1005 | 38.1% | low | 1.17x |
| clip_comparison-blockmanifest-get_supported_model_variants | get_supported_model_variants |
1.0027x | 3031/3031 | 25.9% | low | 1.28x |
| builder-get_cached_models | get_cached_models |
1.0000x | 21/21 | 51.4% | low | 1.80x |
| cache-measure_memory_for_eviction | measure_memory_for_eviction |
1.0000x | 2/2 | N/A | low | 19.13x |
| compiler-establish_control_flow_edge | establish_control_flow_edge |
1.0000x | 26/26 | 24.4% | low | 1.32x |
| compiler-find_longest_lineage_support | find_longest_lineage_support |
1.0000x | 51/51 | 23.1% | low | 1.26x |
| core-wrap_roboflow_api_errors | wrap_roboflow_api_errors |
1.0000x | 438/438 | 3.7% | low | 1.28x |
| core_steps-load_blocks | load_blocks |
1.0000x | 2/2 | N/A | low | 5.09x |
| dataset_upload-_expand_metadata_to_records | _expand_metadata_to_records |
1.0000x | 2/2 | 50.6% | low | 1.62x |
| dataset_upload-_transpose_metadata_batches | _transpose_metadata_batches |
1.0000x | 2/2 | 50.6% | low | 1.36x |
| dynamic_blocks-_create_clean_traceback | _create_clean_traceback |
1.0000x | 2/2 | 13.4% | low | 3.22x |
| execution_data_manager-_transpose_dict_of_batches_if_needed | _transpose_dict_of_batches_if_needed |
1.0000x | 2/2 | 12.7% | low | 1.61x |
| halo-halovisualizationblockv1-getannotator | getAnnotator |
1.0000x | 2/2 | 34.2% | low | 11.57x |
| http-with_route_exceptions_async | with_route_exceptions_async |
1.0000x | 1/1 | 0.8% | 6.20x | |
| managers-customcollector-_fetch_stream_metrics | _fetch_stream_metrics |
1.0000x | 41/41 | 7.2% | low | 1.19x |
| managers-experimentalmodelmanager-is_loaded | is_loaded |
1.0000x | 2/2 | 0.6% | low | 2.31x |
| mask_area_measurement-areameasurementblockv1-run | run |
1.0000x | 2/2 | N/A | low | 1.25x |
| models-semanticsegmentationbaseonnxroboflowinferencemodel-make_response | make_response |
1.0000x | 5/5 | 11.5% | low | 1.33x |
| overlap-overlapmanifest-describe_outputs | describe_outputs |
1.0000x | 2/2 | N/A | low | 6.36x |
| qwen3vl-_is_flash_attn_usable | _is_flash_attn_usable |
1.0000x | 2/2 | 17.4% | low | 141.41x |
| decorators-withfixedsizecache-add_model | add_model |
0.9999x | 1224/1224 | 57.8% | low | 1.17x |
| object_detection-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9982x | 3874/3874 | 27.1% | low | 1.12x |
| instance_segmentation-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9954x | 3836/3836 | 27.6% | low | 1.15x |
| semantic_segmentation-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9907x | 4124/4124 | 38.9% | low | 1.11x |
| segment_anything3-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9862x | 5631/5631 | 9.5% | low | 1.13x |
| keypoint_detection-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9836x | 2631/2631 | 26.7% | low | 1.11x |
| gaze-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9797x | 2229/2229 | 33.1% | low | 1.21x |
| yolo26-yolo26instancesegmentation-predict | predict |
0.9786x | 33/33 | 32.9% | low | 1.26x |
| operations-build_sequence_apply_operation | build_sequence_apply_operation |
0.9720x | 25/25 | 30.3% | low | 1.35x |
| common-deserialize_detections_kind | deserialize_detections_kind |
0.9719x | 8/8 | 5.5% | low | 1.15x |
| stream-inferencepipeline-init_with_workflow | init_with_workflow |
0.9698x | 53/53 | 30.5% | low | 1.10x |
| moondream2-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9669x | 1238/1238 | 50.6% | low | 1.15x |
| multi_class_classification-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9653x | 2622/2622 | 26.0% | low | 1.14x |
Failed Tasks (13)
byte_tracker-bytetrackmanifest-describe_outputs
-
Function:
describe_outputs -
File:
inference/core/workflows/core_steps/transformations/byte_tracker/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 16.66x
-
Solve OK: False
-
Duration: 17.4s
-
Reward: correct=0.0, speedup=0.0, tests=6035/6036
Key errors
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_0.py::test_describe_outputs_basic_structure_and_contents[ 1 ]
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_1.py::test_describe_outputs_already_seen_instances_kind[ 1 ]
INFO: INCORRECT: 6035/6036 passed, 1 diffs
Reproduce: bash docker_e2e_test.sh byte_tracker-bytetrackmanifest-describe_outputs --debug
clip-inferencemodelsclipadapter-compare
-
Function:
compare -
File:
inference/models/clip/clip_inference_models.py -
Commit:
7648e452a70ff1aad09f017a0eb2ea4022b7e177 -
Method: db_code_match
-
DB Speedup: 3.37x
-
Solve OK: False
-
Duration: 23.7s
-
Reward: correct=0.0, speedup=0.0, tests=135/136
Key errors
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_compare__behaviorinstrumented_0.py::TestInferenceModelsClipAdapterCompare::test_compare_empty_prompt_list[ 1 ]
INFO: INCORRECT: 135/136 passed, 1 diffs
Reproduce: bash docker_e2e_test.sh clip-inferencemodelsclipadapter-compare --debug
compiler-establish_batch_oriented_step_lineage
-
Function:
establish_batch_oriented_step_lineage -
File:
inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py -
Commit:
90243bdc6278ef7d17b6db09dc1eb5b0d155b4be -
Method: db_code_match
-
DB Speedup: 1.54x
-
Solve OK: False
-
Duration: 14.0s
-
Reward: correct=0.0, speedup=0.0, tests=33/36
Key errors
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1220: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1236: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_0.py::test_multiple_control_flow_lineages_with_same_min_length_raises_assumption_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_empty_lineage_lists[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_missing_dimensionality_reference_property[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_non_batch_oriented_property_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_multiple_control_flow_same_min_length_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_compound_input_no_batch_oriented_raises_error[ 1 ]
INFO: INCORRECT: 33/36 passed, 3 diffs
Reproduce: bash docker_e2e_test.sh compiler-establish_batch_oriented_step_lineage --debug
compiler-get_reference_lineage
-
Function:
get_reference_lineage -
File:
inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 1.61x
-
Solve OK: False
-
Duration: 14.6s
-
Reward: correct=0.0, speedup=0.0, tests=20/24
Key errors
/errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1294: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1310: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageBasic::test_batch_oriented_property_in_simple_input[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageEdge::test_compound_input_with_batch_oriented_nested[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageEdge::test_compound_input_no_batch_oriented_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageLargeScale::test_large_compound_input_many_nested[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageLargeScale::test_many_input_data_keys[ 1 ]
INFO: INCORRECT: 20/24 passed, 4 diffs
Reproduce: bash docker_e2e_test.sh compiler-get_reference_lineage --debug
core_steps-_should_filter_block
-
Function:
_should_filter_block -
File:
inference/core/workflows/core_steps/loader.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 4.93x
-
Solve OK: False
-
Duration: 27.4s
-
Reward: correct=0.0, speedup=0.0, tests=41/41
Key errors
_ ERROR collecting tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py _
ImportError while importing test module '/workspace/inference/tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py'.
E ImportError: cannot import name 'WORKFLOW_SELECTIVE_BLOCKS_DISABLE' from 'inference.core.env' (/workspace/inference/inference/core/env.py)
/usr/local/lib/python3.12/site-packages/pydantic/fields.py:1093: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
1 warning, 2 errors in 0.28s
INFO: INCORRECT: 41/41 passed, 0 diffs
Reproduce: bash docker_e2e_test.sh core_steps-_should_filter_block --debug
execution_data_manager-prepare_parameters
-
Function:
prepare_parameters -
File:
inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 1.12x
-
Solve OK: False
-
Duration: 15.7s
-
Reward: correct=0.0, speedup=0.0, tests=1/1
Key errors
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_empty_runtime_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_step_execution_dimensionality_zero[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_large_dimensionality[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_special_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_unicode_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_input_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_deeply_nested_compound_inputs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_masks[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_auto_batch_casting_configs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_iteration_performance[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_complex_data_structures[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_mixed_parameter_types[ 1 ]
INFO: INCORRECT: 1/1 passed, 0 diffs
Reproduce: bash docker_e2e_test.sh execution_data_manager-prepare_parameters --debug
glm_ocr-glmocrblockv1-run_remotely
-
Function:
run_remotely -
File:
inference/core/workflows/core_steps/models/foundation/glm_ocr/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 3.24x
-
Solve OK: False
-
Duration: 20.8s
-
Reward: correct=0.0, speedup=0.0, tests=1/1
Key errors
0: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO: INCORRECT: 1/1 passed, 0 diffs
Reproduce: bash docker_e2e_test.sh glm_ocr-glmocrblockv1-run_remotely --debug
ocsort-ocsortblockv1-run
-
Function:
run -
File:
inference/core/workflows/core_steps/trackers/ocsort/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 1.60x
-
Solve OK: False
-
Duration: 16.5s
-
Reward: correct=0.0, speedup=0.0, tests=408/408
Key errors
@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
25 warnings, 2 errors in 0.84s
INFO: INCORRECT: 408/408 passed, 0 diffs
Reproduce: bash docker_e2e_test.sh ocsort-ocsortblockv1-run --debug
perception_encoder-inferencemodelsperceptionencoderadapter-preprocess
-
Function:
preprocess -
File:
inference/models/perception_encoder/perception_encoder_inference_models.py -
Commit:
7648e452a70ff1aad09f017a0eb2ea4022b7e177 -
Method: db_code_match
-
DB Speedup: 2.47x
-
Solve OK: False
-
Duration: 37.7s
-
Reward: correct=0.0, speedup=0.0, tests=2031/2235
Key errors
PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_returns_tuple_with_correct_types[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_calls_preproc_image[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_metadata_is_empty_dict[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_preserves_image_dimensions[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_kwargs[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_multiple_calls_independence[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_1000_rapid_calls[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_varying_channel_counts[ 1 ]
INFO: INCORRECT: 2031/2235 passed, 204 diffs
Reproduce: bash docker_e2e_test.sh perception_encoder-inferencemodelsperceptionencoderadapter-preprocess --debug
qwen3vl-qwen3vlblockv1-run
-
Function:
run -
File:
inference/core/workflows/core_steps/models/foundation/qwen3vl/v1.py -
Commit:
c20359386c628a08bde69f5f3f780cedd782c50c -
Method: db_code_match
-
DB Speedup: 1.45x
-
Solve OK: False
-
Duration: 27.9s
-
Reward: correct=0.0, speedup=0.0, tests=41/42
Key errors
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_local_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_remote_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_local[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_remote[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_single_image[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_multiple_images[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_invalid_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_empty_prompt_string[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_locally_with_none_api_key[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_different_model_versions[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_locally_with_repeated_calls[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_batch_type_handling[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_1.py::test_run_local_various_image_reference_types[ 1 ]
INFO: INCORRECT: 41/42 passed, 1 diffs
Reproduce: bash docker_e2e_test.sh qwen3vl-qwen3vlblockv1-run --debug
rfdetr-rfdetrobjectdetection-postprocess
-
Function:
postprocess -
File:
inference/models/rfdetr/rfdetr.py -
Commit:
6078c43bae0aa336aef12e324b9a9008a35d2408 -
Method: git_parent
-
DB Speedup: 1.13x
-
Solve OK: False
-
Duration: 12.4s
-
Reward: correct=0.0, speedup=0.0, tests=10/29
Key errors
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_bbox_format_conversion[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_sigmoid_stable_applied[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_empty_predictions[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_single_query_single_class[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_zero_confidence_threshold[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_very_small_image_dims[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_very_large_image_dims[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_bbox_clipping[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_class_id_filtering[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_data_type_conversion[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_negative_bbox_coordinates[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_max_detections_large_value[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_precision_with_small_values[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_large_bbox_values[ 1 ]
INFO: INCORRECT: 10/29 passed, 19 diffs
Reproduce: bash docker_e2e_test.sh rfdetr-rfdetrobjectdetection-postprocess --debug
s3-s3sinkblockv1-_upload_separate_file
-
Function:
_upload_separate_file -
File:
inference/core/workflows/core_steps/sinks/s3/v1.py -
Commit:
639c8e77ab90d6a43f32fe55a355373ae74e0924 -
Method: db_code_match
-
DB Speedup: 1.15x
-
Solve OK: False
-
Duration: 41.7s
-
Reward: correct=0.0, speedup=0.0, tests=1249/1252
Key errors
.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1267: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1280: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1296: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1311: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO: INCORRECT: 1249/1252 passed, 3 diffs
INFO: [stdout] WARNING S3 connection error on attempt 1/4: An unspecified error occurred vs WARNING Could not upload to S3: An unspecified error occurred
INFO: [stdout] WARNING Non-retryable S3 error (NoSuchBucket): An error occurred (NoSuchBucket) vs WARNING Could not upload to S3: An error occurred (NoSuchBucket) when calling
Reproduce: bash docker_e2e_test.sh s3-s3sinkblockv1-_upload_separate_file --debug
sort-sortblockv1-run
-
Function:
run -
File:
inference/core/workflows/core_steps/trackers/sort/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 2.07x
-
Solve OK: False
-
Duration: 19.8s
-
Reward: correct=0.0, speedup=0.0, tests=682/684
Key errors
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_new_then_already_seen_instance_detection[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_filter_out_unmatched_tracks_with_negative_id[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_create_tracker_receives_default_fps_when_missing[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_large_scale_many_instances_and_cache_behavior[ 1 ]
INFO: INCORRECT: 682/684 passed, 2 diffs
Reproduce: bash docker_e2e_test.sh sort-sortblockv1-run --debug