23 KiB
Codeflash RL Environment — Batch Validation Report
Summary
| Metric | Count | % |
|---|---|---|
| Total tasks | 123 | 100% |
| Solve passes | 0 | 0% |
| Eval correct (all behavioral tests pass) | 118 | 95% |
| Faster than original (speedup > 1.0) | 92 | 74% |
| All test cases pass | 119 | 96% |
Speedup Distribution (correct tasks only)
- Slower (< 1x): 12 tasks
- 1-1.5x: 66 tasks
- 1.5-2x: 15 tasks
- 2-5x: 11 tasks
- 5-100x: 11 tasks
- >100x: 3 tasks
Successful Tasks (correct=1.0)
| Task | Function | Speedup | Tests | Coverage | Quality | DB Speedup |
|---|---|---|---|---|---|---|
| models-prepare_multi_label_classification_response | prepare_multi_label_classification_response |
83318.0111x | 32/32 | 7.7% | low | 68465.46x |
| depth_anything_v3-inferencemodelsdepthanythingv3adapter-predict | predict |
393.3014x | 36/36 | 23.2% | 391.52x | |
| detection_event_log-detectioneventlogblockv1-_evict_oldest_video | _evict_oldest_video |
347.2878x | 170/170 | 46.4% | 15.92x | |
| workflow_caller-_check_workflow_for_circular_references | _check_workflow_for_circular_references |
27.1588x | 41/41 | 31.1% | low | 11.84x |
| semantic_segmentation-blockmanifest-describe_outputs | describe_outputs |
23.1126x | 2539/2539 | 38.9% | high | 21.92x |
| sort-sortmanifest-describe_outputs | describe_outputs |
19.9634x | 8036/8036 | 89.7% | low | 17.16x |
| event_writer-_extract_detail | _extract_detail |
17.5569x | 43/43 | 31.9% | high | 13.46x |
| s3-deduct_csv_header | deduct_csv_header |
15.2296x | 54/54 | 38.6% | 8.90x | |
| dataset_upload-roboflowdatasetuploadblockv2-run | run |
10.5161x | 13/13 | 57.1% | 9.14x | |
| glm_ocr-blockmanifest-describe_outputs | describe_outputs |
8.6396x | 1035/1035 | 51.9% | 9.33x | |
| qwen3_5vl-blockmanifest-describe_outputs | describe_outputs |
7.4974x | 3228/3228 | 48.6% | medium | 7.30x |
| http-with_route_exceptions | with_route_exceptions |
6.5982x | 1297/1297 | 8.1% | 6.89x | |
| introspection-prepare_operations_descriptions | prepare_operations_descriptions |
6.3132x | 147/147 | 82.5% | high | 6.26x |
| qwen3_5vl-qwen35vlblockv1-run_remotely | run_remotely |
6.1178x | 23/23 | 69.4% | 5.64x | |
| depth_anything_v2-inferencemodelsdepthanythingv2adapter-predict | predict |
4.1608x | 38/38 | 60.2% | high | 5.03x |
| qwen3_5vl-inferencemodelsqwen35vladapter-predict | predict |
3.8783x | 2275/2275 | 70.3% | low | 3.72x |
| managers-modelmanager-_dispose_model_lock | _dispose_model_lock |
2.6428x | 2784/2784 | 14.7% | 3.24x | |
| semantic_segmentation-roboflowsemanticsegmentationmodelblockv1-_convert_to_sv_de | _convert_to_sv_detections |
2.6393x | 13/13 | 71.7% | 2.22x | |
| compiler-establish_step_execution_dimensionality | establish_step_execution_dimensionality |
2.5414x | 47/47 | 23.2% | 2.37x | |
| text_display-clamp_box | clamp_box |
2.5233x | 1210/1210 | 15.0% | high | 2.80x |
| event_writer-_detections_to_v2_instance_segmentations | _detections_to_v2_instance_segmentations |
2.3732x | 36/36 | 41.2% | 2.18x | |
| clip_comparison-blockmanifest-get_required_cache_artifacts | get_required_cache_artifacts |
2.2455x | 130/130 | 26.6% | low | 2.04x |
| models-baseinference-infer | infer |
2.2264x | 1037/1037 | 2.8% | low | 2.32x |
| qwen3vl-inferencemodelsqwen3vladapter-map_inference_kwargs | map_inference_kwargs |
2.1823x | 1125/1125 | 26.8% | 2.39x | |
| compiler-verify_compatibility_of_input_data_lineage_with_control_flow_lineage | verify_compatibility_of_input_data_lineage_with_control_flow_lineage |
2.0308x | 39/39 | 26.4% | 2.11x | |
| execution_data_manager-executiondatamanager-_register_control_flow_output_for_no | _register_control_flow_output_for_non_simd_step |
1.9743x | 32/32 | 20.2% | 2.65x | |
| introspection-_get_property_name_options | _get_property_name_options |
1.9663x | 1053/1053 | 57.5% | medium | 1.52x |
| compiler-_collect_unique_control_flow_lineages_with_step_mapping | _collect_unique_control_flow_lineages_with_step_mapping |
1.9221x | 33/33 | 24.3% | medium | 1.95x |
| compiler-separate_control_flow_predecessors_from_data_providers | separate_control_flow_predecessors_from_data_providers |
1.8701x | 34/34 | 23.1% | high | 1.87x |
| core-_forcetracerootsampler-get_description | get_description |
1.8228x | 3244/3244 | 1.5% | 2.03x | |
| event_writer-_build_event_data | _build_event_data |
1.7964x | 4732/4732 | 34.7% | 1.74x | |
| mask_area_measurement-maskareameasurementblockv1-run | run |
1.7448x | 39/39 | 93.0% | medium | 1.65x |
| entities-workflowimagedata-copy_and_replace | copy_and_replace |
1.7324x | 2336/2336 | 72.1% | medium | 2.04x |
| compiler-step_definition_allows_control_flow_references | step_definition_allows_control_flow_references |
1.6712x | 27/27 | 22.5% | medium | 1.86x |
| mask_area_measurement-compute_detection_areas | compute_detection_areas |
1.6700x | 24/24 | 83.0% | 1.46x | |
| introspection-retrieve_selectors_from_union_definition | retrieve_selectors_from_union_definition |
1.6607x | 36/36 | 22.2% | high | 1.98x |
| dataset_upload-maybe_register_datapoint_at_roboflow | maybe_register_datapoint_at_roboflow |
1.6543x | 1039/1039 | 55.6% | low | 1.47x |
| introspection-_ref_to_def_name | _ref_to_def_name |
1.6183x | 1344/1344 | 27.5% | high | 1.51x |
| compiler-is_control_flow_step | is_control_flow_step |
1.5312x | 1830/1830 | 15.3% | high | 1.34x |
| easy_ocr-blockmanifest-get_supported_model_variants | get_supported_model_variants |
1.5086x | 2039/2039 | 57.5% | medium | 1.31x |
| qwen3_5vl-qwen35vlblockv1-run | run |
1.4974x | 28/28 | 93.1% | low | 1.69x |
| execution_data_manager-construct_mask_for_all_inputs_dimensionalities | construct_mask_for_all_inputs_dimensionalities |
1.4607x | 31/31 | 19.0% | 1.51x | |
| usage_tracking-usagecollector-_compute_execution_duration | _compute_execution_duration |
1.4589x | 2017/2017 | 27.5% | medium | 1.55x |
| core-_url_for_safe_logging | _url_for_safe_logging |
1.4244x | 1055/1055 | 2.8% | 1.47x | |
| execution_data_manager-construct_simd_step_input | construct_simd_step_input |
1.4230x | 26/26 | 28.3% | low | 1.37x |
| qwen3_5vl-inferencemodelsqwen35vladapter-map_inference_kwargs | map_inference_kwargs |
1.4221x | 1549/1549 | 64.9% | low | 1.53x |
| common-add_inference_keypoints_to_sv_detections | add_inference_keypoints_to_sv_detections |
1.4166x | 30/30 | 4.1% | 1.56x | |
| webrtc_worker-videoframeprocessor-serialize_outputs_sync | serialize_outputs_sync |
1.3984x | 48/48 | 17.9% | medium | 1.37x |
| webrtc_worker-videoframeprocessor-_check_termination | _check_termination |
1.3728x | 2029/2029 | 16.1% | 1.36x | |
| dataset_upload-is_prediction_registration_forbidden | is_prediction_registration_forbidden |
1.3720x | 2043/2043 | 31.7% | medium | 1.44x |
| managers-modelmanager-infer_from_request_sync | infer_from_request_sync |
1.3519x | 3041/3041 | 13.7% | low | 1.46x |
| email_notification-format_email_message | format_email_message |
1.3377x | 56/56 | 31.7% | high | 1.35x |
| sequences-sequence_apply | sequence_apply |
1.3332x | 58/58 | 30.2% | medium | 1.48x |
| entities-batch-remove_by_indices | remove_by_indices |
1.3254x | 44/44 | 65.4% | 1.26x | |
| execution_data_manager-filter_to_valid_prefix_chains | filter_to_valid_prefix_chains |
1.3225x | 32/32 | 15.3% | medium | 1.32x |
| managers-rank_for_deletion | rank_for_deletion |
1.2984x | 106/106 | 7.3% | 1.88x | |
| workflow_caller-_extract_workflow_caller_ids_from_spec | _extract_workflow_caller_ids_from_spec |
1.2910x | 44/44 | 25.8% | low | 1.34x |
| anthropic_claude-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.2866x | 2243/2243 | 16.4% | low | 1.45x |
| stream-inferencepipeline-init_with_workflow | init_with_workflow |
1.2696x | 53/53 | 30.5% | low | 1.10x |
| execution_data_manager-intersect_masks_per_dimension | intersect_masks_per_dimension |
1.2412x | 40/40 | 13.5% | medium | 1.66x |
| dataset_upload-roboflowdatasetuploadblockv1-run | run |
1.2300x | 41/41 | 38.6% | 1.26x | |
| openai-execute_gpt_4v_request | execute_gpt_4v_request |
1.2169x | 37/37 | 25.8% | medium | 2.00x |
| detection_event_log-detectioneventlogblockv1-_get_relative_time | _get_relative_time |
1.2079x | 41/41 | 43.0% | 1.19x | |
| http-_build_step_execution_error_response | _build_step_execution_error_response |
1.1919x | 1029/1029 | 1.0% | low | 1.19x |
| executor-_run_workflow | _run_workflow |
1.1903x | 130/130 | 21.6% | low | 1.22x |
| models-inferencemodelsobjectdetectionadapter-postprocess | postprocess |
1.1900x | 33/33 | 8.7% | high | 1.23x |
| lmm-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.1755x | 4625/4625 | 46.1% | 1.12x | |
| compiler-get_lineage_derived_from_control_flow | get_lineage_derived_from_control_flow |
1.1738x | 33/33 | 23.8% | low | 1.25x |
| notification-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.1688x | 1535/1535 | 43.0% | low | 1.14x |
| trackers-instancecache-record_instance | record_instance |
1.1657x | 14857/14857 | 17.3% | low | 1.14x |
| webrtc_worker-default_encoder | default_encoder |
1.1476x | 4071/4071 | 17.0% | medium | 1.12x |
| mask_area_measurement-get_detection_area | get_detection_area |
1.1469x | 129/129 | 83.7% | medium | 1.19x |
| common-serialise_sv_detections | serialise_sv_detections |
1.1455x | 149/149 | 5.1% | 1.19x | |
| execution_data_manager-get_masks_intersection_for_dimensions | get_masks_intersection_for_dimensions |
1.1377x | 36/36 | 16.9% | low | 1.23x |
| yolo_world-blockmanifest-get_supported_model_variants | get_supported_model_variants |
1.1281x | 2232/2232 | 50.0% | high | 1.29x |
| event_writer-_build_image_entry | _build_image_entry |
1.1116x | 1337/1337 | 60.6% | low | 1.10x |
| dataset_upload-register_datapoint | register_datapoint |
1.1109x | 1138/1138 | 42.5% | low | 1.16x |
| workflow_caller-workflowcallerblockv1-run | run |
1.1071x | 59/59 | 48.9% | 1.13x | |
| workflow_caller-_deserialize_output_value | _deserialize_output_value |
1.1008x | 139/139 | 27.4% | 1.11x | |
| cache-_get_block_type_identifier | _get_block_type_identifier |
1.1000x | 34/34 | 26.5% | medium | 1.11x |
| dataset_upload-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.0969x | 3532/3532 | 29.3% | low | 1.10x |
| moondream2-inferencemodelsmoondream2adapter-caption | caption |
1.0864x | 185/185 | 45.1% | 1.11x | |
| models-inferencemodelsobjectdetectionadapter-preprocess | preprocess |
1.0818x | 31/31 | 7.5% | medium | 1.12x |
| heatmap-heatmapvisualizationblockv1-getannotator | getAnnotator |
1.0738x | 38/38 | 44.2% | low | 1.41x |
| common-deserialize_detections_kind | deserialize_detections_kind |
1.0705x | 8/8 | 5.5% | 1.15x | |
| glm_ocr-inferencemodelsglmocradapter-postprocess | postprocess |
1.0592x | 1788/1788 | 54.5% | 1.19x | |
| sms-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
1.0319x | 2743/2743 | 15.4% | low | 1.14x |
| models-inferencemodelskeypointsdetectionadapter-map_inference_kwargs | map_inference_kwargs |
1.0299x | 2232/2232 | 7.4% | low | 1.13x |
| operations-build_sequence_apply_operation | build_sequence_apply_operation |
1.0281x | 25/25 | 30.3% | medium | 1.35x |
| text_display-draw_background_with_alpha | draw_background_with_alpha |
1.0173x | 176/176 | 29.5% | 1.18x | |
| instance_segmentation-blockmanifest-get_compatible_task_types | get_compatible_task_types |
1.0117x | 3836/3836 | 27.6% | low | 1.15x |
| core-get_workflow_cache_file | get_workflow_cache_file |
1.0009x | 1551/1551 | 2.8% | low | 1.17x |
| http-with_route_exceptions_async | with_route_exceptions_async |
1.0000x | 1/1 | 0.8% | low | 6.20x |
| cache-measure_memory_for_eviction | measure_memory_for_eviction |
1.0000x | 2/2 | N/A | 19.13x | |
| compiler-establish_control_flow_edge | establish_control_flow_edge |
1.0000x | 26/26 | 24.4% | low | 1.32x |
| compiler-find_longest_lineage_support | find_longest_lineage_support |
1.0000x | 51/51 | 23.1% | low | 1.26x |
| core_steps-load_blocks | load_blocks |
1.0000x | 2/2 | N/A | low | 5.09x |
| dataset_upload-_expand_metadata_to_records | _expand_metadata_to_records |
1.0000x | 2/2 | 50.6% | low | 1.62x |
| dataset_upload-_transpose_metadata_batches | _transpose_metadata_batches |
1.0000x | 2/2 | 50.6% | low | 1.36x |
| dynamic_blocks-_create_clean_traceback | _create_clean_traceback |
1.0000x | 2/2 | 13.4% | low | 3.22x |
| execution_data_manager-_transpose_dict_of_batches_if_needed | _transpose_dict_of_batches_if_needed |
1.0000x | 2/2 | 12.7% | low | 1.61x |
| managers-experimentalmodelmanager-is_loaded | is_loaded |
1.0000x | 2/2 | 0.6% | low | 2.31x |
| mask_area_measurement-areameasurementblockv1-run | run |
1.0000x | 2/2 | N/A | low | 1.25x |
| models-semanticsegmentationbaseonnxroboflowinferencemodel-make_response | make_response |
1.0000x | 5/5 | 11.5% | low | 1.33x |
| overlap-overlapmanifest-describe_outputs | describe_outputs |
1.0000x | 2/2 | N/A | low | 6.36x |
| qwen3vl-_is_flash_attn_usable | _is_flash_attn_usable |
1.0000x | 2/2 | 17.4% | low | 141.41x |
| segment_anything3-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9955x | 5631/5631 | 9.5% | low | 1.13x |
| clip_comparison-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9944x | 3031/3031 | 25.9% | low | 1.28x |
| object_detection-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9912x | 3874/3874 | 27.1% | low | 1.12x |
| decorators-withfixedsizecache-add_model | add_model |
0.9912x | 1224/1224 | 57.8% | low | 1.17x |
| semantic_segmentation-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9864x | 4124/4124 | 38.9% | low | 1.11x |
| keypoint_detection-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9802x | 2631/2631 | 26.7% | low | 1.11x |
| multi_class_classification-blockmanifest-get_compatible_task_types | get_compatible_task_types |
0.9794x | 2622/2622 | 26.0% | low | 1.14x |
| gaze-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9787x | 2229/2229 | 33.1% | low | 1.21x |
| yolo26-yolo26instancesegmentation-predict | predict |
0.9747x | 33/33 | 32.9% | low | 1.26x |
| moondream2-blockmanifest-get_supported_model_variants | get_supported_model_variants |
0.9730x | 1238/1238 | 50.6% | low | 1.15x |
| custom_metadata-blockmanifest-get_air_gapped_availability | get_air_gapped_availability |
0.7484x | 3724/3724 | 49.4% | low | 1.12x |
| managers-list_files | list_files |
0.4487x | 99/99 | 8.9% | 1.66x |
Failed Tasks (5)
sort-sortblockv1-run
-
Function:
run -
File:
inference/core/workflows/core_steps/trackers/sort/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 2.07x
-
Solve OK: False
-
Duration: 18.7s
-
Reward: correct=0.0, speedup=0.0, tests=682/684
Key errors
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_new_then_already_seen_instance_detection[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_filter_out_unmatched_tracks_with_negative_id[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_create_tracker_receives_default_fps_when_missing[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_large_scale_many_instances_and_cache_behavior[ 1 ]
INFO: INCORRECT: 682/684 passed, 2 diffs
Reproduce: bash docker_e2e_test.sh sort-sortblockv1-run --debug
byte_tracker-bytetrackmanifest-describe_outputs
-
Function:
describe_outputs -
File:
inference/core/workflows/core_steps/transformations/byte_tracker/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 16.66x
-
Solve OK: False
-
Duration: 18.5s
-
Reward: correct=0.0, speedup=0.0, tests=6035/6036
Key errors
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_0.py::test_describe_outputs_basic_structure_and_contents[ 1 ]
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_1.py::test_describe_outputs_already_seen_instances_kind[ 1 ]
INFO: INCORRECT: 6035/6036 passed, 1 diffs
Reproduce: bash docker_e2e_test.sh byte_tracker-bytetrackmanifest-describe_outputs --debug
glm_ocr-glmocrblockv1-run_remotely
-
Function:
run_remotely -
File:
inference/core/workflows/core_steps/models/foundation/glm_ocr/v1.py -
Commit:
HEAD -
Method: db_code_only
-
DB Speedup: 3.24x
-
Solve OK: False
-
Duration: 18.1s
-
Reward: correct=0.0, speedup=0.0, tests=1/1
Key errors
0: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO: INCORRECT: 1/1 passed, 0 diffs
Reproduce: bash docker_e2e_test.sh glm_ocr-glmocrblockv1-run_remotely --debug
qwen3vl-qwen3vlblockv1-run
-
Function:
run -
File:
inference/core/workflows/core_steps/models/foundation/qwen3vl/v1.py -
Commit:
c20359386c628a08bde69f5f3f780cedd782c50c -
Method: db_code_match
-
DB Speedup: 1.45x
-
Solve OK: False
-
Duration: 26.6s
-
Reward: correct=0.0, speedup=0.0, tests=41/42
Key errors
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_local_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_remote_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_local[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_remote[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_single_image[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_multiple_images[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_invalid_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_empty_prompt_string[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_locally_with_none_api_key[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_different_model_versions[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_locally_with_repeated_calls[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_batch_type_handling[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_1.py::test_run_local_various_image_reference_types[ 1 ]
INFO: INCORRECT: 41/42 passed, 1 diffs
Reproduce: bash docker_e2e_test.sh qwen3vl-qwen3vlblockv1-run --debug
clip-inferencemodelsclipadapter-compare
-
Function:
compare -
File:
inference/models/clip/clip_inference_models.py -
Commit:
7648e452a70ff1aad09f017a0eb2ea4022b7e177 -
Method: db_code_match
-
DB Speedup: 3.37x
-
Solve OK: False
-
Duration: 22.2s
-
Reward: correct=0.0, speedup=0.0, tests=135/136
Key errors
PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_compare__behaviorinstrumented_0.py::TestInferenceModelsClipAdapterCompare::test_compare_empty_prompt_list[ 1 ]
INFO: INCORRECT: 135/136 passed, 1 diffs
Reproduce: bash docker_e2e_test.sh clip-inferencemodelsclipadapter-compare --debug