Codeflash RL Environment — Batch Validation Report

Summary

Metric	Count	%
Total tasks	123	100%
Solve passes	0	0%
Eval correct (all behavioral tests pass)	118	95%
Faster than original (speedup > 1.0)	92	74%
All test cases pass	119	96%

Speedup Distribution (correct tasks only)

Slower (< 1x): 12 tasks
1-1.5x: 66 tasks
1.5-2x: 15 tasks
2-5x: 11 tasks
5-100x: 11 tasks
>100x: 3 tasks

Successful Tasks (correct=1.0)

Task	Function	Speedup	Tests	Coverage	Quality	DB Speedup
models-prepare_multi_label_classification_response	`prepare_multi_label_classification_response`	83318.0111x	32/32	7.7%	low	68465.46x
depth_anything_v3-inferencemodelsdepthanythingv3adapter-predict	`predict`	393.3014x	36/36	23.2%		391.52x
detection_event_log-detectioneventlogblockv1-_evict_oldest_video	`_evict_oldest_video`	347.2878x	170/170	46.4%		15.92x
workflow_caller-_check_workflow_for_circular_references	`_check_workflow_for_circular_references`	27.1588x	41/41	31.1%	low	11.84x
semantic_segmentation-blockmanifest-describe_outputs	`describe_outputs`	23.1126x	2539/2539	38.9%	high	21.92x
sort-sortmanifest-describe_outputs	`describe_outputs`	19.9634x	8036/8036	89.7%	low	17.16x
event_writer-_extract_detail	`_extract_detail`	17.5569x	43/43	31.9%	high	13.46x
s3-deduct_csv_header	`deduct_csv_header`	15.2296x	54/54	38.6%		8.90x
dataset_upload-roboflowdatasetuploadblockv2-run	`run`	10.5161x	13/13	57.1%		9.14x
glm_ocr-blockmanifest-describe_outputs	`describe_outputs`	8.6396x	1035/1035	51.9%		9.33x
qwen3_5vl-blockmanifest-describe_outputs	`describe_outputs`	7.4974x	3228/3228	48.6%	medium	7.30x
http-with_route_exceptions	`with_route_exceptions`	6.5982x	1297/1297	8.1%		6.89x
introspection-prepare_operations_descriptions	`prepare_operations_descriptions`	6.3132x	147/147	82.5%	high	6.26x
qwen3_5vl-qwen35vlblockv1-run_remotely	`run_remotely`	6.1178x	23/23	69.4%		5.64x
depth_anything_v2-inferencemodelsdepthanythingv2adapter-predict	`predict`	4.1608x	38/38	60.2%	high	5.03x
qwen3_5vl-inferencemodelsqwen35vladapter-predict	`predict`	3.8783x	2275/2275	70.3%	low	3.72x
managers-modelmanager-_dispose_model_lock	`_dispose_model_lock`	2.6428x	2784/2784	14.7%		3.24x
semantic_segmentation-roboflowsemanticsegmentationmodelblockv1-_convert_to_sv_de	`_convert_to_sv_detections`	2.6393x	13/13	71.7%		2.22x
compiler-establish_step_execution_dimensionality	`establish_step_execution_dimensionality`	2.5414x	47/47	23.2%		2.37x
text_display-clamp_box	`clamp_box`	2.5233x	1210/1210	15.0%	high	2.80x
event_writer-_detections_to_v2_instance_segmentations	`_detections_to_v2_instance_segmentations`	2.3732x	36/36	41.2%		2.18x
clip_comparison-blockmanifest-get_required_cache_artifacts	`get_required_cache_artifacts`	2.2455x	130/130	26.6%	low	2.04x
models-baseinference-infer	`infer`	2.2264x	1037/1037	2.8%	low	2.32x
qwen3vl-inferencemodelsqwen3vladapter-map_inference_kwargs	`map_inference_kwargs`	2.1823x	1125/1125	26.8%		2.39x
compiler-verify_compatibility_of_input_data_lineage_with_control_flow_lineage	`verify_compatibility_of_input_data_lineage_with_control_flow_lineage`	2.0308x	39/39	26.4%		2.11x
execution_data_manager-executiondatamanager-_register_control_flow_output_for_no	`_register_control_flow_output_for_non_simd_step`	1.9743x	32/32	20.2%		2.65x
introspection-_get_property_name_options	`_get_property_name_options`	1.9663x	1053/1053	57.5%	medium	1.52x
compiler-_collect_unique_control_flow_lineages_with_step_mapping	`_collect_unique_control_flow_lineages_with_step_mapping`	1.9221x	33/33	24.3%	medium	1.95x
compiler-separate_control_flow_predecessors_from_data_providers	`separate_control_flow_predecessors_from_data_providers`	1.8701x	34/34	23.1%	high	1.87x
core-_forcetracerootsampler-get_description	`get_description`	1.8228x	3244/3244	1.5%		2.03x
event_writer-_build_event_data	`_build_event_data`	1.7964x	4732/4732	34.7%		1.74x
mask_area_measurement-maskareameasurementblockv1-run	`run`	1.7448x	39/39	93.0%	medium	1.65x
entities-workflowimagedata-copy_and_replace	`copy_and_replace`	1.7324x	2336/2336	72.1%	medium	2.04x
compiler-step_definition_allows_control_flow_references	`step_definition_allows_control_flow_references`	1.6712x	27/27	22.5%	medium	1.86x
mask_area_measurement-compute_detection_areas	`compute_detection_areas`	1.6700x	24/24	83.0%		1.46x
introspection-retrieve_selectors_from_union_definition	`retrieve_selectors_from_union_definition`	1.6607x	36/36	22.2%	high	1.98x
dataset_upload-maybe_register_datapoint_at_roboflow	`maybe_register_datapoint_at_roboflow`	1.6543x	1039/1039	55.6%	low	1.47x
introspection-_ref_to_def_name	`_ref_to_def_name`	1.6183x	1344/1344	27.5%	high	1.51x
compiler-is_control_flow_step	`is_control_flow_step`	1.5312x	1830/1830	15.3%	high	1.34x
easy_ocr-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	1.5086x	2039/2039	57.5%	medium	1.31x
qwen3_5vl-qwen35vlblockv1-run	`run`	1.4974x	28/28	93.1%	low	1.69x
execution_data_manager-construct_mask_for_all_inputs_dimensionalities	`construct_mask_for_all_inputs_dimensionalities`	1.4607x	31/31	19.0%		1.51x
usage_tracking-usagecollector-_compute_execution_duration	`_compute_execution_duration`	1.4589x	2017/2017	27.5%	medium	1.55x
core-_url_for_safe_logging	`_url_for_safe_logging`	1.4244x	1055/1055	2.8%		1.47x
execution_data_manager-construct_simd_step_input	`construct_simd_step_input`	1.4230x	26/26	28.3%	low	1.37x
qwen3_5vl-inferencemodelsqwen35vladapter-map_inference_kwargs	`map_inference_kwargs`	1.4221x	1549/1549	64.9%	low	1.53x
common-add_inference_keypoints_to_sv_detections	`add_inference_keypoints_to_sv_detections`	1.4166x	30/30	4.1%		1.56x
webrtc_worker-videoframeprocessor-serialize_outputs_sync	`serialize_outputs_sync`	1.3984x	48/48	17.9%	medium	1.37x
webrtc_worker-videoframeprocessor-_check_termination	`_check_termination`	1.3728x	2029/2029	16.1%		1.36x
dataset_upload-is_prediction_registration_forbidden	`is_prediction_registration_forbidden`	1.3720x	2043/2043	31.7%	medium	1.44x
managers-modelmanager-infer_from_request_sync	`infer_from_request_sync`	1.3519x	3041/3041	13.7%	low	1.46x
email_notification-format_email_message	`format_email_message`	1.3377x	56/56	31.7%	high	1.35x
sequences-sequence_apply	`sequence_apply`	1.3332x	58/58	30.2%	medium	1.48x
entities-batch-remove_by_indices	`remove_by_indices`	1.3254x	44/44	65.4%		1.26x
execution_data_manager-filter_to_valid_prefix_chains	`filter_to_valid_prefix_chains`	1.3225x	32/32	15.3%	medium	1.32x
managers-rank_for_deletion	`rank_for_deletion`	1.2984x	106/106	7.3%		1.88x
workflow_caller-_extract_workflow_caller_ids_from_spec	`_extract_workflow_caller_ids_from_spec`	1.2910x	44/44	25.8%	low	1.34x
anthropic_claude-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.2866x	2243/2243	16.4%	low	1.45x
stream-inferencepipeline-init_with_workflow	`init_with_workflow`	1.2696x	53/53	30.5%	low	1.10x
execution_data_manager-intersect_masks_per_dimension	`intersect_masks_per_dimension`	1.2412x	40/40	13.5%	medium	1.66x
dataset_upload-roboflowdatasetuploadblockv1-run	`run`	1.2300x	41/41	38.6%		1.26x
openai-execute_gpt_4v_request	`execute_gpt_4v_request`	1.2169x	37/37	25.8%	medium	2.00x
detection_event_log-detectioneventlogblockv1-_get_relative_time	`_get_relative_time`	1.2079x	41/41	43.0%		1.19x
http-_build_step_execution_error_response	`_build_step_execution_error_response`	1.1919x	1029/1029	1.0%	low	1.19x
executor-_run_workflow	`_run_workflow`	1.1903x	130/130	21.6%	low	1.22x
models-inferencemodelsobjectdetectionadapter-postprocess	`postprocess`	1.1900x	33/33	8.7%	high	1.23x
lmm-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.1755x	4625/4625	46.1%		1.12x
compiler-get_lineage_derived_from_control_flow	`get_lineage_derived_from_control_flow`	1.1738x	33/33	23.8%	low	1.25x
notification-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.1688x	1535/1535	43.0%	low	1.14x
trackers-instancecache-record_instance	`record_instance`	1.1657x	14857/14857	17.3%	low	1.14x
webrtc_worker-default_encoder	`default_encoder`	1.1476x	4071/4071	17.0%	medium	1.12x
mask_area_measurement-get_detection_area	`get_detection_area`	1.1469x	129/129	83.7%	medium	1.19x
common-serialise_sv_detections	`serialise_sv_detections`	1.1455x	149/149	5.1%		1.19x
execution_data_manager-get_masks_intersection_for_dimensions	`get_masks_intersection_for_dimensions`	1.1377x	36/36	16.9%	low	1.23x
yolo_world-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	1.1281x	2232/2232	50.0%	high	1.29x
event_writer-_build_image_entry	`_build_image_entry`	1.1116x	1337/1337	60.6%	low	1.10x
dataset_upload-register_datapoint	`register_datapoint`	1.1109x	1138/1138	42.5%	low	1.16x
workflow_caller-workflowcallerblockv1-run	`run`	1.1071x	59/59	48.9%		1.13x
workflow_caller-_deserialize_output_value	`_deserialize_output_value`	1.1008x	139/139	27.4%		1.11x
cache-_get_block_type_identifier	`_get_block_type_identifier`	1.1000x	34/34	26.5%	medium	1.11x
dataset_upload-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.0969x	3532/3532	29.3%	low	1.10x
moondream2-inferencemodelsmoondream2adapter-caption	`caption`	1.0864x	185/185	45.1%		1.11x
models-inferencemodelsobjectdetectionadapter-preprocess	`preprocess`	1.0818x	31/31	7.5%	medium	1.12x
heatmap-heatmapvisualizationblockv1-getannotator	`getAnnotator`	1.0738x	38/38	44.2%	low	1.41x
common-deserialize_detections_kind	`deserialize_detections_kind`	1.0705x	8/8	5.5%		1.15x
glm_ocr-inferencemodelsglmocradapter-postprocess	`postprocess`	1.0592x	1788/1788	54.5%		1.19x
sms-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.0319x	2743/2743	15.4%	low	1.14x
models-inferencemodelskeypointsdetectionadapter-map_inference_kwargs	`map_inference_kwargs`	1.0299x	2232/2232	7.4%	low	1.13x
operations-build_sequence_apply_operation	`build_sequence_apply_operation`	1.0281x	25/25	30.3%	medium	1.35x
text_display-draw_background_with_alpha	`draw_background_with_alpha`	1.0173x	176/176	29.5%		1.18x
instance_segmentation-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	1.0117x	3836/3836	27.6%	low	1.15x
core-get_workflow_cache_file	`get_workflow_cache_file`	1.0009x	1551/1551	2.8%	low	1.17x
http-with_route_exceptions_async	`with_route_exceptions_async`	1.0000x	1/1	0.8%	low	6.20x
cache-measure_memory_for_eviction	`measure_memory_for_eviction`	1.0000x	2/2	N/A		19.13x
compiler-establish_control_flow_edge	`establish_control_flow_edge`	1.0000x	26/26	24.4%	low	1.32x
compiler-find_longest_lineage_support	`find_longest_lineage_support`	1.0000x	51/51	23.1%	low	1.26x
core_steps-load_blocks	`load_blocks`	1.0000x	2/2	N/A	low	5.09x
dataset_upload-_expand_metadata_to_records	`_expand_metadata_to_records`	1.0000x	2/2	50.6%	low	1.62x
dataset_upload-_transpose_metadata_batches	`_transpose_metadata_batches`	1.0000x	2/2	50.6%	low	1.36x
dynamic_blocks-_create_clean_traceback	`_create_clean_traceback`	1.0000x	2/2	13.4%	low	3.22x
execution_data_manager-_transpose_dict_of_batches_if_needed	`_transpose_dict_of_batches_if_needed`	1.0000x	2/2	12.7%	low	1.61x
managers-experimentalmodelmanager-is_loaded	`is_loaded`	1.0000x	2/2	0.6%	low	2.31x
mask_area_measurement-areameasurementblockv1-run	`run`	1.0000x	2/2	N/A	low	1.25x
models-semanticsegmentationbaseonnxroboflowinferencemodel-make_response	`make_response`	1.0000x	5/5	11.5%	low	1.33x
overlap-overlapmanifest-describe_outputs	`describe_outputs`	1.0000x	2/2	N/A	low	6.36x
qwen3vl-_is_flash_attn_usable	`_is_flash_attn_usable`	1.0000x	2/2	17.4%	low	141.41x
segment_anything3-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9955x	5631/5631	9.5%	low	1.13x
clip_comparison-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9944x	3031/3031	25.9%	low	1.28x
object_detection-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9912x	3874/3874	27.1%	low	1.12x
decorators-withfixedsizecache-add_model	`add_model`	0.9912x	1224/1224	57.8%	low	1.17x
semantic_segmentation-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9864x	4124/4124	38.9%	low	1.11x
keypoint_detection-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9802x	2631/2631	26.7%	low	1.11x
multi_class_classification-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9794x	2622/2622	26.0%	low	1.14x
gaze-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9787x	2229/2229	33.1%	low	1.21x
yolo26-yolo26instancesegmentation-predict	`predict`	0.9747x	33/33	32.9%	low	1.26x
moondream2-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9730x	1238/1238	50.6%	low	1.15x
custom_metadata-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	0.7484x	3724/3724	49.4%	low	1.12x
managers-list_files	`list_files`	0.4487x	99/99	8.9%		1.66x

Failed Tasks (5)

sort-sortblockv1-run

Function: run
File: inference/core/workflows/core_steps/trackers/sort/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 2.07x
Solve OK: False
Duration: 18.7s
Reward: correct=0.0, speedup=0.0, tests=682/684

Key errors

  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_new_then_already_seen_instance_detection[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_filter_out_unmatched_tracks_with_negative_id[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_create_tracker_receives_default_fps_when_missing[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_large_scale_many_instances_and_cache_behavior[ 1 ]
INFO:   INCORRECT: 682/684 passed, 2 diffs

Reproduce: bash docker_e2e_test.sh sort-sortblockv1-run --debug

byte_tracker-bytetrackmanifest-describe_outputs

Function: describe_outputs
File: inference/core/workflows/core_steps/transformations/byte_tracker/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 16.66x
Solve OK: False
Duration: 18.5s
Reward: correct=0.0, speedup=0.0, tests=6035/6036

Key errors

  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_0.py::test_describe_outputs_basic_structure_and_contents[ 1 ]
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_1.py::test_describe_outputs_already_seen_instances_kind[ 1 ]
INFO:   INCORRECT: 6035/6036 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh byte_tracker-bytetrackmanifest-describe_outputs --debug

glm_ocr-glmocrblockv1-run_remotely

Function: run_remotely
File: inference/core/workflows/core_steps/models/foundation/glm_ocr/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 3.24x
Solve OK: False
Duration: 18.1s
Reward: correct=0.0, speedup=0.0, tests=1/1

Key errors

0: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO:   INCORRECT: 1/1 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh glm_ocr-glmocrblockv1-run_remotely --debug

qwen3vl-qwen3vlblockv1-run

Function: run
File: inference/core/workflows/core_steps/models/foundation/qwen3vl/v1.py
Commit: c20359386c628a08bde69f5f3f780cedd782c50c
Method: db_code_match
DB Speedup: 1.45x
Solve OK: False
Duration: 26.6s
Reward: correct=0.0, speedup=0.0, tests=41/42

Key errors

FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_local_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_remote_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_local[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_remote[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_single_image[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_multiple_images[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_invalid_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_empty_prompt_string[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_locally_with_none_api_key[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_different_model_versions[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_locally_with_repeated_calls[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_batch_type_handling[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_1.py::test_run_local_various_image_reference_types[ 1 ]
INFO:   INCORRECT: 41/42 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh qwen3vl-qwen3vlblockv1-run --debug

clip-inferencemodelsclipadapter-compare

Function: compare
File: inference/models/clip/clip_inference_models.py
Commit: 7648e452a70ff1aad09f017a0eb2ea4022b7e177
Method: db_code_match
DB Speedup: 3.37x
Solve OK: False
Duration: 22.2s
Reward: correct=0.0, speedup=0.0, tests=135/136

Key errors

  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_compare__behaviorinstrumented_0.py::TestInferenceModelsClipAdapterCompare::test_compare_empty_prompt_list[ 1 ]
INFO:   INCORRECT: 135/136 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh clip-inferencemodelsclipadapter-compare --debug

23 KiB Raw Permalink Blame History

Codeflash RL Environment — Batch Validation Report

Summary

Speedup Distribution (correct tasks only)

Successful Tasks (correct=1.0)

Failed Tasks (5)

sort-sortblockv1-run

byte_tracker-bytetrackmanifest-describe_outputs

glm_ocr-glmocrblockv1-run_remotely

qwen3vl-qwen3vlblockv1-run

clip-inferencemodelsclipadapter-compare

23 KiB

Raw Permalink Blame History