mirror of https://github.com/codeflash-ai/codeflash-internal.git synced 2026-05-04 18:25:18 +00:00

misrasaurabh1 b3f164dcda rl env files

2026-04-16 16:31:25 -07:00

43 KiB

Raw Permalink Blame History

Codeflash RL Environment — Batch Validation Report

Summary

Metric	Count	%
Total tasks	166	100%
Solve passes	0	0%
Eval correct (all behavioral tests pass)	153	92%
Faster than original (speedup > 1.0)	122	73%
All test cases pass	157	94%

Speedup Distribution (correct tasks only)

Slower (< 1x): 13 tasks
1-1.5x: 85 tasks
1.5-2x: 18 tasks
2-5x: 14 tasks
5-100x: 17 tasks
>100x: 6 tasks

Successful Tasks (correct=1.0)

Task	Function	Speedup	Tests	Coverage	Quality	DB Speedup
models-prepare_multi_label_classification_response	`prepare_multi_label_classification_response`	74810.9605x	32/32	7.7%	low	68465.46x
introspection-prepare_operators_descriptions	`prepare_operators_descriptions`	15514.0489x	1057/1057	35.0%	low	14150.75x
decorators-withfixedsizecache-memory_pressure_detected	`memory_pressure_detected`	1747.8138x	132/132	37.8%		917.09x
depth_anything_v3-inferencemodelsdepthanythingv3adapter-predict	`predict`	397.7983x	36/36	23.2%		391.52x
detection_event_log-detectioneventlogblockv1-_evict_oldest_video	`_evict_oldest_video`	343.5465x	170/170	46.4%		15.92x
camera-_generate_grid_colors	`_generate_grid_colors`	269.6811x	1901/1901	9.0%		218.30x
workflow_caller-_check_workflow_for_circular_references	`_check_workflow_for_circular_references`	27.0112x	41/41	31.1%	low	11.84x
semantic_segmentation-blockmanifest-describe_outputs	`describe_outputs`	23.9434x	2539/2539	38.9%	high	21.92x
dynamic_blocks-build_traceback_string	`build_traceback_string`	20.3358x	2047/2047	16.0%	low	13.38x
sort-sortmanifest-describe_outputs	`describe_outputs`	19.1846x	8036/8036	89.7%	low	17.16x
bytetrack-bytetrackmanifest-describe_outputs	`describe_outputs`	19.0393x	3033/3033	90.2%	low	17.56x
event_writer-_extract_detail	`_extract_detail`	16.4218x	43/43	31.9%		13.46x
workflow_caller-_describe_outputs_from_spec	`_describe_outputs_from_spec`	16.0714x	25/25	23.1%	low	12.94x
managers-try_releasing_cuda_memory	`try_releasing_cuda_memory`	15.0925x	1006/1006	10.8%		1.22x
cache-_slugify_model_id	`_slugify_model_id`	13.9914x	1050/1050	26.1%	medium	11.21x
s3-deduct_csv_header	`deduct_csv_header`	13.1148x	54/54	38.6%	high	8.90x
dynamic_blocks-create_dynamic_module	`create_dynamic_module`	10.8808x	142/142	27.4%		12.31x
dataset_upload-roboflowdatasetuploadblockv2-run	`run`	10.0212x	13/13	57.1%		9.14x
glm_ocr-blockmanifest-describe_outputs	`describe_outputs`	8.9607x	1035/1035	51.9%		9.33x
qwen3_5vl-blockmanifest-describe_outputs	`describe_outputs`	7.6146x	3228/3228	48.6%	medium	7.30x
http-with_route_exceptions	`with_route_exceptions`	6.7561x	1297/1297	8.1%		6.89x
qwen3_5vl-qwen35vlblockv1-run_remotely	`run_remotely`	6.3813x	23/23	69.4%		5.64x
introspection-prepare_operations_descriptions	`prepare_operations_descriptions`	6.2270x	147/147	82.5%	high	6.26x
core_steps-load_kinds	`load_kinds`	4.7753x	1153/1153	42.0%	high	3.68x
depth_anything_v2-inferencemodelsdepthanythingv2adapter-predict	`predict`	4.0460x	38/38	60.2%		5.03x
qwen3_5vl-inferencemodelsqwen35vladapter-predict	`predict`	3.2474x	2275/2275	70.3%	low	3.72x
core-_prepare_workflow_response_cache_key	`_prepare_workflow_response_cache_key`	3.0359x	7539/7539	2.7%	medium	2.39x
compiler-establish_step_execution_dimensionality	`establish_step_execution_dimensionality`	2.6841x	47/47	23.2%		2.37x
semantic_segmentation-roboflowsemanticsegmentationmodelblockv1-_convert_to_sv_de	`_convert_to_sv_detections`	2.6825x	13/13	71.7%		2.22x
managers-modelmanager-_dispose_model_lock	`_dispose_model_lock`	2.5469x	2784/2784	14.7%		3.24x
text_display-clamp_box	`clamp_box`	2.5308x	1210/1210	15.0%	high	2.80x
event_writer-_detections_to_v2_instance_segmentations	`_detections_to_v2_instance_segmentations`	2.3078x	36/36	41.2%		2.18x
models-baseinference-infer	`infer`	2.2736x	1037/1037	2.8%	low	2.32x
qwen3vl-inferencemodelsqwen3vladapter-map_inference_kwargs	`map_inference_kwargs`	2.1968x	1125/1125	26.8%	medium	2.39x
clip_comparison-blockmanifest-get_required_cache_artifacts	`get_required_cache_artifacts`	2.1802x	130/130	26.6%	low	2.04x
introspection-_get_property_name_options	`_get_property_name_options`	2.0733x	1053/1053	57.5%		1.52x
compiler-verify_compatibility_of_input_data_lineage_with_control_flow_lineage	`verify_compatibility_of_input_data_lineage_with_control_flow_lineage`	2.0635x	39/39	26.4%		2.11x
execution_data_manager-executiondatamanager-_register_control_flow_output_for_no	`_register_control_flow_output_for_non_simd_step`	1.9788x	32/32	20.2%		2.65x
core-_forcetracerootsampler-get_description	`get_description`	1.9079x	3244/3244	1.6%		2.03x
enterprise_blocks-load_enterprise_blocks	`load_enterprise_blocks`	1.8940x	1936/1936	32.2%	medium	1.45x
entities-workflowimagedata-copy_and_replace	`copy_and_replace`	1.8862x	2336/2336	72.1%		2.04x
compiler-_collect_unique_control_flow_lineages_with_step_mapping	`_collect_unique_control_flow_lineages_with_step_mapping`	1.8585x	33/33	24.3%		1.95x
mask_area_measurement-maskareameasurementblockv1-run	`run`	1.8337x	39/39	93.0%		1.65x
compiler-separate_control_flow_predecessors_from_data_providers	`separate_control_flow_predecessors_from_data_providers`	1.8200x	34/34	23.1%	high	1.87x
event_writer-_build_event_data	`_build_event_data`	1.8120x	4732/4732	34.7%	medium	1.74x
compiler-step_definition_allows_control_flow_references	`step_definition_allows_control_flow_references`	1.7192x	27/27	22.5%	medium	1.86x
introspection-retrieve_selectors_from_union_definition	`retrieve_selectors_from_union_definition`	1.6618x	36/36	22.2%	high	1.98x
dataset_upload-maybe_register_datapoint_at_roboflow	`maybe_register_datapoint_at_roboflow`	1.6392x	1039/1039	55.6%	low	1.47x
cache-is_block_cached	`is_block_cached`	1.6255x	53/53	27.9%	low	1.36x
introspection-_ref_to_def_name	`_ref_to_def_name`	1.6030x	1344/1344	27.5%	high	1.51x
mask_area_measurement-compute_detection_areas	`compute_detection_areas`	1.5829x	24/24	83.0%		1.46x
managers-list_files	`list_files`	1.5797x	99/99	8.9%		1.66x
dynamic_blocks-assembly_custom_python_block	`assembly_custom_python_block`	1.5787x	135/135	36.7%	low	1.61x
cache-get_cached_foundation_models	`get_cached_foundation_models`	1.5691x	32/32	34.7%	low	1.46x
compiler-is_control_flow_step	`is_control_flow_step`	1.5035x	1830/1830	15.3%	medium	1.34x
execution_data_manager-construct_mask_for_all_inputs_dimensionalities	`construct_mask_for_all_inputs_dimensionalities`	1.4808x	31/31	19.0%	low	1.51x
common-deserialize_image_kind	`deserialize_image_kind`	1.4732x	1506/1506	7.4%		1.42x
usage_tracking-usagecollector-_compute_execution_duration	`_compute_execution_duration`	1.4709x	2017/2017	27.5%	medium	1.55x
core-_url_for_safe_logging	`_url_for_safe_logging`	1.4616x	1055/1055	2.8%		1.47x
dataset_upload-is_prediction_registration_forbidden	`is_prediction_registration_forbidden`	1.4475x	2043/2043	31.7%		1.44x
qwen3_5vl-qwen35vlblockv1-run	`run`	1.4215x	28/28	93.1%	low	1.69x
execution_data_manager-construct_simd_step_input	`construct_simd_step_input`	1.4138x	26/26	28.3%	low	1.37x
cache-get_task_type_to_block_mapping	`get_task_type_to_block_mapping`	1.4136x	30/30	29.6%	low	1.39x
anthropic_claude-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.3907x	2243/2243	16.4%	low	1.45x
qwen3_5vl-inferencemodelsqwen35vladapter-map_inference_kwargs	`map_inference_kwargs`	1.3866x	1549/1549	64.9%	low	1.53x
email_notification-format_email_message	`format_email_message`	1.3730x	56/56	31.7%	high	1.35x
dataset_upload-register_datapoint_at_roboflow	`register_datapoint_at_roboflow`	1.3720x	2037/2037	38.6%	low	1.32x
common-add_inference_keypoints_to_sv_detections	`add_inference_keypoints_to_sv_detections`	1.3657x	30/30	4.1%		1.56x
core-get_workflow_specification	`get_workflow_specification`	1.3586x	1157/1157	3.6%	low	1.56x
sequences-sequence_apply	`sequence_apply`	1.3540x	58/58	30.2%	medium	1.48x
managers-modelmanager-infer_from_request_sync	`infer_from_request_sync`	1.3382x	3041/3041	13.7%	low	1.46x
entities-batch-remove_by_indices	`remove_by_indices`	1.3240x	44/44	65.4%	high	1.26x
cache-_is_model_cached	`_is_model_cached`	1.3055x	45/45	27.0%		1.24x
workflow_caller-_extract_workflow_caller_ids_from_spec	`_extract_workflow_caller_ids_from_spec`	1.3007x	44/44	25.8%		1.34x
openai-execute_gpt_4v_request	`execute_gpt_4v_request`	1.3006x	37/37	25.8%	medium	2.00x
cache-is_model_cached	`is_model_cached`	1.2989x	55/55	28.7%	high	1.22x
core-load_cached_workflow_response	`load_cached_workflow_response`	1.2951x	12126/12126	2.8%	low	1.38x
execution_data_manager-filter_to_valid_prefix_chains	`filter_to_valid_prefix_chains`	1.2932x	32/32	15.3%		1.32x
execution_data_manager-intersect_masks_per_dimension	`intersect_masks_per_dimension`	1.2908x	40/40	13.5%	high	1.66x
webrtc_worker-videoframeprocessor-serialize_outputs_sync	`serialize_outputs_sync`	1.2882x	48/48	17.9%	low	1.37x
webrtc_worker-videoframeprocessor-_check_termination	`_check_termination`	1.2650x	2029/2029	16.1%		1.36x
workflow_caller-_fetch_workflow_spec_for_validation	`_fetch_workflow_spec_for_validation`	1.2390x	1547/1547	23.1%		1.33x
dataset_upload-roboflowdatasetuploadblockv1-run	`run`	1.2328x	41/41	38.6%		1.26x
executor-_run_workflow	`_run_workflow`	1.2137x	130/130	21.6%	low	1.22x
managers-rank_for_deletion	`rank_for_deletion`	1.2067x	106/106	7.3%		1.88x
detection_event_log-detectioneventlogblockv1-_get_relative_time	`_get_relative_time`	1.2017x	41/41	43.0%		1.19x
http-_build_step_execution_error_response	`_build_step_execution_error_response`	1.1955x	1029/1029	1.0%	low	1.19x
common-serialise_sv_detections	`serialise_sv_detections`	1.1825x	149/149	5.1%		1.19x
models-inferencemodelsobjectdetectionadapter-postprocess	`postprocess`	1.1819x	33/33	8.7%	medium	1.23x
text_display-draw_background_with_alpha	`draw_background_with_alpha`	1.1776x	176/176	29.5%	high	1.18x
core-record_inference	`record_inference`	1.1693x	3033/3033	1.6%	low	1.22x
webrtc_worker-default_encoder	`default_encoder`	1.1565x	4071/4071	17.0%	medium	1.12x
easy_ocr-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	1.1535x	2039/2039	57.5%	low	1.31x
execution_data_manager-get_masks_intersection_for_dimensions	`get_masks_intersection_for_dimensions`	1.1505x	36/36	16.9%	low	1.23x
mask_area_measurement-get_detection_area	`get_detection_area`	1.1443x	129/129	83.7%		1.19x
email_notification-apply_operations_to_message_parameters	`apply_operations_to_message_parameters`	1.1389x	44/44	29.5%	low	1.15x
dataset_upload-register_datapoint	`register_datapoint`	1.1321x	1138/1138	42.5%	low	1.16x
compiler-get_lineage_derived_from_control_flow	`get_lineage_derived_from_control_flow`	1.1298x	33/33	23.8%	low	1.25x
yolo_world-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	1.1252x	2232/2232	50.0%	medium	1.29x
trackers-instancecache-record_instance	`record_instance`	1.1226x	14857/14857	17.3%	medium	1.14x
event_writer-_build_image_entry	`_build_image_entry`	1.1222x	1337/1337	60.6%	low	1.10x
workflow_caller-_convert_output_descriptions_to_kinds	`_convert_output_descriptions_to_kinds`	1.1212x	37/37	24.6%	medium	1.19x
workflow_caller-workflowcallerblockv1-run	`run`	1.1159x	59/59	48.9%		1.13x
notification-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.1135x	1535/1535	43.0%	low	1.14x
moondream2-inferencemodelsmoondream2adapter-caption	`caption`	1.1118x	185/185	45.1%	high	1.11x
cache-_get_block_type_identifier	`_get_block_type_identifier`	1.1082x	34/34	26.5%		1.11x
models-inferencemodelsobjectdetectionadapter-preprocess	`preprocess`	1.1014x	31/31	7.5%	low	1.12x
workflow_caller-_deserialize_output_value	`_deserialize_output_value`	1.0981x	139/139	27.4%	medium	1.11x
workflow_caller-_resolve_output_kinds_for_run	`_resolve_output_kinds_for_run`	1.0975x	1047/1047	26.2%		1.12x
lmm-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.0895x	4625/4625	46.1%	low	1.12x
workflow_caller-build_workflow_url	`build_workflow_url`	1.0892x	6137/6137	22.2%	low	1.29x
custom_metadata-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.0780x	3724/3724	49.4%	low	1.12x
models-inferencemodelskeypointsdetectionadapter-map_inference_kwargs	`map_inference_kwargs`	1.0669x	2232/2232	7.4%	low	1.13x
heatmap-heatmapvisualizationblockv1-getannotator	`getAnnotator`	1.0615x	38/38	44.2%	low	1.41x
detection_event_log-detectioneventlogblockv1-run	`run`	1.0380x	4426/4426	97.3%	low	3.40x
glm_ocr-inferencemodelsglmocradapter-postprocess	`postprocess`	1.0374x	1788/1788	54.5%	low	1.19x
sms-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.0320x	2743/2743	15.4%	low	1.14x
handlers-handle_describe_workflows_blocks_request	`handle_describe_workflows_blocks_request`	1.0280x	153/153	42.0%	low	2.50x
dataset_upload-blockmanifest-get_air_gapped_availability	`get_air_gapped_availability`	1.0160x	3532/3532	29.3%	low	1.10x
core-get_workflow_cache_file	`get_workflow_cache_file`	1.0093x	1551/1551	2.8%	low	1.17x
dataset_upload-execute_registration	`execute_registration`	1.0064x	1005/1005	38.1%	low	1.17x
clip_comparison-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	1.0027x	3031/3031	25.9%	low	1.28x
builder-get_cached_models	`get_cached_models`	1.0000x	21/21	51.4%	low	1.80x
cache-measure_memory_for_eviction	`measure_memory_for_eviction`	1.0000x	2/2	N/A	low	19.13x
compiler-establish_control_flow_edge	`establish_control_flow_edge`	1.0000x	26/26	24.4%	low	1.32x
compiler-find_longest_lineage_support	`find_longest_lineage_support`	1.0000x	51/51	23.1%	low	1.26x
core-wrap_roboflow_api_errors	`wrap_roboflow_api_errors`	1.0000x	438/438	3.7%	low	1.28x
core_steps-load_blocks	`load_blocks`	1.0000x	2/2	N/A	low	5.09x
dataset_upload-_expand_metadata_to_records	`_expand_metadata_to_records`	1.0000x	2/2	50.6%	low	1.62x
dataset_upload-_transpose_metadata_batches	`_transpose_metadata_batches`	1.0000x	2/2	50.6%	low	1.36x
dynamic_blocks-_create_clean_traceback	`_create_clean_traceback`	1.0000x	2/2	13.4%	low	3.22x
execution_data_manager-_transpose_dict_of_batches_if_needed	`_transpose_dict_of_batches_if_needed`	1.0000x	2/2	12.7%	low	1.61x
halo-halovisualizationblockv1-getannotator	`getAnnotator`	1.0000x	2/2	34.2%	low	11.57x
http-with_route_exceptions_async	`with_route_exceptions_async`	1.0000x	1/1	0.8%		6.20x
managers-customcollector-_fetch_stream_metrics	`_fetch_stream_metrics`	1.0000x	41/41	7.2%	low	1.19x
managers-experimentalmodelmanager-is_loaded	`is_loaded`	1.0000x	2/2	0.6%	low	2.31x
mask_area_measurement-areameasurementblockv1-run	`run`	1.0000x	2/2	N/A	low	1.25x
models-semanticsegmentationbaseonnxroboflowinferencemodel-make_response	`make_response`	1.0000x	5/5	11.5%	low	1.33x
overlap-overlapmanifest-describe_outputs	`describe_outputs`	1.0000x	2/2	N/A	low	6.36x
qwen3vl-_is_flash_attn_usable	`_is_flash_attn_usable`	1.0000x	2/2	17.4%	low	141.41x
decorators-withfixedsizecache-add_model	`add_model`	0.9999x	1224/1224	57.8%	low	1.17x
object_detection-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9982x	3874/3874	27.1%	low	1.12x
instance_segmentation-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9954x	3836/3836	27.6%	low	1.15x
semantic_segmentation-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9907x	4124/4124	38.9%	low	1.11x
segment_anything3-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9862x	5631/5631	9.5%	low	1.13x
keypoint_detection-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9836x	2631/2631	26.7%	low	1.11x
gaze-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9797x	2229/2229	33.1%	low	1.21x
yolo26-yolo26instancesegmentation-predict	`predict`	0.9786x	33/33	32.9%	low	1.26x
operations-build_sequence_apply_operation	`build_sequence_apply_operation`	0.9720x	25/25	30.3%	low	1.35x
common-deserialize_detections_kind	`deserialize_detections_kind`	0.9719x	8/8	5.5%	low	1.15x
stream-inferencepipeline-init_with_workflow	`init_with_workflow`	0.9698x	53/53	30.5%	low	1.10x
moondream2-blockmanifest-get_supported_model_variants	`get_supported_model_variants`	0.9669x	1238/1238	50.6%	low	1.15x
multi_class_classification-blockmanifest-get_compatible_task_types	`get_compatible_task_types`	0.9653x	2622/2622	26.0%	low	1.14x

Failed Tasks (13)

byte_tracker-bytetrackmanifest-describe_outputs

Function: describe_outputs
File: inference/core/workflows/core_steps/transformations/byte_tracker/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 16.66x
Solve OK: False
Duration: 17.4s
Reward: correct=0.0, speedup=0.0, tests=6035/6036

Key errors

  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_0.py::test_describe_outputs_basic_structure_and_contents[ 1 ]
FAILED tests/codeflash_generated/test_describe_outputs__behaviorinstrumented_1.py::test_describe_outputs_already_seen_instances_kind[ 1 ]
INFO:   INCORRECT: 6035/6036 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh byte_tracker-bytetrackmanifest-describe_outputs --debug

clip-inferencemodelsclipadapter-compare

Function: compare
File: inference/models/clip/clip_inference_models.py
Commit: 7648e452a70ff1aad09f017a0eb2ea4022b7e177
Method: db_code_match
DB Speedup: 3.37x
Solve OK: False
Duration: 23.7s
Reward: correct=0.0, speedup=0.0, tests=135/136

Key errors

  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_compare__behaviorinstrumented_0.py::TestInferenceModelsClipAdapterCompare::test_compare_empty_prompt_list[ 1 ]
INFO:   INCORRECT: 135/136 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh clip-inferencemodelsclipadapter-compare --debug

compiler-establish_batch_oriented_step_lineage

Function: establish_batch_oriented_step_lineage
File: inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py
Commit: 90243bdc6278ef7d17b6db09dc1eb5b0d155b4be
Method: db_code_match
DB Speedup: 1.54x
Solve OK: False
Duration: 14.0s
Reward: correct=0.0, speedup=0.0, tests=33/36

Key errors

  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1220: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1236: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_0.py::test_multiple_control_flow_lineages_with_same_min_length_raises_assumption_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_empty_lineage_lists[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_missing_dimensionality_reference_property[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_non_batch_oriented_property_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_multiple_control_flow_same_min_length_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_establish_batch_oriented_step_lineage__behaviorinstrumented_1.py::test_compound_input_no_batch_oriented_raises_error[ 1 ]
INFO:   INCORRECT: 33/36 passed, 3 diffs

Reproduce: bash docker_e2e_test.sh compiler-establish_batch_oriented_step_lineage --debug

compiler-get_reference_lineage

Function: get_reference_lineage
File: inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py
Commit: HEAD
Method: db_code_only
DB Speedup: 1.61x
Solve OK: False
Duration: 14.6s
Reward: correct=0.0, speedup=0.0, tests=20/24

Key errors

/errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1294: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1310: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageBasic::test_batch_oriented_property_in_simple_input[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageEdge::test_compound_input_with_batch_oriented_nested[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageEdge::test_compound_input_no_batch_oriented_raises_error[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageLargeScale::test_large_compound_input_many_nested[ 1 ]
FAILED tests/codeflash_generated/test_get_reference_lineage__behaviorinstrumented_1.py::TestGetReferenceLineageLargeScale::test_many_input_data_keys[ 1 ]
INFO:   INCORRECT: 20/24 passed, 4 diffs

Reproduce: bash docker_e2e_test.sh compiler-get_reference_lineage --debug

core_steps-_should_filter_block

Function: _should_filter_block
File: inference/core/workflows/core_steps/loader.py
Commit: HEAD
Method: db_code_only
DB Speedup: 4.93x
Solve OK: False
Duration: 27.4s
Reward: correct=0.0, speedup=0.0, tests=41/41

Key errors

_ ERROR collecting tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py _
ImportError while importing test module '/workspace/inference/tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py'.
E   ImportError: cannot import name 'WORKFLOW_SELECTIVE_BLOCKS_DISABLE' from 'inference.core.env' (/workspace/inference/inference/core/env.py)
  /usr/local/lib/python3.12/site-packages/pydantic/fields.py:1093: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
1 warning, 2 errors in 0.28s
INFO:   INCORRECT: 41/41 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh core_steps-_should_filter_block --debug

execution_data_manager-prepare_parameters

Function: prepare_parameters
File: inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py
Commit: HEAD
Method: db_code_only
DB Speedup: 1.12x
Solve OK: False
Duration: 15.7s
Reward: correct=0.0, speedup=0.0, tests=1/1

Key errors

FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_empty_runtime_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_step_execution_dimensionality_zero[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_large_dimensionality[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_special_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_unicode_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_input_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_deeply_nested_compound_inputs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_masks[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_auto_batch_casting_configs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_iteration_performance[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_complex_data_structures[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_mixed_parameter_types[ 1 ]
INFO:   INCORRECT: 1/1 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh execution_data_manager-prepare_parameters --debug

glm_ocr-glmocrblockv1-run_remotely

Function: run_remotely
File: inference/core/workflows/core_steps/models/foundation/glm_ocr/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 3.24x
Solve OK: False
Duration: 20.8s
Reward: correct=0.0, speedup=0.0, tests=1/1

Key errors

0: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO:   INCORRECT: 1/1 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh glm_ocr-glmocrblockv1-run_remotely --debug

ocsort-ocsortblockv1-run

Function: run
File: inference/core/workflows/core_steps/trackers/ocsort/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 1.60x
Solve OK: False
Duration: 16.5s
Reward: correct=0.0, speedup=0.0, tests=408/408

Key errors

@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
25 warnings, 2 errors in 0.84s
INFO:   INCORRECT: 408/408 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh ocsort-ocsortblockv1-run --debug

perception_encoder-inferencemodelsperceptionencoderadapter-preprocess

Function: preprocess
File: inference/models/perception_encoder/perception_encoder_inference_models.py
Commit: 7648e452a70ff1aad09f017a0eb2ea4022b7e177
Method: db_code_match
DB Speedup: 2.47x
Solve OK: False
Duration: 37.7s
Reward: correct=0.0, speedup=0.0, tests=2031/2235

Key errors

  PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_returns_tuple_with_correct_types[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_calls_preproc_image[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_metadata_is_empty_dict[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_preserves_image_dimensions[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_kwargs[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_multiple_calls_independence[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_1000_rapid_calls[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_varying_channel_counts[ 1 ]
INFO:   INCORRECT: 2031/2235 passed, 204 diffs

Reproduce: bash docker_e2e_test.sh perception_encoder-inferencemodelsperceptionencoderadapter-preprocess --debug

qwen3vl-qwen3vlblockv1-run

Function: run
File: inference/core/workflows/core_steps/models/foundation/qwen3vl/v1.py
Commit: c20359386c628a08bde69f5f3f780cedd782c50c
Method: db_code_match
DB Speedup: 1.45x
Solve OK: False
Duration: 27.9s
Reward: correct=0.0, speedup=0.0, tests=41/42

Key errors

FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_local_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_remote_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_local[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_none_prompt_and_system_prompt_remote[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_single_image[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestBasicFunctionality::test_run_with_multiple_images[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_invalid_execution_mode[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_with_empty_prompt_string[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestEdgeCases::test_run_locally_with_none_api_key[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_different_model_versions[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_locally_with_repeated_calls[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::TestLargeScale::test_run_with_batch_type_handling[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_1.py::test_run_local_various_image_reference_types[ 1 ]
INFO:   INCORRECT: 41/42 passed, 1 diffs

Reproduce: bash docker_e2e_test.sh qwen3vl-qwen3vlblockv1-run --debug

rfdetr-rfdetrobjectdetection-postprocess

Function: postprocess
File: inference/models/rfdetr/rfdetr.py
Commit: 6078c43bae0aa336aef12e324b9a9008a35d2408
Method: git_parent
DB Speedup: 1.13x
Solve OK: False
Duration: 12.4s
Reward: correct=0.0, speedup=0.0, tests=10/29

Key errors

FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_bbox_format_conversion[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_sigmoid_stable_applied[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_empty_predictions[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_single_query_single_class[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_zero_confidence_threshold[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_very_small_image_dims[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_very_large_image_dims[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_bbox_clipping[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_class_id_filtering[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_data_type_conversion[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_negative_bbox_coordinates[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_max_detections_large_value[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_precision_with_small_values[ 1 ]
FAILED tests/codeflash_generated/test_postprocess__behaviorinstrumented_1.py::test_postprocess_large_bbox_values[ 1 ]
INFO:   INCORRECT: 10/29 passed, 19 diffs

Reproduce: bash docker_e2e_test.sh rfdetr-rfdetrobjectdetection-postprocess --debug

s3-s3sinkblockv1-_upload_separate_file

Function: _upload_separate_file
File: inference/core/workflows/core_steps/sinks/s3/v1.py
Commit: 639c8e77ab90d6a43f32fe55a355373ae74e0924
Method: db_code_match
DB Speedup: 1.15x
Solve OK: False
Duration: 41.7s
Reward: correct=0.0, speedup=0.0, tests=1249/1252

Key errors

.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1267: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1280: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1296: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1311: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO:   INCORRECT: 1249/1252 passed, 3 diffs
INFO:     [stdout] WARNING  S3 connection error on attempt 1/4: An unspecified error occurred        vs  WARNING  Could not upload to S3: An unspecified error occurred                  
INFO:     [stdout] WARNING  Non-retryable S3 error (NoSuchBucket): An error occurred (NoSuchBucket)  vs  WARNING  Could not upload to S3: An error occurred (NoSuchBucket) when calling

Reproduce: bash docker_e2e_test.sh s3-s3sinkblockv1-_upload_separate_file --debug

sort-sortblockv1-run

Function: run
File: inference/core/workflows/core_steps/trackers/sort/v1.py
Commit: HEAD
Method: db_code_only
DB Speedup: 2.07x
Solve OK: False
Duration: 19.8s
Reward: correct=0.0, speedup=0.0, tests=682/684

Key errors

  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_new_then_already_seen_instance_detection[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_filter_out_unmatched_tracks_with_negative_id[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_create_tracker_receives_default_fps_when_missing[ 1 ]
FAILED tests/codeflash_generated/test_run__behaviorinstrumented_0.py::test_large_scale_many_instances_and_cache_behavior[ 1 ]
INFO:   INCORRECT: 682/684 passed, 2 diffs

Reproduce: bash docker_e2e_test.sh sort-sortblockv1-run --debug

43 KiB Raw Permalink Blame History

Codeflash RL Environment — Batch Validation Report

Summary

Speedup Distribution (correct tasks only)

Successful Tasks (correct=1.0)

Failed Tasks (13)

byte_tracker-bytetrackmanifest-describe_outputs

clip-inferencemodelsclipadapter-compare

compiler-establish_batch_oriented_step_lineage

compiler-get_reference_lineage

core_steps-_should_filter_block

execution_data_manager-prepare_parameters

glm_ocr-glmocrblockv1-run_remotely

ocsort-ocsortblockv1-run

perception_encoder-inferencemodelsperceptionencoderadapter-preprocess

qwen3vl-qwen3vlblockv1-run

rfdetr-rfdetrobjectdetection-postprocess

s3-s3sinkblockv1-_upload_separate_file

sort-sortblockv1-run

43 KiB

Raw Permalink Blame History