codeflash-internal/experiments/rl_env/validation_report_retry.md
2026-04-16 16:31:25 -07:00

11 KiB

Codeflash RL Environment — Batch Validation Report

Summary

Metric Count %
Total tasks 13 100%
Solve passes 0 0%
Eval correct (all behavioral tests pass) 8 61%
Faster than original (speedup > 1.0) 6 46%
All test cases pass 11 84%

Speedup Distribution (correct tasks only)

  • 1-1.5x: 5 tasks
  • 1.5-2x: 1 tasks
  • 2-5x: 1 tasks
  • >100x: 1 tasks

Successful Tasks (correct=1.0)

Task Function Speedup Tests Coverage Quality DB Speedup
decorators-withfixedsizecache-memory_pressure_detected memory_pressure_detected 2023.8593x 132/132 37.8% 917.09x
handlers-handle_describe_workflows_blocks_request handle_describe_workflows_blocks_request 2.4668x 153/153 N/A low 2.50x
enterprise_blocks-load_enterprise_blocks load_enterprise_blocks 1.8884x 1936/1936 32.2% medium 1.45x
common-deserialize_image_kind deserialize_image_kind 1.4730x 1506/1506 7.4% medium 1.42x
dataset_upload-execute_registration execute_registration 1.0228x 1005/1005 38.1% low 1.17x
detection_event_log-detectioneventlogblockv1-run run 1.0194x 4426/4426 97.3% low 3.40x
halo-halovisualizationblockv1-getannotator getAnnotator 1.0000x 2/2 34.2% low 11.57x
managers-customcollector-_fetch_stream_metrics _fetch_stream_metrics 1.0000x 41/41 7.2% low 1.19x

Failed Tasks (5)

core_steps-_should_filter_block

  • Function: _should_filter_block

  • File: inference/core/workflows/core_steps/loader.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 4.93x

  • Solve OK: False

  • Duration: 36.6s

  • Reward: correct=0.0, speedup=0.0, tests=41/41

Key errors
_ ERROR collecting tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py _
ImportError while importing test module '/workspace/inference/tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py'.
E   ImportError: cannot import name 'WORKFLOW_SELECTIVE_BLOCKS_DISABLE' from 'inference.core.env' (/workspace/inference/inference/core/env.py)
  /usr/local/lib/python3.12/site-packages/pydantic/fields.py:1093: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test__should_filter_block__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
1 warning, 2 errors in 0.32s
INFO:   INCORRECT: 41/41 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh core_steps-_should_filter_block --debug


execution_data_manager-prepare_parameters

  • Function: prepare_parameters

  • File: inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 1.12x

  • Solve OK: False

  • Duration: 31.8s

  • Reward: correct=0.0, speedup=0.0, tests=1/1

Key errors
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_empty_runtime_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_step_execution_dimensionality_zero[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_large_dimensionality[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_special_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_unicode_step_names[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_input_parameters[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_large_batch_size[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_deeply_nested_compound_inputs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_masks[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_many_auto_batch_casting_configs[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_iteration_performance[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_complex_data_structures[ 1 ]
FAILED tests/codeflash_generated/test_prepare_parameters__behaviorinstrumented_1.py::test_prepare_parameters_with_mixed_parameter_types[ 1 ]
INFO:   INCORRECT: 1/1 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh execution_data_manager-prepare_parameters --debug


ocsort-ocsortblockv1-run

  • Function: run

  • File: inference/core/workflows/core_steps/trackers/ocsort/v1.py

  • Commit: HEAD

  • Method: db_code_only

  • DB Speedup: 1.60x

  • Solve OK: False

  • Duration: 28.2s

  • Reward: correct=0.0, speedup=0.0, tests=408/408

Key errors
@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `allow_reuse` is deprecated and will be ignored; it should no longer be necessary. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_0.py
ERROR tests/codeflash_generated/test_run__behaviorinstrumented_1.py
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
25 warnings, 2 errors in 0.97s
INFO:   INCORRECT: 408/408 passed, 0 diffs

Reproduce: bash docker_e2e_test.sh ocsort-ocsortblockv1-run --debug


perception_encoder-inferencemodelsperceptionencoderadapter-preprocess

  • Function: preprocess

  • File: inference/models/perception_encoder/perception_encoder_inference_models.py

  • Commit: 7648e452a70ff1aad09f017a0eb2ea4022b7e177

  • Method: db_code_match

  • DB Speedup: 2.47x

  • Solve OK: False

  • Duration: 64.0s

  • Reward: correct=0.0, speedup=0.0, tests=2031/2235

Key errors
  PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'optional'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_returns_tuple_with_correct_types[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_calls_preproc_image[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_metadata_is_empty_dict[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_preserves_image_dimensions[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_kwargs[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_multiple_calls_independence[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_1000_rapid_calls[ 1 ]
FAILED tests/codeflash_generated/test_preprocess__behaviorinstrumented_0.py::test_preprocess_with_varying_channel_counts[ 1 ]
INFO:   INCORRECT: 2031/2235 passed, 204 diffs

Reproduce: bash docker_e2e_test.sh perception_encoder-inferencemodelsperceptionencoderadapter-preprocess --debug


s3-s3sinkblockv1-_upload_separate_file

  • Function: _upload_separate_file

  • File: inference/core/workflows/core_steps/sinks/s3/v1.py

  • Commit: 639c8e77ab90d6a43f32fe55a355373ae74e0924

  • Method: db_code_match

  • DB Speedup: 1.15x

  • Solve OK: False

  • Duration: 60.3s

  • Reward: correct=0.0, speedup=0.0, tests=1249/1252

Key errors
.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1267: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1280: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1296: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  /workspace/inference/inference/core/workflows/execution_engine/entities/types.py:1311: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
INFO:   INCORRECT: 1249/1252 passed, 3 diffs
INFO:     [stdout] WARNING  Non-retryable S3 error (NoSuchBucket): An error occurred (NoSuchBucket)  vs  WARNING  Could not upload to S3: An error occurred (NoSuchBucket) when calling  
INFO:     [stdout] WARNING  S3 connection error on attempt 1/4: An unspecified error occurred        vs  WARNING  Could not upload to S3: An unspecified error occurred                  

Reproduce: bash docker_e2e_test.sh s3-s3sinkblockv1-_upload_separate_file --debug