mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

History

Kevin Turcios 7d86202524 Update metaflow README with actual results and PR status (#19 ) Replace placeholder text ("No optimizations applied yet", empty PR table) with: - CAS lz4 compression results (7-18x on realistic ML payloads) - Upstream PR status (Netflix/metaflow#3090, open) - Open questions on dependency management and forward compat - Methodology, remaining targets, and lessons learned		2026-04-14 23:41:55 -05:00
..
bench	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00
data	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00
infra	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00
README.md	Update metaflow README with actual results and PR status (#19 )	2026-04-14 23:41:55 -05:00
status.md	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00

README.md

metaflow Performance Optimization

Upstream performance improvements to Netflix/metaflow, a human-centric framework for data science and ML workflows.

Background

Metaflow is Netflix's open-source Python framework for building and managing real-world data science projects. It handles workflow orchestration, versioning, and execution across local, cloud, and Kubernetes environments.

We targeted the content-addressed store (CAS) — the artifact persistence layer that serializes, compresses, and hashes every artifact saved by a flow step. Every current.card.append, model checkpoint, or DataFrame save goes through this path.

Initial profiling identified two optimization surfaces:

Import time (~513ms): Heavy optional dependencies loaded eagerly. Deprioritized — import deferral savings (~200ms estimated) are marginal vs. the CAS compression wins.
CAS compression: gzip.compress() on every artifact. gzip is slow on large payloads (embeddings, model weights, numpy arrays) that dominate ML workflows.

What We Changed

Replace gzip with lz4 in CAS

The CAS compresses every artifact with gzip.compress() before storing. lz4 frame compression trades marginal compression ratio for dramatically faster throughput — 7-18x on realistic ML payloads.

Added a cas_version=2 header to new blobs so old metaflow installations can detect and error gracefully on lz4-compressed artifacts rather than silently corrupting data.

Results

Azure Standard_D2s_v5, hyperfine + pytest-benchmark, realistic artifact payloads:

Payload	Pickled Size	gzip	lz4	Speedup
Small dict (config)	233B	0.341ms	0.218ms	1.6x
Metrics dict (feature stats)	52KB	2.278ms	0.327ms	7.0x
Numpy float64 (embeddings)	800KB	29.1ms	1.56ms	18.7x
Numpy float64 (model weights)	8MB	289ms	15.8ms	18.3x
Random bytes (opaque model)	5MB	118ms	9.65ms	12.3x

For realistic ML workloads (embeddings, model weights), the improvement is 12-18x.

Upstream Contributions

PR	Status	Description
Netflix/metaflow#3090	Open, waiting for review	Replace gzip with lz4 in CAS
KRRT7/metaflow#1	Draft (fork mirror)	Same, on fork

Open questions from PR

Hard vs soft dependency: Should lz4 be required or optional with gzip fallback?
Forward compatibility: Old metaflow installations can't decompress cas_version=2 blobs — the PR errors gracefully, but the migration story needs maintainer input
Benchmark scripts: To be reverted before merge (included for reproducibility during review)

Remaining Targets

Deprioritized pending maintainer response on #3090:

Target	Approach	Priority
SHA1 content hashing → xxHash/BLAKE3	Faster non-crypto hash for content addressing	Medium (proposal in `data/sha1-proposal.md`)
Import time (~200ms savings)	Defer requests, kubernetes, asyncio, YAML	Low (marginal vs. CAS wins)
Sleep-based polling in multicore_utils	Event-based waiting	Low (marginal impact)

Methodology

Environment

VM: Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable)
OS: Ubuntu 24.04 LTS
Python: 3.12 via pip editable install
Tooling: hyperfine (warmup 5, min-runs 30), pytest-benchmark
Dependencies: lz4, xxhash, numpy installed for benchmarking

Benchmark approach

Realistic artifact payloads chosen to match actual ML workflow data:

Small dicts (config/hyperparameters)
Medium dicts (feature statistics)
Numpy arrays (embeddings at 800KB, model weights at 8MB)
Random bytes (opaque serialized models)

Each benchmark measures the full save path: pickle → compress → hash.

Lessons Learned

1. Dependency decisions are the real review blocker

The optimization itself is simple and the benchmarks are clear. But the open questions from the PR are all about dependency management: should lz4 be a hard or soft dependency? What's the forward-compatibility story for old installations reading new artifacts? These decisions require maintainer alignment, not just benchmark numbers.

2. Profile the actual workload, not synthetic data

Our initial profiling of metaflow import time (~513ms) identified several deferred-import opportunities. But the CAS compression turned out to be the far more impactful target — it runs on every artifact save, and ML artifacts are large. The import time savings (~200ms estimated) pale next to 18x on an 8MB model weight save.

3. SHA1 hashing is a potential follow-up

Content hashing uses SHA1, which is unnecessarily slow for non-cryptographic content addressing. xxHash or BLAKE3 would be faster, but this is a separate discussion with broader compatibility implications (prepared in data/sha1-proposal.md).

Repo Structure

.
├── README.md              # This file
├── status.md              # Current session state
├── bench/                 # Benchmark scripts
├── data/                  # Raw benchmark data + SHA1 proposal
│   ├── results.tsv
│   └── sha1-proposal.md
└── infra/                 # VM provisioning
    ├── cloud-init.yaml
    └── vm-manage.sh