mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Add metaflow and core-product case studies, rename pypa to python (#18 )

- Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention
- Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip)
- Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory)
- Update main README results table with all five case studies

2026-04-14 23:31:49 -05:00

3.5 KiB

Raw Permalink Blame History

metaflow Optimization — Lessons Learned

Full case study: .codeflash/krrt7/netflix/metaflow/

Context

metaflow is Netflix's open-source Python framework for building and managing data science and ML workflows. It handles workflow orchestration, versioning, and artifact storage across local, cloud, and Kubernetes environments.

We targeted the content-addressed store (CAS) — the artifact persistence layer that serializes, compresses, and hashes every artifact saved by a flow step.

What we did

Replace gzip with lz4 in CAS

The CAS compresses every artifact with gzip.compress() before storing. gzip is slow on large payloads (embeddings, model weights, numpy arrays) that dominate ML workflows. Replaced with lz4 frame compression, which trades marginal compression ratio for dramatically faster throughput.

Added a cas_version=2 header to new blobs so old metaflow installations can detect and error gracefully on lz4-compressed artifacts rather than silently corrupting data.

Results

Azure Standard_D2s_v5, hyperfine + pytest-benchmark, realistic artifact payloads:

Payload	Pickled Size	gzip	lz4	Speedup
Small dict (config)	233B	0.341ms	0.218ms	1.6x
Metrics dict (feature stats)	52KB	2.278ms	0.327ms	7.0x
Numpy float64 (embeddings)	800KB	29.1ms	1.56ms	18.7x
Numpy float64 (model weights)	8MB	289ms	15.8ms	18.3x
Random bytes (opaque model)	5MB	118ms	9.65ms	12.3x

For realistic ML workloads (embeddings, model weights), the improvement is 12-18x.

Upstream status

PR	Status	Description
Netflix/metaflow#3090	Open, waiting for review	Replace gzip with lz4 in CAS

Key takeaways

1. Dependency decisions are the real review blocker

The optimization itself is simple and the benchmarks are clear. But the open questions from the PR are all about dependency management: should lz4 be a hard or soft dependency? What's the forward-compatibility story for old installations reading new artifacts? These decisions require maintainer alignment, not just benchmark numbers.

2. Profile the actual workload, not synthetic data

Our initial profiling of metaflow import time (~513ms) identified several deferred-import opportunities. But the CAS compression turned out to be the far more impactful target — it runs on every artifact save, and ML artifacts are large. The import time savings (~200ms estimated) pale next to 18x on an 8MB model weight save.

3. SHA1 hashing is a potential follow-up

Content hashing uses SHA1, which is unnecessarily slow for non-cryptographic content addressing. xxHash or BLAKE3 would be faster, but this is a separate discussion with broader compatibility implications (prepared in data/sha1-proposal.md).

Applicable to codeflash

Compression algorithm choice: If codeflash serializes or caches any data (ASTs, benchmark results, test outputs), audit the compression algorithm. gzip is rarely the right choice for IPC or local caching
Dependency management for perf wins: When an optimization requires a new dependency, the technical win is only half the battle — plan for the maintainer's concerns about dependency footprint, compatibility, and optional vs. required
Forward compatibility: Any format change to persisted data needs a versioning scheme so old versions fail gracefully

3.5 KiB Raw Permalink Blame History