metaflow Performance Optimization

Upstream performance improvements to Netflix/metaflow, a human-centric framework for data science and ML workflows.

Background

Metaflow is Netflix's open-source Python framework for building and managing real-world data science projects. It handles workflow orchestration, versioning, and execution across local, cloud, and Kubernetes environments.

Profiling reveals two main optimization surfaces:

Import time (~513ms): Heavy optional dependencies (requests, kubernetes, asyncio, yaml) loaded eagerly even when not needed. Plugin resolution alone accounts for 65% of import time.
Runtime hot paths: Double gzip compression on every artifact, SHA1 hashing where faster non-crypto hashes suffice, sleep-based polling in multiprocessing utilities.

Optimization Targets

Import Time (Phase 1 — ~200ms savings estimated)

Target	Current	Savings	Approach
Defer `requests` in metadata providers	128ms	~108ms	Lazy import inside ServiceMetadataProvider
Lazy-load Kubernetes clients	50ms	~48ms	Conditional import when K8s decorator used
Defer `asyncio` in subprocess_manager	91ms	~41ms	Import inside async functions only
Defer YAML/cards infrastructure	52ms	~37ms	Move YAML import to card render time

Runtime (Phase 2)

Target	File	Approach
Double gzip compression	`content_addressed_store.py`	Single compression, tune level
SHA1 content hashing	`content_addressed_store.py`	Switch to xxHash/BLAKE3
Sleep-based polling	`multicore_utils.py`	Event-based waiting
Extension loading cache	`extension_support/__init__.py`	Mtime-based cache

Results

No optimizations applied yet.

Benchmark	Before	After	Speedup
`import metaflow`	513ms	—	—
`metaflow --version` CLI	~1.8s	—	—

PRs

None yet.

PR	Branch	Status	Description

2 KiB Raw Blame History