Add metaflow and core-product case studies, rename pypa to python (#18)
- Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention - Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip) - Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory) - Update main README results table with all five case studies
This commit is contained in:
parent
09ba9b44b2
commit
1734199e85
4 changed files with 120 additions and 1 deletions
|
|
@ -7,8 +7,10 @@ Autonomous performance optimization platform. Profiles code, implements optimiza
|
|||
| Project | Result | Details |
|
||||
|---|---|---|
|
||||
| Rich | 2x Console import (79ms → 34ms) | [summary](case-studies/textualize/rich/summary.md) |
|
||||
| pip | 7x `--version` (138ms → 20ms), 1.81x resolver | [summary](case-studies/pypa/pip/summary.md) |
|
||||
| pip | 7x `--version` (138ms → 20ms), 1.81x resolver | [summary](case-studies/python/pip/summary.md) |
|
||||
| typeagent-py | 2.6x query path, 1.16x import + indexing | [summary](case-studies/microsoft/typeagent/summary.md) |
|
||||
| core-product | 14.6% latency, 2.1 GB memory savings | [summary](case-studies/unstructured/core-product/summary.md) |
|
||||
| metaflow | 7-18x artifact compression (lz4 vs gzip) | [summary](case-studies/netflix/metaflow/summary.md) |
|
||||
|
||||
## Domains
|
||||
|
||||
|
|
|
|||
57
case-studies/netflix/metaflow/summary.md
Normal file
57
case-studies/netflix/metaflow/summary.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
# metaflow Optimization — Lessons Learned
|
||||
|
||||
Full case study: [.codeflash/krrt7/netflix/metaflow/](../../../.codeflash/krrt7/netflix/metaflow/)
|
||||
|
||||
## Context
|
||||
|
||||
[metaflow](https://github.com/Netflix/metaflow) is Netflix's open-source Python framework for building and managing data science and ML workflows. It handles workflow orchestration, versioning, and artifact storage across local, cloud, and Kubernetes environments.
|
||||
|
||||
We targeted the content-addressed store (CAS) — the artifact persistence layer that serializes, compresses, and hashes every artifact saved by a flow step.
|
||||
|
||||
## What we did
|
||||
|
||||
### Replace gzip with lz4 in CAS
|
||||
|
||||
The CAS compresses every artifact with `gzip.compress()` before storing. gzip is slow on large payloads (embeddings, model weights, numpy arrays) that dominate ML workflows. Replaced with lz4 frame compression, which trades marginal compression ratio for dramatically faster throughput.
|
||||
|
||||
Added a `cas_version=2` header to new blobs so old metaflow installations can detect and error gracefully on lz4-compressed artifacts rather than silently corrupting data.
|
||||
|
||||
## Results
|
||||
|
||||
Azure Standard_D2s_v5, hyperfine + pytest-benchmark, realistic artifact payloads:
|
||||
|
||||
| Payload | Pickled Size | gzip | lz4 | Speedup |
|
||||
|---|---:|---:|---:|---:|
|
||||
| Small dict (config) | 233B | 0.341ms | 0.218ms | **1.6x** |
|
||||
| Metrics dict (feature stats) | 52KB | 2.278ms | 0.327ms | **7.0x** |
|
||||
| Numpy float64 (embeddings) | 800KB | 29.1ms | 1.56ms | **18.7x** |
|
||||
| Numpy float64 (model weights) | 8MB | 289ms | 15.8ms | **18.3x** |
|
||||
| Random bytes (opaque model) | 5MB | 118ms | 9.65ms | **12.3x** |
|
||||
|
||||
For realistic ML workloads (embeddings, model weights), the improvement is 12-18x.
|
||||
|
||||
## Upstream status
|
||||
|
||||
| PR | Status | Description |
|
||||
|---|---|---|
|
||||
| [Netflix/metaflow#3090](https://github.com/Netflix/metaflow/pull/3090) | Open, waiting for review | Replace gzip with lz4 in CAS |
|
||||
|
||||
## Key takeaways
|
||||
|
||||
### 1. Dependency decisions are the real review blocker
|
||||
|
||||
The optimization itself is simple and the benchmarks are clear. But the open questions from the PR are all about dependency management: should lz4 be a hard or soft dependency? What's the forward-compatibility story for old installations reading new artifacts? These decisions require maintainer alignment, not just benchmark numbers.
|
||||
|
||||
### 2. Profile the actual workload, not synthetic data
|
||||
|
||||
Our initial profiling of metaflow import time (~513ms) identified several deferred-import opportunities. But the CAS compression turned out to be the far more impactful target — it runs on every artifact save, and ML artifacts are large. The import time savings (~200ms estimated) pale next to 18x on an 8MB model weight save.
|
||||
|
||||
### 3. SHA1 hashing is a potential follow-up
|
||||
|
||||
Content hashing uses SHA1, which is unnecessarily slow for non-cryptographic content addressing. xxHash or BLAKE3 would be faster, but this is a separate discussion with broader compatibility implications (prepared in `data/sha1-proposal.md`).
|
||||
|
||||
## Applicable to codeflash
|
||||
|
||||
- **Compression algorithm choice**: If codeflash serializes or caches any data (ASTs, benchmark results, test outputs), audit the compression algorithm. gzip is rarely the right choice for IPC or local caching
|
||||
- **Dependency management for perf wins**: When an optimization requires a new dependency, the technical win is only half the battle — plan for the maintainer's concerns about dependency footprint, compatibility, and optional vs. required
|
||||
- **Forward compatibility**: Any format change to persisted data needs a versioning scheme so old versions fail gracefully
|
||||
60
case-studies/unstructured/core-product/summary.md
Normal file
60
case-studies/unstructured/core-product/summary.md
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
# core-product Optimization — Lessons Learned
|
||||
|
||||
Full case study: [.codeflash/krrt7/unstructured/core-product/](../../../.codeflash/krrt7/unstructured/core-product/)
|
||||
|
||||
## Context
|
||||
|
||||
[core-product](https://github.com/Unstructured-IO/core-product) is Unstructured-IO's document processing pipeline — PDF partitioning, OCR, layout detection, and element extraction. Runs on Knative pods (1 CPU, 32 GB RAM). Python 3.12, multi-package uv workspace.
|
||||
|
||||
We targeted latency and memory in the OCR-heavy PDF processing path, matching production pod constraints (single CPU, high memory).
|
||||
|
||||
## What we did (by impact)
|
||||
|
||||
### BMP render format (14.6% cumulative latency)
|
||||
|
||||
Replaced PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. The rendered images are only transferred between processes via shared memory, never stored to disk or sent over the network, so the larger size has no cost.
|
||||
|
||||
### CPU-aware serial OCR (5.9% latency + 2.1 GB memory)
|
||||
|
||||
Used `os.sched_getaffinity(0)` to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory.
|
||||
|
||||
### Prior merges (before benchmarking infrastructure)
|
||||
|
||||
- Free page image before table OCR (#1448)
|
||||
- Resize-first numpy preprocessing for YOLOX (#1441)
|
||||
- Replace custom lazyproperty with stdlib `cached_property` (#1464)
|
||||
- Reduce attribute lookups in `elements_intersect_vertically` (#1481)
|
||||
- Fix blocking event loop in CSV merge (#1400)
|
||||
|
||||
## Results
|
||||
|
||||
| Benchmark | Baseline | Optimized | Improvement |
|
||||
|---|---:|---:|---:|
|
||||
| 10-page scan (latency) | 35.91s | 30.68s | **-14.6%** |
|
||||
| 16-page mixed (latency) | 65.32s | 56.63s | **-13.3%** |
|
||||
| 1-page tables (latency) | 2.93s | 2.72s | **-7.3%** |
|
||||
| Process-tree RSS (10p scan) | 3,491 MB | 1,398 MB | **-2,093 MB (60%)** |
|
||||
|
||||
## Upstream status
|
||||
|
||||
5 merged, 3 draft PRs. Benchmarked with proper methodology (5 rounds, 1 warmup, median, <0.4% stddev, CPU-pinned to match production).
|
||||
|
||||
## Key takeaways
|
||||
|
||||
### 1. Match your benchmark environment to production
|
||||
|
||||
CPU pinning with `taskset -c 0` on the VM exposed the OCR worker pool waste — without it, the 4-worker pool looked fine because the VM had 4 cores. Production pods only get 1 CPU, and the pool overhead was massive (2.1 GB extra memory, 5.9% latency).
|
||||
|
||||
### 2. Image format choices matter in IPC paths
|
||||
|
||||
PNG compression/decompression was costing ~90ms/page in an IPC path where the image never leaves the machine. BMP eliminated that entirely. The lesson: audit your serialization formats for internal paths — compression that makes sense for network/disk transfer is pure overhead for shared-memory IPC.
|
||||
|
||||
### 3. Cumulative progression benchmarking catches interaction effects
|
||||
|
||||
Stacking optimizations and measuring cumulative progression (not just individual deltas) revealed that BMP + serial OCR together gave more than the sum of their individual improvements, because serial OCR eliminated the per-worker PNG decode overhead entirely.
|
||||
|
||||
## Applicable to codeflash
|
||||
|
||||
- **Subprocess isolation overhead**: If codeflash uses process pools for isolation, check whether the pool size matches available CPUs — over-provisioning wastes memory on constrained environments
|
||||
- **Serialization format audit**: Any IPC path serializing data between processes should use the cheapest format, not the one designed for storage/network
|
||||
- **Production-matching benchmarks**: Always benchmark under the same resource constraints as production — results on unconstrained hardware can be misleading
|
||||
Loading…
Reference in a new issue