mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Add metaflow and core-product case studies, rename pypa to python (#18 )

- Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention
- Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip)
- Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory)
- Update main README results table with all five case studies

2026-04-14 23:31:49 -05:00

3.6 KiB

Raw Blame History

core-product Optimization — Lessons Learned

Full case study: .codeflash/krrt7/unstructured/core-product/

Context

core-product is Unstructured-IO's document processing pipeline — PDF partitioning, OCR, layout detection, and element extraction. Runs on Knative pods (1 CPU, 32 GB RAM). Python 3.12, multi-package uv workspace.

We targeted latency and memory in the OCR-heavy PDF processing path, matching production pod constraints (single CPU, high memory).

What we did (by impact)

BMP render format (14.6% cumulative latency)

Replaced PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. The rendered images are only transferred between processes via shared memory, never stored to disk or sent over the network, so the larger size has no cost.

CPU-aware serial OCR (5.9% latency + 2.1 GB memory)

Used os.sched_getaffinity(0) to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory.

Prior merges (before benchmarking infrastructure)

Free page image before table OCR (#1448)
Resize-first numpy preprocessing for YOLOX (#1441)
Replace custom lazyproperty with stdlib cached_property (#1464)
Reduce attribute lookups in elements_intersect_vertically (#1481)
Fix blocking event loop in CSV merge (#1400)

Results

Benchmark	Baseline	Optimized	Improvement
10-page scan (latency)	35.91s	30.68s	-14.6%
16-page mixed (latency)	65.32s	56.63s	-13.3%
1-page tables (latency)	2.93s	2.72s	-7.3%
Process-tree RSS (10p scan)	3,491 MB	1,398 MB	-2,093 MB (60%)

Upstream status

5 merged, 3 draft PRs. Benchmarked with proper methodology (5 rounds, 1 warmup, median, <0.4% stddev, CPU-pinned to match production).

Key takeaways

1. Match your benchmark environment to production

CPU pinning with taskset -c 0 on the VM exposed the OCR worker pool waste — without it, the 4-worker pool looked fine because the VM had 4 cores. Production pods only get 1 CPU, and the pool overhead was massive (2.1 GB extra memory, 5.9% latency).

2. Image format choices matter in IPC paths

PNG compression/decompression was costing ~90ms/page in an IPC path where the image never leaves the machine. BMP eliminated that entirely. The lesson: audit your serialization formats for internal paths — compression that makes sense for network/disk transfer is pure overhead for shared-memory IPC.

3. Cumulative progression benchmarking catches interaction effects

Stacking optimizations and measuring cumulative progression (not just individual deltas) revealed that BMP + serial OCR together gave more than the sum of their individual improvements, because serial OCR eliminated the per-worker PNG decode overhead entirely.

Applicable to codeflash

Subprocess isolation overhead: If codeflash uses process pools for isolation, check whether the pool size matches available CPUs — over-provisioning wastes memory on constrained environments
Serialization format audit: Any IPC path serializing data between processes should use the cheapest format, not the one designed for storage/network
Production-matching benchmarks: Always benchmark under the same resource constraints as production — results on unconstrained hardware can be misleading

3.6 KiB Raw Blame History