- Rename case-studies/pypa/ → case-studies/python/ to match .codeflash/ convention - Add case-studies/netflix/metaflow/summary.md (7-18x lz4 vs gzip) - Add case-studies/unstructured/core-product/summary.md (14.6% latency, 2.1 GB memory) - Update main README results table with all five case studies
3.6 KiB
core-product Optimization — Lessons Learned
Full case study: .codeflash/krrt7/unstructured/core-product/
Context
core-product is Unstructured-IO's document processing pipeline — PDF partitioning, OCR, layout detection, and element extraction. Runs on Knative pods (1 CPU, 32 GB RAM). Python 3.12, multi-package uv workspace.
We targeted latency and memory in the OCR-heavy PDF processing path, matching production pod constraints (single CPU, high memory).
What we did (by impact)
BMP render format (14.6% cumulative latency)
Replaced PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. The rendered images are only transferred between processes via shared memory, never stored to disk or sent over the network, so the larger size has no cost.
CPU-aware serial OCR (5.9% latency + 2.1 GB memory)
Used os.sched_getaffinity(0) to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory.
Prior merges (before benchmarking infrastructure)
- Free page image before table OCR (#1448)
- Resize-first numpy preprocessing for YOLOX (#1441)
- Replace custom lazyproperty with stdlib
cached_property(#1464) - Reduce attribute lookups in
elements_intersect_vertically(#1481) - Fix blocking event loop in CSV merge (#1400)
Results
| Benchmark | Baseline | Optimized | Improvement |
|---|---|---|---|
| 10-page scan (latency) | 35.91s | 30.68s | -14.6% |
| 16-page mixed (latency) | 65.32s | 56.63s | -13.3% |
| 1-page tables (latency) | 2.93s | 2.72s | -7.3% |
| Process-tree RSS (10p scan) | 3,491 MB | 1,398 MB | -2,093 MB (60%) |
Upstream status
5 merged, 3 draft PRs. Benchmarked with proper methodology (5 rounds, 1 warmup, median, <0.4% stddev, CPU-pinned to match production).
Key takeaways
1. Match your benchmark environment to production
CPU pinning with taskset -c 0 on the VM exposed the OCR worker pool waste — without it, the 4-worker pool looked fine because the VM had 4 cores. Production pods only get 1 CPU, and the pool overhead was massive (2.1 GB extra memory, 5.9% latency).
2. Image format choices matter in IPC paths
PNG compression/decompression was costing ~90ms/page in an IPC path where the image never leaves the machine. BMP eliminated that entirely. The lesson: audit your serialization formats for internal paths — compression that makes sense for network/disk transfer is pure overhead for shared-memory IPC.
3. Cumulative progression benchmarking catches interaction effects
Stacking optimizations and measuring cumulative progression (not just individual deltas) revealed that BMP + serial OCR together gave more than the sum of their individual improvements, because serial OCR eliminated the per-worker PNG decode overhead entirely.
Applicable to codeflash
- Subprocess isolation overhead: If codeflash uses process pools for isolation, check whether the pool size matches available CPUs — over-provisioning wastes memory on constrained environments
- Serialization format audit: Any IPC path serializing data between processes should use the cheapest format, not the one designed for storage/network
- Production-matching benchmarks: Always benchmark under the same resource constraints as production — results on unconstrained hardware can be misleading