mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Kevin Turcios a4276d658a Refine engagement report and case study for executive review

- Hero metrics: -89% cost, -52% peak memory, flat scaling, -12.9% latency
- Add lightspeed canvas animation via assets/lightspeed.js for Plotly Cloud
- Add platform-libs CI/CD migration to timeline (Phase 1b) with PR links
- Update next-engagement card with POC branch and PR references
- Replace RSS with peak memory in user-facing copy
- Add flat memory scaling to case study results table

2026-04-16 17:51:54 -05:00

9 KiB

Raw Blame History

How Unstructured Reduced Pod Memory 87.5% and Saved $107K/Year on Document Processing Infrastructure

87.5% memory reduction | $107K/yr infrastructure savings | 24 PRs merged across 5 repos — in 7 weeks

The company

Unstructured builds document processing infrastructure — PDF partitioning, OCR, layout detection, and element extraction — used by enterprises to convert unstructured documents into structured data at scale. Their core pipeline runs on AKS (Azure Kubernetes Service), processing documents through a chain of ML models including ONNX-based layout detection, PaddleOCR, and tesseract.

The problem

Unstructured's document processing pods were requesting 32 GB of RAM each — and still occasionally OOM'ing. On Standard_D48s_v5 nodes (48 vCPU, 192 GB RAM), this meant only 5 pods per node, with RAM as the bottleneck. At ~$2,300/month per node, infrastructure costs were scaling linearly with document volume, and the team had limited visibility into why memory consumption was so high.

The root cause wasn't a single leak — it was a combination of factors that only became visible once optimization work began peeling back layers. The pods ran with a 1-CPU limit, but Python's os.cpu_count() returned the host's full 48-core count, spawning redundant OCR worker processes that each loaded the complete ONNX model set into memory. Fixing that revealed the next issue: RSS was still growing 24 MB with every request — not from a classical memory leak, but from heavy per-page PIL image churn and framework allocator fragmentation that had been masked by the much larger worker pool overhead (memray confirmed 10 GB total allocated across a single 10-page scan, with peak at 1.0 GB on the serial path). Similarly, once the obvious memory waste was resolved, profiling exposed latency bottlenecks that had been invisible before: image serialization between processes used PNG compression — adding latency for data that never left the machine — and a text extraction function had O(N²) scaling that re-scanned the full character list on every patch operation, compounding on longer documents. These weren't problems anyone had reason to look for; they surfaced because each optimization shifted the profile and revealed what was hiding underneath.

Without intervention, Unstructured's infrastructure spend on core-product compute alone would continue growing at $120K+ annually as document volume increased.

The approach

Over 7 weeks, Codeflash profiled the pipeline end-to-end under production-equivalent constraints — CPU-pinned to a single core on identical Azure hardware, matching the exact resource limits of production pods. This was critical: profiling on unconstrained hardware would have hidden the primary memory bottleneck entirely, since the 4-worker OCR pool performs fine when 4 cores are actually available.

The optimization touched 5 repositories across the Unstructured ecosystem (core-product, unstructured, unstructured-inference, unstructured-od-models, github-workflows). Each round of optimization shifted the performance profile, revealing issues that had been masked by larger problems upstream:

Memory: eliminating redundant model loading and per-request growth. The largest single win came from making OCR worker allocation CPU-aware. Using os.sched_getaffinity(0) (which respects cgroup limits and taskset masks), the pipeline now detects when it's running on a single-CPU pod and skips the worker pool entirely — eliminating 4 idle workers that each loaded ~500 MB of duplicate ONNX models into private memory. Combined with model loading optimizations across the inference stack (replacing HuggingFace image processors with torchvision, reading ONNX metadata from session objects instead of full protobuf loads, and switching to jemalloc to reduce Python heap fragmentation), peak memory dropped from 4,651 MB to 2,227 MB, per-request RSS growth fell from 24 MB to 17 MB, and memory is now constant regardless of document count — ~1,400 MB whether processing 10 or 400 pages, eliminating the slow memory creep that eventually triggered OOM kills under sustained load.

Latency: removing unnecessary serialization and fixing algorithmic complexity. PDF pages rendered by pdfium were being compressed to PNG for inter-process transfer — then decompressed on the other side. Since these images only exist in shared memory between processes and are never stored or transmitted, the compression was pure overhead: ~90ms per page, adding up across multi-page documents. A text extraction function (_patch_current_chars_with_render_mode) had O(N²) complexity — re-scanning the full character list on every patch operation, causing processing time to grow quadratically on text-heavy documents. Switching to uncompressed BMP, passing file paths directly to tesseract (instead of PIL → numpy → PIL → temp-file round-trips), and fixing the quadratic scan reduced end-to-end latency from 50.8s to 44.3s on a 10-page production workload.

Reliability: fixing async bottlenecks. Three blocking event loop patterns in the FastAPI path (gzip decompression, PDF validation, CSV response merging) were identified and fixed, improving throughput under concurrent load.

Every change passed through Unstructured's existing CI pipeline and code review process. All 24 PRs were reviewed and merged by the core-product engineering team, with the full test suite (354 tests) passing at each commit.

The results

Metric	Before	After	Impact
K8s pod memory allocation	32 GB	4 GB	87.5% reduction
Peak memory per pod	4,651 MB	2,227 MB	52% reduction
Memory scaling	~1,400 MB at 10p	~1,400 MB at 400p	Flat (constant regardless of document count)
RSS growth per request	24 MB	17 MB	29% reduction (eliminated OOM creep)
End-to-end latency (10p scan)	50.8s	44.3s	12.9% faster
Text extraction scaling	O(N²)	O(N)	Linear on long documents
Pods per node	5	46	9.2x density
Monthly compute cost	~$10,000	~$1,100	~$8,900/mo savings
Annual infrastructure savings	—	—	~$107,000/year

With pods now fitting in 4 GB, RAM is no longer the scheduling bottleneck — CPU is. The same Standard_D48s_v5 nodes that previously held 5 pods can now hold 46, meaning the same 30-pod workload fits on a single node instead of six.

How it was measured

Results were benchmarked on Azure Standard_E4s_v5 instances (4 vCPU, 32 GB RAM) running Ubuntu 24.04 LTS, with taskset -c 0 pinning the process to a single core to match production pod constraints. Each measurement used pytest-benchmark with 5 measured rounds and 1 warmup round, reporting the median. Observed standard deviation was consistently below 0.4% of the median. Memory was measured via process-tree RSS (summing /proc/[pid]/status VmRSS across the main process and all children), which captures worker process memory that single-process profiling tools miss.

Optimizations were stacked and measured cumulatively rather than in isolation, which revealed interaction effects — for example, serial OCR mode combined with BMP rendering delivered more than the sum of their individual improvements, because eliminating the worker pool also eliminated per-worker PNG decode overhead. This iterative approach was essential: many of the most impactful findings — the per-request RSS creep, the O(N²) text extraction, the PNG serialization overhead — only became visible after earlier optimizations removed the noise that had been masking them.

What Unstructured's team gained

Beyond the direct cost savings, the optimization work delivered compounding benefits:

Reliability: Fewer OOM events means fewer failed jobs, better SLA compliance, and reduced on-call burden for the platform team.
Capacity headroom: 9.2x pod density means the existing node pool can absorb significant document volume growth before requiring new infrastructure.
Codebase health: 24 merged PRs improved code quality across the ecosystem — replacing custom utilities with stdlib equivalents, fixing async anti-patterns, and removing dead framework overhead.
Security visibility: The same codebase audit surfaced 39 security findings across 8 repositories (7 critical, 13 high), including lockfile bypasses in Docker builds and unpinned CI dependencies — risks that were invisible without systematic code review.

Technical appendix

Full PR inventory, benchmark data, and methodology details available in the optimization data directory.

PRs by category

Category	Merged	In Progress	Repos
Memory	10	1	core-product, unstructured, unstructured-inference, unstructured-od-models
Latency (incl. O(N²) fix)	5	3	core-product, unstructured, unstructured-od-models
Reliability	3	—	core-product
Code quality	4	1	core-product, unstructured, unstructured-inference, unstructured-od-models
CI/CD	2	—	github-workflows

9 KiB Raw Blame History