Weave "optimizations reveal deeper issues" framing into engagement report executive summary, case study, and optimization README. Add O(N²) text extraction fix, per-request RSS creep (24→17 MB), and memray profiling data that were previously undocumented. |
||
|---|---|---|
| .. | ||
| bench | ||
| data | ||
| infra | ||
| README.md | ||
| status.md | ||
core-product Performance Optimization
Unstructured-IO's document processing pipeline -- PDF partitioning, OCR, layout detection, and element extraction. Python 3.12, multi-package uv workspace.
Results
Environment: Python 3.12, Ubuntu 24.04 LTS (Azure Standard_E4s_v5, taskset -c 0), pytest-benchmark (5 rounds, 1 warmup, median reported)
Cumulative latency progression
| Stage | 1p-tables | 10p-scan | 16p-mixed |
|---|---|---|---|
| Baseline (main) | 2.93s | 35.91s | 65.32s |
| + CPU-aware serial OCR (#1502) | 2.92s (-0.3%) | 33.80s (-5.9%) | 61.23s (-6.3%) |
| + BMP render format (#1503) | 2.72s (-7.3%) | 30.68s (-14.6%) | 56.63s (-13.3%) |
Memory
| Optimization | Metric | Before | After | Improvement |
|---|---|---|---|---|
| CPU-aware serial OCR (#1502) | Process-tree RSS (10p scan, post-partition) | 3,491 MB | 1,398 MB | -2,093 MB (60%) |
| CPU-aware serial OCR (#1502) | Process-tree RSS (10p scan, pre-partition) | 2,619 MB | 499 MB | -2,120 MB (81%) |
Throughput
No throughput regressions detected. Serial OCR path matches pool-on-1-CPU throughput (pool provides no parallelism benefit when pinned to 1 core).
What We Changed
Latency
- O(N²) text extraction fix (#4266):
_patch_current_chars_with_render_modewas re-scanning the full character list on every patch operation — quadratic scaling on text-heavy documents. Replaced with single-pass approach. - BMP render format (#1503): Replace PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. 9.2% incremental gain on 10p-scan (on top of serial OCR).
- CPU-aware serial OCR (#1502): Use
os.sched_getaffinity(0)to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory. 5.9% latency improvement + 2.1 GB memory savings.
Memory
- CPU-aware serial OCR (#1502): Saves ~2.1 GB with zero latency cost on single-CPU pods.
- Per-request RSS creep: Reduced from 24 MB/request to 17 MB/request (-29%) via jemalloc and PIL churn reduction. memray confirmed 10 GB total allocated across a single 10-page scan — heavy per-page allocation churn, not classical accumulation.
Prior merges (before benchmarking infrastructure)
- Free page image before table OCR (#1448): Release PIL image memory before table extraction starts
- Resize-first preprocessing (#1441): Resize numpy arrays before YOLOX preprocessing instead of after
- Replace lazyproperty (#1464): Switch from custom lazyproperty to stdlib
functools.cached_property - Reduce attribute lookups (#1481): Optimize
elements_intersect_verticallyinner loop - Fix blocking event loop (#1400): Replace blocking CSV merge with async implementation
Upstream Contributions
| PR | Status | Description |
|---|---|---|
| Unstructured-IO/core-product#1503 | Draft | Render PDF pages as BMP instead of PNG in pdfium pool |
| Unstructured-IO/core-product#1502 | Draft | Cap OCR workers to available CPUs — serial mode on 1-CPU pods |
| Unstructured-IO/core-product#1500 | Draft | Benchmark infrastructure and repo conventions |
| Unstructured-IO/core-product#1481 | Merged | Reduce attribute lookups in elements_intersect_vertically |
| Unstructured-IO/core-product#1464 | Merged | Replace lazyproperty with functools.cached_property |
| Unstructured-IO/core-product#1448 | Merged | Free page image before table extraction |
| Unstructured-IO/core-product#1441 | Merged | Resize-first numpy preprocessing for YOLOX |
| Unstructured-IO/core-product#1400 | Merged | Fix blocking event loop in CSV merge |
Methodology
Environment
- VM: Azure Standard_E4s_v5 (4 vCPU, 32 GB RAM, memory-optimized)
- OS: Ubuntu 24.04 LTS
- Region: westus2
- Python: 3.12 (project constraint:
>=3.12, <3.13) - Tooling: pytest-benchmark (5 rounds, 1 warmup, median reported), memray, cProfile
- CPU pinning:
taskset -c 0to match production pod profile (1 CPU request, 32 GB RAM limit)
Non-burstable VM + CPU pinning matches production Knative pod resources. 32 GB RAM matches the pod limit exactly.
Benchmarking methodology
pedantic(rounds=5, warmup_rounds=1)— 1 warmup absorbs ONNX model JIT, page cache warming, and pool initialization overhead. 5 measured rounds enable median, IQR, and Tukey outlier detection.- Median reported as primary metric (robust to up to 2 outliers in 5 samples)
- Observed stddev <0.4% of median across all measurements
Profiling approach
- cProfile + memray -- identify hot functions and peak memory allocators
- Per-stage benchmark instrumentation -- render, detect, OCR, merge timing breakdown
- Cumulative progression on stacked branch with proper statistical methodology
- Full unit test suite run before and after every change (348 tests)
Memory measurement
Process-tree RSS measured by summing /proc/[pid]/status VmRSS for the main process and all direct children. This captures worker process memory that resource.getrusage(RUSAGE_SELF) misses.
Runner convention
Benchmark scripts use .venv/bin/python directly for accuracy (uv run adds overhead). Upstream reproducers use uv run python for portability.
Install notes
core-product is a multi-package uv workspace with three sub-packages (unstructured_prop, unstructured_inference_prop, unstructured-api). All share a single root-level .venv:
uv venv --python 3.12 # create root .venv
export VIRTUAL_ENV=$PWD/.venv # all uv sync --active commands use this
make install # syncs all three sub-packages into shared venv
Private PyPI access requires:
export UV_INDEX_UNSTRUCTURED_USERNAME=unstructured
export UV_INDEX_UNSTRUCTURED_PASSWORD=<token from Keeper "Private PyPi Index">
System dependencies beyond build-essential: tesseract-ocr libtesseract-dev libleptonica-dev poppler-utils libmagic1 libgl1 libglib2.0-0
Repo Structure
.
├── README.md # This file
├── bench/ # Benchmark scripts
├── data/ # Raw benchmark data
│ └── results.tsv
└── infra/ # VM provisioning
├── cloud-init.yaml
└── vm-manage.sh