codeflash-agent/.codeflash/unstructured/core-product/README.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

6.3 KiB

core-product Performance Optimization

Unstructured-IO's document processing pipeline -- PDF partitioning, OCR, layout detection, and element extraction. Python 3.12, multi-package uv workspace.

Results

Environment: Python 3.12, Ubuntu 24.04 LTS (Azure Standard_E4s_v5, taskset -c 0), pytest-benchmark (5 rounds, 1 warmup, median reported)

Cumulative latency progression

Stage 1p-tables 10p-scan 16p-mixed
Baseline (main) 2.93s 35.91s 65.32s
+ CPU-aware serial OCR (#1502) 2.92s (-0.3%) 33.80s (-5.9%) 61.23s (-6.3%)
+ BMP render format (#1503) 2.72s (-7.3%) 30.68s (-14.6%) 56.63s (-13.3%)

Memory

Optimization Metric Before After Improvement
CPU-aware serial OCR (#1502) Process-tree RSS (10p scan, post-partition) 3,491 MB 1,398 MB -2,093 MB (60%)
CPU-aware serial OCR (#1502) Process-tree RSS (10p scan, pre-partition) 2,619 MB 499 MB -2,120 MB (81%)

Throughput

No throughput regressions detected. Serial OCR path matches pool-on-1-CPU throughput (pool provides no parallelism benefit when pinned to 1 core).

What We Changed

Latency

  • BMP render format (#1503): Replace PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. 9.2% incremental gain on 10p-scan (on top of serial OCR).
  • CPU-aware serial OCR (#1502): Use os.sched_getaffinity(0) to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory. 5.9% latency improvement + 2.1 GB memory savings.

Memory

  • CPU-aware serial OCR (#1502): Saves ~2.1 GB with zero latency cost on single-CPU pods.

Prior merges (before benchmarking infrastructure)

  • Free page image before table OCR (#1448): Release PIL image memory before table extraction starts
  • Resize-first preprocessing (#1441): Resize numpy arrays before YOLOX preprocessing instead of after
  • Replace lazyproperty (#1464): Switch from custom lazyproperty to stdlib functools.cached_property
  • Reduce attribute lookups (#1481): Optimize elements_intersect_vertically inner loop
  • Fix blocking event loop (#1400): Replace blocking CSV merge with async implementation

Upstream Contributions

PR Status Description
Unstructured-IO/core-product#1503 Draft Render PDF pages as BMP instead of PNG in pdfium pool
Unstructured-IO/core-product#1502 Draft Cap OCR workers to available CPUs — serial mode on 1-CPU pods
Unstructured-IO/core-product#1500 Draft Benchmark infrastructure and repo conventions
Unstructured-IO/core-product#1481 Merged Reduce attribute lookups in elements_intersect_vertically
Unstructured-IO/core-product#1464 Merged Replace lazyproperty with functools.cached_property
Unstructured-IO/core-product#1448 Merged Free page image before table extraction
Unstructured-IO/core-product#1441 Merged Resize-first numpy preprocessing for YOLOX
Unstructured-IO/core-product#1400 Merged Fix blocking event loop in CSV merge

Methodology

Environment

  • VM: Azure Standard_E4s_v5 (4 vCPU, 32 GB RAM, memory-optimized)
  • OS: Ubuntu 24.04 LTS
  • Region: westus2
  • Python: 3.12 (project constraint: >=3.12, <3.13)
  • Tooling: pytest-benchmark (5 rounds, 1 warmup, median reported), memray, cProfile
  • CPU pinning: taskset -c 0 to match production pod profile (1 CPU request, 32 GB RAM limit)

Non-burstable VM + CPU pinning matches production Knative pod resources. 32 GB RAM matches the pod limit exactly.

Benchmarking methodology

  • pedantic(rounds=5, warmup_rounds=1) — 1 warmup absorbs ONNX model JIT, page cache warming, and pool initialization overhead. 5 measured rounds enable median, IQR, and Tukey outlier detection.
  • Median reported as primary metric (robust to up to 2 outliers in 5 samples)
  • Observed stddev <0.4% of median across all measurements

Profiling approach

  1. cProfile + memray -- identify hot functions and peak memory allocators
  2. Per-stage benchmark instrumentation -- render, detect, OCR, merge timing breakdown
  3. Cumulative progression on stacked branch with proper statistical methodology
  4. Full unit test suite run before and after every change (348 tests)

Memory measurement

Process-tree RSS measured by summing /proc/[pid]/status VmRSS for the main process and all direct children. This captures worker process memory that resource.getrusage(RUSAGE_SELF) misses.

Runner convention

Benchmark scripts use .venv/bin/python directly for accuracy (uv run adds overhead). Upstream reproducers use uv run python for portability.

Install notes

core-product is a multi-package uv workspace with three sub-packages (unstructured_prop, unstructured_inference_prop, unstructured-api). All share a single root-level .venv:

uv venv --python 3.12          # create root .venv
export VIRTUAL_ENV=$PWD/.venv   # all uv sync --active commands use this
make install                    # syncs all three sub-packages into shared venv

Private PyPI access requires:

export UV_INDEX_UNSTRUCTURED_USERNAME=unstructured
export UV_INDEX_UNSTRUCTURED_PASSWORD=<token from Keeper "Private PyPi Index">

System dependencies beyond build-essential: tesseract-ocr libtesseract-dev libleptonica-dev poppler-utils libmagic1 libgl1 libglib2.0-0

Repo Structure

.
├── README.md              # This file
├── bench/                 # Benchmark scripts
├── data/                  # Raw benchmark data
│   └── results.tsv
└── infra/                 # VM provisioning
    ├── cloud-init.yaml
    └── vm-manage.sh