mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

Kevin Turcios 3b59d97647 squash

2026-04-13 14:12:17 -05:00

6.3 KiB

Raw Blame History

core-product Performance Optimization

Unstructured-IO's document processing pipeline -- PDF partitioning, OCR, layout detection, and element extraction. Python 3.12, multi-package uv workspace.

Results

Environment: Python 3.12, Ubuntu 24.04 LTS (Azure Standard_E4s_v5, taskset -c 0), pytest-benchmark (5 rounds, 1 warmup, median reported)

Cumulative latency progression

Stage	1p-tables	10p-scan	16p-mixed
Baseline (main)	2.93s	35.91s	65.32s
+ CPU-aware serial OCR (#1502)	2.92s (-0.3%)	33.80s (-5.9%)	61.23s (-6.3%)
+ BMP render format (#1503)	2.72s (-7.3%)	30.68s (-14.6%)	56.63s (-13.3%)

Memory

Optimization	Metric	Before	After	Improvement
CPU-aware serial OCR (#1502)	Process-tree RSS (10p scan, post-partition)	3,491 MB	1,398 MB	-2,093 MB (60%)
CPU-aware serial OCR (#1502)	Process-tree RSS (10p scan, pre-partition)	2,619 MB	499 MB	-2,120 MB (81%)

Throughput

No throughput regressions detected. Serial OCR path matches pool-on-1-CPU throughput (pool provides no parallelism benefit when pinned to 1 core).

What We Changed

Latency

BMP render format (#1503): Replace PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. 9.2% incremental gain on 10p-scan (on top of serial OCR).
CPU-aware serial OCR (#1502): Use os.sched_getaffinity(0) to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory. 5.9% latency improvement + 2.1 GB memory savings.

Memory

CPU-aware serial OCR (#1502): Saves ~2.1 GB with zero latency cost on single-CPU pods.

Prior merges (before benchmarking infrastructure)

Free page image before table OCR (#1448): Release PIL image memory before table extraction starts
Resize-first preprocessing (#1441): Resize numpy arrays before YOLOX preprocessing instead of after
Replace lazyproperty (#1464): Switch from custom lazyproperty to stdlib functools.cached_property
Reduce attribute lookups (#1481): Optimize elements_intersect_vertically inner loop
Fix blocking event loop (#1400): Replace blocking CSV merge with async implementation

Upstream Contributions

PR	Status	Description
Unstructured-IO/core-product#1503	Draft	Render PDF pages as BMP instead of PNG in pdfium pool
Unstructured-IO/core-product#1502	Draft	Cap OCR workers to available CPUs — serial mode on 1-CPU pods
Unstructured-IO/core-product#1500	Draft	Benchmark infrastructure and repo conventions
Unstructured-IO/core-product#1481	Merged	Reduce attribute lookups in elements_intersect_vertically
Unstructured-IO/core-product#1464	Merged	Replace lazyproperty with functools.cached_property
Unstructured-IO/core-product#1448	Merged	Free page image before table extraction
Unstructured-IO/core-product#1441	Merged	Resize-first numpy preprocessing for YOLOX
Unstructured-IO/core-product#1400	Merged	Fix blocking event loop in CSV merge

Methodology

Environment

VM: Azure Standard_E4s_v5 (4 vCPU, 32 GB RAM, memory-optimized)
OS: Ubuntu 24.04 LTS
Region: westus2
Python: 3.12 (project constraint: >=3.12, <3.13)
Tooling: pytest-benchmark (5 rounds, 1 warmup, median reported), memray, cProfile
CPU pinning: taskset -c 0 to match production pod profile (1 CPU request, 32 GB RAM limit)

Non-burstable VM + CPU pinning matches production Knative pod resources. 32 GB RAM matches the pod limit exactly.

Benchmarking methodology

pedantic(rounds=5, warmup_rounds=1) — 1 warmup absorbs ONNX model JIT, page cache warming, and pool initialization overhead. 5 measured rounds enable median, IQR, and Tukey outlier detection.
Median reported as primary metric (robust to up to 2 outliers in 5 samples)
Observed stddev <0.4% of median across all measurements

Profiling approach

cProfile + memray -- identify hot functions and peak memory allocators
Per-stage benchmark instrumentation -- render, detect, OCR, merge timing breakdown
Cumulative progression on stacked branch with proper statistical methodology
Full unit test suite run before and after every change (348 tests)

Memory measurement

Process-tree RSS measured by summing /proc/[pid]/status VmRSS for the main process and all direct children. This captures worker process memory that resource.getrusage(RUSAGE_SELF) misses.

Runner convention

Benchmark scripts use .venv/bin/python directly for accuracy (uv run adds overhead). Upstream reproducers use uv run python for portability.

Install notes

core-product is a multi-package uv workspace with three sub-packages (unstructured_prop, unstructured_inference_prop, unstructured-api). All share a single root-level .venv:

uv venv --python 3.12          # create root .venv
export VIRTUAL_ENV=$PWD/.venv   # all uv sync --active commands use this
make install                    # syncs all three sub-packages into shared venv

Private PyPI access requires:

export UV_INDEX_UNSTRUCTURED_USERNAME=unstructured
export UV_INDEX_UNSTRUCTURED_PASSWORD=<token from Keeper "Private PyPi Index">

System dependencies beyond build-essential: tesseract-ocr libtesseract-dev libleptonica-dev poppler-utils libmagic1 libgl1 libglib2.0-0

Repo Structure

.
├── README.md              # This file
├── bench/                 # Benchmark scripts
├── data/                  # Raw benchmark data
│   └── results.tsv
└── infra/                 # VM provisioning
    ├── cloud-init.yaml
    └── vm-manage.sh

6.3 KiB Raw Blame History

core-product Performance Optimization

Results

Cumulative latency progression

Memory

Throughput

What We Changed

Latency

Memory

Prior merges (before benchmarking infrastructure)

Upstream Contributions

Methodology

Environment

Benchmarking methodology

Profiling approach

Memory measurement

Runner convention

Install notes

Repo Structure

6.3 KiB

Raw Blame History