mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

History

Kevin Turcios 380bd59503 Add iterative-discovery narrative and missing findings across all reports Weave "optimizations reveal deeper issues" framing into engagement report executive summary, case study, and optimization README. Add O(N²) text extraction fix, per-request RSS creep (24→17 MB), and memray profiling data that were previously undocumented.		2026-04-16 15:02:39 -05:00
..
bench	Lint and format entire repo, not just packages (#23 )	2026-04-15 03:16:15 -05:00
data	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00
infra	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00
README.md	Add iterative-discovery narrative and missing findings across all reports	2026-04-16 15:02:39 -05:00
status.md	Migrate .codeflash/ to {teammember}/{org}/{project}/ format (#15 )	2026-04-14 23:04:34 -05:00

README.md

core-product Performance Optimization

Unstructured-IO's document processing pipeline -- PDF partitioning, OCR, layout detection, and element extraction. Python 3.12, multi-package uv workspace.

Results

Environment: Python 3.12, Ubuntu 24.04 LTS (Azure Standard_E4s_v5, taskset -c 0), pytest-benchmark (5 rounds, 1 warmup, median reported)

Cumulative latency progression

Stage	1p-tables	10p-scan	16p-mixed
Baseline (main)	2.93s	35.91s	65.32s
+ CPU-aware serial OCR (#1502)	2.92s (-0.3%)	33.80s (-5.9%)	61.23s (-6.3%)
+ BMP render format (#1503)	2.72s (-7.3%)	30.68s (-14.6%)	56.63s (-13.3%)

Memory

Optimization	Metric	Before	After	Improvement
CPU-aware serial OCR (#1502)	Process-tree RSS (10p scan, post-partition)	3,491 MB	1,398 MB	-2,093 MB (60%)
CPU-aware serial OCR (#1502)	Process-tree RSS (10p scan, pre-partition)	2,619 MB	499 MB	-2,120 MB (81%)

Throughput

No throughput regressions detected. Serial OCR path matches pool-on-1-CPU throughput (pool provides no parallelism benefit when pinned to 1 core).

What We Changed

Latency

O(N²) text extraction fix (#4266): _patch_current_chars_with_render_mode was re-scanning the full character list on every patch operation — quadratic scaling on text-heavy documents. Replaced with single-pass approach.
BMP render format (#1503): Replace PNG with BMP in the pdfium process-isolation worker. BMP is uncompressed — eliminates ~90ms/page of PNG compression on write and decompression on read. 9.2% incremental gain on 10p-scan (on top of serial OCR).
CPU-aware serial OCR (#1502): Use os.sched_getaffinity(0) to detect available CPUs (respects cgroup limits + taskset masks). On single-CPU pods, the OCR worker pool is never created — avoids 4 idle workers each loading duplicate OCR/ONNX models into ~500 MB of private memory. 5.9% latency improvement + 2.1 GB memory savings.

Memory

CPU-aware serial OCR (#1502): Saves ~2.1 GB with zero latency cost on single-CPU pods.
Per-request RSS creep: Reduced from 24 MB/request to 17 MB/request (-29%) via jemalloc and PIL churn reduction. memray confirmed 10 GB total allocated across a single 10-page scan — heavy per-page allocation churn, not classical accumulation.

Prior merges (before benchmarking infrastructure)

Free page image before table OCR (#1448): Release PIL image memory before table extraction starts
Resize-first preprocessing (#1441): Resize numpy arrays before YOLOX preprocessing instead of after
Replace lazyproperty (#1464): Switch from custom lazyproperty to stdlib functools.cached_property
Reduce attribute lookups (#1481): Optimize elements_intersect_vertically inner loop
Fix blocking event loop (#1400): Replace blocking CSV merge with async implementation

Upstream Contributions

PR	Status	Description
Unstructured-IO/core-product#1503	Draft	Render PDF pages as BMP instead of PNG in pdfium pool
Unstructured-IO/core-product#1502	Draft	Cap OCR workers to available CPUs — serial mode on 1-CPU pods
Unstructured-IO/core-product#1500	Draft	Benchmark infrastructure and repo conventions
Unstructured-IO/core-product#1481	Merged	Reduce attribute lookups in elements_intersect_vertically
Unstructured-IO/core-product#1464	Merged	Replace lazyproperty with functools.cached_property
Unstructured-IO/core-product#1448	Merged	Free page image before table extraction
Unstructured-IO/core-product#1441	Merged	Resize-first numpy preprocessing for YOLOX
Unstructured-IO/core-product#1400	Merged	Fix blocking event loop in CSV merge

Methodology

Environment

VM: Azure Standard_E4s_v5 (4 vCPU, 32 GB RAM, memory-optimized)
OS: Ubuntu 24.04 LTS
Region: westus2
Python: 3.12 (project constraint: >=3.12, <3.13)
Tooling: pytest-benchmark (5 rounds, 1 warmup, median reported), memray, cProfile
CPU pinning: taskset -c 0 to match production pod profile (1 CPU request, 32 GB RAM limit)

Non-burstable VM + CPU pinning matches production Knative pod resources. 32 GB RAM matches the pod limit exactly.

Benchmarking methodology

pedantic(rounds=5, warmup_rounds=1) — 1 warmup absorbs ONNX model JIT, page cache warming, and pool initialization overhead. 5 measured rounds enable median, IQR, and Tukey outlier detection.
Median reported as primary metric (robust to up to 2 outliers in 5 samples)
Observed stddev <0.4% of median across all measurements

Profiling approach

cProfile + memray -- identify hot functions and peak memory allocators
Per-stage benchmark instrumentation -- render, detect, OCR, merge timing breakdown
Cumulative progression on stacked branch with proper statistical methodology
Full unit test suite run before and after every change (348 tests)

Memory measurement

Process-tree RSS measured by summing /proc/[pid]/status VmRSS for the main process and all direct children. This captures worker process memory that resource.getrusage(RUSAGE_SELF) misses.

Runner convention

Benchmark scripts use .venv/bin/python directly for accuracy (uv run adds overhead). Upstream reproducers use uv run python for portability.

Install notes

core-product is a multi-package uv workspace with three sub-packages (unstructured_prop, unstructured_inference_prop, unstructured-api). All share a single root-level .venv:

uv venv --python 3.12          # create root .venv
export VIRTUAL_ENV=$PWD/.venv   # all uv sync --active commands use this
make install                    # syncs all three sub-packages into shared venv

Private PyPI access requires:

export UV_INDEX_UNSTRUCTURED_USERNAME=unstructured
export UV_INDEX_UNSTRUCTURED_PASSWORD=<token from Keeper "Private PyPi Index">

System dependencies beyond build-essential: tesseract-ocr libtesseract-dev libleptonica-dev poppler-utils libmagic1 libgl1 libglib2.0-0

Repo Structure

.
├── README.md              # This file
├── bench/                 # Benchmark scripts
├── data/                  # Raw benchmark data
│   └── results.tsv
└── infra/                 # VM provisioning
    ├── cloud-init.yaml
    └── vm-manage.sh