core-product Status

Last updated: 2026-04-10

Current state

Stacked PR #1500 updated with cumulative progression (proper benchmarking: 5 rounds + 1 warmup, median reported, <0.4% stddev). Cumulative: 14.6% latency on 10p-scan, 13.3% on 16p-mixed, 2.1 GB memory savings. Next: optimization #4 (direct numpy-to-BMP for tesseract).

Target repo

~/Desktop/work/unstructured_org/core-product on branch main (PR branch: perf/cpu-aware-serial-ocr)

PRs

PR	Branch	Status	Description
#1503	`perf/bmp-render-format`	Draft	Render PDF pages as BMP instead of PNG in pdfium pool
#1502	`perf/cpu-aware-serial-ocr`	Draft	Cap OCR workers to available CPUs (serial mode on 1-CPU pods)
#1500	`codeflash-agent`	Draft	Stacked optimizations + benchmark infra (cumulative progression)
#1481	`perf/elements-intersect-vertically`	Merged	Reduce attribute lookups
#1464	`replace-lazyproperty-with-cached-property`	Merged	Replace lazyproperty with functools.cached_property
#1448	`mem/free-pil-before-table-extraction`	Merged	Free page image before table OCR
#1441	`mem/numpy-preprocessing-yolox`	Merged	Resize-first preprocessing
#1400	`async-join-responses`	Merged	Fix blocking event loop in CSV merge

Optimization queue

~~CPU-aware serial OCR~~ — PR #1502 open (draft), benchmarked. Rebase after #1501 merges.
~~Early memory release~~ — skipped, codebase already well-optimized (context managers, per-page cleanup)
~~BMP render format~~ — PR #1503 open (draft), benchmarked. 14.9% latency improvement on 10p-scan.
Direct numpy-to-BMP for tesseract — encode from numpy without PIL round-trip
Skip remaining PIL↔numpy conversions in OCR path

Dependencies

PR #1501 (segfault fix, patched_convert_pdf_to_image refactor) must merge before #1502 rebase. Different functions, clean rebase expected.

VM

IP: 40.65.91.158
Size: Standard_E4s_v5
RG: core-product-BENCH-RG
State: Running (verified 2026-04-10)
Git auth: HTTPS with embedded token (set previously). Use ssh -A for agent forwarding if token expires.
Note: uv is at ~/.local/bin/uv — needs export PATH=$HOME/.local/bin:$PATH in non-login shells.
Note: pytest-benchmark installed in .venv (not in lockfile).

Next steps

Implement "pass file path to tesseract" optimization (skip PIL→numpy→PIL→temp-file round-trip)
Benchmark on VM, open draft PR
Rebase #1502 once #1501 merges

Notes

memray tree opens a TUI — do not run directly over SSH. Use memray stats, memray summary, or memray flamegraph --output file.html instead.
memray peak is 1.0 GB (10p scan, serial path). 10 GB total allocated = heavy PIL churn per page, not accumulation.
Benchmarking: use pedantic(rounds=5, warmup_rounds=1) — warmup absorbs ONNX JIT + page cache. Observed stddev <0.4% of median. Guest CPU frequency controls are ineffective on Azure Hyper-V — use statistical methods (more rounds + median) instead of trying to pin frequency.
Workflow: independent perf/<name> branch → open individual draft PR → cherry-pick to codeflash-agent → benchmark stacked progression → update #1500 body.

3.2 KiB Raw Blame History