codeflash-agent/.codeflash/unstructured/core-product/status.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

3.2 KiB

core-product Status

Last updated: 2026-04-10

Current state

Stacked PR #1500 updated with cumulative progression (proper benchmarking: 5 rounds + 1 warmup, median reported, <0.4% stddev). Cumulative: 14.6% latency on 10p-scan, 13.3% on 16p-mixed, 2.1 GB memory savings. Next: optimization #4 (direct numpy-to-BMP for tesseract).

Target repo

~/Desktop/work/unstructured_org/core-product on branch main (PR branch: perf/cpu-aware-serial-ocr)

PRs

PR Branch Status Description
#1503 perf/bmp-render-format Draft Render PDF pages as BMP instead of PNG in pdfium pool
#1502 perf/cpu-aware-serial-ocr Draft Cap OCR workers to available CPUs (serial mode on 1-CPU pods)
#1500 codeflash-agent Draft Stacked optimizations + benchmark infra (cumulative progression)
#1481 perf/elements-intersect-vertically Merged Reduce attribute lookups
#1464 replace-lazyproperty-with-cached-property Merged Replace lazyproperty with functools.cached_property
#1448 mem/free-pil-before-table-extraction Merged Free page image before table OCR
#1441 mem/numpy-preprocessing-yolox Merged Resize-first preprocessing
#1400 async-join-responses Merged Fix blocking event loop in CSV merge

Optimization queue

  1. CPU-aware serial OCR — PR #1502 open (draft), benchmarked. Rebase after #1501 merges.
  2. Early memory release — skipped, codebase already well-optimized (context managers, per-page cleanup)
  3. BMP render format — PR #1503 open (draft), benchmarked. 14.9% latency improvement on 10p-scan.
  4. Direct numpy-to-BMP for tesseract — encode from numpy without PIL round-trip
  5. Skip remaining PIL↔numpy conversions in OCR path

Dependencies

  • PR #1501 (segfault fix, patched_convert_pdf_to_image refactor) must merge before #1502 rebase. Different functions, clean rebase expected.

VM

  • IP: 40.65.91.158
  • Size: Standard_E4s_v5
  • RG: core-product-BENCH-RG
  • State: Running (verified 2026-04-10)
  • Git auth: HTTPS with embedded token (set previously). Use ssh -A for agent forwarding if token expires.
  • Note: uv is at ~/.local/bin/uv — needs export PATH=$HOME/.local/bin:$PATH in non-login shells.
  • Note: pytest-benchmark installed in .venv (not in lockfile).

Next steps

  1. Implement "pass file path to tesseract" optimization (skip PIL→numpy→PIL→temp-file round-trip)
  2. Benchmark on VM, open draft PR
  3. Rebase #1502 once #1501 merges

Notes

  • memray tree opens a TUI — do not run directly over SSH. Use memray stats, memray summary, or memray flamegraph --output file.html instead.
  • memray peak is 1.0 GB (10p scan, serial path). 10 GB total allocated = heavy PIL churn per page, not accumulation.
  • Benchmarking: use pedantic(rounds=5, warmup_rounds=1) — warmup absorbs ONNX JIT + page cache. Observed stddev <0.4% of median. Guest CPU frequency controls are ineffective on Azure Hyper-V — use statistical methods (more rounds + median) instead of trying to pin frequency.
  • Workflow: independent perf/<name> branch → open individual draft PR → cherry-pick to codeflash-agent → benchmark stacked progression → update #1500 body.