If you run ML in production,
we've probably optimized your stack.

Inference, preprocessing, and memory dominate the ML bill. Codeflash has shipped merged wins across PyTorch, JAX, ONNX, vLLM, Diffusers, YOLO, RF-DETR, SAM3, PaddleOCR, and spaCy. From algorithmic rewrites to container-aware scheduling.

ML costs hide in places profiling doesn't show you.

Most teams optimize the obvious things: model size, batch size, hardware tier. The real waste is usually elsewhere, and it compounds quietly until the bill forces a conversation.

  • Your worker count is wrong and you don't know it. Inside Kubernetes, os.cpu_count() returns the host CPU count, not the pod's. A single OCR service was spawning 4 ONNX workers on a 1-CPU pod, giving 4× the memory usage with zero extra throughput. We find this by default.
  • The model isn't the bottleneck. The preprocessing is. In vision and OCR pipelines, the GPU waits on CPU. PIL, PNG decoding, or a redundant resize three layers deep. That's where the latency lives. Fixing the model achieves nothing until you fix the pipeline feeding it.
  • Memory creep turns into over-provisioning. 24 MB per request is invisible at low scale. At high scale it causes OOMs, which cause defensive over-provisioning, which causes a cloud bill no one can explain. We track RSS before and after on every PR.
  • Fix one bottleneck and the next one surfaces. This is why one-time profiling sessions rarely deliver the full saving. The agent follows the stack down layer by layer until the bill actually bends.

Real projects. Merged PRs. Numbers you can verify.

Everything below is upstream, reviewed by maintainers, and running in production.

"Codeflash made our core object detection flow 25% faster. On the same GPU machine, the object detection throughput went up from 80 fps to 100 fps with a corresponding drop in latency from 12.2ms to 9.8ms."

Brad Dwyer · Founder & CTO, Roboflow
vLLM
13.7× faster token decoding
Merged upstream · PR #20413
HF Diffusers
9× faster WAN encoding
Merged upstream · PR #11665
Unstructured
90% infra cost cut
$10K → $1.1K/mo · 9.2× pod density
Roboflow
25% faster object detection
80fps → 100fps · 12.2ms → 9.8ms · YOLOv8 on GPU
pdfminer.six
3× speedup
OSS library running in thousands of pipelines
Pydantic
34% speedup
16 PRs merged · zero regressions

Frameworks we've shipped wins on.

If your stack is listed here, we've shipped merged, production-verified improvements on it.

PyTorch
Graph rewrites, custom ops, and memory layout improvements that reduce GPU idle time.
JAX
JIT and pmap boundary fixes, trace stability issues that cause silent slowdowns at scale.
ONNX / ONNXRuntime
Worker pool sizing and CPU-aware execution. The most common source of wasted memory in inference services.
vLLM
Token decoding and scheduler hotspots. 13.7× improvement merged upstream in PR #20413.
HF Diffusers
Encoder and decoder speedups. 9× improvement on the WAN encoding path, merged in PR #11665.
YOLO family
YOLOv8 inference throughput on GPU. Roboflow saw 80fps go to 100fps on the same hardware.
RF-DETR / SAM3
End-to-end inference latency on modern vision architectures.
PaddleOCR · spaCy
Preprocessing pipelines and CPU-bound bottlenecks that sit upstream of the model itself.
"A PR comes in, we measure RSS before and after on a running system, and the improvement is 2× or 3×. Real demonstrable progress, not theoretical."
Crag Wolfe · Chief Architect, Unstructured

Find the waste in your ML stack.

20-minute diagnostic. We'll tell you where the cost is hiding: inference, preprocessing, memory, or all three, before you commit to anything.