9.8 KiB
GPU/CUDA Inference Optimization Guide
Why cProfile Is Insufficient for GPU Workloads
cProfile measures Python-side CPU time. GPU work is asynchronous — torch.mm() returns immediately while the GPU computes. The real costs in GPU inference are invisible to Python profilers:
- H2D/D2H transfers: Data copies between CPU and GPU happen in the CUDA driver. A 24-byte tensor copy that takes microseconds of Python time can create a 100us pipeline bubble waiting for the driver to stage pageable memory.
- Kernel dispatch gaps: Each PyTorch operation launches a CUDA kernel with ~25us of dispatch overhead. 29 sequential operations = ~725us of dead time between kernels, but cProfile shows each op completing in microseconds.
- Stream synchronization stalls:
stream.synchronize()blocks the CPU thread until the GPU finishes. This shows as idle CPU time in cProfile — not attributed to any function. - Pipeline bubbles: Gaps where neither CPU nor GPU is doing useful work because one is waiting for the other. These are the "inter-iteration bubbles" that dominate real-world inference latency.
Rule: For GPU-bound workloads, always use torch.profiler or nsys as the primary profiler. cProfile is supplementary only — use it for CPU-side code paths (preprocessing, postprocessing, adapter serialization).
Profiling Tools
torch.profiler (primary)
The standard tool for profiling PyTorch GPU workloads. Captures both CPU and CUDA activities.
import torch
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True,
profile_memory=True,
schedule=schedule(wait=2, warmup=3, active=5, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler("/tmp/gpu_trace"),
) as prof:
for i in range(10):
run_inference(model, input_data)
prof.step()
# Print summary sorted by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
# Print summary sorted by CPU time (for comparison)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
What to look for in the output:
- CUDA kernels with high total time — these are compute bottlenecks
- Many calls with short CUDA time — dispatch overhead candidates for fusion
- CPU time >> CUDA time on a function — the CPU is doing unnecessary work
- CUDA time >> CPU time on a function — large GPU compute or transfer
aten::copy_entries — these are memory transfers (H2D, D2H, D2D)cudaStreamSynchronizein CPU column — blocking sync points
Nsight Systems (nsys) — authoritative validation
When torch.profiler findings need confirmation or when you need driver-level detail.
nsys profile \
--trace=cuda,nvtx,osrt \
--cuda-memory-usage=true \
--output=/tmp/nsys_profile \
$RUNNER inference_script.py
Reading the timeline:
- Stream rows: Each CUDA stream shows kernel execution as colored blocks with gaps between them. Large gaps = pipeline bubbles.
- Memory copy rows: H2D (green) and D2H (red) transfers. Pageable transfers show a staging copy; pinned transfers are direct.
- CPU row: Python thread activity. Look for idle periods that coincide with
cudaStreamSynchronize. - NVTX markers: If the code uses
torch.cuda.nvtx.range_push/pop, these label pipeline stages.
torch.cuda.Event timing (quick per-stage measurement)
For fast A/B comparisons without full profiling infrastructure.
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# Warm up
for _ in range(10):
run_inference(model, input_data)
torch.cuda.synchronize()
# Measure
times = []
for _ in range(100):
start.record()
run_inference(model, input_data)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end)) # milliseconds
import statistics
print(f"Mean: {statistics.mean(times):.3f}ms, Median: {statistics.median(times):.3f}ms, Std: {statistics.stdev(times):.3f}ms")
Per-stage measurement — wrap each pipeline stage with its own event pair:
events = {stage: (torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True))
for stage in ["preprocess", "inference", "postprocess"]}
events["preprocess"][0].record(preprocess_stream)
# ... preprocessing ...
events["preprocess"][1].record(preprocess_stream)
events["inference"][0].record(inference_stream)
# ... inference ...
events["inference"][1].record(inference_stream)
# After sync:
for stage, (s, e) in events.items():
print(f"{stage}: {s.elapsed_time(e):.3f}ms")
The GPU Inference Pipeline Model
A typical GPU inference pipeline has 6 stages. Each transition between stages is an optimization surface:
CPU Prep ──H2D──> GPU Preprocess ──> GPU Inference ──D2H──> CPU Postprocess ──> Output
│ │ │ │
│ │ │ │
▼ ▼ ▼ ▼
Image resize Normalize, Model forward Threshold,
Format convert Channel swap, pass (TRT, Sort, NMS,
Numpy→Tensor Pad/letterbox ONNX, PyTorch) Class remap
Where time is typically spent (ranges for single-image inference on modern GPUs):
| Stage | Typical range | Optimization surface |
|---|---|---|
| CPU Prep | 0.5-3ms | Vectorization, avoid copies |
| H2D Transfer | 0.1-1ms | Pinned memory, non_blocking |
| GPU Preprocess | 0.1-2ms | Kernel fusion, cached constants |
| GPU Inference | 2-20ms | Model optimization (TRT, quantization) — usually not our target |
| D2H Transfer | 0.05-0.5ms | Direct-to-host writes, pinned buffers |
| CPU Postprocess | 0.1-2ms | Numpy for small arrays, vectorization |
| Sync overhead | 0.5-3ms | Event-based sync, pipeline overlap |
The model forward pass is usually the largest single cost, but it's typically already optimized (TensorRT, ONNX). The surrounding pipeline stages and their transitions often account for 30-60% of total E2E latency and are where Python-level optimization has the most impact.
Monolithic kernel strategy
When individual micro-optimizations (tensor caching, pinned memory, etc.) show improvements in isolation but fail to register in E2E measurements — often because savings are below variance or masked by stream overlap — the monolithic approach fuses an entire pipeline stage into a single compiled operation. This eliminates ALL intermediate tensors, kernel launches, and dispatch gaps at once.
Preprocessing monolith: Replace (CPU resize + pageable H2D + permute + normalize + channel swap) with (pinned H2D + GPU F.interpolate + torch.compile(mode="reduce-overhead") fusing the rest). Moves compute onto GPU where it fuses into a single compiled kernel graph. Measured: 31% stage improvement on RF-DETR (2.47ms -> 1.70ms).
Postprocessing monolith: Replace (per-call torch.tensor() + torch.isin() + external rescale function) with (pre-computed boolean lookup mask + cached GPU tensors + inlined in-place rescale). Eliminates all per-call allocations and expensive internal kernels. Measured: 33% stage improvement on RF-DETR (1.11ms -> 0.74ms).
When to use: After individual transfer/dispatch/sync optimizations are discarded or show sub-threshold gains. The monolithic approach often succeeds where piecemeal fixes fail because the combined savings exceed the measurement noise floor.
When NOT to use: If the stage has data-dependent control flow that prevents compilation (variable-length outputs, dynamic shapes). Postprocessing's confidence filtering creates variable-size tensors — fuse around it, not through it.
Decision Framework
Step 1: Is the workload GPU-bound or CPU-bound?
# Compare GPU active time to wall-clock time
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
run_inference(model, input_data)
# If total CUDA time >> total CPU time: GPU-bound (optimize kernels, reduce transfers)
# If total CPU time >> total CUDA time: CPU-bound (optimize Python code, reduce sync waits)
# If they're close: pipeline-bound (optimize transitions between stages)
Step 2: Is the bottleneck data movement or kernel execution?
Look at the torch.profiler output:
- Transfer-bound:
aten::copy_andcudaMemcpy*dominate. Focus on pinned memory, non_blocking, reducing transfer count. - Compute-bound: CUDA kernels dominate. Focus on kernel fusion, precision reduction, algorithmic changes.
- Dispatch-bound: Many short kernels with gaps. Focus on kernel fusion via Triton or torch.compile.
Step 3: Is the overhead per-call or proportional?
- Per-call fixed cost: Tensor creation, stream synchronization, small H2D transfers. These don't scale with batch size — they matter most for single-image real-time inference.
- Proportional cost: Large tensor transfers, compute-heavy kernels. These scale with batch size and image resolution.
Per-call fixed costs compound in streaming/real-time applications where every millisecond of latency matters. Proportional costs matter more in throughput-oriented batch processing.
GPU Warm-Up
Always warm up the GPU before benchmarking. The first few inference calls include:
- CUDA context initialization (~100-500ms)
- TensorRT engine deserialization (~1-10s)
- cuDNN autotuning (~100ms per unique input shape)
- JIT compilation for torch.compile or Triton kernels
# Standard warm-up: run 10-20 inference calls and discard timing
for _ in range(20):
run_inference(model, input_data)
torch.cuda.synchronize()
# NOW start measuring
Never benchmark cold GPU state — the variance will mask real optimization effects.