mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

aseembits93 2734b42f96 wip

2026-04-17 23:23:57 +00:00

9.8 KiB

Raw Blame History

GPU/CUDA Inference Optimization Guide

Why cProfile Is Insufficient for GPU Workloads

cProfile measures Python-side CPU time. GPU work is asynchronous — torch.mm() returns immediately while the GPU computes. The real costs in GPU inference are invisible to Python profilers:

H2D/D2H transfers: Data copies between CPU and GPU happen in the CUDA driver. A 24-byte tensor copy that takes microseconds of Python time can create a 100us pipeline bubble waiting for the driver to stage pageable memory.
Kernel dispatch gaps: Each PyTorch operation launches a CUDA kernel with ~25us of dispatch overhead. 29 sequential operations = ~725us of dead time between kernels, but cProfile shows each op completing in microseconds.
Stream synchronization stalls: stream.synchronize() blocks the CPU thread until the GPU finishes. This shows as idle CPU time in cProfile — not attributed to any function.
Pipeline bubbles: Gaps where neither CPU nor GPU is doing useful work because one is waiting for the other. These are the "inter-iteration bubbles" that dominate real-world inference latency.

Rule: For GPU-bound workloads, always use torch.profiler or nsys as the primary profiler. cProfile is supplementary only — use it for CPU-side code paths (preprocessing, postprocessing, adapter serialization).

Profiling Tools

torch.profiler (primary)

The standard tool for profiling PyTorch GPU workloads. Captures both CPU and CUDA activities.

import torch
from torch.profiler import profile, ProfilerActivity, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
    record_shapes=True,
    profile_memory=True,
    schedule=schedule(wait=2, warmup=3, active=5, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("/tmp/gpu_trace"),
) as prof:
    for i in range(10):
        run_inference(model, input_data)
        prof.step()

# Print summary sorted by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

# Print summary sorted by CPU time (for comparison)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))

What to look for in the output:

CUDA kernels with high total time — these are compute bottlenecks
Many calls with short CUDA time — dispatch overhead candidates for fusion
CPU time >> CUDA time on a function — the CPU is doing unnecessary work
CUDA time >> CPU time on a function — large GPU compute or transfer
aten::copy_ entries — these are memory transfers (H2D, D2H, D2D)
cudaStreamSynchronize in CPU column — blocking sync points

Nsight Systems (nsys) — authoritative validation

When torch.profiler findings need confirmation or when you need driver-level detail.

nsys profile \
    --trace=cuda,nvtx,osrt \
    --cuda-memory-usage=true \
    --output=/tmp/nsys_profile \
    $RUNNER inference_script.py

Reading the timeline:

Stream rows: Each CUDA stream shows kernel execution as colored blocks with gaps between them. Large gaps = pipeline bubbles.
Memory copy rows: H2D (green) and D2H (red) transfers. Pageable transfers show a staging copy; pinned transfers are direct.
CPU row: Python thread activity. Look for idle periods that coincide with cudaStreamSynchronize.
NVTX markers: If the code uses torch.cuda.nvtx.range_push/pop, these label pipeline stages.

torch.cuda.Event timing (quick per-stage measurement)

For fast A/B comparisons without full profiling infrastructure.

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

# Warm up
for _ in range(10):
    run_inference(model, input_data)
torch.cuda.synchronize()

# Measure
times = []
for _ in range(100):
    start.record()
    run_inference(model, input_data)
    end.record()
    torch.cuda.synchronize()
    times.append(start.elapsed_time(end))  # milliseconds

import statistics
print(f"Mean: {statistics.mean(times):.3f}ms, Median: {statistics.median(times):.3f}ms, Std: {statistics.stdev(times):.3f}ms")

Per-stage measurement — wrap each pipeline stage with its own event pair:

events = {stage: (torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True))
          for stage in ["preprocess", "inference", "postprocess"]}

events["preprocess"][0].record(preprocess_stream)
# ... preprocessing ...
events["preprocess"][1].record(preprocess_stream)

events["inference"][0].record(inference_stream)
# ... inference ...
events["inference"][1].record(inference_stream)

# After sync:
for stage, (s, e) in events.items():
    print(f"{stage}: {s.elapsed_time(e):.3f}ms")

The GPU Inference Pipeline Model

A typical GPU inference pipeline has 6 stages. Each transition between stages is an optimization surface:

CPU Prep ──H2D──> GPU Preprocess ──> GPU Inference ──D2H──> CPU Postprocess ──> Output
   │                   │                   │                      │
   │                   │                   │                      │
   ▼                   ▼                   ▼                      ▼
 Image resize      Normalize,          Model forward         Threshold,
 Format convert    Channel swap,       pass (TRT,            Sort, NMS,
 Numpy→Tensor      Pad/letterbox       ONNX, PyTorch)        Class remap

Where time is typically spent (ranges for single-image inference on modern GPUs):

Stage	Typical range	Optimization surface
CPU Prep	0.5-3ms	Vectorization, avoid copies
H2D Transfer	0.1-1ms	Pinned memory, non_blocking
GPU Preprocess	0.1-2ms	Kernel fusion, cached constants
GPU Inference	2-20ms	Model optimization (TRT, quantization) — usually not our target
D2H Transfer	0.05-0.5ms	Direct-to-host writes, pinned buffers
CPU Postprocess	0.1-2ms	Numpy for small arrays, vectorization
Sync overhead	0.5-3ms	Event-based sync, pipeline overlap

The model forward pass is usually the largest single cost, but it's typically already optimized (TensorRT, ONNX). The surrounding pipeline stages and their transitions often account for 30-60% of total E2E latency and are where Python-level optimization has the most impact.

Monolithic kernel strategy

When individual micro-optimizations (tensor caching, pinned memory, etc.) show improvements in isolation but fail to register in E2E measurements — often because savings are below variance or masked by stream overlap — the monolithic approach fuses an entire pipeline stage into a single compiled operation. This eliminates ALL intermediate tensors, kernel launches, and dispatch gaps at once.

Preprocessing monolith: Replace (CPU resize + pageable H2D + permute + normalize + channel swap) with (pinned H2D + GPU F.interpolate + torch.compile(mode="reduce-overhead") fusing the rest). Moves compute onto GPU where it fuses into a single compiled kernel graph. Measured: 31% stage improvement on RF-DETR (2.47ms -> 1.70ms).

Postprocessing monolith: Replace (per-call torch.tensor() + torch.isin() + external rescale function) with (pre-computed boolean lookup mask + cached GPU tensors + inlined in-place rescale). Eliminates all per-call allocations and expensive internal kernels. Measured: 33% stage improvement on RF-DETR (1.11ms -> 0.74ms).

When to use: After individual transfer/dispatch/sync optimizations are discarded or show sub-threshold gains. The monolithic approach often succeeds where piecemeal fixes fail because the combined savings exceed the measurement noise floor.

When NOT to use: If the stage has data-dependent control flow that prevents compilation (variable-length outputs, dynamic shapes). Postprocessing's confidence filtering creates variable-size tensors — fuse around it, not through it.

Decision Framework

Step 1: Is the workload GPU-bound or CPU-bound?

# Compare GPU active time to wall-clock time
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    run_inference(model, input_data)

# If total CUDA time >> total CPU time: GPU-bound (optimize kernels, reduce transfers)
# If total CPU time >> total CUDA time: CPU-bound (optimize Python code, reduce sync waits)
# If they're close: pipeline-bound (optimize transitions between stages)

Step 2: Is the bottleneck data movement or kernel execution?

Look at the torch.profiler output:

Transfer-bound: aten::copy_ and cudaMemcpy* dominate. Focus on pinned memory, non_blocking, reducing transfer count.
Compute-bound: CUDA kernels dominate. Focus on kernel fusion, precision reduction, algorithmic changes.
Dispatch-bound: Many short kernels with gaps. Focus on kernel fusion via Triton or torch.compile.

Step 3: Is the overhead per-call or proportional?

Per-call fixed cost: Tensor creation, stream synchronization, small H2D transfers. These don't scale with batch size — they matter most for single-image real-time inference.
Proportional cost: Large tensor transfers, compute-heavy kernels. These scale with batch size and image resolution.

Per-call fixed costs compound in streaming/real-time applications where every millisecond of latency matters. Proportional costs matter more in throughput-oriented batch processing.

GPU Warm-Up

Always warm up the GPU before benchmarking. The first few inference calls include:

CUDA context initialization (~100-500ms)
TensorRT engine deserialization (~1-10s)
cuDNN autotuning (~100ms per unique input shape)
JIT compilation for torch.compile or Triton kernels

# Standard warm-up: run 10-20 inference calls and discard timing
for _ in range(20):
    run_inference(model, input_data)
torch.cuda.synchronize()
# NOW start measuring

Never benchmark cold GPU state — the variance will mask real optimization effects.

9.8 KiB Raw Blame History