11 KiB
Experiment Loop — GPU/CUDA Domain
Base framework:
../shared/experiment-loop-base.md
Reasoning Checklist
Before writing any code, answer these 10 questions. If you can't answer 3-7 concretely, profile deeper before coding.
- Pattern: What GPU antipattern? (see
reference.mdcatalog: [transfer], [dispatch], [sync], [memory], [compute], [correctness]) - Hot path? Confirmed by torch.profiler or nsys — NOT cProfile. cProfile cannot see GPU-side costs.
- Pipeline stage? Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
- Transfer or compute? Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
- Per-call vs amortized? Is this fixed overhead per inference call, or proportional to batch size / image resolution?
- GPU warm? Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
- Mechanism: HOW does your change reduce latency? Be specific: fewer transfers, fewer kernel launches, less sync wait, direct-to-host writes, etc.
- Correctness: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can produce different floating-point results. Verify bit-for-bit or within tolerance.
- CUDA graph compatibility: Does the project use CUDA graphs? If so, will this change break capture or replay? Event-based sync is incompatible with graph capture. Pinned memory and cached tensors are compatible.
- Verify cheaply: Can you validate with
torch.cuda.Eventtiming before running the full benchmark?
Profiling Methodology
Mandatory: torch.profiler baseline
Run this BEFORE any code changes. This is equivalent to the cProfile gate for CPU optimization — entering the experiment loop without torch.profiler output is not permitted.
# /tmp/gpu_baseline.py
import torch
from torch.profiler import profile, ProfilerActivity
# --- Adapt these to the project ---
# from model import load_model, preprocess, inference, postprocess
# model = load_model()
# input_data = load_test_input()
# ---
# Warmup
for _ in range(20):
run_inference(model, input_data)
torch.cuda.synchronize()
# Profile
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True,
) as prof:
for _ in range(10):
run_inference(model, input_data)
torch.cuda.synchronize()
# Ranked by CUDA time
print("=== CUDA time ranking ===")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
# Ranked by CPU time
print("\n=== CPU time ranking ===")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
# Memory copies (H2D, D2H)
print("\n=== Memory events ===")
for evt in prof.key_averages():
if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
print(f" {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
Print the output as [gpu baseline] — this is a key deliverable.
Event-based micro-timing (per-stage measurement)
For quick A/B comparisons of individual pipeline stages:
import torch
def time_stage(fn, *args, warmup=10, runs=100, stream=None):
"""Time a single pipeline stage using CUDA events."""
s = stream or torch.cuda.current_stream()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(warmup):
fn(*args)
torch.cuda.synchronize()
times = []
for _ in range(runs):
start.record(s)
fn(*args)
end.record(s)
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
import statistics
return statistics.median(times), statistics.stdev(times)
nsys validation (when torch.profiler is insufficient)
Use when you need to see driver-level detail — pageable vs pinned transfers, exact kernel launch gaps, stream activity overlap:
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
$RUNNER /tmp/inference_bench.py
Domain-Specific Loop Steps
Step 1 — Choose target sources:
- torch.profiler output: Highest CUDA time function, or largest gap between GPU kernels
- nsys timeline: Pipeline bubbles — idle gaps between stream activities
- Manual inspection: Per-call tensor creation patterns,
synchronize()calls, pageable H2D transfers
Print: [experiment N] Target: <stage>.<pattern> (<est. latency>us, <antipattern category>)
Example:
[experiment 1] Target: postprocess.dispatch_gaps (1021us, 29 kernels — [dispatch] fusion candidate)
[experiment 2] Target: preprocess.h2d_staging (300us — [transfer] pageable->pinned)
[experiment 3] Target: preprocess.tensor_create (150us — [transfer] loop-invariant hoisting)
[experiment 4] Target: sync.stream_block (1000us — [sync] synchronize->event)
Step 6 — Read results: Print BOTH stage-level AND E2E timings:
[experiment N] Stage: <stage> <before>us -> <after>us (<delta>%)
[experiment N] E2E: <before>ms -> <after>ms (<delta>%)
Step 8 — Noise threshold: GPU timing has higher variance than CPU due to GPU clock boosting, thermal throttling, and driver scheduling. If speedup <10% on a single stage, re-run 5x (not 3x) with GPU warm to confirm it's not noise. Use median, not mean.
Step 10 — Record: Record immediately in both .codeflash/results.tsv and .codeflash/HANDOFF.md. Include both stage-level and E2E metrics.
Step 12 — E2E benchmark: After KEEP, measure E2E latency over 100+ inference calls to confirm the stage-level gain translates to real-world improvement. Stage improvements that don't show in E2E may be masked by other bottlenecks — still KEEP if the stage improvement is confirmed, but note the masking.
Keep/Discard Thresholds
Tests passed?
+-- NO -> Fix or discard immediately
+-- YES -> Check E2E and stage metrics:
+-- E2E improved >=3% -> KEEP
+-- E2E <3% but stage improved >=10%:
| +-- Re-run 5x with GPU warm
| +-- Confirmed -> KEEP (stage gain is real; E2E may compound with other fixes)
| +-- Noise -> DISCARD
+-- Micro-benchmark (Event timing) >=20% on confirmed hot stage -> KEEP
+-- No measurable improvement -> DISCARD
Why 3% E2E (not 5% like CPU)? GPU inference pipelines have high fixed costs (model forward pass) that dilute the effect of transfer/sync optimizations on E2E numbers. A 50% reduction in a pipeline stage (e.g., postprocess 1ms -> 0.5ms) may only show as 5% E2E improvement on a 10ms pipeline. The stage-level improvement is real and compounds with other stage fixes.
Strategy Rotation
If 3+ consecutive discards on the same type, switch:
- Transfer optimization (pinned memory, cached tensors, non_blocking) -> if exhausted:
- Dispatch/fusion (Triton kernels, torch.compile, batched ops) -> if exhausted:
- Monolithic kernel fusion (fuse entire pre/post stage into single compiled operation — see below) -> if exhausted:
- Synchronization (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
- Compute/precision (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
- Architectural (batch pipeline stages differently, pipeline parallelism across frames)
Monolithic kernel escalation
When individual micro-optimizations (steps 1-2) are discarded because savings are below the E2E noise floor or masked by stream overlap, escalate to monolithic stage fusion. Instead of optimizing individual ops, replace the entire pipeline stage with a single compiled operation.
Preprocessing monolith recipe:
- Replace CPU resize with GPU
F.interpolate(move compute to GPU) - Use pinned memory buffer +
non_blocking=Truefor raw H2D (eliminate staging) - Wrap permute + resize + channel swap + normalize in
torch.compile(mode="reduce-overhead") - Cache normalization tensors on GPU at init time
- Make pinned buffer thread-local if model serves concurrent requests
Postprocessing monolith recipe:
- Replace
torch.isin()with pre-computed boolean lookup mask (index by class ID) - Cache
inference_size_whwh, pad offsets, and scale tensors on GPU (keyed by metadata identity) - Inline rescale:
boxes.sub_(offsets).div_(scale)— in-place, no intermediate allocations - Keep data-dependent ops (confidence filter, sort) as-is — they can't be compiled
When this works: Individual fixes gave 13x micro-benchmark gains but 0% E2E because stream overlap or variance masked savings. The monolithic approach combines ALL savings into one change that exceeds the noise floor. RF-DETR case: individual fixes = 0% E2E, monolithic = 8.3% E2E.
Plateau Detection
Remaining targets are at the optimization floor when:
- Inside TRT/ONNX engine: The model forward pass is opaque — can't optimize individual kernels at the Python level. Recommend TRT profile/refit or model-level changes.
- PCIe bandwidth ceiling: H2D/D2H transfers are limited by PCIe bandwidth. Verify with
nvidia-smi topo -mand transfer size calculation. - SM occupancy ceiling: GPU compute units are fully utilized. Verify with nsys or Nsight Compute occupancy metrics.
- CUDA context overhead: ~50-100us per kernel launch is irreducible driver overhead. Can only be reduced by fewer launches (fusion) or CUDA graphs.
Correctness Verification
Numerical equivalence
Kernel fusion and operation reordering can produce different floating-point results due to:
- Different reduction order (sum, max, argmax)
- Different intermediate precision (FP32 vs FP16 intermediates)
- Fused multiply-add vs separate multiply then add
Always verify output equivalence after kernel fusion:
# Run both paths on same input
original_output = original_postprocess(logits, boxes)
fused_output = fused_postprocess(logits, boxes)
# Check equivalence
assert torch.allclose(original_output, fused_output, atol=1e-5, rtol=1e-4), \
f"Max diff: {(original_output - fused_output).abs().max()}"
CUDA graph compatibility
If the project uses CUDA graph capture:
- Safe: Cached tensors, pinned buffers, pre-allocated outputs — these are created outside capture
- Unsafe: Event-based cross-stream sync — events recorded outside the graph are ignored during replay
- Depends:
torch.compile— the compiled graph may or may not be capturable
Logging Format
Tab-separated .codeflash/results.tsv:
commit target_test stage gpu_baseline_us gpu_optimized_us gpu_speedup e2e_baseline_ms e2e_optimized_ms e2e_speedup tests_passed tests_failed status pattern description
stage: pipeline stage (e.g.,preprocess,postprocess,sync)gpu_baseline_us/gpu_optimized_us: stage-level timing in microsecondse2e_baseline_ms/e2e_optimized_ms: end-to-end timing in millisecondspattern: antipattern tag (e.g.,loop-invariant-tensor,kernel-fusion,pageable-h2d,monolithic-preprocess,monolithic-postprocess)status:keep,discard, orcrash