codeflash-agent/plugin/languages/python/references/gpu/experiment-loop.md
aseembits93 2734b42f96 wip
2026-04-17 23:23:57 +00:00

11 KiB

Experiment Loop — GPU/CUDA Domain

Base framework: ../shared/experiment-loop-base.md

Reasoning Checklist

Before writing any code, answer these 10 questions. If you can't answer 3-7 concretely, profile deeper before coding.

  1. Pattern: What GPU antipattern? (see reference.md catalog: [transfer], [dispatch], [sync], [memory], [compute], [correctness])
  2. Hot path? Confirmed by torch.profiler or nsys — NOT cProfile. cProfile cannot see GPU-side costs.
  3. Pipeline stage? Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
  4. Transfer or compute? Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
  5. Per-call vs amortized? Is this fixed overhead per inference call, or proportional to batch size / image resolution?
  6. GPU warm? Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
  7. Mechanism: HOW does your change reduce latency? Be specific: fewer transfers, fewer kernel launches, less sync wait, direct-to-host writes, etc.
  8. Correctness: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can produce different floating-point results. Verify bit-for-bit or within tolerance.
  9. CUDA graph compatibility: Does the project use CUDA graphs? If so, will this change break capture or replay? Event-based sync is incompatible with graph capture. Pinned memory and cached tensors are compatible.
  10. Verify cheaply: Can you validate with torch.cuda.Event timing before running the full benchmark?

Profiling Methodology

Mandatory: torch.profiler baseline

Run this BEFORE any code changes. This is equivalent to the cProfile gate for CPU optimization — entering the experiment loop without torch.profiler output is not permitted.

# /tmp/gpu_baseline.py
import torch
from torch.profiler import profile, ProfilerActivity

# --- Adapt these to the project ---
# from model import load_model, preprocess, inference, postprocess
# model = load_model()
# input_data = load_test_input()
# ---

# Warmup
for _ in range(20):
    run_inference(model, input_data)
torch.cuda.synchronize()

# Profile
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
    record_shapes=True,
) as prof:
    for _ in range(10):
        run_inference(model, input_data)
    torch.cuda.synchronize()

# Ranked by CUDA time
print("=== CUDA time ranking ===")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

# Ranked by CPU time
print("\n=== CPU time ranking ===")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))

# Memory copies (H2D, D2H)
print("\n=== Memory events ===")
for evt in prof.key_averages():
    if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
        print(f"  {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")

Print the output as [gpu baseline] — this is a key deliverable.

Event-based micro-timing (per-stage measurement)

For quick A/B comparisons of individual pipeline stages:

import torch

def time_stage(fn, *args, warmup=10, runs=100, stream=None):
    """Time a single pipeline stage using CUDA events."""
    s = stream or torch.cuda.current_stream()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    for _ in range(warmup):
        fn(*args)
    torch.cuda.synchronize()

    times = []
    for _ in range(runs):
        start.record(s)
        fn(*args)
        end.record(s)
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    import statistics
    return statistics.median(times), statistics.stdev(times)

nsys validation (when torch.profiler is insufficient)

Use when you need to see driver-level detail — pageable vs pinned transfers, exact kernel launch gaps, stream activity overlap:

nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
    $RUNNER /tmp/inference_bench.py

Domain-Specific Loop Steps

Step 1 — Choose target sources:

  • torch.profiler output: Highest CUDA time function, or largest gap between GPU kernels
  • nsys timeline: Pipeline bubbles — idle gaps between stream activities
  • Manual inspection: Per-call tensor creation patterns, synchronize() calls, pageable H2D transfers

Print: [experiment N] Target: <stage>.<pattern> (<est. latency>us, <antipattern category>)

Example:

[experiment 1] Target: postprocess.dispatch_gaps (1021us, 29 kernels — [dispatch] fusion candidate)
[experiment 2] Target: preprocess.h2d_staging (300us — [transfer] pageable->pinned)
[experiment 3] Target: preprocess.tensor_create (150us — [transfer] loop-invariant hoisting)
[experiment 4] Target: sync.stream_block (1000us — [sync] synchronize->event)

Step 6 — Read results: Print BOTH stage-level AND E2E timings:

[experiment N] Stage: <stage> <before>us -> <after>us (<delta>%)
[experiment N] E2E: <before>ms -> <after>ms (<delta>%)

Step 8 — Noise threshold: GPU timing has higher variance than CPU due to GPU clock boosting, thermal throttling, and driver scheduling. If speedup <10% on a single stage, re-run 5x (not 3x) with GPU warm to confirm it's not noise. Use median, not mean.

Step 10 — Record: Record immediately in both .codeflash/results.tsv and .codeflash/HANDOFF.md. Include both stage-level and E2E metrics.

Step 12 — E2E benchmark: After KEEP, measure E2E latency over 100+ inference calls to confirm the stage-level gain translates to real-world improvement. Stage improvements that don't show in E2E may be masked by other bottlenecks — still KEEP if the stage improvement is confirmed, but note the masking.

Keep/Discard Thresholds

Tests passed?
+-- NO -> Fix or discard immediately
+-- YES -> Check E2E and stage metrics:
    +-- E2E improved >=3% -> KEEP
    +-- E2E <3% but stage improved >=10%:
    |   +-- Re-run 5x with GPU warm
    |   +-- Confirmed -> KEEP (stage gain is real; E2E may compound with other fixes)
    |   +-- Noise -> DISCARD
    +-- Micro-benchmark (Event timing) >=20% on confirmed hot stage -> KEEP
    +-- No measurable improvement -> DISCARD

Why 3% E2E (not 5% like CPU)? GPU inference pipelines have high fixed costs (model forward pass) that dilute the effect of transfer/sync optimizations on E2E numbers. A 50% reduction in a pipeline stage (e.g., postprocess 1ms -> 0.5ms) may only show as 5% E2E improvement on a 10ms pipeline. The stage-level improvement is real and compounds with other stage fixes.

Strategy Rotation

If 3+ consecutive discards on the same type, switch:

  1. Transfer optimization (pinned memory, cached tensors, non_blocking) -> if exhausted:
  2. Dispatch/fusion (Triton kernels, torch.compile, batched ops) -> if exhausted:
  3. Monolithic kernel fusion (fuse entire pre/post stage into single compiled operation — see below) -> if exhausted:
  4. Synchronization (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
  5. Compute/precision (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
  6. Architectural (batch pipeline stages differently, pipeline parallelism across frames)

Monolithic kernel escalation

When individual micro-optimizations (steps 1-2) are discarded because savings are below the E2E noise floor or masked by stream overlap, escalate to monolithic stage fusion. Instead of optimizing individual ops, replace the entire pipeline stage with a single compiled operation.

Preprocessing monolith recipe:

  1. Replace CPU resize with GPU F.interpolate (move compute to GPU)
  2. Use pinned memory buffer + non_blocking=True for raw H2D (eliminate staging)
  3. Wrap permute + resize + channel swap + normalize in torch.compile(mode="reduce-overhead")
  4. Cache normalization tensors on GPU at init time
  5. Make pinned buffer thread-local if model serves concurrent requests

Postprocessing monolith recipe:

  1. Replace torch.isin() with pre-computed boolean lookup mask (index by class ID)
  2. Cache inference_size_whwh, pad offsets, and scale tensors on GPU (keyed by metadata identity)
  3. Inline rescale: boxes.sub_(offsets).div_(scale) — in-place, no intermediate allocations
  4. Keep data-dependent ops (confidence filter, sort) as-is — they can't be compiled

When this works: Individual fixes gave 13x micro-benchmark gains but 0% E2E because stream overlap or variance masked savings. The monolithic approach combines ALL savings into one change that exceeds the noise floor. RF-DETR case: individual fixes = 0% E2E, monolithic = 8.3% E2E.

Plateau Detection

Remaining targets are at the optimization floor when:

  • Inside TRT/ONNX engine: The model forward pass is opaque — can't optimize individual kernels at the Python level. Recommend TRT profile/refit or model-level changes.
  • PCIe bandwidth ceiling: H2D/D2H transfers are limited by PCIe bandwidth. Verify with nvidia-smi topo -m and transfer size calculation.
  • SM occupancy ceiling: GPU compute units are fully utilized. Verify with nsys or Nsight Compute occupancy metrics.
  • CUDA context overhead: ~50-100us per kernel launch is irreducible driver overhead. Can only be reduced by fewer launches (fusion) or CUDA graphs.

Correctness Verification

Numerical equivalence

Kernel fusion and operation reordering can produce different floating-point results due to:

  • Different reduction order (sum, max, argmax)
  • Different intermediate precision (FP32 vs FP16 intermediates)
  • Fused multiply-add vs separate multiply then add

Always verify output equivalence after kernel fusion:

# Run both paths on same input
original_output = original_postprocess(logits, boxes)
fused_output = fused_postprocess(logits, boxes)

# Check equivalence
assert torch.allclose(original_output, fused_output, atol=1e-5, rtol=1e-4), \
    f"Max diff: {(original_output - fused_output).abs().max()}"

CUDA graph compatibility

If the project uses CUDA graph capture:

  • Safe: Cached tensors, pinned buffers, pre-allocated outputs — these are created outside capture
  • Unsafe: Event-based cross-stream sync — events recorded outside the graph are ignored during replay
  • Depends: torch.compile — the compiled graph may or may not be capturable

Logging Format

Tab-separated .codeflash/results.tsv:

commit	target_test	stage	gpu_baseline_us	gpu_optimized_us	gpu_speedup	e2e_baseline_ms	e2e_optimized_ms	e2e_speedup	tests_passed	tests_failed	status	pattern	description
  • stage: pipeline stage (e.g., preprocess, postprocess, sync)
  • gpu_baseline_us / gpu_optimized_us: stage-level timing in microseconds
  • e2e_baseline_ms / e2e_optimized_ms: end-to-end timing in milliseconds
  • pattern: antipattern tag (e.g., loop-invariant-tensor, kernel-fusion, pageable-h2d, monolithic-preprocess, monolithic-postprocess)
  • status: keep, discard, or crash