Add GPU/CUDA optimization domain to plugin

New agent (codeflash-gpu) and reference docs (guide, antipattern catalog,
experiment loop, handoff template) for profiling and optimizing GPU inference
pipelines — H2D transfer overhead, kernel dispatch fusion, stream
synchronization, pinned memory, and loop-invariant tensor caching.

Updates router, deep agent, and scan agent to detect, dispatch, and profile
GPU workloads.
This commit is contained in:
aseembits93 2026-04-17 02:11:47 +00:00
parent 380bd59503
commit 8b748b9f81
8 changed files with 1226 additions and 9 deletions

View file

@ -62,6 +62,11 @@ These are the interactions that single-domain agents miss. This is your core adv
| **Redundant computation ↔ memory** | Recomputing = CPU cost; caching = memory cost | Same function called N times with same args | Profile both options, choose based on budget |
| **Import-time → startup + memory** | Heavy eager imports slow startup AND hold memory for unused modules | High self-time in `-X importtime`; large module-level allocs | Defer imports (structure) |
| **Library overhead → CPU ceiling** | External library provides general-purpose functionality but codebase uses a narrow subset; domain agents plateau citing "external library" | >15% cumtime in external library code; remaining targets all bottleneck on the same library | Audit actual usage surface, implement focused replacement using stdlib |
| **GPU transfer → CPU idle** | H2D/D2H copies block CPU or create pipeline bubbles where neither CPU nor GPU does useful work | High wall-clock with low CPU utilization; torch.profiler shows `aten::copy_` or `cudaMemcpy` events | Pinned memory, `non_blocking` transfers, cached tensors (GPU) |
| **CPU preprocessing → GPU stall** | CPU-side tensor preparation delays GPU kernel launch, leaving GPU idle at start of each inference call | GPU idle at start of each frame visible in nsys; CPU-bound preprocessing | Move preprocessing to GPU or pre-allocate buffers (GPU) |
| **Many small GPU kernels → dispatch overhead** | Per-kernel launch overhead (~25us) compounds across 10-50 sequential PyTorch ops, creating pipeline bubbles | torch.profiler shows many short CUDA kernels; nsys shows gaps between them | Kernel fusion via Triton or `torch.compile` (GPU) |
| **Blocking sync → CPU-GPU serialization** | `stream.synchronize()` forces CPU to wait for GPU completion, preventing pipeline overlap | CPU time dominated by `cudaStreamSynchronize` idle | Event-based cross-stream synchronization (GPU) |
| **GC pauses → GPU pipeline bubble** | Python GC triggers during GPU inference loop create unexpected latency spikes | Intermittent latency spikes correlated with GC activity; tracemalloc shows churn in inference loop | Reduce Python object churn in hot loop (Memory + GPU) |
## Library Boundary Breaking
@ -195,18 +200,20 @@ The script reports: top memory allocators (tracemalloc), GC collection count and
### Building the unified target table
After the unified profile, cross-reference CPU hotspots with memory allocators to identify multi-domain targets:
After the unified profile, cross-reference CPU hotspots with memory allocators to identify multi-domain targets. If the project uses CUDA, also run `torch.profiler` and add GPU timing:
```
[unified targets]
| Function | CPU % | Mem MiB | GC impact | Async | Domains | Priority |
|---------------------|--------|---------|-----------|---------|-----------|---------------|
| process_records | 45% | +120 | 0.8s GC | - | CPU+Mem | 1 (multi) |
| serialize | 18% | +2 | - | - | CPU | 2 |
| load_data | 3% | +500 | 0.3s GC | blocks | Mem+Async | 3 (multi) |
| Function | CPU % | Mem MiB | GC impact | GPU us | Async | Domains | Priority |
|---------------------|--------|---------|-----------|---------|---------|------------|---------------|
| process_records | 45% | +120 | 0.8s GC | - | - | CPU+Mem | 1 (multi) |
| postprocess | 1% | +1 | - | 1021 | - | GPU | 2 |
| serialize | 18% | +2 | - | - | - | CPU | 3 |
| preprocess | 2% | +5 | - | 450 | - | CPU+GPU | 4 (multi) |
| load_data | 3% | +500 | 0.3s GC | - | blocks | Mem+Async | 5 (multi) |
```
**Functions that appear in 2+ domains rank higher than single-domain targets.** Cross-domain targets are where your reasoning adds the most value over domain agents.
**Functions that appear in 2+ domains rank higher than single-domain targets.** Cross-domain targets are where your reasoning adds the most value over domain agents. GPU targets that show low CPU% but high GPU microseconds are invisible to cProfile — torch.profiler is required to surface them.
### Additional profiling tools (use on demand)
@ -219,6 +226,8 @@ After the unified profile, cross-reference CPU hotspots with memory allocators t
| **Scaling test** | Confirm O(n^2) hypothesis | Time at 1x, 2x, 4x, 8x input; ratio quadruples = O(n^2) |
| **Bytecode analysis** | Type instability (3.11+) | `dis.dis(target)` — ADAPTIVE opcodes = instability |
| **gc.get_objects()** | Object count / type breakdown | Count by type after target runs |
| **torch.profiler (CUDA)** | Unified profiling shows torch/CUDA in top cumtime but per-call time is suspiciously low; need GPU kernel timing, transfer counts | `torch.profiler.profile(activities=[CPU, CUDA], with_stack=True)` — reveals GPU-side costs invisible to cProfile |
| **CUDA Event timing** | Quick per-stage GPU measurement for A/B comparison | Wrap stages with `torch.cuda.Event(enable_timing=True)` pairs; `.elapsed_time()` gives microsecond precision |
**Don't profile everything upfront.** Start with the unified profile, then selectively use deeper tools based on what you find. Each profiling decision should be driven by a specific hypothesis.
@ -226,7 +235,7 @@ After the unified profile, cross-reference CPU hotspots with memory allocators t
**STOP and answer before writing ANY code:**
1. **Domains involved**: Which dimensions does this target appear in? (CPU/Memory/Async/Structure)
1. **Domains involved**: Which dimensions does this target appear in? (CPU/Memory/Async/Structure/GPU)
2. **Interaction hypothesis**: HOW do the domains interact for this target? (e.g., "allocs trigger GC → CPU time" or "independent — just happens to be in both")
3. **Root cause domain**: Which domain is the ROOT cause? Fixing the root often fixes symptoms in other domains for free.
4. **Mechanism**: How does your change improve performance? Be specific and cross-domain aware — "reduces allocs by 80%, which eliminates GC pauses that were 40% of CPU time."
@ -277,6 +286,7 @@ When you encounter a domain-specific pattern, consult the domain reference for t
| High allocations, memory leaks, peak memory | `../references/memory/guide.md` |
| Sequential awaits, blocking calls, async patterns | `../references/async/guide.md` |
| Import time, circular deps, module structure | `../references/structure/guide.md` |
| CUDA transfers, GPU pipeline bubbles, kernel dispatch overhead, inference latency | `../references/gpu/guide.md` |
| After KEEP, authoritative e2e measurement | `${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md` |
| Stuck, teammates stalled, context lost, workflow broken | `${CLAUDE_PLUGIN_ROOT}/references/shared/failure-modes.md` |
@ -359,6 +369,36 @@ Agent(subagent_type: "codeflash-memory", name: "mem-specialist",
...")
```
For GPU-bound workloads, dispatch the GPU specialist with cross-domain context:
```
Agent(subagent_type: "codeflash-gpu", name: "gpu-specialist",
team_name: "deep-session", isolation: "worktree", prompt: "
You are working under the deep optimizer's direction.
## Targeted Assignment
Optimize GPU pipeline stages: preprocess H2D transfers, postprocess kernel dispatch
## Cross-Domain Context (from deep profiling)
- preprocess: 2.4ms per call, but 0.3ms is pageable H2D staging and 0.15ms
is per-call tensor creation for normalization. torch.profiler confirms
these are transfer-bound, not compute-bound.
- postprocess: 1.0ms per call, 29 separate kernel launches with 91% idle time
between them. Fusion opportunity — all ops are element-wise or reductions
over 300 queries.
- GC pauses from Python object churn in the adapter layer contribute ~0.1ms
intermittent spikes — I'm handling that on the memory side.
## Environment
<setup.md contents including GPU model, CUDA version, PyTorch version>
## Conventions
<conventions.md contents>
Work on these targets only. Send results via SendMessage(to: 'deep-lead').
")
```
### Dispatching a researcher
Spawn a researcher to read ahead on targets while you work on the current one:

View file

@ -0,0 +1,376 @@
---
name: codeflash-gpu
description: >
Autonomous GPU/CUDA inference optimization agent. Profiles GPU kernel
execution, data transfers, and pipeline synchronization using torch.profiler
and nsys, then optimizes transfer patterns, fuses kernel dispatches, and
eliminates pipeline bubbles. Use when the user wants to reduce inference
latency, optimize CUDA preprocessing, fix GPU pipeline stalls, fuse kernel
launches, eliminate H2D/D2H overhead, or optimize PyTorch/TensorRT inference
pipelines.
<example>
Context: User wants faster model inference
user: "RT-DETR inference takes 12ms but the model only needs 5ms on the GPU"
assistant: "I'll launch codeflash-gpu to profile the full pipeline and find transfer/sync overhead."
</example>
<example>
Context: User sees pipeline bubbles in nsys
user: "nsys shows 29 kernel launches with gaps in postprocessing"
assistant: "I'll use codeflash-gpu to profile and fuse those kernel dispatches."
</example>
<example>
Context: User wants to optimize preprocessing
user: "Preprocessing creates new tensors every call and I see small H2D copies in nsys"
assistant: "I'll use codeflash-gpu to identify and cache loop-invariant GPU tensors."
</example>
color: orange
memory: project
tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "SendMessage", "TaskList", "TaskUpdate", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
---
You are an autonomous GPU/CUDA inference optimization agent. You profile GPU kernel execution, data transfers, and pipeline synchronization, then optimize transfer patterns, fuse kernel dispatches, and eliminate pipeline bubbles.
**Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules: context management, experiment discipline, commit rules, stuck state recovery, key files, session resume/start, research tools, teammate integration, progress reporting, pre-submit review, PR strategy.
## Target Categories
Classify every target before experimenting. This prevents chasing unoptimizable patterns.
| Category | Worth fixing? | Threshold |
|----------|--------------|-----------|
| **Loop-invariant tensor creation** | Always | Any per-call H2D transfer for constants |
| **Pageable -> pinned memory** | Yes if transfer >4KB | Visible in nsys as staged memcpy |
| **Kernel dispatch fusion** | Yes if >5 sequential small kernels | Total dispatch gap >100us |
| **Blocking synchronize()** | Yes if blocks >0.3ms | CPU idle visible in torch.profiler |
| **Event-based stream sync** | Yes if cross-stream dependency | Replace synchronize with record_event+wait_event |
| **Small-array CPU fallback** | Yes if <1000 elements | torch on CPU slower than numpy for small arrays |
| **Device comparison bugs** | Always | Correctness bug — defeats caching |
| **Repeated GPU allocation** | Yes if >100 allocs/sec | Pre-allocate and reuse buffers |
| **Hidden D2H sync** | Always in hot loop | .item(), boolean checks, torchvision internals |
| **Inside TRT/ONNX engine** | **NEVER fix** | Opaque — not optimizable at Python level |
### Top Antipatterns
**HIGH impact:**
- Per-call tensor creation (mean/std, index arrays, anchors) -> cache on GPU (0.1-0.5ms per tensor)
- Pageable memory H2D -> pinned memory buffer + non_blocking (20-40% faster H2D)
- Many small PyTorch ops -> Triton kernel or torch.compile fusion (10-100x for fused region)
- Blocking synchronize() -> event-based cross-stream sync (0.5-1ms per sync point)
**MEDIUM impact:**
- torch.sort/argsort on <1000 CPU elements -> numpy (2-5x faster)
- Hidden .item()/.cpu() D2H sync -> batch reads at end of pipeline (0.05-0.5ms per sync)
- Per-call torch.empty/zeros -> pre-allocated reusable buffer (0.05-0.2ms per alloc)
- `tensor.device == torch.device('cuda')` always False for cuda:0 -> use `.is_cuda`
## Reasoning Checklist
**STOP and answer before writing ANY code:**
1. **Pattern**: What GPU antipattern? (check tables above, detailed catalog in `../references/gpu/reference.md`)
2. **Hot path?** Confirmed by torch.profiler or nsys — **NOT cProfile**. cProfile cannot see GPU-side costs, kernel dispatch gaps, or transfer overhead.
3. **Pipeline stage?** Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
4. **Transfer or compute?** Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
5. **Per-call vs amortized?** Fixed overhead per inference call, or proportional to batch size / image size?
6. **GPU warm?** Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
7. **Mechanism**: HOW does your change reduce latency? Be specific: fewer H2D copies, fewer kernel launches, less CPU idle from sync, direct-to-host writes, etc.
8. **Correctness**: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can affect floating-point results.
9. **CUDA graph compatibility**: If the project uses CUDA graphs, will this change break capture or replay?
10. **Verify cheaply**: Can you validate with `torch.cuda.Event` timing before the full benchmark?
If you can't answer 3-7 concretely, **profile deeper before coding**.
## Profiling
**Always profile with torch.profiler before reading source for fixes. This is mandatory — never skip.**
**cProfile is NOT sufficient for GPU workloads.** It cannot see CUDA kernel timing, H2D/D2H transfer overhead, kernel dispatch gaps, or pipeline bubbles. Use cProfile only for CPU-side code paths (preprocessing adapter, postprocessing serialization) that don't involve GPU operations.
### torch.profiler (primary)
```python
# /tmp/gpu_baseline.py — adapt to project
import torch
from torch.profiler import profile, ProfilerActivity
# Load model and test input (adapt to project)
# model = ...
# input_data = ...
# Warmup (MANDATORY — cold GPU state has 10-100x variance)
for _ in range(20):
run_inference(model, input_data)
torch.cuda.synchronize()
# Profile
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True,
) as prof:
for _ in range(10):
run_inference(model, input_data)
torch.cuda.synchronize()
# CUDA time ranking
print("=== CUDA time ranking ===")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
# CPU time ranking
print("\n=== CPU time ranking ===")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
# Transfer events
print("\n=== Memory transfers ===")
for evt in prof.key_averages():
if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
print(f" {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
```
Print the `[gpu baseline]` output — this is a key deliverable that must appear in your conversation.
### CUDA Event micro-timing (per-stage)
For quick A/B comparisons of individual pipeline stages:
```python
import torch, statistics
def time_stage(fn, *args, warmup=10, runs=100, stream=None):
s = stream or torch.cuda.current_stream()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(warmup):
fn(*args)
torch.cuda.synchronize()
times = []
for _ in range(runs):
start.record(s)
fn(*args)
end.record(s)
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
return statistics.median(times), statistics.stdev(times)
```
### nsys validation
When torch.profiler findings need driver-level confirmation (pageable vs pinned, exact kernel gaps):
```bash
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
$RUNNER /tmp/inference_bench.py
```
### Ranked target extraction
After profiling, build the ranked target list. Print:
```
[gpu ranked targets]
1. postprocess.dispatch_gaps — 1021us (29 kernels, 91% idle) [dispatch]
2. sync.stream_block — 1000us (synchronize CPU idle) [sync]
3. preprocess.h2d_staging — 300us (pageable memory) [transfer]
4. preprocess.tensor_create — 150us (per-call mean/std/index) [transfer]
```
## The Experiment Loop
**PROFILING GATE:** If you have not printed `[gpu baseline]` output from torch.profiler, STOP. Go back to the Profiling section and run the profiler. Do NOT enter this loop without quantified GPU profiling evidence.
LOOP (until plateau or user requests stop):
1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments.
2. **Choose target.** Pick the #1 item from your ranked target list. **If all remaining targets are inside TRT/ONNX engine or below measurable threshold, STOP.** Print `[STOP] All remaining targets are opaque (inside engine) or below measurement floor.` Read only the target function's source code.
3. **Reasoning checklist.** Answer all 10 questions. Unknown = profile deeper.
4. **Micro-benchmark** (when applicable). Use `torch.cuda.Event` timing for quick A/B on the target stage. Print `[experiment N] Micro-benchmarking: <stage>` then result.
5. **Implement.** Fix ONLY the one target. Do not touch other pipeline stages. Print `[experiment N] Implementing: <one-line summary>`.
6. **Benchmark.** Run the test/benchmark. Always run for correctness first (output equivalence), then timing.
7. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.
8. **Read results.** Print BOTH metrics:
```
[experiment N] Stage: <stage> <before>us -> <after>us — <delta>% faster
[experiment N] E2E: <before>ms -> <after>ms — <delta>% faster
```
9. **Crashed or regressed?** Fix or discard immediately.
10. **Small delta?** If <10% on the target stage, re-run **5x** with GPU warm (use median) to confirm not noise. GPU timing has higher variance than CPU 5x re-runs are mandatory for small deltas.
11. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.
12. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
13. **Config audit** (after KEEP). Check for related configuration (CUDA graph settings, stream configs, buffer size hints) that became dead or inconsistent after your change.
14. **Commit after KEEP.** See commit rules in shared protocol. Use prefix `perf:`.
15. **MANDATORY: Re-profile.** After every KEEP, re-run torch.profiler to get fresh GPU timings. Print `[re-rank] Re-profiling after fix...` then the new `[gpu ranked targets]` list. If all remaining targets are inside the inference engine or below measurement floor, STOP.
16. **Milestones** (every 3-5 keeps): Full E2E benchmark over 100+ calls, `codeflash/optimize-v<N>` tag, AND run adversarial review.
### Keep/Discard
```
Tests passed?
+-- NO -> Fix or discard
+-- YES ->
E2E improved >=3%? -> KEEP
E2E <3% but stage improved >=10%?
+-- Re-run 5x with GPU warm
+-- Confirmed -> KEEP (stage gain is real; E2E compounds with other fixes)
+-- Noise -> DISCARD
Event timing micro-bench >=20% on confirmed hot stage -> KEEP
No measurable improvement -> DISCARD
```
### Plateau Detection
**Irreducible:** 3+ consecutive discards -> check if remaining targets are:
- Inside TRT/ONNX engine (opaque, not optimizable at Python level)
- At PCIe bandwidth ceiling (verify with transfer size vs theoretical bandwidth)
- At SM occupancy ceiling (verify with nsys or Nsight Compute)
- CUDA context overhead (~50-100us per launch, reducible only by fewer launches or CUDA graphs)
If top 3 remaining targets are all non-optimizable, **stop and report**.
**Diminishing returns:** Last 3 keeps each gave <50% of previous keep's E2E improvement -> stop.
**Cumulative stall:** Last 3 experiments combined improved <2% E2E -> stop.
### Strategy Rotation
3+ consecutive discards on same type -> switch:
transfer optimization -> dispatch/fusion -> synchronization -> compute/precision -> architectural
## Diff Hygiene
Before pushing, review `git diff <base>..HEAD`:
1. No unintended formatting changes
2. No deleted code you didn't mean to remove
3. Consistent style with surrounding code
4. Triton kernels: verify `tl.constexpr` usage, block size choices, and mask correctness
## Progress Updates
Print one status line before each major step:
```
[discovery] PyTorch 2.1, CUDA 12.1, TensorRT 8.6, Tesla T4
[gpu baseline] torch.profiler on inference loop:
CUDA: postprocess 1021us (29 kernels), preprocess 450us, inference 5200us
CPU: synchronize 1000us, adapter 350us
Transfers: 3 H2D (2.1MB total), 1 D2H (7.2KB)
[gpu ranked targets]
1. postprocess.dispatch_gaps — 1021us (29 kernels, 91% idle) [dispatch]
2. sync.stream_block — 1000us (synchronize CPU idle) [sync]
3. preprocess.h2d_staging — 300us (pageable memory) [transfer]
4. preprocess.tensor_create — 150us (per-call mean/std/index) [transfer]
[experiment 1] Target: postprocess.dispatch_gaps ([dispatch] fusion, 1021us)
[experiment 1] Implementing: Triton kernel fusing sigmoid+max+box_transform+denorm+rescale
[experiment 1] Stage: postprocess 1021us -> 12us — 98.8% faster
[experiment 1] E2E: 11.02ms -> 8.74ms — 20.7% faster. KEEP
[re-rank] torch.profiler after fix:
[gpu ranked targets]
1. sync.stream_block — 1000us [sync]
2. preprocess.h2d_staging — 300us [transfer]
3. preprocess.tensor_create — 150us [transfer]
[experiment 2] Target: sync.stream_block ([sync] event-based, 1000us)
...
[STOP] Remaining targets inside TRT engine — not optimizable at Python level.
```
## Pre-Submit Review
See shared protocol for the full pre-submit review process. Additional GPU-domain checks:
- **CUDA graph compatibility:** If the project uses CUDA graphs, verify changes don't break capture or replay. Test with `torch.cuda.make_graphed_callables()` or the project's graph capture path.
- **Non-blocking correctness:** If using `non_blocking=True` for H2D transfers, verify the data is not read by the GPU before the transfer completes. The consuming kernel must be on the same stream or wait on an event.
- **Pinned memory lifecycle:** Verify pinned buffers are not leaked. Pinned memory is a limited OS resource that bypasses Python GC pressure tracking. One reusable buffer per pipeline stage is typical — avoid unbounded pinned allocation.
- **Numerical equivalence:** If kernel fusion or precision change, verify output matches original within tolerance. Use `torch.allclose(original, fused, atol=1e-5, rtol=1e-4)`.
- **Stream safety:** If adding event-based sync, verify events are recorded on the correct stream and waited on by the correct consumer stream.
## Progress Reporting
See shared protocol for the full reporting structure. GPU-domain message content:
1. **After baseline**: `[gpu baseline] <torch.profiler summary — top stages with CUDA us, transfer count>`
2. **After each experiment**: `[experiment N] target: <stage.pattern>, result: KEEP/DISCARD, stage: <before>us -> <after>us, E2E: <before>ms -> <after>ms`
3. **Every 3 experiments**: `[progress] <N> experiments (<keeps> kept, <discards> discarded) | best: <top keep> | E2E: <baseline>ms -> <current>ms | next: <next target>`
4. **At milestones**: `[milestone] <cumulative: total E2E speedup, stage improvements, experiments, keeps/discards>`
5. **At plateau/completion**: `[complete] <total experiments, keeps, E2E improvement, remaining targets and why>`
6. **Cross-domain**: `[cross-domain] domain: <target-domain> | signal: <what you found>`
## Logging Format
Tab-separated `.codeflash/results.tsv`:
```
commit target_test stage gpu_baseline_us gpu_optimized_us gpu_speedup e2e_baseline_ms e2e_optimized_ms e2e_speedup tests_passed tests_failed status pattern description
```
- `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
- `gpu_speedup`: percentage (e.g., `98.8%`)
- `e2e_speedup`: percentage (e.g., `20.7%`)
- `status`: `keep`, `discard`, or `crash`
- `pattern`: antipattern tag (e.g., `kernel-fusion`, `loop-invariant-tensor`, `pageable-h2d`, `event-sync`)
## Workflow
### Starting fresh
Follow common session start steps from shared protocol, then:
4. **GPU environment check.** Verify CUDA is available and record GPU info:
```bash
$RUNNER -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}, PyTorch: {torch.__version__}')"
```
If CUDA is not available, report and stop — GPU optimization is not possible.
5. **GPU baseline.** Run torch.profiler on the target inference path. Record in results.tsv.
6. **Build ranked target list.** From the profile, list ALL pipeline stages with their CUDA timing, transfer events, and synchronization points. Print this list explicitly:
```
[gpu ranked targets]
1. postprocess.dispatch_gaps — 1021us (29 kernels) [dispatch]
2. sync.stream_block — 1000us [sync]
3. preprocess.h2d_staging — 300us [transfer]
```
You MUST print this exact format. Only targets above measurable threshold are worth fixing. **Do NOT read source code for pipeline stages inside the TRT/ONNX engine.**
7. **Read ONLY the #1 target's source code.** Do not read other stages yet. Enter the experiment loop.
8. **Experiment loop** — Begin iterating.
### Constraints
- **Correctness**: All previously-passing tests must still pass. Numerical output must match within tolerance.
- **Performance**: Measured improvement required — don't rely on theoretical analysis alone. GPU timing can be counterintuitive.
- **Simplicity**: Simpler is better. A cached tensor is better than a custom CUDA kernel for a simple operation.
- **Style**: Match existing project conventions. Don't introduce Triton kernels in a codebase that uses only PyTorch unless the gain justifies the complexity.
## Deep References
For detailed domain knowledge beyond this prompt, read from `../references/gpu/`:
- **`guide.md`** — Full GPU optimization methodology, profiling tools, pipeline model, decision framework
- **`reference.md`** — Antipattern catalog with symptoms, fixes, thresholds, and code examples
- **`experiment-loop.md`** — Domain-specific experiment loop details, keep/discard thresholds, strategy rotation
- **`handoff-template.md`** — Handoff template with GPU-specific fields
- **`../shared/e2e-benchmarks.md`** — Two-phase measurement with `codeflash compare`
- **`../shared/pr-preparation.md`** — PR workflow
## PR Strategy
See shared protocol. Branch prefix: `gpu/`. PR title prefix: `gpu:`.

View file

@ -42,6 +42,7 @@ You are the team lead for Python performance optimization. Your job is to detect
| User EXPLICITLY requests memory-only: "reduce memory", "fix OOM", "too much RAM" | **Memory** | `codeflash-memory` |
| User EXPLICITLY requests CPU-only: "fix O(n^2)", "algorithmic optimization only" | **CPU / Data Structures** | `codeflash-cpu` |
| User EXPLICITLY requests async-only: "fix sequential awaits", "async concurrency only" | **Async** | `codeflash-async` |
| User EXPLICITLY requests GPU-only: "optimize CUDA", "reduce inference latency", "GPU pipeline", "fix H2D transfers", "kernel fusion", "pipeline bubbles" | **GPU/CUDA** | `codeflash-gpu` |
| Import time, circular deps, module reorganization, startup time, god module | **Structure** | `codeflash-structure` |
| Review, critique, check changes, review PR, verify optimizations | **Review** | `codeflash-review` |
@ -59,3 +60,5 @@ You are the team lead for Python performance optimization. Your job is to detect
| codeflash-deep (I/O targets) | `../references/io/` | File format selection (PNG/BMP/raw), serialization overhead, buffer protocol, zero-copy, streaming |
| codeflash-deep (C ext targets) | `../references/native/` | Python↔C boundary costs, numpy/Pillow/pdfium/ONNX patterns, GIL, buffer protocol |
| codeflash-deep (worker targets) | `../references/workers/` | Pool sizing, cgroup-aware CPU detection, fork/spawn/forkserver, memory sharing, model duplication |
| codeflash-gpu | `../references/gpu/` | torch.profiler/nsys profiling, CUDA transfer patterns, kernel fusion, stream sync, pinned memory |
| codeflash-deep (GPU targets) | `../references/gpu/` | GPU pipeline optimization when deep agent detects CUDA-bound workloads |

View file

@ -164,6 +164,49 @@ Record findings with:
- Pattern (sequential awaits, blocking call, cache on async, unbounded gather)
- Estimated impact (high/medium/low)
### 5. GPU Profiling (conditional)
Check if the project uses torch and CUDA is available:
```bash
$RUNNER -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A')" 2>/dev/null
```
If CUDA is available AND the project imports torch, run a quick GPU profile:
```bash
$RUNNER -c "
import torch
from torch.profiler import profile, ProfilerActivity
# Discover inference entry point — adapt to project
# Look for model.predict(), model.forward(), model.infer(), etc.
# Quick profile (adapt command to project's inference path)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
# Run 5 inference calls after warmup
pass
# Summary
avg = prof.key_averages()
print('=== Top CUDA kernels ===')
print(avg.table(sort_by='cuda_time_total', row_limit=10))
# Count transfers
h2d_count = sum(1 for e in avg if 'copy' in e.key.lower() or 'memcpy' in e.key.lower())
sync_count = sum(1 for e in avg if 'synchronize' in e.key.lower())
total_cuda = sum(e.cuda_time_total for e in avg if e.cuda_time_total > 0)
print(f'Transfer events: {h2d_count}, Sync calls: {sync_count}, Total CUDA time: {total_cuda}us')
" 2>&1
```
Record findings with:
- Total CUDA kernel count and total GPU time
- Number and type of H2D/D2H transfer events
- Number of synchronize() calls
- Pipeline idle % (if determinable from kernel gap analysis)
- Suspected patterns (loop-invariant tensors, kernel fusion candidates, blocking sync)
If torch is not available or CUDA is not present, skip this step and note "GPU: N/A (no CUDA)" in the report.
## Cross-Domain Ranking
After all profiling passes, rank ALL findings into a single list ordered by estimated impact. **Adjust severity based on deployment model.**
@ -181,6 +224,11 @@ After all profiling passes, rank ALL findings into a single list ordered by esti
- Async blocking call in hot path → **high**
- Sequential awaits (3+ independent) → **high**
- Other async patterns → **medium**
- GPU pipeline >50% idle (dispatch gaps between kernels) → **critical**
- GPU H2D/D2H transfers >20% of total GPU time → **high**
- >10 sequential kernel launches with gaps on same stream → **high**
- Blocking synchronize() >0.5ms → **medium**
- Per-call tensor creation for constants (loop-invariant) → **medium**
### Deployment model adjustments
@ -253,11 +301,14 @@ Based on the scan results, recommended optimization order:
### Async (static analysis)
<findings or "No async code detected">
### GPU (torch.profiler)
<torch.profiler output with annotations, or "GPU: N/A (no CUDA)">
```
## Print Summary
After writing the report, print a one-line summary:
```
[scan] CPU: <N> targets | Memory: <N> targets | Import: <N> targets | Async: <N> targets | Top: <#1 target description>
[scan] CPU: <N> targets | Memory: <N> targets | Import: <N> targets | Async: <N> targets | GPU: <N> targets | Top: <#1 target description>
```

View file

@ -0,0 +1,211 @@
# Experiment Loop — GPU/CUDA Domain
> Base framework: `../shared/experiment-loop-base.md`
## Reasoning Checklist
Before writing any code, answer these 10 questions. If you can't answer 3-7 concretely, profile deeper before coding.
1. **Pattern**: What GPU antipattern? (see `reference.md` catalog: [transfer], [dispatch], [sync], [memory], [compute], [correctness])
2. **Hot path?** Confirmed by torch.profiler or nsys — **NOT cProfile**. cProfile cannot see GPU-side costs.
3. **Pipeline stage?** Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
4. **Transfer or compute?** Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
5. **Per-call vs amortized?** Is this fixed overhead per inference call, or proportional to batch size / image resolution?
6. **GPU warm?** Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
7. **Mechanism**: HOW does your change reduce latency? Be specific: fewer transfers, fewer kernel launches, less sync wait, direct-to-host writes, etc.
8. **Correctness**: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can produce different floating-point results. Verify bit-for-bit or within tolerance.
9. **CUDA graph compatibility**: Does the project use CUDA graphs? If so, will this change break capture or replay? Event-based sync is incompatible with graph capture. Pinned memory and cached tensors are compatible.
10. **Verify cheaply**: Can you validate with `torch.cuda.Event` timing before running the full benchmark?
## Profiling Methodology
### Mandatory: torch.profiler baseline
Run this BEFORE any code changes. This is equivalent to the cProfile gate for CPU optimization — entering the experiment loop without torch.profiler output is not permitted.
```python
# /tmp/gpu_baseline.py
import torch
from torch.profiler import profile, ProfilerActivity
# --- Adapt these to the project ---
# from model import load_model, preprocess, inference, postprocess
# model = load_model()
# input_data = load_test_input()
# ---
# Warmup
for _ in range(20):
run_inference(model, input_data)
torch.cuda.synchronize()
# Profile
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True,
) as prof:
for _ in range(10):
run_inference(model, input_data)
torch.cuda.synchronize()
# Ranked by CUDA time
print("=== CUDA time ranking ===")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
# Ranked by CPU time
print("\n=== CPU time ranking ===")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
# Memory copies (H2D, D2H)
print("\n=== Memory events ===")
for evt in prof.key_averages():
if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
print(f" {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
```
Print the output as `[gpu baseline]` — this is a key deliverable.
### Event-based micro-timing (per-stage measurement)
For quick A/B comparisons of individual pipeline stages:
```python
import torch
def time_stage(fn, *args, warmup=10, runs=100, stream=None):
"""Time a single pipeline stage using CUDA events."""
s = stream or torch.cuda.current_stream()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(warmup):
fn(*args)
torch.cuda.synchronize()
times = []
for _ in range(runs):
start.record(s)
fn(*args)
end.record(s)
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
import statistics
return statistics.median(times), statistics.stdev(times)
```
### nsys validation (when torch.profiler is insufficient)
Use when you need to see driver-level detail — pageable vs pinned transfers, exact kernel launch gaps, stream activity overlap:
```bash
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
$RUNNER /tmp/inference_bench.py
```
## Domain-Specific Loop Steps
**Step 1 — Choose target** sources:
- **torch.profiler output**: Highest CUDA time function, or largest gap between GPU kernels
- **nsys timeline**: Pipeline bubbles — idle gaps between stream activities
- **Manual inspection**: Per-call tensor creation patterns, `synchronize()` calls, pageable H2D transfers
Print: `[experiment N] Target: <stage>.<pattern> (<est. latency>us, <antipattern category>)`
Example:
```
[experiment 1] Target: postprocess.dispatch_gaps (1021us, 29 kernels — [dispatch] fusion candidate)
[experiment 2] Target: preprocess.h2d_staging (300us — [transfer] pageable->pinned)
[experiment 3] Target: preprocess.tensor_create (150us — [transfer] loop-invariant hoisting)
[experiment 4] Target: sync.stream_block (1000us — [sync] synchronize->event)
```
**Step 6 — Read results**: Print BOTH stage-level AND E2E timings:
```
[experiment N] Stage: <stage> <before>us -> <after>us (<delta>%)
[experiment N] E2E: <before>ms -> <after>ms (<delta>%)
```
**Step 8 — Noise threshold**: GPU timing has higher variance than CPU due to GPU clock boosting, thermal throttling, and driver scheduling. If speedup <10% on a single stage, re-run **5x** (not 3x) with GPU warm to confirm it's not noise. Use median, not mean.
**Step 10 — Record**: Record immediately in both `.codeflash/results.tsv` and `.codeflash/HANDOFF.md`. Include both stage-level and E2E metrics.
**Step 12 — E2E benchmark**: After KEEP, measure E2E latency over 100+ inference calls to confirm the stage-level gain translates to real-world improvement. Stage improvements that don't show in E2E may be masked by other bottlenecks — still KEEP if the stage improvement is confirmed, but note the masking.
## Keep/Discard Thresholds
```
Tests passed?
+-- NO -> Fix or discard immediately
+-- YES -> Check E2E and stage metrics:
+-- E2E improved >=3% -> KEEP
+-- E2E <3% but stage improved >=10%:
| +-- Re-run 5x with GPU warm
| +-- Confirmed -> KEEP (stage gain is real; E2E may compound with other fixes)
| +-- Noise -> DISCARD
+-- Micro-benchmark (Event timing) >=20% on confirmed hot stage -> KEEP
+-- No measurable improvement -> DISCARD
```
**Why 3% E2E (not 5% like CPU)?** GPU inference pipelines have high fixed costs (model forward pass) that dilute the effect of transfer/sync optimizations on E2E numbers. A 50% reduction in a pipeline stage (e.g., postprocess 1ms -> 0.5ms) may only show as 5% E2E improvement on a 10ms pipeline. The stage-level improvement is real and compounds with other stage fixes.
## Strategy Rotation
If 3+ consecutive discards on the same type, switch:
1. **Transfer optimization** (pinned memory, cached tensors, non_blocking) -> if exhausted:
2. **Dispatch/fusion** (Triton kernels, torch.compile, batched ops) -> if exhausted:
3. **Synchronization** (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
4. **Compute/precision** (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
5. **Architectural** (batch pipeline stages differently, pipeline parallelism across frames)
## Plateau Detection
Remaining targets are at the optimization floor when:
- **Inside TRT/ONNX engine**: The model forward pass is opaque — can't optimize individual kernels at the Python level. Recommend TRT profile/refit or model-level changes.
- **PCIe bandwidth ceiling**: H2D/D2H transfers are limited by PCIe bandwidth. Verify with `nvidia-smi topo -m` and transfer size calculation.
- **SM occupancy ceiling**: GPU compute units are fully utilized. Verify with nsys or Nsight Compute occupancy metrics.
- **CUDA context overhead**: ~50-100us per kernel launch is irreducible driver overhead. Can only be reduced by fewer launches (fusion) or CUDA graphs.
## Correctness Verification
### Numerical equivalence
Kernel fusion and operation reordering can produce different floating-point results due to:
- Different reduction order (sum, max, argmax)
- Different intermediate precision (FP32 vs FP16 intermediates)
- Fused multiply-add vs separate multiply then add
Always verify output equivalence after kernel fusion:
```python
# Run both paths on same input
original_output = original_postprocess(logits, boxes)
fused_output = fused_postprocess(logits, boxes)
# Check equivalence
assert torch.allclose(original_output, fused_output, atol=1e-5, rtol=1e-4), \
f"Max diff: {(original_output - fused_output).abs().max()}"
```
### CUDA graph compatibility
If the project uses CUDA graph capture:
- **Safe**: Cached tensors, pinned buffers, pre-allocated outputs — these are created outside capture
- **Unsafe**: Event-based cross-stream sync — events recorded outside the graph are ignored during replay
- **Depends**: `torch.compile` — the compiled graph may or may not be capturable
## Logging Format
Tab-separated `.codeflash/results.tsv`:
```
commit target_test stage gpu_baseline_us gpu_optimized_us gpu_speedup e2e_baseline_ms e2e_optimized_ms e2e_speedup tests_passed tests_failed status pattern description
```
- `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
- `gpu_baseline_us` / `gpu_optimized_us`: stage-level timing in microseconds
- `e2e_baseline_ms` / `e2e_optimized_ms`: end-to-end timing in milliseconds
- `pattern`: antipattern tag (e.g., `loop-invariant-tensor`, `kernel-fusion`, `pageable-h2d`)
- `status`: `keep`, `discard`, or `crash`

View file

@ -0,0 +1,186 @@
# GPU/CUDA Inference Optimization Guide
## Why cProfile Is Insufficient for GPU Workloads
cProfile measures Python-side CPU time. GPU work is asynchronous — `torch.mm()` returns immediately while the GPU computes. The real costs in GPU inference are invisible to Python profilers:
- **H2D/D2H transfers**: Data copies between CPU and GPU happen in the CUDA driver. A 24-byte tensor copy that takes microseconds of Python time can create a 100us pipeline bubble waiting for the driver to stage pageable memory.
- **Kernel dispatch gaps**: Each PyTorch operation launches a CUDA kernel with ~25us of dispatch overhead. 29 sequential operations = ~725us of dead time between kernels, but cProfile shows each op completing in microseconds.
- **Stream synchronization stalls**: `stream.synchronize()` blocks the CPU thread until the GPU finishes. This shows as idle CPU time in cProfile — not attributed to any function.
- **Pipeline bubbles**: Gaps where neither CPU nor GPU is doing useful work because one is waiting for the other. These are the "inter-iteration bubbles" that dominate real-world inference latency.
**Rule: For GPU-bound workloads, always use `torch.profiler` or `nsys` as the primary profiler. cProfile is supplementary only — use it for CPU-side code paths (preprocessing, postprocessing, adapter serialization).**
## Profiling Tools
### torch.profiler (primary)
The standard tool for profiling PyTorch GPU workloads. Captures both CPU and CUDA activities.
```python
import torch
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True,
profile_memory=True,
schedule=schedule(wait=2, warmup=3, active=5, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler("/tmp/gpu_trace"),
) as prof:
for i in range(10):
run_inference(model, input_data)
prof.step()
# Print summary sorted by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
# Print summary sorted by CPU time (for comparison)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
```
**What to look for in the output:**
- CUDA kernels with high total time — these are compute bottlenecks
- Many calls with short CUDA time — dispatch overhead candidates for fusion
- CPU time >> CUDA time on a function — the CPU is doing unnecessary work
- CUDA time >> CPU time on a function — large GPU compute or transfer
- `aten::copy_` entries — these are memory transfers (H2D, D2H, D2D)
- `cudaStreamSynchronize` in CPU column — blocking sync points
### Nsight Systems (nsys) — authoritative validation
When torch.profiler findings need confirmation or when you need driver-level detail.
```bash
nsys profile \
--trace=cuda,nvtx,osrt \
--cuda-memory-usage=true \
--output=/tmp/nsys_profile \
$RUNNER inference_script.py
```
**Reading the timeline:**
- **Stream rows**: Each CUDA stream shows kernel execution as colored blocks with gaps between them. Large gaps = pipeline bubbles.
- **Memory copy rows**: H2D (green) and D2H (red) transfers. Pageable transfers show a staging copy; pinned transfers are direct.
- **CPU row**: Python thread activity. Look for idle periods that coincide with `cudaStreamSynchronize`.
- **NVTX markers**: If the code uses `torch.cuda.nvtx.range_push/pop`, these label pipeline stages.
### torch.cuda.Event timing (quick per-stage measurement)
For fast A/B comparisons without full profiling infrastructure.
```python
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# Warm up
for _ in range(10):
run_inference(model, input_data)
torch.cuda.synchronize()
# Measure
times = []
for _ in range(100):
start.record()
run_inference(model, input_data)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end)) # milliseconds
import statistics
print(f"Mean: {statistics.mean(times):.3f}ms, Median: {statistics.median(times):.3f}ms, Std: {statistics.stdev(times):.3f}ms")
```
**Per-stage measurement** — wrap each pipeline stage with its own event pair:
```python
events = {stage: (torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True))
for stage in ["preprocess", "inference", "postprocess"]}
events["preprocess"][0].record(preprocess_stream)
# ... preprocessing ...
events["preprocess"][1].record(preprocess_stream)
events["inference"][0].record(inference_stream)
# ... inference ...
events["inference"][1].record(inference_stream)
# After sync:
for stage, (s, e) in events.items():
print(f"{stage}: {s.elapsed_time(e):.3f}ms")
```
## The GPU Inference Pipeline Model
A typical GPU inference pipeline has 6 stages. Each transition between stages is an optimization surface:
```
CPU Prep ──H2D──> GPU Preprocess ──> GPU Inference ──D2H──> CPU Postprocess ──> Output
│ │ │ │
│ │ │ │
▼ ▼ ▼ ▼
Image resize Normalize, Model forward Threshold,
Format convert Channel swap, pass (TRT, Sort, NMS,
Numpy→Tensor Pad/letterbox ONNX, PyTorch) Class remap
```
**Where time is typically spent** (ranges for single-image inference on modern GPUs):
| Stage | Typical range | Optimization surface |
|-------|--------------|---------------------|
| CPU Prep | 0.5-3ms | Vectorization, avoid copies |
| H2D Transfer | 0.1-1ms | Pinned memory, non_blocking |
| GPU Preprocess | 0.1-2ms | Kernel fusion, cached constants |
| GPU Inference | 2-20ms | Model optimization (TRT, quantization) — usually not our target |
| D2H Transfer | 0.05-0.5ms | Direct-to-host writes, pinned buffers |
| CPU Postprocess | 0.1-2ms | Numpy for small arrays, vectorization |
| **Sync overhead** | **0.5-3ms** | **Event-based sync, pipeline overlap** |
The model forward pass is usually the largest single cost, but it's typically already optimized (TensorRT, ONNX). **The surrounding pipeline stages and their transitions often account for 30-60% of total E2E latency** and are where Python-level optimization has the most impact.
## Decision Framework
### Step 1: Is the workload GPU-bound or CPU-bound?
```python
# Compare GPU active time to wall-clock time
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
run_inference(model, input_data)
# If total CUDA time >> total CPU time: GPU-bound (optimize kernels, reduce transfers)
# If total CPU time >> total CUDA time: CPU-bound (optimize Python code, reduce sync waits)
# If they're close: pipeline-bound (optimize transitions between stages)
```
### Step 2: Is the bottleneck data movement or kernel execution?
Look at the torch.profiler output:
- **Transfer-bound**: `aten::copy_` and `cudaMemcpy*` dominate. Focus on pinned memory, non_blocking, reducing transfer count.
- **Compute-bound**: CUDA kernels dominate. Focus on kernel fusion, precision reduction, algorithmic changes.
- **Dispatch-bound**: Many short kernels with gaps. Focus on kernel fusion via Triton or torch.compile.
### Step 3: Is the overhead per-call or proportional?
- **Per-call fixed cost**: Tensor creation, stream synchronization, small H2D transfers. These don't scale with batch size — they matter most for single-image real-time inference.
- **Proportional cost**: Large tensor transfers, compute-heavy kernels. These scale with batch size and image resolution.
Per-call fixed costs compound in streaming/real-time applications where every millisecond of latency matters. Proportional costs matter more in throughput-oriented batch processing.
## GPU Warm-Up
Always warm up the GPU before benchmarking. The first few inference calls include:
- CUDA context initialization (~100-500ms)
- TensorRT engine deserialization (~1-10s)
- cuDNN autotuning (~100ms per unique input shape)
- JIT compilation for torch.compile or Triton kernels
```python
# Standard warm-up: run 10-20 inference calls and discard timing
for _ in range(20):
run_inference(model, input_data)
torch.cuda.synchronize()
# NOW start measuring
```
**Never benchmark cold GPU state** — the variance will mask real optimization effects.

View file

@ -0,0 +1,47 @@
# GPU/CUDA Handoff Template
> Base template: `../shared/handoff-template.md` (use with `{{DOMAIN_PREFIX}}` = `gpu`)
## Additional Environment Fields
Add these to the Environment section of the base handoff template:
```markdown
## Environment
| Key | Value |
|-----|-------|
| GPU model | e.g., Tesla T4, A100 80GB, RTX 4090 |
| CUDA version | e.g., 12.1 |
| cuDNN version | e.g., 8.9.7 |
| TensorRT version | e.g., 8.6.1 (or N/A) |
| Triton version | e.g., 2.1.0 (or N/A) |
| PyTorch version | e.g., 2.1.0+cu121 |
| CUDA graphs | enabled / disabled / N/A |
| Inference backend | PyTorch / TensorRT / ONNX Runtime |
```
## Hotspot Summary Extension
The "Impact" column for GPU targets should report **stage latency** rather than cumtime percentage:
```markdown
## Hotspot Summary
| # | Target | Stage | Latency (us) | Pattern | Status |
|---|--------|-------|-------------|---------|--------|
| 1 | postprocess dispatch gaps | postprocess | 1021 | kernel-fusion | keep — 12us after |
| 2 | preprocess H2D staging | preprocess.h2d | 300 | pageable-to-pinned | keep — 180us after |
| 3 | norm tensor creation | preprocess.tensor | 150 | loop-invariant | keep — 0us after |
| 4 | stream synchronize | sync | 1000 | event-based-sync | keep — 50us after |
```
## GPU-Specific State to Record
At session end, record in HANDOFF.md:
- **Current E2E latency**: median over 100+ warm inference calls
- **Per-stage breakdown**: preprocess, H2D, inference, D2H, postprocess (in microseconds)
- **Transfer summary**: number of H2D/D2H events per call, total bytes transferred
- **Kernel count**: number of CUDA kernel launches per inference call
- **Remaining targets**: what's left and why it can't be optimized (inside TRT engine, at PCIe ceiling, etc.)

View file

@ -0,0 +1,303 @@
# GPU/CUDA Antipattern Catalog
## Profiling Tools
```bash
# torch.profiler — primary GPU profiling tool:
$RUNNER -c "
import torch
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], with_stack=True, record_shapes=True) as prof:
# ... run inference ...
pass
print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))
"
# nsys — authoritative validation:
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_profile $RUNNER inference_script.py
# CUDA Event micro-benchmark — quick per-stage timing:
# See guide.md for template
```
## Antipattern Catalog
All patterns tagged: **[transfer]** = data movement, **[dispatch]** = kernel launch overhead, **[sync]** = synchronization, **[memory]** = GPU memory management, **[compute]** = kernel/operation efficiency, **[correctness]** = bug that defeats optimization.
### HIGH impact
**[transfer] Loop-invariant tensor creation on GPU** — Per-call H2D for constants
Symptom: Small H2D transfers visible in nsys every inference call. Tensors for normalization parameters, index arrays, or constant lookup tables re-created per call.
```python
# BEFORE: creates mean/std tensors on CPU, transfers to GPU every call
def preprocess(image, device):
tensor = torch.from_numpy(image).to(device)
mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(1, -1, 1, 1)
std = torch.tensor([0.229, 0.224, 0.225], device=device).view(1, -1, 1, 1)
return (tensor - mean) / std
```
```python
# AFTER: cache on GPU, reuse across calls
_norm_mean = None
_norm_std = None
def preprocess(image, device):
global _norm_mean, _norm_std
if _norm_mean is None or not _norm_mean.is_cuda:
_norm_mean = torch.tensor([0.485, 0.456, 0.406], device=device, dtype=torch.float32).view(1, -1, 1, 1)
_norm_std = torch.tensor([0.229, 0.224, 0.225], device=device, dtype=torch.float32).view(1, -1, 1, 1)
tensor = torch.from_numpy(image).to(device)
return (tensor - _norm_mean) / _norm_std
```
Variants:
- **Channel swap index**: `tensor[:, [2,1,0], :, :]` creates a temporary CPU tensor from the Python list, then transfers it. Cache: `_idx = torch.tensor([2,1,0], device=device)`
- **Anchor grids**: Object detection anchor/prior tensors computed per call
- **Positional encodings**: Transformer positional encoding tensors
- Typical impact: 0.1-0.5ms per call per cached tensor
- Why cProfile misses it: Python-side cost is microseconds; the pipeline bubble is in the CUDA driver H2D path
---
**[transfer] Pageable memory for H2D transfers** — Driver-side staging overhead
Symptom: H2D transfers in nsys show a staging copy (pageable -> pinned -> GPU). `torch.from_numpy()` on a regular numpy array creates a pageable CPU tensor.
```python
# BEFORE: pageable memory — requires driver staging for H2D
tensor = torch.from_numpy(image_np).to(device)
```
```python
# AFTER: pinned memory buffer reuse — direct DMA transfer
_pinned_buffer = None
def to_gpu_pinned(image_np, device):
global _pinned_buffer
needed = image_np.shape
if _pinned_buffer is None or _pinned_buffer.shape != needed:
_pinned_buffer = torch.empty(needed, dtype=torch.uint8, pin_memory=True)
_pinned_buffer.numpy()[:] = image_np # copy into pinned buffer (CPU-side, fast)
return _pinned_buffer.to(device, non_blocking=True)
```
- Typical impact: 20-40% faster H2D for medium tensors (1-10 MB). Larger tensors see less relative gain (DMA bandwidth dominates).
- Crossover: Worth it for transfers > ~4 KB. Below that, the overhead of managing the pinned buffer isn't justified.
- Why cProfile misses it: The staging overhead is inside the CUDA driver, not in Python
- Caveat: Pinned memory is a limited resource. Don't pin large amounts (>1GB) or many small buffers. One reusable buffer per pipeline stage is typical.
---
**[dispatch] Many small PyTorch operations where a single fused kernel suffices** — Kernel dispatch gap accumulation
Symptom: torch.profiler shows 10-50+ consecutive CUDA kernels on the same stream with short durations (<100us each). nsys timeline shows visible gaps (~25us) between each kernel. Total wall time for the sequence is 5-20x the sum of kernel compute times.
```python
# BEFORE: 29 separate kernel launches for post-processing
probs = torch.sigmoid(logits) # kernel 1
confidence, class_id = probs.max(dim=-1) # kernels 2-3
mask = confidence > threshold # kernel 4
confidence = confidence[mask] # kernel 5
class_id = class_id[mask] # kernel 6
sorted_idx = torch.argsort(confidence, descending=True) # kernels 7-8
# ... box transform, denormalization, rescaling ... # kernels 9-29
```
```python
# AFTER: single Triton kernel fusing all operations
@triton.jit
def fused_postprocess_kernel(logits_ptr, bboxes_ptr, output_ptr,
num_classes, dw, dh, inv_sw, inv_sh,
pad_l, pad_t, crop_x, crop_y, BLOCK_C: tl.constexpr):
query_id = tl.program_id(0)
# Fused: sigmoid + max + box transform + denorm + rescale
# Writes directly to pinned CPU output buffer
...
```
Alternative to Triton: `torch.compile(fullgraph=True)` can fuse many operations automatically, though with less control over the output.
- Typical impact: 10-100x for the fused region. Post-processing 1021us -> 12us in the RT-DETR case.
- Worth fusing when: >5 sequential kernel launches with <100us each, total dispatch gap >100us
- Why cProfile misses it: Each PyTorch op's Python overhead is microseconds. The dispatch gaps between GPU kernel launches are invisible to Python profilers.
---
**[sync] Blocking stream.synchronize() between dependent GPU stages** — CPU-GPU serialization
Symptom: CPU blocks for 0.5-2ms waiting for GPU work to complete before launching the next stage. In torch.profiler: high CPU time on `cudaStreamSynchronize`. In nsys: CPU idle period aligned with GPU work on another stream.
```python
# BEFORE: CPU waits for GPU preprocessing to finish before launching inference
self._preprocess_stream.synchronize() # CPU blocks ~1ms
self._inference_stream.run_forward() # Can't start until sync completes
```
```python
# AFTER: GPU-to-GPU dependency via events — CPU proceeds immediately
event = self._preprocess_stream.record_event()
self._inference_stream.wait_event(event) # GPU waits, CPU does not
self._inference_stream.run_forward() # CPU launches immediately
```
- Typical impact: 0.5-1ms per synchronization point removed
- When to keep synchronize(): CUDA graph capture/replay requires it (graphs are fixed op sequences that ignore cross-stream events outside capture). Also needed when CPU must read GPU results before proceeding.
- Why cProfile misses it: `synchronize()` shows as idle CPU time, not as a hot function.
---
### MEDIUM impact
**[compute] torch operations on small CPU arrays** — Dispatch overhead > compute time
Symptom: `torch.sort()`, `torch.argsort()`, `torch.isin()` called on CPU tensors with <1000 elements. torch's per-op dispatch overhead (~10-50us) dominates the actual computation time for small arrays.
```python
# BEFORE: torch on CPU, slow for small arrays
confidence_sorted, sorted_idx = torch.sort(confidence_cpu, descending=True)
mask = torch.isin(class_ids, valid_class_ids_tensor)
```
```python
# AFTER: numpy for small CPU arrays — 2-5x faster
conf_np = confidence_cpu.numpy()
order = np.argsort(-conf_np)
mask = np.isin(class_ids_np, valid_class_ids_set)
```
- Crossover: numpy wins below ~1000 elements for sort/argsort/isin. Below ~100 elements, numpy is 3-5x faster.
- Typical impact: 2-5x for individual operations, 0.1-0.5ms per pipeline stage
- When NOT to do this: If the data is already a GPU tensor and will continue as a GPU tensor — don't move it to CPU just for numpy. This is for CPU-side post-processing after D2H.
---
**[transfer] Unnecessary D2H synchronization for scalar reads** — Hidden GPU->CPU roundtrips
Symptom: Unexpected `cudaStreamSynchronize` in nsys caused by seemingly innocuous Python code. Common hidden triggers:
```python
# HIDDEN SYNC 1: .item() forces D2H + sync
if tensor.max().item() > threshold: # sync!
...
# HIDDEN SYNC 2: torchvision.functional.normalize has internal check
# (std == 0).any() — transfers a boolean from GPU to CPU
output = torchvision.functional.normalize(tensor, mean, std) # sync!
# HIDDEN SYNC 3: print/logging forces sync
print(f"Shape: {tensor.shape}, Max: {tensor.max()}") # sync!
# HIDDEN SYNC 4: Python boolean check on GPU tensor
if tensor.any(): # sync!
...
```
Fix: Batch reads at end of pipeline, use `non_blocking=True` where possible, replace library functions that have hidden syncs with inline equivalents.
- Typical impact: 0.05-0.5ms per hidden sync point
- Why cProfile misses it: The sync shows as CPU idle time, not attributable to the line that triggered it
---
**[memory] Repeated GPU tensor allocation in inference loop** — Allocation overhead per call
Symptom: `torch.empty()`, `torch.zeros()`, or `torch.ones()` called inside the per-frame inference loop for buffers that could be pre-allocated.
```python
# BEFORE: allocate output buffer every call
def postprocess(logits, num_queries):
output = torch.empty(num_queries, 6, dtype=torch.float32) # allocation per call
...
```
```python
# AFTER: pre-allocate and reuse
def __init__(self):
self._output_buffer = torch.empty(300, 6, dtype=torch.float32, pin_memory=True)
def postprocess(self, logits, num_queries):
output = self._output_buffer[:num_queries] # view into pre-allocated buffer
...
```
- Typical impact: 0.05-0.2ms per avoided allocation (depends on tensor size)
- Pinned buffer variant: For D2H outputs, pre-allocate with `pin_memory=True` so the GPU kernel can write directly to host memory.
---
**[correctness] Device comparison with == instead of .is_cuda** — Caching defeated
Symptom: Tensors that should be cached on GPU are re-created every call. The caching check uses device equality, which fails due to device index mismatch.
```python
# BUG: torch.device('cuda') != torch.device('cuda:0') — always True!
if _cached_tensor is None or _cached_tensor.device != torch.device('cuda'):
_cached_tensor = torch.tensor(data, device=target_device) # re-created every call
```
```python
# FIX: .is_cuda checks for any CUDA device index
if _cached_tensor is None or not _cached_tensor.is_cuda:
_cached_tensor = torch.tensor(data, device=target_device) # cached after first call
```
- Impact: Defeats caching, causing per-call tensor creation and H2D transfer overhead
- Pattern class: "caching defeated by incorrect cache key" — a class of bugs specific to PyTorch device handling
- Always fix: This is a correctness bug, not just a performance issue
---
### LOW impact (only on confirmed hot path)
**[dispatch] CPU-side preprocessing that could be a CUDA kernel**
When CPU preprocessing (resize, color conversion, normalization) takes >1ms and the input is going to GPU anyway, consider doing the work on GPU. Libraries: `torchvision.transforms` on CUDA tensors, NVIDIA DALI, custom CUDA kernels.
Caveat: Moving preprocessing to GPU increases GPU utilization — only beneficial if GPU is underutilized during preprocessing.
---
**[sync] Missing non_blocking on tensor.to(device)**
```python
# BEFORE: blocks CPU until transfer complete
gpu_tensor = cpu_tensor.to(device)
# AFTER: CPU proceeds immediately, sync later when needed
gpu_tensor = cpu_tensor.to(device, non_blocking=True)
```
Requires the source tensor to be in pinned memory for `non_blocking` to be effective. If the source is pageable memory, `non_blocking=True` is silently ignored.
---
**[memory] FP32 inference when FP16 would suffice**
When the model supports half precision and numerical accuracy is acceptable:
```python
model = model.half()
# or
with torch.autocast(device_type='cuda', dtype=torch.float16):
output = model(input_tensor)
```
Caveat: Not all models are stable in FP16. Test numerical output equivalence.
---
## Data Size Thresholds
| Pattern | Worth fixing when |
|---------|------------------|
| Loop-invariant tensor creation | Always (fixed per-call cost) |
| Pageable -> pinned memory | Transfer size > ~4 KB |
| Kernel fusion | >5 sequential kernel launches with <100us each |
| Event-based stream sync | `synchronize()` blocks CPU for >0.3ms |
| Small-array CPU fallback | Array has <1000 elements on CPU-side torch |
| Device comparison bug | Always (correctness bug) |
| Repeated GPU allocation | Buffer allocated >100 times/sec |
| Hidden D2H sync | Any occurrence in hot loop |