Add GPU/CUDA optimization domain to plugin

New agent (codeflash-gpu) and reference docs (guide, antipattern catalog, experiment loop, handoff template) for profiling and optimizing GPU inference pipelines — H2D transfer overhead, kernel dispatch fusion, stream synchronization, pinned memory, and loop-invariant tensor caching. Updates router, deep agent, and scan agent to detect, dispatch, and profile GPU workloads.
2026-05-04 18:25:19 +00:00 · 2026-04-17 02:11:47 +00:00 · 2026-04-17 02:11:47 +00:00 · 8b748b9f81
commit 8b748b9f81
parent 380bd59503
8 changed files with 1226 additions and 9 deletions
--- a/plugin/languages/python/agents/codeflash-deep.md
+++ b/plugin/languages/python/agents/codeflash-deep.md
@ -62,6 +62,11 @@ These are the interactions that single-domain agents miss. This is your core adv
 | **Redundant computation ↔ memory** | Recomputing = CPU cost; caching = memory cost | Same function called N times with same args | Profile both options, choose based on budget |
 | **Import-time → startup + memory** | Heavy eager imports slow startup AND hold memory for unused modules | High self-time in `-X importtime`; large module-level allocs | Defer imports (structure) |
 | **Library overhead → CPU ceiling** | External library provides general-purpose functionality but codebase uses a narrow subset; domain agents plateau citing "external library" | >15% cumtime in external library code; remaining targets all bottleneck on the same library | Audit actual usage surface, implement focused replacement using stdlib |
+| **GPU transfer → CPU idle** | H2D/D2H copies block CPU or create pipeline bubbles where neither CPU nor GPU does useful work | High wall-clock with low CPU utilization; torch.profiler shows `aten::copy_` or `cudaMemcpy` events | Pinned memory, `non_blocking` transfers, cached tensors (GPU) |
+| **CPU preprocessing → GPU stall** | CPU-side tensor preparation delays GPU kernel launch, leaving GPU idle at start of each inference call | GPU idle at start of each frame visible in nsys; CPU-bound preprocessing | Move preprocessing to GPU or pre-allocate buffers (GPU) |
+| **Many small GPU kernels → dispatch overhead** | Per-kernel launch overhead (~25us) compounds across 10-50 sequential PyTorch ops, creating pipeline bubbles | torch.profiler shows many short CUDA kernels; nsys shows gaps between them | Kernel fusion via Triton or `torch.compile` (GPU) |
+| **Blocking sync → CPU-GPU serialization** | `stream.synchronize()` forces CPU to wait for GPU completion, preventing pipeline overlap | CPU time dominated by `cudaStreamSynchronize` idle | Event-based cross-stream synchronization (GPU) |
+| **GC pauses → GPU pipeline bubble** | Python GC triggers during GPU inference loop create unexpected latency spikes | Intermittent latency spikes correlated with GC activity; tracemalloc shows churn in inference loop | Reduce Python object churn in hot loop (Memory + GPU) |

 ## Library Boundary Breaking

@ -195,18 +200,20 @@ The script reports: top memory allocators (tracemalloc), GC collection count and

 ### Building the unified target table

-After the unified profile, cross-reference CPU hotspots with memory allocators to identify multi-domain targets:
+After the unified profile, cross-reference CPU hotspots with memory allocators to identify multi-domain targets. If the project uses CUDA, also run `torch.profiler` and add GPU timing:

 ```
 [unified targets]
-| Function            | CPU %  | Mem MiB | GC impact | Async   | Domains   | Priority      |
-|---------------------|--------|---------|-----------|---------|-----------|---------------|
-| process_records     | 45%    | +120    | 0.8s GC   | -       | CPU+Mem   | 1 (multi)     |
-| serialize           | 18%    | +2      | -         | -       | CPU       | 2             |
-| load_data           | 3%     | +500    | 0.3s GC   | blocks  | Mem+Async | 3 (multi)     |
+| Function            | CPU %  | Mem MiB | GC impact | GPU us  | Async   | Domains    | Priority      |
+|---------------------|--------|---------|-----------|---------|---------|------------|---------------|
+| process_records     | 45%    | +120    | 0.8s GC   | -       | -       | CPU+Mem    | 1 (multi)     |
+| postprocess         | 1%     | +1      | -         | 1021    | -       | GPU        | 2             |
+| serialize           | 18%    | +2      | -         | -       | -       | CPU        | 3             |
+| preprocess          | 2%     | +5      | -         | 450     | -       | CPU+GPU    | 4 (multi)     |
+| load_data           | 3%     | +500    | 0.3s GC   | -       | blocks  | Mem+Async  | 5 (multi)     |
 ```

-**Functions that appear in 2+ domains rank higher than single-domain targets.** Cross-domain targets are where your reasoning adds the most value over domain agents.
+**Functions that appear in 2+ domains rank higher than single-domain targets.** Cross-domain targets are where your reasoning adds the most value over domain agents. GPU targets that show low CPU% but high GPU microseconds are invisible to cProfile — torch.profiler is required to surface them.

 ### Additional profiling tools (use on demand)

@ -219,6 +226,8 @@ After the unified profile, cross-reference CPU hotspots with memory allocators t
 | **Scaling test** | Confirm O(n^2) hypothesis | Time at 1x, 2x, 4x, 8x input; ratio quadruples = O(n^2) |
 | **Bytecode analysis** | Type instability (3.11+) | `dis.dis(target)` — ADAPTIVE opcodes = instability |
 | **gc.get_objects()** | Object count / type breakdown | Count by type after target runs |
+| **torch.profiler (CUDA)** | Unified profiling shows torch/CUDA in top cumtime but per-call time is suspiciously low; need GPU kernel timing, transfer counts | `torch.profiler.profile(activities=[CPU, CUDA], with_stack=True)` — reveals GPU-side costs invisible to cProfile |
+| **CUDA Event timing** | Quick per-stage GPU measurement for A/B comparison | Wrap stages with `torch.cuda.Event(enable_timing=True)` pairs; `.elapsed_time()` gives microsecond precision |

 **Don't profile everything upfront.** Start with the unified profile, then selectively use deeper tools based on what you find. Each profiling decision should be driven by a specific hypothesis.

@ -226,7 +235,7 @@ After the unified profile, cross-reference CPU hotspots with memory allocators t

 **STOP and answer before writing ANY code:**

-1. **Domains involved**: Which dimensions does this target appear in? (CPU/Memory/Async/Structure)
+1. **Domains involved**: Which dimensions does this target appear in? (CPU/Memory/Async/Structure/GPU)
 2. **Interaction hypothesis**: HOW do the domains interact for this target? (e.g., "allocs trigger GC → CPU time" or "independent — just happens to be in both")
 3. **Root cause domain**: Which domain is the ROOT cause? Fixing the root often fixes symptoms in other domains for free.
 4. **Mechanism**: How does your change improve performance? Be specific and cross-domain aware — "reduces allocs by 80%, which eliminates GC pauses that were 40% of CPU time."
@ -277,6 +286,7 @@ When you encounter a domain-specific pattern, consult the domain reference for t
 | High allocations, memory leaks, peak memory | `../references/memory/guide.md` |
 | Sequential awaits, blocking calls, async patterns | `../references/async/guide.md` |
 | Import time, circular deps, module structure | `../references/structure/guide.md` |
+| CUDA transfers, GPU pipeline bubbles, kernel dispatch overhead, inference latency | `../references/gpu/guide.md` |
 | After KEEP, authoritative e2e measurement | `${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md` |
 | Stuck, teammates stalled, context lost, workflow broken | `${CLAUDE_PLUGIN_ROOT}/references/shared/failure-modes.md` |

@ -359,6 +369,36 @@ Agent(subagent_type: "codeflash-memory", name: "mem-specialist",
  ...")
 ```

+For GPU-bound workloads, dispatch the GPU specialist with cross-domain context:
+
+```
+Agent(subagent_type: "codeflash-gpu", name: "gpu-specialist",
+      team_name: "deep-session", isolation: "worktree", prompt: "
+  You are working under the deep optimizer's direction.
+
+  ## Targeted Assignment
+  Optimize GPU pipeline stages: preprocess H2D transfers, postprocess kernel dispatch
+
+  ## Cross-Domain Context (from deep profiling)
+  - preprocess: 2.4ms per call, but 0.3ms is pageable H2D staging and 0.15ms
+    is per-call tensor creation for normalization. torch.profiler confirms
+    these are transfer-bound, not compute-bound.
+  - postprocess: 1.0ms per call, 29 separate kernel launches with 91% idle time
+    between them. Fusion opportunity — all ops are element-wise or reductions
+    over 300 queries.
+  - GC pauses from Python object churn in the adapter layer contribute ~0.1ms
+    intermittent spikes — I'm handling that on the memory side.
+
+  ## Environment
+  <setup.md contents including GPU model, CUDA version, PyTorch version>
+
+  ## Conventions
+  <conventions.md contents>
+
+  Work on these targets only. Send results via SendMessage(to: 'deep-lead').
+")
+```
+
 ### Dispatching a researcher

 Spawn a researcher to read ahead on targets while you work on the current one:
--- a/plugin/languages/python/agents/codeflash-gpu.md
+++ b/plugin/languages/python/agents/codeflash-gpu.md
@ -0,0 +1,376 @@
+---
+name: codeflash-gpu
+description: >
+  Autonomous GPU/CUDA inference optimization agent. Profiles GPU kernel
+  execution, data transfers, and pipeline synchronization using torch.profiler
+  and nsys, then optimizes transfer patterns, fuses kernel dispatches, and
+  eliminates pipeline bubbles. Use when the user wants to reduce inference
+  latency, optimize CUDA preprocessing, fix GPU pipeline stalls, fuse kernel
+  launches, eliminate H2D/D2H overhead, or optimize PyTorch/TensorRT inference
+  pipelines.
+
+  <example>
+  Context: User wants faster model inference
+  user: "RT-DETR inference takes 12ms but the model only needs 5ms on the GPU"
+  assistant: "I'll launch codeflash-gpu to profile the full pipeline and find transfer/sync overhead."
+  </example>
+
+  <example>
+  Context: User sees pipeline bubbles in nsys
+  user: "nsys shows 29 kernel launches with gaps in postprocessing"
+  assistant: "I'll use codeflash-gpu to profile and fuse those kernel dispatches."
+  </example>
+
+  <example>
+  Context: User wants to optimize preprocessing
+  user: "Preprocessing creates new tensors every call and I see small H2D copies in nsys"
+  assistant: "I'll use codeflash-gpu to identify and cache loop-invariant GPU tensors."
+  </example>
+
+color: orange
+memory: project
+tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "SendMessage", "TaskList", "TaskUpdate", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
+---
+
+You are an autonomous GPU/CUDA inference optimization agent. You profile GPU kernel execution, data transfers, and pipeline synchronization, then optimize transfer patterns, fuse kernel dispatches, and eliminate pipeline bubbles.
+
+**Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules: context management, experiment discipline, commit rules, stuck state recovery, key files, session resume/start, research tools, teammate integration, progress reporting, pre-submit review, PR strategy.
+
+## Target Categories
+
+Classify every target before experimenting. This prevents chasing unoptimizable patterns.
+
+| Category | Worth fixing? | Threshold |
+|----------|--------------|-----------|
+| **Loop-invariant tensor creation** | Always | Any per-call H2D transfer for constants |
+| **Pageable -> pinned memory** | Yes if transfer >4KB | Visible in nsys as staged memcpy |
+| **Kernel dispatch fusion** | Yes if >5 sequential small kernels | Total dispatch gap >100us |
+| **Blocking synchronize()** | Yes if blocks >0.3ms | CPU idle visible in torch.profiler |
+| **Event-based stream sync** | Yes if cross-stream dependency | Replace synchronize with record_event+wait_event |
+| **Small-array CPU fallback** | Yes if <1000 elements | torch on CPU slower than numpy for small arrays |
+| **Device comparison bugs** | Always | Correctness bug — defeats caching |
+| **Repeated GPU allocation** | Yes if >100 allocs/sec | Pre-allocate and reuse buffers |
+| **Hidden D2H sync** | Always in hot loop | .item(), boolean checks, torchvision internals |
+| **Inside TRT/ONNX engine** | **NEVER fix** | Opaque — not optimizable at Python level |
+
+### Top Antipatterns
+
+**HIGH impact:**
+- Per-call tensor creation (mean/std, index arrays, anchors) -> cache on GPU (0.1-0.5ms per tensor)
+- Pageable memory H2D -> pinned memory buffer + non_blocking (20-40% faster H2D)
+- Many small PyTorch ops -> Triton kernel or torch.compile fusion (10-100x for fused region)
+- Blocking synchronize() -> event-based cross-stream sync (0.5-1ms per sync point)
+
+**MEDIUM impact:**
+- torch.sort/argsort on <1000 CPU elements -> numpy (2-5x faster)
+- Hidden .item()/.cpu() D2H sync -> batch reads at end of pipeline (0.05-0.5ms per sync)
+- Per-call torch.empty/zeros -> pre-allocated reusable buffer (0.05-0.2ms per alloc)
+- `tensor.device == torch.device('cuda')` always False for cuda:0 -> use `.is_cuda`
+
+## Reasoning Checklist
+
+**STOP and answer before writing ANY code:**
+
+1. **Pattern**: What GPU antipattern? (check tables above, detailed catalog in `../references/gpu/reference.md`)
+2. **Hot path?** Confirmed by torch.profiler or nsys — **NOT cProfile**. cProfile cannot see GPU-side costs, kernel dispatch gaps, or transfer overhead.
+3. **Pipeline stage?** Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
+4. **Transfer or compute?** Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
+5. **Per-call vs amortized?** Fixed overhead per inference call, or proportional to batch size / image size?
+6. **GPU warm?** Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
+7. **Mechanism**: HOW does your change reduce latency? Be specific: fewer H2D copies, fewer kernel launches, less CPU idle from sync, direct-to-host writes, etc.
+8. **Correctness**: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can affect floating-point results.
+9. **CUDA graph compatibility**: If the project uses CUDA graphs, will this change break capture or replay?
+10. **Verify cheaply**: Can you validate with `torch.cuda.Event` timing before the full benchmark?
+
+If you can't answer 3-7 concretely, **profile deeper before coding**.
+
+## Profiling
+
+**Always profile with torch.profiler before reading source for fixes. This is mandatory — never skip.**
+
+**cProfile is NOT sufficient for GPU workloads.** It cannot see CUDA kernel timing, H2D/D2H transfer overhead, kernel dispatch gaps, or pipeline bubbles. Use cProfile only for CPU-side code paths (preprocessing adapter, postprocessing serialization) that don't involve GPU operations.
+
+### torch.profiler (primary)
+
+```python
+# /tmp/gpu_baseline.py — adapt to project
+import torch
+from torch.profiler import profile, ProfilerActivity
+
+# Load model and test input (adapt to project)
+# model = ...
+# input_data = ...
+
+# Warmup (MANDATORY — cold GPU state has 10-100x variance)
+for _ in range(20):
+    run_inference(model, input_data)
+torch.cuda.synchronize()
+
+# Profile
+with profile(
+    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
+    with_stack=True,
+    record_shapes=True,
+) as prof:
+    for _ in range(10):
+        run_inference(model, input_data)
+    torch.cuda.synchronize()
+
+# CUDA time ranking
+print("=== CUDA time ranking ===")
+print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
+
+# CPU time ranking
+print("\n=== CPU time ranking ===")
+print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
+
+# Transfer events
+print("\n=== Memory transfers ===")
+for evt in prof.key_averages():
+    if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
+        print(f"  {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
+```
+
+Print the `[gpu baseline]` output — this is a key deliverable that must appear in your conversation.
+
+### CUDA Event micro-timing (per-stage)
+
+For quick A/B comparisons of individual pipeline stages:
+
+```python
+import torch, statistics
+
+def time_stage(fn, *args, warmup=10, runs=100, stream=None):
+    s = stream or torch.cuda.current_stream()
+    start = torch.cuda.Event(enable_timing=True)
+    end = torch.cuda.Event(enable_timing=True)
+    for _ in range(warmup):
+        fn(*args)
+    torch.cuda.synchronize()
+    times = []
+    for _ in range(runs):
+        start.record(s)
+        fn(*args)
+        end.record(s)
+        torch.cuda.synchronize()
+        times.append(start.elapsed_time(end))
+    return statistics.median(times), statistics.stdev(times)
+```
+
+### nsys validation
+
+When torch.profiler findings need driver-level confirmation (pageable vs pinned, exact kernel gaps):
+
+```bash
+nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
+    $RUNNER /tmp/inference_bench.py
+```
+
+### Ranked target extraction
+
+After profiling, build the ranked target list. Print:
+
+```
+[gpu ranked targets]
+  1. postprocess.dispatch_gaps  — 1021us (29 kernels, 91% idle)    [dispatch]
+  2. sync.stream_block          — 1000us (synchronize CPU idle)     [sync]
+  3. preprocess.h2d_staging     —  300us (pageable memory)          [transfer]
+  4. preprocess.tensor_create   —  150us (per-call mean/std/index)  [transfer]
+```
+
+## The Experiment Loop
+
+**PROFILING GATE:** If you have not printed `[gpu baseline]` output from torch.profiler, STOP. Go back to the Profiling section and run the profiler. Do NOT enter this loop without quantified GPU profiling evidence.
+
+LOOP (until plateau or user requests stop):
+
+1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments.
+
+2. **Choose target.** Pick the #1 item from your ranked target list. **If all remaining targets are inside TRT/ONNX engine or below measurable threshold, STOP.** Print `[STOP] All remaining targets are opaque (inside engine) or below measurement floor.` Read only the target function's source code.
+
+3. **Reasoning checklist.** Answer all 10 questions. Unknown = profile deeper.
+
+4. **Micro-benchmark** (when applicable). Use `torch.cuda.Event` timing for quick A/B on the target stage. Print `[experiment N] Micro-benchmarking: <stage>` then result.
+
+5. **Implement.** Fix ONLY the one target. Do not touch other pipeline stages. Print `[experiment N] Implementing: <one-line summary>`.
+
+6. **Benchmark.** Run the test/benchmark. Always run for correctness first (output equivalence), then timing.
+
+7. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.
+
+8. **Read results.** Print BOTH metrics:
+   ```
+   [experiment N] Stage: <stage> <before>us -> <after>us — <delta>% faster
+   [experiment N] E2E: <before>ms -> <after>ms — <delta>% faster
+   ```
+
+9. **Crashed or regressed?** Fix or discard immediately.
+
+10. **Small delta?** If <10% on the target stage, re-run **5x** with GPU warm (use median) to confirm not noise. GPU timing has higher variance than CPU — 5x re-runs are mandatory for small deltas.
+
+11. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.
+
+12. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
+
+13. **Config audit** (after KEEP). Check for related configuration (CUDA graph settings, stream configs, buffer size hints) that became dead or inconsistent after your change.
+
+14. **Commit after KEEP.** See commit rules in shared protocol. Use prefix `perf:`.
+
+15. **MANDATORY: Re-profile.** After every KEEP, re-run torch.profiler to get fresh GPU timings. Print `[re-rank] Re-profiling after fix...` then the new `[gpu ranked targets]` list. If all remaining targets are inside the inference engine or below measurement floor, STOP.
+
+16. **Milestones** (every 3-5 keeps): Full E2E benchmark over 100+ calls, `codeflash/optimize-v<N>` tag, AND run adversarial review.
+
+### Keep/Discard
+
+```
+Tests passed?
+-- NO -> Fix or discard
+-- YES ->
+    E2E improved >=3%? -> KEEP
+    E2E <3% but stage improved >=10%?
+    +-- Re-run 5x with GPU warm
+    +-- Confirmed -> KEEP (stage gain is real; E2E compounds with other fixes)
+    +-- Noise -> DISCARD
+    Event timing micro-bench >=20% on confirmed hot stage -> KEEP
+    No measurable improvement -> DISCARD
+```
+
+### Plateau Detection
+
+**Irreducible:** 3+ consecutive discards -> check if remaining targets are:
+- Inside TRT/ONNX engine (opaque, not optimizable at Python level)
+- At PCIe bandwidth ceiling (verify with transfer size vs theoretical bandwidth)
+- At SM occupancy ceiling (verify with nsys or Nsight Compute)
+- CUDA context overhead (~50-100us per launch, reducible only by fewer launches or CUDA graphs)
+
+If top 3 remaining targets are all non-optimizable, **stop and report**.
+
+**Diminishing returns:** Last 3 keeps each gave <50% of previous keep's E2E improvement -> stop.
+
+**Cumulative stall:** Last 3 experiments combined improved <2% E2E -> stop.
+
+### Strategy Rotation
+
+3+ consecutive discards on same type -> switch:
+transfer optimization -> dispatch/fusion -> synchronization -> compute/precision -> architectural
+
+## Diff Hygiene
+
+Before pushing, review `git diff <base>..HEAD`:
+
+1. No unintended formatting changes
+2. No deleted code you didn't mean to remove
+3. Consistent style with surrounding code
+4. Triton kernels: verify `tl.constexpr` usage, block size choices, and mask correctness
+
+## Progress Updates
+
+Print one status line before each major step:
+
+```
+[discovery] PyTorch 2.1, CUDA 12.1, TensorRT 8.6, Tesla T4
+[gpu baseline] torch.profiler on inference loop:
+  CUDA: postprocess 1021us (29 kernels), preprocess 450us, inference 5200us
+  CPU:  synchronize 1000us, adapter 350us
+  Transfers: 3 H2D (2.1MB total), 1 D2H (7.2KB)
+[gpu ranked targets]
+  1. postprocess.dispatch_gaps  — 1021us (29 kernels, 91% idle)    [dispatch]
+  2. sync.stream_block          — 1000us (synchronize CPU idle)     [sync]
+  3. preprocess.h2d_staging     —  300us (pageable memory)          [transfer]
+  4. preprocess.tensor_create   —  150us (per-call mean/std/index)  [transfer]
+[experiment 1] Target: postprocess.dispatch_gaps ([dispatch] fusion, 1021us)
+[experiment 1] Implementing: Triton kernel fusing sigmoid+max+box_transform+denorm+rescale
+[experiment 1] Stage: postprocess 1021us -> 12us — 98.8% faster
+[experiment 1] E2E: 11.02ms -> 8.74ms — 20.7% faster. KEEP
+[re-rank] torch.profiler after fix:
+[gpu ranked targets]
+  1. sync.stream_block          — 1000us    [sync]
+  2. preprocess.h2d_staging     —  300us    [transfer]
+  3. preprocess.tensor_create   —  150us    [transfer]
+[experiment 2] Target: sync.stream_block ([sync] event-based, 1000us)
+...
+[STOP] Remaining targets inside TRT engine — not optimizable at Python level.
+```
+
+## Pre-Submit Review
+
+See shared protocol for the full pre-submit review process. Additional GPU-domain checks:
+
+- **CUDA graph compatibility:** If the project uses CUDA graphs, verify changes don't break capture or replay. Test with `torch.cuda.make_graphed_callables()` or the project's graph capture path.
+- **Non-blocking correctness:** If using `non_blocking=True` for H2D transfers, verify the data is not read by the GPU before the transfer completes. The consuming kernel must be on the same stream or wait on an event.
+- **Pinned memory lifecycle:** Verify pinned buffers are not leaked. Pinned memory is a limited OS resource that bypasses Python GC pressure tracking. One reusable buffer per pipeline stage is typical — avoid unbounded pinned allocation.
+- **Numerical equivalence:** If kernel fusion or precision change, verify output matches original within tolerance. Use `torch.allclose(original, fused, atol=1e-5, rtol=1e-4)`.
+- **Stream safety:** If adding event-based sync, verify events are recorded on the correct stream and waited on by the correct consumer stream.
+
+## Progress Reporting
+
+See shared protocol for the full reporting structure. GPU-domain message content:
+
+1. **After baseline**: `[gpu baseline] <torch.profiler summary — top stages with CUDA us, transfer count>`
+2. **After each experiment**: `[experiment N] target: <stage.pattern>, result: KEEP/DISCARD, stage: <before>us -> <after>us, E2E: <before>ms -> <after>ms`
+3. **Every 3 experiments**: `[progress] <N> experiments (<keeps> kept, <discards> discarded) | best: <top keep> | E2E: <baseline>ms -> <current>ms | next: <next target>`
+4. **At milestones**: `[milestone] <cumulative: total E2E speedup, stage improvements, experiments, keeps/discards>`
+5. **At plateau/completion**: `[complete] <total experiments, keeps, E2E improvement, remaining targets and why>`
+6. **Cross-domain**: `[cross-domain] domain: <target-domain> | signal: <what you found>`
+
+## Logging Format
+
+Tab-separated `.codeflash/results.tsv`:
+
+```
+commit	target_test	stage	gpu_baseline_us	gpu_optimized_us	gpu_speedup	e2e_baseline_ms	e2e_optimized_ms	e2e_speedup	tests_passed	tests_failed	status	pattern	description
+```
+
+- `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
+- `gpu_speedup`: percentage (e.g., `98.8%`)
+- `e2e_speedup`: percentage (e.g., `20.7%`)
+- `status`: `keep`, `discard`, or `crash`
+- `pattern`: antipattern tag (e.g., `kernel-fusion`, `loop-invariant-tensor`, `pageable-h2d`, `event-sync`)
+
+## Workflow
+
+### Starting fresh
+
+Follow common session start steps from shared protocol, then:
+
+4. **GPU environment check.** Verify CUDA is available and record GPU info:
+   ```bash
+   $RUNNER -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}, PyTorch: {torch.__version__}')"
+   ```
+   If CUDA is not available, report and stop — GPU optimization is not possible.
+
+5. **GPU baseline.** Run torch.profiler on the target inference path. Record in results.tsv.
+
+6. **Build ranked target list.** From the profile, list ALL pipeline stages with their CUDA timing, transfer events, and synchronization points. Print this list explicitly:
+   ```
+   [gpu ranked targets]
+     1. postprocess.dispatch_gaps  — 1021us (29 kernels)   [dispatch]
+     2. sync.stream_block          — 1000us                 [sync]
+     3. preprocess.h2d_staging     —  300us                 [transfer]
+   ```
+   You MUST print this exact format. Only targets above measurable threshold are worth fixing. **Do NOT read source code for pipeline stages inside the TRT/ONNX engine.**
+
+7. **Read ONLY the #1 target's source code.** Do not read other stages yet. Enter the experiment loop.
+
+8. **Experiment loop** — Begin iterating.
+
+### Constraints
+
+- **Correctness**: All previously-passing tests must still pass. Numerical output must match within tolerance.
+- **Performance**: Measured improvement required — don't rely on theoretical analysis alone. GPU timing can be counterintuitive.
+- **Simplicity**: Simpler is better. A cached tensor is better than a custom CUDA kernel for a simple operation.
+- **Style**: Match existing project conventions. Don't introduce Triton kernels in a codebase that uses only PyTorch unless the gain justifies the complexity.
+
+## Deep References
+
+For detailed domain knowledge beyond this prompt, read from `../references/gpu/`:
+- **`guide.md`** — Full GPU optimization methodology, profiling tools, pipeline model, decision framework
+- **`reference.md`** — Antipattern catalog with symptoms, fixes, thresholds, and code examples
+- **`experiment-loop.md`** — Domain-specific experiment loop details, keep/discard thresholds, strategy rotation
+- **`handoff-template.md`** — Handoff template with GPU-specific fields
+- **`../shared/e2e-benchmarks.md`** — Two-phase measurement with `codeflash compare`
+- **`../shared/pr-preparation.md`** — PR workflow
+
+## PR Strategy
+
+See shared protocol. Branch prefix: `gpu/`. PR title prefix: `gpu:`.
--- a/plugin/languages/python/agents/codeflash-python.md
+++ b/plugin/languages/python/agents/codeflash-python.md
@ -42,6 +42,7 @@ You are the team lead for Python performance optimization. Your job is to detect
 | User EXPLICITLY requests memory-only: "reduce memory", "fix OOM", "too much RAM" | **Memory** | `codeflash-memory` |
 | User EXPLICITLY requests CPU-only: "fix O(n^2)", "algorithmic optimization only" | **CPU / Data Structures** | `codeflash-cpu` |
 | User EXPLICITLY requests async-only: "fix sequential awaits", "async concurrency only" | **Async** | `codeflash-async` |
+| User EXPLICITLY requests GPU-only: "optimize CUDA", "reduce inference latency", "GPU pipeline", "fix H2D transfers", "kernel fusion", "pipeline bubbles" | **GPU/CUDA** | `codeflash-gpu` |
 | Import time, circular deps, module reorganization, startup time, god module | **Structure** | `codeflash-structure` |
 | Review, critique, check changes, review PR, verify optimizations | **Review** | `codeflash-review` |

@ -59,3 +60,5 @@ You are the team lead for Python performance optimization. Your job is to detect
 | codeflash-deep (I/O targets) | `../references/io/` | File format selection (PNG/BMP/raw), serialization overhead, buffer protocol, zero-copy, streaming |
 | codeflash-deep (C ext targets) | `../references/native/` | Python↔C boundary costs, numpy/Pillow/pdfium/ONNX patterns, GIL, buffer protocol |
 | codeflash-deep (worker targets) | `../references/workers/` | Pool sizing, cgroup-aware CPU detection, fork/spawn/forkserver, memory sharing, model duplication |
+| codeflash-gpu | `../references/gpu/` | torch.profiler/nsys profiling, CUDA transfer patterns, kernel fusion, stream sync, pinned memory |
+| codeflash-deep (GPU targets) | `../references/gpu/` | GPU pipeline optimization when deep agent detects CUDA-bound workloads |
--- a/plugin/languages/python/agents/codeflash-scan.md
+++ b/plugin/languages/python/agents/codeflash-scan.md
@ -164,6 +164,49 @@ Record findings with:
 - Pattern (sequential awaits, blocking call, cache on async, unbounded gather)
 - Estimated impact (high/medium/low)

+### 5. GPU Profiling (conditional)
+
+Check if the project uses torch and CUDA is available:
+```bash
+$RUNNER -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A')" 2>/dev/null
+```
+
+If CUDA is available AND the project imports torch, run a quick GPU profile:
+```bash
+$RUNNER -c "
+import torch
+from torch.profiler import profile, ProfilerActivity
+
+# Discover inference entry point — adapt to project
+# Look for model.predict(), model.forward(), model.infer(), etc.
+
+# Quick profile (adapt command to project's inference path)
+with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
+    # Run 5 inference calls after warmup
+    pass
+
+# Summary
+avg = prof.key_averages()
+print('=== Top CUDA kernels ===')
+print(avg.table(sort_by='cuda_time_total', row_limit=10))
+
+# Count transfers
+h2d_count = sum(1 for e in avg if 'copy' in e.key.lower() or 'memcpy' in e.key.lower())
+sync_count = sum(1 for e in avg if 'synchronize' in e.key.lower())
+total_cuda = sum(e.cuda_time_total for e in avg if e.cuda_time_total > 0)
+print(f'Transfer events: {h2d_count}, Sync calls: {sync_count}, Total CUDA time: {total_cuda}us')
+" 2>&1
+```
+
+Record findings with:
+- Total CUDA kernel count and total GPU time
+- Number and type of H2D/D2H transfer events
+- Number of synchronize() calls
+- Pipeline idle % (if determinable from kernel gap analysis)
+- Suspected patterns (loop-invariant tensors, kernel fusion candidates, blocking sync)
+
+If torch is not available or CUDA is not present, skip this step and note "GPU: N/A (no CUDA)" in the report.
+
 ## Cross-Domain Ranking

 After all profiling passes, rank ALL findings into a single list ordered by estimated impact. **Adjust severity based on deployment model.**
@ -181,6 +224,11 @@ After all profiling passes, rank ALL findings into a single list ordered by esti
 - Async blocking call in hot path → **high**
 - Sequential awaits (3+ independent) → **high**
 - Other async patterns → **medium**
+- GPU pipeline >50% idle (dispatch gaps between kernels) → **critical**
+- GPU H2D/D2H transfers >20% of total GPU time → **high**
+- >10 sequential kernel launches with gaps on same stream → **high**
+- Blocking synchronize() >0.5ms → **medium**
+- Per-call tensor creation for constants (loop-invariant) → **medium**

 ### Deployment model adjustments

@ -253,11 +301,14 @@ Based on the scan results, recommended optimization order:

 ### Async (static analysis)
 <findings or "No async code detected">
+
+### GPU (torch.profiler)
+<torch.profiler output with annotations, or "GPU: N/A (no CUDA)">
 ```

 ## Print Summary

 After writing the report, print a one-line summary:
 ```
-[scan] CPU: <N> targets | Memory: <N> targets | Import: <N> targets | Async: <N> targets | Top: <#1 target description>
+[scan] CPU: <N> targets | Memory: <N> targets | Import: <N> targets | Async: <N> targets | GPU: <N> targets | Top: <#1 target description>
 ```
--- a/plugin/languages/python/references/gpu/experiment-loop.md
+++ b/plugin/languages/python/references/gpu/experiment-loop.md
@ -0,0 +1,211 @@
+# Experiment Loop — GPU/CUDA Domain
+
+> Base framework: `../shared/experiment-loop-base.md`
+
+## Reasoning Checklist
+
+Before writing any code, answer these 10 questions. If you can't answer 3-7 concretely, profile deeper before coding.
+
+1. **Pattern**: What GPU antipattern? (see `reference.md` catalog: [transfer], [dispatch], [sync], [memory], [compute], [correctness])
+2. **Hot path?** Confirmed by torch.profiler or nsys — **NOT cProfile**. cProfile cannot see GPU-side costs.
+3. **Pipeline stage?** Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
+4. **Transfer or compute?** Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
+5. **Per-call vs amortized?** Is this fixed overhead per inference call, or proportional to batch size / image resolution?
+6. **GPU warm?** Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
+7. **Mechanism**: HOW does your change reduce latency? Be specific: fewer transfers, fewer kernel launches, less sync wait, direct-to-host writes, etc.
+8. **Correctness**: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can produce different floating-point results. Verify bit-for-bit or within tolerance.
+9. **CUDA graph compatibility**: Does the project use CUDA graphs? If so, will this change break capture or replay? Event-based sync is incompatible with graph capture. Pinned memory and cached tensors are compatible.
+10. **Verify cheaply**: Can you validate with `torch.cuda.Event` timing before running the full benchmark?
+
+## Profiling Methodology
+
+### Mandatory: torch.profiler baseline
+
+Run this BEFORE any code changes. This is equivalent to the cProfile gate for CPU optimization — entering the experiment loop without torch.profiler output is not permitted.
+
+```python
+# /tmp/gpu_baseline.py
+import torch
+from torch.profiler import profile, ProfilerActivity
+
+# --- Adapt these to the project ---
+# from model import load_model, preprocess, inference, postprocess
+# model = load_model()
+# input_data = load_test_input()
+# ---
+
+# Warmup
+for _ in range(20):
+    run_inference(model, input_data)
+torch.cuda.synchronize()
+
+# Profile
+with profile(
+    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
+    with_stack=True,
+    record_shapes=True,
+) as prof:
+    for _ in range(10):
+        run_inference(model, input_data)
+    torch.cuda.synchronize()
+
+# Ranked by CUDA time
+print("=== CUDA time ranking ===")
+print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
+
+# Ranked by CPU time
+print("\n=== CPU time ranking ===")
+print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
+
+# Memory copies (H2D, D2H)
+print("\n=== Memory events ===")
+for evt in prof.key_averages():
+    if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
+        print(f"  {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
+```
+
+Print the output as `[gpu baseline]` — this is a key deliverable.
+
+### Event-based micro-timing (per-stage measurement)
+
+For quick A/B comparisons of individual pipeline stages:
+
+```python
+import torch
+
+def time_stage(fn, *args, warmup=10, runs=100, stream=None):
+    """Time a single pipeline stage using CUDA events."""
+    s = stream or torch.cuda.current_stream()
+    start = torch.cuda.Event(enable_timing=True)
+    end = torch.cuda.Event(enable_timing=True)
+
+    for _ in range(warmup):
+        fn(*args)
+    torch.cuda.synchronize()
+
+    times = []
+    for _ in range(runs):
+        start.record(s)
+        fn(*args)
+        end.record(s)
+        torch.cuda.synchronize()
+        times.append(start.elapsed_time(end))
+
+    import statistics
+    return statistics.median(times), statistics.stdev(times)
+```
+
+### nsys validation (when torch.profiler is insufficient)
+
+Use when you need to see driver-level detail — pageable vs pinned transfers, exact kernel launch gaps, stream activity overlap:
+
+```bash
+nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
+    $RUNNER /tmp/inference_bench.py
+```
+
+## Domain-Specific Loop Steps
+
+**Step 1 — Choose target** sources:
+- **torch.profiler output**: Highest CUDA time function, or largest gap between GPU kernels
+- **nsys timeline**: Pipeline bubbles — idle gaps between stream activities
+- **Manual inspection**: Per-call tensor creation patterns, `synchronize()` calls, pageable H2D transfers
+
+Print: `[experiment N] Target: <stage>.<pattern> (<est. latency>us, <antipattern category>)`
+
+Example:
+```
+[experiment 1] Target: postprocess.dispatch_gaps (1021us, 29 kernels — [dispatch] fusion candidate)
+[experiment 2] Target: preprocess.h2d_staging (300us — [transfer] pageable->pinned)
+[experiment 3] Target: preprocess.tensor_create (150us — [transfer] loop-invariant hoisting)
+[experiment 4] Target: sync.stream_block (1000us — [sync] synchronize->event)
+```
+
+**Step 6 — Read results**: Print BOTH stage-level AND E2E timings:
+```
+[experiment N] Stage: <stage> <before>us -> <after>us (<delta>%)
+[experiment N] E2E: <before>ms -> <after>ms (<delta>%)
+```
+
+**Step 8 — Noise threshold**: GPU timing has higher variance than CPU due to GPU clock boosting, thermal throttling, and driver scheduling. If speedup <10% on a single stage, re-run **5x** (not 3x) with GPU warm to confirm it's not noise. Use median, not mean.
+
+**Step 10 — Record**: Record immediately in both `.codeflash/results.tsv` and `.codeflash/HANDOFF.md`. Include both stage-level and E2E metrics.
+
+**Step 12 — E2E benchmark**: After KEEP, measure E2E latency over 100+ inference calls to confirm the stage-level gain translates to real-world improvement. Stage improvements that don't show in E2E may be masked by other bottlenecks — still KEEP if the stage improvement is confirmed, but note the masking.
+
+## Keep/Discard Thresholds
+
+```
+Tests passed?
+-- NO -> Fix or discard immediately
+-- YES -> Check E2E and stage metrics:
+    +-- E2E improved >=3% -> KEEP
+    +-- E2E <3% but stage improved >=10%:
+    |   +-- Re-run 5x with GPU warm
+    |   +-- Confirmed -> KEEP (stage gain is real; E2E may compound with other fixes)
+    |   +-- Noise -> DISCARD
+    +-- Micro-benchmark (Event timing) >=20% on confirmed hot stage -> KEEP
+    +-- No measurable improvement -> DISCARD
+```
+
+**Why 3% E2E (not 5% like CPU)?** GPU inference pipelines have high fixed costs (model forward pass) that dilute the effect of transfer/sync optimizations on E2E numbers. A 50% reduction in a pipeline stage (e.g., postprocess 1ms -> 0.5ms) may only show as 5% E2E improvement on a 10ms pipeline. The stage-level improvement is real and compounds with other stage fixes.
+
+## Strategy Rotation
+
+If 3+ consecutive discards on the same type, switch:
+
+1. **Transfer optimization** (pinned memory, cached tensors, non_blocking) -> if exhausted:
+2. **Dispatch/fusion** (Triton kernels, torch.compile, batched ops) -> if exhausted:
+3. **Synchronization** (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
+4. **Compute/precision** (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
+5. **Architectural** (batch pipeline stages differently, pipeline parallelism across frames)
+
+## Plateau Detection
+
+Remaining targets are at the optimization floor when:
+- **Inside TRT/ONNX engine**: The model forward pass is opaque — can't optimize individual kernels at the Python level. Recommend TRT profile/refit or model-level changes.
+- **PCIe bandwidth ceiling**: H2D/D2H transfers are limited by PCIe bandwidth. Verify with `nvidia-smi topo -m` and transfer size calculation.
+- **SM occupancy ceiling**: GPU compute units are fully utilized. Verify with nsys or Nsight Compute occupancy metrics.
+- **CUDA context overhead**: ~50-100us per kernel launch is irreducible driver overhead. Can only be reduced by fewer launches (fusion) or CUDA graphs.
+
+## Correctness Verification
+
+### Numerical equivalence
+
+Kernel fusion and operation reordering can produce different floating-point results due to:
+- Different reduction order (sum, max, argmax)
+- Different intermediate precision (FP32 vs FP16 intermediates)
+- Fused multiply-add vs separate multiply then add
+
+Always verify output equivalence after kernel fusion:
+
+```python
+# Run both paths on same input
+original_output = original_postprocess(logits, boxes)
+fused_output = fused_postprocess(logits, boxes)
+
+# Check equivalence
+assert torch.allclose(original_output, fused_output, atol=1e-5, rtol=1e-4), \
+    f"Max diff: {(original_output - fused_output).abs().max()}"
+```
+
+### CUDA graph compatibility
+
+If the project uses CUDA graph capture:
+- **Safe**: Cached tensors, pinned buffers, pre-allocated outputs — these are created outside capture
+- **Unsafe**: Event-based cross-stream sync — events recorded outside the graph are ignored during replay
+- **Depends**: `torch.compile` — the compiled graph may or may not be capturable
+
+## Logging Format
+
+Tab-separated `.codeflash/results.tsv`:
+
+```
+commit	target_test	stage	gpu_baseline_us	gpu_optimized_us	gpu_speedup	e2e_baseline_ms	e2e_optimized_ms	e2e_speedup	tests_passed	tests_failed	status	pattern	description
+```
+
+- `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
+- `gpu_baseline_us` / `gpu_optimized_us`: stage-level timing in microseconds
+- `e2e_baseline_ms` / `e2e_optimized_ms`: end-to-end timing in milliseconds
+- `pattern`: antipattern tag (e.g., `loop-invariant-tensor`, `kernel-fusion`, `pageable-h2d`)
+- `status`: `keep`, `discard`, or `crash`
--- a/plugin/languages/python/references/gpu/guide.md
+++ b/plugin/languages/python/references/gpu/guide.md
@ -0,0 +1,186 @@
+# GPU/CUDA Inference Optimization Guide
+
+## Why cProfile Is Insufficient for GPU Workloads
+
+cProfile measures Python-side CPU time. GPU work is asynchronous — `torch.mm()` returns immediately while the GPU computes. The real costs in GPU inference are invisible to Python profilers:
+
+- **H2D/D2H transfers**: Data copies between CPU and GPU happen in the CUDA driver. A 24-byte tensor copy that takes microseconds of Python time can create a 100us pipeline bubble waiting for the driver to stage pageable memory.
+- **Kernel dispatch gaps**: Each PyTorch operation launches a CUDA kernel with ~25us of dispatch overhead. 29 sequential operations = ~725us of dead time between kernels, but cProfile shows each op completing in microseconds.
+- **Stream synchronization stalls**: `stream.synchronize()` blocks the CPU thread until the GPU finishes. This shows as idle CPU time in cProfile — not attributed to any function.
+- **Pipeline bubbles**: Gaps where neither CPU nor GPU is doing useful work because one is waiting for the other. These are the "inter-iteration bubbles" that dominate real-world inference latency.
+
+**Rule: For GPU-bound workloads, always use `torch.profiler` or `nsys` as the primary profiler. cProfile is supplementary only — use it for CPU-side code paths (preprocessing, postprocessing, adapter serialization).**
+
+## Profiling Tools
+
+### torch.profiler (primary)
+
+The standard tool for profiling PyTorch GPU workloads. Captures both CPU and CUDA activities.
+
+```python
+import torch
+from torch.profiler import profile, ProfilerActivity, schedule
+
+with profile(
+    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
+    with_stack=True,
+    record_shapes=True,
+    profile_memory=True,
+    schedule=schedule(wait=2, warmup=3, active=5, repeat=1),
+    on_trace_ready=torch.profiler.tensorboard_trace_handler("/tmp/gpu_trace"),
+) as prof:
+    for i in range(10):
+        run_inference(model, input_data)
+        prof.step()
+
+# Print summary sorted by CUDA time
+print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
+
+# Print summary sorted by CPU time (for comparison)
+print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
+```
+
+**What to look for in the output:**
+- CUDA kernels with high total time — these are compute bottlenecks
+- Many calls with short CUDA time — dispatch overhead candidates for fusion
+- CPU time >> CUDA time on a function — the CPU is doing unnecessary work
+- CUDA time >> CPU time on a function — large GPU compute or transfer
+- `aten::copy_` entries — these are memory transfers (H2D, D2H, D2D)
+- `cudaStreamSynchronize` in CPU column — blocking sync points
+
+### Nsight Systems (nsys) — authoritative validation
+
+When torch.profiler findings need confirmation or when you need driver-level detail.
+
+```bash
+nsys profile \
+    --trace=cuda,nvtx,osrt \
+    --cuda-memory-usage=true \
+    --output=/tmp/nsys_profile \
+    $RUNNER inference_script.py
+```
+
+**Reading the timeline:**
+- **Stream rows**: Each CUDA stream shows kernel execution as colored blocks with gaps between them. Large gaps = pipeline bubbles.
+- **Memory copy rows**: H2D (green) and D2H (red) transfers. Pageable transfers show a staging copy; pinned transfers are direct.
+- **CPU row**: Python thread activity. Look for idle periods that coincide with `cudaStreamSynchronize`.
+- **NVTX markers**: If the code uses `torch.cuda.nvtx.range_push/pop`, these label pipeline stages.
+
+### torch.cuda.Event timing (quick per-stage measurement)
+
+For fast A/B comparisons without full profiling infrastructure.
+
+```python
+start = torch.cuda.Event(enable_timing=True)
+end = torch.cuda.Event(enable_timing=True)
+
+# Warm up
+for _ in range(10):
+    run_inference(model, input_data)
+torch.cuda.synchronize()
+
+# Measure
+times = []
+for _ in range(100):
+    start.record()
+    run_inference(model, input_data)
+    end.record()
+    torch.cuda.synchronize()
+    times.append(start.elapsed_time(end))  # milliseconds
+
+import statistics
+print(f"Mean: {statistics.mean(times):.3f}ms, Median: {statistics.median(times):.3f}ms, Std: {statistics.stdev(times):.3f}ms")
+```
+
+**Per-stage measurement** — wrap each pipeline stage with its own event pair:
+
+```python
+events = {stage: (torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True))
+          for stage in ["preprocess", "inference", "postprocess"]}
+
+events["preprocess"][0].record(preprocess_stream)
+# ... preprocessing ...
+events["preprocess"][1].record(preprocess_stream)
+
+events["inference"][0].record(inference_stream)
+# ... inference ...
+events["inference"][1].record(inference_stream)
+
+# After sync:
+for stage, (s, e) in events.items():
+    print(f"{stage}: {s.elapsed_time(e):.3f}ms")
+```
+
+## The GPU Inference Pipeline Model
+
+A typical GPU inference pipeline has 6 stages. Each transition between stages is an optimization surface:
+
+```
+CPU Prep ──H2D──> GPU Preprocess ──> GPU Inference ──D2H──> CPU Postprocess ──> Output
+   │                   │                   │                      │
+   │                   │                   │                      │
+   ▼                   ▼                   ▼                      ▼
+ Image resize      Normalize,          Model forward         Threshold,
+ Format convert    Channel swap,       pass (TRT,            Sort, NMS,
+ Numpy→Tensor      Pad/letterbox       ONNX, PyTorch)        Class remap
+```
+
+**Where time is typically spent** (ranges for single-image inference on modern GPUs):
+
+| Stage | Typical range | Optimization surface |
+|-------|--------------|---------------------|
+| CPU Prep | 0.5-3ms | Vectorization, avoid copies |
+| H2D Transfer | 0.1-1ms | Pinned memory, non_blocking |
+| GPU Preprocess | 0.1-2ms | Kernel fusion, cached constants |
+| GPU Inference | 2-20ms | Model optimization (TRT, quantization) — usually not our target |
+| D2H Transfer | 0.05-0.5ms | Direct-to-host writes, pinned buffers |
+| CPU Postprocess | 0.1-2ms | Numpy for small arrays, vectorization |
+| **Sync overhead** | **0.5-3ms** | **Event-based sync, pipeline overlap** |
+
+The model forward pass is usually the largest single cost, but it's typically already optimized (TensorRT, ONNX). **The surrounding pipeline stages and their transitions often account for 30-60% of total E2E latency** and are where Python-level optimization has the most impact.
+
+## Decision Framework
+
+### Step 1: Is the workload GPU-bound or CPU-bound?
+
+```python
+# Compare GPU active time to wall-clock time
+with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
+    run_inference(model, input_data)
+
+# If total CUDA time >> total CPU time: GPU-bound (optimize kernels, reduce transfers)
+# If total CPU time >> total CUDA time: CPU-bound (optimize Python code, reduce sync waits)
+# If they're close: pipeline-bound (optimize transitions between stages)
+```
+
+### Step 2: Is the bottleneck data movement or kernel execution?
+
+Look at the torch.profiler output:
+- **Transfer-bound**: `aten::copy_` and `cudaMemcpy*` dominate. Focus on pinned memory, non_blocking, reducing transfer count.
+- **Compute-bound**: CUDA kernels dominate. Focus on kernel fusion, precision reduction, algorithmic changes.
+- **Dispatch-bound**: Many short kernels with gaps. Focus on kernel fusion via Triton or torch.compile.
+
+### Step 3: Is the overhead per-call or proportional?
+
+- **Per-call fixed cost**: Tensor creation, stream synchronization, small H2D transfers. These don't scale with batch size — they matter most for single-image real-time inference.
+- **Proportional cost**: Large tensor transfers, compute-heavy kernels. These scale with batch size and image resolution.
+
+Per-call fixed costs compound in streaming/real-time applications where every millisecond of latency matters. Proportional costs matter more in throughput-oriented batch processing.
+
+## GPU Warm-Up
+
+Always warm up the GPU before benchmarking. The first few inference calls include:
+- CUDA context initialization (~100-500ms)
+- TensorRT engine deserialization (~1-10s)
+- cuDNN autotuning (~100ms per unique input shape)
+- JIT compilation for torch.compile or Triton kernels
+
+```python
+# Standard warm-up: run 10-20 inference calls and discard timing
+for _ in range(20):
+    run_inference(model, input_data)
+torch.cuda.synchronize()
+# NOW start measuring
+```
+
+**Never benchmark cold GPU state** — the variance will mask real optimization effects.
--- a/plugin/languages/python/references/gpu/handoff-template.md
+++ b/plugin/languages/python/references/gpu/handoff-template.md
@ -0,0 +1,47 @@
+# GPU/CUDA Handoff Template
+
+> Base template: `../shared/handoff-template.md` (use with `{{DOMAIN_PREFIX}}` = `gpu`)
+
+## Additional Environment Fields
+
+Add these to the Environment section of the base handoff template:
+
+```markdown
+## Environment
+
+| Key | Value |
+|-----|-------|
+| GPU model | e.g., Tesla T4, A100 80GB, RTX 4090 |
+| CUDA version | e.g., 12.1 |
+| cuDNN version | e.g., 8.9.7 |
+| TensorRT version | e.g., 8.6.1 (or N/A) |
+| Triton version | e.g., 2.1.0 (or N/A) |
+| PyTorch version | e.g., 2.1.0+cu121 |
+| CUDA graphs | enabled / disabled / N/A |
+| Inference backend | PyTorch / TensorRT / ONNX Runtime |
+```
+
+## Hotspot Summary Extension
+
+The "Impact" column for GPU targets should report **stage latency** rather than cumtime percentage:
+
+```markdown
+## Hotspot Summary
+
+| # | Target | Stage | Latency (us) | Pattern | Status |
+|---|--------|-------|-------------|---------|--------|
+| 1 | postprocess dispatch gaps | postprocess | 1021 | kernel-fusion | keep — 12us after |
+| 2 | preprocess H2D staging | preprocess.h2d | 300 | pageable-to-pinned | keep — 180us after |
+| 3 | norm tensor creation | preprocess.tensor | 150 | loop-invariant | keep — 0us after |
+| 4 | stream synchronize | sync | 1000 | event-based-sync | keep — 50us after |
+```
+
+## GPU-Specific State to Record
+
+At session end, record in HANDOFF.md:
+
+- **Current E2E latency**: median over 100+ warm inference calls
+- **Per-stage breakdown**: preprocess, H2D, inference, D2H, postprocess (in microseconds)
+- **Transfer summary**: number of H2D/D2H events per call, total bytes transferred
+- **Kernel count**: number of CUDA kernel launches per inference call
+- **Remaining targets**: what's left and why it can't be optimized (inside TRT engine, at PCIe ceiling, etc.)
--- a/plugin/languages/python/references/gpu/reference.md
+++ b/plugin/languages/python/references/gpu/reference.md
@ -0,0 +1,303 @@
+# GPU/CUDA Antipattern Catalog
+
+## Profiling Tools
+
+```bash
+# torch.profiler — primary GPU profiling tool:
+$RUNNER -c "
+import torch
+from torch.profiler import profile, ProfilerActivity
+with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], with_stack=True, record_shapes=True) as prof:
+    # ... run inference ...
+    pass
+print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))
+"
+
+# nsys — authoritative validation:
+nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_profile $RUNNER inference_script.py
+
+# CUDA Event micro-benchmark — quick per-stage timing:
+# See guide.md for template
+```
+
+## Antipattern Catalog
+
+All patterns tagged: **[transfer]** = data movement, **[dispatch]** = kernel launch overhead, **[sync]** = synchronization, **[memory]** = GPU memory management, **[compute]** = kernel/operation efficiency, **[correctness]** = bug that defeats optimization.
+
+### HIGH impact
+
+**[transfer] Loop-invariant tensor creation on GPU** — Per-call H2D for constants
+
+Symptom: Small H2D transfers visible in nsys every inference call. Tensors for normalization parameters, index arrays, or constant lookup tables re-created per call.
+
+```python
+# BEFORE: creates mean/std tensors on CPU, transfers to GPU every call
+def preprocess(image, device):
+    tensor = torch.from_numpy(image).to(device)
+    mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(1, -1, 1, 1)
+    std = torch.tensor([0.229, 0.224, 0.225], device=device).view(1, -1, 1, 1)
+    return (tensor - mean) / std
+```
+
+```python
+# AFTER: cache on GPU, reuse across calls
+_norm_mean = None
+_norm_std = None
+
+def preprocess(image, device):
+    global _norm_mean, _norm_std
+    if _norm_mean is None or not _norm_mean.is_cuda:
+        _norm_mean = torch.tensor([0.485, 0.456, 0.406], device=device, dtype=torch.float32).view(1, -1, 1, 1)
+        _norm_std = torch.tensor([0.229, 0.224, 0.225], device=device, dtype=torch.float32).view(1, -1, 1, 1)
+    tensor = torch.from_numpy(image).to(device)
+    return (tensor - _norm_mean) / _norm_std
+```
+
+Variants:
+- **Channel swap index**: `tensor[:, [2,1,0], :, :]` creates a temporary CPU tensor from the Python list, then transfers it. Cache: `_idx = torch.tensor([2,1,0], device=device)`
+- **Anchor grids**: Object detection anchor/prior tensors computed per call
+- **Positional encodings**: Transformer positional encoding tensors
+- Typical impact: 0.1-0.5ms per call per cached tensor
+- Why cProfile misses it: Python-side cost is microseconds; the pipeline bubble is in the CUDA driver H2D path
+
+---
+
+**[transfer] Pageable memory for H2D transfers** — Driver-side staging overhead
+
+Symptom: H2D transfers in nsys show a staging copy (pageable -> pinned -> GPU). `torch.from_numpy()` on a regular numpy array creates a pageable CPU tensor.
+
+```python
+# BEFORE: pageable memory — requires driver staging for H2D
+tensor = torch.from_numpy(image_np).to(device)
+```
+
+```python
+# AFTER: pinned memory buffer reuse — direct DMA transfer
+_pinned_buffer = None
+
+def to_gpu_pinned(image_np, device):
+    global _pinned_buffer
+    needed = image_np.shape
+    if _pinned_buffer is None or _pinned_buffer.shape != needed:
+        _pinned_buffer = torch.empty(needed, dtype=torch.uint8, pin_memory=True)
+    _pinned_buffer.numpy()[:] = image_np  # copy into pinned buffer (CPU-side, fast)
+    return _pinned_buffer.to(device, non_blocking=True)
+```
+
+- Typical impact: 20-40% faster H2D for medium tensors (1-10 MB). Larger tensors see less relative gain (DMA bandwidth dominates).
+- Crossover: Worth it for transfers > ~4 KB. Below that, the overhead of managing the pinned buffer isn't justified.
+- Why cProfile misses it: The staging overhead is inside the CUDA driver, not in Python
+- Caveat: Pinned memory is a limited resource. Don't pin large amounts (>1GB) or many small buffers. One reusable buffer per pipeline stage is typical.
+
+---
+
+**[dispatch] Many small PyTorch operations where a single fused kernel suffices** — Kernel dispatch gap accumulation
+
+Symptom: torch.profiler shows 10-50+ consecutive CUDA kernels on the same stream with short durations (<100us each). nsys timeline shows visible gaps (~25us) between each kernel. Total wall time for the sequence is 5-20x the sum of kernel compute times.
+
+```python
+# BEFORE: 29 separate kernel launches for post-processing
+probs = torch.sigmoid(logits)                    # kernel 1
+confidence, class_id = probs.max(dim=-1)         # kernels 2-3
+mask = confidence > threshold                     # kernel 4
+confidence = confidence[mask]                     # kernel 5
+class_id = class_id[mask]                         # kernel 6
+sorted_idx = torch.argsort(confidence, descending=True)  # kernels 7-8
+# ... box transform, denormalization, rescaling ... # kernels 9-29
+```
+
+```python
+# AFTER: single Triton kernel fusing all operations
+@triton.jit
+def fused_postprocess_kernel(logits_ptr, bboxes_ptr, output_ptr,
+                              num_classes, dw, dh, inv_sw, inv_sh,
+                              pad_l, pad_t, crop_x, crop_y, BLOCK_C: tl.constexpr):
+    query_id = tl.program_id(0)
+    # Fused: sigmoid + max + box transform + denorm + rescale
+    # Writes directly to pinned CPU output buffer
+    ...
+```
+
+Alternative to Triton: `torch.compile(fullgraph=True)` can fuse many operations automatically, though with less control over the output.
+
+- Typical impact: 10-100x for the fused region. Post-processing 1021us -> 12us in the RT-DETR case.
+- Worth fusing when: >5 sequential kernel launches with <100us each, total dispatch gap >100us
+- Why cProfile misses it: Each PyTorch op's Python overhead is microseconds. The dispatch gaps between GPU kernel launches are invisible to Python profilers.
+
+---
+
+**[sync] Blocking stream.synchronize() between dependent GPU stages** — CPU-GPU serialization
+
+Symptom: CPU blocks for 0.5-2ms waiting for GPU work to complete before launching the next stage. In torch.profiler: high CPU time on `cudaStreamSynchronize`. In nsys: CPU idle period aligned with GPU work on another stream.
+
+```python
+# BEFORE: CPU waits for GPU preprocessing to finish before launching inference
+self._preprocess_stream.synchronize()  # CPU blocks ~1ms
+self._inference_stream.run_forward()   # Can't start until sync completes
+```
+
+```python
+# AFTER: GPU-to-GPU dependency via events — CPU proceeds immediately
+event = self._preprocess_stream.record_event()
+self._inference_stream.wait_event(event)  # GPU waits, CPU does not
+self._inference_stream.run_forward()      # CPU launches immediately
+```
+
+- Typical impact: 0.5-1ms per synchronization point removed
+- When to keep synchronize(): CUDA graph capture/replay requires it (graphs are fixed op sequences that ignore cross-stream events outside capture). Also needed when CPU must read GPU results before proceeding.
+- Why cProfile misses it: `synchronize()` shows as idle CPU time, not as a hot function.
+
+---
+
+### MEDIUM impact
+
+**[compute] torch operations on small CPU arrays** — Dispatch overhead > compute time
+
+Symptom: `torch.sort()`, `torch.argsort()`, `torch.isin()` called on CPU tensors with <1000 elements. torch's per-op dispatch overhead (~10-50us) dominates the actual computation time for small arrays.
+
+```python
+# BEFORE: torch on CPU, slow for small arrays
+confidence_sorted, sorted_idx = torch.sort(confidence_cpu, descending=True)
+mask = torch.isin(class_ids, valid_class_ids_tensor)
+```
+
+```python
+# AFTER: numpy for small CPU arrays — 2-5x faster
+conf_np = confidence_cpu.numpy()
+order = np.argsort(-conf_np)
+mask = np.isin(class_ids_np, valid_class_ids_set)
+```
+
+- Crossover: numpy wins below ~1000 elements for sort/argsort/isin. Below ~100 elements, numpy is 3-5x faster.
+- Typical impact: 2-5x for individual operations, 0.1-0.5ms per pipeline stage
+- When NOT to do this: If the data is already a GPU tensor and will continue as a GPU tensor — don't move it to CPU just for numpy. This is for CPU-side post-processing after D2H.
+
+---
+
+**[transfer] Unnecessary D2H synchronization for scalar reads** — Hidden GPU->CPU roundtrips
+
+Symptom: Unexpected `cudaStreamSynchronize` in nsys caused by seemingly innocuous Python code. Common hidden triggers:
+
+```python
+# HIDDEN SYNC 1: .item() forces D2H + sync
+if tensor.max().item() > threshold:  # sync!
+    ...
+
+# HIDDEN SYNC 2: torchvision.functional.normalize has internal check
+# (std == 0).any() — transfers a boolean from GPU to CPU
+output = torchvision.functional.normalize(tensor, mean, std)  # sync!
+
+# HIDDEN SYNC 3: print/logging forces sync
+print(f"Shape: {tensor.shape}, Max: {tensor.max()}")  # sync!
+
+# HIDDEN SYNC 4: Python boolean check on GPU tensor
+if tensor.any():  # sync!
+    ...
+```
+
+Fix: Batch reads at end of pipeline, use `non_blocking=True` where possible, replace library functions that have hidden syncs with inline equivalents.
+
+- Typical impact: 0.05-0.5ms per hidden sync point
+- Why cProfile misses it: The sync shows as CPU idle time, not attributable to the line that triggered it
+
+---
+
+**[memory] Repeated GPU tensor allocation in inference loop** — Allocation overhead per call
+
+Symptom: `torch.empty()`, `torch.zeros()`, or `torch.ones()` called inside the per-frame inference loop for buffers that could be pre-allocated.
+
+```python
+# BEFORE: allocate output buffer every call
+def postprocess(logits, num_queries):
+    output = torch.empty(num_queries, 6, dtype=torch.float32)  # allocation per call
+    ...
+```
+
+```python
+# AFTER: pre-allocate and reuse
+def __init__(self):
+    self._output_buffer = torch.empty(300, 6, dtype=torch.float32, pin_memory=True)
+
+def postprocess(self, logits, num_queries):
+    output = self._output_buffer[:num_queries]  # view into pre-allocated buffer
+    ...
+```
+
+- Typical impact: 0.05-0.2ms per avoided allocation (depends on tensor size)
+- Pinned buffer variant: For D2H outputs, pre-allocate with `pin_memory=True` so the GPU kernel can write directly to host memory.
+
+---
+
+**[correctness] Device comparison with == instead of .is_cuda** — Caching defeated
+
+Symptom: Tensors that should be cached on GPU are re-created every call. The caching check uses device equality, which fails due to device index mismatch.
+
+```python
+# BUG: torch.device('cuda') != torch.device('cuda:0') — always True!
+if _cached_tensor is None or _cached_tensor.device != torch.device('cuda'):
+    _cached_tensor = torch.tensor(data, device=target_device)  # re-created every call
+```
+
+```python
+# FIX: .is_cuda checks for any CUDA device index
+if _cached_tensor is None or not _cached_tensor.is_cuda:
+    _cached_tensor = torch.tensor(data, device=target_device)  # cached after first call
+```
+
+- Impact: Defeats caching, causing per-call tensor creation and H2D transfer overhead
+- Pattern class: "caching defeated by incorrect cache key" — a class of bugs specific to PyTorch device handling
+- Always fix: This is a correctness bug, not just a performance issue
+
+---
+
+### LOW impact (only on confirmed hot path)
+
+**[dispatch] CPU-side preprocessing that could be a CUDA kernel**
+
+When CPU preprocessing (resize, color conversion, normalization) takes >1ms and the input is going to GPU anyway, consider doing the work on GPU. Libraries: `torchvision.transforms` on CUDA tensors, NVIDIA DALI, custom CUDA kernels.
+
+Caveat: Moving preprocessing to GPU increases GPU utilization — only beneficial if GPU is underutilized during preprocessing.
+
+---
+
+**[sync] Missing non_blocking on tensor.to(device)**
+
+```python
+# BEFORE: blocks CPU until transfer complete
+gpu_tensor = cpu_tensor.to(device)
+
+# AFTER: CPU proceeds immediately, sync later when needed
+gpu_tensor = cpu_tensor.to(device, non_blocking=True)
+```
+
+Requires the source tensor to be in pinned memory for `non_blocking` to be effective. If the source is pageable memory, `non_blocking=True` is silently ignored.
+
+---
+
+**[memory] FP32 inference when FP16 would suffice**
+
+When the model supports half precision and numerical accuracy is acceptable:
+
+```python
+model = model.half()
+# or
+with torch.autocast(device_type='cuda', dtype=torch.float16):
+    output = model(input_tensor)
+```
+
+Caveat: Not all models are stable in FP16. Test numerical output equivalence.
+
+---
+
+## Data Size Thresholds
+
+| Pattern | Worth fixing when |
+|---------|------------------|
+| Loop-invariant tensor creation | Always (fixed per-call cost) |
+| Pageable -> pinned memory | Transfer size > ~4 KB |
+| Kernel fusion | >5 sequential kernel launches with <100us each |
+| Event-based stream sync | `synchronize()` blocks CPU for >0.3ms |
+| Small-array CPU fallback | Array has <1000 elements on CPU-side torch |
+| Device comparison bug | Always (correctness bug) |
+| Repeated GPU allocation | Buffer allocated >100 times/sec |
+| Hidden D2H sync | Any occurrence in hot loop |