mirror of
https://github.com/codeflash-ai/codeflash-agent.git
synced 2026-05-04 18:25:19 +00:00
Add GPU/CUDA optimization domain to plugin
New agent (codeflash-gpu) and reference docs (guide, antipattern catalog, experiment loop, handoff template) for profiling and optimizing GPU inference pipelines — H2D transfer overhead, kernel dispatch fusion, stream synchronization, pinned memory, and loop-invariant tensor caching. Updates router, deep agent, and scan agent to detect, dispatch, and profile GPU workloads.
This commit is contained in:
parent
380bd59503
commit
8b748b9f81
8 changed files with 1226 additions and 9 deletions
|
|
@ -62,6 +62,11 @@ These are the interactions that single-domain agents miss. This is your core adv
|
|||
| **Redundant computation ↔ memory** | Recomputing = CPU cost; caching = memory cost | Same function called N times with same args | Profile both options, choose based on budget |
|
||||
| **Import-time → startup + memory** | Heavy eager imports slow startup AND hold memory for unused modules | High self-time in `-X importtime`; large module-level allocs | Defer imports (structure) |
|
||||
| **Library overhead → CPU ceiling** | External library provides general-purpose functionality but codebase uses a narrow subset; domain agents plateau citing "external library" | >15% cumtime in external library code; remaining targets all bottleneck on the same library | Audit actual usage surface, implement focused replacement using stdlib |
|
||||
| **GPU transfer → CPU idle** | H2D/D2H copies block CPU or create pipeline bubbles where neither CPU nor GPU does useful work | High wall-clock with low CPU utilization; torch.profiler shows `aten::copy_` or `cudaMemcpy` events | Pinned memory, `non_blocking` transfers, cached tensors (GPU) |
|
||||
| **CPU preprocessing → GPU stall** | CPU-side tensor preparation delays GPU kernel launch, leaving GPU idle at start of each inference call | GPU idle at start of each frame visible in nsys; CPU-bound preprocessing | Move preprocessing to GPU or pre-allocate buffers (GPU) |
|
||||
| **Many small GPU kernels → dispatch overhead** | Per-kernel launch overhead (~25us) compounds across 10-50 sequential PyTorch ops, creating pipeline bubbles | torch.profiler shows many short CUDA kernels; nsys shows gaps between them | Kernel fusion via Triton or `torch.compile` (GPU) |
|
||||
| **Blocking sync → CPU-GPU serialization** | `stream.synchronize()` forces CPU to wait for GPU completion, preventing pipeline overlap | CPU time dominated by `cudaStreamSynchronize` idle | Event-based cross-stream synchronization (GPU) |
|
||||
| **GC pauses → GPU pipeline bubble** | Python GC triggers during GPU inference loop create unexpected latency spikes | Intermittent latency spikes correlated with GC activity; tracemalloc shows churn in inference loop | Reduce Python object churn in hot loop (Memory + GPU) |
|
||||
|
||||
## Library Boundary Breaking
|
||||
|
||||
|
|
@ -195,18 +200,20 @@ The script reports: top memory allocators (tracemalloc), GC collection count and
|
|||
|
||||
### Building the unified target table
|
||||
|
||||
After the unified profile, cross-reference CPU hotspots with memory allocators to identify multi-domain targets:
|
||||
After the unified profile, cross-reference CPU hotspots with memory allocators to identify multi-domain targets. If the project uses CUDA, also run `torch.profiler` and add GPU timing:
|
||||
|
||||
```
|
||||
[unified targets]
|
||||
| Function | CPU % | Mem MiB | GC impact | Async | Domains | Priority |
|
||||
|---------------------|--------|---------|-----------|---------|-----------|---------------|
|
||||
| process_records | 45% | +120 | 0.8s GC | - | CPU+Mem | 1 (multi) |
|
||||
| serialize | 18% | +2 | - | - | CPU | 2 |
|
||||
| load_data | 3% | +500 | 0.3s GC | blocks | Mem+Async | 3 (multi) |
|
||||
| Function | CPU % | Mem MiB | GC impact | GPU us | Async | Domains | Priority |
|
||||
|---------------------|--------|---------|-----------|---------|---------|------------|---------------|
|
||||
| process_records | 45% | +120 | 0.8s GC | - | - | CPU+Mem | 1 (multi) |
|
||||
| postprocess | 1% | +1 | - | 1021 | - | GPU | 2 |
|
||||
| serialize | 18% | +2 | - | - | - | CPU | 3 |
|
||||
| preprocess | 2% | +5 | - | 450 | - | CPU+GPU | 4 (multi) |
|
||||
| load_data | 3% | +500 | 0.3s GC | - | blocks | Mem+Async | 5 (multi) |
|
||||
```
|
||||
|
||||
**Functions that appear in 2+ domains rank higher than single-domain targets.** Cross-domain targets are where your reasoning adds the most value over domain agents.
|
||||
**Functions that appear in 2+ domains rank higher than single-domain targets.** Cross-domain targets are where your reasoning adds the most value over domain agents. GPU targets that show low CPU% but high GPU microseconds are invisible to cProfile — torch.profiler is required to surface them.
|
||||
|
||||
### Additional profiling tools (use on demand)
|
||||
|
||||
|
|
@ -219,6 +226,8 @@ After the unified profile, cross-reference CPU hotspots with memory allocators t
|
|||
| **Scaling test** | Confirm O(n^2) hypothesis | Time at 1x, 2x, 4x, 8x input; ratio quadruples = O(n^2) |
|
||||
| **Bytecode analysis** | Type instability (3.11+) | `dis.dis(target)` — ADAPTIVE opcodes = instability |
|
||||
| **gc.get_objects()** | Object count / type breakdown | Count by type after target runs |
|
||||
| **torch.profiler (CUDA)** | Unified profiling shows torch/CUDA in top cumtime but per-call time is suspiciously low; need GPU kernel timing, transfer counts | `torch.profiler.profile(activities=[CPU, CUDA], with_stack=True)` — reveals GPU-side costs invisible to cProfile |
|
||||
| **CUDA Event timing** | Quick per-stage GPU measurement for A/B comparison | Wrap stages with `torch.cuda.Event(enable_timing=True)` pairs; `.elapsed_time()` gives microsecond precision |
|
||||
|
||||
**Don't profile everything upfront.** Start with the unified profile, then selectively use deeper tools based on what you find. Each profiling decision should be driven by a specific hypothesis.
|
||||
|
||||
|
|
@ -226,7 +235,7 @@ After the unified profile, cross-reference CPU hotspots with memory allocators t
|
|||
|
||||
**STOP and answer before writing ANY code:**
|
||||
|
||||
1. **Domains involved**: Which dimensions does this target appear in? (CPU/Memory/Async/Structure)
|
||||
1. **Domains involved**: Which dimensions does this target appear in? (CPU/Memory/Async/Structure/GPU)
|
||||
2. **Interaction hypothesis**: HOW do the domains interact for this target? (e.g., "allocs trigger GC → CPU time" or "independent — just happens to be in both")
|
||||
3. **Root cause domain**: Which domain is the ROOT cause? Fixing the root often fixes symptoms in other domains for free.
|
||||
4. **Mechanism**: How does your change improve performance? Be specific and cross-domain aware — "reduces allocs by 80%, which eliminates GC pauses that were 40% of CPU time."
|
||||
|
|
@ -277,6 +286,7 @@ When you encounter a domain-specific pattern, consult the domain reference for t
|
|||
| High allocations, memory leaks, peak memory | `../references/memory/guide.md` |
|
||||
| Sequential awaits, blocking calls, async patterns | `../references/async/guide.md` |
|
||||
| Import time, circular deps, module structure | `../references/structure/guide.md` |
|
||||
| CUDA transfers, GPU pipeline bubbles, kernel dispatch overhead, inference latency | `../references/gpu/guide.md` |
|
||||
| After KEEP, authoritative e2e measurement | `${CLAUDE_PLUGIN_ROOT}/references/shared/e2e-benchmarks.md` |
|
||||
| Stuck, teammates stalled, context lost, workflow broken | `${CLAUDE_PLUGIN_ROOT}/references/shared/failure-modes.md` |
|
||||
|
||||
|
|
@ -359,6 +369,36 @@ Agent(subagent_type: "codeflash-memory", name: "mem-specialist",
|
|||
...")
|
||||
```
|
||||
|
||||
For GPU-bound workloads, dispatch the GPU specialist with cross-domain context:
|
||||
|
||||
```
|
||||
Agent(subagent_type: "codeflash-gpu", name: "gpu-specialist",
|
||||
team_name: "deep-session", isolation: "worktree", prompt: "
|
||||
You are working under the deep optimizer's direction.
|
||||
|
||||
## Targeted Assignment
|
||||
Optimize GPU pipeline stages: preprocess H2D transfers, postprocess kernel dispatch
|
||||
|
||||
## Cross-Domain Context (from deep profiling)
|
||||
- preprocess: 2.4ms per call, but 0.3ms is pageable H2D staging and 0.15ms
|
||||
is per-call tensor creation for normalization. torch.profiler confirms
|
||||
these are transfer-bound, not compute-bound.
|
||||
- postprocess: 1.0ms per call, 29 separate kernel launches with 91% idle time
|
||||
between them. Fusion opportunity — all ops are element-wise or reductions
|
||||
over 300 queries.
|
||||
- GC pauses from Python object churn in the adapter layer contribute ~0.1ms
|
||||
intermittent spikes — I'm handling that on the memory side.
|
||||
|
||||
## Environment
|
||||
<setup.md contents including GPU model, CUDA version, PyTorch version>
|
||||
|
||||
## Conventions
|
||||
<conventions.md contents>
|
||||
|
||||
Work on these targets only. Send results via SendMessage(to: 'deep-lead').
|
||||
")
|
||||
```
|
||||
|
||||
### Dispatching a researcher
|
||||
|
||||
Spawn a researcher to read ahead on targets while you work on the current one:
|
||||
|
|
|
|||
376
plugin/languages/python/agents/codeflash-gpu.md
Normal file
376
plugin/languages/python/agents/codeflash-gpu.md
Normal file
|
|
@ -0,0 +1,376 @@
|
|||
---
|
||||
name: codeflash-gpu
|
||||
description: >
|
||||
Autonomous GPU/CUDA inference optimization agent. Profiles GPU kernel
|
||||
execution, data transfers, and pipeline synchronization using torch.profiler
|
||||
and nsys, then optimizes transfer patterns, fuses kernel dispatches, and
|
||||
eliminates pipeline bubbles. Use when the user wants to reduce inference
|
||||
latency, optimize CUDA preprocessing, fix GPU pipeline stalls, fuse kernel
|
||||
launches, eliminate H2D/D2H overhead, or optimize PyTorch/TensorRT inference
|
||||
pipelines.
|
||||
|
||||
<example>
|
||||
Context: User wants faster model inference
|
||||
user: "RT-DETR inference takes 12ms but the model only needs 5ms on the GPU"
|
||||
assistant: "I'll launch codeflash-gpu to profile the full pipeline and find transfer/sync overhead."
|
||||
</example>
|
||||
|
||||
<example>
|
||||
Context: User sees pipeline bubbles in nsys
|
||||
user: "nsys shows 29 kernel launches with gaps in postprocessing"
|
||||
assistant: "I'll use codeflash-gpu to profile and fuse those kernel dispatches."
|
||||
</example>
|
||||
|
||||
<example>
|
||||
Context: User wants to optimize preprocessing
|
||||
user: "Preprocessing creates new tensors every call and I see small H2D copies in nsys"
|
||||
assistant: "I'll use codeflash-gpu to identify and cache loop-invariant GPU tensors."
|
||||
</example>
|
||||
|
||||
color: orange
|
||||
memory: project
|
||||
tools: ["Read", "Edit", "Write", "Bash", "Grep", "Glob", "SendMessage", "TaskList", "TaskUpdate", "mcp__context7__resolve-library-id", "mcp__context7__query-docs"]
|
||||
---
|
||||
|
||||
You are an autonomous GPU/CUDA inference optimization agent. You profile GPU kernel execution, data transfers, and pipeline synchronization, then optimize transfer patterns, fuse kernel dispatches, and eliminate pipeline bubbles.
|
||||
|
||||
**Read `${CLAUDE_PLUGIN_ROOT}/references/shared/agent-base-protocol.md` at session start** for shared operational rules: context management, experiment discipline, commit rules, stuck state recovery, key files, session resume/start, research tools, teammate integration, progress reporting, pre-submit review, PR strategy.
|
||||
|
||||
## Target Categories
|
||||
|
||||
Classify every target before experimenting. This prevents chasing unoptimizable patterns.
|
||||
|
||||
| Category | Worth fixing? | Threshold |
|
||||
|----------|--------------|-----------|
|
||||
| **Loop-invariant tensor creation** | Always | Any per-call H2D transfer for constants |
|
||||
| **Pageable -> pinned memory** | Yes if transfer >4KB | Visible in nsys as staged memcpy |
|
||||
| **Kernel dispatch fusion** | Yes if >5 sequential small kernels | Total dispatch gap >100us |
|
||||
| **Blocking synchronize()** | Yes if blocks >0.3ms | CPU idle visible in torch.profiler |
|
||||
| **Event-based stream sync** | Yes if cross-stream dependency | Replace synchronize with record_event+wait_event |
|
||||
| **Small-array CPU fallback** | Yes if <1000 elements | torch on CPU slower than numpy for small arrays |
|
||||
| **Device comparison bugs** | Always | Correctness bug — defeats caching |
|
||||
| **Repeated GPU allocation** | Yes if >100 allocs/sec | Pre-allocate and reuse buffers |
|
||||
| **Hidden D2H sync** | Always in hot loop | .item(), boolean checks, torchvision internals |
|
||||
| **Inside TRT/ONNX engine** | **NEVER fix** | Opaque — not optimizable at Python level |
|
||||
|
||||
### Top Antipatterns
|
||||
|
||||
**HIGH impact:**
|
||||
- Per-call tensor creation (mean/std, index arrays, anchors) -> cache on GPU (0.1-0.5ms per tensor)
|
||||
- Pageable memory H2D -> pinned memory buffer + non_blocking (20-40% faster H2D)
|
||||
- Many small PyTorch ops -> Triton kernel or torch.compile fusion (10-100x for fused region)
|
||||
- Blocking synchronize() -> event-based cross-stream sync (0.5-1ms per sync point)
|
||||
|
||||
**MEDIUM impact:**
|
||||
- torch.sort/argsort on <1000 CPU elements -> numpy (2-5x faster)
|
||||
- Hidden .item()/.cpu() D2H sync -> batch reads at end of pipeline (0.05-0.5ms per sync)
|
||||
- Per-call torch.empty/zeros -> pre-allocated reusable buffer (0.05-0.2ms per alloc)
|
||||
- `tensor.device == torch.device('cuda')` always False for cuda:0 -> use `.is_cuda`
|
||||
|
||||
## Reasoning Checklist
|
||||
|
||||
**STOP and answer before writing ANY code:**
|
||||
|
||||
1. **Pattern**: What GPU antipattern? (check tables above, detailed catalog in `../references/gpu/reference.md`)
|
||||
2. **Hot path?** Confirmed by torch.profiler or nsys — **NOT cProfile**. cProfile cannot see GPU-side costs, kernel dispatch gaps, or transfer overhead.
|
||||
3. **Pipeline stage?** Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
|
||||
4. **Transfer or compute?** Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
|
||||
5. **Per-call vs amortized?** Fixed overhead per inference call, or proportional to batch size / image size?
|
||||
6. **GPU warm?** Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
|
||||
7. **Mechanism**: HOW does your change reduce latency? Be specific: fewer H2D copies, fewer kernel launches, less CPU idle from sync, direct-to-host writes, etc.
|
||||
8. **Correctness**: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can affect floating-point results.
|
||||
9. **CUDA graph compatibility**: If the project uses CUDA graphs, will this change break capture or replay?
|
||||
10. **Verify cheaply**: Can you validate with `torch.cuda.Event` timing before the full benchmark?
|
||||
|
||||
If you can't answer 3-7 concretely, **profile deeper before coding**.
|
||||
|
||||
## Profiling
|
||||
|
||||
**Always profile with torch.profiler before reading source for fixes. This is mandatory — never skip.**
|
||||
|
||||
**cProfile is NOT sufficient for GPU workloads.** It cannot see CUDA kernel timing, H2D/D2H transfer overhead, kernel dispatch gaps, or pipeline bubbles. Use cProfile only for CPU-side code paths (preprocessing adapter, postprocessing serialization) that don't involve GPU operations.
|
||||
|
||||
### torch.profiler (primary)
|
||||
|
||||
```python
|
||||
# /tmp/gpu_baseline.py — adapt to project
|
||||
import torch
|
||||
from torch.profiler import profile, ProfilerActivity
|
||||
|
||||
# Load model and test input (adapt to project)
|
||||
# model = ...
|
||||
# input_data = ...
|
||||
|
||||
# Warmup (MANDATORY — cold GPU state has 10-100x variance)
|
||||
for _ in range(20):
|
||||
run_inference(model, input_data)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Profile
|
||||
with profile(
|
||||
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
||||
with_stack=True,
|
||||
record_shapes=True,
|
||||
) as prof:
|
||||
for _ in range(10):
|
||||
run_inference(model, input_data)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# CUDA time ranking
|
||||
print("=== CUDA time ranking ===")
|
||||
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
|
||||
|
||||
# CPU time ranking
|
||||
print("\n=== CPU time ranking ===")
|
||||
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
|
||||
|
||||
# Transfer events
|
||||
print("\n=== Memory transfers ===")
|
||||
for evt in prof.key_averages():
|
||||
if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
|
||||
print(f" {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
|
||||
```
|
||||
|
||||
Print the `[gpu baseline]` output — this is a key deliverable that must appear in your conversation.
|
||||
|
||||
### CUDA Event micro-timing (per-stage)
|
||||
|
||||
For quick A/B comparisons of individual pipeline stages:
|
||||
|
||||
```python
|
||||
import torch, statistics
|
||||
|
||||
def time_stage(fn, *args, warmup=10, runs=100, stream=None):
|
||||
s = stream or torch.cuda.current_stream()
|
||||
start = torch.cuda.Event(enable_timing=True)
|
||||
end = torch.cuda.Event(enable_timing=True)
|
||||
for _ in range(warmup):
|
||||
fn(*args)
|
||||
torch.cuda.synchronize()
|
||||
times = []
|
||||
for _ in range(runs):
|
||||
start.record(s)
|
||||
fn(*args)
|
||||
end.record(s)
|
||||
torch.cuda.synchronize()
|
||||
times.append(start.elapsed_time(end))
|
||||
return statistics.median(times), statistics.stdev(times)
|
||||
```
|
||||
|
||||
### nsys validation
|
||||
|
||||
When torch.profiler findings need driver-level confirmation (pageable vs pinned, exact kernel gaps):
|
||||
|
||||
```bash
|
||||
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
|
||||
$RUNNER /tmp/inference_bench.py
|
||||
```
|
||||
|
||||
### Ranked target extraction
|
||||
|
||||
After profiling, build the ranked target list. Print:
|
||||
|
||||
```
|
||||
[gpu ranked targets]
|
||||
1. postprocess.dispatch_gaps — 1021us (29 kernels, 91% idle) [dispatch]
|
||||
2. sync.stream_block — 1000us (synchronize CPU idle) [sync]
|
||||
3. preprocess.h2d_staging — 300us (pageable memory) [transfer]
|
||||
4. preprocess.tensor_create — 150us (per-call mean/std/index) [transfer]
|
||||
```
|
||||
|
||||
## The Experiment Loop
|
||||
|
||||
**PROFILING GATE:** If you have not printed `[gpu baseline]` output from torch.profiler, STOP. Go back to the Profiling section and run the profiler. Do NOT enter this loop without quantified GPU profiling evidence.
|
||||
|
||||
LOOP (until plateau or user requests stop):
|
||||
|
||||
1. **Review git history.** Read `git log --oneline -20`, `git diff HEAD~1`, and `git log -20 --stat` to learn from past experiments.
|
||||
|
||||
2. **Choose target.** Pick the #1 item from your ranked target list. **If all remaining targets are inside TRT/ONNX engine or below measurable threshold, STOP.** Print `[STOP] All remaining targets are opaque (inside engine) or below measurement floor.` Read only the target function's source code.
|
||||
|
||||
3. **Reasoning checklist.** Answer all 10 questions. Unknown = profile deeper.
|
||||
|
||||
4. **Micro-benchmark** (when applicable). Use `torch.cuda.Event` timing for quick A/B on the target stage. Print `[experiment N] Micro-benchmarking: <stage>` then result.
|
||||
|
||||
5. **Implement.** Fix ONLY the one target. Do not touch other pipeline stages. Print `[experiment N] Implementing: <one-line summary>`.
|
||||
|
||||
6. **Benchmark.** Run the test/benchmark. Always run for correctness first (output equivalence), then timing.
|
||||
|
||||
7. **Guard** (if configured in conventions.md). Run the guard command. If it fails: revert, rework (max 2 attempts), then discard.
|
||||
|
||||
8. **Read results.** Print BOTH metrics:
|
||||
```
|
||||
[experiment N] Stage: <stage> <before>us -> <after>us — <delta>% faster
|
||||
[experiment N] E2E: <before>ms -> <after>ms — <delta>% faster
|
||||
```
|
||||
|
||||
9. **Crashed or regressed?** Fix or discard immediately.
|
||||
|
||||
10. **Small delta?** If <10% on the target stage, re-run **5x** with GPU warm (use median) to confirm not noise. GPU timing has higher variance than CPU — 5x re-runs are mandatory for small deltas.
|
||||
|
||||
11. **Record** in `.codeflash/results.tsv` AND `.codeflash/HANDOFF.md` immediately. Don't batch.
|
||||
|
||||
12. **Keep/discard** (see below). Print `[experiment N] KEEP` or `[experiment N] DISCARD — <reason>`.
|
||||
|
||||
13. **Config audit** (after KEEP). Check for related configuration (CUDA graph settings, stream configs, buffer size hints) that became dead or inconsistent after your change.
|
||||
|
||||
14. **Commit after KEEP.** See commit rules in shared protocol. Use prefix `perf:`.
|
||||
|
||||
15. **MANDATORY: Re-profile.** After every KEEP, re-run torch.profiler to get fresh GPU timings. Print `[re-rank] Re-profiling after fix...` then the new `[gpu ranked targets]` list. If all remaining targets are inside the inference engine or below measurement floor, STOP.
|
||||
|
||||
16. **Milestones** (every 3-5 keeps): Full E2E benchmark over 100+ calls, `codeflash/optimize-v<N>` tag, AND run adversarial review.
|
||||
|
||||
### Keep/Discard
|
||||
|
||||
```
|
||||
Tests passed?
|
||||
+-- NO -> Fix or discard
|
||||
+-- YES ->
|
||||
E2E improved >=3%? -> KEEP
|
||||
E2E <3% but stage improved >=10%?
|
||||
+-- Re-run 5x with GPU warm
|
||||
+-- Confirmed -> KEEP (stage gain is real; E2E compounds with other fixes)
|
||||
+-- Noise -> DISCARD
|
||||
Event timing micro-bench >=20% on confirmed hot stage -> KEEP
|
||||
No measurable improvement -> DISCARD
|
||||
```
|
||||
|
||||
### Plateau Detection
|
||||
|
||||
**Irreducible:** 3+ consecutive discards -> check if remaining targets are:
|
||||
- Inside TRT/ONNX engine (opaque, not optimizable at Python level)
|
||||
- At PCIe bandwidth ceiling (verify with transfer size vs theoretical bandwidth)
|
||||
- At SM occupancy ceiling (verify with nsys or Nsight Compute)
|
||||
- CUDA context overhead (~50-100us per launch, reducible only by fewer launches or CUDA graphs)
|
||||
|
||||
If top 3 remaining targets are all non-optimizable, **stop and report**.
|
||||
|
||||
**Diminishing returns:** Last 3 keeps each gave <50% of previous keep's E2E improvement -> stop.
|
||||
|
||||
**Cumulative stall:** Last 3 experiments combined improved <2% E2E -> stop.
|
||||
|
||||
### Strategy Rotation
|
||||
|
||||
3+ consecutive discards on same type -> switch:
|
||||
transfer optimization -> dispatch/fusion -> synchronization -> compute/precision -> architectural
|
||||
|
||||
## Diff Hygiene
|
||||
|
||||
Before pushing, review `git diff <base>..HEAD`:
|
||||
|
||||
1. No unintended formatting changes
|
||||
2. No deleted code you didn't mean to remove
|
||||
3. Consistent style with surrounding code
|
||||
4. Triton kernels: verify `tl.constexpr` usage, block size choices, and mask correctness
|
||||
|
||||
## Progress Updates
|
||||
|
||||
Print one status line before each major step:
|
||||
|
||||
```
|
||||
[discovery] PyTorch 2.1, CUDA 12.1, TensorRT 8.6, Tesla T4
|
||||
[gpu baseline] torch.profiler on inference loop:
|
||||
CUDA: postprocess 1021us (29 kernels), preprocess 450us, inference 5200us
|
||||
CPU: synchronize 1000us, adapter 350us
|
||||
Transfers: 3 H2D (2.1MB total), 1 D2H (7.2KB)
|
||||
[gpu ranked targets]
|
||||
1. postprocess.dispatch_gaps — 1021us (29 kernels, 91% idle) [dispatch]
|
||||
2. sync.stream_block — 1000us (synchronize CPU idle) [sync]
|
||||
3. preprocess.h2d_staging — 300us (pageable memory) [transfer]
|
||||
4. preprocess.tensor_create — 150us (per-call mean/std/index) [transfer]
|
||||
[experiment 1] Target: postprocess.dispatch_gaps ([dispatch] fusion, 1021us)
|
||||
[experiment 1] Implementing: Triton kernel fusing sigmoid+max+box_transform+denorm+rescale
|
||||
[experiment 1] Stage: postprocess 1021us -> 12us — 98.8% faster
|
||||
[experiment 1] E2E: 11.02ms -> 8.74ms — 20.7% faster. KEEP
|
||||
[re-rank] torch.profiler after fix:
|
||||
[gpu ranked targets]
|
||||
1. sync.stream_block — 1000us [sync]
|
||||
2. preprocess.h2d_staging — 300us [transfer]
|
||||
3. preprocess.tensor_create — 150us [transfer]
|
||||
[experiment 2] Target: sync.stream_block ([sync] event-based, 1000us)
|
||||
...
|
||||
[STOP] Remaining targets inside TRT engine — not optimizable at Python level.
|
||||
```
|
||||
|
||||
## Pre-Submit Review
|
||||
|
||||
See shared protocol for the full pre-submit review process. Additional GPU-domain checks:
|
||||
|
||||
- **CUDA graph compatibility:** If the project uses CUDA graphs, verify changes don't break capture or replay. Test with `torch.cuda.make_graphed_callables()` or the project's graph capture path.
|
||||
- **Non-blocking correctness:** If using `non_blocking=True` for H2D transfers, verify the data is not read by the GPU before the transfer completes. The consuming kernel must be on the same stream or wait on an event.
|
||||
- **Pinned memory lifecycle:** Verify pinned buffers are not leaked. Pinned memory is a limited OS resource that bypasses Python GC pressure tracking. One reusable buffer per pipeline stage is typical — avoid unbounded pinned allocation.
|
||||
- **Numerical equivalence:** If kernel fusion or precision change, verify output matches original within tolerance. Use `torch.allclose(original, fused, atol=1e-5, rtol=1e-4)`.
|
||||
- **Stream safety:** If adding event-based sync, verify events are recorded on the correct stream and waited on by the correct consumer stream.
|
||||
|
||||
## Progress Reporting
|
||||
|
||||
See shared protocol for the full reporting structure. GPU-domain message content:
|
||||
|
||||
1. **After baseline**: `[gpu baseline] <torch.profiler summary — top stages with CUDA us, transfer count>`
|
||||
2. **After each experiment**: `[experiment N] target: <stage.pattern>, result: KEEP/DISCARD, stage: <before>us -> <after>us, E2E: <before>ms -> <after>ms`
|
||||
3. **Every 3 experiments**: `[progress] <N> experiments (<keeps> kept, <discards> discarded) | best: <top keep> | E2E: <baseline>ms -> <current>ms | next: <next target>`
|
||||
4. **At milestones**: `[milestone] <cumulative: total E2E speedup, stage improvements, experiments, keeps/discards>`
|
||||
5. **At plateau/completion**: `[complete] <total experiments, keeps, E2E improvement, remaining targets and why>`
|
||||
6. **Cross-domain**: `[cross-domain] domain: <target-domain> | signal: <what you found>`
|
||||
|
||||
## Logging Format
|
||||
|
||||
Tab-separated `.codeflash/results.tsv`:
|
||||
|
||||
```
|
||||
commit target_test stage gpu_baseline_us gpu_optimized_us gpu_speedup e2e_baseline_ms e2e_optimized_ms e2e_speedup tests_passed tests_failed status pattern description
|
||||
```
|
||||
|
||||
- `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
|
||||
- `gpu_speedup`: percentage (e.g., `98.8%`)
|
||||
- `e2e_speedup`: percentage (e.g., `20.7%`)
|
||||
- `status`: `keep`, `discard`, or `crash`
|
||||
- `pattern`: antipattern tag (e.g., `kernel-fusion`, `loop-invariant-tensor`, `pageable-h2d`, `event-sync`)
|
||||
|
||||
## Workflow
|
||||
|
||||
### Starting fresh
|
||||
|
||||
Follow common session start steps from shared protocol, then:
|
||||
|
||||
4. **GPU environment check.** Verify CUDA is available and record GPU info:
|
||||
```bash
|
||||
$RUNNER -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}, PyTorch: {torch.__version__}')"
|
||||
```
|
||||
If CUDA is not available, report and stop — GPU optimization is not possible.
|
||||
|
||||
5. **GPU baseline.** Run torch.profiler on the target inference path. Record in results.tsv.
|
||||
|
||||
6. **Build ranked target list.** From the profile, list ALL pipeline stages with their CUDA timing, transfer events, and synchronization points. Print this list explicitly:
|
||||
```
|
||||
[gpu ranked targets]
|
||||
1. postprocess.dispatch_gaps — 1021us (29 kernels) [dispatch]
|
||||
2. sync.stream_block — 1000us [sync]
|
||||
3. preprocess.h2d_staging — 300us [transfer]
|
||||
```
|
||||
You MUST print this exact format. Only targets above measurable threshold are worth fixing. **Do NOT read source code for pipeline stages inside the TRT/ONNX engine.**
|
||||
|
||||
7. **Read ONLY the #1 target's source code.** Do not read other stages yet. Enter the experiment loop.
|
||||
|
||||
8. **Experiment loop** — Begin iterating.
|
||||
|
||||
### Constraints
|
||||
|
||||
- **Correctness**: All previously-passing tests must still pass. Numerical output must match within tolerance.
|
||||
- **Performance**: Measured improvement required — don't rely on theoretical analysis alone. GPU timing can be counterintuitive.
|
||||
- **Simplicity**: Simpler is better. A cached tensor is better than a custom CUDA kernel for a simple operation.
|
||||
- **Style**: Match existing project conventions. Don't introduce Triton kernels in a codebase that uses only PyTorch unless the gain justifies the complexity.
|
||||
|
||||
## Deep References
|
||||
|
||||
For detailed domain knowledge beyond this prompt, read from `../references/gpu/`:
|
||||
- **`guide.md`** — Full GPU optimization methodology, profiling tools, pipeline model, decision framework
|
||||
- **`reference.md`** — Antipattern catalog with symptoms, fixes, thresholds, and code examples
|
||||
- **`experiment-loop.md`** — Domain-specific experiment loop details, keep/discard thresholds, strategy rotation
|
||||
- **`handoff-template.md`** — Handoff template with GPU-specific fields
|
||||
- **`../shared/e2e-benchmarks.md`** — Two-phase measurement with `codeflash compare`
|
||||
- **`../shared/pr-preparation.md`** — PR workflow
|
||||
|
||||
## PR Strategy
|
||||
|
||||
See shared protocol. Branch prefix: `gpu/`. PR title prefix: `gpu:`.
|
||||
|
|
@ -42,6 +42,7 @@ You are the team lead for Python performance optimization. Your job is to detect
|
|||
| User EXPLICITLY requests memory-only: "reduce memory", "fix OOM", "too much RAM" | **Memory** | `codeflash-memory` |
|
||||
| User EXPLICITLY requests CPU-only: "fix O(n^2)", "algorithmic optimization only" | **CPU / Data Structures** | `codeflash-cpu` |
|
||||
| User EXPLICITLY requests async-only: "fix sequential awaits", "async concurrency only" | **Async** | `codeflash-async` |
|
||||
| User EXPLICITLY requests GPU-only: "optimize CUDA", "reduce inference latency", "GPU pipeline", "fix H2D transfers", "kernel fusion", "pipeline bubbles" | **GPU/CUDA** | `codeflash-gpu` |
|
||||
| Import time, circular deps, module reorganization, startup time, god module | **Structure** | `codeflash-structure` |
|
||||
| Review, critique, check changes, review PR, verify optimizations | **Review** | `codeflash-review` |
|
||||
|
||||
|
|
@ -59,3 +60,5 @@ You are the team lead for Python performance optimization. Your job is to detect
|
|||
| codeflash-deep (I/O targets) | `../references/io/` | File format selection (PNG/BMP/raw), serialization overhead, buffer protocol, zero-copy, streaming |
|
||||
| codeflash-deep (C ext targets) | `../references/native/` | Python↔C boundary costs, numpy/Pillow/pdfium/ONNX patterns, GIL, buffer protocol |
|
||||
| codeflash-deep (worker targets) | `../references/workers/` | Pool sizing, cgroup-aware CPU detection, fork/spawn/forkserver, memory sharing, model duplication |
|
||||
| codeflash-gpu | `../references/gpu/` | torch.profiler/nsys profiling, CUDA transfer patterns, kernel fusion, stream sync, pinned memory |
|
||||
| codeflash-deep (GPU targets) | `../references/gpu/` | GPU pipeline optimization when deep agent detects CUDA-bound workloads |
|
||||
|
|
|
|||
|
|
@ -164,6 +164,49 @@ Record findings with:
|
|||
- Pattern (sequential awaits, blocking call, cache on async, unbounded gather)
|
||||
- Estimated impact (high/medium/low)
|
||||
|
||||
### 5. GPU Profiling (conditional)
|
||||
|
||||
Check if the project uses torch and CUDA is available:
|
||||
```bash
|
||||
$RUNNER -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A')" 2>/dev/null
|
||||
```
|
||||
|
||||
If CUDA is available AND the project imports torch, run a quick GPU profile:
|
||||
```bash
|
||||
$RUNNER -c "
|
||||
import torch
|
||||
from torch.profiler import profile, ProfilerActivity
|
||||
|
||||
# Discover inference entry point — adapt to project
|
||||
# Look for model.predict(), model.forward(), model.infer(), etc.
|
||||
|
||||
# Quick profile (adapt command to project's inference path)
|
||||
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
|
||||
# Run 5 inference calls after warmup
|
||||
pass
|
||||
|
||||
# Summary
|
||||
avg = prof.key_averages()
|
||||
print('=== Top CUDA kernels ===')
|
||||
print(avg.table(sort_by='cuda_time_total', row_limit=10))
|
||||
|
||||
# Count transfers
|
||||
h2d_count = sum(1 for e in avg if 'copy' in e.key.lower() or 'memcpy' in e.key.lower())
|
||||
sync_count = sum(1 for e in avg if 'synchronize' in e.key.lower())
|
||||
total_cuda = sum(e.cuda_time_total for e in avg if e.cuda_time_total > 0)
|
||||
print(f'Transfer events: {h2d_count}, Sync calls: {sync_count}, Total CUDA time: {total_cuda}us')
|
||||
" 2>&1
|
||||
```
|
||||
|
||||
Record findings with:
|
||||
- Total CUDA kernel count and total GPU time
|
||||
- Number and type of H2D/D2H transfer events
|
||||
- Number of synchronize() calls
|
||||
- Pipeline idle % (if determinable from kernel gap analysis)
|
||||
- Suspected patterns (loop-invariant tensors, kernel fusion candidates, blocking sync)
|
||||
|
||||
If torch is not available or CUDA is not present, skip this step and note "GPU: N/A (no CUDA)" in the report.
|
||||
|
||||
## Cross-Domain Ranking
|
||||
|
||||
After all profiling passes, rank ALL findings into a single list ordered by estimated impact. **Adjust severity based on deployment model.**
|
||||
|
|
@ -181,6 +224,11 @@ After all profiling passes, rank ALL findings into a single list ordered by esti
|
|||
- Async blocking call in hot path → **high**
|
||||
- Sequential awaits (3+ independent) → **high**
|
||||
- Other async patterns → **medium**
|
||||
- GPU pipeline >50% idle (dispatch gaps between kernels) → **critical**
|
||||
- GPU H2D/D2H transfers >20% of total GPU time → **high**
|
||||
- >10 sequential kernel launches with gaps on same stream → **high**
|
||||
- Blocking synchronize() >0.5ms → **medium**
|
||||
- Per-call tensor creation for constants (loop-invariant) → **medium**
|
||||
|
||||
### Deployment model adjustments
|
||||
|
||||
|
|
@ -253,11 +301,14 @@ Based on the scan results, recommended optimization order:
|
|||
|
||||
### Async (static analysis)
|
||||
<findings or "No async code detected">
|
||||
|
||||
### GPU (torch.profiler)
|
||||
<torch.profiler output with annotations, or "GPU: N/A (no CUDA)">
|
||||
```
|
||||
|
||||
## Print Summary
|
||||
|
||||
After writing the report, print a one-line summary:
|
||||
```
|
||||
[scan] CPU: <N> targets | Memory: <N> targets | Import: <N> targets | Async: <N> targets | Top: <#1 target description>
|
||||
[scan] CPU: <N> targets | Memory: <N> targets | Import: <N> targets | Async: <N> targets | GPU: <N> targets | Top: <#1 target description>
|
||||
```
|
||||
|
|
|
|||
211
plugin/languages/python/references/gpu/experiment-loop.md
Normal file
211
plugin/languages/python/references/gpu/experiment-loop.md
Normal file
|
|
@ -0,0 +1,211 @@
|
|||
# Experiment Loop — GPU/CUDA Domain
|
||||
|
||||
> Base framework: `../shared/experiment-loop-base.md`
|
||||
|
||||
## Reasoning Checklist
|
||||
|
||||
Before writing any code, answer these 10 questions. If you can't answer 3-7 concretely, profile deeper before coding.
|
||||
|
||||
1. **Pattern**: What GPU antipattern? (see `reference.md` catalog: [transfer], [dispatch], [sync], [memory], [compute], [correctness])
|
||||
2. **Hot path?** Confirmed by torch.profiler or nsys — **NOT cProfile**. cProfile cannot see GPU-side costs.
|
||||
3. **Pipeline stage?** Which stage: CPU prep, H2D, GPU preprocess, GPU inference, D2H, CPU postprocess?
|
||||
4. **Transfer or compute?** Is the bottleneck data movement (H2D/D2H/staging) or kernel execution time?
|
||||
5. **Per-call vs amortized?** Is this fixed overhead per inference call, or proportional to batch size / image resolution?
|
||||
6. **GPU warm?** Is the benchmark running after warmup (CUDA context init, TRT deserialization, cuDNN autotuning)?
|
||||
7. **Mechanism**: HOW does your change reduce latency? Be specific: fewer transfers, fewer kernel launches, less sync wait, direct-to-host writes, etc.
|
||||
8. **Correctness**: Does this change numerical output? Kernel fusion, precision changes, and operation reordering can produce different floating-point results. Verify bit-for-bit or within tolerance.
|
||||
9. **CUDA graph compatibility**: Does the project use CUDA graphs? If so, will this change break capture or replay? Event-based sync is incompatible with graph capture. Pinned memory and cached tensors are compatible.
|
||||
10. **Verify cheaply**: Can you validate with `torch.cuda.Event` timing before running the full benchmark?
|
||||
|
||||
## Profiling Methodology
|
||||
|
||||
### Mandatory: torch.profiler baseline
|
||||
|
||||
Run this BEFORE any code changes. This is equivalent to the cProfile gate for CPU optimization — entering the experiment loop without torch.profiler output is not permitted.
|
||||
|
||||
```python
|
||||
# /tmp/gpu_baseline.py
|
||||
import torch
|
||||
from torch.profiler import profile, ProfilerActivity
|
||||
|
||||
# --- Adapt these to the project ---
|
||||
# from model import load_model, preprocess, inference, postprocess
|
||||
# model = load_model()
|
||||
# input_data = load_test_input()
|
||||
# ---
|
||||
|
||||
# Warmup
|
||||
for _ in range(20):
|
||||
run_inference(model, input_data)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Profile
|
||||
with profile(
|
||||
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
||||
with_stack=True,
|
||||
record_shapes=True,
|
||||
) as prof:
|
||||
for _ in range(10):
|
||||
run_inference(model, input_data)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Ranked by CUDA time
|
||||
print("=== CUDA time ranking ===")
|
||||
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
|
||||
|
||||
# Ranked by CPU time
|
||||
print("\n=== CPU time ranking ===")
|
||||
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
|
||||
|
||||
# Memory copies (H2D, D2H)
|
||||
print("\n=== Memory events ===")
|
||||
for evt in prof.key_averages():
|
||||
if "copy" in evt.key.lower() or "memcpy" in evt.key.lower():
|
||||
print(f" {evt.key}: count={evt.count}, cuda_time={evt.cuda_time_total}us")
|
||||
```
|
||||
|
||||
Print the output as `[gpu baseline]` — this is a key deliverable.
|
||||
|
||||
### Event-based micro-timing (per-stage measurement)
|
||||
|
||||
For quick A/B comparisons of individual pipeline stages:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
def time_stage(fn, *args, warmup=10, runs=100, stream=None):
|
||||
"""Time a single pipeline stage using CUDA events."""
|
||||
s = stream or torch.cuda.current_stream()
|
||||
start = torch.cuda.Event(enable_timing=True)
|
||||
end = torch.cuda.Event(enable_timing=True)
|
||||
|
||||
for _ in range(warmup):
|
||||
fn(*args)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
times = []
|
||||
for _ in range(runs):
|
||||
start.record(s)
|
||||
fn(*args)
|
||||
end.record(s)
|
||||
torch.cuda.synchronize()
|
||||
times.append(start.elapsed_time(end))
|
||||
|
||||
import statistics
|
||||
return statistics.median(times), statistics.stdev(times)
|
||||
```
|
||||
|
||||
### nsys validation (when torch.profiler is insufficient)
|
||||
|
||||
Use when you need to see driver-level detail — pageable vs pinned transfers, exact kernel launch gaps, stream activity overlap:
|
||||
|
||||
```bash
|
||||
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_validate \
|
||||
$RUNNER /tmp/inference_bench.py
|
||||
```
|
||||
|
||||
## Domain-Specific Loop Steps
|
||||
|
||||
**Step 1 — Choose target** sources:
|
||||
- **torch.profiler output**: Highest CUDA time function, or largest gap between GPU kernels
|
||||
- **nsys timeline**: Pipeline bubbles — idle gaps between stream activities
|
||||
- **Manual inspection**: Per-call tensor creation patterns, `synchronize()` calls, pageable H2D transfers
|
||||
|
||||
Print: `[experiment N] Target: <stage>.<pattern> (<est. latency>us, <antipattern category>)`
|
||||
|
||||
Example:
|
||||
```
|
||||
[experiment 1] Target: postprocess.dispatch_gaps (1021us, 29 kernels — [dispatch] fusion candidate)
|
||||
[experiment 2] Target: preprocess.h2d_staging (300us — [transfer] pageable->pinned)
|
||||
[experiment 3] Target: preprocess.tensor_create (150us — [transfer] loop-invariant hoisting)
|
||||
[experiment 4] Target: sync.stream_block (1000us — [sync] synchronize->event)
|
||||
```
|
||||
|
||||
**Step 6 — Read results**: Print BOTH stage-level AND E2E timings:
|
||||
```
|
||||
[experiment N] Stage: <stage> <before>us -> <after>us (<delta>%)
|
||||
[experiment N] E2E: <before>ms -> <after>ms (<delta>%)
|
||||
```
|
||||
|
||||
**Step 8 — Noise threshold**: GPU timing has higher variance than CPU due to GPU clock boosting, thermal throttling, and driver scheduling. If speedup <10% on a single stage, re-run **5x** (not 3x) with GPU warm to confirm it's not noise. Use median, not mean.
|
||||
|
||||
**Step 10 — Record**: Record immediately in both `.codeflash/results.tsv` and `.codeflash/HANDOFF.md`. Include both stage-level and E2E metrics.
|
||||
|
||||
**Step 12 — E2E benchmark**: After KEEP, measure E2E latency over 100+ inference calls to confirm the stage-level gain translates to real-world improvement. Stage improvements that don't show in E2E may be masked by other bottlenecks — still KEEP if the stage improvement is confirmed, but note the masking.
|
||||
|
||||
## Keep/Discard Thresholds
|
||||
|
||||
```
|
||||
Tests passed?
|
||||
+-- NO -> Fix or discard immediately
|
||||
+-- YES -> Check E2E and stage metrics:
|
||||
+-- E2E improved >=3% -> KEEP
|
||||
+-- E2E <3% but stage improved >=10%:
|
||||
| +-- Re-run 5x with GPU warm
|
||||
| +-- Confirmed -> KEEP (stage gain is real; E2E may compound with other fixes)
|
||||
| +-- Noise -> DISCARD
|
||||
+-- Micro-benchmark (Event timing) >=20% on confirmed hot stage -> KEEP
|
||||
+-- No measurable improvement -> DISCARD
|
||||
```
|
||||
|
||||
**Why 3% E2E (not 5% like CPU)?** GPU inference pipelines have high fixed costs (model forward pass) that dilute the effect of transfer/sync optimizations on E2E numbers. A 50% reduction in a pipeline stage (e.g., postprocess 1ms -> 0.5ms) may only show as 5% E2E improvement on a 10ms pipeline. The stage-level improvement is real and compounds with other stage fixes.
|
||||
|
||||
## Strategy Rotation
|
||||
|
||||
If 3+ consecutive discards on the same type, switch:
|
||||
|
||||
1. **Transfer optimization** (pinned memory, cached tensors, non_blocking) -> if exhausted:
|
||||
2. **Dispatch/fusion** (Triton kernels, torch.compile, batched ops) -> if exhausted:
|
||||
3. **Synchronization** (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
|
||||
4. **Compute/precision** (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
|
||||
5. **Architectural** (batch pipeline stages differently, pipeline parallelism across frames)
|
||||
|
||||
## Plateau Detection
|
||||
|
||||
Remaining targets are at the optimization floor when:
|
||||
- **Inside TRT/ONNX engine**: The model forward pass is opaque — can't optimize individual kernels at the Python level. Recommend TRT profile/refit or model-level changes.
|
||||
- **PCIe bandwidth ceiling**: H2D/D2H transfers are limited by PCIe bandwidth. Verify with `nvidia-smi topo -m` and transfer size calculation.
|
||||
- **SM occupancy ceiling**: GPU compute units are fully utilized. Verify with nsys or Nsight Compute occupancy metrics.
|
||||
- **CUDA context overhead**: ~50-100us per kernel launch is irreducible driver overhead. Can only be reduced by fewer launches (fusion) or CUDA graphs.
|
||||
|
||||
## Correctness Verification
|
||||
|
||||
### Numerical equivalence
|
||||
|
||||
Kernel fusion and operation reordering can produce different floating-point results due to:
|
||||
- Different reduction order (sum, max, argmax)
|
||||
- Different intermediate precision (FP32 vs FP16 intermediates)
|
||||
- Fused multiply-add vs separate multiply then add
|
||||
|
||||
Always verify output equivalence after kernel fusion:
|
||||
|
||||
```python
|
||||
# Run both paths on same input
|
||||
original_output = original_postprocess(logits, boxes)
|
||||
fused_output = fused_postprocess(logits, boxes)
|
||||
|
||||
# Check equivalence
|
||||
assert torch.allclose(original_output, fused_output, atol=1e-5, rtol=1e-4), \
|
||||
f"Max diff: {(original_output - fused_output).abs().max()}"
|
||||
```
|
||||
|
||||
### CUDA graph compatibility
|
||||
|
||||
If the project uses CUDA graph capture:
|
||||
- **Safe**: Cached tensors, pinned buffers, pre-allocated outputs — these are created outside capture
|
||||
- **Unsafe**: Event-based cross-stream sync — events recorded outside the graph are ignored during replay
|
||||
- **Depends**: `torch.compile` — the compiled graph may or may not be capturable
|
||||
|
||||
## Logging Format
|
||||
|
||||
Tab-separated `.codeflash/results.tsv`:
|
||||
|
||||
```
|
||||
commit target_test stage gpu_baseline_us gpu_optimized_us gpu_speedup e2e_baseline_ms e2e_optimized_ms e2e_speedup tests_passed tests_failed status pattern description
|
||||
```
|
||||
|
||||
- `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
|
||||
- `gpu_baseline_us` / `gpu_optimized_us`: stage-level timing in microseconds
|
||||
- `e2e_baseline_ms` / `e2e_optimized_ms`: end-to-end timing in milliseconds
|
||||
- `pattern`: antipattern tag (e.g., `loop-invariant-tensor`, `kernel-fusion`, `pageable-h2d`)
|
||||
- `status`: `keep`, `discard`, or `crash`
|
||||
186
plugin/languages/python/references/gpu/guide.md
Normal file
186
plugin/languages/python/references/gpu/guide.md
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
# GPU/CUDA Inference Optimization Guide
|
||||
|
||||
## Why cProfile Is Insufficient for GPU Workloads
|
||||
|
||||
cProfile measures Python-side CPU time. GPU work is asynchronous — `torch.mm()` returns immediately while the GPU computes. The real costs in GPU inference are invisible to Python profilers:
|
||||
|
||||
- **H2D/D2H transfers**: Data copies between CPU and GPU happen in the CUDA driver. A 24-byte tensor copy that takes microseconds of Python time can create a 100us pipeline bubble waiting for the driver to stage pageable memory.
|
||||
- **Kernel dispatch gaps**: Each PyTorch operation launches a CUDA kernel with ~25us of dispatch overhead. 29 sequential operations = ~725us of dead time between kernels, but cProfile shows each op completing in microseconds.
|
||||
- **Stream synchronization stalls**: `stream.synchronize()` blocks the CPU thread until the GPU finishes. This shows as idle CPU time in cProfile — not attributed to any function.
|
||||
- **Pipeline bubbles**: Gaps where neither CPU nor GPU is doing useful work because one is waiting for the other. These are the "inter-iteration bubbles" that dominate real-world inference latency.
|
||||
|
||||
**Rule: For GPU-bound workloads, always use `torch.profiler` or `nsys` as the primary profiler. cProfile is supplementary only — use it for CPU-side code paths (preprocessing, postprocessing, adapter serialization).**
|
||||
|
||||
## Profiling Tools
|
||||
|
||||
### torch.profiler (primary)
|
||||
|
||||
The standard tool for profiling PyTorch GPU workloads. Captures both CPU and CUDA activities.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from torch.profiler import profile, ProfilerActivity, schedule
|
||||
|
||||
with profile(
|
||||
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
||||
with_stack=True,
|
||||
record_shapes=True,
|
||||
profile_memory=True,
|
||||
schedule=schedule(wait=2, warmup=3, active=5, repeat=1),
|
||||
on_trace_ready=torch.profiler.tensorboard_trace_handler("/tmp/gpu_trace"),
|
||||
) as prof:
|
||||
for i in range(10):
|
||||
run_inference(model, input_data)
|
||||
prof.step()
|
||||
|
||||
# Print summary sorted by CUDA time
|
||||
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
|
||||
|
||||
# Print summary sorted by CPU time (for comparison)
|
||||
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
|
||||
```
|
||||
|
||||
**What to look for in the output:**
|
||||
- CUDA kernels with high total time — these are compute bottlenecks
|
||||
- Many calls with short CUDA time — dispatch overhead candidates for fusion
|
||||
- CPU time >> CUDA time on a function — the CPU is doing unnecessary work
|
||||
- CUDA time >> CPU time on a function — large GPU compute or transfer
|
||||
- `aten::copy_` entries — these are memory transfers (H2D, D2H, D2D)
|
||||
- `cudaStreamSynchronize` in CPU column — blocking sync points
|
||||
|
||||
### Nsight Systems (nsys) — authoritative validation
|
||||
|
||||
When torch.profiler findings need confirmation or when you need driver-level detail.
|
||||
|
||||
```bash
|
||||
nsys profile \
|
||||
--trace=cuda,nvtx,osrt \
|
||||
--cuda-memory-usage=true \
|
||||
--output=/tmp/nsys_profile \
|
||||
$RUNNER inference_script.py
|
||||
```
|
||||
|
||||
**Reading the timeline:**
|
||||
- **Stream rows**: Each CUDA stream shows kernel execution as colored blocks with gaps between them. Large gaps = pipeline bubbles.
|
||||
- **Memory copy rows**: H2D (green) and D2H (red) transfers. Pageable transfers show a staging copy; pinned transfers are direct.
|
||||
- **CPU row**: Python thread activity. Look for idle periods that coincide with `cudaStreamSynchronize`.
|
||||
- **NVTX markers**: If the code uses `torch.cuda.nvtx.range_push/pop`, these label pipeline stages.
|
||||
|
||||
### torch.cuda.Event timing (quick per-stage measurement)
|
||||
|
||||
For fast A/B comparisons without full profiling infrastructure.
|
||||
|
||||
```python
|
||||
start = torch.cuda.Event(enable_timing=True)
|
||||
end = torch.cuda.Event(enable_timing=True)
|
||||
|
||||
# Warm up
|
||||
for _ in range(10):
|
||||
run_inference(model, input_data)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Measure
|
||||
times = []
|
||||
for _ in range(100):
|
||||
start.record()
|
||||
run_inference(model, input_data)
|
||||
end.record()
|
||||
torch.cuda.synchronize()
|
||||
times.append(start.elapsed_time(end)) # milliseconds
|
||||
|
||||
import statistics
|
||||
print(f"Mean: {statistics.mean(times):.3f}ms, Median: {statistics.median(times):.3f}ms, Std: {statistics.stdev(times):.3f}ms")
|
||||
```
|
||||
|
||||
**Per-stage measurement** — wrap each pipeline stage with its own event pair:
|
||||
|
||||
```python
|
||||
events = {stage: (torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True))
|
||||
for stage in ["preprocess", "inference", "postprocess"]}
|
||||
|
||||
events["preprocess"][0].record(preprocess_stream)
|
||||
# ... preprocessing ...
|
||||
events["preprocess"][1].record(preprocess_stream)
|
||||
|
||||
events["inference"][0].record(inference_stream)
|
||||
# ... inference ...
|
||||
events["inference"][1].record(inference_stream)
|
||||
|
||||
# After sync:
|
||||
for stage, (s, e) in events.items():
|
||||
print(f"{stage}: {s.elapsed_time(e):.3f}ms")
|
||||
```
|
||||
|
||||
## The GPU Inference Pipeline Model
|
||||
|
||||
A typical GPU inference pipeline has 6 stages. Each transition between stages is an optimization surface:
|
||||
|
||||
```
|
||||
CPU Prep ──H2D──> GPU Preprocess ──> GPU Inference ──D2H──> CPU Postprocess ──> Output
|
||||
│ │ │ │
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
Image resize Normalize, Model forward Threshold,
|
||||
Format convert Channel swap, pass (TRT, Sort, NMS,
|
||||
Numpy→Tensor Pad/letterbox ONNX, PyTorch) Class remap
|
||||
```
|
||||
|
||||
**Where time is typically spent** (ranges for single-image inference on modern GPUs):
|
||||
|
||||
| Stage | Typical range | Optimization surface |
|
||||
|-------|--------------|---------------------|
|
||||
| CPU Prep | 0.5-3ms | Vectorization, avoid copies |
|
||||
| H2D Transfer | 0.1-1ms | Pinned memory, non_blocking |
|
||||
| GPU Preprocess | 0.1-2ms | Kernel fusion, cached constants |
|
||||
| GPU Inference | 2-20ms | Model optimization (TRT, quantization) — usually not our target |
|
||||
| D2H Transfer | 0.05-0.5ms | Direct-to-host writes, pinned buffers |
|
||||
| CPU Postprocess | 0.1-2ms | Numpy for small arrays, vectorization |
|
||||
| **Sync overhead** | **0.5-3ms** | **Event-based sync, pipeline overlap** |
|
||||
|
||||
The model forward pass is usually the largest single cost, but it's typically already optimized (TensorRT, ONNX). **The surrounding pipeline stages and their transitions often account for 30-60% of total E2E latency** and are where Python-level optimization has the most impact.
|
||||
|
||||
## Decision Framework
|
||||
|
||||
### Step 1: Is the workload GPU-bound or CPU-bound?
|
||||
|
||||
```python
|
||||
# Compare GPU active time to wall-clock time
|
||||
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
|
||||
run_inference(model, input_data)
|
||||
|
||||
# If total CUDA time >> total CPU time: GPU-bound (optimize kernels, reduce transfers)
|
||||
# If total CPU time >> total CUDA time: CPU-bound (optimize Python code, reduce sync waits)
|
||||
# If they're close: pipeline-bound (optimize transitions between stages)
|
||||
```
|
||||
|
||||
### Step 2: Is the bottleneck data movement or kernel execution?
|
||||
|
||||
Look at the torch.profiler output:
|
||||
- **Transfer-bound**: `aten::copy_` and `cudaMemcpy*` dominate. Focus on pinned memory, non_blocking, reducing transfer count.
|
||||
- **Compute-bound**: CUDA kernels dominate. Focus on kernel fusion, precision reduction, algorithmic changes.
|
||||
- **Dispatch-bound**: Many short kernels with gaps. Focus on kernel fusion via Triton or torch.compile.
|
||||
|
||||
### Step 3: Is the overhead per-call or proportional?
|
||||
|
||||
- **Per-call fixed cost**: Tensor creation, stream synchronization, small H2D transfers. These don't scale with batch size — they matter most for single-image real-time inference.
|
||||
- **Proportional cost**: Large tensor transfers, compute-heavy kernels. These scale with batch size and image resolution.
|
||||
|
||||
Per-call fixed costs compound in streaming/real-time applications where every millisecond of latency matters. Proportional costs matter more in throughput-oriented batch processing.
|
||||
|
||||
## GPU Warm-Up
|
||||
|
||||
Always warm up the GPU before benchmarking. The first few inference calls include:
|
||||
- CUDA context initialization (~100-500ms)
|
||||
- TensorRT engine deserialization (~1-10s)
|
||||
- cuDNN autotuning (~100ms per unique input shape)
|
||||
- JIT compilation for torch.compile or Triton kernels
|
||||
|
||||
```python
|
||||
# Standard warm-up: run 10-20 inference calls and discard timing
|
||||
for _ in range(20):
|
||||
run_inference(model, input_data)
|
||||
torch.cuda.synchronize()
|
||||
# NOW start measuring
|
||||
```
|
||||
|
||||
**Never benchmark cold GPU state** — the variance will mask real optimization effects.
|
||||
47
plugin/languages/python/references/gpu/handoff-template.md
Normal file
47
plugin/languages/python/references/gpu/handoff-template.md
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
# GPU/CUDA Handoff Template
|
||||
|
||||
> Base template: `../shared/handoff-template.md` (use with `{{DOMAIN_PREFIX}}` = `gpu`)
|
||||
|
||||
## Additional Environment Fields
|
||||
|
||||
Add these to the Environment section of the base handoff template:
|
||||
|
||||
```markdown
|
||||
## Environment
|
||||
|
||||
| Key | Value |
|
||||
|-----|-------|
|
||||
| GPU model | e.g., Tesla T4, A100 80GB, RTX 4090 |
|
||||
| CUDA version | e.g., 12.1 |
|
||||
| cuDNN version | e.g., 8.9.7 |
|
||||
| TensorRT version | e.g., 8.6.1 (or N/A) |
|
||||
| Triton version | e.g., 2.1.0 (or N/A) |
|
||||
| PyTorch version | e.g., 2.1.0+cu121 |
|
||||
| CUDA graphs | enabled / disabled / N/A |
|
||||
| Inference backend | PyTorch / TensorRT / ONNX Runtime |
|
||||
```
|
||||
|
||||
## Hotspot Summary Extension
|
||||
|
||||
The "Impact" column for GPU targets should report **stage latency** rather than cumtime percentage:
|
||||
|
||||
```markdown
|
||||
## Hotspot Summary
|
||||
|
||||
| # | Target | Stage | Latency (us) | Pattern | Status |
|
||||
|---|--------|-------|-------------|---------|--------|
|
||||
| 1 | postprocess dispatch gaps | postprocess | 1021 | kernel-fusion | keep — 12us after |
|
||||
| 2 | preprocess H2D staging | preprocess.h2d | 300 | pageable-to-pinned | keep — 180us after |
|
||||
| 3 | norm tensor creation | preprocess.tensor | 150 | loop-invariant | keep — 0us after |
|
||||
| 4 | stream synchronize | sync | 1000 | event-based-sync | keep — 50us after |
|
||||
```
|
||||
|
||||
## GPU-Specific State to Record
|
||||
|
||||
At session end, record in HANDOFF.md:
|
||||
|
||||
- **Current E2E latency**: median over 100+ warm inference calls
|
||||
- **Per-stage breakdown**: preprocess, H2D, inference, D2H, postprocess (in microseconds)
|
||||
- **Transfer summary**: number of H2D/D2H events per call, total bytes transferred
|
||||
- **Kernel count**: number of CUDA kernel launches per inference call
|
||||
- **Remaining targets**: what's left and why it can't be optimized (inside TRT engine, at PCIe ceiling, etc.)
|
||||
303
plugin/languages/python/references/gpu/reference.md
Normal file
303
plugin/languages/python/references/gpu/reference.md
Normal file
|
|
@ -0,0 +1,303 @@
|
|||
# GPU/CUDA Antipattern Catalog
|
||||
|
||||
## Profiling Tools
|
||||
|
||||
```bash
|
||||
# torch.profiler — primary GPU profiling tool:
|
||||
$RUNNER -c "
|
||||
import torch
|
||||
from torch.profiler import profile, ProfilerActivity
|
||||
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], with_stack=True, record_shapes=True) as prof:
|
||||
# ... run inference ...
|
||||
pass
|
||||
print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))
|
||||
"
|
||||
|
||||
# nsys — authoritative validation:
|
||||
nsys profile --trace=cuda,nvtx,osrt --cuda-memory-usage=true -o /tmp/nsys_profile $RUNNER inference_script.py
|
||||
|
||||
# CUDA Event micro-benchmark — quick per-stage timing:
|
||||
# See guide.md for template
|
||||
```
|
||||
|
||||
## Antipattern Catalog
|
||||
|
||||
All patterns tagged: **[transfer]** = data movement, **[dispatch]** = kernel launch overhead, **[sync]** = synchronization, **[memory]** = GPU memory management, **[compute]** = kernel/operation efficiency, **[correctness]** = bug that defeats optimization.
|
||||
|
||||
### HIGH impact
|
||||
|
||||
**[transfer] Loop-invariant tensor creation on GPU** — Per-call H2D for constants
|
||||
|
||||
Symptom: Small H2D transfers visible in nsys every inference call. Tensors for normalization parameters, index arrays, or constant lookup tables re-created per call.
|
||||
|
||||
```python
|
||||
# BEFORE: creates mean/std tensors on CPU, transfers to GPU every call
|
||||
def preprocess(image, device):
|
||||
tensor = torch.from_numpy(image).to(device)
|
||||
mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(1, -1, 1, 1)
|
||||
std = torch.tensor([0.229, 0.224, 0.225], device=device).view(1, -1, 1, 1)
|
||||
return (tensor - mean) / std
|
||||
```
|
||||
|
||||
```python
|
||||
# AFTER: cache on GPU, reuse across calls
|
||||
_norm_mean = None
|
||||
_norm_std = None
|
||||
|
||||
def preprocess(image, device):
|
||||
global _norm_mean, _norm_std
|
||||
if _norm_mean is None or not _norm_mean.is_cuda:
|
||||
_norm_mean = torch.tensor([0.485, 0.456, 0.406], device=device, dtype=torch.float32).view(1, -1, 1, 1)
|
||||
_norm_std = torch.tensor([0.229, 0.224, 0.225], device=device, dtype=torch.float32).view(1, -1, 1, 1)
|
||||
tensor = torch.from_numpy(image).to(device)
|
||||
return (tensor - _norm_mean) / _norm_std
|
||||
```
|
||||
|
||||
Variants:
|
||||
- **Channel swap index**: `tensor[:, [2,1,0], :, :]` creates a temporary CPU tensor from the Python list, then transfers it. Cache: `_idx = torch.tensor([2,1,0], device=device)`
|
||||
- **Anchor grids**: Object detection anchor/prior tensors computed per call
|
||||
- **Positional encodings**: Transformer positional encoding tensors
|
||||
- Typical impact: 0.1-0.5ms per call per cached tensor
|
||||
- Why cProfile misses it: Python-side cost is microseconds; the pipeline bubble is in the CUDA driver H2D path
|
||||
|
||||
---
|
||||
|
||||
**[transfer] Pageable memory for H2D transfers** — Driver-side staging overhead
|
||||
|
||||
Symptom: H2D transfers in nsys show a staging copy (pageable -> pinned -> GPU). `torch.from_numpy()` on a regular numpy array creates a pageable CPU tensor.
|
||||
|
||||
```python
|
||||
# BEFORE: pageable memory — requires driver staging for H2D
|
||||
tensor = torch.from_numpy(image_np).to(device)
|
||||
```
|
||||
|
||||
```python
|
||||
# AFTER: pinned memory buffer reuse — direct DMA transfer
|
||||
_pinned_buffer = None
|
||||
|
||||
def to_gpu_pinned(image_np, device):
|
||||
global _pinned_buffer
|
||||
needed = image_np.shape
|
||||
if _pinned_buffer is None or _pinned_buffer.shape != needed:
|
||||
_pinned_buffer = torch.empty(needed, dtype=torch.uint8, pin_memory=True)
|
||||
_pinned_buffer.numpy()[:] = image_np # copy into pinned buffer (CPU-side, fast)
|
||||
return _pinned_buffer.to(device, non_blocking=True)
|
||||
```
|
||||
|
||||
- Typical impact: 20-40% faster H2D for medium tensors (1-10 MB). Larger tensors see less relative gain (DMA bandwidth dominates).
|
||||
- Crossover: Worth it for transfers > ~4 KB. Below that, the overhead of managing the pinned buffer isn't justified.
|
||||
- Why cProfile misses it: The staging overhead is inside the CUDA driver, not in Python
|
||||
- Caveat: Pinned memory is a limited resource. Don't pin large amounts (>1GB) or many small buffers. One reusable buffer per pipeline stage is typical.
|
||||
|
||||
---
|
||||
|
||||
**[dispatch] Many small PyTorch operations where a single fused kernel suffices** — Kernel dispatch gap accumulation
|
||||
|
||||
Symptom: torch.profiler shows 10-50+ consecutive CUDA kernels on the same stream with short durations (<100us each). nsys timeline shows visible gaps (~25us) between each kernel. Total wall time for the sequence is 5-20x the sum of kernel compute times.
|
||||
|
||||
```python
|
||||
# BEFORE: 29 separate kernel launches for post-processing
|
||||
probs = torch.sigmoid(logits) # kernel 1
|
||||
confidence, class_id = probs.max(dim=-1) # kernels 2-3
|
||||
mask = confidence > threshold # kernel 4
|
||||
confidence = confidence[mask] # kernel 5
|
||||
class_id = class_id[mask] # kernel 6
|
||||
sorted_idx = torch.argsort(confidence, descending=True) # kernels 7-8
|
||||
# ... box transform, denormalization, rescaling ... # kernels 9-29
|
||||
```
|
||||
|
||||
```python
|
||||
# AFTER: single Triton kernel fusing all operations
|
||||
@triton.jit
|
||||
def fused_postprocess_kernel(logits_ptr, bboxes_ptr, output_ptr,
|
||||
num_classes, dw, dh, inv_sw, inv_sh,
|
||||
pad_l, pad_t, crop_x, crop_y, BLOCK_C: tl.constexpr):
|
||||
query_id = tl.program_id(0)
|
||||
# Fused: sigmoid + max + box transform + denorm + rescale
|
||||
# Writes directly to pinned CPU output buffer
|
||||
...
|
||||
```
|
||||
|
||||
Alternative to Triton: `torch.compile(fullgraph=True)` can fuse many operations automatically, though with less control over the output.
|
||||
|
||||
- Typical impact: 10-100x for the fused region. Post-processing 1021us -> 12us in the RT-DETR case.
|
||||
- Worth fusing when: >5 sequential kernel launches with <100us each, total dispatch gap >100us
|
||||
- Why cProfile misses it: Each PyTorch op's Python overhead is microseconds. The dispatch gaps between GPU kernel launches are invisible to Python profilers.
|
||||
|
||||
---
|
||||
|
||||
**[sync] Blocking stream.synchronize() between dependent GPU stages** — CPU-GPU serialization
|
||||
|
||||
Symptom: CPU blocks for 0.5-2ms waiting for GPU work to complete before launching the next stage. In torch.profiler: high CPU time on `cudaStreamSynchronize`. In nsys: CPU idle period aligned with GPU work on another stream.
|
||||
|
||||
```python
|
||||
# BEFORE: CPU waits for GPU preprocessing to finish before launching inference
|
||||
self._preprocess_stream.synchronize() # CPU blocks ~1ms
|
||||
self._inference_stream.run_forward() # Can't start until sync completes
|
||||
```
|
||||
|
||||
```python
|
||||
# AFTER: GPU-to-GPU dependency via events — CPU proceeds immediately
|
||||
event = self._preprocess_stream.record_event()
|
||||
self._inference_stream.wait_event(event) # GPU waits, CPU does not
|
||||
self._inference_stream.run_forward() # CPU launches immediately
|
||||
```
|
||||
|
||||
- Typical impact: 0.5-1ms per synchronization point removed
|
||||
- When to keep synchronize(): CUDA graph capture/replay requires it (graphs are fixed op sequences that ignore cross-stream events outside capture). Also needed when CPU must read GPU results before proceeding.
|
||||
- Why cProfile misses it: `synchronize()` shows as idle CPU time, not as a hot function.
|
||||
|
||||
---
|
||||
|
||||
### MEDIUM impact
|
||||
|
||||
**[compute] torch operations on small CPU arrays** — Dispatch overhead > compute time
|
||||
|
||||
Symptom: `torch.sort()`, `torch.argsort()`, `torch.isin()` called on CPU tensors with <1000 elements. torch's per-op dispatch overhead (~10-50us) dominates the actual computation time for small arrays.
|
||||
|
||||
```python
|
||||
# BEFORE: torch on CPU, slow for small arrays
|
||||
confidence_sorted, sorted_idx = torch.sort(confidence_cpu, descending=True)
|
||||
mask = torch.isin(class_ids, valid_class_ids_tensor)
|
||||
```
|
||||
|
||||
```python
|
||||
# AFTER: numpy for small CPU arrays — 2-5x faster
|
||||
conf_np = confidence_cpu.numpy()
|
||||
order = np.argsort(-conf_np)
|
||||
mask = np.isin(class_ids_np, valid_class_ids_set)
|
||||
```
|
||||
|
||||
- Crossover: numpy wins below ~1000 elements for sort/argsort/isin. Below ~100 elements, numpy is 3-5x faster.
|
||||
- Typical impact: 2-5x for individual operations, 0.1-0.5ms per pipeline stage
|
||||
- When NOT to do this: If the data is already a GPU tensor and will continue as a GPU tensor — don't move it to CPU just for numpy. This is for CPU-side post-processing after D2H.
|
||||
|
||||
---
|
||||
|
||||
**[transfer] Unnecessary D2H synchronization for scalar reads** — Hidden GPU->CPU roundtrips
|
||||
|
||||
Symptom: Unexpected `cudaStreamSynchronize` in nsys caused by seemingly innocuous Python code. Common hidden triggers:
|
||||
|
||||
```python
|
||||
# HIDDEN SYNC 1: .item() forces D2H + sync
|
||||
if tensor.max().item() > threshold: # sync!
|
||||
...
|
||||
|
||||
# HIDDEN SYNC 2: torchvision.functional.normalize has internal check
|
||||
# (std == 0).any() — transfers a boolean from GPU to CPU
|
||||
output = torchvision.functional.normalize(tensor, mean, std) # sync!
|
||||
|
||||
# HIDDEN SYNC 3: print/logging forces sync
|
||||
print(f"Shape: {tensor.shape}, Max: {tensor.max()}") # sync!
|
||||
|
||||
# HIDDEN SYNC 4: Python boolean check on GPU tensor
|
||||
if tensor.any(): # sync!
|
||||
...
|
||||
```
|
||||
|
||||
Fix: Batch reads at end of pipeline, use `non_blocking=True` where possible, replace library functions that have hidden syncs with inline equivalents.
|
||||
|
||||
- Typical impact: 0.05-0.5ms per hidden sync point
|
||||
- Why cProfile misses it: The sync shows as CPU idle time, not attributable to the line that triggered it
|
||||
|
||||
---
|
||||
|
||||
**[memory] Repeated GPU tensor allocation in inference loop** — Allocation overhead per call
|
||||
|
||||
Symptom: `torch.empty()`, `torch.zeros()`, or `torch.ones()` called inside the per-frame inference loop for buffers that could be pre-allocated.
|
||||
|
||||
```python
|
||||
# BEFORE: allocate output buffer every call
|
||||
def postprocess(logits, num_queries):
|
||||
output = torch.empty(num_queries, 6, dtype=torch.float32) # allocation per call
|
||||
...
|
||||
```
|
||||
|
||||
```python
|
||||
# AFTER: pre-allocate and reuse
|
||||
def __init__(self):
|
||||
self._output_buffer = torch.empty(300, 6, dtype=torch.float32, pin_memory=True)
|
||||
|
||||
def postprocess(self, logits, num_queries):
|
||||
output = self._output_buffer[:num_queries] # view into pre-allocated buffer
|
||||
...
|
||||
```
|
||||
|
||||
- Typical impact: 0.05-0.2ms per avoided allocation (depends on tensor size)
|
||||
- Pinned buffer variant: For D2H outputs, pre-allocate with `pin_memory=True` so the GPU kernel can write directly to host memory.
|
||||
|
||||
---
|
||||
|
||||
**[correctness] Device comparison with == instead of .is_cuda** — Caching defeated
|
||||
|
||||
Symptom: Tensors that should be cached on GPU are re-created every call. The caching check uses device equality, which fails due to device index mismatch.
|
||||
|
||||
```python
|
||||
# BUG: torch.device('cuda') != torch.device('cuda:0') — always True!
|
||||
if _cached_tensor is None or _cached_tensor.device != torch.device('cuda'):
|
||||
_cached_tensor = torch.tensor(data, device=target_device) # re-created every call
|
||||
```
|
||||
|
||||
```python
|
||||
# FIX: .is_cuda checks for any CUDA device index
|
||||
if _cached_tensor is None or not _cached_tensor.is_cuda:
|
||||
_cached_tensor = torch.tensor(data, device=target_device) # cached after first call
|
||||
```
|
||||
|
||||
- Impact: Defeats caching, causing per-call tensor creation and H2D transfer overhead
|
||||
- Pattern class: "caching defeated by incorrect cache key" — a class of bugs specific to PyTorch device handling
|
||||
- Always fix: This is a correctness bug, not just a performance issue
|
||||
|
||||
---
|
||||
|
||||
### LOW impact (only on confirmed hot path)
|
||||
|
||||
**[dispatch] CPU-side preprocessing that could be a CUDA kernel**
|
||||
|
||||
When CPU preprocessing (resize, color conversion, normalization) takes >1ms and the input is going to GPU anyway, consider doing the work on GPU. Libraries: `torchvision.transforms` on CUDA tensors, NVIDIA DALI, custom CUDA kernels.
|
||||
|
||||
Caveat: Moving preprocessing to GPU increases GPU utilization — only beneficial if GPU is underutilized during preprocessing.
|
||||
|
||||
---
|
||||
|
||||
**[sync] Missing non_blocking on tensor.to(device)**
|
||||
|
||||
```python
|
||||
# BEFORE: blocks CPU until transfer complete
|
||||
gpu_tensor = cpu_tensor.to(device)
|
||||
|
||||
# AFTER: CPU proceeds immediately, sync later when needed
|
||||
gpu_tensor = cpu_tensor.to(device, non_blocking=True)
|
||||
```
|
||||
|
||||
Requires the source tensor to be in pinned memory for `non_blocking` to be effective. If the source is pageable memory, `non_blocking=True` is silently ignored.
|
||||
|
||||
---
|
||||
|
||||
**[memory] FP32 inference when FP16 would suffice**
|
||||
|
||||
When the model supports half precision and numerical accuracy is acceptable:
|
||||
|
||||
```python
|
||||
model = model.half()
|
||||
# or
|
||||
with torch.autocast(device_type='cuda', dtype=torch.float16):
|
||||
output = model(input_tensor)
|
||||
```
|
||||
|
||||
Caveat: Not all models are stable in FP16. Test numerical output equivalence.
|
||||
|
||||
---
|
||||
|
||||
## Data Size Thresholds
|
||||
|
||||
| Pattern | Worth fixing when |
|
||||
|---------|------------------|
|
||||
| Loop-invariant tensor creation | Always (fixed per-call cost) |
|
||||
| Pageable -> pinned memory | Transfer size > ~4 KB |
|
||||
| Kernel fusion | >5 sequential kernel launches with <100us each |
|
||||
| Event-based stream sync | `synchronize()` blocks CPU for >0.3ms |
|
||||
| Small-array CPU fallback | Array has <1000 elements on CPU-side torch |
|
||||
| Device comparison bug | Always (correctness bug) |
|
||||
| Repeated GPU allocation | Buffer allocated >100 times/sec |
|
||||
| Hidden D2H sync | Any occurrence in hot loop |
|
||||
Loading…
Reference in a new issue