wip

2026-05-04 18:25:19 +00:00 · 2026-04-17 23:23:57 +00:00 · 2026-04-17 23:23:57 +00:00 · 2734b42f96
commit 2734b42f96
parent 8b748b9f81
4 changed files with 159 additions and 5 deletions
--- a/plugin/languages/python/agents/codeflash-gpu.md
+++ b/plugin/languages/python/agents/codeflash-gpu.md
@ -45,6 +45,7 @@ Classify every target before experimenting. This prevents chasing unoptimizable
 | **Loop-invariant tensor creation** | Always | Any per-call H2D transfer for constants |
 | **Pageable -> pinned memory** | Yes if transfer >4KB | Visible in nsys as staged memcpy |
 | **Kernel dispatch fusion** | Yes if >5 sequential small kernels | Total dispatch gap >100us |
+| **Monolithic pre/post kernel** | Yes when stage has >3 ops totaling >1ms | Fuse entire stage into one compiled kernel |
 | **Blocking synchronize()** | Yes if blocks >0.3ms | CPU idle visible in torch.profiler |
 | **Event-based stream sync** | Yes if cross-stream dependency | Replace synchronize with record_event+wait_event |
 | **Small-array CPU fallback** | Yes if <1000 elements | torch on CPU slower than numpy for small arrays |
@ -60,12 +61,15 @@ Classify every target before experimenting. This prevents chasing unoptimizable
 - Pageable memory H2D -> pinned memory buffer + non_blocking (20-40% faster H2D)
 - Many small PyTorch ops -> Triton kernel or torch.compile fusion (10-100x for fused region)
 - Blocking synchronize() -> event-based cross-stream sync (0.5-1ms per sync point)
+- Monolithic preprocessing kernel: pinned H2D + GPU resize (F.interpolate) + torch.compile fused normalize/permute/channel-swap (31% stage improvement, 2.47ms -> 1.70ms on RF-DETR)
+- Monolithic postprocessing kernel: boolean class-lookup mask replacing torch.isin + cached GPU rescale tensors + inlined rescale ops (33% stage improvement, 1.11ms -> 0.74ms on RF-DETR)

 **MEDIUM impact:**
 - torch.sort/argsort on <1000 CPU elements -> numpy (2-5x faster)
 - Hidden .item()/.cpu() D2H sync -> batch reads at end of pipeline (0.05-0.5ms per sync)
 - Per-call torch.empty/zeros -> pre-allocated reusable buffer (0.05-0.2ms per alloc)
 - `tensor.device == torch.device('cuda')` always False for cuda:0 -> use `.is_cuda`
+- torch.isin on GPU -> pre-computed boolean lookup mask (13x faster in isolation, eliminates internal unique+searchsorted kernels)

 ## Reasoning Checklist

@ -252,7 +256,9 @@ If top 3 remaining targets are all non-optimizable, **stop and report**.
 ### Strategy Rotation

 3+ consecutive discards on same type -> switch:
-transfer optimization -> dispatch/fusion -> synchronization -> compute/precision -> architectural
+transfer optimization -> dispatch/fusion -> **monolithic stage fusion** -> synchronization -> compute/precision -> architectural
+
+When individual micro-optimizations (transfer, dispatch) are discarded because savings are below E2E noise floor, escalate to monolithic stage fusion — fuse the entire pre/post stage into a single compiled operation. See `../references/gpu/experiment-loop.md` for recipes.

 ## Diff Hygiene

--- a/plugin/languages/python/references/gpu/experiment-loop.md
+++ b/plugin/languages/python/references/gpu/experiment-loop.md
@ -156,9 +156,29 @@ If 3+ consecutive discards on the same type, switch:

 1. **Transfer optimization** (pinned memory, cached tensors, non_blocking) -> if exhausted:
 2. **Dispatch/fusion** (Triton kernels, torch.compile, batched ops) -> if exhausted:
-3. **Synchronization** (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
-4. **Compute/precision** (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
-5. **Architectural** (batch pipeline stages differently, pipeline parallelism across frames)
+3. **Monolithic kernel fusion** (fuse entire pre/post stage into single compiled operation — see below) -> if exhausted:
+4. **Synchronization** (event-based sync, stream overlap, pipeline parallelism) -> if exhausted:
+5. **Compute/precision** (FP16, operator replacement, GPU-side preprocessing) -> if exhausted:
+6. **Architectural** (batch pipeline stages differently, pipeline parallelism across frames)
+
+### Monolithic kernel escalation
+
+When individual micro-optimizations (steps 1-2) are discarded because savings are below the E2E noise floor or masked by stream overlap, escalate to monolithic stage fusion. Instead of optimizing individual ops, replace the entire pipeline stage with a single compiled operation.
+
+**Preprocessing monolith recipe**:
+1. Replace CPU resize with GPU `F.interpolate` (move compute to GPU)
+2. Use pinned memory buffer + `non_blocking=True` for raw H2D (eliminate staging)
+3. Wrap permute + resize + channel swap + normalize in `torch.compile(mode="reduce-overhead")`
+4. Cache normalization tensors on GPU at init time
+5. Make pinned buffer thread-local if model serves concurrent requests
+
+**Postprocessing monolith recipe**:
+1. Replace `torch.isin()` with pre-computed boolean lookup mask (index by class ID)
+2. Cache `inference_size_whwh`, pad offsets, and scale tensors on GPU (keyed by metadata identity)
+3. Inline rescale: `boxes.sub_(offsets).div_(scale)` — in-place, no intermediate allocations
+4. Keep data-dependent ops (confidence filter, sort) as-is — they can't be compiled
+
+**When this works**: Individual fixes gave 13x micro-benchmark gains but 0% E2E because stream overlap or variance masked savings. The monolithic approach combines ALL savings into one change that exceeds the noise floor. RF-DETR case: individual fixes = 0% E2E, monolithic = 8.3% E2E.

 ## Plateau Detection

@ -207,5 +227,5 @@ commit	target_test	stage	gpu_baseline_us	gpu_optimized_us	gpu_speedup	e2e_baseli
 - `stage`: pipeline stage (e.g., `preprocess`, `postprocess`, `sync`)
 - `gpu_baseline_us` / `gpu_optimized_us`: stage-level timing in microseconds
 - `e2e_baseline_ms` / `e2e_optimized_ms`: end-to-end timing in milliseconds
- `pattern`: antipattern tag (e.g., `loop-invariant-tensor`, `kernel-fusion`, `pageable-h2d`)
+- `pattern`: antipattern tag (e.g., `loop-invariant-tensor`, `kernel-fusion`, `pageable-h2d`, `monolithic-preprocess`, `monolithic-postprocess`)
 - `status`: `keep`, `discard`, or `crash`
--- a/plugin/languages/python/references/gpu/guide.md
+++ b/plugin/languages/python/references/gpu/guide.md
@ -139,6 +139,18 @@ CPU Prep ──H2D──> GPU Preprocess ──> GPU Inference ──D2H──>

 The model forward pass is usually the largest single cost, but it's typically already optimized (TensorRT, ONNX). **The surrounding pipeline stages and their transitions often account for 30-60% of total E2E latency** and are where Python-level optimization has the most impact.

+### Monolithic kernel strategy
+
+When individual micro-optimizations (tensor caching, pinned memory, etc.) show improvements in isolation but fail to register in E2E measurements — often because savings are below variance or masked by stream overlap — the monolithic approach fuses an entire pipeline stage into a single compiled operation. This eliminates ALL intermediate tensors, kernel launches, and dispatch gaps at once.
+
+**Preprocessing monolith**: Replace (CPU resize + pageable H2D + permute + normalize + channel swap) with (pinned H2D + GPU `F.interpolate` + `torch.compile(mode="reduce-overhead")` fusing the rest). Moves compute onto GPU where it fuses into a single compiled kernel graph. Measured: 31% stage improvement on RF-DETR (2.47ms -> 1.70ms).
+
+**Postprocessing monolith**: Replace (per-call `torch.tensor()` + `torch.isin()` + external rescale function) with (pre-computed boolean lookup mask + cached GPU tensors + inlined in-place rescale). Eliminates all per-call allocations and expensive internal kernels. Measured: 33% stage improvement on RF-DETR (1.11ms -> 0.74ms).
+
+**When to use**: After individual transfer/dispatch/sync optimizations are discarded or show sub-threshold gains. The monolithic approach often succeeds where piecemeal fixes fail because the combined savings exceed the measurement noise floor.
+
+**When NOT to use**: If the stage has data-dependent control flow that prevents compilation (variable-length outputs, dynamic shapes). Postprocessing's confidence filtering creates variable-size tensors — fuse around it, not through it.
+
 ## Decision Framework

 ### Step 1: Is the workload GPU-bound or CPU-bound?
--- a/plugin/languages/python/references/gpu/reference.md
+++ b/plugin/languages/python/references/gpu/reference.md
@ -126,6 +126,120 @@ Alternative to Triton: `torch.compile(fullgraph=True)` can fuse many operations

 ---

+**[dispatch] Monolithic fused preprocessing kernel** — Eliminate CPU resize + pageable H2D + scattered GPU ops
+
+Symptom: Preprocessing performs CPU-side resize (cv2/PIL), then pageable H2D transfer, then separate GPU kernels for permute, normalize, channel swap, padding. torch.profiler shows 2-3ms total for preprocessing with multiple `aten::copy_` and small compute kernels. The CPU resize dominates but runs sequentially before GPU work.
+
+```python
+# BEFORE: CPU resize -> pageable H2D -> separate GPU ops (2.5ms total)
+def pre_process(self, image_np):
+    resized = cv2.resize(image_np, (target_w, target_h))       # 0.8ms CPU
+    tensor = torch.from_numpy(resized).to(device)               # 0.5ms pageable H2D
+    tensor = tensor.permute(2, 0, 1).unsqueeze(0).float()       # kernel 1
+    tensor = tensor / 255.0                                      # kernel 2
+    tensor = (tensor - self.mean) / self.std                     # kernels 3-4
+    return tensor
+```
+
+```python
+# AFTER: pinned H2D + GPU resize + torch.compile fused kernel (1.7ms total)
+def __init__(self):
+    self._pinned_buf = torch.empty((...), dtype=torch.uint8, pin_memory=True)
+    self._fused_mean = torch.tensor(mean, device=device).view(1, 3, 1, 1)
+    self._fused_std = torch.tensor(std, device=device).view(1, 3, 1, 1)
+    self._compiled_preprocess = self._build_compiled_preprocess()
+
+def _build_compiled_preprocess(self):
+    target_h, target_w = self._target_h, self._target_w
+    sf = self._scaling_factor
+
+    @torch.compile(mode="reduce-overhead")
+    def _kernel(t, mean_t, std_t):
+        t = t.unsqueeze(0).permute(0, 3, 1, 2).float()
+        t = F.interpolate(t, size=[target_h, target_w], mode="bilinear")
+        t = t[:, [2, 1, 0], :, :]  # BGR->RGB
+        t = t / sf
+        return ((t - mean_t) / std_t).contiguous()
+    return _kernel
+
+def pre_process(self, image_np):
+    self._pinned_buf.copy_(torch.from_numpy(image_np))          # fast CPU copy
+    gpu_raw = self._pinned_buf.to(device, non_blocking=True)    # DMA H2D
+    return self._compiled_preprocess(gpu_raw, self._fused_mean, self._fused_std)
+```
+
+Three techniques combined:
+1. **Pinned memory buffer** (reusable, thread-local) for DMA H2D — eliminates driver staging copy
+2. **GPU bilinear resize** (`F.interpolate`) — moves resize off CPU, eliminates intermediate numpy→tensor
+3. **`torch.compile(mode="reduce-overhead")`** — fuses permute + interpolate + channel swap + scale + normalize into a compiled kernel graph with CUDA graph replay under the hood
+
+- Measured impact: 31% faster preprocessing (2.47ms -> 1.70ms on RF-DETR large, Tesla T4)
+- E2E impact: compounds with postprocess optimization for 8.3% total E2E improvement
+- Eligibility: fixed input size after resize (STRETCH_TO mode), standard normalization, single numpy image input
+- Correctness: GPU bilinear interpolation differs slightly from CPU cv2.resize (max ~0.013 normalized). Verify detection output quality — typically no visible impact.
+- Thread safety: pinned buffer must be thread-local if the model serves concurrent requests
+- Warmup cost: first call triggers torch.compile JIT (~2-5s). Include in warmup iterations.
+- Why cProfile misses it: CPU resize appears fast in Python, but the full preprocessing chain including H2D staging and kernel gaps is invisible
+
+---
+
+**[dispatch] Monolithic fused postprocessing with cached tensors** — Eliminate per-call allocations + expensive torch.isin
+
+Symptom: Postprocessing creates `torch.tensor()` / `torch.as_tensor()` for rescale constants on every call, uses `torch.isin()` for class remapping (which internally launches unique + searchsorted + multiple index kernels), and calls a general-purpose rescale function that allocates intermediates.
+
+```python
+# BEFORE: per-call tensor creation + torch.isin + external rescale (1.1ms)
+def post_process(self, logits, boxes, image_meta):
+    # ... confidence filtering ...
+    if self._classes_re_mapping is not None:
+        mask = torch.isin(classes, self._remaining_ids)      # expensive: unique+searchsorted
+        classes = self._class_mapping[classes[mask]]
+        boxes = boxes[mask]
+    denorm = torch.tensor([w, h, w, h], device=device)       # per-call H2D
+    boxes_abs = boxes_pct * denorm
+    boxes_abs = rescale_image_detections(boxes_abs, meta)     # allocates pad/scale tensors
+    return boxes_abs, classes, confidence
+```
+
+```python
+# AFTER: boolean lookup + cached GPU tensors + inlined rescale (0.74ms)
+def __init__(self):
+    # Pre-compute boolean class mask (replaces torch.isin)
+    max_id = int(self._class_mapping.shape[0])
+    self._class_valid_mask = torch.zeros(max_id, dtype=torch.bool, device=device)
+    self._class_valid_mask[self._remaining_ids] = True
+    # Cache for rescale tensors (keyed by metadata identity)
+    self._cached_meta_id = None
+    self._cached_inf_size = None
+    self._cached_offsets = None
+    self._cached_scale = None
+
+def post_process(self, logits, boxes, image_meta):
+    # ... confidence filtering ...
+    if self._classes_re_mapping is not None:
+        mask = self._class_valid_mask[classes]                 # O(1) boolean lookup
+        classes = self._class_mapping[classes[mask]]
+        boxes = boxes[mask]
+    inf_size, offsets, scale = self._get_cached_tensors(meta)  # reuses GPU tensors
+    boxes_abs = boxes_pct * inf_size
+    boxes_abs.sub_(offsets).div_(scale)                        # in-place, no alloc
+    return boxes_abs, classes, confidence
+```
+
+Three techniques combined:
+1. **Boolean class lookup mask** replaces `torch.isin` — pre-computes a bool tensor indexed by class ID. Avoids 4+ kernel launches (unique, searchsorted, index_select, gather).
+2. **Cached GPU rescale tensors** — `inference_size_whwh`, pad offsets, and scale factors are cached on GPU and reused when metadata hasn't changed (keyed by `id(metadata)`).
+3. **Inlined rescale** with in-place ops — replaces `rescale_image_detections()` which allocates pad/scale tensors per call. Uses `sub_()` and `div_()` to avoid intermediate allocations.
+
+- Measured impact: 33% faster postprocessing (1.11ms -> 0.74ms on RF-DETR large, Tesla T4)
+- torch.isin replacement alone: 13x faster in micro-benchmark (but masked by stream overlap in E2E)
+- E2E impact: compounds with preprocess optimization for 8.3% total E2E improvement
+- Correctness: Bitwise identical output — same operations, same order, just cached tensors and boolean indexing
+- Thread safety: Cached tensors are immutable after creation; metadata keying uses `id()` which is safe across threads
+- Pattern tag: `monolithic-postprocess`
+
+---
+
 **[sync] Blocking stream.synchronize() between dependent GPU stages** — CPU-GPU serialization

 Symptom: CPU blocks for 0.5-2ms waiting for GPU work to complete before launching the next stage. In torch.profiler: high CPU time on `cudaStreamSynchronize`. In nsys: CPU idle period aligned with GPU work on another stream.
@ -296,6 +410,8 @@ Caveat: Not all models are stable in FP16. Test numerical output equivalence.
 | Loop-invariant tensor creation | Always (fixed per-call cost) |
 | Pageable -> pinned memory | Transfer size > ~4 KB |
 | Kernel fusion | >5 sequential kernel launches with <100us each |
+| Monolithic preprocess kernel | Stage has >3 ops (resize+H2D+normalize+permute) totaling >1ms |
+| Monolithic postprocess kernel | Stage has per-call tensor allocs + torch.isin + external rescale |
 | Event-based stream sync | `synchronize()` blocks CPU for >0.3ms |
 | Small-array CPU fallback | Array has <1000 elements on CPU-side torch |
 | Device comparison bug | Always (correctness bug) |