diff --git a/docs/codeflash-concepts/benchmarking-gpu-code.mdx b/docs/codeflash-concepts/benchmarking-gpu-code.mdx new file mode 100644 index 000000000..9a54e700c --- /dev/null +++ b/docs/codeflash-concepts/benchmarking-gpu-code.mdx @@ -0,0 +1,118 @@ +--- +title: "How Codeflash Measures Code Runtime on GPUs" +description: "Learn how Codeflash accurately measures code performance on GPUs" +icon: "stopwatch" +sidebarTitle: "GPU Runtime Measurement" +keywords: ["benchmarking", "performance", "timing", "measurement", "runtime", "noise reduction", "GPU", "MPS"] +--- + +## Accurate Benchmarking on GPU devices (NVIDIA GPUs and Mac Metal Performance Shaders) + +When a GPU operation is executed, it executes **asynchronously**. This means the CPU queues up work for the GPU and immediately continues to the next line of code - it doesn't wait for the GPU to finish. Accurate measurement of code execution on GPUs involves the insertion of synchronization barriers to ensure no pending GPU tasks are executing before and after the timing measurements are made. + +## Illustration + +### Without Synchronization + +```mermaid actions={false} +%%{init: {'gantt': {'useWidth': 1200}}}%% +gantt + title CPU vs CUDA Stream Timeline (Without Synchronization) + dateFormat X + axisFormat %s + + section CPU + Timer Start :milestone, m1, 0, 0 + Launch Kernel 1 :active, cpu0, 0, 4 + Launch Kernel 2 :active, cpu1, 4, 8 + Launch Kernel 3 :active, cpu2, 8, 12 + Timer End :milestone, m2, 12, 12 + + section CUDA Stream + Waiting :done, wait, 0, 4 + Kernel 1 :active, k1, 4, 11 + Kernel 2 :active, k2, 11, 18 + Kernel 3 :active, k3, 18, 25 + + section Problem + Timer ends too early :done, p1, after m2, 25 +``` + +Here you can see that the timing statements are measuring the duration up till the end of the final kernel launch. The GPU computation hasn't completed yet, which means the timing measurement is not accurate and would affect any future inference based on this information. + +### With Synchronization + +```mermaid actions={false} +%%{init: {'gantt': {'useWidth': 1200}}}%% +gantt + title CPU vs CUDA Stream Timeline (With Synchronization) + dateFormat X + axisFormat %s + + section CPU + Device Synchronization :done, wait, 0, 4 + Timer Start :milestone, m1, 4, 4 + Launch Kernel 1 :active, cpu0, 4, 8 + Launch Kernel 2 :active, cpu1, 8, 12 + Launch Kernel 3 :active, cpu2, 12, 16 + Device Synchronization :done, wait, 16, 29 + Timer End :milestone, m2, 29, 29 + + section CUDA Stream + Previous Work :done, wait, 0, 4 + Kernel 1 :active, k1, 4, 11 + Kernel 2 :active, k2, 11, 18 + Kernel 3 :active, k3, 18, 29 +``` + +Here you can see that a device synchronization call is made before executing the code, this ensures that the CPU waits for any pending GPU tasks to finish before starting the timer. After the launch of the final kernel, another device synchronization call is made which ensures all pending GPU tasks are finished before measuring the runtime. + + + +## Pytorch Example + +Execute the following code in your Python Interpreter to get the kernel launch time (Replace `cuda` with `mps` everywhere to run on your Mac). +```python +import torch +import time +device = "cuda" +x = torch.randn(8192, 8192, device=device) +y = torch.randn(8192, 8192, device=device) +t0 = time.perf_counter_ns() +z = torch.matmul(x, y) +t1 = time.perf_counter_ns() +print(f"Without synchronize: {(t1 - t0) / 1e6:.3f} ms") +``` + +Now, **Restart** your interpreter and execute the following code to get the kernel execution time (Replace `cuda` with `mps` everywhere to run on your Mac). +```python +import torch +import time +device = "cuda" +x = torch.randn(8192, 8192, device=device) +y = torch.randn(8192, 8192, device=device) +torch.cuda.synchronize() # clear any pending work +t0 = time.perf_counter_ns() +z = torch.matmul(x, y) +torch.cuda.synchronize() # wait for GPU to finish +t1 = time.perf_counter_ns() +print(f"With synchronize: {(t1 - t0) / 1e6:.3f} ms") +``` + + +Output on NVIDIA GPU + +``` +Without synchronize: 69.157 ms +With synchronize: 152.277 ms +``` + +# How codeflash measures execution time involving GPUs + +Codeflash automatically inserts synchronization barriers before measuring performance. It currently supports GPU code written in `Pytorch`, `Tensorflow` and `JAX`. + +- **PyTorch**: Uses `torch.cuda.synchronize()` (NVIDIA GPUs) or `torch.mps.synchronize()` (MacOS Metal Performance Shaders) depending on the device. +- **JAX**: Uses `jax.block_until_ready()` to wait for computation to complete. It works for both CUDA and MPS devices. +- **TensorFlow**: Uses `tf.test.experimental.sync_devices()` for device synchronization. It works for both CUDA and MPS devices. + +Codeflash would support ROCm and TPU devices in the near future. \ No newline at end of file diff --git a/docs/docs.json b/docs/docs.json index 43949abb2..e81e21cd4 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -67,7 +67,8 @@ "pages": [ "codeflash-concepts/how-codeflash-works", "codeflash-concepts/benchmarking", - "support-for-jit/index" + "support-for-jit/index", + "codeflash-concepts/benchmarking-gpu-code" ] }, { diff --git a/docs/support-for-jit/index.mdx b/docs/support-for-jit/index.mdx index d84a75912..2321609d8 100644 --- a/docs/support-for-jit/index.mdx +++ b/docs/support-for-jit/index.mdx @@ -1,70 +1,14 @@ --- -title: "Support for Just-in-Time Compilation" +title: "Just-in-Time Compilation" description: "Learn how Codeflash optimizes code using JIT compilation with Numba, PyTorch, TensorFlow, and JAX" icon: "bolt" sidebarTitle: "JIT Compilation" -keywords: ["JIT", "just-in-time", "numba", "pytorch", "tensorflow", "jax", "GPU", "CUDA", "compilation", "performance"] +keywords: ["JIT", "just-in-time", "numba", "pytorch", "tensorflow", "jax", "GPU", "CUDA", "MPS", "compilation", "performance"] --- -# Support for Just-in-Time Compilation +# Just-in-Time Compilation -Codeflash supports optimizing numerical code using Just-in-Time (JIT) compilation via leveraging JIT compilers from popular frameworks including **Numba**, **PyTorch**, **TensorFlow**, and **JAX**. - -## Supported JIT Frameworks - -Each framework uses different compilation strategies to accelerate Python code: - -### Numba (CPU Code) - -Numba compiles Python functions to optimized machine code using the LLVM compiler infrastructure. Codeflash can suggest Numba optimizations that use: - -- **`@jit`** - General-purpose JIT compilation with optional flags. - - **`nopython=True`** - Compiles to machine code without falling back to the Python interpreter. - - **`fastmath=True`** - Uses aggressive floating-point optimizations via LLVM's fastmath flag. - - **`cache=True`** - cache compiled function to disk which reduces future runtimes. - - **`parallel=True`** - Parallelizes code inside loops. - -### PyTorch - -PyTorch provides JIT compilation through `torch.compile()`, the recommended compilation API introduced in PyTorch 2.0. It uses TorchDynamo to capture Python bytecode and TorchInductor to generate optimized kernels. - -- **`torch.compile()`** - Compiles a function or module for optimized execution. - - **`mode`** - Controls the compilation strategy: - - `"default"` - Balanced compilation with moderate optimization. - - `"reduce-overhead"` - Minimizes Python overhead using CUDA graphs, ideal for small batches. - - `"max-autotune"` - Spends more time autotuning to find the fastest kernels. - - **`fullgraph=True`** - Requires the entire function to be captured as a single graph. Raises an error if graph breaks occur, useful for ensuring complete optimization. - - **`dynamic=True`** - Enables dynamic shape support, allowing the compiled function to handle varying input sizes without recompilation. - -### TensorFlow - -TensorFlow uses `@tf.function` to compile Python functions into optimized TensorFlow graphs. When combined with XLA (Accelerated Linear Algebra), it can generate highly optimized machine code for both CPU and GPU. - -- **`@tf.function`** - Converts Python functions into TensorFlow graphs for optimized execution. - - **`jit_compile=True`** - Enables XLA compilation, which performs whole-function optimization including operation fusion, memory layout optimization, and target-specific code generation. - -### JAX - -JAX uses XLA to JIT compile pure functions into optimized machine code. It emphasizes functional programming patterns and captures side-effect-free operations for optimization. - -- **`@jax.jit`** - JIT compiles functions using XLA with automatic operation fusion. - -## How Codeflash Optimizes with JIT - -When Codeflash identifies a function that could benefit from JIT compilation, it: - -1. Rewrites the code in a JIT-compatible format, which may involve breaking down complex functions into separate JIT-compiled components. -2. Generates appropriate tests that are compatible with JIT-compiled code, carefully handling data types since JIT compilers have stricter input type requirements. -3. Disables JIT compilation while running coverage and tracer to get accurate coverage and trace information. Both of them rely on Python bytecode execution but JIT compiled code stops running as Python bytecode. -4. Disables Line Profiler information collection whenever presented with JIT compiled code. It could be possible to disable JIT compilation and run the line profiler, but that would lead to inaccurate information which could misguide the optimization process. - -## Accurate Benchmarking on Non-CPU devices - -Since Non-CPU operations execute asynchronously, Codeflash automatically inserts synchronization barriers before measuring performance. This ensures timing measurements reflect actual computation time rather than just the time to queue operations: - -- **PyTorch**: Uses `torch.cuda.synchronize()` (NVIDIA GPUs) or `torch.mps.synchronize()` (MacOS Metal Performance Shaders) depending on the device. -- **JAX**: Uses `jax.block_until_ready()` to wait for computation to complete. -- **TensorFlow**: Uses `tf.test.experimental.sync_devices()` for device synchronization. +Just-in-time (JIT) compilation is a runtime technique where code is compiled into machine code on the fly, right before it is executed, to improve performance.. Codeflash supports optimizing numerical code using Just-in-Time (JIT) compilation via leveraging JIT compilers from popular frameworks including **Numba**, **PyTorch**, **TensorFlow**, and **JAX**. ## When JIT Compilation Helps @@ -157,7 +101,7 @@ JIT compilation may not provide speedups when: #### Function Definition -``` +```python def adaptive_processing(x, threshold=0.5): """Function with data-dependent control flow - compile struggles here""" # Check how many values exceed threshold (data-dependent!) @@ -177,7 +121,7 @@ def adaptive_processing(x, threshold=0.5): #### Benchmarking Snippet (replace `cuda` with `mps` to run on your Mac) -``` +```python # Create compiled version adaptive_processing_compiled = torch.compile(adaptive_processing) @@ -253,7 +197,6 @@ Optimized: 0.0277s Speedup compared to Uncompiled: 1.57x ``` - Key improvements: 1. Eliminate `.item()` - Keep computation on GPU. @@ -261,6 +204,54 @@ Key improvements: 3. Vectorization - Replace conditionals with masked operations. 4. Reduce Python overhead - Minimize host-device synchronization. +## Supported JIT Frameworks + +Each framework uses different compilation strategies to accelerate Python code: + +### Numba (CPU Code) + +Numba compiles Python functions to optimized machine code using the LLVM compiler infrastructure. Codeflash can suggest Numba optimizations that use: + +- **`@jit`** - General-purpose JIT compilation with optional flags. + - **`nopython=True`** - Compiles to machine code without falling back to the Python interpreter. + - **`fastmath=True`** - Uses aggressive floating-point optimizations via LLVM's fastmath flag. + - **`cache=True`** - cache compiled function to disk which reduces future runtimes. + - **`parallel=True`** - Parallelizes code inside loops. + +### PyTorch + +PyTorch provides JIT compilation through `torch.compile()`, the recommended compilation API introduced in PyTorch 2.0. It uses TorchDynamo to capture Python bytecode and TorchInductor to generate optimized kernels. + +- **`torch.compile()`** - Compiles a function or module for optimized execution. + - **`mode`** - Controls the compilation strategy: + - `"default"` - Balanced compilation with moderate optimization. + - `"reduce-overhead"` - Minimizes Python overhead using CUDA graphs, ideal for small batches. + - `"max-autotune"` - Spends more time auto-tuning to find the fastest kernels. + - **`fullgraph=True`** - Requires the entire function to be captured as a single graph. Raises an error if graph breaks occur, useful for ensuring complete optimization. + - **`dynamic=True`** - Enables dynamic shape support, allowing the compiled function to handle varying input sizes without recompilation. + +### TensorFlow + +TensorFlow uses `@tf.function` to compile Python functions into optimized TensorFlow graphs. When combined with XLA (Accelerated Linear Algebra), it can generate highly optimized machine code for both CPU and GPU. + +- **`@tf.function`** - Converts Python functions into TensorFlow graphs for optimized execution. + - **`jit_compile=True`** - Enables XLA compilation, which performs whole-function optimization including operation fusion, memory layout optimization, and target-specific code generation. + +### JAX + +JAX uses XLA to JIT compile pure functions into optimized machine code. It emphasizes functional programming patterns and captures side-effect-free operations for optimization. + +- **`@jax.jit`** - JIT compiles functions using XLA with automatic operation fusion. + +## How Codeflash Optimizes with JIT + +When Codeflash identifies a function that could benefit from JIT compilation, it: + +1. Rewrites the code in a JIT-compatible format, which may involve breaking down complex functions into separate JIT-compiled components. +2. Generates appropriate tests that are compatible with JIT-compiled code, carefully handling data types since JIT compilers have stricter input type requirements. +3. Disables JIT compilation while running coverage and tracer to get accurate coverage and trace information. Both of them rely on Python bytecode execution but JIT compiled code stops running as Python bytecode. +4. Disables Line Profiler information collection whenever presented with JIT compiled code. It could be possible to disable JIT compilation and run the line profiler, but that would lead to inaccurate information which could misguide the optimization process. + ## Configuration JIT compilation support is **enabled automatically** in Codeflash. You don't need to modify any configuration to enable JIT-based optimizations. Codeflash will automatically detect when JIT compilation could improve performance and suggest appropriate optimizations. \ No newline at end of file