• Yu Cheng's avatar
    [Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2
    Yu Cheng authored
    * [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation
    
    - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
    - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
    - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
    - Updated relevant documentation and comments to reflect the addition of new features and usage instructions.
    
    * [Example] Add MLA Decode With Persistent Threadblock Example
    
    * [Feature] Introduce Persistent Loop and Update GEMM Example
    
    - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
    - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
    - Introduced a `loop_break` intrinsic for better control flow within persistent loops.
    - Updated relevant files to support the new features, including changes in code generation and language interface.
    
    * lint fix
    e7b97be2
builtin.cc 4.8 KB