• Lei Wang's avatar
    [Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f
    Lei Wang authored
    * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
    
    * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
    
    * Add merge shared memory allocations pass and related configurations
    
    - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
    - Registered configuration options for debugging and controlling the merging behavior.
    - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
    - Adjusted import paths and added documentation for the new functionality.
    
    * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
    
    * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
    
    * lint fix
    
    * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
    2837878f
gemm_sm89.h 20.3 KB