- 25 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Adjust layout inference calculations in Gemm and ParallelOp * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference. * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience. * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code. * [Refactor] Clean up layout inference logic in Gemm and ParallelOp * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process. * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality. * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts. * lint fix * [Enhancement] Improve cumulative sum functionality and annotations handling * Updated the `cumsum` function to include detailed documentation and error handling for dimension bounds. * Modified the `run_cumsum` test to utilize a random tensor supply type for profiling, enhancing test robustness. * Added annotations to the fused loop in `loop_fusion_utils.h`, ensuring proper metadata is preserved during loop fusion. * lint fix
-
- 22 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Implement CumSum operation in TileLang * Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic. * Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums. * Created tests for CumSum functionality in shared memory and fragment contexts. * Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang. * Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms. * lint fix
-