- 18 Nov, 2025 1 commit
-
-
Jay Zhuang authored
* fix argument order for fla chunk_gated_delta_rule_fwd_h * explicit import assert_similar from utils * rename utils module to avoid name clash * set store_final_state and save_new_value to True * fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 14 Nov, 2025 1 commit
-
-
Zhengju Tang authored
* [BugFix] Add autotune and exp2 for GDN kernel * [Lint] * [Lint]
-
- 03 Nov, 2025 1 commit
-
-
Kurisu authored
* tilelang frontend v2 * syntax sugar: defining a local var by annotation * [Refactor] fix type linting warning like `T.float32` * Add tl.local_var_init for new tl.float32 * allow passing default argument as function annotation * allow default arguments as annotation * fix lint error * minor fix * [Refactor] refactor tilelang.jit and tilelang.autotune * minor fix * minor fix * minor fix * fix metal get function name * add par_compile impl and tests * Type consistency on tvm datatype 1. isinstance(tl.float32, tvm.DataType) == True 2. Allow `tl.float32` as function annotations 3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions * fix lint error * add more warning in frontend * update tvm version * Minor fix on tvm_ffi annotations * add document and examples * fix lint error * Simplify index calculations in example_chunk_o_bwd.py Refactor index calculations for dg_last_fragment assignment. * minor fix * lint fix --------- Co-authored-by:
Lei Wang <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 21 Oct, 2025 1 commit
-
-
Tong WU authored
* [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts * lint * lint
-
- 14 Oct, 2025 1 commit
-
-
Xuehai Pan authored
Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 02 Oct, 2025 1 commit
-
-
Lei Wang authored
* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode - Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated. - Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed. - Updated error handling for non-zero index access in fragment buffers to improve robustness. * [Layout] Improve code formatting and readability in layout.cc and parallel.cc - Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity. - Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation. - Ensured consistent formatting across conditional statements and comments for improved maintainability. * updt * optimize const index related op * bug fix * reduce gdn test * test fix * lintfix * lint fix * test fix
-
- 13 Sep, 2025 1 commit
-
-
Yichen Yan authored
* update lint config * Remove spaces for blank line * update
-
- 28 Aug, 2025 1 commit
-
-
Zhengju Tang authored
* [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by:
Yu Cheng <54519279+chengyupku@users.noreply.github.com> Co-authored-by:
Johnny <johnnync13@gmail.com>
-
- 25 Aug, 2025 1 commit
-
-
Yu Cheng authored
- Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.
-
- 22 Aug, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix
-
- 07 Aug, 2025 1 commit
-
-
Zhengju Tang authored
* [GDN] Add examples for GDN forward and backward kernels * [Refactor] Folder structure refactor for duplicated utils * [Test] Add test script for kernels * [Refactor] Rename examples to align with the repo * [Lint] Modify README * [Update] Modified README to align upstream repo * [BugFix] Path of FLA * [Fix] Copyright and test * [Lint] * [CI] Add GDN compilation test CI * [Lint] * [BugFix] Import error of fla
-