1. 17 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
  2. 12 Dec, 2025 1 commit
  3. 18 Nov, 2025 1 commit
    • Lei Wang's avatar
      [FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696
      Lei Wang authored
      * [Refactor] Update FFI type handling and simplify argument management
      
      * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
      * Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
      * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
      * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
      * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.
      
      * [Update] Sync TVM submodule and enhance kernel source handling
      
      * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
      * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
      * Commented out the main execution call in test files to prevent unintended execution during testing.
      * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
      * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.
      
      * [Refactor] Clean up imports and improve code formatting
      
      * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
      * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
      * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.
      
      * Update execution backend options and improve resolution logic
      
      - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
      - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
      - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
      - Updated documentation to reflect changes in execution backend options and their defaults.
      
      * lint fix
      
      * fix
      
      * Enhance argument handling in CUDA and HIP runtime modules
      
      - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
      - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
      - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * minor fix
      
      * fix
      
      * recover check
      
      * Refactor argument binding and validation in `arg_binder.cc`
      
      - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
      - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
      - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
      - Minor adjustments in test files to streamline kernel execution and improve readability.
      
      * lint fix
      
      * stride fix
      
      * minor fix
      
      * fix
      
      * lint fix
      
      * lint fix
      
      * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management
      
      - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
      - Updated runtime files to include new FFI packed functions for managing stream attributes.
      - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
      - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.
      
      * check with symbolic
      
      * support null ptr
      
      * Update CMakeLists and lower.py for code generation and subproject status
      
      - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
      - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
      - Marked the TVM subproject as dirty to indicate local modifications.
      
      * lint fix
      
      * Update comments for clarity in quickstart.py
      74da3696
  4. 02 Oct, 2025 1 commit
    • Lei Wang's avatar
      [Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452
      Lei Wang authored
      * [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode
      
      - Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
      - Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
      - Updated error handling for non-zero index access in fragment buffers to improve robustness.
      
      * [Layout] Improve code formatting and readability in layout.cc and parallel.cc
      
      - Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
      - Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
      - Ensured consistent formatting across conditional statements and comments for improved maintainability.
      
      * updt
      
      * optimize const index related op
      
      * bug fix
      
      * reduce gdn test
      
      * test fix
      
      * lintfix
      
      * lint fix
      
      * test fix
      fc4bd452
  5. 07 Aug, 2025 1 commit
    • Zhengju Tang's avatar
      Gated Delta Net(GDN) kernel implementation in TileLang (#695) · 6f59668d
      Zhengju Tang authored
      * [GDN] Add examples for GDN forward and backward kernels
      
      * [Refactor] Folder structure refactor for duplicated utils
      
      * [Test] Add test script for kernels
      
      * [Refactor] Rename examples to align with the repo
      
      * [Lint] Modify README
      
      * [Update] Modified README to align upstream repo
      
      * [BugFix] Path of FLA
      
      * [Fix] Copyright and test
      
      * [Lint]
      
      * [CI] Add GDN compilation test CI
      
      * [Lint]
      
      * [BugFix] Import error of fla
      6f59668d