1. 17 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
  2. 12 Dec, 2025 1 commit
  3. 19 Nov, 2025 1 commit
  4. 05 Oct, 2025 1 commit
  5. 18 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355
      Lei Wang authored
      * [Enhancement] Enable fast math optimization in tilelang JIT configurations
      
      - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
      - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
      - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
      - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.
      
      * lint fix
      
      * [Refactor] Introduce deprecated_warning utility for improved deprecation handling
      
      - Added a new `deprecated_warning` function to streamline deprecation messages.
      - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
      - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
      e7e38355
  6. 25 Jun, 2025 1 commit
    • Cunxiao Ni's avatar
      [Example] Update examples to use @tilelang.jit (#597) · 3db18726
      Cunxiao Ni authored
      
      
      * [Example] Update kernel compilation in examples to use @tilelang.jit
      
      - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
      - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
      - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.
      
      * format
      
      * Update example_tilelang_sparse_gqa_decode_varlen_indice.py
      
      * Update example_dequant_gemm_fine_grained.py
      
      * Update example_gemm_autotune.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3db18726
  7. 04 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Include several examples into ci (#531) · 3ca3a8af
      Lei Wang authored
      * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.
      
      * lint fix
      
      * fix split k example
      
      * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation
      
      * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
      3ca3a8af
  8. 12 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5
      Lei Wang authored
      * Update legalize_safe_memory_access.cc
      
      * Add cache path handling and file locking in Cython adapter
      
      - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
      - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
      - These changes aim to optimize the compilation process and prevent race conditions during library loading.
      
      * lint fix
      
      * refactor
      
      * refactor
      
      * Add GlobalCopyPatternDetector to identify global memory copy patterns
      
      - Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
      - Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
      - Improved code clarity and maintainability by encapsulating detection logic within the new class.
      
      * Refactor copy stage detection logic in pipeline planning
      
      - Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
      - Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.
      
      * lint fix
      abaacde5
  9. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  10. 13 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhancing the handling of conditional statements in the pipeline (#201) · dda8ebff
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      
      * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc
      
      - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
      - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.
      
      * Refactor condition handling in inject_pipeline.cc
      
      - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
      - Update condition comparison logic to use StructuralEqual for better accuracy.
      - Enhance logging to provide detailed insights into condition changes and statement processing.
      - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.
      
      * Improve logging and formatting in inject_pipeline.cc
      
      - Enhance logging statements for better clarity on condition changes and statement processing.
      - Adjust formatting for improved readability, including line breaks and consistent spacing.
      - Ensure accurate condition comparison and handling in the pipeline logic.
      
      * Refactor logging and clean up inject_pipeline.cc
      
      - Remove excessive logging statements to streamline the code and improve performance.
      - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
      - Maintain the core functionality while enhancing code readability and maintainability.
      dda8ebff
  11. 12 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      7ccec53b
  12. 10 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Examples] Implement NSA Backward kernels (#180) · 6891d3ec
      Lei Wang authored
      
      * Update native sparse attention example with scale parameter handling
      
      - Add scale parameter processing in native_sparse_attention function
      - Modify example script to include custom scale value
      - Update function calls to pass scale parameter
      - Enhance flexibility of sparse attention implementation
      
      * Refactor Triton Native Sparse Attention Example
      
      - Improve code formatting and readability in example_triton_nsa_bwd.py
      - Standardize function and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Enhance code style consistency with previous commits
      6891d3ec
  13. 07 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Example] Implement tilelang native sparse attention varlen example (#170) · 8e1845d2
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      
      * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton
      
      - Introduce new example scripts for variable-length native sparse attention:
        * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
        * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
      - Update reference.py to support variable-length sparse attention scenarios
      - Enhance existing sparse attention implementations to handle variable-length inputs
      - Add comprehensive testing and validation for variable-length sparse attention
      
      * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements
      
      - Standardize function and parameter formatting across NSA example files
      - Improve code readability by adjusting indentation and line breaks
      - Enhance type hints and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Maintain consistent code style across TileLang and Triton implementations
      8e1845d2
    • Lei Wang's avatar
      [Example] Implement NSA Decode tilelang exampls (#168) · 69f35439
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      69f35439