1. 04 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Include several examples into ci (#531) · 3ca3a8af
      Lei Wang authored
      * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.
      
      * lint fix
      
      * fix split k example
      
      * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation
      
      * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
      3ca3a8af
  2. 12 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5
      Lei Wang authored
      * Update legalize_safe_memory_access.cc
      
      * Add cache path handling and file locking in Cython adapter
      
      - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
      - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
      - These changes aim to optimize the compilation process and prevent race conditions during library loading.
      
      * lint fix
      
      * refactor
      
      * refactor
      
      * Add GlobalCopyPatternDetector to identify global memory copy patterns
      
      - Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
      - Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
      - Improved code clarity and maintainability by encapsulating detection logic within the new class.
      
      * Refactor copy stage detection logic in pipeline planning
      
      - Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
      - Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.
      
      * lint fix
      abaacde5
  3. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  4. 13 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhancing the handling of conditional statements in the pipeline (#201) · dda8ebff
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      
      * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc
      
      - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
      - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.
      
      * Refactor condition handling in inject_pipeline.cc
      
      - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
      - Update condition comparison logic to use StructuralEqual for better accuracy.
      - Enhance logging to provide detailed insights into condition changes and statement processing.
      - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.
      
      * Improve logging and formatting in inject_pipeline.cc
      
      - Enhance logging statements for better clarity on condition changes and statement processing.
      - Adjust formatting for improved readability, including line breaks and consistent spacing.
      - Ensure accurate condition comparison and handling in the pipeline logic.
      
      * Refactor logging and clean up inject_pipeline.cc
      
      - Remove excessive logging statements to streamline the code and improve performance.
      - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
      - Maintain the core functionality while enhancing code readability and maintainability.
      dda8ebff
  5. 12 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      7ccec53b
  6. 10 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Examples] Implement NSA Backward kernels (#180) · 6891d3ec
      Lei Wang authored
      
      * Update native sparse attention example with scale parameter handling
      
      - Add scale parameter processing in native_sparse_attention function
      - Modify example script to include custom scale value
      - Update function calls to pass scale parameter
      - Enhance flexibility of sparse attention implementation
      
      * Refactor Triton Native Sparse Attention Example
      
      - Improve code formatting and readability in example_triton_nsa_bwd.py
      - Standardize function and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Enhance code style consistency with previous commits
      6891d3ec
  7. 07 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Example] Implement tilelang native sparse attention varlen example (#170) · 8e1845d2
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      
      * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton
      
      - Introduce new example scripts for variable-length native sparse attention:
        * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
        * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
      - Update reference.py to support variable-length sparse attention scenarios
      - Enhance existing sparse attention implementations to handle variable-length inputs
      - Add comprehensive testing and validation for variable-length sparse attention
      
      * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements
      
      - Standardize function and parameter formatting across NSA example files
      - Improve code readability by adjusting indentation and line breaks
      - Enhance type hints and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Maintain consistent code style across TileLang and Triton implementations
      8e1845d2
    • Lei Wang's avatar
      [Example] Implement NSA Decode tilelang exampls (#168) · 69f35439
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      69f35439