Commits · 45559a1fc30ba27626c7a29226feec245415bfc3 · OpenDAS / tilelang

14 Mar, 2025 1 commit

[Enhancement] Allow mma fallback when wgmma is not supported (#206) · 45559a1f

Lei Wang authored Mar 14, 2025

* Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging.

* Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture

- Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
- Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
- Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
- Include necessary header files for built-in operations and layout inference improvements.

* lint fix

* Remove unused builtin.h include directive

* Update include path for builtin.h

45559a1f

20 Jul, 2025 1 commit
- [LICENSE] Switch to Tile-AI and declare the copyright time for the Microsoft team. (#208) · 6ffae0f2
  Lei Wang authored Mar 14, 2025
  
  6ffae0f2
13 Mar, 2025 5 commits

[Dev] Add GQA backward example (#205) · a55f3686

Yu Cheng authored Mar 13, 2025

- Introduce `example_gqa_bwd.py` demonstrating the backward pass of FlashAttention with pipelined execution.
- Implement forward and backward functions for FlashAttention, including preprocessing and postprocessing steps.
- Enhance argument parsing for batch size, heads, context size, and dimensions.
- Include a reference implementation for validation and performance benchmarking.

a55f3686

[Docker] Update Dockerfiles to specify exact version of libstdcxx-ng (#203) · 05d72dfc

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

* Update Dockerfiles to specify exact version of libstdcxx-ng

- Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution.
- Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.

05d72dfc

[Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5

zqh-wz authored Mar 13, 2025



* upgrade cutlass to upstream v3.8.0

* Implement fp8 gemm and add example script

* Fix dtype retrieval with map_torch_type for fp8 inputs

* Disable vectorization of fp8 values

* Make MMA declaration compatible with cutlass 3.4.0+

* Add test for fp8 T.gemm

* fix indent

* fix indent

* Add copyright and license header

* Add copyright and license header

* lint fix

* Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.

* clang format lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

2cccf1f5

[Enhancement] Enhancing the handling of conditional statements in the pipeline (#201) · dda8ebff

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

dda8ebff

[Dev] Add new example for FlashAttention with pipelined execution (#200) · c2b9b59d

Yu Cheng authored Mar 13, 2025

- Introduce `example_gqa_fwd_bshd_wgmma_pipelined.py` demonstrating a pipelined implementation of FlashAttention.
- Update sequence length parameter in existing example to 8192 and adjust number of stages for improved performance.
- Enhance argument parsing to accommodate new configurations for batch size, heads, and groups.

c2b9b59d

12 Mar, 2025 9 commits

[Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

7ccec53b

[CMake] Add CUDA Major Version Detection for Conditional Compilation (#197) · 20f19611

Yu Cheng authored Mar 12, 2025

* [Feature] Add TMA Store Synchronization Support

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

* [CMake] Add CUDA Major Version Detection for Conditional Compilation

- Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version
- Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths
- Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION
- Improve cross-version compatibility for CUDA-dependent code sections

20f19611

Update expired example code. (#196) · 6ab29ffc
66RING authored Mar 12, 2025
```
Expired example code, update readme.
```
6ab29ffc

[Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a

Yu Cheng authored Mar 12, 2025

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

eba7dd5a

[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp... · 94c758ad

Yu Cheng authored Mar 12, 2025

[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter (#194)

* [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter

- Introduce `SetMaxNRegCollector` to collect register hints from SetMaxNReg calls
- Modify `WarpSpecializedRewriter` to use collected register hints for producer and consumer code
- Add validation checks for register hint values in the collector
- Remove SetMaxNReg calls during code transformation
- Enhance flexibility of register allocation in warp specialized rewriting

* temporary remove check in lower_hopper_intrin

94c758ad

[Language] Support clamp in language (#192) · 94c941fc
_HYX_ authored Mar 12, 2025
```
* [Dev] Support clamp in language.

* [Bugfix]: Fix clamp

* [Refactor]
```
94c941fc
[Bugfix] Make quickstart work properly on cu118 (#193) · efb2b1d5
penguin_wwy authored Mar 12, 2025

efb2b1d5

[Enhancement] Simplify GEMM example with direct kernel compilation (#191) · 79ea77e8

Lei Wang authored Mar 12, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

79ea77e8

[Bugfix] Fix `T.copy` for scalar datatypes (#190) · 454248c7

Lei Wang authored Mar 12, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

454248c7

11 Mar, 2025 4 commits

[Dev] Add the failed nvcc command to the exception message (#189) · 5fafcb32
penguin_wwy authored Mar 12, 2025

5fafcb32

[Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug (#188) · fe0de672

Yu Cheng authored Mar 12, 2025

* [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug

- Implement two RMS normalization implementations in TileLang:
  * `rms_norm_splitk`: Split-K reduction approach for large matrices
  * `rms_norm`: Full reduction kernel with simplified implementation
- Add reference implementation using PyTorch for validation
- Include performance benchmarking for both kernel variants
- Demonstrate flexible block size and matrix size configurations

* [Examples] Simplify RMS Normalization Kernel Compilation

- Remove commented-out code for split-K RMS normalization
- Simplify kernel compilation by removing explicit TMA lowering configuration
- Update copyright header to Tile-AI Corporation
- Streamline main script for RMS normalization example

fe0de672

[Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation (#185) · d34601ab

Lei Wang authored Mar 11, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

d34601ab

[Enhancement] Optimize CMake build process with dynamic job count calculation (#183) · c2192780

Lei Wang authored Mar 11, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

c2192780

10 Mar, 2025 3 commits

[Examples] Implement NSA Backward kernels (#180) · 6891d3ec

Lei Wang authored Mar 10, 2025


* Update native sparse attention example with scale parameter handling

- Add scale parameter processing in native_sparse_attention function
- Modify example script to include custom scale value
- Update function calls to pass scale parameter
- Enhance flexibility of sparse attention implementation

* Refactor Triton Native Sparse Attention Example

- Improve code formatting and readability in example_triton_nsa_bwd.py
- Standardize function and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Enhance code style consistency with previous commits

6891d3ec

[Bugfix] Improve Thread Variable Handling in Layout Inference (#179) · c39e540a

Lei Wang authored Mar 10, 2025

* [Refactor] Improve Thread Variable Handling in Layout Inference

- Update layout inference to handle thread variables more robustly
- Add explicit size check between infer_list_ and thread_var_vec_
- Modify thread variable access to use per-iteration thread variable
- Simplify thread predicate retrieval logic
- Add minor code cleanup and return variable assignment

* [Refactor] Update Layout Inference Copyright and Simplify Return Logic

- Replace Apache License header with Microsoft Corporation copyright notice
- Simplify LayoutInference function by directly returning substituted function
- Remove unnecessary variable assignment in return statement

* [Refactor] Update Layout Inference Copyright to Tile-AI Corporation

- Change copyright notice from Microsoft Corporation to Tile-AI Corporation
- Maintain existing file structure and licensing header

c39e540a

[Refactor] Enhance GPU Kernel Launch with Environment Thread Creation (#178) · 8ccf6ea2

Lei Wang authored Mar 10, 2025

- Introduce `CreateEnvThread` function to generate environment threads for GPU kernel launches
- Modify `KernelLaunch` to use `CreateEnvThread` for block and thread indices
- Improve thread variable naming with shorter, more descriptive identifiers (bx, by, bz, tx, ty, tz)
- Ensure proper thread environment setup within PrimFunc context

8ccf6ea2

09 Mar, 2025 4 commits

[Feat] Introduce new caching mechanism for compiled kernels (#176) · 7bde63d5

Lei Wang authored Mar 09, 2025

* Add kernel caching mechanism to TileLang

- Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels
- Expose the `cached` function in the main `tilelang/__init__.py`
- Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py`
- Provide a `clear_cache()` function to reset the kernel cache when needed

* Refactor kernel caching test and implementation

- Simplify the `cached` function in `tilelang/cache/__init__.py`
- Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()`
- Remove unnecessary whitespace and improve code formatting

* Update import for `cached` function in MHA examples

- Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py`
- Change import from `tilelang.profiler import cached` to `tilelang import cached`
- Align with recent refactoring of kernel caching mechanism

* Refactor `cached` function signature in kernel caching

- Update function signature to use keyword-only arguments for `target` and `target_host`
- Improve parameter order and readability of the `cached` decorator
- Maintain existing functionality while enhancing function definition

7bde63d5

[Feat] Append Pass Context and TMA lowering configuration option (#175) · fb6b101c

Lei Wang authored Mar 09, 2025

* Add TMA lowering configuration option and update copyright notices

This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include:

- Add `kDisableTMALower` configuration option in builtin.h and builtin.cc
- Update copyright notices from Microsoft Corporation to Tile-AI Corporation
- Modify `LowerArgs` struct to include `disable_tma_lower` flag
- Update JIT compilation interfaces to support pass configuration
- Enhance error reporting in bulk copy lowering
- Propagate pass configuration through various adapter layers

* lint fix

fb6b101c

[AutoTune] Enable config-performance trace (#174) · e6f77253

Lei Wang authored Mar 09, 2025

* Improve Autotuner and CUDA Compatibility for Tensor Core Policies

- Enhance autotuner with robust parallel compilation and error handling
- Add logging for better debugging during configuration compilation
- Support SM90 compute capabilities in TensorCore and matmul analysis policies
- Improve future handling and result tracking in autotuner
- Add more flexible SM version checks for pipeline and async copy stages

* Refactor Autotuner Parallel Compilation with Improved Error Handling

- Enhance tqdm progress bar formatting for concurrent configuration compilation
- Simplify exception handling in parallel compilation process
- Remove unnecessary logging and improve code readability
- Optimize thread pool shutdown and result processing

e6f77253

[Bugfix] Implement boundary check for the buffer shape with dynamic symbolic (#173) · 8344af52

Lei Wang authored Mar 09, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

* Add Variable-Length Native Sparse Attention Examples for TileLang and Triton

- Introduce new example scripts for variable-length native sparse attention:
  * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
  * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
- Update reference.py to support variable-length sparse attention scenarios
- Enhance existing sparse attention implementations to handle variable-length inputs
- Add comprehensive testing and validation for variable-length sparse attention

* Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements

- Standardize function and parameter formatting across NSA example files
- Improve code readability by adjusting indentation and line breaks
- Enhance type hints and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Maintain consistent code style across TileLang and Triton implementations

* Add debug logging and extend execution backend in JIT and loop vectorization

- Add detailed logging in loop vectorization to help diagnose buffer shape handling
- Extend JIT execution backend to include 'cython' option
- Improve boundary condition checks in BufferLoadNode visit method

* Remove debug logging in loop vectorization BufferLoadNode visit method

- Remove unnecessary INFO log statements in VisitExpr_ method
- Simplify code by eliminating redundant logging
- Maintain core logic for handling buffer load node visits

8344af52

07 Mar, 2025 7 commits

[Example] Implement tilelang native sparse attention varlen example (#170) · 8e1845d2

Lei Wang authored Mar 08, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

* Add Variable-Length Native Sparse Attention Examples for TileLang and Triton

- Introduce new example scripts for variable-length native sparse attention:
  * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
  * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
- Update reference.py to support variable-length sparse attention scenarios
- Enhance existing sparse attention implementations to handle variable-length inputs
- Add comprehensive testing and validation for variable-length sparse attention

* Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements

- Standardize function and parameter formatting across NSA example files
- Improve code readability by adjusting indentation and line breaks
- Enhance type hints and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Maintain consistent code style across TileLang and Triton implementations

8e1845d2

[Dev] Use SS-GEMM for PV in mla (#165) · 166a9585
You Jiacheng authored Mar 08, 2025
```
It's slightly faster than T.copy then RS-GEMM, and simpler.
```
166a9585
Add Docker build scripts for local and PyPI distribution (#166) · d3f26ef8
Lei Wang authored Mar 07, 2025

d3f26ef8

[Example] Implement NSA Decode tilelang exampls (#168) · 69f35439

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

69f35439

[Bugfix] Cast bool dtype into int8 in blocksparse examples (#167) · b6c48453

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

b6c48453

[Refactor] Replace `T.thread_binding` with `T.get_thread_binding` in examples and test cases (#163) · de1ba1e4

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

de1ba1e4

[Enhancement] Improve CUDA path detection (#157) · 901deae1

Wenhao Xie authored Mar 07, 2025

* [Typo] Fix formatting in installation instructions in README.md

* [Enhancement] Improve CUDA path detection and update configuration handling

* fix typo

* remove IS_WINDOWS constant

* lint fix

* Improve error messages for CUDA detection failure

* lint fix

* lint fix

* Fix .gitignore to correctly include venv directory

901deae1

06 Mar, 2025 6 commits

Refactor MLA decode kernel: Replace T.If with native Python if statement (#162) · cfcbcf1e

Lei Wang authored Mar 07, 2025

Simplify the control flow in the MLA decode kernel by replacing TileLang's T.If construct with a standard Python if statement. This change improves code readability and maintains the existing logic for handling sequence length constraints during block-wise computation.

cfcbcf1e

[Carver] Multi-Threads Compilation for Fast Auto Tuning (#156) · 18be9e07
Chaofan Lin authored Mar 07, 2025
```
* [Carver] Multi-Threads Compilation for Fast Auto Tuning

* Add progress bar for compilation

* lint
```
18be9e07

Add cpu jit with backend ctypes (#154) · 782ca9f6

xs-keju authored Mar 06, 2025



* Add cpu jit with backend ctypes

* Resolve some lint issues

* Apply PR feedback on head file and kernel example

* Add test cases

* Resolve formatting issues

* Resolve formatting issues

---------
Co-authored-by: xxw <1990389406@qq.con>

782ca9f6

Add libstdcxx-ng-12 to Dockerfiles for CUDA versions (#160) · 3486e27e

Lei Wang authored Mar 06, 2025

Update Dockerfiles for CUDA 118, 120, 121, 123, 124, 125, and 126 to install libstdcxx-ng-12 from conda-forge, ensuring consistent standard library support across different CUDA versions

3486e27e

[Release] Bump Version to v0.1.2 (#155) · 237dab0d

Lei Wang authored Mar 06, 2025

* Remove Torch CPP backend and update execution backend options

- Remove TorchCPPKernelAdapter and related code from JIT modules
- Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
- Remove "torch_cpp" from supported execution backend literals
- Simplify backend validation and remove unused torch_cpp-related code
。

* lint fix

* Add block sparse attention implementations for TileLang and Triton

- Implement block sparse attention kernels for TileLang and Triton
- Add example scripts for block sparse attention with top-k and threshold-based masking
- Include utility functions for generating sparse attention masks
- Demonstrate causal attention with block-level sparsity
- Add test cases to validate sparse attention implementations against PyTorch reference

* Bump version to 0.1.1

* Bump version to 0.1.2

237dab0d

[Dev][Benchmark] Add MLA paged decoding example and benchmark script (#158) · be9abf18

Yu Cheng authored Mar 06, 2025

* [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16

- Remove redundant `acc_s_0` fragment in flash attention kernel
- Simplify memory copy and reduction operations
- Reorder memory copy and scaling steps for improved performance
- Add Hopper-specific synchronization method in CUDA reduce template
- Update reduce operation to use architecture-specific synchronization

* [Dev] Add DeepSeek MLA Decoding (Paged+Varlen) kernel and Performance Benchmark Script

- Implement comprehensive MLA (Multi-Head Latent Attention) decoding benchmark script
- Add support for multiple implementations: Torch, TileLang, FlashMLA, FlashInfer, and Triton
- Create flexible configuration for benchmarking different batch sizes, sequence lengths, and head configurations
- Implement performance comparison and CSV output for detailed performance analysis
- Add command-line argument support for targeted benchmarking and comparison

* [Dev] Refactor MLA Paged Decoding Kernel with Improved Block Handling and Precision

- Replace `d` parameter with `dv` to clarify value dimension in MLA decoding
- Enhance block distribution logic for split KV processing
- Improve handling of remaining blocks in split KV computation
- Add initialization of `lse_max_local` to prevent potential precision issues
- Optimize block start and range calculations for more accurate sequence processing

* lint

be9abf18