Commits · 3aecab8f4ffe16b2f5f4bacf621b13dd94b27418 · OpenDAS / tilelang

05 Oct, 2025 1 commit
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
30 Sep, 2025 1 commit

[Example] Specify a fixed commit for the flash-linear-attention repository and... · 3ad6202d

Lei Wang authored Sep 30, 2025

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples (#913)

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

3ad6202d

12 Apr, 2025 1 commit

[Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5

Lei Wang authored Apr 12, 2025

* Update legalize_safe_memory_access.cc

* Add cache path handling and file locking in Cython adapter

- Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
- Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
- These changes aim to optimize the compilation process and prevent race conditions during library loading.

* lint fix

* refactor

* refactor

* Add GlobalCopyPatternDetector to identify global memory copy patterns

- Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
- Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
- Improved code clarity and maintainability by encapsulating detection logic within the new class.

* Refactor copy stage detection logic in pipeline planning

- Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
- Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.

* lint fix

abaacde5

26 Mar, 2025 1 commit

[Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1

Lei Wang authored Mar 26, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

bf8a6fc1

18 Mar, 2025 1 commit

[Refactor] Refactor for Better Layout Conflict Handling (#240) · 2a286ae6

Lei Wang authored Mar 18, 2025

* [Feature] Add reduce_max functionality and corresponding tests

* Introduced a new test file for the reduce_max operation in the tilelang language module.
* Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying.
* Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation.
* Enhanced profiling assertions to validate the output against reference implementations.

* Fix whitespace issues in reduce_max test file for improved readability

* [Refactor] Update DebugOutput methods to return strings instead of void

* Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging.
* Updated corresponding header files to reflect the new return types.
* Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts.

* lint fix

* Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests.

* [Enhancement] Improve layout inference conflict handling in ParallelOp

* Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers.
* Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages.
* Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise.

* [Feature] Add IsEqual methods for layout comparison

* Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison.
* Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts.
* Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm

* [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc

* Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement.
* Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity.
* Improved error messages in layout conflict checks to provide better guidance on potential issues.

* [Refactor] Clean up pointer formatting in layout inference files

* Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability.
* Minor adjustments to error message formatting in layout conflict checks for better clarity.

2a286ae6

14 Mar, 2025 1 commit

[Enhancement] Avoid tvm ffi handling when out_idx is specified (#209) · 227ed7ec

Lei Wang authored Mar 14, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

* Update Dockerfiles to specify exact version of libstdcxx-ng

- Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution.
- Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.

* Refactor and enhance examples and kernel handling

- Adjusted the pipeline stages in `example_blocksparse_gemm.py` from 2 to 1 for improved performance.
- Added kernel source printing in `benchmark_nsa_fwd.py` for better debugging and profiling insights.
- Updated tensor allocation and parameter handling in `CtypesKernelAdapter` and `CythonKernelWrapper` to cache parameter dtypes and shapes, improving efficiency and clarity.
- Enhanced the handling of dynamic shapes in the Cython JIT kernel compilation process.
- Modified the benchmark script to accommodate new tensor output parameters and improved batch size defaults for testing.

* Update copyright header in Cython wrapper to reflect Tile-AI Corporation

* revert change

227ed7ec

12 Mar, 2025 1 commit

[Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

7ccec53b