Commits · f2e99180d54cf24559998281614296ddcfdf3000 · OpenDAS / tilelang

20 Mar, 2025 1 commit

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

19 Mar, 2025 1 commit

[Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters (#213) · e789808b

alex_xiao authored Mar 19, 2025



* [Dev] Add database mechanism to cache

* [Dev] Fix database cache and test for it

* [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment.

* [Refactor] Improve code formatting and readability in multiple files

* [Enhancement] Add execution backend options and improve kernel adapter initialization

* [Refactor] Rename cached function to cached_kernel and update related references

* [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case

* [Enhancement] Update kernel compilation to specify execution backend as "cython"

* [Refactor] Rename cached_kernel to cached and update references in the codebase

* [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code

* [Refactor] Clean up code formatting and improve readability in cache and adapter modules

* [Refactor] Remove unused imports

* [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety

* [Refactor] Update cached function calls to use PrimFunc and improve parameter handling

* [Refactor] Clean up import statements and improve code formatting in cache and kernel test files

* Update tilelang/jit/kernel.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

e789808b

09 Mar, 2025 1 commit

[Feat] Introduce new caching mechanism for compiled kernels (#176) · 7bde63d5

Lei Wang authored Mar 09, 2025

* Add kernel caching mechanism to TileLang

- Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels
- Expose the `cached` function in the main `tilelang/__init__.py`
- Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py`
- Provide a `clear_cache()` function to reset the kernel cache when needed

* Refactor kernel caching test and implementation

- Simplify the `cached` function in `tilelang/cache/__init__.py`
- Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()`
- Remove unnecessary whitespace and improve code formatting

* Update import for `cached` function in MHA examples

- Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py`
- Change import from `tilelang.profiler import cached` to `tilelang import cached`
- Align with recent refactoring of kernel caching mechanism

* Refactor `cached` function signature in kernel caching

- Update function signature to use keyword-only arguments for `target` and `target_host`
- Improve parameter order and readability of the `cached` decorator
- Maintain existing functionality while enhancing function definition

7bde63d5

07 Mar, 2025 2 commits

[Example] Implement tilelang native sparse attention varlen example (#170) · 8e1845d2

Lei Wang authored Mar 08, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

* Add Variable-Length Native Sparse Attention Examples for TileLang and Triton

- Introduce new example scripts for variable-length native sparse attention:
  * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
  * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
- Update reference.py to support variable-length sparse attention scenarios
- Enhance existing sparse attention implementations to handle variable-length inputs
- Add comprehensive testing and validation for variable-length sparse attention

* Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements

- Standardize function and parameter formatting across NSA example files
- Improve code readability by adjusting indentation and line breaks
- Enhance type hints and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Maintain consistent code style across TileLang and Triton implementations

8e1845d2

[Refactor] Replace `T.thread_binding` with `T.get_thread_binding` in examples and test cases (#163) · de1ba1e4

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

de1ba1e4

05 Mar, 2025 1 commit

[Enhancement] Enable runtime tensor data type validation (#146) · d0434c3e

Lei Wang authored Mar 05, 2025

* Fix debug print buffer template for unsigned char type

- Update debug_print_buffer_value template specialization for unsigned char
- Modify test_tilelang_debug_print.py to include additional dtype tests
- Add test case for uint8 dtype in debug print buffer function

* Refactor debug print buffer template formatting for unsigned char

- Improve code formatting for debug_print_buffer_value template specialization
- Adjust line breaks and indentation for better readability
- Maintain consistent code style with other template specializations

* Extract map_torch_type utility function to tilelang.utils.tensor

- Move map_torch_type function from multiple test files to a centralized location
- Import map_torch_type from tilelang.utils.tensor in kernel test files
- Improve code reusability by creating a shared utility function for type mapping

* Add buffer dtype mapping for Cython kernel adapter

- Introduce buffer_dtype_map in CythonKernelAdapter to track buffer variable dtypes
- Add _process_buffer_dtype method to extract dtype information from TIR function
- Update CythonKernelWrapper to support setting and validating buffer dtypes
- Enhance type checking during kernel execution with dtype verification
- Improve logging message for Cython JIT adapter compilation

* Add static shape mapping for Cython kernel adapter

- Introduce static_shape_map in CythonKernelAdapter to track buffer variable static shapes
- Add _process_static_shape method to extract static shape information from TIR function
- Update CythonKernelWrapper to support setting and validating static shapes
- Enhance type checking during kernel execution with static shape verification

* Add Multi-Head Attention (MHA) Backward Pass Test for TileLang Kernel

- Implement comprehensive test for Multi-Head Attention backward pass
- Support both causal and non-causal attention scenarios
- Add reference implementation for comparing kernel outputs
- Test different batch sizes, head counts, sequence lengths, and head dimensions
- Verify forward and backward pass correctness using torch.testing.assert_close

* Set random seed for MHA backward pass test

- Add random seed initialization for consistent test reproducibility
- Use tilelang.testing.set_random_seed(42) to ensure deterministic test results

d0434c3e

04 Mar, 2025 1 commit

[Bugfix] Add missing definition for AtomicAdd (#138) · 3960d3d0

Lei Wang authored Mar 04, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

* Enhance Layout Plotting with Multi-Replication and Dynamic Visualization

- Update plot_layout function to support multiple replications in thread and value mapping
- Improve thread and value mapping to handle replicated layouts
- Dynamically adjust figure size and legend positioning
- Add print statements for saved plot file paths
- Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting

* Refactor AtomicAdd functions in CUDA common header

- Implement a generic template for AtomicAdd function
- Specialize templates for half_t, bfloat16_t, and pointer types
- Reorganize and clean up existing AtomicAdd implementations
- Improve type handling and conversion in atomic operations

* Remove unused import in MHA backward test file

- Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
- Add blank line for improved code formatting
- Minor code cleanup in test file

3960d3d0

11 Feb, 2025 1 commit

[Dev] Add mha backward example (#77) · a6fe61e2

Yu Cheng authored Feb 12, 2025

* [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized

* Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests

* [Dev] Add mha backward example

a6fe61e2