- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 19 Mar, 2025 1 commit
-
-
alex_xiao authored
* [Dev] Add database mechanism to cache * [Dev] Fix database cache and test for it * [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment. * [Refactor] Improve code formatting and readability in multiple files * [Enhancement] Add execution backend options and improve kernel adapter initialization * [Refactor] Rename cached function to cached_kernel and update related references * [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case * [Enhancement] Update kernel compilation to specify execution backend as "cython" * [Refactor] Rename cached_kernel to cached and update references in the codebase * [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code * [Refactor] Clean up code formatting and improve readability in cache and adapter modules * [Refactor] Remove unused imports * [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety * [Refactor] Update cached function calls to use PrimFunc and improve parameter handling * [Refactor] Clean up import statements and improve code formatting in cache and kernel test files * Update tilelang/jit/kernel.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 09 Mar, 2025 1 commit
-
-
Lei Wang authored
* Add kernel caching mechanism to TileLang - Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels - Expose the `cached` function in the main `tilelang/__init__.py` - Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py` - Provide a `clear_cache()` function to reset the kernel cache when needed * Refactor kernel caching test and implementation - Simplify the `cached` function in `tilelang/cache/__init__.py` - Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()` - Remove unnecessary whitespace and improve code formatting * Update import for `cached` function in MHA examples - Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py` - Change import from `tilelang.profiler import cached` to `tilelang import cached` - Align with recent refactoring of kernel caching mechanism * Refactor `cached` function signature in kernel caching - Update function signature to use keyword-only arguments for `target` and `target_host` - Improve parameter order and readability of the `cached` decorator - Maintain existing functionality while enhancing function definition
-
- 07 Mar, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization * Refactor Native Sparse Attention Example with Enhanced Triton Kernel - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation - Add support for block counts and offsets in the Triton kernel - Modify kernel grid and computation logic for improved performance - Update example script to use naive_nsa_simple reference implementation - Improve type hints and kernel configuration * Add Native Sparse Attention Examples with Tilelang and Triton Implementations - Introduce new example scripts for native sparse attention: * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass - Update reference.py with naive implementations for sparse attention - Support different sparse attention scenarios including forward pass and inference - Add comprehensive testing and validation against reference implementations * lint fix * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton - Introduce new example scripts for variable-length native sparse attention: * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths - Update reference.py to support variable-length sparse attention scenarios - Enhance existing sparse attention implementations to handle variable-length inputs - Add comprehensive testing and validation for variable-length sparse attention * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements - Standardize function and parameter formatting across NSA example files - Improve code readability by adjusting indentation and line breaks - Enhance type hints and parameter alignment - Remove unnecessary whitespaces and optimize imports - Maintain consistent code style across TileLang and Triton implementations
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization
-
- 05 Mar, 2025 1 commit
-
-
Lei Wang authored
* Fix debug print buffer template for unsigned char type - Update debug_print_buffer_value template specialization for unsigned char - Modify test_tilelang_debug_print.py to include additional dtype tests - Add test case for uint8 dtype in debug print buffer function * Refactor debug print buffer template formatting for unsigned char - Improve code formatting for debug_print_buffer_value template specialization - Adjust line breaks and indentation for better readability - Maintain consistent code style with other template specializations * Extract map_torch_type utility function to tilelang.utils.tensor - Move map_torch_type function from multiple test files to a centralized location - Import map_torch_type from tilelang.utils.tensor in kernel test files - Improve code reusability by creating a shared utility function for type mapping * Add buffer dtype mapping for Cython kernel adapter - Introduce buffer_dtype_map in CythonKernelAdapter to track buffer variable dtypes - Add _process_buffer_dtype method to extract dtype information from TIR function - Update CythonKernelWrapper to support setting and validating buffer dtypes - Enhance type checking during kernel execution with dtype verification - Improve logging message for Cython JIT adapter compilation * Add static shape mapping for Cython kernel adapter - Introduce static_shape_map in CythonKernelAdapter to track buffer variable static shapes - Add _process_static_shape method to extract static shape information from TIR function - Update CythonKernelWrapper to support setting and validating static shapes - Enhance type checking during kernel execution with static shape verification * Add Multi-Head Attention (MHA) Backward Pass Test for TileLang Kernel - Implement comprehensive test for Multi-Head Attention backward pass - Support both causal and non-causal attention scenarios - Add reference implementation for comparing kernel outputs - Test different batch sizes, head counts, sequence lengths, and head dimensions - Verify forward and backward pass correctness using torch.testing.assert_close * Set random seed for MHA backward pass test - Add random seed initialization for consistent test reproducibility - Use tilelang.testing.set_random_seed(42) to ensure deterministic test results
-
- 04 Mar, 2025 1 commit
-
-
Lei Wang authored
* Change default log level from WARNING to INFO in TileLang initialization * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation - Remove unused imports and simplify function signature - Modify `flashattn` function to handle max sequence length as a separate argument - Update kernel call to include max sequence length parameter - Improve code readability and remove commented-out code - Add print statement to confirm successful assertion * Refactor code formatting in TileLang lowering and example files - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py` - Simplify line breaks and reduce unnecessary whitespace - Enhance code readability by adjusting indentation and line breaks - Update example MHA forward pass script with cleaner tensor initialization * Update TileLang kernel test with import path changes for MMA layout and macro generator - Modify import statements in test_tilelang_kernel_dequantize_gemm.py - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities - Update main function to use tilelang.testing.main() * Add Block Sparse Attention Examples for TileLang and Triton - Implement block sparse attention kernels for both TileLang and Triton - Add utility functions for generating sparse attention masks using top-k and threshold methods - Support causal and variable-length attention scenarios - Include test cases for different sequence length configurations - Demonstrate block-level sparse attention with configurable parameters * Refactor Block Sparse Attention Examples with Code Style Improvements - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py - Enhance readability by adjusting line breaks and indentation - Simplify kernel and function calls with better formatting - Add whitespace and line break improvements for better code clarity * Enhance Layout Plotting with Multi-Replication and Dynamic Visualization - Update plot_layout function to support multiple replications in thread and value mapping - Improve thread and value mapping to handle replicated layouts - Dynamically adjust figure size and legend positioning - Add print statements for saved plot file paths - Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting * Refactor AtomicAdd functions in CUDA common header - Implement a generic template for AtomicAdd function - Specialize templates for half_t, bfloat16_t, and pointer types - Reorganize and clean up existing AtomicAdd implementations - Improve type handling and conversion in atomic operations * Remove unused import in MHA backward test file - Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py - Add blank line for improved code formatting - Minor code cleanup in test file
-
- 11 Feb, 2025 1 commit
-
-
Yu Cheng authored
* [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized * Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests * [Dev] Add mha backward example
-