- 15 Mar, 2025 2 commits
-
-
Lei Wang authored
- Modified `format.sh` to first check for the upstream repository before falling back to the local `origin/main` branch for determining the merge base. - Improved logic to handle cases where the upstream repository is not available, ensuring robust file formatting across branches.
-
Lei Wang authored
* [Enhancement] Improve device handling in Cython kernel adapter and wrapper - Updated `CythonKernelAdapter` to support dynamic device assignment based on target type (CUDA, HIP, or CPU). - Enhanced `CythonKernelWrapper` to include device management, ensuring tensors are allocated on the correct device. - Added error handling for unsupported target types to improve robustness. * [Enhancement] Add buffer device mapping in Cython kernel adapter and wrapper - Introduced `buffer_device_map` in `CythonKernelAdapter` to associate buffer variables with their respective devices. - Updated `CythonKernelWrapper` to utilize the new buffer device mapping for device checks during tensor allocation. - Enhanced error handling for device mismatches to ensure tensors are allocated on the correct device, improving robustness and flexibility in device management.
-
- 14 Mar, 2025 6 commits
-
-
Lei Wang authored
* [Feature] Add reshape and view functionalities to tilelang language module - Introduced new test files for reshape and view operations in the tilelang language. - Implemented reshape and view functions in the customize module, enhancing buffer manipulation capabilities. - Updated the language initialization to include the new functionalities. - Removed unnecessary import from test_tilelang_language_clamp.py for cleaner code. * Update copyright to Tile-AI Corporation * [Refactor] Clean up whitespace in test files for reshape and view functionalities - Removed unnecessary blank lines in `test_tilelang_language_reshape.py` and `test_tilelang_language_view.py` for improved readability. - Ensured consistent formatting across test files to enhance code clarity.
-
Yu Cheng authored
* [Dev] Implement IfStmtBinding and MergeIfStmt transformations - Add IfStmtBinding to bind If statements to each statement in SeqStmt, enhancing the handling of conditional statements. - Introduce MergeIfStmt to merge consecutive If statements within SeqStmt, optimizing the structure of conditional logic. - Update phase.py to apply IfStmtBinding and MergeIfStmt transformations for the "sm_90" target. - Enhance __init__.py with new functions for IfStmtBinding and MergeIfStmt, providing a clear interface for these transformations. * Update license header in if_stmt_binding.cc * Update license header in merge_if_stmt.cc --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yuxuan Hu authored
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method * Enhance block sparse attention implementation - Update `blocksparse_flashattn` to use 2 stages for improved performance. - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency. - Modify condition checks in the kernel to utilize boolean values. - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention. - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling. * Refactor and clean up code formatting across multiple files - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`. - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`. - Updated comments and documentation for clarity in `__init__.py` and `phase.py`. - Ensured consistent formatting and style across the codebase. * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes. - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline. * Refactor condition handling in inject_pipeline.cc - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity. - Update condition comparison logic to use StructuralEqual for better accuracy. - Enhance logging to provide detailed insights into condition changes and statement processing. - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements. * Improve logging and formatting in inject_pipeline.cc - Enhance logging statements for better clarity on condition changes and statement processing. - Adjust formatting for improved readability, including line breaks and consistent spacing. - Ensure accurate condition comparison and handling in the pipeline logic. * Refactor logging and clean up inject_pipeline.cc - Remove excessive logging statements to streamline the code and improve performance. - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing. - Maintain the core functionality while enhancing code readability and maintainability. * Update Dockerfiles to specify exact version of libstdcxx-ng - Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution. - Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds. * Refactor and enhance examples and kernel handling - Adjusted the pipeline stages in `example_blocksparse_gemm.py` from 2 to 1 for improved performance. - Added kernel source printing in `benchmark_nsa_fwd.py` for better debugging and profiling insights. - Updated tensor allocation and parameter handling in `CtypesKernelAdapter` and `CythonKernelWrapper` to cache parameter dtypes and shapes, improving efficiency and clarity. - Enhanced the handling of dynamic shapes in the Cython JIT kernel compilation process. - Modified the benchmark script to accommodate new tensor output parameters and improved batch size defaults for testing. * Update copyright header in Cython wrapper to reflect Tile-AI Corporation * revert change
-
Chenghua authored
* [Example] Modify tuning configurations for FlashAttention example * [Examples] formatting example_gqa_fwd_bshd.py
-
Lei Wang authored
* Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging. * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support. - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types. - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture. - Include necessary header files for built-in operations and layout inference improvements. * lint fix * Remove unused builtin.h include directive * Update include path for builtin.h
-
- 20 Jul, 2025 1 commit
-
-
Lei Wang authored
-
- 13 Mar, 2025 5 commits
-
-
Yu Cheng authored
- Introduce `example_gqa_bwd.py` demonstrating the backward pass of FlashAttention with pipelined execution. - Implement forward and backward functions for FlashAttention, including preprocessing and postprocessing steps. - Enhance argument parsing for batch size, heads, context size, and dimensions. - Include a reference implementation for validation and performance benchmarking.
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method * Enhance block sparse attention implementation - Update `blocksparse_flashattn` to use 2 stages for improved performance. - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency. - Modify condition checks in the kernel to utilize boolean values. - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention. - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling. * Refactor and clean up code formatting across multiple files - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`. - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`. - Updated comments and documentation for clarity in `__init__.py` and `phase.py`. - Ensured consistent formatting and style across the codebase. * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes. - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline. * Refactor condition handling in inject_pipeline.cc - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity. - Update condition comparison logic to use StructuralEqual for better accuracy. - Enhance logging to provide detailed insights into condition changes and statement processing. - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements. * Improve logging and formatting in inject_pipeline.cc - Enhance logging statements for better clarity on condition changes and statement processing. - Adjust formatting for improved readability, including line breaks and consistent spacing. - Ensure accurate condition comparison and handling in the pipeline logic. * Refactor logging and clean up inject_pipeline.cc - Remove excessive logging statements to streamline the code and improve performance. - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing. - Maintain the core functionality while enhancing code readability and maintainability. * Update Dockerfiles to specify exact version of libstdcxx-ng - Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution. - Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.
-
zqh-wz authored
* upgrade cutlass to upstream v3.8.0 * Implement fp8 gemm and add example script * Fix dtype retrieval with map_torch_type for fp8 inputs * Disable vectorization of fp8 values * Make MMA declaration compatible with cutlass 3.4.0+ * Add test for fp8 T.gemm * fix indent * fix indent * Add copyright and license header * Add copyright and license header * lint fix * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting. * clang format lint --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method * Enhance block sparse attention implementation - Update `blocksparse_flashattn` to use 2 stages for improved performance. - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency. - Modify condition checks in the kernel to utilize boolean values. - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention. - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling. * Refactor and clean up code formatting across multiple files - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`. - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`. - Updated comments and documentation for clarity in `__init__.py` and `phase.py`. - Ensured consistent formatting and style across the codebase. * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes. - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline. * Refactor condition handling in inject_pipeline.cc - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity. - Update condition comparison logic to use StructuralEqual for better accuracy. - Enhance logging to provide detailed insights into condition changes and statement processing. - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements. * Improve logging and formatting in inject_pipeline.cc - Enhance logging statements for better clarity on condition changes and statement processing. - Adjust formatting for improved readability, including line breaks and consistent spacing. - Ensure accurate condition comparison and handling in the pipeline logic. * Refactor logging and clean up inject_pipeline.cc - Remove excessive logging statements to streamline the code and improve performance. - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing. - Maintain the core functionality while enhancing code readability and maintainability.
-
Yu Cheng authored
- Introduce `example_gqa_fwd_bshd_wgmma_pipelined.py` demonstrating a pipelined implementation of FlashAttention. - Update sequence length parameter in existing example to 8192 and adjust number of stages for improved performance. - Enhance argument parsing to accommodate new configurations for batch size, heads, and groups.
-
- 12 Mar, 2025 9 commits
-
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method * Enhance block sparse attention implementation - Update `blocksparse_flashattn` to use 2 stages for improved performance. - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency. - Modify condition checks in the kernel to utilize boolean values. - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention. - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling. * Refactor and clean up code formatting across multiple files - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`. - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`. - Updated comments and documentation for clarity in `__init__.py` and `phase.py`. - Ensured consistent formatting and style across the codebase.
-
Yu Cheng authored
* [Feature] Add TMA Store Synchronization Support - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization - Add new builtin operations in op/builtin.cc and op/builtin.h - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls - Update CUDA codegen to support new TMA store synchronization intrinsics - Add Python language bindings for new TMA store synchronization operations * [CMake] Add CUDA Major Version Detection for Conditional Compilation - Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version - Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths - Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION - Improve cross-version compatibility for CUDA-dependent code sections
-
66RING authored
Expired example code, update readme.
-
Yu Cheng authored
- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization - Add new builtin operations in op/builtin.cc and op/builtin.h - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls - Update CUDA codegen to support new TMA store synchronization intrinsics - Add Python language bindings for new TMA store synchronization operations
-
Yu Cheng authored
[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter (#194) * [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter - Introduce `SetMaxNRegCollector` to collect register hints from SetMaxNReg calls - Modify `WarpSpecializedRewriter` to use collected register hints for producer and consumer code - Add validation checks for register hint values in the collector - Remove SetMaxNReg calls during code transformation - Enhance flexibility of register allocation in warp specialized rewriting * temporary remove check in lower_hopper_intrin
-
_HYX_ authored
* [Dev] Support clamp in language. * [Bugfix]: Fix clamp * [Refactor]
-
penguin_wwy authored
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment
-
- 11 Mar, 2025 4 commits
-
-
penguin_wwy authored
-
Yu Cheng authored
* [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug - Implement two RMS normalization implementations in TileLang: * `rms_norm_splitk`: Split-K reduction approach for large matrices * `rms_norm`: Full reduction kernel with simplified implementation - Add reference implementation using PyTorch for validation - Include performance benchmarking for both kernel variants - Demonstrate flexible block size and matrix size configurations * [Examples] Simplify RMS Normalization Kernel Compilation - Remove commented-out code for split-K RMS normalization - Simplify kernel compilation by removing explicit TMA lowering configuration - Update copyright header to Tile-AI Corporation - Streamline main script for RMS normalization example
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process
-
- 10 Mar, 2025 3 commits
-
-
Lei Wang authored
* Update native sparse attention example with scale parameter handling - Add scale parameter processing in native_sparse_attention function - Modify example script to include custom scale value - Update function calls to pass scale parameter - Enhance flexibility of sparse attention implementation * Refactor Triton Native Sparse Attention Example - Improve code formatting and readability in example_triton_nsa_bwd.py - Standardize function and parameter alignment - Remove unnecessary whitespaces and optimize imports - Enhance code style consistency with previous commits
-
Lei Wang authored
* [Refactor] Improve Thread Variable Handling in Layout Inference - Update layout inference to handle thread variables more robustly - Add explicit size check between infer_list_ and thread_var_vec_ - Modify thread variable access to use per-iteration thread variable - Simplify thread predicate retrieval logic - Add minor code cleanup and return variable assignment * [Refactor] Update Layout Inference Copyright and Simplify Return Logic - Replace Apache License header with Microsoft Corporation copyright notice - Simplify LayoutInference function by directly returning substituted function - Remove unnecessary variable assignment in return statement * [Refactor] Update Layout Inference Copyright to Tile-AI Corporation - Change copyright notice from Microsoft Corporation to Tile-AI Corporation - Maintain existing file structure and licensing header
-
Lei Wang authored
- Introduce `CreateEnvThread` function to generate environment threads for GPU kernel launches - Modify `KernelLaunch` to use `CreateEnvThread` for block and thread indices - Improve thread variable naming with shorter, more descriptive identifiers (bx, by, bz, tx, ty, tz) - Ensure proper thread environment setup within PrimFunc context
-
- 09 Mar, 2025 4 commits
-
-
Lei Wang authored
* Add kernel caching mechanism to TileLang - Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels - Expose the `cached` function in the main `tilelang/__init__.py` - Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py` - Provide a `clear_cache()` function to reset the kernel cache when needed * Refactor kernel caching test and implementation - Simplify the `cached` function in `tilelang/cache/__init__.py` - Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()` - Remove unnecessary whitespace and improve code formatting * Update import for `cached` function in MHA examples - Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py` - Change import from `tilelang.profiler import cached` to `tilelang import cached` - Align with recent refactoring of kernel caching mechanism * Refactor `cached` function signature in kernel caching - Update function signature to use keyword-only arguments for `target` and `target_host` - Improve parameter order and readability of the `cached` decorator - Maintain existing functionality while enhancing function definition
-
Lei Wang authored
* Add TMA lowering configuration option and update copyright notices This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include: - Add `kDisableTMALower` configuration option in builtin.h and builtin.cc - Update copyright notices from Microsoft Corporation to Tile-AI Corporation - Modify `LowerArgs` struct to include `disable_tma_lower` flag - Update JIT compilation interfaces to support pass configuration - Enhance error reporting in bulk copy lowering - Propagate pass configuration through various adapter layers * lint fix
-
Lei Wang authored
* Improve Autotuner and CUDA Compatibility for Tensor Core Policies - Enhance autotuner with robust parallel compilation and error handling - Add logging for better debugging during configuration compilation - Support SM90 compute capabilities in TensorCore and matmul analysis policies - Improve future handling and result tracking in autotuner - Add more flexible SM version checks for pipeline and async copy stages * Refactor Autotuner Parallel Compilation with Improved Error Handling - Enhance tqdm progress bar formatting for concurrent configuration compilation - Simplify exception handling in parallel compilation process - Remove unnecessary logging and improve code readability - Optimize thread pool shutdown and result processing
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization * Refactor Native Sparse Attention Example with Enhanced Triton Kernel - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation - Add support for block counts and offsets in the Triton kernel - Modify kernel grid and computation logic for improved performance - Update example script to use naive_nsa_simple reference implementation - Improve type hints and kernel configuration * Add Native Sparse Attention Examples with Tilelang and Triton Implementations - Introduce new example scripts for native sparse attention: * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass - Update reference.py with naive implementations for sparse attention - Support different sparse attention scenarios including forward pass and inference - Add comprehensive testing and validation against reference implementations * lint fix * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton - Introduce new example scripts for variable-length native sparse attention: * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths - Update reference.py to support variable-length sparse attention scenarios - Enhance existing sparse attention implementations to handle variable-length inputs - Add comprehensive testing and validation for variable-length sparse attention * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements - Standardize function and parameter formatting across NSA example files - Improve code readability by adjusting indentation and line breaks - Enhance type hints and parameter alignment - Remove unnecessary whitespaces and optimize imports - Maintain consistent code style across TileLang and Triton implementations * Add debug logging and extend execution backend in JIT and loop vectorization - Add detailed logging in loop vectorization to help diagnose buffer shape handling - Extend JIT execution backend to include 'cython' option - Improve boundary condition checks in BufferLoadNode visit method * Remove debug logging in loop vectorization BufferLoadNode visit method - Remove unnecessary INFO log statements in VisitExpr_ method - Simplify code by eliminating redundant logging - Maintain core logic for handling buffer load node visits
-
- 07 Mar, 2025 6 commits
-
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization * Refactor Native Sparse Attention Example with Enhanced Triton Kernel - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation - Add support for block counts and offsets in the Triton kernel - Modify kernel grid and computation logic for improved performance - Update example script to use naive_nsa_simple reference implementation - Improve type hints and kernel configuration * Add Native Sparse Attention Examples with Tilelang and Triton Implementations - Introduce new example scripts for native sparse attention: * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass - Update reference.py with naive implementations for sparse attention - Support different sparse attention scenarios including forward pass and inference - Add comprehensive testing and validation against reference implementations * lint fix * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton - Introduce new example scripts for variable-length native sparse attention: * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths - Update reference.py to support variable-length sparse attention scenarios - Enhance existing sparse attention implementations to handle variable-length inputs - Add comprehensive testing and validation for variable-length sparse attention * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements - Standardize function and parameter formatting across NSA example files - Improve code readability by adjusting indentation and line breaks - Enhance type hints and parameter alignment - Remove unnecessary whitespaces and optimize imports - Maintain consistent code style across TileLang and Triton implementations
-
You Jiacheng authored
It's slightly faster than T.copy then RS-GEMM, and simpler.
-
Lei Wang authored
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization * Refactor Native Sparse Attention Example with Enhanced Triton Kernel - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation - Add support for block counts and offsets in the Triton kernel - Modify kernel grid and computation logic for improved performance - Update example script to use naive_nsa_simple reference implementation - Improve type hints and kernel configuration * Add Native Sparse Attention Examples with Tilelang and Triton Implementations - Introduce new example scripts for native sparse attention: * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass - Update reference.py with naive implementations for sparse attention - Support different sparse attention scenarios including forward pass and inference - Add comprehensive testing and validation against reference implementations * lint fix
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization * Refactor Native Sparse Attention Example with Enhanced Triton Kernel - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation - Add support for block counts and offsets in the Triton kernel - Modify kernel grid and computation logic for improved performance - Update example script to use naive_nsa_simple reference implementation - Improve type hints and kernel configuration
-
Lei Wang authored
* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py - Modify roller hints generation using new TileLang Carver template and utility functions - Update get_roller_hints_from_func to handle None cases and improve return logic - Adjust DefaultPolicy to handle different codegen dictionary formats * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples - Move map_torch_type utility function to tilelang.utils.tensor - Remove unnecessary imports and improve code organization
-