"git@developer.sourcefind.cn:change/sglang.git" did not exist on "60b0503227d01fbeb21a9ad270445d38229611c1"
  1. 16 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Introduce KernelParam integration across modules (#223) · 3de9f13c
      Lei Wang authored
      * [Refactor] Update KernelParam integration across modules
      
      - Replaced instances of TensorType with KernelParam in various modules to standardize parameter handling.
      - Updated JITKernel, BaseKernelAdapter, and CythonKernelAdapter to utilize KernelParam for improved type consistency.
      - Enhanced Profiler class to include KernelParam in its parameters, ensuring better integration with the new parameter structure.
      - Adjusted tensor handling in utility functions to accommodate the new KernelParam type, improving overall code clarity and maintainability.
      - Updated copyright headers to reflect the correct organization.
      
      * [Refactor] Clean up whitespace in kernel, profiler, and tensor modules
      
      - Added blank lines for improved readability in kernel.py, __init__.py, and tensor.py.
      - Enhanced code clarity by ensuring consistent formatting across these modules.
      
      * [Enhancement] Add detailed docstrings to KernelParam and Profiler classes
      
      - Enhanced KernelParam class with comprehensive docstrings for better understanding of its purpose and methods.
      - Updated Profiler class to include detailed docstrings for its attributes and methods, improving code documentation and usability.
      - Removed unused do_bench function to streamline the profiler module and improve clarity.
      
      * [Refactor] Update type hints in do_bench function and clean up whitespace in profiler module
      
      - Changed type hints for grad_to_none and quantiles parameters in do_bench function to use Optional for better clarity.
      - Added a blank line in __init__.py for improved readability and consistency in the profiler module.
      
      * [Refactor] Update type hint in do_bench function for consistency
      
      - Changed the return type hint in the do_bench function from a union type to a more explicit List type for better clarity and consistency in type annotations.
      
      * [Refactor] Update return type hint in do_bench function for clarity
      
      - Changed the return type hint in the do_bench function from a union type to Union[float, List[float]] for improved clarity and consistency in type annotations.
      
      * [Enhancement] Add func property to Profiler class for adapter access
      
      - Introduced a new property `func` in the Profiler class to provide access to the adapter, ensuring that the adapter is set before retrieval. This enhancement improves the usability of the Profiler class by allowing easier access to the adapter functionality.
      
      * [Refactor] Update kernel compilation and profiling in tests
      
      - Replaced instances of `TL.lower` and `TL.Profiler` with `tilelang.compile` and the new profiler interface across multiple test files.
      - Enhanced the kernel compilation process to utilize the updated API, improving consistency and maintainability in the testing framework.
      - Updated assertions to use the new profiler methods for better clarity and functionality in performance testing.
      
      * [Refactor] Simplify kernel invocation and remove unused parameters in tests
      
      - Updated the kernel invocation in `test_tilelang_dynamic_symbolic.py` to directly assign the result to `C`, improving clarity.
      - Removed the `execution_backend` parameter from `tilelang.compile` calls in `test_tilelang_jit_callback.py` and `test_tilelang_jit_gemm.py` for consistency with the updated API.
      - Commented out the call to `tilelang.testing.main()` in `test_tilelang_jit_callback.py` and replaced it with a direct call to `test_gemm_jit_kernel()` to streamline test execution.
      - Adjusted the dtype mapping in `TorchDLPackKernelAdapter` to use the parameter's dtype directly, enhancing code simplicity.
      
      * [Refactor] Remove unused imports in test files for cleaner code
      
      - Eliminated unnecessary imports of `tilelang` as `TL` in various test files to enhance code clarity and maintainability.
      - Updated multiple test files to streamline the codebase and reduce potential confusion from unused references.
      
      * [Refactor] Simplify kernel invocation in tilelang kernel test
      
      - Updated the kernel invocation in `test_tilelang_kernel_bf16_gemm_mma.py` to directly assign the result to `C`, enhancing code clarity and consistency with recent changes in the API.
      
      * [Refactor] Simplify kernel invocation in tilelang kernel tests
      
      - Updated kernel invocations in multiple test files to directly assign the result to `C`, improving code clarity and consistency with the updated API.
      - Removed unnecessary initialization of `C` as a zero tensor, streamlining the code further.
      
      * [Refactor] Update kernel invocation in tilelang transform tests
      
      - Replaced the use of `TL.Profiler` with `tilelang.compile` in `test_tilelang_transform_simplify.py`, enhancing code clarity and consistency with the updated API.
      - Streamlined the kernel invocation process by directly assigning the result to `C`, improving readability and maintainability of the test code.
      3de9f13c
  2. 15 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Update format script to support upstream repository for file formatting (#221) · 6bc8d6d3
      Lei Wang authored
      - Modified `format.sh` to first check for the upstream repository before falling back to the local `origin/main` branch for determining the merge base.
      - Improved logic to handle cases where the upstream repository is not available, ensuring robust file formatting across branches.
      6bc8d6d3
    • Lei Wang's avatar
      [Enhancement] Improve device handling in Cython kernel adapter (#220) · 01526645
      Lei Wang authored
      * [Enhancement] Improve device handling in Cython kernel adapter and wrapper
      
      - Updated `CythonKernelAdapter` to support dynamic device assignment based on target type (CUDA, HIP, or CPU).
      - Enhanced `CythonKernelWrapper` to include device management, ensuring tensors are allocated on the correct device.
      - Added error handling for unsupported target types to improve robustness.
      
      * [Enhancement] Add buffer device mapping in Cython kernel adapter and wrapper
      
      - Introduced `buffer_device_map` in `CythonKernelAdapter` to associate buffer variables with their respective devices.
      - Updated `CythonKernelWrapper` to utilize the new buffer device mapping for device checks during tensor allocation.
      - Enhanced error handling for device mismatches to ensure tensors are allocated on the correct device, improving robustness and flexibility in device management.
      01526645
  3. 14 Mar, 2025 6 commits
    • Lei Wang's avatar
      [Language] Introduce `T.reshape` and `T.view` (#212) · 872f5613
      Lei Wang authored
      * [Feature] Add reshape and view functionalities to tilelang language module
      
      - Introduced new test files for reshape and view operations in the tilelang language.
      - Implemented reshape and view functions in the customize module, enhancing buffer manipulation capabilities.
      - Updated the language initialization to include the new functionalities.
      - Removed unnecessary import from test_tilelang_language_clamp.py for cleaner code.
      
      * Update copyright to Tile-AI Corporation
      
      * [Refactor] Clean up whitespace in test files for reshape and view functionalities
      
      - Removed unnecessary blank lines in `test_tilelang_language_reshape.py` and `test_tilelang_language_view.py` for improved readability.
      - Ensured consistent formatting across test files to enhance code clarity.
      872f5613
    • Yu Cheng's avatar
      [Dev] Implement IfStmtBinding and MergeIfStmt transformations (#211) · 86f96f8a
      Yu Cheng authored
      
      
      * [Dev] Implement IfStmtBinding and MergeIfStmt transformations
      
      - Add IfStmtBinding to bind If statements to each statement in SeqStmt, enhancing the handling of conditional statements.
      - Introduce MergeIfStmt to merge consecutive If statements within SeqStmt, optimizing the structure of conditional logic.
      - Update phase.py to apply IfStmtBinding and MergeIfStmt transformations for the "sm_90" target.
      - Enhance __init__.py with new functions for IfStmtBinding and MergeIfStmt, providing a clear interface for these transformations.
      
      * Update license header in if_stmt_binding.cc
      
      * Update license header in merge_if_stmt.cc
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      86f96f8a
    • Yuxuan Hu's avatar
    • Lei Wang's avatar
      [Enhancement] Avoid tvm ffi handling when out_idx is specified (#209) · 227ed7ec
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      
      * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc
      
      - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
      - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.
      
      * Refactor condition handling in inject_pipeline.cc
      
      - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
      - Update condition comparison logic to use StructuralEqual for better accuracy.
      - Enhance logging to provide detailed insights into condition changes and statement processing.
      - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.
      
      * Improve logging and formatting in inject_pipeline.cc
      
      - Enhance logging statements for better clarity on condition changes and statement processing.
      - Adjust formatting for improved readability, including line breaks and consistent spacing.
      - Ensure accurate condition comparison and handling in the pipeline logic.
      
      * Refactor logging and clean up inject_pipeline.cc
      
      - Remove excessive logging statements to streamline the code and improve performance.
      - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
      - Maintain the core functionality while enhancing code readability and maintainability.
      
      * Update Dockerfiles to specify exact version of libstdcxx-ng
      
      - Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution.
      - Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.
      
      * Refactor and enhance examples and kernel handling
      
      - Adjusted the pipeline stages in `example_blocksparse_gemm.py` from 2 to 1 for improved performance.
      - Added kernel source printing in `benchmark_nsa_fwd.py` for better debugging and profiling insights.
      - Updated tensor allocation and parameter handling in `CtypesKernelAdapter` and `CythonKernelWrapper` to cache parameter dtypes and shapes, improving efficiency and clarity.
      - Enhanced the handling of dynamic shapes in the Cython JIT kernel compilation process.
      - Modified the benchmark script to accommodate new tensor output parameters and improved batch size defaults for testing.
      
      * Update copyright header in Cython wrapper to reflect Tile-AI Corporation
      
      * revert change
      227ed7ec
    • Chenghua's avatar
      [Examples] Expand tuning configurations for FlashAttention example (#204) · 8678aac0
      Chenghua authored
      * [Example] Modify tuning configurations for FlashAttention example
      
      * [Examples] formatting example_gqa_fwd_bshd.py
      8678aac0
    • Lei Wang's avatar
      [Enhancement] Allow mma fallback when wgmma is not supported (#206) · 45559a1f
      Lei Wang authored
      * Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging.
      
      * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture
      
      - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
      - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
      - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
      - Include necessary header files for built-in operations and layout inference improvements.
      
      * lint fix
      
      * Remove unused builtin.h include directive
      
      * Update include path for builtin.h
      45559a1f
  4. 20 Jul, 2025 1 commit
  5. 13 Mar, 2025 5 commits
    • Yu Cheng's avatar
      [Dev] Add GQA backward example (#205) · a55f3686
      Yu Cheng authored
      - Introduce `example_gqa_bwd.py` demonstrating the backward pass of FlashAttention with pipelined execution.
      - Implement forward and backward functions for FlashAttention, including preprocessing and postprocessing steps.
      - Enhance argument parsing for batch size, heads, context size, and dimensions.
      - Include a reference implementation for validation and performance benchmarking.
      a55f3686
    • Lei Wang's avatar
      [Docker] Update Dockerfiles to specify exact version of libstdcxx-ng (#203) · 05d72dfc
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      
      * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc
      
      - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
      - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.
      
      * Refactor condition handling in inject_pipeline.cc
      
      - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
      - Update condition comparison logic to use StructuralEqual for better accuracy.
      - Enhance logging to provide detailed insights into condition changes and statement processing.
      - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.
      
      * Improve logging and formatting in inject_pipeline.cc
      
      - Enhance logging statements for better clarity on condition changes and statement processing.
      - Adjust formatting for improved readability, including line breaks and consistent spacing.
      - Ensure accurate condition comparison and handling in the pipeline logic.
      
      * Refactor logging and clean up inject_pipeline.cc
      
      - Remove excessive logging statements to streamline the code and improve performance.
      - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
      - Maintain the core functionality while enhancing code readability and maintainability.
      
      * Update Dockerfiles to specify exact version of libstdcxx-ng
      
      - Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution.
      - Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.
      05d72dfc
    • zqh-wz's avatar
      [Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5
      zqh-wz authored
      
      
      * upgrade cutlass to upstream v3.8.0
      
      * Implement fp8 gemm and add example script
      
      * Fix dtype retrieval with map_torch_type for fp8 inputs
      
      * Disable vectorization of fp8 values
      
      * Make MMA declaration compatible with cutlass 3.4.0+
      
      * Add test for fp8 T.gemm
      
      * fix indent
      
      * fix indent
      
      * Add copyright and license header
      
      * Add copyright and license header
      
      * lint fix
      
      * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.
      
      * clang format lint
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2cccf1f5
    • Lei Wang's avatar
      [Enhancement] Enhancing the handling of conditional statements in the pipeline (#201) · dda8ebff
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      
      * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc
      
      - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
      - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.
      
      * Refactor condition handling in inject_pipeline.cc
      
      - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
      - Update condition comparison logic to use StructuralEqual for better accuracy.
      - Enhance logging to provide detailed insights into condition changes and statement processing.
      - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.
      
      * Improve logging and formatting in inject_pipeline.cc
      
      - Enhance logging statements for better clarity on condition changes and statement processing.
      - Adjust formatting for improved readability, including line breaks and consistent spacing.
      - Ensure accurate condition comparison and handling in the pipeline logic.
      
      * Refactor logging and clean up inject_pipeline.cc
      
      - Remove excessive logging statements to streamline the code and improve performance.
      - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
      - Maintain the core functionality while enhancing code readability and maintainability.
      dda8ebff
    • Yu Cheng's avatar
      [Dev] Add new example for FlashAttention with pipelined execution (#200) · c2b9b59d
      Yu Cheng authored
      - Introduce `example_gqa_fwd_bshd_wgmma_pipelined.py` demonstrating a pipelined implementation of FlashAttention.
      - Update sequence length parameter in existing example to 8192 and adjust number of stages for improved performance.
      - Enhance argument parsing to accommodate new configurations for batch size, heads, and groups.
      c2b9b59d
  6. 12 Mar, 2025 9 commits
    • Lei Wang's avatar
      [Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      7ccec53b
    • Yu Cheng's avatar
      [CMake] Add CUDA Major Version Detection for Conditional Compilation (#197) · 20f19611
      Yu Cheng authored
      * [Feature] Add TMA Store Synchronization Support
      
      - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
      - Add new builtin operations in op/builtin.cc and op/builtin.h
      - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
      - Update CUDA codegen to support new TMA store synchronization intrinsics
      - Add Python language bindings for new TMA store synchronization operations
      
      * [CMake] Add CUDA Major Version Detection for Conditional Compilation
      
      - Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version
      - Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths
      - Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION
      - Improve cross-version compatibility for CUDA-dependent code sections
      20f19611
    • 66RING's avatar
      Update expired example code. (#196) · 6ab29ffc
      66RING authored
      Expired example code, update readme.
      6ab29ffc
    • Yu Cheng's avatar
      [Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a
      Yu Cheng authored
      - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
      - Add new builtin operations in op/builtin.cc and op/builtin.h
      - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
      - Update CUDA codegen to support new TMA store synchronization intrinsics
      - Add Python language bindings for new TMA store synchronization operations
      eba7dd5a
    • Yu Cheng's avatar
      [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp... · 94c758ad
      Yu Cheng authored
      [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter (#194)
      
      * [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter
      
      - Introduce `SetMaxNRegCollector` to collect register hints from SetMaxNReg calls
      - Modify `WarpSpecializedRewriter` to use collected register hints for producer and consumer code
      - Add validation checks for register hint values in the collector
      - Remove SetMaxNReg calls during code transformation
      - Enhance flexibility of register allocation in warp specialized rewriting
      
      * temporary remove check in lower_hopper_intrin
      94c758ad
    • _HYX_'s avatar
      [Language] Support clamp in language (#192) · 94c941fc
      _HYX_ authored
      * [Dev] Support clamp in language.
      
      * [Bugfix]: Fix clamp
      
      * [Refactor]
      94c941fc
    • penguin_wwy's avatar
      efb2b1d5
    • Lei Wang's avatar
      [Enhancement] Simplify GEMM example with direct kernel compilation (#191) · 79ea77e8
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      79ea77e8
    • Lei Wang's avatar
      [Bugfix] Fix `T.copy` for scalar datatypes (#190) · 454248c7
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      454248c7
  7. 11 Mar, 2025 4 commits
    • penguin_wwy's avatar
    • Yu Cheng's avatar
      [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug (#188) · fe0de672
      Yu Cheng authored
      * [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug
      
      - Implement two RMS normalization implementations in TileLang:
        * `rms_norm_splitk`: Split-K reduction approach for large matrices
        * `rms_norm`: Full reduction kernel with simplified implementation
      - Add reference implementation using PyTorch for validation
      - Include performance benchmarking for both kernel variants
      - Demonstrate flexible block size and matrix size configurations
      
      * [Examples] Simplify RMS Normalization Kernel Compilation
      
      - Remove commented-out code for split-K RMS normalization
      - Simplify kernel compilation by removing explicit TMA lowering configuration
      - Update copyright header to Tile-AI Corporation
      - Streamline main script for RMS normalization example
      fe0de672
    • Lei Wang's avatar
      [Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation (#185) · d34601ab
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      d34601ab
    • Lei Wang's avatar
      [Enhancement] Optimize CMake build process with dynamic job count calculation (#183) · c2192780
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      c2192780
  8. 10 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Examples] Implement NSA Backward kernels (#180) · 6891d3ec
      Lei Wang authored
      
      * Update native sparse attention example with scale parameter handling
      
      - Add scale parameter processing in native_sparse_attention function
      - Modify example script to include custom scale value
      - Update function calls to pass scale parameter
      - Enhance flexibility of sparse attention implementation
      
      * Refactor Triton Native Sparse Attention Example
      
      - Improve code formatting and readability in example_triton_nsa_bwd.py
      - Standardize function and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Enhance code style consistency with previous commits
      6891d3ec
    • Lei Wang's avatar
      [Bugfix] Improve Thread Variable Handling in Layout Inference (#179) · c39e540a
      Lei Wang authored
      * [Refactor] Improve Thread Variable Handling in Layout Inference
      
      - Update layout inference to handle thread variables more robustly
      - Add explicit size check between infer_list_ and thread_var_vec_
      - Modify thread variable access to use per-iteration thread variable
      - Simplify thread predicate retrieval logic
      - Add minor code cleanup and return variable assignment
      
      * [Refactor] Update Layout Inference Copyright and Simplify Return Logic
      
      - Replace Apache License header with Microsoft Corporation copyright notice
      - Simplify LayoutInference function by directly returning substituted function
      - Remove unnecessary variable assignment in return statement
      
      * [Refactor] Update Layout Inference Copyright to Tile-AI Corporation
      
      - Change copyright notice from Microsoft Corporation to Tile-AI Corporation
      - Maintain existing file structure and licensing header
      c39e540a
    • Lei Wang's avatar
      [Refactor] Enhance GPU Kernel Launch with Environment Thread Creation (#178) · 8ccf6ea2
      Lei Wang authored
      - Introduce `CreateEnvThread` function to generate environment threads for GPU kernel launches
      - Modify `KernelLaunch` to use `CreateEnvThread` for block and thread indices
      - Improve thread variable naming with shorter, more descriptive identifiers (bx, by, bz, tx, ty, tz)
      - Ensure proper thread environment setup within PrimFunc context
      8ccf6ea2
  9. 09 Mar, 2025 4 commits
    • Lei Wang's avatar
      [Feat] Introduce new caching mechanism for compiled kernels (#176) · 7bde63d5
      Lei Wang authored
      * Add kernel caching mechanism to TileLang
      
      - Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels
      - Expose the `cached` function in the main `tilelang/__init__.py`
      - Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py`
      - Provide a `clear_cache()` function to reset the kernel cache when needed
      
      * Refactor kernel caching test and implementation
      
      - Simplify the `cached` function in `tilelang/cache/__init__.py`
      - Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()`
      - Remove unnecessary whitespace and improve code formatting
      
      * Update import for `cached` function in MHA examples
      
      - Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py`
      - Change import from `tilelang.profiler import cached` to `tilelang import cached`
      - Align with recent refactoring of kernel caching mechanism
      
      * Refactor `cached` function signature in kernel caching
      
      - Update function signature to use keyword-only arguments for `target` and `target_host`
      - Improve parameter order and readability of the `cached` decorator
      - Maintain existing functionality while enhancing function definition
      7bde63d5
    • Lei Wang's avatar
      [Feat] Append Pass Context and TMA lowering configuration option (#175) · fb6b101c
      Lei Wang authored
      * Add TMA lowering configuration option and update copyright notices
      
      This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include:
      
      - Add `kDisableTMALower` configuration option in builtin.h and builtin.cc
      - Update copyright notices from Microsoft Corporation to Tile-AI Corporation
      - Modify `LowerArgs` struct to include `disable_tma_lower` flag
      - Update JIT compilation interfaces to support pass configuration
      - Enhance error reporting in bulk copy lowering
      - Propagate pass configuration through various adapter layers
      
      * lint fix
      fb6b101c
    • Lei Wang's avatar
      [AutoTune] Enable config-performance trace (#174) · e6f77253
      Lei Wang authored
      * Improve Autotuner and CUDA Compatibility for Tensor Core Policies
      
      - Enhance autotuner with robust parallel compilation and error handling
      - Add logging for better debugging during configuration compilation
      - Support SM90 compute capabilities in TensorCore and matmul analysis policies
      - Improve future handling and result tracking in autotuner
      - Add more flexible SM version checks for pipeline and async copy stages
      
      * Refactor Autotuner Parallel Compilation with Improved Error Handling
      
      - Enhance tqdm progress bar formatting for concurrent configuration compilation
      - Simplify exception handling in parallel compilation process
      - Remove unnecessary logging and improve code readability
      - Optimize thread pool shutdown and result processing
      e6f77253
    • Lei Wang's avatar
      [Bugfix] Implement boundary check for the buffer shape with dynamic symbolic (#173) · 8344af52
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      
      * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton
      
      - Introduce new example scripts for variable-length native sparse attention:
        * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
        * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
      - Update reference.py to support variable-length sparse attention scenarios
      - Enhance existing sparse attention implementations to handle variable-length inputs
      - Add comprehensive testing and validation for variable-length sparse attention
      
      * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements
      
      - Standardize function and parameter formatting across NSA example files
      - Improve code readability by adjusting indentation and line breaks
      - Enhance type hints and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Maintain consistent code style across TileLang and Triton implementations
      
      * Add debug logging and extend execution backend in JIT and loop vectorization
      
      - Add detailed logging in loop vectorization to help diagnose buffer shape handling
      - Extend JIT execution backend to include 'cython' option
      - Improve boundary condition checks in BufferLoadNode visit method
      
      * Remove debug logging in loop vectorization BufferLoadNode visit method
      
      - Remove unnecessary INFO log statements in VisitExpr_ method
      - Simplify code by eliminating redundant logging
      - Maintain core logic for handling buffer load node visits
      8344af52
  10. 07 Mar, 2025 5 commits
    • Lei Wang's avatar
      [Example] Implement tilelang native sparse attention varlen example (#170) · 8e1845d2
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      
      * Add Variable-Length Native Sparse Attention Examples for TileLang and Triton
      
      - Introduce new example scripts for variable-length native sparse attention:
        * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
        * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
      - Update reference.py to support variable-length sparse attention scenarios
      - Enhance existing sparse attention implementations to handle variable-length inputs
      - Add comprehensive testing and validation for variable-length sparse attention
      
      * Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements
      
      - Standardize function and parameter formatting across NSA example files
      - Improve code readability by adjusting indentation and line breaks
      - Enhance type hints and parameter alignment
      - Remove unnecessary whitespaces and optimize imports
      - Maintain consistent code style across TileLang and Triton implementations
      8e1845d2
    • You Jiacheng's avatar
      [Dev] Use SS-GEMM for PV in mla (#165) · 166a9585
      You Jiacheng authored
      It's slightly faster than T.copy then RS-GEMM, and simpler.
      166a9585
    • Lei Wang's avatar
    • Lei Wang's avatar
      [Example] Implement NSA Decode tilelang exampls (#168) · 69f35439
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      
      * Add Native Sparse Attention Examples with Tilelang and Triton Implementations
      
      - Introduce new example scripts for native sparse attention:
        * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
        * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
        * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
      - Update reference.py with naive implementations for sparse attention
      - Support different sparse attention scenarios including forward pass and inference
      - Add comprehensive testing and validation against reference implementations
      
      * lint fix
      69f35439
    • Lei Wang's avatar
      [Bugfix] Cast bool dtype into int8 in blocksparse examples (#167) · b6c48453
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      
      * Refactor Native Sparse Attention Example with Enhanced Triton Kernel
      
      - Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
      - Add support for block counts and offsets in the Triton kernel
      - Modify kernel grid and computation logic for improved performance
      - Update example script to use naive_nsa_simple reference implementation
      - Improve type hints and kernel configuration
      b6c48453