1. 05 Jun, 2025 1 commit
    • Gabriel Wu's avatar
      [Enhancement] Add nvrtc execution backend (#461) · 17f7394f
      Gabriel Wu authored
      
      
      * [wip] feat: add nvrtc backend
      
      * [wip] fix: handle out_idx
      
      * [wip] refactor: move lib logic to libgen
      
      * feat: cache for nvrtc backend
      
      * fmt: run format
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: get kernel source
      
      * refactor: speedup pyimport
      
      * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.
      
      * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.
      
      * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.
      
      * fix tl_templates
      
      * ensure CUDA context
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      17f7394f
  2. 26 May, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25
      Lei Wang authored
      * [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA
      
      * Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
      * Updated `customize.py` to include the new `atomic_addx4` function for external calls.
      * Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.
      
      * lint fix
      46798f25
    • Lei Wang's avatar
      [Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      
      * Enhance CUDA FP8 type handling and debug printing
      
      * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
      * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
      * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.
      
      * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state
      
      * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
      * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.
      
      * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting
      
      * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
      * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.
      
      * lint fix
      6addc509
  3. 25 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      623edf4c
  4. 22 May, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f
      Lei Wang authored
      * [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility
      
      * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
      * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
      * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
      * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.
      
      * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example
      
      * Updated thread count in the kernel context from 256 to 128 to enhance performance.
      * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
      * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
      * Cleaned up comments for clarity and corrected a typo in the memory copy comment.
      
      * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity
      
      * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
      * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
      dbe8689f
  5. 17 May, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683
      Lei Wang authored
      * [Enhancement] Improve GEMM layout function and documentation
      
      * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
      * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
      * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.
      
      * lint fix
      
      * [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility
      
      * Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
      * Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
      * Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.
      
      * [Refactor] Clean up formatting in GEMM implementation and CUDA templates
      
      * Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
      * Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.
      33937683
    • Lei Wang's avatar
      [Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      
      * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
      2837878f
    • Lei Wang's avatar
      [Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      68a3c4f3
  6. 09 May, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Implement fast integer power operation and related API (#466) · 1f5eb492
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [Feature] Implement fast integer power operation and related API
      
      * Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation.
      * Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang.
      * Enhanced `common.h` with a template function for integer power calculations.
      * Updated documentation to reflect the new functionality and usage examples.
      1f5eb492
  7. 06 May, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP (#450) · 0a8c8b99
      Lei Wang authored
      * [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP
      
      * Introduced TILELANG_CHECK_LAST_ERROR macro to streamline error checking for kernel launches in both CUDA and HIP.
      * Updated kernel launch code in wrapper.py to utilize the new macro, enhancing readability and maintainability.
      * This change improves error reporting by providing detailed messages when kernel execution fails.
      
      * [Refactor] Standardize error message formatting in TILELANG_CHECK_LAST_ERROR macro
      
      * Updated the TILELANG_CHECK_LAST_ERROR macro in both CUDA and HIP implementations to ensure consistent formatting of error messages.
      * Enhanced readability by aligning the error message structure across different platforms, improving maintainability of error handling code.
      0a8c8b99
  8. 29 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix layout inference for free fragment buffer (#443) · 2ea45ae9
      Lei Wang authored
      * [Enhancement] Improve layout inference accuracy in ParallelOp (#441)
      
      * Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
      * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
      
      * [Enhancement] Add error handling macros and refactor loop partitioning logic
      
      * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
      * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
      * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
      * Updated pass configuration management to streamline vectorization control in the optimization process.
      
      * lint fix
      
      * remove debug print
      2ea45ae9
  9. 25 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support cute mma tile mxn8ky (#434) · d1c15bc5
      Lei Wang authored
      * [Enhancement] Improve error handling in layout inference and update profiler type in tests
      
      * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
      * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.
      
      * lint fix
      
      * [Refactor] Update OperandTraits to include num_warp_n parameter
      
      * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
      * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.
      
      * lint fix
      
      * [Refactor] Update DispatchInstruction templates to include N parameter
      
      * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
      * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
      d1c15bc5
  10. 22 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support tile operator `T.cumsum` (#423) · 88747fcd
      Lei Wang authored
      * [Feature] Implement CumSum operation in TileLang
      
      * Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
      * Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
      * Created tests for CumSum functionality in shared memory and fragment contexts.
      * Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
      * Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.
      
      * lint fix
      88747fcd
  11. 11 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65
      Lei Wang authored
      
      
      * [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks
      
      - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
      - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
      - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
      - Introduced tests for the new logical operations to ensure correctness and performance.
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      c4638d65
  12. 03 Apr, 2025 1 commit
  13. 30 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
  14. 28 Mar, 2025 1 commit
  15. 27 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * [Update] Sync subproject commit and modify CUDA atomic add functions
      
      - Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
      - Commented out the CUDA capability check in the example convolution script to prevent execution errors.
      - Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.
      83412458
  16. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  17. 19 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Enhancement] Add zero initialization option to GEMM operations (#246) · 701e9234
      Yu Cheng authored
      * [Enhancement] Add zero initialization option to GEMM operations
      
      - Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator.
      - Updated the GEMM implementation across various CUDA architectures to support the new parameter.
      - Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution.
      - Ensured compatibility with existing functionality while improving initialization control for performance optimization.
      
      * rename zero_init to clear_accum
      
      * lint
      701e9234
  18. 17 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Disable force inline for ldmatrix (#227) · a1da26f2
      Lei Wang authored
      * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture
      
      - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
      - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
      - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
      - Include necessary header files for built-in operations and layout inference improvements.
      
      * Refactor parameter formatting in CUDA matrix load functions for consistency
      
      - Adjusted parameter alignment in `ptx_ldmatrix_x1`, `ptx_ldmatrix_x2`, `ptx_ldmatrix_x4`, and their transposed counterparts for improved readability.
      - Added a blank line in `get_tensor_supply` function in `tensor.py` to enhance code clarity.
      
      * Enhance tensor supply generation in `get_tensor_supply` function
      
      - Introduced handling for unsigned integer and float8 tensor types, allowing for specific random tensor generation based on data type.
      - Updated logic to return appropriate random tensors for different data types, improving flexibility and functionality of tensor supply generation.
      - Refactored existing conditions for clarity and maintainability.
      
      * Fix tensor supply generation logic in `get_tensor_supply` function
      
      - Updated the variable reference from `tensor` to `param` to ensure correct handling of tensor data types.
      - Improved the accuracy of unsigned integer and float8 checks for tensor supply generation, enhancing functionality and reliability.
      
      * Enhance tensor supply checks in `get_tensor_supply` function
      
      - Updated the logic for identifying unsigned integers and float8 types by using `removeprefix` on the dtype string, improving accuracy in tensor supply generation.
      - Ensured better handling of tensor data types for more reliable random tensor generation based on the updated checks.
      
      * Enhance KernelParam functionality and improve tensor supply checks
      
      - Added methods `is_unsigned` and `is_float8` to the `KernelParam` class for better type identification of parameters.
      - Updated the `get_tensor_supply` function to utilize the new methods, improving clarity and accuracy in tensor supply generation based on parameter types.
      a1da26f2
  19. 16 Mar, 2025 1 commit
  20. 14 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Allow mma fallback when wgmma is not supported (#206) · 45559a1f
      Lei Wang authored
      * Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging.
      
      * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture
      
      - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
      - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
      - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
      - Include necessary header files for built-in operations and layout inference improvements.
      
      * lint fix
      
      * Remove unused builtin.h include directive
      
      * Update include path for builtin.h
      45559a1f
  21. 13 Mar, 2025 1 commit
    • zqh-wz's avatar
      [Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5
      zqh-wz authored
      
      
      * upgrade cutlass to upstream v3.8.0
      
      * Implement fp8 gemm and add example script
      
      * Fix dtype retrieval with map_torch_type for fp8 inputs
      
      * Disable vectorization of fp8 values
      
      * Make MMA declaration compatible with cutlass 3.4.0+
      
      * Add test for fp8 T.gemm
      
      * fix indent
      
      * fix indent
      
      * Add copyright and license header
      
      * Add copyright and license header
      
      * lint fix
      
      * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.
      
      * clang format lint
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2cccf1f5
  22. 12 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a
      Yu Cheng authored
      - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
      - Add new builtin operations in op/builtin.cc and op/builtin.h
      - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
      - Update CUDA codegen to support new TMA store synchronization intrinsics
      - Add Python language bindings for new TMA store synchronization operations
      eba7dd5a
  23. 11 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug (#188) · fe0de672
      Yu Cheng authored
      * [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug
      
      - Implement two RMS normalization implementations in TileLang:
        * `rms_norm_splitk`: Split-K reduction approach for large matrices
        * `rms_norm`: Full reduction kernel with simplified implementation
      - Add reference implementation using PyTorch for validation
      - Include performance benchmarking for both kernel variants
      - Demonstrate flexible block size and matrix size configurations
      
      * [Examples] Simplify RMS Normalization Kernel Compilation
      
      - Remove commented-out code for split-K RMS normalization
      - Simplify kernel compilation by removing explicit TMA lowering configuration
      - Update copyright header to Tile-AI Corporation
      - Streamline main script for RMS normalization example
      fe0de672
  24. 05 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Support debug print for unsigned char datatype (#145) · bb60f6ce
      Lei Wang authored
      * Fix debug print buffer template for unsigned char type
      
      - Update debug_print_buffer_value template specialization for unsigned char
      - Modify test_tilelang_debug_print.py to include additional dtype tests
      - Add test case for uint8 dtype in debug print buffer function
      
      * Refactor debug print buffer template formatting for unsigned char
      
      - Improve code formatting for debug_print_buffer_value template specialization
      - Adjust line breaks and indentation for better readability
      - Maintain consistent code style with other template specializations
      bb60f6ce
    • Yu Cheng's avatar
      [Dev] Adjust computation logic to avoid precision loss when casting acc_s from... · e1d82bf3
      Yu Cheng authored
      [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141)
      
      - Remove redundant `acc_s_0` fragment in flash attention kernel
      - Simplify memory copy and reduction operations
      - Reorder memory copy and scaling steps for improved performance
      - Add Hopper-specific synchronization method in CUDA reduce template
      - Update reduce operation to use architecture-specific synchronization
      e1d82bf3
  25. 04 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Add missing definition for AtomicAdd (#138) · 3960d3d0
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      
      * Enhance Layout Plotting with Multi-Replication and Dynamic Visualization
      
      - Update plot_layout function to support multiple replications in thread and value mapping
      - Improve thread and value mapping to handle replicated layouts
      - Dynamically adjust figure size and legend positioning
      - Add print statements for saved plot file paths
      - Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting
      
      * Refactor AtomicAdd functions in CUDA common header
      
      - Implement a generic template for AtomicAdd function
      - Specialize templates for half_t, bfloat16_t, and pointer types
      - Reorganize and clean up existing AtomicAdd implementations
      - Improve type handling and conversion in atomic operations
      
      * Remove unused import in MHA backward test file
      
      - Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
      - Add blank line for improved code formatting
      - Minor code cleanup in test file
      3960d3d0
  26. 27 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Enhance cython/ctypes wrapper for tma descriptor (#126) · 7b74bb01
      Lei Wang authored
      
      
      * refactor code
      
      * enhance tutorial
      
      * Enhance error handling and code generation in CUDA and TileLang components
      
      This commit introduces several improvements across multiple files:
      - Added more informative error messages in GEMM layout checks
      - Updated CUDA codegen to support more flexible function signature generation
      - Improved TMA descriptor initialization and kernel dispatch logic
      - Refined library generation and source code parsing utilities
      - Enhanced error handling in various adapter and wrapper classes
      
      * Add thread tag validation for warp specialization
      
      Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization.
      
      * Update TileLang Profiling and Compilation in Flash Decoding Examples
      
      Refactor the profiling and compilation workflow in two flash decoding example scripts:
      - Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()`
      - Simplify profiler initialization using `get_profiler()`
      - Update method calls to use the new profiler and compiled kernel objects
      - Maintain existing performance benchmarking and validation logic
      
      * Refactor and clean up code formatting in TileLang testing and adapter modules
      
      This commit includes several code style and formatting improvements:
      - Adjust whitespace and line breaks in test files
      - Improve code formatting in CUDA source wrapper and adapter utilities
      - Enhance readability of function calls and argument handling
      - Remove unnecessary whitespace and standardize indentation
      - Simplify function signatures and argument parsing
      
      * Refactor CUDA codegen and improve code formatting
      
      This commit includes several improvements to CUDA code generation and formatting:
      - Enhance function signature generation in CodeGenTileLangCUDA
      - Improve code formatting and readability in CUDA-related files
      - Simplify parameter handling and type annotations
      - Clean up whitespace and line breaks in codegen and layout files
      
      ---------
      Co-authored-by: default avatarUbuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>
      7b74bb01
  27. 24 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Dev] Support vectorized value pack and atomicAdd for BFloat16 DType (#116) · 62843b88
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      62843b88
  28. 22 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Implement simple block sparse kernel (#106) · c7462abf
      Lei Wang authored
      * Remove Torch CPP backend and update execution backend options
      
      - Remove TorchCPPKernelAdapter and related code from JIT modules
      - Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
      - Remove "torch_cpp" from supported execution backend literals
      - Simplify backend validation and remove unused torch_cpp-related code
      。
      
      * lint fix
      
      * Add block sparse attention implementations for TileLang and Triton
      
      - Implement block sparse attention kernels for TileLang and Triton
      - Add example scripts for block sparse attention with top-k and threshold-based masking
      - Include utility functions for generating sparse attention masks
      - Demonstrate causal attention with block-level sparsity
      - Add test cases to validate sparse attention implementations against PyTorch reference
      c7462abf
  29. 09 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      f9b6a92e
  30. 06 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Dev] Support FP8 Codegen for cuda backend (#64) · 61de5288
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      61de5288
  31. 24 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Debug] Introduce `T.print` for buffer and variables logging on frontend (#45) · 8cdc185b
      Lei Wang authored
      * [Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo
      
      * [Feature] Add device-side debug printing functions and integrate into kernel interface
      
      * lint fix
      
      * remove debug print
      
      * implement test for debug
      
      * lint fix
      
      * add some comments
      
      * Enhance fragment design and assert fragment print
      
      * enhance debug print
      
      * add test for msg
      
      * lint fix
      8cdc185b
  32. 11 Jan, 2025 2 commits
    • Lei Wang's avatar
      [Lint] Overall Typo and Linting Fixes (#13) · fa511857
      Lei Wang authored
      * README.md fixed
      
      * update test ci
      
      * Lint and Typo Fix
      
      * Clang Format Lint Fix
      fa511857
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c