1. 24 Sep, 2025 1 commit
  2. 23 Sep, 2025 6 commits
  3. 22 Sep, 2025 4 commits
    • Lei Wang's avatar
      [AMD][MLA] Fix mla autotune for rocm (#861) · 3b21a67d
      Lei Wang authored
      * Refactor matmul example to include ReLU activation and update batch size in benchmark script
      
      * lint fix
      
      * Enhance autotuning capabilities in benchmark script and update argument defaults
      
      - Introduced a new `get_configs` function to generate autotuning configurations for the benchmark.
      - Updated the default batch size and kv context length in the argument parser for improved performance.
      - Renamed the `--auto_tune` argument to `--autotune` for consistency.
      - Modified the kernel invocation logic to support autotuning based on the new configurations.
      
      * lint fix
      3b21a67d
    • Lei Wang's avatar
      [TMA] Bugfix when a shared buffer is both issued with tma store and tma load (#857) · b9a51c43
      Lei Wang authored
      - Updated `init_desc_arg_map` to use `Var` as the key instead of `String` in `lower_hopper_intrin.cc`.
      - Enhanced `func_call_args` method in `TLCUDASourceWrapper` to accept additional parameters for better argument mapping.
      - Added assertions to ensure consistency between function parameters and arguments during kernel launches.
      - Modified `generate_tma_descriptor_args` to utilize a mapping of variable names for TMA descriptor initialization.
      b9a51c43
    • Lei Wang's avatar
      [Doc] Optimize the quickstart guide for clarity and not just for CUDA (#858) · 058a670b
      Lei Wang authored
      * Refactor matmul example to include ReLU activation and update batch size in benchmark script
      
      * lint fix
      058a670b
    • Lei Wang's avatar
  4. 21 Sep, 2025 1 commit
    • Lei Wang's avatar
      [PATCH] Static libg++ linking fix (#854) · a3497ebc
      Lei Wang authored
      * bump version to 0.1.6
      
      * phaseout py38
      
      * py39
      
      * Update submodule 'tvm' to latest commit adc0e48
      
      * [Build] Update CMake and Python environment settings
      
      - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
      - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
      - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.
      
      * [Build] Update Python version requirements in scripts and documentation
      
      - Changed Python version requirement in README.md from 3.9+ to 3.8+.
      - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
      - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.
      
      * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml
      
      - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
      - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.
      
      * [Build] Update CMake and Dockerfile for improved compatibility
      
      - Removed static linking flags from CMakeLists.txt to simplify build configuration.
      - Updated Dockerfile to use Ubuntu 20.04 and streamlined the installation of dependencies, removing gcc-9 and g++-9.
      - Adjusted symlink creation for Python environments to use the `-sf` option for safer linking.
      
      * [Build] Bump version to 0.1.6.post1 for post-release updates
      
      * [Build] Remove static linking flags from CMakeLists.txt
      
      - Eliminated static linking flags for GCC and libstdc++ to simplify build configuration and avoid potential conflicts with Python extensions.
      
      * [Build] Update Docker distribution scripts for manylinux compatibility
      
      - Changed base image from `tilelang-builder:18.04` to `tilelang-builder:manylinux` in both local and PyPI distribution scripts.
      - Updated Dockerfile references to use `pypi.manylinux.Dockerfile`.
      - Added `--gpus all` flag to the Docker run command to enable GPU support during execution.
      
      * lint fix
      
      * add cmake
      a3497ebc
  5. 19 Sep, 2025 3 commits
    • Lei Wang's avatar
      [Release] Bump Version to 0.1.6 (#818) · 1ad6e461
      Lei Wang authored
      * bump version to 0.1.6
      
      * phaseout py38
      
      * py39
      
      * Update submodule 'tvm' to latest commit adc0e48
      
      * [Build] Update CMake and Python environment settings
      
      - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
      - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
      - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.
      
      * [Build] Update Python version requirements in scripts and documentation
      
      - Changed Python version requirement in README.md from 3.9+ to 3.8+.
      - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
      - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.
      
      * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml
      
      - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
      - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.
      1ad6e461
    • Lei Wang's avatar
      [Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298
      Lei Wang authored
      - Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
      - Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
      - Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.
      094e2298
    • Lei Wang's avatar
      [Py38] Revert typing and parser updates for Python 3.8 compatibility (#850) · bc9623fc
      Lei Wang authored
      * Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards.
      
      * Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.
      bc9623fc
  6. 18 Sep, 2025 6 commits
    • Lei Wang's avatar
    • Lei Wang's avatar
    • Jiaxing Ding's avatar
      [AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
      Jiaxing Ding authored
      6efeb743
    • Lei Wang's avatar
      [Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355
      Lei Wang authored
      * [Enhancement] Enable fast math optimization in tilelang JIT configurations
      
      - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
      - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
      - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
      - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.
      
      * lint fix
      
      * [Refactor] Introduce deprecated_warning utility for improved deprecation handling
      
      - Added a new `deprecated_warning` function to streamline deprecation messages.
      - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
      - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
      e7e38355
    • Lei Wang's avatar
      [CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride (#843) · ebea77d9
      Lei Wang authored
      * bugfix
      
      * fix
      
      * test fix
      ebea77d9
    • Lei Wang's avatar
      [Refactor] Refactor some build related configurations (#827) · 232782dd
      Lei Wang authored
      * bugfix
      
      * [Build] Update build dependencies and Dockerfile configuration
      
      - Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`.
      - Removed unnecessary dependencies from the build system.
      - Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance.
      - Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment.
      
      * cmake fix
      
      * fix
      
      * fix
      232782dd
  7. 17 Sep, 2025 5 commits
    • Lei Wang's avatar
    • Tong WU's avatar
      [Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01
      Tong WU authored
      
      
      * [Enhancement] Enhance dequantization examples and utilities
      
      - Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
      - Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
      - Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
      Co-authored-by: default avatarZhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
      
      * fix typos in docstrings
      
      * remove redundant code
      
      * [Format] Unreproducible debug with T.print
      
      * [BugFix] Correct dtype in ref dequantize; larger data distribution
      
      * [Format]
      
      * [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py
      
      - Removed unnecessary cache disabling and manual seed setting in the example.
      - Simplified nested loops into parallelized operations for better readability and performance.
      - Updated the assertion function in utils.py to print detailed error messages.
      - Adjusted tensor sizes in examples
      
      * [Refactor] Update import path in example_dequant_gemm_fine_grained.py
      
      - Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.
      
      * lint
      
      * rename and add test
      
      * lint
      
      * [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py
      
      - Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
      - Updated the `matmul` function to utilize autotuning with the new configurations.
      - Improve kernel performance via vectorization and threadblock swizzle.
      - Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.
      
      * lint
      
      * fix typo
      
      * fix typo and lint
      
      * make ci format check happy
      
      * fix ci
      
      ---------
      Co-authored-by: default avatarZhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
      Co-authored-by: default avatartzj-fxz <tzjfxz@gmail.com>
      8554cb01
    • Lei Wang's avatar
      [Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes (#824) · e4a346fe
      Lei Wang authored
      * bug fix when git is not installed
      
      * ml_dtypes_fix
      e4a346fe
    • Lei Wang's avatar
      a57f8270
    • Lei Wang's avatar
      [DSL] Support python tenary if then else expression (#822) · 15479958
      Lei Wang authored
      * support python tenary if then else expression
      
      * lint fix
      15479958
  8. 16 Sep, 2025 3 commits
    • botbw's avatar
      [Example] Remove redundant param (#821) · 907c3ff0
      botbw authored
      907c3ff0
    • Cunxiao Ni's avatar
      [CI] fix rocm ci (#819) · d3e75b70
      Cunxiao Ni authored
      * [CI] fix rocm ci
      
      * Trigger CI
      d3e75b70
    • Cunxiao Ni's avatar
      [Example] add w4a8 gemm kernel (#815) · 4bcb1593
      Cunxiao Ni authored
      * [Bugfix] fix autotune bug
      
      * [Example] add w4a8 gemm kernel
      
      * fix lint: pinned the version of `ml_dtypes`
      The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined.
      
      * Renames example for dequantization GEMM
      
      * format
      
      * add w4a8 example to ci
      
      * fix lint
      4bcb1593
  9. 15 Sep, 2025 4 commits
    • Yu Cheng's avatar
      [Refactor] Update TVM subproject and streamline buffer store handling (#816) · 85d1a6b3
      Yu Cheng authored
      - Updated the TVM subproject to the latest commit for improved functionality.
      - Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability.
      - Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.
      85d1a6b3
    • Kurisu's avatar
      [Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op (#817) · 5c869bc7
      Kurisu authored
      * [Refactor] Rewrite AddWrapper pass by ir_transform
      PyStmtExprVisitor and PyStmtExprMutator seem buggy
      
      * fix lint error
      5c869bc7
    • Yu Cheng's avatar
      [Refactor] Update TVM subproject and refactor BlockNode handling in... · 8b005226
      Yu Cheng authored
      [Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812)
      
      * [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation
      
      - Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework.
      - Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly.
      - Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency.
      - Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations.
      
      * lint
      
      * lint
      8b005226
    • botbw's avatar
      [feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf
      botbw authored
      
      
      * [feat] add an example mma atom
      
      * [fix] fix typo naming
      
      * [feat] add a template to enable compilation
      
      * [feat] add print util
      
      * [WIP] pass on single block tile
      
      * [feat] add sm80 metadata layout
      
      * [chore] clean codebase
      
      * [CI] format.sh
      
      * [feat] add sm80 compress utils
      
      * [bugfix] fix C fragment layout
      
      * [refactor] use nvcc version instead of str
      
      * [test] add test cases
      
      * [chore] add a param check
      
      * [chore] format a bit
      
      * [chore] rename func to satisfy PEP 8 and appease gemini
      
      * [chore] add check
      
      * [feat] support sm75 layout && add assertion && chore
      
      * [bug] fix illegal memory access when using two warps over N=32
      
      This could be a missing check related to cutlass 2.x implementation.
      Using the cutlass example can't trigger this cause it's bypassed by
      padding the input.
      
      For now I think it might be safe to increase the atom size and inve-
      sgate in the future.
      
      * [chore] add example
      
      * [chore] format
      
      * [example] update benchmark
      
      * [bugfix] fix namespace and format
      
      * [bugfix] fix incorrect param passing
      
      * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp
      
      * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py
      
      * [CI] fix arch
      
      * [example] add torch sparse benchmark
      
      * [misc] polish && add reference && apply review suggestionsi && format
      
      * [CI] format with clang-tidy
      
      * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h
      
      * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0b3683bf
  10. 14 Sep, 2025 2 commits
    • Kurisu's avatar
      [Fix] Fix lower bug when buffer store is not guarded by any tile op (#794) · f0d66698
      Kurisu authored
      * [Fix] Fix lower bug when buffer store is not guarded by any tile op
      
      * fix lint error
      
      * Fix typo in  pass
      
      * fix lint error
      
      * Ignore custom thread binding
      f0d66698
    • Yu Cheng's avatar
      [Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality (#809) · ae9b7063
      Yu Cheng authored
      - Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang.
      - Updated the CUDA code generation to support the new barrier operation.
      - Added a corresponding function in the TileLang Python API for ease of use.
      - Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.
      ae9b7063
  11. 13 Sep, 2025 1 commit
  12. 12 Sep, 2025 2 commits
  13. 11 Sep, 2025 2 commits
    • Tang Xinsheng's avatar
      [AMD] support fp8 T.gemm (#804) · 409ab83d
      Tang Xinsheng authored
      
      
      * [AMD] support fp8 T.gemm
      
      * format
      
      ---------
      Co-authored-by: default avatartangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>
      409ab83d
    • Lei Wang's avatar
      [Refactor] Use new namespace and enhance dispatch macros for mma (#801) · b62a0b43
      Lei Wang authored
      * Refactor CUDA GEMM operations to use new namespace and enhance dispatch macros
      
      - Moved GEMM-related dispatch instructions to the `cute::tl_mma` namespace for better organization.
      - Introduced `TL_DISPATCH_MMA` and `TL_DISPATCH_MMA_TEMPLATE` macros to streamline the definition of dispatch instructions for various data types and architectures.
      - Updated the handling of CUDA architecture checks to include additional support for newer architectures.
      - Improved clarity and maintainability of the code by restructuring the layout and organization of dispatch instructions.
      - Ensured consistent usage of tensor views and memory clearing operations across different GEMM implementations.
      
      * Remove deprecated `DispatchInstruction` templates and `tl_mma` namespace from CUDA GEMM implementation. This cleanup enhances code clarity and maintainability by eliminating unused structures and streamlining the overall organization of the GEMM operations.
      b62a0b43