"vscode:/vscode.git/clone" did not exist on "0ae183db3dbe9b9a20fd2ee0ae075c104931585d"
- 23 Sep, 2025 6 commits
-
-
Tong WU authored
* [Example] Add a new example to support attention sink for MHA - Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens. - Added a reference attention function to validate the implementation against PyTorch. - Included argument parsing for command-line execution of the example. * [Example] Replace MHA sink forward example with updated implementation - Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens. - Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability. - Updated argument parsing and reference functions to align with the new implementation. * Enhance MHA sink example with sliding window support - Added a `window_size` parameter to the `flashattn` function to enable sliding window attention. - Implemented assertions to ensure `window_size` is compatible with `block_N`. - Updated the main function to include a `tune` option for performance tuning. - Introduced a new test file to validate both full attention and sliding window scenarios. - Adjusted FLOPS calculation to account for the sliding window configuration. * lint * [Fix] Add checkinf process to fix the bug of swa * Migrate to BSHD layout to align with triton baselines * lint * fix typo * Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters. * Add GQA sink example for optimized attention mechanism & lint fix * fix several typos and bugs * lint * fix speed issues of swa * Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
Lei Wang authored
-
Lei Wang authored
* Enhance LayoutNode::Forward method to handle variable transformations more robustly - Updated the method to check for a minimum number of input dimensions. - Introduced a mechanism to transform the last InputDim() elements of the input variables. - Concatenated transformed variables with the remaining input variables for a comprehensive output. * Refactor LayoutNode::Forward method for improved readability - Removed unnecessary whitespace to enhance code clarity. - Maintained existing functionality while streamlining the transformation process of input variables.
-
Tong WU authored
-
Jiaxing Ding authored
-
Tong WU authored
* fix flash attention examples for `seqlen_q<seqlen_kv` cases * lint
-
- 22 Sep, 2025 4 commits
-
-
Lei Wang authored
* Refactor matmul example to include ReLU activation and update batch size in benchmark script * lint fix * Enhance autotuning capabilities in benchmark script and update argument defaults - Introduced a new `get_configs` function to generate autotuning configurations for the benchmark. - Updated the default batch size and kv context length in the argument parser for improved performance. - Renamed the `--auto_tune` argument to `--autotune` for consistency. - Modified the kernel invocation logic to support autotuning based on the new configurations. * lint fix
-
Lei Wang authored
- Updated `init_desc_arg_map` to use `Var` as the key instead of `String` in `lower_hopper_intrin.cc`. - Enhanced `func_call_args` method in `TLCUDASourceWrapper` to accept additional parameters for better argument mapping. - Added assertions to ensure consistency between function parameters and arguments during kernel launches. - Modified `generate_tma_descriptor_args` to utilize a mapping of variable names for TMA descriptor initialization.
-
Lei Wang authored
* Refactor matmul example to include ReLU activation and update batch size in benchmark script * lint fix
-
Lei Wang authored
-
- 21 Sep, 2025 1 commit
-
-
Lei Wang authored
* bump version to 0.1.6 * phaseout py38 * py39 * Update submodule 'tvm' to latest commit adc0e48 * [Build] Update CMake and Python environment settings - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking. - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility. - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments. * [Build] Update Python version requirements in scripts and documentation - Changed Python version requirement in README.md from 3.9+ to 3.8+. - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version. - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments. * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility. - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable. * [Build] Update CMake and Dockerfile for improved compatibility - Removed static linking flags from CMakeLists.txt to simplify build configuration. - Updated Dockerfile to use Ubuntu 20.04 and streamlined the installation of dependencies, removing gcc-9 and g++-9. - Adjusted symlink creation for Python environments to use the `-sf` option for safer linking. * [Build] Bump version to 0.1.6.post1 for post-release updates * [Build] Remove static linking flags from CMakeLists.txt - Eliminated static linking flags for GCC and libstdc++ to simplify build configuration and avoid potential conflicts with Python extensions. * [Build] Update Docker distribution scripts for manylinux compatibility - Changed base image from `tilelang-builder:18.04` to `tilelang-builder:manylinux` in both local and PyPI distribution scripts. - Updated Dockerfile references to use `pypi.manylinux.Dockerfile`. - Added `--gpus all` flag to the Docker run command to enable GPU support during execution. * lint fix * add cmake
-
- 19 Sep, 2025 3 commits
-
-
Lei Wang authored
* bump version to 0.1.6 * phaseout py38 * py39 * Update submodule 'tvm' to latest commit adc0e48 * [Build] Update CMake and Python environment settings - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking. - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility. - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments. * [Build] Update Python version requirements in scripts and documentation - Changed Python version requirement in README.md from 3.9+ to 3.8+. - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version. - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments. * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility. - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.
-
Lei Wang authored
- Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used. - Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability. - Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.
-
Lei Wang authored
* Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards. * Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.
-
- 18 Sep, 2025 6 commits
-
-
Lei Wang authored
-
Lei Wang authored
-
Jiaxing Ding authored
-
Lei Wang authored
* [Enhancement] Enable fast math optimization in tilelang JIT configurations - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization. - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations. - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings. - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option. * lint fix * [Refactor] Introduce deprecated_warning utility for improved deprecation handling - Added a new `deprecated_warning` function to streamline deprecation messages. - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration. - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
-
Lei Wang authored
* bugfix * fix * test fix
-
Lei Wang authored
* bugfix * [Build] Update build dependencies and Dockerfile configuration - Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`. - Removed unnecessary dependencies from the build system. - Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance. - Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment. * cmake fix * fix * fix
-
- 17 Sep, 2025 5 commits
-
-
Lei Wang authored
-
Tong WU authored
* [Enhancement] Enhance dequantization examples and utilities - Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`. - Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance. - Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process. Co-authored-by:
Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> * fix typos in docstrings * remove redundant code * [Format] Unreproducible debug with T.print * [BugFix] Correct dtype in ref dequantize; larger data distribution * [Format] * [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py - Removed unnecessary cache disabling and manual seed setting in the example. - Simplified nested loops into parallelized operations for better readability and performance. - Updated the assertion function in utils.py to print detailed error messages. - Adjusted tensor sizes in examples * [Refactor] Update import path in example_dequant_gemm_fine_grained.py - Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure. * lint * rename and add test * lint * [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py - Added a new function `get_configs()` to generate hyperparameter configurations for tuning. - Updated the `matmul` function to utilize autotuning with the new configurations. - Improve kernel performance via vectorization and threadblock swizzle. - Enhanced the main function to support the new autotuning inputs and updated parameters for better performance. * lint * fix typo * fix typo and lint * make ci format check happy * fix ci --------- Co-authored-by:
Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> Co-authored-by:
tzj-fxz <tzjfxz@gmail.com>
-
Lei Wang authored
* bug fix when git is not installed * ml_dtypes_fix
-
Lei Wang authored
-
Lei Wang authored
* support python tenary if then else expression * lint fix
-
- 16 Sep, 2025 3 commits
-
-
botbw authored
-
Cunxiao Ni authored
* [CI] fix rocm ci * Trigger CI
-
Cunxiao Ni authored
* [Bugfix] fix autotune bug * [Example] add w4a8 gemm kernel * fix lint: pinned the version of `ml_dtypes` The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined. * Renames example for dequantization GEMM * format * add w4a8 example to ci * fix lint
-
- 15 Sep, 2025 4 commits
-
-
Yu Cheng authored
- Updated the TVM subproject to the latest commit for improved functionality. - Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability. - Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.
-
Kurisu authored
* [Refactor] Rewrite AddWrapper pass by ir_transform PyStmtExprVisitor and PyStmtExprMutator seem buggy * fix lint error
-
Yu Cheng authored
[Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812) * [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation - Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework. - Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly. - Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency. - Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations. * lint * lint
-
botbw authored
* [feat] add an example mma atom * [fix] fix typo naming * [feat] add a template to enable compilation * [feat] add print util * [WIP] pass on single block tile * [feat] add sm80 metadata layout * [chore] clean codebase * [CI] format.sh * [feat] add sm80 compress utils * [bugfix] fix C fragment layout * [refactor] use nvcc version instead of str * [test] add test cases * [chore] add a param check * [chore] format a bit * [chore] rename func to satisfy PEP 8 and appease gemini * [chore] add check * [feat] support sm75 layout && add assertion && chore * [bug] fix illegal memory access when using two warps over N=32 This could be a missing check related to cutlass 2.x implementation. Using the cutlass example can't trigger this cause it's bypassed by padding the input. For now I think it might be safe to increase the atom size and inve- sgate in the future. * [chore] add example * [chore] format * [example] update benchmark * [bugfix] fix namespace and format * [bugfix] fix incorrect param passing * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py * [CI] fix arch * [example] add torch sparse benchmark * [misc] polish && add reference && apply review suggestionsi && format * [CI] format with clang-tidy * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 14 Sep, 2025 2 commits
-
-
Kurisu authored
* [Fix] Fix lower bug when buffer store is not guarded by any tile op * fix lint error * Fix typo in pass * fix lint error * Ignore custom thread binding
-
Yu Cheng authored
- Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang. - Updated the CUDA code generation to support the new barrier operation. - Added a corresponding function in the TileLang Python API for ease of use. - Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.
-
- 13 Sep, 2025 1 commit
-
-
Yichen Yan authored
* update lint config * Remove spaces for blank line * update
-
- 12 Sep, 2025 2 commits
-
-
alex_xiao authored
-
Jiaxing Ding authored
Co-authored-by:Jiaxing Ding <jiaxing.ding@bytedance.com>
-
- 11 Sep, 2025 3 commits
-
-
Tang Xinsheng authored
* [AMD] support fp8 T.gemm * format --------- Co-authored-by:tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>
-
Lei Wang authored
* Refactor CUDA GEMM operations to use new namespace and enhance dispatch macros - Moved GEMM-related dispatch instructions to the `cute::tl_mma` namespace for better organization. - Introduced `TL_DISPATCH_MMA` and `TL_DISPATCH_MMA_TEMPLATE` macros to streamline the definition of dispatch instructions for various data types and architectures. - Updated the handling of CUDA architecture checks to include additional support for newer architectures. - Improved clarity and maintainability of the code by restructuring the layout and organization of dispatch instructions. - Ensured consistent usage of tensor views and memory clearing operations across different GEMM implementations. * Remove deprecated `DispatchInstruction` templates and `tl_mma` namespace from CUDA GEMM implementation. This cleanup enhances code clarity and maintainability by eliminating unused structures and streamlining the overall organization of the GEMM operations.
-
Lei Wang authored
- Introduced a new function `alloc_reducer` to allocate a reducer buffer with specified shape, data type, and reduction operation (sum, max, min). - Added detailed documentation for the function, including usage instructions and parameter descriptions. - Ensured that the function supports replication strategies and includes assertions for valid operation types and replication options. This enhancement improves the functionality of buffer management in TileLang, facilitating efficient reduction operations in parallel loops.
-