- 19 Oct, 2025 1 commit
-
-
Tong WU authored
* [Refactor][Example] Update linear attention examples and add tests - Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance. - Introduced L2 normalization in the main functions of both examples. - Added a new test suite for the linear attention examples to ensure correctness and performance. - Updated argument parsing in the main functions for better usability. * upd docstring for tma atomic add * lint * Add flash-linear-attention dependency to requirements.txt * Rename main function to chunk_linear_attn_bwd * Rename main function to chunk_linear_attn_fwd * chore --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 15 Oct, 2025 1 commit
-
-
Xuehai Pan authored
* refactor: merge test CI workflow files into one * chore: set `UV_INDEX_STRATEGY=unsafe-best-match` * feat: add AST test with Python 3.8 * feat: implement manual caching mechanism for self-hosted runners * refactor: simplify cache logic for self-hosted runners * chore: clear uv cache on failure * chore: print format.sh output to logs * chore: improve uv caching * chore: disable parallel test * chore: use `PYTHONDEVMODE=1` in CI * feat: enable coredump generation * fix: fix perfbench condition * Revert "feat: enable coredump generation" This reverts commit c52da65cb572932e09905d08c43a39ec3cf47c54. * chore: move example CI down * Revert "chore: move example CI down" This reverts commit 9d8e65055e01d955c5268a9a6705d270c2de0d57. * chore: skip example `test_example_mha_sink_bwd_bhsd` * chore: skip example `test_example_gqa_sink_bwd_bhsd` * fix: fix example argument passing * fix: loosen test criteria * chore: rename `CMAKE_CONFIG...
-
- 13 Oct, 2025 1 commit
-
-
Yichen Yan authored
* cleanup * init * build first wheel that may not work * build cython ext * fix tvm build * use sabi * update rpath to support auditwheel * pass editible build * update ci * fix warnings * do not use ccache in self host runner * test local uv cache * test pip index * update lib search to respect new lib location * fix * update ci * enable cuda by default * update src map * fix * fix * fix * Generate version with backend and git information at build time * copy tvm_cython to wheels * fix tvm lib search * fmt * remove unused * auto detect ccache * add back backend-related files * remove jit cython adaptor to simplify code * fmt * fix ci * ci fix 2 * ci fix 3 * workaround metal * ci fix 4 * fmt * fmt * Revert "ci fix 4" This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676. * tmp * fix metal * trivial cleanup * add detailed build-time version for cuda * add back mlc * Restore wheel info and other trivial updates * update * fix cuda * upd * fix metal ci * test for ga build * test for nvidia/cuda * test ubuntu 20 * fix * fix * Do not use `uv build` * fix * fix * log toolchain version * merge wheel * update * debug * fix * update * skip rocm * update artifacts each * fix * fix * add mac * fix cache * fix cache * fix cache * reset and add comment * upd * fix git version * update deps * trivial update * use in-tree build dir and install to src to speedup editable build * Revert "use in-tree build dir and install to src to speedup editable build" This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2. * add build-dir * update docs * remove old scrips * [1/n] cleanup scripts * [Lint]: [pre-commit.ci] auto fixes [...] * fix and update * wait for tvm fix * revert some tmp fix * fix * fix * spell * doc update * test cibuildwheel * fix and test macos on ci * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ga event * cleanup * bump tvm to support api3 * test final version * add cron * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ccache for metal cibuildwheel * test newer macos * finish --------- Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com>
-
- 19 Sep, 2025 1 commit
-
-
Lei Wang authored
* Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards. * Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.
-
- 18 Sep, 2025 1 commit
-
-
Lei Wang authored
* bugfix * [Build] Update build dependencies and Dockerfile configuration - Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`. - Removed unnecessary dependencies from the build system. - Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance. - Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment. * cmake fix * fix * fix
-
- 16 Sep, 2025 1 commit
-
-
Cunxiao Ni authored
* [Bugfix] fix autotune bug * [Example] add w4a8 gemm kernel * fix lint: pinned the version of `ml_dtypes` The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined. * Renames example for dequantization GEMM * format * add w4a8 example to ci * fix lint
-
- 06 May, 2025 1 commit
-
-
Lei Wang authored
-
- 19 Apr, 2025 1 commit
-
-
Lei Wang authored
* Phase out attr * Remove unused dependencies from requirements files * Update TVM submodule to latest commit 4776d31
-
- 21 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes - Add support for multi-stream kernel execution in CTypes backend - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py - Add symbolic shape utility function in tilelang.language - Update profiler to improve flexibility in benchmark selection * Remove redundant thread binding in GEMM kernel implementations - Remove unnecessary `thread_binding` line in GEMM kernel functions - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py` - Enhance code readability by removing redundant thread binding annotation * Fix indentation in int4 GEMM kernel test file - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py` - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls - Improve code formatting for better readability * [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution - Implement CythonKernelAdapter to handle dynamic symbolic shapes - Add support for multi-stream kernel execution in Cython backend - Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py - Update JITKernel to include "cython" as a valid execution backend - Add Cython-specific wrapper and library generation modules - Update .gitignore to exclude Cython cache directory - Modify setup.py to include Cython source files in package data * lint fix * [Refactor] Replace JITKernel with compile() function for kernel compilation - Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel - Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()` - Modify kernel adapters to support optional kernel-only source retrieval - Update `__init__.py` to import the new `compile()` function - Improve kernel source retrieval for different execution backends * lint fix * remove debug print * Add C/C++ compiler utility module and update Cython JIT kernel support - Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities - Add functions to detect and retrieve system C/C++ compilers - Implement cross-compilation and shared library creation support - Update Cython JIT kernel to validate C++ compiler availability - Modify Cython adapter to use detected C++ compiler for library generation * Refactor float8 dtype mapping in tensor utility module - Move float8_dtype_map inside adapt_torch2tvm function - Simplify global scope by localizing the dtype mapping - Maintain existing functionality for converting torch float8 tensors to TVM ndarray * Refactor float8 dtype mapping in tensor utility module - Move float8_dtype_map inside adapt_torch2tvm function - Simplify global scope by localizing the dtype mapping - Maintain existing functionality for converting torch float8 tensors to TVM ndarray * revert * Enhance Cython JIT adapter with Cython compiler detection - Add `get_cython_compiler()` function to dynamically locate Cython executable - Update Cython adapter to use detected Cython compiler instead of hardcoded command - Raise an exception if no Cython compiler is found - Update requirements.txt to specify minimum PyTorch version (>=2.2.0) * Fix Cython kernel wrapper stream handling and type annotations - Update stream parameter type to int64_t for better compatibility - Directly use torch.cuda.current_stream().cuda_stream instead of casting - Improve type safety and precision in Cython kernel wrapper
-
- 10 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add VectorizeLoop function and update imports for compatibility * [CI][Test] Improve test cases for vectorization and fix typos in parser comments * lint fix * Fix incorrect module reference for VectorizeLoop transformation * Refactor vectorize_loop transformation by removing unused extent mutation logic * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen * Fix formatting in CUDA FP8 header file for consistency * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity * Update submodule 'tvm' to latest commit for improved functionality * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule. * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files. * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency * Add CUDA requirements to FP8 test cases and update references for clarity * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py * Add CUDA requirements and FP8 test cases for matmul and gemv simulations * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py * Add BF16 support to matrix multiplication and introduce corresponding test cases * Add a blank line for improved readability in BF16 GEMM test * Update acknowledgements in README to include supervision by Zhi Yang at Peking University * enhance acknowledgement * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives * Update subproject commit for TVM dependency * Update subproject commit for TVM dependency * Add int4_t type and functions for packing char values in CUDA common header * Add plot_layout example and implement GetForwardVars method in layout classes * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files * Fix formatting by removing unnecessary line break in layout.h * Refactor make_int4 function for improved readability by adjusting parameter formatting * Add legend to plot_layout for improved clarity of thread and local IDs * Remove unnecessary dependencies from requirements files for cleaner setup * Remove flash_mha.py and add .gitkeep to deepseek_mla directory * Add build requirements and update installation scripts for improved setup
-
- 11 Jan, 2025 1 commit
-
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-