- 17 Dec, 2025 1 commit
-
-
Kuris authored
-
- 13 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add read-only parameter annotation for CUDA codegen * Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen. * Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads. * Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process. * lint fix * [Dependency] Update apache-tvm-ffi version to >=0.1.3 * Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes. * Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity. * lint fix * [Enhancement] Refactor ReadWriteMarker for improved parameter handling * Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables. * Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables. * Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters. * Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications. * lint fix * Update tma_load function to use const qualifier for global memory pointer * Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations. * This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming. * Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity * Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.
-
- 11 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Dependency] Update apache-tvm-ffi version to >=0.1.2 in project files * [Dependency] Update subproject commit for TVM to latest version afc07935 * [Enhancement] Add support for optional step parameter in loop constructs - Updated loop creation functions to accept an optional step parameter, enhancing flexibility in loop definitions. - Modified ForFrame implementations to utilize the new step parameter across various loop types including serial, parallel, and pipelined loops. - Adjusted related vectorization transformations to accommodate the step parameter, ensuring consistent behavior in loop vectorization processes. * lint fix
-
- 05 Nov, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Add support for SM70 tensor core MMA instructions - Introduced new intrinsic `ptx_mma_sm70` for Volta GPUs, enabling m16n16k4 shape with FP16 inputs and FP16/FP32 accumulation. - Added `GemmMMASm70` class for handling GEMM operations specific to SM70 architecture. - Implemented layout functions for Volta swizzled layouts and updated existing GEMM layout inference logic. - Updated `requirements-dev.txt` to include `apache-tvm-ffi` dependency. - Added correctness evaluation script for testing GEMM operations on SM70. * [Refactor] Update formatting and installation commands in scripts - Modified `format.sh` to install `pre-commit` and `clang-tidy` with the `--user` flag for user-specific installations. - Improved readability in `correctness_evaluation_sm70.py` by adjusting the formatting of pytest parameters. - Cleaned up spacing and formatting in various C++ source files for better consistency and readability. - Removed unnecessary comments and improved layout function definitions in `mma_sm70_layout.py` and `mma_sm70_macro_generator.py` for clarity. - Ensured consistent formatting in layout initialization and swizzle functions. * typo fix
-
- 15 Oct, 2025 1 commit
-
-
Xuehai Pan authored
* refactor: merge test CI workflow files into one * chore: set `UV_INDEX_STRATEGY=unsafe-best-match` * feat: add AST test with Python 3.8 * feat: implement manual caching mechanism for self-hosted runners * refactor: simplify cache logic for self-hosted runners * chore: clear uv cache on failure * chore: print format.sh output to logs * chore: improve uv caching * chore: disable parallel test * chore: use `PYTHONDEVMODE=1` in CI * feat: enable coredump generation * fix: fix perfbench condition * Revert "feat: enable coredump generation" This reverts commit c52da65cb572932e09905d08c43a39ec3cf47c54. * chore: move example CI down * Revert "chore: move example CI down" This reverts commit 9d8e65055e01d955c5268a9a6705d270c2de0d57. * chore: skip example `test_example_mha_sink_bwd_bhsd` * chore: skip example `test_example_gqa_sink_bwd_bhsd` * fix: fix example argument passing * fix: loosen test criteria * chore: rename `CMAKE_CONFIG...
-
- 14 Oct, 2025 1 commit
-
-
Yichen Yan authored
* Load libs from build dir, if present, to support faster rebuild. * typo * upd * refine check * md lint
-
- 13 Oct, 2025 1 commit
-
-
Yichen Yan authored
* cleanup * init * build first wheel that may not work * build cython ext * fix tvm build * use sabi * update rpath to support auditwheel * pass editible build * update ci * fix warnings * do not use ccache in self host runner * test local uv cache * test pip index * update lib search to respect new lib location * fix * update ci * enable cuda by default * update src map * fix * fix * fix * Generate version with backend and git information at build time * copy tvm_cython to wheels * fix tvm lib search * fmt * remove unused * auto detect ccache * add back backend-related files * remove jit cython adaptor to simplify code * fmt * fix ci * ci fix 2 * ci fix 3 * workaround metal * ci fix 4 * fmt * fmt * Revert "ci fix 4" This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676. * tmp * fix metal * trivial cleanup * add detailed build-time version for cuda * add back mlc * Restore wheel info and other trivial updates * update * fix cuda * upd * fix metal ci * test for ga build * test for nvidia/cuda * test ubuntu 20 * fix * fix * Do not use `uv build` * fix * fix * log toolchain version * merge wheel * update * debug * fix * update * skip rocm * update artifacts each * fix * fix * add mac * fix cache * fix cache * fix cache * reset and add comment * upd * fix git version * update deps * trivial update * use in-tree build dir and install to src to speedup editable build * Revert "use in-tree build dir and install to src to speedup editable build" This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2. * add build-dir * update docs * remove old scrips * [1/n] cleanup scripts * [Lint]: [pre-commit.ci] auto fixes [...] * fix and update * wait for tvm fix * revert some tmp fix * fix * fix * spell * doc update * test cibuildwheel * fix and test macos on ci * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ga event * cleanup * bump tvm to support api3 * test final version * add cron * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ccache for metal cibuildwheel * test newer macos * finish --------- Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com>
-
- 08 Aug, 2025 1 commit
-
-
Lei Wang authored
* Implement new free stage layout inference. * Fix bug * Make replication upcasting and unnormalizable iterators safe. * Better handling of updating with more replica * Remove unnecessary check. * Fix compilation. * Fix setup.py. * Simplify development mode. * Allow ParallelOp layout when there's already a compatible layout specified * lint fix * Add ProveFragmentContains function to validate thread access between small and large fragments This function checks if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment, ensuring valid access during updates. The implementation includes deriving thread indices, computing logical indices, and verifying thread mappings. * Update dependencies in requirements files * Remove 'thefuzz' from requirements-dev.txt * Specify exact versions for 'torch' and add 'flash_attn' in requirements-test.txt * Update CI workflow to use SHA256 hash for requirements file * Update requirements and CI workflow for flash attention * Removed specific version for 'torch' in requirements-test.txt * Added installation of 'flash_attn==2.5.8' in CI workflow to ensure compatibility * Refactor flash attention import handling in examples * Removed availability checks for 'flash_attn' in multiple example scripts. * Simplified import statements for 'flash_attn' to ensure consistent usage across examples. --------- Co-authored-by:Huanqi Cao <caohuanqi@deepseek.com>
-
- 19 Apr, 2025 1 commit
-
-
Lei Wang authored
* Phase out attr * Remove unused dependencies from requirements files * Update TVM submodule to latest commit 4776d31
-
- 22 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Build] Improve build configuration and package distribution support - Add `build` to requirements-build.txt for package building - Update MANIFEST.in to include Cython wrapper source file - Enhance setup.py to improve Cython file copying logic - Update build scripts to support multi-Python version distribution * [Build] Improve Cython file handling in setup.py and MANIFEST.in - Remove Cython wrapper from MANIFEST.in - Enhance setup.py to create target directory if it doesn't exist when copying Cython files - Improve file copying logic for Cython source files during build process * [Build] Remove Cython file copying logic in setup.py - Comment out Cython file copying code in TileLangBuilPydCommand - Simplify setup.py build process by removing redundant Cython file handling * [Build] Enhance Docker distribution scripts for multi-Python version support - Refactor local and PyPI distribution Docker scripts - Replace hardcoded Python installation with Miniconda-based multi-version Python environment - Improve Docker image setup with dynamic Python version creation - Simplify build process by using Miniconda for Python environment management * [Build] Separate lint requirements into a dedicated file - Create new requirements-lint.txt for formatting and linting tools - Update format.sh to use requirements-lint.txt instead of requirements-dev.txt - Update requirements-dev.txt and requirements-test.txt to reference requirements-lint.txt - Improve dependency management by isolating lint-specific requirements * [Build] Restore Cython file copying logic in setup.py - Re-add Cython file copying mechanism in TileLangBuilPydCommand - Implement robust file search across multiple potential directories - Add warning for cases where Cython source files cannot be found - Improve build process reliability for Cython source files * [Build] Refactor Cython file copying logic in setup.py - Simplify Cython file copying mechanism in TileLangBuilPydCommand - Improve directory creation and file copying for Cython source files - Relocate potential directories list to a more logical position - Enhance robustness of file and directory handling during build process * [Build] Refine Cython file copying logic in setup.py - Improve file existence check when copying Cython source files - Use os.path.join to construct full path for more robust file checking - Enhance file copying mechanism in TileLangBuilPydCommand
-
- 10 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add VectorizeLoop function and update imports for compatibility * [CI][Test] Improve test cases for vectorization and fix typos in parser comments * lint fix * Fix incorrect module reference for VectorizeLoop transformation * Refactor vectorize_loop transformation by removing unused extent mutation logic * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen * Fix formatting in CUDA FP8 header file for consistency * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity * Update submodule 'tvm' to latest commit for improved functionality * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule. * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files. * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency * Add CUDA requirements to FP8 test cases and update references for clarity * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py * Add CUDA requirements and FP8 test cases for matmul and gemv simulations * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py * Add BF16 support to matrix multiplication and introduce corresponding test cases * Add a blank line for improved readability in BF16 GEMM test * Update acknowledgements in README to include supervision by Zhi Yang at Peking University * enhance acknowledgement * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives * Update subproject commit for TVM dependency * Update subproject commit for TVM dependency * Add int4_t type and functions for packing char values in CUDA common header * Add plot_layout example and implement GetForwardVars method in layout classes * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files * Fix formatting by removing unnecessary line break in layout.h * Refactor make_int4 function for improved readability by adjusting parameter formatting * Add legend to plot_layout for improved clarity of thread and local IDs * Remove unnecessary dependencies from requirements files for cleaner setup * Remove flash_mha.py and add .gitkeep to deepseek_mla directory * Add build requirements and update installation scripts for improved setup
-
- 25 Jan, 2025 1 commit
-
-
Yu Cheng authored
* [Dev] Add FlashDecoding example * [CI][Test] Add test cases for tilelang kernel convolution * [CI][Test] Add test cases for tilelang kernel FlashAttention * Reduce the number of stages to ensure the shared memory allocation is valid * Temporarily remove the dim128 case * lint * update einops in requirements-dev.txt * update einops in requirements-test.txt * remove einops in requirements-dev.txt
-
- 11 Jan, 2025 1 commit
-
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-