- 10 May, 2025 1 commit
-
-
Yuxuan Hu authored
-
- 09 May, 2025 1 commit
-
-
Rinne authored
-
- 18 Apr, 2025 1 commit
-
-
Andy Luo authored
* Update Installation.md * Update installation prerequisites in documentation --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 12 Apr, 2025 2 commits
-
-
Lei Wang authored
* Add deepseek_mla to documentation index (#380) * lint fix
-
Lei Wang authored
* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism. - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use. - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization. - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming. * doc update * [Add] Enhance AMD FlashMLA implementation and documentation - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation. - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation. - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks. * lint fix * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators. - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling. - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks. - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
-
- 05 Apr, 2025 1 commit
-
-
yeh-sudo authored
This pull request includes a change to the `gemv.md` file. The changes add heading level to title in the document to make the heading level right.
-
- 28 Mar, 2025 1 commit
-
-
botbw authored
* [doc/example] init gemv doc and examples * [example] add vectorized read * [example] use local register instead of smem * [example] add bench * [doc] update doc * [doc] refine doc * [lint] format code * [doc] add tips * [doc/example] fix typo * [example] use tmv_all_reduce * [doc] update doc accordingly * [doc] add benchmark table * [lint] format code
-
- 27 Mar, 2025 1 commit
-
-
Wenhao Xie authored
* fix bug * update performance.py * update python api docs * test workflow * fix dependency * fix bug * fix * update correct git config * test workflow * clear cache * lint fix * fix exclude path
-
- 26 Mar, 2025 3 commits
-
-
Yu Cheng authored
-
Lei Wang authored
* [Refactor] Improve flash attention example and layout comparison logic - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code. - Updated the handling of `lse_local_split` to utilize parallel processing for better performance. - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example. - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons. * lint fix * [Enhancement] Add support for shared memory scope in Fill operation - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation. - Implemented parallel operation and layout inference for improved performance in shared memory scenarios. - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling. * [Refactor] Remove deprecated decorator and enhance Cython kernel handling - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization. - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution. - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments. - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced. - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs. * [Feature] Add matrix multiplication test and kernel implementation - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives. - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types. - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation. - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs. - Minor formatting improvements in `deprecated.py` for better readability. * lint fix * [Refactor] Update tensor creation in matrix multiplication test - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency. - Updated imports in `__init__.py` to include `make_tensor`. - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers. * [Refactor] Update tensor definitions across multiple files - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity. - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations. - Improved documentation in README and example files to reflect changes in tensor usage. * lint fix * [Refactor] Update tensor types in attention and matrix multiplication examples - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity. - Adjusted tensor definitions in benchmark and example files to align with the new tensor types. - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files. * lint fix * [Refactor] Update tensor types in GEMM example and test files - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity. - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions. * [Refactor] Update tensor usage in customize.py - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions. - Improved code clarity by standardizing buffer usage across the file. * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions. - Improved code clarity by standardizing buffer usage across the test file. * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions. - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions. * [Refactor] Introduce Tensor alias for Buffer in proxy.py - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`. - This change enhances clarity and consistency in tensor usage across the codebase.
-
yyttt6 authored
-
- 22 Mar, 2025 1 commit
-
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations. * Refactor CUDA post-processing callback registration in TileLang - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility. - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability. - Added debug prints to the CUDA code generation process for better traceability during development. - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation. - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling. * lint fix * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio. - Implemented a function to generate various kernel configurations for autotuning. - Refactored the main execution block to support both autotuned and default configurations. - Improved the block mask generation to accommodate specified sparsity levels. - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
-
- 19 Mar, 2025 1 commit
-
-
Chenghua authored
* [Example] Modify tuning configurations for FlashAttention example * [Examples] formatting example_gqa_fwd_bshd.py * [Examples] Implement elementwise add kernel * [Doc] Update ElementWise Operators document * [Examples] Replace the example of elementwise add.
-
- 17 Mar, 2025 1 commit
-
-
Wenhao Xie authored
* [Typo] Fix formatting in installation instructions in README.md * [Enhancement] Improve CUDA path detection and update configuration handling * fix typo * remove IS_WINDOWS constant * lint fix * Improve error messages for CUDA detection failure * lint fix * lint fix * Fix .gitignore to correctly include venv directory * [Doc] Add instructions for installing nightly version of TileLang * update installation instructions * update install instruction
-
- 21 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes - Add support for multi-stream kernel execution in CTypes backend - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py - Add symbolic shape utility function in tilelang.language - Update profiler to improve flexibility in benchmark selection * Remove redundant thread binding in GEMM kernel implementations - Remove unnecessary `thread_binding` line in GEMM kernel functions - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py` - Enhance code readability by removing redundant thread binding annotation * Fix indentation in int4 GEMM kernel test file - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py` - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls - Improve code formatting for better readability * [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution - Implement CythonKernelAdapter to handle dynamic symbolic shapes - Add support for multi-stream kernel execution in Cython backend - Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py - Update JITKernel to include "cython" as a valid execution backend - Add Cython-specific wrapper and library generation modules - Update .gitignore to exclude Cython cache directory - Modify setup.py to include Cython source files in package data * lint fix * [Refactor] Replace JITKernel with compile() function for kernel compilation - Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel - Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()` - Modify kernel adapters to support optional kernel-only source retrieval - Update `__init__.py` to import the new `compile()` function - Improve kernel source retrieval for different execution backends * lint fix * remove debug print * Add C/C++ compiler utility module and update Cython JIT kernel support - Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities - Add functions to detect and retrieve system C/C++ compilers - Implement cross-compilation and shared library creation support - Update Cython JIT kernel to validate C++ compiler availability - Modify Cython adapter to use detected C++ compiler for library generation * Refactor float8 dtype mapping in tensor utility module - Move float8_dtype_map inside adapt_torch2tvm function - Simplify global scope by localizing the dtype mapping - Maintain existing functionality for converting torch float8 tensors to TVM ndarray * Refactor float8 dtype mapping in tensor utility module - Move float8_dtype_map inside adapt_torch2tvm function - Simplify global scope by localizing the dtype mapping - Maintain existing functionality for converting torch float8 tensors to TVM ndarray * revert * Enhance Cython JIT adapter with Cython compiler detection - Add `get_cython_compiler()` function to dynamically locate Cython executable - Update Cython adapter to use detected Cython compiler instead of hardcoded command - Raise an exception if no Cython compiler is found - Update requirements.txt to specify minimum PyTorch version (>=2.2.0) * Fix Cython kernel wrapper stream handling and type annotations - Update stream parameter type to int64_t for better compatibility - Directly use torch.cuda.current_stream().cuda_stream instead of casting - Improve type safety and precision in Cython kernel wrapper
-
- 13 Feb, 2025 2 commits
-
-
Lei Wang authored
* bump version into v0.1.0 * [Enhancement] Add custom develop command for editable installs and update .gitignore * [Documentation] Update README to include system dependencies installation instructions * [Build] Update setup.py to support library file copying for both release and develop modes * [Build] Refactor library file copying logic in setup.py * [Documentation] Remove unnecessary install section header in Installation.md
-
Wenhao Xie authored
* [CI] Clean up target repository before publishing documentation. * [Doc] Convert docs from rst format to Markdown format.
-
- 09 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add VectorizeLoop function and update imports for compatibility * [CI][Test] Improve test cases for vectorization and fix typos in parser comments * lint fix * Fix incorrect module reference for VectorizeLoop transformation * Refactor vectorize_loop transformation by removing unused extent mutation logic * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen * Fix formatting in CUDA FP8 header file for consistency * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity * Update submodule 'tvm' to latest commit for improved functionality * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule. * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files. * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency * Add CUDA requirements to FP8 test cases and update references for clarity * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py * Add CUDA requirements and FP8 test cases for matmul and gemv simulations * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py * Add BF16 support to matrix multiplication and introduce corresponding test cases * Add a blank line for improved readability in BF16 GEMM test * Update acknowledgements in README to include supervision by Zhi Yang at Peking University * enhance acknowledgement * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives * Update subproject commit for TVM dependency * Update subproject commit for TVM dependency * Add int4_t type and functions for packing char values in CUDA common header * Add plot_layout example and implement GetForwardVars method in layout classes * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files * Fix formatting by removing unnecessary line break in layout.h * Refactor make_int4 function for improved readability by adjusting parameter formatting
-
- 02 Feb, 2025 1 commit
-
-
Lei Wang authored
* implement jit test case * [Dev] implement auto tune test case for matrix multiplication * Implement test for legalize memory access and vectorized loop * lint fix * introduce run_once * Refactor callback function names for consistency and improve code readability * enhance documentations * lint fix * lint fix * lint fix * lint fix * fix formatting issues in rt_mod_hip.cc * add random seed initialization for deterministic testing * Add documentation images and comprehensive GEMM tutorial for TileLang * Update MATMUL documentation title to highlight Tile Library
-
- 26 Jan, 2025 1 commit
-
-
Lei Wang authored
* implement jit test case * [Dev] implement auto tune test case for matrix multiplication * Implement test for legalize memory access and vectorized loop * lint fix * introduce run_once * Refactor callback function names for consistency and improve code readability * enhance documentations * lint fix * lint fix * lint fix * lint fix * fix formatting issues in rt_mod_hip.cc * add random seed initialization for deterministic testing
-
- 24 Jan, 2025 1 commit
-
-
Lei Wang authored
[Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo (#39)
-
- 23 Jan, 2025 1 commit
-
-
Wenhao Xie authored
* [Doc] Use sphinx to generate docs. * [Doc] Fix a bug on tlcpack_sphinx_addon. * [Doc] Fix linting issues. * [Doc] Create a workflow to host docs using GitHub Pages. * [Doc] Remove all deprecated docs. * [Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file. * [CI] Remove conditional check for GitHub Pages deployment on main branch.
-
- 22 Jan, 2025 1 commit
-
-
Cunxiao Ni authored
* [Doc] fix installation.rst * [Doc] fix installation
-
- 21 Jan, 2025 2 commits
-
-
Cunxiao Ni authored
* [Doc] update installation.md and readme * solve conflicts * change readme * fix installation.rst * fix readme * fix installation
-
Wenhao Xie authored
* [Doc] Use sphinx to generate docs. * [Doc] Fix a bug on tlcpack_sphinx_addon. * [Doc] Fix linting issues.
-
- 20 Jan, 2025 1 commit
-
-
Lei Wang authored
* instruction update * replace link with TileLang/tile-lang * [Dev][Adapter] Implement Torch DLPack Kernel Adapter and related utilities * lint fix * Implement JIT Compiler Components * Documents update * lint fix * update logo * install script fix
-
- 18 Jan, 2025 1 commit
-
-
Lei Wang authored
* instruction update * replace link with TileLang/tile-lang
-
- 11 Jan, 2025 2 commits
-
-
Lei Wang authored
* README.md fixed * update test ci * Lint and Typo Fix * Clang Format Lint Fix
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-