1. 16 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      dca2fb48
  2. 06 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      
      * [Feature] Add new examples for warp specialization and TMA integration
      
      * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
      * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
      * Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
      * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.
      
      * lint fix
      
      * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
      b5faf25a
  3. 30 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support explicit programming for identified warp groups (#445) · 6972aed7
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      6972aed7
  4. 13 Apr, 2025 1 commit
  5. 08 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support pass config `disable_warp_specialize` to disable auto... · 7fdcedd0
      Lei Wang authored
      [Enhancement] Support pass config `disable_warp_specialize` to disable auto specialization on hopper (#357)
      
      * [Enhancement] Add warp specialization configuration option and update related functionality
      
      * [Add] Introduced a new pass configuration option `kDisableWarpSpecialized` to control warp specialization behavior.
      * [Refactor] Updated `WarpSpecializedRewriter` and `WSCodeEmitter` to utilize the new configuration option, allowing for more flexible optimization strategies.
      * [Update] Modified the optimization pipeline in `phase.py` to include pipeline planning when warp specialization is disabled, enhancing performance with async copy.
      * [Documentation] Updated JIT compilation parameters to reflect the new configuration option for better clarity.
      
      * lint fix
      
      * [Add] Implement test for GEMM with warp specialization configuration
      
      * Introduced a new test file `test_tilelang_pass_config_disable_warp_specialized.py` to validate the functionality of the warp specialization configuration option.
      * Added a `run_gemm` function to execute matrix multiplication tests with and without warp specialization, ensuring correctness through profiling against reference results.
      * Included a specific test case for GEMM with float16 data types, enhancing test coverage for the new configuration feature.
      
      * [Refactor] Improve formatting in test_tilelang_pass_config_disable_warp_specialized.py
      
      * Reformatted the `tilelang.compile` call in the `run_gemm` function for better readability by breaking it into multiple lines.
      * Added a blank line for improved code structure and clarity in the `test_gemm_f16f16f16_nn` function.
      7fdcedd0
  6. 06 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support index bit width configuration (#343) · 70546adc
      Lei Wang authored
      
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation
      
      - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
      - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
      - Integrated the new transformation into the optimization pipeline in `phase.py`.
      - Documented the new transformation in `__init__.py` for clarity.
      
      * lint fix
      
      * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation
      
      - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
      - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
      - Documented the new configuration option in the JIT compilation function's parameters for clarity.
      - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.
      
      * lint fix
      
      * Refactor index bitwidth configuration handling
      
      - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
      - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      70546adc
  7. 26 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8
      Yu Cheng authored
      - Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
      - Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
      - Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.
      76435ca8
  8. 12 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a
      Yu Cheng authored
      - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
      - Add new builtin operations in op/builtin.cc and op/builtin.h
      - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
      - Update CUDA codegen to support new TMA store synchronization intrinsics
      - Add Python language bindings for new TMA store synchronization operations
      eba7dd5a
  9. 09 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feat] Append Pass Context and TMA lowering configuration option (#175) · fb6b101c
      Lei Wang authored
      * Add TMA lowering configuration option and update copyright notices
      
      This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include:
      
      - Add `kDisableTMALower` configuration option in builtin.h and builtin.cc
      - Update copyright notices from Microsoft Corporation to Tile-AI Corporation
      - Modify `LowerArgs` struct to include `disable_tma_lower` flag
      - Update JIT compilation interfaces to support pass configuration
      - Enhance error reporting in bulk copy lowering
      - Propagate pass configuration through various adapter layers
      
      * lint fix
      fb6b101c
  10. 27 Jan, 2025 1 commit
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang transform LowerHopperIntrin (#59) · 7d4156df
      Yu Cheng authored
      * [Dev] Add FlashDecoding example
      
      * [CI][Test] Add test cases for tilelang kernel convolution
      
      * [CI][Test] Add test cases for tilelang kernel FlashAttention
      
      * Reduce the number of stages to ensure the shared memory allocation is valid
      
      * Temporarily remove the dim128 case
      
      * lint
      
      * update einops in requirements-dev.txt
      
      * update einops in requirements-test.txt
      
      * remove einops in requirements-dev.txt
      
      * [CI][Test] Add test cases for tilelang transform ClusterPlanning
      
      * [CI][Test] Add test cases for tilelang transform LowerHopperIntrin
      7d4156df
  11. 11 Jan, 2025 2 commits
    • Lei Wang's avatar
      [Lint] Overall Typo and Linting Fixes (#13) · fa511857
      Lei Wang authored
      * README.md fixed
      
      * update test ci
      
      * Lint and Typo Fix
      
      * Clang Format Lint Fix
      fa511857
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c