1. 11 Aug, 2025 1 commit
  2. 16 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480
      Lei Wang authored
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * [Refactor] Update GEMM Layout Functions and Inference Logic
      
      - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
      - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
      - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
      - Adjusted related layout functions to ensure consistent behavior across different architectures.
      
      * [Refactor] Remove tf32 Casting Logic from GEMM Templates
      
      - Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
      - Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
      - Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
      - Adjusted tensor view handling to improve performance and accuracy in matrix operations.
      
      * Update bulk_copy.cc
      
      * Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.
      9ba8b480
  3. 07 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52
      Lei Wang authored
      * Add tf32 casting functionality to GEMM templates
      
      - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
      - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
      - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.
      
      * lint fix
      
      * Refactor tfloat32 casting logic in GEMM templates
      
      - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
      - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
      - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.
      
      * Refactor GEMM template functions for improved readability
      
      - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
      - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.
      
      * Enhance tf32 casting logic in GEMM templates
      
      - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
      - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.
      
      * Refactor GEMM template functions and layout inference logic
      
      - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
      - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
      - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.
      8cc8db52
  4. 22 May, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f
      Lei Wang authored
      * [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility
      
      * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
      * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
      * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
      * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.
      
      * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example
      
      * Updated thread count in the kernel context from 256 to 128 to enhance performance.
      * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
      * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
      * Cleaned up comments for clarity and corrected a typo in the memory copy comment.
      
      * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity
      
      * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
      * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
      dbe8689f
  5. 17 May, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      
      * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
      2837878f
    • Lei Wang's avatar
      [Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      68a3c4f3
  6. 25 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support cute mma tile mxn8ky (#434) · d1c15bc5
      Lei Wang authored
      * [Enhancement] Improve error handling in layout inference and update profiler type in tests
      
      * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
      * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.
      
      * lint fix
      
      * [Refactor] Update OperandTraits to include num_warp_n parameter
      
      * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
      * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.
      
      * lint fix
      
      * [Refactor] Update DispatchInstruction templates to include N parameter
      
      * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
      * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
      d1c15bc5
  7. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  8. 19 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Enhancement] Add zero initialization option to GEMM operations (#246) · 701e9234
      Yu Cheng authored
      * [Enhancement] Add zero initialization option to GEMM operations
      
      - Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator.
      - Updated the GEMM implementation across various CUDA architectures to support the new parameter.
      - Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution.
      - Ensured compatibility with existing functionality while improving initialization control for performance optimization.
      
      * rename zero_init to clear_accum
      
      * lint
      701e9234
  9. 13 Mar, 2025 1 commit
    • zqh-wz's avatar
      [Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5
      zqh-wz authored
      
      
      * upgrade cutlass to upstream v3.8.0
      
      * Implement fp8 gemm and add example script
      
      * Fix dtype retrieval with map_torch_type for fp8 inputs
      
      * Disable vectorization of fp8 values
      
      * Make MMA declaration compatible with cutlass 3.4.0+
      
      * Add test for fp8 T.gemm
      
      * fix indent
      
      * fix indent
      
      * Add copyright and license header
      
      * Add copyright and license header
      
      * lint fix
      
      * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.
      
      * clang format lint
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2cccf1f5
  10. 11 Jan, 2025 2 commits
    • Lei Wang's avatar
      [Lint] Overall Typo and Linting Fixes (#13) · fa511857
      Lei Wang authored
      * README.md fixed
      
      * update test ci
      
      * Lint and Typo Fix
      
      * Clang Format Lint Fix
      fa511857
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c