1. 10 Sep, 2025 1 commit
  2. 09 Sep, 2025 2 commits
  3. 06 Sep, 2025 3 commits
    • Cunxiao Ni's avatar
      [CI]Adds pytest timeout to CI (#792) · bcfc8343
      Cunxiao Ni authored
      * [CI]Adds pytest timeout to CI
      
      Adds a timeout to pytest runs in CI to prevent jobs from hanging indefinitely.
      This also adds `pytest-timeout` to the test requirements.
      
      * fix lint
      bcfc8343
    • Lei Wang's avatar
      [TMA] Automatically lower 1d tma in appropriate cases (#788) · 9d7d45be
      Lei Wang authored
      * Enhance layout inference and copy operations with 1D TMA support
      
      - Updated `CopyNode` to introduce separate handling for 1D bulk load/store operations, including new methods for checking and lowering these operations.
      - Modified `InferLayout` and `GetCopyInst` to accommodate additional parameters for layout maps and analyzers.
      - Enhanced `AtomicAddNode` and `FillNode` to utilize the updated layout inference logic.
      - Improved buffer out-of-bounds checks during layout inference to ensure safe memory access.
      
      This update improves the efficiency and correctness of memory operations in the TileLang framework.
      
      * Refactor layout inference calls for improved readability
      
      - Updated `InferLayout` calls in `AtomicAddNode`, `CopyNode`, and `FillNode` to enhance code clarity by formatting parameters across multiple lines.
      - Cleaned up whitespace and formatting in `copy.h` and `layout_inference.cc` to adhere to coding standards and improve maintainability.
      
      This refactor aims to streamline the layout inference logic and improve overall code organization.
      
      * Fix shared tensor check in CopyNode for bulk copy operations
      
      - Updated the condition in `CheckBulkCopy1D` to verify contiguity of `shared_tensor` instead of `dst`, ensuring correct handling of shared memory layouts during bulk copy operations.
      - This change enhances the accuracy of memory operations in the TileLang framework.
      
      * Update test_example_gdn_compilation.py to invoke test function directly
      
      - Commented out the call to `tilelang.testing.main()` in `test_example_gdn_compilation.py` and replaced it with a direct call to `test_example_chunk_delta_bwd_compilation()`. This change simplifies the test execution flow and focuses on the specific test case.
      
      * Enhance bulk load/store checks in CopyNode with last dimension validation
      
      - Updated `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to include an optional parameter for validating the last dimension during bulk copy operations.
      - Adjusted related methods `CheckBulkLoad1D` and `CheckBulkStore1D` to pass the new parameter, improving the accuracy of bulk copy checks.
      - This change enhances the robustness of memory operations in the TileLang framework by ensuring compliance with dimensional requirements.
      
      * Refactor CheckBulkLoad and CheckBulkStore methods for improved readability
      
      - Reformatted the parameter lists of `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to enhance code clarity by aligning parameters across multiple lines.
      - This change improves the maintainability of the code and adheres to coding standards.
      9d7d45be
    • Jiaxing Ding's avatar
      [AMD] fix mfma op interface (#791) · b6b02dab
      Jiaxing Ding authored
      
      Co-authored-by: default avatarJiaxing Ding <jiaxing.ding@bytedance.com>
      b6b02dab
  4. 05 Sep, 2025 3 commits
  5. 04 Sep, 2025 3 commits
    • Hao Kang's avatar
      [Nvidia][SM121] Add intrin.h include to gemm_mma.h for sm120+(#785) · 6e0c3500
      Hao Kang authored
      To make sm120 arch runnable.
      6e0c3500
    • alex_xiao's avatar
      [AMD] Fix amd tir&add examples (#784) · f07f31c1
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      * Add new AMD FlashAttention example and test script
      
      - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
      - Added `test.sh` script to facilitate running the new example with specified parameters.
      - Enhanced the overall structure and organization of the example for better clarity and usability.
      
      * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner
      
      - Reduced the number of threads and `num_split_q` options for improved performance.
      - Adjusted `panel_size` options to streamline configuration settings.
      
      * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217
      
      * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c
      
      * Add example for AMD Flash Attention backward pass implementation
      
      - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
      - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
      - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
      - Included reference implementation for validation against PyTorch's attention mechanism.
      
      This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.
      
      * Enhance AMD Flash Attention example with additional testing capabilities
      
      - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
      - Improved the main function to allow for better parameter configuration and benchmarking.
      - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.
      
      This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.
      
      * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * Refactor HIP intrinsic rules to CUDA
      
      - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
      - Adjusted include paths for better organization and clarity in the code structure.
      
      * Update AMD CI workflow to uninstall specific PyTorch packages before installation
      
      - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
      - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.
      
      * Remove unused shared memory allocations in AMD Flash Attention backward example
      
      - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
      - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.
      
      * Remove unnecessary pip uninstall command from AMD CI workflow
      
      - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
      - This change simplifies the CI process and reduces potential overhead during package management.
      
      * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules
      
      - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
      - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.
      
      * Refactor formatting of HIP intrinsic rule registrations
      
      - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
      - No functional changes were made; this update focuses on code style improvements to enhance maintainability.
      
      * Update file name and documentation for HIP intrinsic rules
      
      - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
      - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.
      
      * Enhance DispatchHIPShuffle function with clang-analyzer comments
      
      - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
      - This change improves code clarity and maintains compliance with static analysis tools.
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      f07f31c1
    • Lei Wang's avatar
      [Refactor] Support python reflection for tile operators (#783) · 3cfefc8e
      Lei Wang authored
      * Implement Fill operator and related reflection methods in TileLang
      
      - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers.
      - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities.
      - Updated relevant files to register reflection methods and ensure proper initialization in static blocks.
      - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability.
      - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.
      
      * Refactor operator reflection methods and improve code clarity
      
      - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks.
      - Consolidated static initialization blocks for various operators to a single line for improved consistency.
      - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability.
      - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.
      
      * Refactor GEMM and AtomicAdd operations for improved clarity
      
      - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety.
      - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method.
      - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity.
      - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.
      
      * Remove deprecated operator files from the tilelang IR module
      
      - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase.
      - This cleanup enhances maintainability by removing unused code and improving overall organization of the module.
      
      * Refactor imports in tilelang IR module for improved organization
      
      - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.
      
      * lint fix
      
      * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability
      
      - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability.
      - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code.
      - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity.
      - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities.
      - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
      
      * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability
      
      - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure.
      - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization.
      - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability.
      - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.
      
      * comment
      
      * Refactor operator header files for improved readability
      
      - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity.
      - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.
      
      * Refactor MakeReduce method in ReduceOpNode for clarity
      
      - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability.
      - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.
      
      * Update Reduce operation type checks for consistency
      
      - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity.
      - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.
      3cfefc8e
  6. 03 Sep, 2025 1 commit
    • Cunxiao Ni's avatar
      [CI] Adds pytest-durations for test timing (#782) · 141e01fb
      Cunxiao Ni authored
      * [Ci] Adds pytest-durations for test timing
      
      Adds `pytest-durations` to the test requirements and configures pytest to display test durations.
      
      This helps in identifying slow-running tests and optimizing the test suite for faster feedback.
      
      * add amd ci durations
      
      * Removes flash_attn installation from CI
      141e01fb
  7. 02 Sep, 2025 4 commits
    • Lei Wang's avatar
      [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) · b66f9aae
      Lei Wang authored
      * Fix type hint for target_host parameter in compile function to allow None value
      
      * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency
      
      * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.
      
      * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.
      
      * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.
      
      * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.
      
      * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h
      
      * lint fix
      
      * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages
      
      * lint fix
      
      * Add cmake dependency to pyproject.toml and improve build logging in setup.py
      b66f9aae
    • Cunxiao Ni's avatar
      [Example]Adds example for top-k operation (#775) · 021e44e3
      Cunxiao Ni authored
      * [Example]Adds example for top-k operation
      
      Adds an example demonstrating the top-k operation using tilelang
      
      * format
      
      * Adds topk tilelang example test
      
      * fix lint
      021e44e3
    • Lei Wang's avatar
      [Cache] Introduce detailed target information for the disk kernel cache (#780) · 7ffc5b44
      Lei Wang authored
      * Fix type hint for target_host parameter in compile function to allow None value
      
      * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency
      
      * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.
      
      * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.
      
      * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.
      
      * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.
      7ffc5b44
    • Lei Wang's avatar
      [Lint] Introduce clang-tidy into format.sh (#777) · cdc5d8d3
      Lei Wang authored
      * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency
      
      - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
      - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
      - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
      - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
      - General code cleanup and adherence to best practices for better maintainability.
      
      * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration
      
      - Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
      - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
      - Replaced size checks with `empty()` method calls in various locations for clearer intent.
      - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency
      
      - Added clang-tidy checks to the format script for improved code quality assurance.
      - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
      - Updated the requirements-lint.txt file to include clang-tidy as a dependency.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [CI] Update AMD CI Workflow to Include Build Directory Creation
      
      - Added steps to create a build directory and configure CMake with ROCm support during the format check process.
      - Ensured cleanup of the build directory after the format check to maintain a clean workspace.
      
      * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode
      
      - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
      - This change enhances code clarity and maintainability by focusing on relevant attributes for each class.
      
      * [Refactor] Update Clang-Tidy Integration and Code Improvements
      
      - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
      - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
      - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize
      
      - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
      - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
      - General code cleanup to maintain consistency and adhere to best practices.
      
      * [CI] Add Git Submodule Initialization to CI Workflow
      
      - Included a step to initialize and update git submodules recursively in the CI workflow.
      - This change ensures that all necessary submodules are available during the format check process, improving build reliability.
      
      * [CI] Add Git Submodule Update Step to Format Check
      
      - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
      - This enhancement ensures that all required submodules are available, contributing to improved build reliability.
      
      * [Refactor] Update Function Signatures in AtomicAddVectorize
      
      - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
      - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.
      cdc5d8d3
  8. 01 Sep, 2025 3 commits
  9. 31 Aug, 2025 4 commits
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `reducer_0825` (#772) · 9a869396
      coderabbitai[bot] authored
      * 📝 Add docstrings to `reducer_0825`
      
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118
      
      
      
      The following files were modified:
      
      * `setup.py`
      * `src/op/builtin.h`
      * `src/op/finalize_reducer.cc`
      * `src/op/finalize_reducer.h`
      * `src/op/parallel.cc`
      * `src/op/parallel.h`
      * `src/op/reduce.cc`
      * `src/target/codegen_cuda.cc`
      * `src/tl_templates/cuda/common.h`
      * `src/transform/layout_inference.cc`
      * `src/transform/layout_reducer.cc`
      * `src/transform/layout_reducer.h`
      * `src/transform/merge_shared_memory_allocations.cc`
      * `src/transform/storage_access.cc`
      * `src/transform/warp_specialized_rewriter.cc`
      * `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
      * `tilelang/engine/phase.py`
      * `tilelang/language/customize.py`
      * `tilelang/language/reduce.py`
      * `tilelang/transform/__init__.py`
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      9a869396
    • yyttt6's avatar
      [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) · a7a29c09
      yyttt6 authored
      * [Bugfix]:Fix atomic add auto vectorize negative optimization
      
      * fixbug
      
      * format
      
      * fix bug
      a7a29c09
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `pytile_0826` (#770) · 2af3f22e
      coderabbitai[bot] authored
      * 📝 Add docstrings to `pytile_0826`
      
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814
      
      
      
      The following files were modified:
      
      * `src/op/atomic_add.cc`
      * `src/op/atomic_add.h`
      * `src/op/copy.cc`
      * `src/op/copy.h`
      * `src/op/elem.cc`
      * `src/op/elem.h`
      * `src/op/gemm.cc`
      * `src/op/gemm.h`
      * `src/op/gemm_sp.cc`
      * `src/op/gemm_sp.h`
      * `src/op/operator.cc`
      * `src/op/operator.h`
      * `src/op/parallel.cc`
      * `src/op/parallel.h`
      * `src/op/reduce.cc`
      * `src/op/reduce.h`
      * `src/op/region.cc`
      * `src/op/region.h`
      * `src/transform/layout_inference.cc`
      * `src/transform/lower_tile_op.cc`
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2af3f22e
    • Lei Wang's avatar
      [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) · 8eab7755
      Lei Wang authored
      
      
      * [Enhancement] Introduce finalize_reducer operator and layout reducer support
      
      - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
      - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
      - Updated `setup.py` to include logging for build directory paths, improving build process visibility.
      - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
      - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
      - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.
      
      * Refactor code formatting and improve readability in multiple files
      
      - Cleaned up whitespace in `setup.py` to enhance logging clarity.
      - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
      - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
      - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.
      
      * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.
      
      * [Enhancement] Disable reuse of small arrays in shared memory allocation
      
      - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.
      
      * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.
      
      * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.
      
      * bug fix
      
      * Add thread checks workaround for replicated cases
      
      * Remove the is_one check
      
      * fix lint error
      
      * lint fix
      
      * Update autotune tests to use smaller matrix sizes for improved performance and reliability
      
      * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods
      
      - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
      - Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
      - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
      - Adjusted header inclusions for improved organization and clarity across multiple files.
      
      * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py
      
      - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
      - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.
      
      * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution
      
      - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
      - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      Co-authored-by: default avatarFreebase6912 <amid-gauze-racing@duck.com>
      8eab7755
  10. 29 Aug, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) · b38bd69e
      Lei Wang authored
      * Refactor operator classes to inherit from TileOperator and update layout inference methods
      
      - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations.
      - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency.
      - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization.
      - Added missing layout inference implementations for Fill and Conv2DIm2ColOp.
      - Removed deprecated op.cc and op.h files to streamline the codebase.
      
      * lint fix
      
      * Refactor operator classes to use Node pattern and improve memory management
      
      - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation.
      - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access.
      - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design.
      - Refactored InferLayout and Lower methods to ensure consistency across operator implementations.
      - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase.
      
      * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning
      
      - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects.
      - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations.
      - Made minor adjustments in layout inference and other related methods for consistency and clarity.
      
      * Refactor FillNode::Lower method to remove unused global function call
      
      - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity.
      - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies.
      b38bd69e
    • Johnny's avatar
      hot fix blackwell (#768) · 277ed53c
      Johnny authored
      277ed53c
  11. 28 Aug, 2025 4 commits
  12. 26 Aug, 2025 1 commit
  13. 25 Aug, 2025 1 commit
    • Yu Cheng's avatar
      [README] Update GDN README for clarity and add acknowledgements (#758) · e0cf5fee
      Yu Cheng authored
      - Improved formatting and clarity of the GDN kernel implementation description.
      - Updated requirement section to list dependencies in a clearer format.
      - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.
      e0cf5fee
  14. 24 Aug, 2025 6 commits
    • Lei Wang's avatar
      [Typo] Remove `disable_cache` in some tests (#755) · 556d411e
      Lei Wang authored
      * Update test parameters and remove debug print statement
      
      - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
      - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.
      
      * Refactor loop stack management in warp_specialized_rewriter
      
      - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
      - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
      - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.
      
      * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.
      556d411e
    • Lei Wang's avatar
      [Bugfix][WS] Consider loop min extent when computing phase id (#754) · b39aaf5b
      Lei Wang authored
      * Update test parameters and remove debug print statement
      
      - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
      - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.
      
      * Refactor loop stack management in warp_specialized_rewriter
      
      - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
      - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
      - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.
      b39aaf5b
    • Zhengju Tang's avatar
      [MXFP4] Add bias to MXFP4 GEMM kernel (#753) · fd199a4a
      Zhengju Tang authored
      * [MXFP4] Add bias to gemm kernel
      
      * [Lint]
      
      * [Lint] Rename "bias" to "Bias"
      fd199a4a
    • Lei Wang's avatar
      [Bugfix] Add missing FP8 header include (#752) · cf7be057
      Lei Wang authored
      
      
      * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h
      
      - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
      - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      
      * [Enhancement] Include cuda_fp8.h in gemm_sm90.h
      
      - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types.
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      
      * lint fix
      
      * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h
      
      - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase.
      - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations.
      - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability.
      
      * lint fix
      
      * [Refactor] Update header inclusions in common.h and gemm_sm90.h
      
      - Removed the inclusion of "intrin.h" from common.h to streamline dependencies.
      - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability.
      
      * bug fix
      cf7be057
    • Kurisu's avatar
      [Enhancement] Add shape checking for reduce options (#748) · c2fe91e0
      Kurisu authored
      
      
      * Add shape checking for reduce options
      
      * lint fix
      
      * Handle special case reducing into shape-1 tensor
      
      Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y]
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      c2fe91e0
    • Lei Wang's avatar
      [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) · e68fdab8
      Lei Wang authored
      - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
      - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.
      e68fdab8
  15. 23 Aug, 2025 2 commits