"docs/source/vscode:/vscode.git/clone" did not exist on "d6a4975524f9dbbc7de1eb363d97a073cbdb9ea3"
  1. 22 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245
      Lei Wang authored
      * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy
      
      * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
      * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
      * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.
      
      * lint fix
      
      * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.
      
      * remove bulk copy
      
      * Refactor copy and atomic add operations to support TMA lower configuration
      
      - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
      - Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
      - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
      - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
      - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.
      
      * Enhance TMA bulk copy logic in `LowerBulkCopy` method
      
      - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
      - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.
      
      * lint fix
      
      * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.
      
      * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions
      
      - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
      - Added TODO comments to indicate the need for further improvements in shared memory handling.
      
      * Update `native_sparse_attention` function to include TMA configuration options
      
      - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
      - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.
      
      * Refactor JIT decorator formatting in `native_sparse_attention` function
      
      - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
      - No functional changes were made; this update focuses on code clarity and maintainability.
      
      * Enhance thread management and logging in TileLang compilation
      
      - Added a method to check if printing is enabled during compilation, improving control over logging behavior.
      - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
      - Added comments to clarify the purpose of changes and improve code readability.
      
      * Add warp specialization scope and refactor register management in TileLang
      
      - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
      - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
      - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
      - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.
      
      * Refactor test for InjectSetMaxNReg pass in TileLang
      
      - Improved readability by restructuring conditional checks and assertions in the test cases.
      - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
      - Ensured consistent formatting and spacing throughout the test functions for better maintainability.
      
      * Enhance bulk copy and store checks in `Copy` class
      
      - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
      - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
      - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.
      
      * lint fix
      5c11d245
  2. 21 Aug, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Refactor barrier management (#744) · cb37bfef
      Lei Wang authored
      * Introduce Barrier
      
      * Enhance CUDA kernel with new barrier management and post-processing support
      
      - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
      - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
      - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
      - Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
      - Enhanced the overall structure and readability of the codebase.
      
      * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.
      
      * Enhance barrier management in TileLang
      
      - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
      - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
      - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
      - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
      - Removed deprecated memory scope handling code to enhance clarity and maintainability.
      
      * lint fix
      
      * lint fix
      
      * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.
      
      * Refactor logging in JITKernel to improve kernel compilation tracking
      
      - Removed unused import of `torch.backends` in the example file.
      - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
      - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.
      
      * Refactor dequantization tests and update barrier function
      
      - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
      - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.
      
      * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed.
      
      * Fix typos in rasterization parameters and update import path for cached module
      
      - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
      - Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
      - Added `StridedTensor` import in the language module for enhanced tensor functionality.
      
      * Update ci.yml
      cb37bfef
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to PR #744 (#745) · eccdfe17
      coderabbitai[bot] authored
      * 📝 Add docstrings to `main`
      
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559
      
      
      
      The following files were modified:
      
      * `src/transform/atomicadd_vectorize.cc`
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      eccdfe17
  3. 20 Aug, 2025 2 commits
  4. 19 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Refactor env into a more flexible version (#740) · 72be4909
      Lei Wang authored
      * Fix environment variable name for compilation print setting in `env.py`
      
      * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.
      
      * lint fix
      
      * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`
      72be4909
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `mxfp4` (#732) · e3a80b70
      coderabbitai[bot] authored
      * 📝 Add docstrings to `mxfp4`
      
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561
      
      
      
      The following files were modified:
      
      * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py`
      * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py`
      * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py`
      * `examples/dequantize_gemm/utils.py`
      * `examples/gemm/example_gemm_autotune.py`
      * `tilelang/intrinsics/utils.py`
      * `tilelang/language/__init__.py`
      * `tilelang/language/utils.py`
      * `tilelang/quantize/mxfp.py`
      * `tilelang/quantize/quantization.py`
      
      * [Lint] More accurate docstring
      
      * [Lint]
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      Co-authored-by: default avatartzj-fxz <tzjfxz@gmail.com>
      e3a80b70
    • Zhengju Tang's avatar
      [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) · 24603e4a
      Zhengju Tang authored
      
      
      * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16
      
      * [Dequant] Add extern call and serial dequantization
      
      * [Dequant] Parallel Dequant wait for fence debug.
      
      * [Scale] Add scale matrix to mxfp4 gemm
      
      * [Remove] Remove fence-buggy example and some generated source cuda code
      
      * [MXFP4] Update initial version of MXFP4 GEMM
      
      * [Scale] Add scale to latest mxfp4 gemm
      
      * [Lint]
      
      * [BugFix] Load Scale, disabe TMA to recover performance
      
      * [Lint]
      
      * [Lint]
      
      * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance
      
      * [Lint]
      
      * Update example_dequant_gemm_bf16_fp4_hopper_serial.py
      
      * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.
      
      * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.
      
      * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.
      
      * lint fix
      
      * ci fix
      
      * Remove non-existent example
      
      * [BugFix] Add smem swizzle to recover performance of TMA
      
      * [BugFix] Enough reg for producer when threads=512
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      24603e4a
  5. 18 Aug, 2025 3 commits
    • alex_xiao's avatar
      [CI] Fix AMD CI (#729) · a86223f4
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      a86223f4
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `fix` (#726) · a5074fd5
      coderabbitai[bot] authored
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851
      
      
      
      The following files were modified:
      
      * `src/op/gemm.cc`
      * `src/tl_templates/cuda/gemm_sm90.h`
      * `src/transform/warp_specialized_rewriter.cc`
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      a5074fd5
    • Wenhao Xie's avatar
      [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr... · f4a828f6
      Wenhao Xie authored
      
      [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712)
      
      * bug fix and support gemm_sr fallback for hopper
      
      * Update gemm.cc
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      f4a828f6
  6. 17 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) · 1b308baf
      Lei Wang authored
      
      
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Support strided tensors
      
      * Refactor target attribute helper functions for improved clarity
      
      * No code changes made in proxy.py and setup.py
      
      * lint fix
      
      * lint fix via gemini
      
      * lint fix
      
      * test fix
      
      * test fix
      
      * lint fix
      
      * Update wrapper.py
      
      * test fix
      
      * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      1b308baf
  7. 16 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) · c369d690
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor CUDA code generation to simplify eviction policy handling
      
      - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity.
      - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations.
      
      * lint fix
      c369d690
  8. 15 Aug, 2025 4 commits
    • NaOHCC's avatar
      [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) · 2bd2d69e
      NaOHCC authored
      * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy
      
      * [Typo] Correct architecture selection for CUDA and CDNA
      2bd2d69e
    • alex_xiao's avatar
      [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) · 8e1b88f3
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Update AMD FlashAttention example and TVM submodule
      
      - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
      - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
      - Updated the TVM submodule to the latest commit for improved functionality.
      - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
      
      * Add CI workflow for automated format checking and testing
      
      - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
      - The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
      - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
      
      * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
      
      * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
      
      * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
      
      * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
      
      * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
      
      * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
      
      * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
      
      * Add installation of ROCm in AMD CI workflow
      
      - Included a step to execute the `install_rocm.sh` script for improved setup.
      - Removed unnecessary blank line for better readability in the workflow script.
      
      * Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
      
      * Update AMD CI workflow to run specific test file with verbose output instead of all tests.
      
      * Add new tilelang built-in operations for AMD architecture
      
      - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
      - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
      - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
      - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
      
      * Enhance autotuner configurations and memory handling in AMD example
      
      - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
      - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
      
      * Refine autotuner configurations and memory usage in AMD example
      
      - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
      - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
      - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Expanded thread counts in `get_configs` to include higher values for improved autotuning.
      - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
      
      * Update AMD CI workflow and remove obsolete test script
      
      - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
      - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
      
      * Remove TVM subproject from 3rdparty directory
      
      * Refactor configuration generation and accumulation logic in AMD example
      
      - Reformatted the `get_configs` function for improved readability by aligning parameters.
      - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
      
      * Enhance AMD CI workflow with additional logging and setup steps
      
      - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
      - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
      
      * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
      
      * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
      
      * Enhance BuildTileLangHIP function by adding whitespace for improved readability
      
      * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
      
      * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * lint fix
      
      * Update AMD CI workflow to use requirements-rocm.txt for dependency installation
      
      * fix ci
      
      * Remove dependency on format-check from AMD CI workflow
      
      * fix ci
      
      * fix ci
      
      * fix ci
      
      * Remove format-check job from AMD CI workflow
      
      * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
      
      * Add dependency on format-check job in AMD CI workflow
      
      * Add format-check job to AMD CI workflow
      
      * Update format-check job in AMD CI workflow to run on self-hosted environment
      
      * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
      
      * Update amd_ci.yml
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      8e1b88f3
    • Gabriel Wu's avatar
      [Chore] fix typos (#719) · d0742860
      Gabriel Wu authored
      * chore: fix typos
      
      * chore: fix ruff
      
      * chore: fix clang-format
      d0742860
    • Wenhao Xie's avatar
      [CI] fix docs ci (#720) · 6545b084
      Wenhao Xie authored
      6545b084
  9. 14 Aug, 2025 3 commits
  10. 13 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) · a9611738
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling
      
      - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
      - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
      - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
      - Removed obsolete code and improved overall code clarity and maintainability.
      
      * lint fix
      
      * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls
      
      - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
      - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.
      
      * test fix
      
      * Enhance global read detection in pipeline planning
      
      - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
      - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
      - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.
      
      * Add IndexLegalizer to enforce int64 for out-of-bound indices
      
      - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds.
      - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability.
      - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations.
      a9611738
    • Lei Wang's avatar
      [Pipeline] Skip condition expression analysis for global reading (#713) · c1eef511
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling
      
      - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
      - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
      - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
      - Removed obsolete code and improved overall code clarity and maintainability.
      
      * lint fix
      
      * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls
      
      - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
      - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.
      
      * test fix
      
      * Enhance global read detection in pipeline planning
      
      - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
      - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
      - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.
      c1eef511
    • Lei Wang's avatar
      [Pipeline] Phaseout fragment and double buffer info from pipeline pass (#711) · 49d5d80e
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling
      
      - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
      - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
      - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
      - Removed obsolete code and improved overall code clarity and maintainability.
      
      * lint fix
      
      * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls
      
      - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
      - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.
      
      * test fix
      49d5d80e
  11. 12 Aug, 2025 2 commits
  12. 11 Aug, 2025 2 commits
    • Wenhao Xie's avatar
      [Enhancement] Add eviction policy support for TMA operations, enhance CUDA... · 6664d170
      Wenhao Xie authored
      [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config (#690)
      
      * Enhance TMA and barrier handling in CUDA code generation
      
      - Updated `CodeGenTileLangCUDA` to support eviction policies for TMA operations, allowing for more flexible memory management.
      - Introduced a new `CacheHintSm90` enum to define eviction strategies in `copy_sm90.h`.
      - Modified TMA load/store functions to accept eviction policies, improving performance on different architectures.
      - Enhanced `TmaBarrierCollector` and `TmaBarrierRewriter` to account for SIMT copies, ensuring correct barrier insertion.
      - Refactored thread synchronization logic to utilize barrier IDs, improving the efficiency of partial thread synchronization.
      - Updated Python interface for `copy` and `c2d_im2col` to include optional eviction policy parameters, enhancing usability.
      
      * update shuffle and elect optimization
      
      * fix bug
      
      * fix bug
      
      * fix potential bug
      
      * lint fix
      
      * lint fix
      
      * update shuffle_elect template
      
      * fix bug
      
      * fix bug
      
      * fix template
      
      * lint and fix
      
      * fix typo
      6664d170
    • FeiyangChen's avatar
      [Feat] Support mma gemm with stride (#701) · fe70549f
      FeiyangChen authored
      
      
      * gemm_with_stride sm89
      
      * fix offset issue
      
      * bug fix
      
      * format
      
      * sm80 support
      
      * add sm90
      
      * add testing
      
      * format
      
      * add static_assert for wgmma
      
      * Enhance error message for inner_box_dim validation in LowerBulkCopy
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      fe70549f
  13. 10 Aug, 2025 2 commits
    • Zhengju Tang's avatar
      Low-bit kernels fix and implementation (#704) · 569b0127
      Zhengju Tang authored
      
      
      * [MXFP4] Dequantize FP4 kernel example, MX scale todo
      
      * [BugFix] Fix the bug of fp4&fp16 exponential bias
      
      * [MXFP4] Add group scale factor for BF16xMXFP4 gemm
      
      * [Lint]
      
      * [Test] Add test script for BF16xMXFP4 gemm
      
      * [Lint]
      
      * [BugFix] Fix the shape of scale tensor
      
      * Update example_dequant_gemm_fp4_hopper.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      569b0127
    • Lei Wang's avatar
      [Pipeline] Optimize inject software pipeline and pipeline planing pass (#706) · 376ba9eb
      Lei Wang authored
      * Refactor inject_pipeline.cc to improve version handling and add unique producer head tracking
      
      - Updated version check to allow for cases with two or more versions.
      - Adjusted logic to decrement num_versions when multi-versioning is not needed.
      - Introduced a helper function to ensure unique producer heads are added to the commit group.
      - Removed obsolete AddAllocBuffers method to streamline code.
      
      * lint fix
      
      * Refactor pipeline planning logic to enhance copy stage dependency management
      
      - Removed obsolete conditional expression handling from the pipeline planning code.
      - Introduced a new structure to manage copy stage dependency reads, improving clarity and efficiency.
      - Updated logic to correctly identify producer stages for copy stages, ensuring accurate pipeline stage assignment.
      - Added a new block sparse matrix multiplication function in the testing suite to validate the pipeline planning changes.
      
      * Update ci.yml
      
      * Fix structural equality checks in AddUnique and Contains methods to compare buffer references instead of entire regions in pipeline planning.
      
      * Refactor pipeline planning logic to improve copy stage dependency propagation
      
      - Updated structural equality checks in AddUnique and Contains methods to use buffer reference comparison.
      - Enhanced the iteration logic for managing copy stage dependencies, ensuring accurate identification of producer stages.
      - Added safeguards against exceeding maximum iterations during dependency propagation.
      376ba9eb
  14. 08 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Layout] Introduce a new layout inference mechanism (#699) · 407117e1
      Lei Wang authored
      
      
      * Implement new free stage layout inference.
      
      * Fix bug
      
      * Make replication upcasting and unnormalizable iterators safe.
      
      * Better handling of updating with more replica
      
      * Remove unnecessary check.
      
      * Fix compilation.
      
      * Fix setup.py.
      
      * Simplify development mode.
      
      * Allow ParallelOp layout when there's already a compatible layout specified
      
      * lint fix
      
      * Add ProveFragmentContains function to validate thread access between small and large fragments
      
      This function checks if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment, ensuring valid access during updates. The implementation includes deriving thread indices, computing logical indices, and verifying thread mappings.
      
      * Update dependencies in requirements files
      
      * Remove 'thefuzz' from requirements-dev.txt
      * Specify exact versions for 'torch' and add 'flash_attn' in requirements-test.txt
      
      * Update CI workflow to use SHA256 hash for requirements file
      
      * Update requirements and CI workflow for flash attention
      
      * Removed specific version for 'torch' in requirements-test.txt
      * Added installation of 'flash_attn==2.5.8' in CI workflow to ensure compatibility
      
      * Refactor flash attention import handling in examples
      
      * Removed availability checks for 'flash_attn' in multiple example scripts.
      * Simplified import statements for 'flash_attn' to ensure consistent usage across examples.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      407117e1
    • Lei Wang's avatar
      [CI] Remove Flash Attention dependency (#705) · 87aae294
      Lei Wang authored
      * Update flash-attn version in requirements-test.txt from <=2.2.0 to ==2.5.8
      
      * lint fix
      
      * Remove unused dependencies from requirements-test.txt
      
      * Update import path for padding functions in example MHA forward variable length script
      
      * Refactor code formatting in bert_padding.py for improved readability
      87aae294
    • Yichen Yan's avatar
      Trivial update to calculate target arch (#702) · da74c09d
      Yichen Yan authored
      
      
      * Trivial update to calculate target arch
      
      * Update tilelang/contrib/nvrtc.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * fmt
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      da74c09d
  15. 07 Aug, 2025 2 commits
  16. 06 Aug, 2025 2 commits
    • Lei Wang's avatar
      [Example] Optimize warp specialize flashmla example (#698) · a1149cab
      Lei Wang authored
      * [Enhancement] Disable cache and append git commit ID to version in tilelang (#688)
      
      * Disabled caching in quickstart example for improved performance.
      * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities.
      
      * revert quickstart
      
      * optimize code.
      a1149cab
    • Lei Wang's avatar
      [Version] Keep local commit id as it somehow help with debugging (#697) · ed1b96d5
      Lei Wang authored
      * [Enhancement] Disable cache and append git commit ID to version in tilelang (#688)
      
      * Disabled caching in quickstart example for improved performance.
      * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities.
      
      * revert quickstart
      ed1b96d5
  17. 05 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Smem Reuse] Optimize to do memory alignment on identical buffers. (#693) · 17fafc1b
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      
      * phase out Canonialize
      
      * add option --expt-relaxed-constexpr
      
      * [Enhancement] Introduce tilelang intrinsic operations for GEMM
      
      - Added `tl_gemm` and `tl_gemm_sp` built-in operations to support general and sparse matrix multiplication in tilelang.
      - Updated the lowering logic in `Gemm` and `GemmSP` to utilize the new tilelang operations.
      - Enhanced CUDA and HIP code generation to handle the new GEMM operations, ensuring proper argument validation and external call printing.
      - Implemented shared memory alignment planning for GEMM operations to optimize performance on supported architectures.
      
      * lint fix
      
      * lint fix
      
      * test fix
      
      * test fix
      
      * rebase
      
      * Update builtin.cc
      17fafc1b
  18. 04 Aug, 2025 1 commit
  19. 03 Aug, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Introduce GemmInst for different targets handling (#688) · d2afb513
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      d2afb513
    • Lei Wang's avatar
      [Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346
      Lei Wang authored
      * [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling
      
      - Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
      - Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
      - Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
      - Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
      - Removed obsolete code related to previous buffer access methods for clarity and maintainability.
      
      * test fix
      73bf8346