1. 20 Aug, 2025 1 commit
  2. 19 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Refactor env into a more flexible version (#740) · 72be4909
      Lei Wang authored
      * Fix environment variable name for compilation print setting in `env.py`
      
      * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.
      
      * lint fix
      
      * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`
      72be4909
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `mxfp4` (#732) · e3a80b70
      coderabbitai[bot] authored
      * 📝 Add docstrings to `mxfp4`
      
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561
      
      
      
      The following files were modified:
      
      * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py`
      * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py`
      * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py`
      * `examples/dequantize_gemm/utils.py`
      * `examples/gemm/example_gemm_autotune.py`
      * `tilelang/intrinsics/utils.py`
      * `tilelang/language/__init__.py`
      * `tilelang/language/utils.py`
      * `tilelang/quantize/mxfp.py`
      * `tilelang/quantize/quantization.py`
      
      * [Lint] More accurate docstring
      
      * [Lint]
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      Co-authored-by: default avatartzj-fxz <tzjfxz@gmail.com>
      e3a80b70
    • Zhengju Tang's avatar
      [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) · 24603e4a
      Zhengju Tang authored
      
      
      * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16
      
      * [Dequant] Add extern call and serial dequantization
      
      * [Dequant] Parallel Dequant wait for fence debug.
      
      * [Scale] Add scale matrix to mxfp4 gemm
      
      * [Remove] Remove fence-buggy example and some generated source cuda code
      
      * [MXFP4] Update initial version of MXFP4 GEMM
      
      * [Scale] Add scale to latest mxfp4 gemm
      
      * [Lint]
      
      * [BugFix] Load Scale, disabe TMA to recover performance
      
      * [Lint]
      
      * [Lint]
      
      * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance
      
      * [Lint]
      
      * Update example_dequant_gemm_bf16_fp4_hopper_serial.py
      
      * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.
      
      * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.
      
      * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.
      
      * lint fix
      
      * ci fix
      
      * Remove non-existent example
      
      * [BugFix] Add smem swizzle to recover performance of TMA
      
      * [BugFix] Enough reg for producer when threads=512
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      24603e4a
  3. 18 Aug, 2025 3 commits
    • alex_xiao's avatar
      [CI] Fix AMD CI (#729) · a86223f4
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      a86223f4
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `fix` (#726) · a5074fd5
      coderabbitai[bot] authored
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851
      
      
      
      The following files were modified:
      
      * `src/op/gemm.cc`
      * `src/tl_templates/cuda/gemm_sm90.h`
      * `src/transform/warp_specialized_rewriter.cc`
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      a5074fd5
    • Wenhao Xie's avatar
      [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr... · f4a828f6
      Wenhao Xie authored
      
      [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712)
      
      * bug fix and support gemm_sr fallback for hopper
      
      * Update gemm.cc
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      f4a828f6
  4. 17 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) · 1b308baf
      Lei Wang authored
      
      
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Support strided tensors
      
      * Refactor target attribute helper functions for improved clarity
      
      * No code changes made in proxy.py and setup.py
      
      * lint fix
      
      * lint fix via gemini
      
      * lint fix
      
      * test fix
      
      * test fix
      
      * lint fix
      
      * Update wrapper.py
      
      * test fix
      
      * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      1b308baf
  5. 16 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) · c369d690
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor CUDA code generation to simplify eviction policy handling
      
      - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity.
      - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations.
      
      * lint fix
      c369d690
  6. 15 Aug, 2025 4 commits
    • NaOHCC's avatar
      [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) · 2bd2d69e
      NaOHCC authored
      * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy
      
      * [Typo] Correct architecture selection for CUDA and CDNA
      2bd2d69e
    • alex_xiao's avatar
      [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) · 8e1b88f3
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Update AMD FlashAttention example and TVM submodule
      
      - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
      - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
      - Updated the TVM submodule to the latest commit for improved functionality.
      - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
      
      * Add CI workflow for automated format checking and testing
      
      - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
      - The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
      - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
      
      * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
      
      * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
      
      * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
      
      * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
      
      * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
      
      * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
      
      * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
      
      * Add installation of ROCm in AMD CI workflow
      
      - Included a step to execute the `install_rocm.sh` script for improved setup.
      - Removed unnecessary blank line for better readability in the workflow script.
      
      * Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
      
      * Update AMD CI workflow to run specific test file with verbose output instead of all tests.
      
      * Add new tilelang built-in operations for AMD architecture
      
      - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
      - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
      - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
      - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
      
      * Enhance autotuner configurations and memory handling in AMD example
      
      - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
      - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
      
      * Refine autotuner configurations and memory usage in AMD example
      
      - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
      - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
      - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Expanded thread counts in `get_configs` to include higher values for improved autotuning.
      - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
      
      * Update AMD CI workflow and remove obsolete test script
      
      - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
      - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
      
      * Remove TVM subproject from 3rdparty directory
      
      * Refactor configuration generation and accumulation logic in AMD example
      
      - Reformatted the `get_configs` function for improved readability by aligning parameters.
      - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
      
      * Enhance AMD CI workflow with additional logging and setup steps
      
      - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
      - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
      
      * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
      
      * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
      
      * Enhance BuildTileLangHIP function by adding whitespace for improved readability
      
      * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
      
      * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * lint fix
      
      * Update AMD CI workflow to use requirements-rocm.txt for dependency installation
      
      * fix ci
      
      * Remove dependency on format-check from AMD CI workflow
      
      * fix ci
      
      * fix ci
      
      * fix ci
      
      * Remove format-check job from AMD CI workflow
      
      * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
      
      * Add dependency on format-check job in AMD CI workflow
      
      * Add format-check job to AMD CI workflow
      
      * Update format-check job in AMD CI workflow to run on self-hosted environment
      
      * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
      
      * Update amd_ci.yml
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      8e1b88f3
    • Gabriel Wu's avatar
      [Chore] fix typos (#719) · d0742860
      Gabriel Wu authored
      * chore: fix typos
      
      * chore: fix ruff
      
      * chore: fix clang-format
      d0742860
    • Wenhao Xie's avatar
      [CI] fix docs ci (#720) · 6545b084
      Wenhao Xie authored
      6545b084
  7. 14 Aug, 2025 3 commits
  8. 13 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) · a9611738
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling
      
      - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
      - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
      - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
      - Removed obsolete code and improved overall code clarity and maintainability.
      
      * lint fix
      
      * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls
      
      - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
      - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.
      
      * test fix
      
      * Enhance global read detection in pipeline planning
      
      - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
      - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
      - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.
      
      * Add IndexLegalizer to enforce int64 for out-of-bound indices
      
      - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds.
      - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability.
      - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations.
      a9611738
    • Lei Wang's avatar
      [Pipeline] Skip condition expression analysis for global reading (#713) · c1eef511
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling
      
      - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
      - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
      - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
      - Removed obsolete code and improved overall code clarity and maintainability.
      
      * lint fix
      
      * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls
      
      - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
      - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.
      
      * test fix
      
      * Enhance global read detection in pipeline planning
      
      - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
      - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
      - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.
      c1eef511
    • Lei Wang's avatar
      [Pipeline] Phaseout fragment and double buffer info from pipeline pass (#711) · 49d5d80e
      Lei Wang authored
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling
      
      - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
      - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
      - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
      - Removed obsolete code and improved overall code clarity and maintainability.
      
      * lint fix
      
      * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls
      
      - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
      - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.
      
      * test fix
      49d5d80e
  9. 12 Aug, 2025 2 commits
  10. 11 Aug, 2025 2 commits
    • Wenhao Xie's avatar
      [Enhancement] Add eviction policy support for TMA operations, enhance CUDA... · 6664d170
      Wenhao Xie authored
      [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config (#690)
      
      * Enhance TMA and barrier handling in CUDA code generation
      
      - Updated `CodeGenTileLangCUDA` to support eviction policies for TMA operations, allowing for more flexible memory management.
      - Introduced a new `CacheHintSm90` enum to define eviction strategies in `copy_sm90.h`.
      - Modified TMA load/store functions to accept eviction policies, improving performance on different architectures.
      - Enhanced `TmaBarrierCollector` and `TmaBarrierRewriter` to account for SIMT copies, ensuring correct barrier insertion.
      - Refactored thread synchronization logic to utilize barrier IDs, improving the efficiency of partial thread synchronization.
      - Updated Python interface for `copy` and `c2d_im2col` to include optional eviction policy parameters, enhancing usability.
      
      * update shuffle and elect optimization
      
      * fix bug
      
      * fix bug
      
      * fix potential bug
      
      * lint fix
      
      * lint fix
      
      * update shuffle_elect template
      
      * fix bug
      
      * fix bug
      
      * fix template
      
      * lint and fix
      
      * fix typo
      6664d170
    • FeiyangChen's avatar
      [Feat] Support mma gemm with stride (#701) · fe70549f
      FeiyangChen authored
      
      
      * gemm_with_stride sm89
      
      * fix offset issue
      
      * bug fix
      
      * format
      
      * sm80 support
      
      * add sm90
      
      * add testing
      
      * format
      
      * add static_assert for wgmma
      
      * Enhance error message for inner_box_dim validation in LowerBulkCopy
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      fe70549f
  11. 10 Aug, 2025 2 commits
    • Zhengju Tang's avatar
      Low-bit kernels fix and implementation (#704) · 569b0127
      Zhengju Tang authored
      
      
      * [MXFP4] Dequantize FP4 kernel example, MX scale todo
      
      * [BugFix] Fix the bug of fp4&fp16 exponential bias
      
      * [MXFP4] Add group scale factor for BF16xMXFP4 gemm
      
      * [Lint]
      
      * [Test] Add test script for BF16xMXFP4 gemm
      
      * [Lint]
      
      * [BugFix] Fix the shape of scale tensor
      
      * Update example_dequant_gemm_fp4_hopper.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      569b0127
    • Lei Wang's avatar
      [Pipeline] Optimize inject software pipeline and pipeline planing pass (#706) · 376ba9eb
      Lei Wang authored
      * Refactor inject_pipeline.cc to improve version handling and add unique producer head tracking
      
      - Updated version check to allow for cases with two or more versions.
      - Adjusted logic to decrement num_versions when multi-versioning is not needed.
      - Introduced a helper function to ensure unique producer heads are added to the commit group.
      - Removed obsolete AddAllocBuffers method to streamline code.
      
      * lint fix
      
      * Refactor pipeline planning logic to enhance copy stage dependency management
      
      - Removed obsolete conditional expression handling from the pipeline planning code.
      - Introduced a new structure to manage copy stage dependency reads, improving clarity and efficiency.
      - Updated logic to correctly identify producer stages for copy stages, ensuring accurate pipeline stage assignment.
      - Added a new block sparse matrix multiplication function in the testing suite to validate the pipeline planning changes.
      
      * Update ci.yml
      
      * Fix structural equality checks in AddUnique and Contains methods to compare buffer references instead of entire regions in pipeline planning.
      
      * Refactor pipeline planning logic to improve copy stage dependency propagation
      
      - Updated structural equality checks in AddUnique and Contains methods to use buffer reference comparison.
      - Enhanced the iteration logic for managing copy stage dependencies, ensuring accurate identification of producer stages.
      - Added safeguards against exceeding maximum iterations during dependency propagation.
      376ba9eb
  12. 08 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Layout] Introduce a new layout inference mechanism (#699) · 407117e1
      Lei Wang authored
      
      
      * Implement new free stage layout inference.
      
      * Fix bug
      
      * Make replication upcasting and unnormalizable iterators safe.
      
      * Better handling of updating with more replica
      
      * Remove unnecessary check.
      
      * Fix compilation.
      
      * Fix setup.py.
      
      * Simplify development mode.
      
      * Allow ParallelOp layout when there's already a compatible layout specified
      
      * lint fix
      
      * Add ProveFragmentContains function to validate thread access between small and large fragments
      
      This function checks if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment, ensuring valid access during updates. The implementation includes deriving thread indices, computing logical indices, and verifying thread mappings.
      
      * Update dependencies in requirements files
      
      * Remove 'thefuzz' from requirements-dev.txt
      * Specify exact versions for 'torch' and add 'flash_attn' in requirements-test.txt
      
      * Update CI workflow to use SHA256 hash for requirements file
      
      * Update requirements and CI workflow for flash attention
      
      * Removed specific version for 'torch' in requirements-test.txt
      * Added installation of 'flash_attn==2.5.8' in CI workflow to ensure compatibility
      
      * Refactor flash attention import handling in examples
      
      * Removed availability checks for 'flash_attn' in multiple example scripts.
      * Simplified import statements for 'flash_attn' to ensure consistent usage across examples.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      407117e1
    • Lei Wang's avatar
      [CI] Remove Flash Attention dependency (#705) · 87aae294
      Lei Wang authored
      * Update flash-attn version in requirements-test.txt from <=2.2.0 to ==2.5.8
      
      * lint fix
      
      * Remove unused dependencies from requirements-test.txt
      
      * Update import path for padding functions in example MHA forward variable length script
      
      * Refactor code formatting in bert_padding.py for improved readability
      87aae294
    • Yichen Yan's avatar
      Trivial update to calculate target arch (#702) · da74c09d
      Yichen Yan authored
      
      
      * Trivial update to calculate target arch
      
      * Update tilelang/contrib/nvrtc.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * fmt
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      da74c09d
  13. 07 Aug, 2025 2 commits
  14. 06 Aug, 2025 2 commits
    • Lei Wang's avatar
      [Example] Optimize warp specialize flashmla example (#698) · a1149cab
      Lei Wang authored
      * [Enhancement] Disable cache and append git commit ID to version in tilelang (#688)
      
      * Disabled caching in quickstart example for improved performance.
      * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities.
      
      * revert quickstart
      
      * optimize code.
      a1149cab
    • Lei Wang's avatar
      [Version] Keep local commit id as it somehow help with debugging (#697) · ed1b96d5
      Lei Wang authored
      * [Enhancement] Disable cache and append git commit ID to version in tilelang (#688)
      
      * Disabled caching in quickstart example for improved performance.
      * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities.
      
      * revert quickstart
      ed1b96d5
  15. 05 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Smem Reuse] Optimize to do memory alignment on identical buffers. (#693) · 17fafc1b
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      
      * phase out Canonialize
      
      * add option --expt-relaxed-constexpr
      
      * [Enhancement] Introduce tilelang intrinsic operations for GEMM
      
      - Added `tl_gemm` and `tl_gemm_sp` built-in operations to support general and sparse matrix multiplication in tilelang.
      - Updated the lowering logic in `Gemm` and `GemmSP` to utilize the new tilelang operations.
      - Enhanced CUDA and HIP code generation to handle the new GEMM operations, ensuring proper argument validation and external call printing.
      - Implemented shared memory alignment planning for GEMM operations to optimize performance on supported architectures.
      
      * lint fix
      
      * lint fix
      
      * test fix
      
      * test fix
      
      * rebase
      
      * Update builtin.cc
      17fafc1b
  16. 04 Aug, 2025 1 commit
  17. 03 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Introduce GemmInst for different targets handling (#688) · d2afb513
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      d2afb513
    • Lei Wang's avatar
      [Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346
      Lei Wang authored
      * [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling
      
      - Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
      - Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
      - Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
      - Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
      - Removed obsolete code related to previous buffer access methods for clarity and maintainability.
      
      * test fix
      73bf8346
    • yyttt6's avatar
      [Feature]:Add auto vectorize for atomic add (#686) · b45e9c45
      yyttt6 authored
      * [Feature]:Add auto vectorize for atomic add
      
      * fix
      
      * fix2
      
      * format
      b45e9c45
  18. 01 Aug, 2025 1 commit
  19. 31 Jul, 2025 2 commits