1. 26 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks (#599) · a664c998
      Lei Wang authored
      * [Enhancement] Improve error messaging for global and shared range legality checks in LowerBulkCopy
      
      - Updated error messages in the LowerBulkCopy function to provide clearer context when global and shared ranges are illegal.
      - Enhanced the readability of the error output by including tensor names, improving debugging and validation processes during bulk copy operations.
      
      * [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks
      
      - Improved the clarity of error messages in the LowerBulkCopy function by enhancing the output format.
      - Included additional context in error messages to aid debugging when global and shared ranges are found to be illegal, ensuring better traceability during bulk copy operations.
      a664c998
  2. 25 Jun, 2025 1 commit
    • Cunxiao Ni's avatar
      [Example] Update examples to use @tilelang.jit (#597) · 3db18726
      Cunxiao Ni authored
      
      
      * [Example] Update kernel compilation in examples to use @tilelang.jit
      
      - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
      - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
      - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.
      
      * format
      
      * Update example_tilelang_sparse_gqa_decode_varlen_indice.py
      
      * Update example_dequant_gemm_fine_grained.py
      
      * Update example_gemm_autotune.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3db18726
  3. 24 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add strict layout map for improved buffer layout inference (#594) · 18889821
      Lei Wang authored
      - Introduced a `strict_layout_map` to enhance layout inference by ensuring that buffers with strict layout requirements are properly accounted for during the inference process.
      - Updated the inference logic to check for the presence of buffers in the `strict_layout_map` before applying layout changes, improving the accuracy of layout assignments.
      - Refactored the layout inference steps to include the copying of layouts into the new strict map, ensuring a clear separation of layout handling based on inference levels.
      18889821
  4. 23 Jun, 2025 2 commits
    • Jianqiao Lu's avatar
      [Example] Add a easy version for online softmax (#593) · a6b52c52
      Jianqiao Lu authored
      
      
      * feat: add a easy version for online softmax
      
      * fix: set x & y to fragment memory to load data from global memory
      
      * feat: apply format check
      
      * Add License
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      a6b52c52
    • Lei Wang's avatar
      [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy (#592) · 78651bae
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      78651bae
  5. 22 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Improve memory access condition checks in GlobalMemChecker (#591) · 41ec2bc6
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      41ec2bc6
  6. 21 Jun, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Improve tensor shape compatibility checks in AutoTuner (#590) · 804735bf
      Lei Wang authored
      - Simplified the shape comparison logic in the AutoTuner class to enhance readability and maintainability.
      - Ensured that the shape compatibility checks are more concise while preserving functionality, contributing to overall code clarity.
      804735bf
    • Lei Wang's avatar
      [Bugfix] Fix input tensor compatibility checks in AutoTuner (#588) · cce6aed8
      Lei Wang authored
      
      
      * [Refactor] Remove cache existence check in kernel saving logic
      
      - Eliminated redundant checks for existing cache paths in `AutotuneResult` and `AutoTunerCache` classes, simplifying the kernel saving process.
      - Ensured that the cache directory is always created before saving kernel source code, improving reliability in kernel storage.
      
      * [Enhancement] Improve input tensor compatibility checks in AutoTuner
      
      - Enhanced the input tensor caching logic in the AutoTuner class to ensure compatibility between cached tensors and newly generated tensors during configuration trials.
      - Added detailed logging to warn users about potential mismatches in tensor properties, including shape and dtype, when caching is enabled.
      - Implemented a mechanism to regenerate input tensors if compatibility issues are detected, improving the robustness of the autotuning process.
      
      * [Refactor] Update L2 persistent map initialization in CUDA wrapper
      
      - Adjusted the L2 persistent map initialization function to use a consistent size parameter for cache limits and byte counts, improving clarity and reducing potential errors in memory management.
      - Simplified the formatting of the initialization function to enhance readability and maintainability of the code.
      
      * Update tilelang/autotuner/__init__.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      cce6aed8
  7. 20 Jun, 2025 3 commits
    • botbw's avatar
      [Bugfix] fix missing node in ws role maker (#587) · c15d35e4
      botbw authored
      c15d35e4
    • Lei Wang's avatar
      [Bugfix] Fix device type validation for input tensors (#586) · 7b474fbe
      Lei Wang authored
      * Enhancement: Update `pythonic_expr` to accept `tvm.tir.PrimExpr` and improve type handling
      
      - Modified the `pythonic_expr` function to check for `tvm.tir.PrimExpr` type, ensuring proper handling of expressions.
      - Refactored device and dtype checks in `CythonKernelWrapper` for better clarity and error messaging, enhancing robustness in tensor validation.
      
      * Enhancement: Refine `pythonic_expr` function to support additional expression types
      
      - Updated the `pythonic_expr` function to accept `tvm.tir.PrimExpr` and handle both integer and float immediate types, improving expression representation and type handling.
      7b474fbe
    • Lei Wang's avatar
      [Enhancement] align shared memory allocations (#583) · fecc8336
      Lei Wang authored
      * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper
      
      - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
      - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
      - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
      
      * [Feature] Implement dynamic shared memory allocation alignment
      
      - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
      - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
      - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.
      
      * [Enhancement] Update dtype and argument defaults in GEMM autotuning example
      
      - Changed data type from `float16` to `bfloat16` for improved precision in computations.
      - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.
      
      * [Enhancement] Improve thread range computation in storage access
      
      - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
      - Updated `AccessEntry` structure to include `thread_range`.
      - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
      - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.
      
      * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment
      
      - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
      - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
      - Added a blank line for improved readability in `thread_storage_sync.cc`.
      
      * [Refactor] Enhance storage access analysis and thread range computation
      
      - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
      - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
      - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.
      
      * [Enhancement] Update default matrix dimensions in GEMM autotune example
      
      - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.
      
      * typo fix
      
      * enhancement
      
      * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync
      
      - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
      fecc8336
  8. 19 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] FIx autotuning params (#585) · f4bb9f6c
      Lei Wang authored
      * [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation
      
      - Modified AutoTuner to store cache in a dedicated "autotuner" directory.
      - Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing.
      - Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation.
      
      * lint fix
      
      * [Enhancement] Improve error handling and documentation in AutoTuner
      
      - Added traceback logging for exceptions during configuration testing to enhance debugging.
      - Expanded the AutoTuner class docstring to include detailed descriptions of new parameters for input tensor generation and validation, improving clarity for users.
      f4bb9f6c
  9. 18 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Update warp specialization checking (#580) · 6cede73d
      Lei Wang authored
      * Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation
      
      * [Enhancement] Update warp specialization check in phase.py
      
      * lint fix
      
      * [Enhancement] Add ContainsSeqStmt method to improve statement handling in merge_shared_memory_allocations.cc
      
      * [Refactor] Simplify memory copy operations in GEMM kernel tests
      
      - Updated memory copy operations in `test_tilelang_kernel_gemm.py` to use shared memory allocations for both A and B matrices, improving clarity and performance.
      - Adjusted the main execution block to include a new `run_gemm_rs` function call for testing, enhancing the test structure.
      
      * revert memory reuse pass.
      
      * revert the memory resue and thread sync pass/
      
      * Update test_tilelang_kernel_gemm.py
      
      * Update test_tilelang_kernel_mha_bwd.py
      6cede73d
  10. 17 Jun, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Update dtype handling in KernelParam and CythonKernelWrapper (#582) · 44508e59
      Lei Wang authored
      - Modified `KernelParam.from_var` to map Torch data types to a more appropriate format.
      - Enhanced `CythonKernelWrapper` to support additional tensor types and ensure proper conversion of tensor dtypes to C types, improving error handling for unsupported types.
      44508e59
    • Lei Wang's avatar
      [Enhancement] Update `pythonic_expr` to format type casts and improve tensor... · 05fc9cd5
      Lei Wang authored
      [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper (#581)
      
      - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
      - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
      - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
      05fc9cd5
  11. 16 Jun, 2025 6 commits
    • Lei Wang's avatar
      [Enhancement] Introduce wrapper util `pythonic_expr` to transform a PrimExpr... · 916ee60e
      Lei Wang authored
      [Enhancement] Introduce wrapper util `pythonic_expr` to transform a PrimExpr into python string (#577)
      
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * lint fix
      
      * rebase
      
      * rebase
      
      * typo
      
      * requirement fix
      
      * revert flash atten requirenemts
      916ee60e
    • 徐畅's avatar
      [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split (#575) · 67d0b677
      徐畅 authored
      * [CI] Add flash_decoding example to CI
      
      * Add output of ref latency
      
      * format example_gqa_decode.py
      
      * [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split
      
      * format example_gqa_decode.py
      67d0b677
    • Lei Wang's avatar
      Fix L2 cache size calculation to handle symbolic expressions and ensure float... · 837b6398
      Lei Wang authored
      Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation (#576)
      
      837b6398
    • Tong WU's avatar
      [BugFix] Fix import error in nsa examples when `fla.__version__ >=0.2.1` (#579) · 18ab72c9
      Tong WU authored
      * Update FLA import path for `prepare_token_indices`
      
      * Update FLA import path for `prepare_token_indices`
      
      * Compare versions via packaging.version.parse
      18ab72c9
    • Yu Cheng's avatar
      [CI] Modify test requirements and CI workflow (#578) · 4c24a69e
      Yu Cheng authored
      * [CI] Modify test requirements and CI workflow
      
      - Replaced `flash-attn` with `packaging` and `wheel` in `requirements-test.txt` to ensure proper package management.
      - Updated the CI workflow to install `flash-attn` without build isolation, improving the installation process.
      
      * [CI] remove redundant packages
      
      * [CI] Update test requirements and CI workflow
      
      - Added `flash-attn` to `requirements-test.txt` to ensure it is included in the testing environment.
      - Modified the CI workflow to install packages from `requirements-test.txt` with `PIP_NO_BUILD_ISOLATION=1`, improving the installation process.
      4c24a69e
    • Lei Wang's avatar
      [Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480
      Lei Wang authored
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * [Refactor] Update GEMM Layout Functions and Inference Logic
      
      - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
      - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
      - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
      - Adjusted related layout functions to ensure consistent behavior across different architectures.
      
      * [Refactor] Remove tf32 Casting Logic from GEMM Templates
      
      - Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
      - Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
      - Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
      - Adjusted tensor view handling to improve performance and accuracy in matrix operations.
      
      * Update bulk_copy.cc
      
      * Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.
      9ba8b480
  12. 13 Jun, 2025 4 commits
  13. 11 Jun, 2025 4 commits
    • Lei Wang's avatar
      [Feature] Implement Swizzle 32B (#566) · ae9668a8
      Lei Wang authored
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * [Refactor] Update GEMM Layout Functions and Inference Logic
      
      - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
      - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
      - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
      - Adjusted related layout functions to ensure consistent behavior across different architectures.
      
      * Update bulk_copy.cc
      
      * Update __init__.py
      ae9668a8
    • Lei Wang's avatar
      [Bugfix] Add `__tune_params` into key hash for autotuning (#565) · ae386a7b
      Lei Wang authored
      * [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation
      
      - Modified AutoTuner to store cache in a dedicated "autotuner" directory.
      - Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing.
      - Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation.
      
      * lint fix
      ae386a7b
    • Yu Cheng's avatar
      [Refactor] Improve dtype handling in KernelParam class (#564) · 59172ff6
      Yu Cheng authored
      - Updated the dtype handling logic in the KernelParam class to enhance clarity and maintainability. The dtype string is now modified only if it starts with "torch.", simplifying the return statement for boolean type checks.
      59172ff6
    • Yu Cheng's avatar
      [Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2
      Yu Cheng authored
      * [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation
      
      - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
      - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
      - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
      - Updated relevant documentation and comments to reflect the addition of new features and usage instructions.
      
      * [Example] Add MLA Decode With Persistent Threadblock Example
      
      * [Feature] Introduce Persistent Loop and Update GEMM Example
      
      - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
      - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
      - Introduced a `loop_break` intrinsic for better control flow within persistent loops.
      - Updated relevant files to support the new features, including changes in code generation and language interface.
      
      * lint fix
      e7b97be2
  14. 09 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Optimize debug info for auto tuning (#560) · e5e36dbf
      Lei Wang authored
      
      
      * [Enhancement] Update AutoTuner and JIT compilation arguments
      
      * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities.
      * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling.
      * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes.
      
      * Update tilelang/autotuner/param.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * remove redundant comments
      
      * Refactor kernel source retrieval and logging levels
      
      * Updated `AutotuneResult` to use `kernel.get_kernel_source()` instead of `kernel.adapter.get_kernel_source()`.
      * Changed logging level in `KernelCache` from `ERROR` to `DEBUG` for improved verbosity during kernel caching operations.
      * Removed unnecessary verbose logging in JIT compilation process to streamline output.
      
      * Merge branch 'main' of https://github.com/tile-ai/tilelang
      
       into bugfix_autotune_0604
      
      * lint fix
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      e5e36dbf
  15. 07 Jun, 2025 2 commits
    • Yu Cheng's avatar
      [Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61
      Yu Cheng authored
      * [Enhancement] Fix multi-version buffer index in nested-loop
      
      * [Feature] Support persistent kernels and add persistent GEMM example
      
      * lint fix
      
      * lint fix
      
      * [CI] Remove test_tilelang_transform_annotate_device_regions.py
      225aca61
    • Lei Wang's avatar
      [Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52
      Lei Wang authored
      * Add tf32 casting functionality to GEMM templates
      
      - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
      - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
      - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.
      
      * lint fix
      
      * Refactor tfloat32 casting logic in GEMM templates
      
      - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
      - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
      - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.
      
      * Refactor GEMM template functions for improved readability
      
      - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
      - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.
      
      * Enhance tf32 casting logic in GEMM templates
      
      - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
      - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.
      
      * Refactor GEMM template functions and layout inference logic
      
      - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
      - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
      - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.
      8cc8db52
  16. 06 Jun, 2025 1 commit
  17. 05 Jun, 2025 3 commits
    • Gabriel Wu's avatar
      [Enhancement] Add nvrtc execution backend (#461) · 17f7394f
      Gabriel Wu authored
      
      
      * [wip] feat: add nvrtc backend
      
      * [wip] fix: handle out_idx
      
      * [wip] refactor: move lib logic to libgen
      
      * feat: cache for nvrtc backend
      
      * fmt: run format
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: get kernel source
      
      * refactor: speedup pyimport
      
      * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.
      
      * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.
      
      * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.
      
      * fix tl_templates
      
      * ensure CUDA context
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      17f7394f
    • Zhengju Tang's avatar
      [CI] Add FusedMoE example (#555) · 88c622c9
      Zhengju Tang authored
      
      
      * [CI] Add FusedMoE example
      
      * Lint
      
      * Fix import bug
      
      * Fix comment bug
      
      * Update example_fusedmoe_torch.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      88c622c9
    • Lei Wang's avatar
      [Release] Bump Version to 0.1.5 (#551) · cc5e9f7c
      Lei Wang authored
      * Update VERSION to 0.1.5
      
      * Add DEBUG_MODE support in setup.py and update CMake build type; enhance pypi.Dockerfile with git installation
      cc5e9f7c
  18. 04 Jun, 2025 4 commits
    • alex_xiao's avatar
      [CI]Add norm and layout_plot (#534) · c9e503be
      alex_xiao authored
      
      
      * [CI]Add norm and layout_plot
      
      * fix lint
      
      * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite.
      
      * Add make_mma_load_base_layout function to create MMA result layouts
      
      - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers.
      - Added detailed docstring explaining parameters, return values, and potential exceptions.
      - Implemented logic for handling different data types and matrix configurations, including assertions for input validation.
      - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality.
      
      * Enhance MMA load test with additional imports and functionality
      
      - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality.
      - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations.
      - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A.
      
      * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability.
      
      * Add RMS normalization functions to test_rms_norm.py
      
      - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities.
      - Implemented kernel functions with shared memory allocation and parallel processing for improved performance.
      - Updated the test function to validate the new RMS normalization implementations.
      
      * Add reference program for RMS normalization in test_rms_norm.py
      
      - Introduced `ref_program` function to provide a reference implementation for RMS normalization.
      - This addition enhances the testing framework by allowing comparisons against a known reference output.
      
      * Enhance RMS normalization tests with additional imports and formatting
      
      - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`.
      - Improved code readability by adding blank lines for better separation of code sections.
      
      * Update RMS normalization test parameters and enhance layout plotting
      
      - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing.
      - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite.
      - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`.
      
      * Refactor RMS normalization test parameters and improve layout plotting readability
      
      - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity.
      - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable.
      
      * Enhance RMS normalization with split-k implementation and additional profiling
      
      - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization.
      - Updated the main RMS normalization script to include profiling for the split-k implementation.
      - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations.
      
      * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite.
      
      * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity.
      
      * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity.
      
      ---------
      Co-authored-by: default avatarYour Name <you@example.com>
      c9e503be
    • Tong WU's avatar
      [CI] Add linear attention examples to CI (#552) · eec07578
      Tong WU authored
      * Add linear attention examples.
      
      * Add license
      
      * Remove comments
      
      * Run yapf and ruff
      eec07578
    • Lei Wang's avatar
      [Autotune] Remove the out_idx argument from the autotune cache (#553) · 5fbfb80b
      Lei Wang authored
      
      
      * [Enhancement] Update AutoTuner and JIT compilation arguments
      
      * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities.
      * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling.
      * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes.
      
      * Update tilelang/autotuner/param.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * remove redundant comments
      
      * Update tilelang/jit/__init__.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      5fbfb80b
    • Lei Wang's avatar
      [Bugfix] Enhance layout inference pass for flexibility (#550) · 444b7c4e
      Lei Wang authored
      * Enhance Layout
      
      * strict update
      
      * lint fix
      
      * Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps.
      
      * lint fix
      
      * Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory.
      444b7c4e