"git@developer.sourcefind.cn:modelzoo/resnet50_tensorflow.git" did not exist on "1ac53637fcf297a396a2d5c56ff5f1236845a16c"
  1. 18 Dec, 2025 3 commits
    • Gabriel Wu's avatar
      feat(cutedsl): add CuTeDSL backend (#1421) · 7248a810
      Gabriel Wu authored
      
      
      * feat: CuTeDSL backend
      
      * fix: clang-tidy
      
      * fix: clang-format
      
      * fix: ci
      
      * fix: revert example gemm fp8
      
      * fix: remove duplicate code
      
      * fix: switch-case
      
      * fix: fp16 silence
      
      * fix: TVM IR print
      
      * fix: useless tir
      
      * fix: clang-format
      
      * fix: remove tilelang/contrib/cutedsl/.gitignore
      
      * fix: use hexfloat
      
      * fix: gsym guard
      
      * fix: unknown storage sync type
      
      * fix: string literal
      
      * fix: add args guard
      
      * fix: name hint dedup
      
      * fix: better find_kernel_by_pattern
      
      * fix: set libpath for from_database path
      
      * fix: guard buffer.strides
      
      * fix: from guard
      
      * fix: eviction guard
      
      * fix: use thread local tma descs
      
      * fix: ruff
      
      * fix: drop tma_init_cpp
      
      * fix: exc_info
      
      * fix: negative unmatch early return
      
      * fix: rename postproc func and add test
      
      * fix: handle fast math according to pass config
      
      * fix: dyn_sym parse
      
      * fix: wrap_forward
      
      * fix: use tvm_ffi.libinfo instead of cli
      
      * fix: keep signature
      
      * fix: C++ string safety
      
      * fix: mark tma_store_add as unsupported
      
      * fix: tvm version
      
      * resolve ldsm and cpasync issues.
      
      * fix: minor fixes
      
      * fix: parse signature using ast
      
      * fix: guard global_addr
      
      * fix: create tempfile only when necessary
      
      * fix: use logger.execption for exceptions
      
      * fix: guard lib_path and host_func
      
      * fix: remove tma_cpp_init and add timeout for cpp compile
      
      * add timeout for mbarrier_wait.
      
      * fix: _load_kernel_from_disk signature
      
      * resolve codegen issues.
      
      * fix: logger.exception
      
      * add comment for div_by=1
      
      * merge
      
      * fix: reserve cutlass,cute,tl
      
      * fix: guard tma_store
      
      * fix: allow int64 offset in make_tensor_at_offset
      
      * fix: guard barrier
      
      * fix: add comments for div_by=16
      
      * fix: div_by=1 issue
      
      * delete div_by when offset is 0
      
      * use tl.make_tensor when offset is 0
      
      * fix: explicitly check cutedsl target
      
      * fix: use param.torch_dtype()
      
      ---------
      Co-authored-by: default avataryuxic <yuxic@nvidia.com>
      Co-authored-by: default avatarYong <yong@local>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      7248a810
    • Jinjie Liu's avatar
      remove unused duplicated type check (#1462) · a6f59f31
      Jinjie Liu authored
      
      Signed-off-by: default avatarJinjie Liu <jjliu@baai.ac.cn>
      a6f59f31
    • silentCoder-dev's avatar
      [Language]Adds a random number generation capability through curand_kernel (#1461) · cae06edd
      silentCoder-dev authored
      
      
      * add curand.{curand_init, curand}
      
      * run format.sh
      
      * add default value for curand_init & add test for curand
      
      * Update testing/python/language/test_rand.py
      
      Remove unused thread binding
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      
      * remove unused library
      
      * enable tilelang cache for testing
      
      * run format.sh
      
      * Revert "run format.sh"
      
      This reverts commit 5afaff782f31cdf653e2c45b469da8dead228b8a.
      
      * Revert "enable tilelang cache for testing"
      
      This reverts commit c277a43e77938bd88d47a108dd1bd65734d4a1ae.
      
      * Revert "remove unused library"
      
      This reverts commit 568ad20611f039380113937fd131151a2bffd801.
      
      * run format.sh
      
      * ensure FreshName for __philox_state
      
      * ensure FreshName for __philox_state
      
      * change the return type of T.rng_init
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      cae06edd
  2. 17 Dec, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
    • Lei Wang's avatar
      [Language] Introduce `T.annotate_restrict_buffers` (#1428) · 0814b171
      Lei Wang authored
      * [Enhancement] Introduce non-restrict parameter support in code generation
      
      - Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
      - Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
      - Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
      - Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
      - Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.
      
      * [Enhancement] Improve HoistNonRestrictParams transformation
      
      - Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
      - Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
      - Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
      - Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.
      
      * [Refactor] Simplify non-restrict parameter handling in code generation
      
      - Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
      - Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
      - Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.
      
      * [Dependency] Update TVM subproject to latest commit 68aa8461
      
      - Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
      - Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
      - Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.
      
      * fix
      0814b171
  3. 16 Dec, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Reduce direct dependency on PyTorch due to its limited type support (#1444) · dda45126
      Lei Wang authored
      
      
      * [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method
      
      - Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions.
      - Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation.
      - Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization.
      
      * [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types
      
      - Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively.
      - Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64).
      - Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions.
      - Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing.
      - Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch.
      - Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively.
      
      * lint fix
      
      * Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file.
      
      * [Enhancement] Add support for FP4 and FP8 types in CUDA code generation
      
      - Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types.
      - Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch.
      - Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations.
      - Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations.
      - Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows.
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      dda45126
    • Kuris's avatar
      [Fix] Fix analyzer bind conflicting (#1446) · 81b8c1b7
      Kuris authored
      81b8c1b7
  4. 15 Dec, 2025 7 commits
    • Dayuxiaoshui's avatar
      [Feature] Support region as input of T.cumsum (#1426) · 869f021b
      Dayuxiaoshui authored
      
      
      * [Feature] Support region as input of T.cumsum
      
      - Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer
      - This enables operations on buffer slices/regions like:
        T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0)
      - Update cumsum_fragment to handle region inputs properly
      - Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes
      
      Fixes #879
      
      * Fix formatting and add docstring for cumsum_fragment
      
      - Add comprehensive docstring for cumsum_fragment function
      - Format code according to ruff style guidelines
      
      * Fix CodeRabbit review issues
      
      - Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape))
      - Add src/dst shape compatibility validation for out-of-place cumsum
      - Update copy() type annotation to accept BufferRegion as dst parameter
      - Fix test in-place mutation issues by using out-of-place cumsum operations
      - Add non-divisible size test cases for tail region coverage
      
      * Fix out-of-bounds access in region tests
      
      - Add bounds clamping using T.min() for chunk_end calculations
      - Prevents accessing beyond tensor bounds for non-divisible sizes
      - Matches reference implementation behavior
      - Fixes both 1D and 2D region test cases
      
      * Fix region test: use simple slice expressions instead of T.min()
      
      - Remove T.min() which cannot be used directly in slice indices
      - Use chunk_start + chunk_size form instead
      - Rely on system's automatic bounds checking for non-divisible sizes
      - Update comments to reflect this approach
      
      * Fix cumsum region: use region extents in lowering and update tests for shared memory
      
      * Simplify fragment scope check using is_fragment()
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      869f021b
    • Xiangwen Wang's avatar
      bcae814e
    • Lei Wang's avatar
      [Enhancement] Refactor vectorization checks in loop_vectorize (#1440) · e387102c
      Lei Wang authored
      * Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability.
      * Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions.
      * Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.
      e387102c
    • Chaofan Lin's avatar
      [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice (#1405) · 2feaa41e
      Chaofan Lin authored
      * [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice
      
      * address comments
      
      * fix
      
      * fix submodule
      
      * fix
      
      * fix 3rdparty
      2feaa41e
    • Lei Wang's avatar
      [Enhancement] Improve buffer usage tracking in MakePackedAPI (#1435) · 0788feb8
      Lei Wang authored
      * Added detailed logging for data and shape variable parameters during buffer usage detection in the MakePackedAPI function.
      * Refactored the UsedBufferDetector to differentiate between used parameters by data and shape variables, enhancing clarity in buffer management.
      * Updated logic to ensure minimal carrier buffers are selected for shape symbols, improving the efficiency of parameter handling.
      0788feb8
    • Lei Wang's avatar
      [Bugfix] Convey `compile_flags` to ffi compilation path with pass_configs (#1434) · fba12a5f
      Lei Wang authored
      * [Enhancement] Add device compile flags support in pass configuration
      
      * Introduced `kDeviceCompileFlags` option in the pass configuration to allow additional device compiler flags for CUDA compilation.
      * Updated the `tilelang_callback_cuda_compile` function to merge extra flags from the pass configuration, enhancing flexibility in compiler options.
      * Modified the `JITKernel` class to handle device compile flags appropriately, ensuring they are included during compilation.
      * Documented the new pass configuration key for clarity on usage and expected input formats.
      
      * lint fix
      
      * [Refactor] Simplify compile_flags handling in JIT functions
      
      * Removed redundant string check for compile_flags in the compile, jit, and lazy_jit functions, ensuring compile_flags is consistently treated as a list.
      * Updated the JITKernel class to handle compile_flags as a list when a string is provided, enhancing code clarity and maintainability.
      
      * lint fix
      
      * fix
      fba12a5f
    • Lei Wang's avatar
      [Refactor] Phase out the primitives folder since its design has been merged into tileop (#1429) · 89521e63
      Lei Wang authored
      * Phase out primitives
      
      * revert changes
      
      * Refactor GemmWarpPolicy method signature for clarity
      
      Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability.
      
      * fix
      89521e63
  5. 13 Dec, 2025 2 commits
    • Lei Wang's avatar
      [CUDA] Add read-only parameter annotation for CUDA codegen (#1416) · 00dd7388
      Lei Wang authored
      * [Enhancement] Add read-only parameter annotation for CUDA codegen
      
      * Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
      * Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
      * Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.
      
      * lint fix
      
      * [Dependency] Update apache-tvm-ffi version to >=0.1.3
      
      * Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
      * Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.
      
      * lint fix
      
      * [Enhancement] Refactor ReadWriteMarker for improved parameter handling
      
      * Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
      * Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
      * Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
      * Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.
      
      * lint fix
      
      * Update tma_load function to use const qualifier for global memory pointer
      
      * Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
      * This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.
      
      * Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity
      
      * Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.
      00dd7388
    • Lei Wang's avatar
      [Atomic] Use ptr for atomicAdd dst instead of reference (#1425) · 3546e2ee
      Lei Wang authored
      * [Enhancement] Update AtomicAdd function signature to accept pointer to destination
      
      * Modified AtomicAdd in CUDA to take a pointer instead of a reference for the destination argument.
      * Updated related code in atomicadd_vectorize.cc to ensure compatibility with the new signature.
      * Adjusted Python interface in atomic.py to pass the destination by pointer, aligning with device function requirements.
      
      * [Enhancement] Refactor AtomicAddRet function signature to accept pointer
      
      * Updated AtomicAddRet in both CUDA and HIP to take a pointer instead of a reference for the address argument, improving consistency with the AtomicAdd function.
      * Adjusted the implementation to ensure proper reinterpretation of the address type for atomic operations.
      
      * lint fix
      
      * [Enhancement] Refactor AtomicAddNode::MakeSIMTLoop to use destination pointer
      
      * Updated the MakeSIMTLoop function to build a pointer to the destination element using tvm_access_ptr instead of loading the destination value directly.
      * Simplified the handling of source and destination predicates, improving clarity and maintainability of the code.
      * Ensured compatibility with the new pointer-based approach for atomic operations.
      
      * lint fix
      
      * test fix
      
      * lint fix
      3546e2ee
  6. 12 Dec, 2025 2 commits
    • Xiangwen Wang's avatar
      [Enhancement] Improve vectorization invariant check (#1398) · e84b24bc
      Xiangwen Wang authored
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Add some vectorize tests and comments
      e84b24bc
    • Lei Wang's avatar
      [Enhancement] Introduce `T.__ldg` (#1414) · 6f67da84
      Lei Wang authored
      * [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads
      
      * Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA.
      * Updated the corresponding documentation and added support in both CUDA and HIP code generation.
      * Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability.
      
      * [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment
      
      * Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting.
      * Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block.
      * Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors.
      
      * lint fix
      6f67da84
  7. 11 Dec, 2025 3 commits
    • Cunxiao Ni's avatar
      [TypoFix] fix typo for SM120 (#1408) · ede9eaa3
      Cunxiao Ni authored
      ede9eaa3
    • danielhua23's avatar
      [AMD] Enable FA2 fwd on AMD MI300X (#1406) · 53be59dc
      danielhua23 authored
      * enable FA2 on AMD MI300X
      
      * make lint happy
      53be59dc
    • Lei Wang's avatar
      [Dependency] Update apache-tvm-ffi version to >=0.1.2 (#1400) · 0eb33f28
      Lei Wang authored
      * [Dependency] Update apache-tvm-ffi version to >=0.1.2 in project files
      
      * [Dependency] Update subproject commit for TVM to latest version afc07935
      
      * [Enhancement] Add support for optional step parameter in loop constructs
      
      - Updated loop creation functions to accept an optional step parameter, enhancing flexibility in loop definitions.
      - Modified ForFrame implementations to utilize the new step parameter across various loop types including serial, parallel, and pipelined loops.
      - Adjusted related vectorization transformations to accommodate the step parameter, ensuring consistent behavior in loop vectorization processes.
      
      * lint fix
      0eb33f28
  8. 10 Dec, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Refactor inflight computing to support dynamic pipeline extents (#1399) · f2858fa1
      Lei Wang authored
      * [Build] Update CMake configuration for tilelang_cython_wrapper installation
      
      - Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib.
      - Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules.
      - Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects.
      
      * [Build] Standardize output directories for tilelang libraries
      
      - Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds.
      - This change enhances organization and ensures that all build artifacts are located in a unified directory structure.
      
      * [Refactor] Update TVM subproject and enhance pipeline loop handling
      
      - Updated the TVM subproject to commit 90581fe9e5287bbcf1844ad14255a1e1e8cdf7f0.
      - Added new fields to `PipelineAnnotation` and `RewrittenBlockInfo` structures to track original statement indices and improve async state management.
      - Refactored `EmitImpl` and `PopulateWaitCounts` methods to enhance clarity and functionality, including better handling of commit groups and wait counts.
      - Simplified access index calculations and strengthened analyzer constraints for loop bounds.
      
      * [Cleanup] Remove license block and unused includes from inject_pipeline.cc
      
      - Eliminated the Apache license block from the top of the file to streamline the code.
      - Removed unused include directives for memory and stringstream to enhance code clarity and reduce unnecessary dependencies.
      
      * [Refactor] Enhance transformation pipeline and test execution
      
      - Added an additional Simplify transformation in the InjectSoftwarePipeline to improve optimization.
      - Updated the test file to call `test_trival_pipeline()` directly, commenting out the previous main execution for better test isolation.
      f2858fa1
    • Kuris's avatar
  9. 06 Dec, 2025 2 commits
    • Cunxiao Ni's avatar
      [Tool] Provide layout visualization tool (#1353) · 924225ed
      Cunxiao Ni authored
      * Provide layout visualization tool
      
      Adds a layout visualization tool to TileLang, which helps users understand and debug the layout transformations applied during compilation.
      
      This tool visualizes the memory layout of tensors at different stages of the compilation process, allowing developers to identify potential inefficiencies and optimize their code for better performance.
      
      The visualization can be enabled via a pass config option.
      
      * format
      
      * add layout visual example
      
      * Adds vis extra with matplotlib dependency
      
      * rafactor pass config name
      
      * fix lint
      
      * Enables configurable layout visualization formats
      
      Allows users to specify the output formats (png, pdf, svg) for layout visualization through a pass config option.
      
      This change provides more flexibility in how layout visualizations are generated, allowing users to choose the formats that best suit their needs.
      
      It also fixes a bug where layout visualization was not correctly disabled when the config option was set to "false".
      
      * Adds visual layout inference tool docs
      
      * fix lint
      
      * fix lint
      
      * Rafactor configurable layout visualization formats
      
      * fix lint
      
      * fix typo
      
      * add some comments
      
      * fix lints
      
      * add some warnings for user
      
      * Moves layout visualization
      
      * Refactors layout visualization pass configuration
      
      Updates the layout visualization pass configuration to use boolean flag for enabling and a string for specifying formats.
      
      * Enables multiple layout visualization formats
      
      * Updates layout visualization docs
      
      * Moves layout visualization to analysis
      924225ed
    • Lei Wang's avatar
      [Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations (#1376) · f8e7fef5
      Lei Wang authored
      * Update submodule TVM to latest commit and add PlanAndUpdateBufferAllocationLocation function to transform module
      
      - Updated the TVM submodule to commit 3a32b763.
      - Added a new function `PlanAndUpdateBufferAllocationLocation` in the transform module to facilitate buffer allocation planning within PrimFuncs.
      
      * Refactor buffer allocation code for improved readability and consistency
      
      - Updated formatting and spacing in `plan_update_buffer_allocation_location.cc` for better code clarity.
      - Standardized the use of pointer and reference syntax across various class methods.
      - Enhanced comments for better understanding of buffer allocation logic.
      - Removed unnecessary lines and improved overall code structure.
      
      * Refactor buffer allocation checks for improved clarity
      
      - Replaced size checks with empty checks for `ffi::Array<Buffer>` in `plan_update_buffer_allocation_location.cc` to enhance code readability.
      - Updated conditions in multiple methods to use `empty()` instead of comparing size to zero, streamlining the logic.
      f8e7fef5
  10. 05 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Layout] Enhance Free Layout Inference (#1375) · 6654064d
      Lei Wang authored
      * [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py
      
      * [Enhancement] Extend support for float8 data types in GEMM operations
      
      - Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
      - Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
      - Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.
      
      * lint fix
      
      * [Enhancement] Add injective layout detection and exception handling
      
      - Introduced `DetectInjective` method in `FragmentNode` to check for injective layouts.
      - Added `LoopLayoutInjectiveException` to handle errors related to non-injective layouts.
      - Updated `InferLayout` methods in `ParallelOpNode` to utilize injective checks and log relevant information.
      - Refactored layout inference queue management to use `std::deque` for improved performance and added prioritization logic for buffer layouts.
      
      * remove debug print
      
      * remove debug print
      
      * remove debug print
      
      * minor layout fix
      
      * fix for T.view
      
      * [Enhancement] Improve injective layout detection in FragmentNode
      
      - Updated the `DetectInjective` method to handle symbolic dimensions more effectively by introducing a mechanism to collect symbolic shapes and adjust the detection level accordingly.
      - Added logging for cases where the layout detection falls back to NoCheck due to symbolic dimensions.
      - Minor update to the test file to include the tilelang testing module.
      
      * [Refactor] Simplify layout inference for bulk copy operations
      
      - Removed unnecessary conditions for bulk load/store operations in the layout inference logic.
      - Streamlined the handling of layout application for bulk copy instances to enhance clarity and maintainability.
      
      * remove debug print
      
      * [Enhancement] Introduce layout-related exceptions and improve error handling
      
      - Added `LayoutConflictException` and `LoopLayoutInjectiveException` classes for better exception management in layout operations.
      - Updated `InferLayout` method in `ParallelOpNode` to throw `LoopLayoutInjectiveException` with detailed error information when injective layout checks fail.
      - Removed redundant exception class definitions from `parallel.h` to streamline code organization.
      6654064d
  11. 03 Dec, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Generalize fp8 process (#1372) · 92121fc6
      Lei Wang authored
      * [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py
      
      * [Enhancement] Extend support for float8 data types in GEMM operations
      
      - Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
      - Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
      - Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.
      
      * lint fix
      92121fc6
    • Yuqi Dong's avatar
  12. 01 Dec, 2025 5 commits
    • Lei Wang's avatar
      [Enhancement] Implement dynamic unroll factor in CUDA code generation (#1360) · 388ee7ee
      Lei Wang authored
      * [Enhancement] Implement dynamic unroll factor in CUDA code generation
      
      This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics.
      
      * lint fix
      
      * [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation
      388ee7ee
    • Lei Wang's avatar
    • botbw's avatar
      [Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056) · 283a9a00
      botbw authored
      * [misc] add a cpp side wrapper for gemm_sp_py
      
      * [misc] typing
      
      * [IR] bind GemmSPWarpPolicy
      
      * [chore] add wrapper code
      
      * [IR] fix GemmSPWarpPolicy
      
      * [codegen] apply ptxas instructions
      
      * [intrinsic] add typical (unused) mma layout
      
      * [template] add uint16 debug func
      
      * [intrinsic] add b matrix layout
      
      * [gemm_sp] enable fp16/bf16 on sm8x
      
      * [layout] refactor fp16/bf16 layout
      
      * [gemm_sp] enable int8
      
      * [chore] update test case dtype
      
      * [gemm_sp] enable fp32
      
      * [layout] refactor layouts
      
      * [intrinsic] enable ldmatrix for mat A
      
      * [layout] enable ldsm for matrix b
      
      * [layout] add ldmatrix for fp32 and fp8
      
      * [chore] refine
      
      * [chore] refactor
      
      * [chore] add fp8 efactor
      
      * [chore] refactor
      
      * [chore] add remove negative zero util
      
      * [example] add a custom compress kernel
      
      * [chore] minor update
      
      * [test] refactor gemm_sp test
      
      * [refactor] make metadata layout func
      
      * [example] add option for using cutlass layout
      
      * [doc] add a gemm_sp doc
      
      * [doc] minor polish
      
      * [chore] remove unused
      
      * [bugfix] fix non replicate b case
      
      * [test] refactor
      
      * [chore] add a check
      
      * [bugfix] fix util bug
      
      * [wip] init a new test case for v2
      
      * [chore] minor refactor
      
      * [chore] minor update
      
      * [bugfix] enable 16bit rs
      
      * [language] enable rs
      
      * [language] enable gemm_sp_sr
      
      * [language] enable gemm_sp_rr
      
      * [test] enable more tests
      
      * [tvm] update ffi binding
      
      * [chore] remove print
      
      * [chore] fix benchmark script
      
      * [lint] precommit lint
      
      * [chore] apply feedback
      
      * [test] use arch 8.0
      
      * [chore] rollback ::ordered_metadata for backward compatibility
      
      * [bugfix] fix captialized
      
      * [example] keep gemm_sp on hopper
      
      * [test] fix no fp8 normal kernel
      
      * [test] reduce matmul size to satisfy accum error
      
      * [test] use cal_diff for assertion
      
      * [bugfix] expand float8 type
      
      * [lib] add make_int4 for short type
      
      * [language] add transpose E
      
      * [bugfix] fix wrong var
      
      * [format] format
      
      * [chore] refactor binding
      
      * [chore] fix wrong passing var
      283a9a00
    • Chaofan Lin's avatar
      [Analysis] Enhance NestedLoopChecker with tile op cases (#1358) · b10ef75f
      Chaofan Lin authored
      * [Analysis] Enhance NestedLoopChecker with tile op cases
      
      * fix tileop issue
      b10ef75f
    • Lei Wang's avatar
      [Refactor] Update Fragment Indexing in ParallelOpNode's InferLayout Method (#1359) · 1b42c87b
      Lei Wang authored
      This commit refines the Fragment creation process in the InferLayout method of ParallelOpNode. It removes the unnecessary forward_index array and utilizes default fragment indexing for consistency with other operations. Additionally, it binds the thread range to enhance comparability across different operations.
      1b42c87b
  13. 28 Nov, 2025 3 commits
    • LJC00118's avatar
      [Bugfix] Disable floordiv optimization due to integer overflow risk (#1355) · a4ea7da9
      LJC00118 authored
      * disable overflow-prone floordiv optimization in lower_intrin.cc
      
      * disable overflow-prone floordiv optimization in lower_intrin.cc
      a4ea7da9
    • Lei Wang's avatar
      [Enhancement] Improve error handling and assertion messages across runtime and... · 17cfeb76
      Lei Wang authored
      [Enhancement] Improve error handling and assertion messages across runtime and argument binding (#1356)
      
      This commit enhances the error handling mechanisms in the runtime by introducing CPU-safe runtime helpers and refining assertion messages in the CodeGenCHost and ArgBinder. It includes structured packed error messages for various conditions, improving clarity in diagnostics. Additionally, the CMake configuration is updated to always include necessary runtime helpers, ensuring consistent error reporting. The changes aim to provide clearer feedback during runtime errors and improve the overall robustness of the argument binding process.
      17cfeb76
    • Lei Wang's avatar
      [Refactor] Simplify index sign state handling in LegalizeNegativeIndex (#1354) · 36a2b2f3
      Lei Wang authored
      This commit refines the logic for determining the sign state of indices in the LegalizeNegativeIndex transformation. It prioritizes vector patterns, specifically Ramp and Broadcast nodes, to avoid compile-time lane queries. The handling of scalar indices is also streamlined, ensuring clearer diagnostics when non-negativity cannot be proven. These changes enhance the robustness and clarity of index handling in the transformation pass.
      36a2b2f3
  14. 27 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder (#1352) · 1e92d11c
      Lei Wang authored
      * [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder
      
      This commit refines the assertion message generation in CodeGenCHost by optimizing the handling of equality checks and reducing buffer size for error messages. Additionally, it enhances the ArgBinder by introducing a nullable guard mechanism for assertions, allowing for more precise error handling when binding arguments. The changes improve the clarity and efficiency of assertion handling across the codebase.
      
      * [Enhancement] Update matmul kernel and optimize argument binding
      
      This commit enhances the matmul kernel by introducing additional tensor parameters and refining the pipeline stages for improved performance. It also updates the argument binding mechanism to include a flag indicating whether buffers are used, enhancing the efficiency of buffer management. Furthermore, the optimization phase in the engine is improved by adding a simplification step, ensuring better performance and clarity in the generated code.
      
      * lint fix
      
      * [Enhancement] Add tensor checks documentation and improve argument binding assertions
      
      This commit introduces a new documentation page for host-side tensor checks, detailing the automatic validations performed by TileLang on kernel arguments. It enhances the ArgBinder by adding assertions for non-null pointers when arguments are used, improving error handling. Additionally, the optimization phase in the engine is updated to include a simplification step, ensuring better performance and clarity in the generated code.
      
      * [Enhancement] Update .gitignore and refine matmul kernel for improved performance
      
      This commit adds host checks logs to the .gitignore file to prevent unnecessary log files from being tracked. Additionally, it refines the matmul kernel by adjusting pipeline stages, updating tensor parameters, and enhancing argument handling for better performance. The changes also include improved error messages in the argument binding process, ensuring clearer diagnostics for users.
      
      * lint fix
      
      * lint fix
      
      * [Refactor] Simplify tensor_null_test function and remove ptr_null_test
      
      This commit refactors the tensor_null_test function by adding a with_bias parameter and removing the ptr_null_test function, which was previously unused. The run_test function is updated to reflect these changes, streamlining the testing process for tensor operations.
      
      * lint fix
      
      * fix
      1e92d11c
  15. 26 Nov, 2025 3 commits
    • Gongen-Ali's avatar
      [Enhancement] Add support for k_pack in gemm_mfma (#1344) · 6bae64f6
      Gongen-Ali authored
      * add support for k_pack
      
      * support benchmark on ROCm
      
      * fix format
      6bae64f6
    • Lei Wang's avatar
      [Refactor] Enhance CopyNode's IterVar Creation and Range Handling (#1346) · 17718bec
      Lei Wang authored
      * [Refactor] Enhance CopyNode's IterVar Creation and Range Handling
      
      This commit refines the `MakeIterVars` method in `CopyNode` to select base ranges based on memory scope levels, ensuring that the chosen ranges are not smaller than the original source ranges. Additionally, it updates the Python `copy` function to clarify range handling, including broadcasting logic and extent alignment. These changes improve the robustness and clarity of the copy operation's implementation.
      
      * test fix
      17718bec
    • Yunqian Fan's avatar
      [Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 (#1327) · f0c721a4
      Yunqian Fan authored
      * feat: add fp8 variants; add placeholder for fp6/fp4 in meta
      
      support ld with pack for fp32 dtype
      
      add dump
      
      add tempalte expand
      
      remove unused dtype and change to rebased apis
      
      * fix: when atom-m!=128, enable_ws
      
      * fix: typo in tcgen05 meta; dispatch in gemm sm100
      f0c721a4