"src/git@developer.sourcefind.cn:OpenDAS/nni.git" did not exist on "a9711e24a150d2cf4e5cb3ed3d2224789369feda"
  1. 19 Nov, 2025 1 commit
    • Kuris's avatar
      [Fix] Fix memory leak bug (#1281) · cd681e63
      Kuris authored
      * add typing stub for tir.ir
      
      * remove idents
      
      * minor update
      
      * [Refactor] add numpy conversion for dtype
      
      * fix lint error
      
      * remove unused np.float_ in dtype conversion
      
      * fix type in np.int_
      
      * fix typo
      
      * minor fix
      
      * remove debug files
      
      * fix memory leak bug
      
      * fix lint error
      
      * add comments
      
      * fix lint error
      
      * remove duplicated, because tilelang doesn't dependent deprecated
      cd681e63
  2. 18 Nov, 2025 3 commits
    • Lei Wang's avatar
      [FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696
      Lei Wang authored
      * [Refactor] Update FFI type handling and simplify argument management
      
      * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
      * Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
      * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
      * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
      * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.
      
      * [Update] Sync TVM submodule and enhance kernel source handling
      
      * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
      * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
      * Commented out the main execution call in test files to prevent unintended execution during testing.
      * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
      * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.
      
      * [Refactor] Clean up imports and improve code formatting
      
      * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
      * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
      * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.
      
      * Update execution backend options and improve resolution logic
      
      - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
      - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
      - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
      - Updated documentation to reflect changes in execution backend options and their defaults.
      
      * lint fix
      
      * fix
      
      * Enhance argument handling in CUDA and HIP runtime modules
      
      - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
      - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
      - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * minor fix
      
      * fix
      
      * recover check
      
      * Refactor argument binding and validation in `arg_binder.cc`
      
      - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
      - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
      - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
      - Minor adjustments in test files to streamline kernel execution and improve readability.
      
      * lint fix
      
      * stride fix
      
      * minor fix
      
      * fix
      
      * lint fix
      
      * lint fix
      
      * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management
      
      - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
      - Updated runtime files to include new FFI packed functions for managing stream attributes.
      - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
      - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.
      
      * check with symbolic
      
      * support null ptr
      
      * Update CMakeLists and lower.py for code generation and subproject status
      
      - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
      - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
      - Marked the TVM subproject as dirty to indicate local modifications.
      
      * lint fix
      
      * Update comments for clarity in quickstart.py
      74da3696
    • Chaofan Lin's avatar
      [Language] Add shape check in `T.view/reshape` (#1277) · 921b96a3
      Chaofan Lin authored
      * [Language] Add shape check in T.view/reshape
      
      * address comments
      921b96a3
    • Elevator14B's avatar
      Fix various issues under `int64_t` static and dynamic shape. (#1218) · 49c85715
      Elevator14B authored
      
      
      * Fix various issues under int64_t static and dynamic shape.
      
      * Resolve reviewed issues.
      
      * Add unit test.
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      49c85715
  3. 17 Nov, 2025 1 commit
    • Kuris's avatar
      [Refactor] add support for numpy dtype conversion (#1255) · 041d4a06
      Kuris authored
      * add typing stub for tir.ir
      
      * remove idents
      
      * minor update
      
      * [Refactor] add numpy conversion for dtype
      
      * fix lint error
      
      * remove unused np.float_ in dtype conversion
      
      * fix type in np.int_
      
      * fix typo
      
      * minor fix
      
      * remove debug files
      041d4a06
  4. 16 Nov, 2025 1 commit
  5. 14 Nov, 2025 1 commit
  6. 12 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support Layout/Fragment Reshape (#1241) · 4370309b
      Lei Wang authored
      
      
      * Update layout handling and introduce reshape functionality
      
      - Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes.
      - Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents.
      - Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations.
      - Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios.
      
      * lint fix
      
      * Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged
      
      * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1
      
      * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1
      
      * remove useless prove
      
      * remove comment
      
      ---------
      Co-authored-by: default avatartilelang-bot <bot@tilelang>
      4370309b
  7. 11 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add thread count validation for ReduceOp fragment layout inference (#1225) · 67cc8611
      Lei Wang authored
      * [Enhancement] Add thread count validation for ReduceOp fragment layout inference
      
      * Introduced a check to ensure that the thread count is divisible by the replicate extent during layout inference in ReduceOpNode. This validation prevents layout inference failures and provides detailed error messages to guide users in resolving issues related to thread block sizes and fragment layouts.
      * Updated tests to remove unsupported configurations that could lead to layout inference errors, ensuring more robust testing scenarios.
      
      * lint fix
      67cc8611
  8. 06 Nov, 2025 2 commits
  9. 05 Nov, 2025 2 commits
    • Tong WU's avatar
      [Feature] Add `tl.infinity` operator for infinity handling of bfloat16 (#1175) · 11456de2
      Tong WU authored
      
      
      * Update dependency version for apache-tvm-ffi in pyproject.toml to fix CI
      
      * [Math] Add `tl.infinity` operation and update Python interface for infinity handling
      
      - Implemented `infinity_op` in C++ to return infinity values for supported data types.
      - Registered new operation `tl.infinity` with appropriate attributes.
      - Updated Python interface to call the new `tl.infinity` operation instead of the previous method.
      
      * Add unit tests for `tl.infinity` operation in TileLang
      
      - Introduced a new test file `test_tilelang_language_infinity.py` to validate the behavior of the `tl.infinity` operation across multiple data types (float16, bfloat16, float32, float64).
      - Implemented a kernel to fill a tensor with infinity values and asserted the correctness of the output against PyTorch's `torch.inf`.
      
      * lint
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      11456de2
    • Kurisu's avatar
      [Feat] Add swap like grammar in tuple assignment (#1185) · 055f8500
      Kurisu authored
      * [Feat] add 2 phase binding to allow swap two var
      
      * Minor update tvm dtype constructor
      
      * fix lint error
      055f8500
  10. 03 Nov, 2025 1 commit
    • Kurisu's avatar
      [Language] Initial version of tilelang frontend v2 (#1120) · 5f202fe5
      Kurisu authored
      
      
      * tilelang frontend v2
      
      * syntax sugar: defining a local var by annotation
      
      * [Refactor] fix type linting warning like `T.float32`
      
      * Add tl.local_var_init for new tl.float32
      
      * allow passing default argument as function annotation
      
      * allow default arguments as annotation
      
      * fix lint error
      
      * minor fix
      
      * [Refactor] refactor tilelang.jit and tilelang.autotune
      
      * minor fix
      
      * minor fix
      
      * minor fix
      
      * fix metal get function name
      
      * add par_compile impl and tests
      
      * Type consistency on tvm datatype
      1. isinstance(tl.float32, tvm.DataType) == True
      2. Allow `tl.float32` as function annotations
      3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions
      
      * fix lint error
      
      * add more warning in frontend
      
      * update tvm version
      
      * Minor fix on tvm_ffi annotations
      
      * add document and examples
      
      * fix lint error
      
      * Simplify index calculations in example_chunk_o_bwd.py
      
      Refactor index calculations for dg_last_fragment assignment.
      
      * minor fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5f202fe5
  11. 02 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
    • Lei Wang's avatar
      c85bb3ac
  12. 01 Nov, 2025 1 commit
  13. 29 Oct, 2025 2 commits
  14. 28 Oct, 2025 1 commit
  15. 23 Oct, 2025 1 commit
    • Tong WU's avatar
      [Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a
      Tong WU authored
      * [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen
      
      * Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
      * Enhanced the existing code to support both directions of conversion based on the lane count.
      * Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.
      
      * [Feature] Add float32 to float8 conversion support in CUDA codegen
      
      * Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
      * Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
      * Enhanced type handling for better compatibility with TileLang, particularly for float8 types.
      
      * lint
      
      * fix a bug
      
      * [Enhancement] Support lanes=4 cases and add unit test for vectorized cast
      
      * lint
      
      * [Feature] Refactor bf16 convertion operations and remove legacy compile flags
      
      * lint
      a148d62a
  16. 21 Oct, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e
      Lei Wang authored
      * - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
          generated Allocates and the PrimFunc attrs
        - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
          allocations keep their tl.local_var_init annotations
        - teach annotation handling to accept scalar initializers, resolve buffers, and merge
          with existing stat
      
      * lint fix
      
      * enhance
      
      * lint fix
      
      * lint fix
      bddb125e
  17. 20 Oct, 2025 3 commits
  18. 17 Oct, 2025 2 commits
  19. 15 Oct, 2025 1 commit
  20. 14 Oct, 2025 2 commits
  21. 11 Oct, 2025 1 commit
  22. 07 Oct, 2025 1 commit
  23. 25 Sep, 2025 2 commits
    • Lei Wang's avatar
      [Language] Support atomic add with ret (#870) · aa0b1090
      Lei Wang authored
      * Add atomic operations for CUDA templates in new atomic.h file
      
      - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
      - Implemented support for half, bfloat16, and float types with appropriate memory ordering.
      - Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
      - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
      - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.
      
      * Refactor atomic operations in CUDA templates for improved readability
      
      - Reformatted atomic operation implementations in atomic.h for better code clarity.
      - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
      - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.
      
      * Add thread storage synchronization configuration option
      
      - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
      - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
      - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.
      
      * Refactor thread storage sync configuration retrieval
      
      - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
      - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.
      
      * test fix
      
      * Update atomic operations and tests for improved functionality
      
      - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
      - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
      - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
      - Updated the TVM subproject to the latest commit for better compatibility.
      
      * Update attention sink examples to use 32 heads
      
      - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
      - Ensured consistency across example scripts for improved usability and testing.
      
      * Refactor atomic add handling in vectorization
      
      - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
      - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.
      
      * Add loop break functionality and enhance thread synchronization
      
      - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
      - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
      - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.
      
      * test fix
      aa0b1090
    • Lei Wang's avatar
      [Language] Support sequence comparisons (#872) · c538d8ab
      Lei Wang authored
      * Update submodule 'tvm' to latest commit 7a71ee34
      
      * lint fix
      c538d8ab
  24. 17 Sep, 2025 1 commit
  25. 24 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Typo] Remove `disable_cache` in some tests (#755) · 556d411e
      Lei Wang authored
      * Update test parameters and remove debug print statement
      
      - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
      - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.
      
      * Refactor loop stack management in warp_specialized_rewriter
      
      - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
      - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
      - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.
      
      * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.
      556d411e
  26. 23 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) · 6b125028
      Lei Wang authored
      * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.
      
      * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.
      
      * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.
      
      * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.
      
      * typo fix
      
      * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.
      
      * lint fix
      
      * test fix
      
      * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.
      
      * use ninja instead of make
      
      * Add CMake configuration step for Ninja build system in setup.py
      
      * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.
      
      * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.
      
      * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.
      
      * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.
      
      * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.
      
      * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.
      
      * bugfix
      
      * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.
      
      * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.
      
      * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.
      6b125028
  27. 22 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245
      Lei Wang authored
      * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy
      
      * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
      * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
      * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.
      
      * lint fix
      
      * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.
      
      * remove bulk copy
      
      * Refactor copy and atomic add operations to support TMA lower configuration
      
      - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
      - Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
      - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
      - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
      - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.
      
      * Enhance TMA bulk copy logic in `LowerBulkCopy` method
      
      - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
      - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.
      
      * lint fix
      
      * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.
      
      * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions
      
      - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
      - Added TODO comments to indicate the need for further improvements in shared memory handling.
      
      * Update `native_sparse_attention` function to include TMA configuration options
      
      - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
      - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.
      
      * Refactor JIT decorator formatting in `native_sparse_attention` function
      
      - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
      - No functional changes were made; this update focuses on code clarity and maintainability.
      
      * Enhance thread management and logging in TileLang compilation
      
      - Added a method to check if printing is enabled during compilation, improving control over logging behavior.
      - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
      - Added comments to clarify the purpose of changes and improve code readability.
      
      * Add warp specialization scope and refactor register management in TileLang
      
      - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
      - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
      - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
      - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.
      
      * Refactor test for InjectSetMaxNReg pass in TileLang
      
      - Improved readability by restructuring conditional checks and assertions in the test cases.
      - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
      - Ensured consistent formatting and spacing throughout the test functions for better maintainability.
      
      * Enhance bulk copy and store checks in `Copy` class
      
      - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
      - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
      - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.
      
      * lint fix
      5c11d245
  28. 17 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) · 1b308baf
      Lei Wang authored
      
      
      * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107
      
      * Support strided tensors
      
      * Refactor target attribute helper functions for improved clarity
      
      * No code changes made in proxy.py and setup.py
      
      * lint fix
      
      * lint fix via gemini
      
      * lint fix
      
      * test fix
      
      * test fix
      
      * lint fix
      
      * Update wrapper.py
      
      * test fix
      
      * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      1b308baf
  29. 15 Aug, 2025 1 commit