"docs/git@developer.sourcefind.cn:lacacy/qwen_lmdeploy.git" did not exist on "c3290cadcd8818deeaf9281f0494174308e04d79"
  1. 20 Nov, 2025 3 commits
  2. 19 Nov, 2025 5 commits
  3. 18 Nov, 2025 7 commits
    • Lei Wang's avatar
      [FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696
      Lei Wang authored
      * [Refactor] Update FFI type handling and simplify argument management
      
      * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
      * Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
      * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
      * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
      * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.
      
      * [Update] Sync TVM submodule and enhance kernel source handling
      
      * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
      * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
      * Commented out the main execution call in test files to prevent unintended execution during testing.
      * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
      * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.
      
      * [Refactor] Clean up imports and improve code formatting
      
      * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
      * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
      * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.
      
      * Update execution backend options and improve resolution logic
      
      - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
      - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
      - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
      - Updated documentation to reflect changes in execution backend options and their defaults.
      
      * lint fix
      
      * fix
      
      * Enhance argument handling in CUDA and HIP runtime modules
      
      - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
      - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
      - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * minor fix
      
      * fix
      
      * recover check
      
      * Refactor argument binding and validation in `arg_binder.cc`
      
      - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
      - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
      - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
      - Minor adjustments in test files to streamline kernel execution and improve readability.
      
      * lint fix
      
      * stride fix
      
      * minor fix
      
      * fix
      
      * lint fix
      
      * lint fix
      
      * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management
      
      - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
      - Updated runtime files to include new FFI packed functions for managing stream attributes.
      - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
      - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.
      
      * check with symbolic
      
      * support null ptr
      
      * Update CMakeLists and lower.py for code generation and subproject status
      
      - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
      - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
      - Marked the TVM subproject as dirty to indicate local modifications.
      
      * lint fix
      
      * Update comments for clarity in quickstart.py
      74da3696
    • Chaofan Lin's avatar
      [Language] Add shape check in `T.view/reshape` (#1277) · 921b96a3
      Chaofan Lin authored
      * [Language] Add shape check in T.view/reshape
      
      * address comments
      921b96a3
    • Lei Wang's avatar
      [Bugfix] Minor fix for some cases (#1278) · 1b0efb65
      Lei Wang authored
      1b0efb65
    • Jay Zhuang's avatar
      Bug fix for Gated Delta Net benchmark script (#1267) · 0f980f15
      Jay Zhuang authored
      
      
      * fix argument order for fla chunk_gated_delta_rule_fwd_h
      
      * explicit import assert_similar from utils
      
      * rename utils module to avoid name clash
      
      * set store_final_state and save_new_value to True
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0f980f15
    • Elevator14B's avatar
      Fix various issues under `int64_t` static and dynamic shape. (#1218) · 49c85715
      Elevator14B authored
      
      
      * Fix various issues under int64_t static and dynamic shape.
      
      * Resolve reviewed issues.
      
      * Add unit test.
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      49c85715
    • Chaofan Lin's avatar
      [BugFix] Adding extra parameters into autotune hashkey (#1274) · e805f8e5
      Chaofan Lin authored
      * [BugFix] Adding extra parameters into autotune hashkey
      
      * lint
      
      * None check
      
      * check serializable
      e805f8e5
    • Yichen Yan's avatar
  4. 17 Nov, 2025 5 commits
  5. 16 Nov, 2025 2 commits
  6. 15 Nov, 2025 3 commits
    • Jiaxing Ding's avatar
      [AMD] Update CK for ROCm7 (#1262) · 729e66ca
      Jiaxing Ding authored
      729e66ca
    • Gabriel Wu's avatar
      [fix] NVRTC execution backend (#1256) · eb415744
      Gabriel Wu authored
      * [fix] NVRTC execution backend
      
      * [fmt] run pre-commit
      
      * [fix] coderabbit reviews
      
      * [test] add cuda-python to test dep
      
      * [fix] coderabbit reviews
      
      * [fix] CUDA 13 compatibility
      
      * [fix] sm90
      
      * [fix] CUDA 13 compatibility
      
      * [fix] pre-commit
      
      * [fix] always use cuda::std::__atomic_ref_impl
      
      * [fix] restore to external API
      
      * Revert "[fix] restore to external API"
      
      This reverts commit 49bd875638fb631d270015f408991d38fd1e9a5d.
      
      * [fmt] use space instead tabs for py codegen
      
      * [fix] im2col API
      
      * [fix] revert atomic.h
      
      * [fix] dynamic shape
      
      * [refactor] extract common utils
      
      * [feat] support L2 persistent map
      
      * [fix] l2 persistent map
      
      * [fix] pre-commit
      
      * [fix] restore _TYPE_MAP
      
      * [fix] pre-commit
      
      * [fix] avoid duplicate TMA descs
      
      * [docs] add docstring
      
      * [fix] coderabbit
      
      * [fix] coderabbit
      
      * [fix] coderabbit
      
      * [fix] coderabbit
      eb415744
    • Tong WU's avatar
      [BugFix] Refactor attention kernel to handle OOB positions by filling with... · 0af3fd7c
      Tong WU authored
      [BugFix] Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. (#1222)
      
      * Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators.
      
      * lint
      
      * pre-commit
      
      * Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.
      0af3fd7c
  7. 14 Nov, 2025 2 commits
  8. 13 Nov, 2025 6 commits
    • Lei Wang's avatar
      [Refactor] Update buffer handling in copy and atomic operations (#1247) · 2c0072a8
      Lei Wang authored
      * [Refactor] Update buffer handling in copy and atomic operations
      
      * Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions.
      * Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity.
      * Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase.
      * Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations.
      
      * Remove obsolete .git_commit.txt file
      
      * Add unit test for dynamic copy extent handling in TileLang
      
      * Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building.
      * The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios.
      * The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix.
      
      * lint fix
      
      * fix
      
      * Revert "fix"
      
      This reverts commit 828b4c1e4de76a7d11e4d4092927303fbbe00097.
      
      * Update TVM submodule and refactor atomic and copy functions
      
      * Updated the TVM submodule to a dirty state.
      * Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions.
      * Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage.
      
      * Enhance extent handling in atomic and copy functions
      
      * Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations.
      * Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents.
      * Added comments to clarify the extent handling logic.
      
      * Enhance `legalize_pairwise_extents` function with early-exit rule
      
      * Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments.
      * Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic.
      
      * lint fix
      2c0072a8
    • Lei Wang's avatar
      [Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf
      Lei Wang authored
      * fix
      
      * Refactor tensor reshaping in fp8_lighting_indexer.py
      
      - Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
      - Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.
      
      * Refactor analyzer usage in Layout and Fragment reshaping
      
      - Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
      - Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.
      d7164abf
    • Chaofan Lin's avatar
      [Minor] Remove git_commit.txt (#1249) · c1398550
      Chaofan Lin authored
      c1398550
    • Lei Wang's avatar
      [Bugfix] Fix fp8 dtype for some cases (#1246) · 63bf1609
      Lei Wang authored
      * [Enhancement] Add FP8 support and reproducibility in lighting indexer
      
      * Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance.
      * Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix
      
      * Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h`
      
      * Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity.
      * Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures.
      
      * test fix
      
      * bug fix
      63bf1609
    • Lei Wang's avatar
      [Refactor] Phaseout legacy loop vectorize dynamic pass (#1245) · f550a58d
      Lei Wang authored
      
      
      * Deleted the LoopVectorizeDynamic implementation from the transform module.
      * Removed associated references in the phase and initialization files to streamline the codebase.
      * This change simplifies the transformation pipeline by eliminating unused functionality.
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      f550a58d
    • Jiaxing Ding's avatar
  9. 12 Nov, 2025 7 commits
    • pengxin99's avatar
      RMSNorm epsilon refine in the example (#1243) · 468b1b70
      pengxin99 authored
      * Fix division by zero in RMS normalization
      
      * Fix rsqrt calculation to avoid division by zero
      468b1b70
    • Lei Wang's avatar
      [Bugfix] Minor fix for tcgen05 (#1242) · 6882bd50
      Lei Wang authored
      
      
      * Add correctness evaluation script for GEMM v2
      
      - Introduced a new Python script `correctness_evaluation_tcgen05.py` for testing the correctness of GEMM v2 implementations using pytest.
      - Implemented matrix multiplication and compilation checks, along with parameterized tests for various input configurations.
      - Enhanced the testing framework to validate GEMM operations with different data types and configurations, ensuring robustness in the implementation.
      - Updated logging in `legalize_negative_index.cc` to reduce verbosity by changing from WARNING to DLOG.
      - Adjusted assertions in `tcgen05_macro_generator.py` to accommodate new warp size requirements for improved performance.
      - Removed unused variable in `gemm_tcgen05.py` to streamline the codebase.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      6882bd50
    • Lei Wang's avatar
      [Enhancement] Support Layout/Fragment Reshape (#1241) · 4370309b
      Lei Wang authored
      
      
      * Update layout handling and introduce reshape functionality
      
      - Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes.
      - Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents.
      - Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations.
      - Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios.
      
      * lint fix
      
      * Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged
      
      * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1
      
      * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1
      
      * remove useless prove
      
      * remove comment
      
      ---------
      Co-authored-by: default avatartilelang-bot <bot@tilelang>
      4370309b
    • Kuris's avatar
      [Language] Add type stubs for tir op (#1239) · 02cfc2a3
      Kuris authored
      * add typing stub for tir.ir
      
      * remove idents
      
      * minor update
      02cfc2a3
    • LJC00118's avatar
      [Bugfix] Minor fix in `builder.py` (#1235) · 30d8dedd
      LJC00118 authored
      30d8dedd
    • Lei Wang's avatar
      [Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a
      Lei Wang authored
      * Add kernel selection option for GEMM v1 in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
      - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
      - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.
      
      * bug fix
      
      * Add kernel selection option for GEMM in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
      - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
      - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.
      
      * Refactor GEMM macro generator to use BufferRegion instead of Buffer
      
      - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
      - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
      - Simplified buffer access logic for better clarity and maintainability.
      
      * Refactor GEMM functions to utilize BufferRegion for improved memory handling
      
      - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
      - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
      - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
      - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.
      
      * Refactor GEMM code for improved readability and consistency
      
      - Cleaned up formatting and spacing in GEMM-related files for better readability.
      - Standardized comments and code structure across various GEMM functions and macros.
      - Enhanced error messages for clarity in buffer region checks.
      - Removed redundant lines and improved overall code maintainability.
      
      * Update GEMM correctness evaluation and macro generator for improved functionality
      
      - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
      - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
      - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
      - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.
      
      * Refactor GEMM and intrinsic files for improved clarity and functionality
      
      - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
      - Adjusted function signature formatting in `swizzle.py` for better readability.
      - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
      - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
      - Improved function signature formatting in `language.py` for consistency.
      
      * Enhance GEMM and MMA functionality for FP64 support
      
      - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
      - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
      - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
      - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
      - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.
      
      * lint fix
      
      * Refactor GEMM correctness evaluation and shared memory alignment handling
      
      - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
      - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
      - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
      - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.
      
      * Enhance GEMM and MMA implementations with region-based memory handling
      
      - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
      - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
      - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.
      
      * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls
      
      - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
      - Modified function calls to use the new module structure for better organization and maintainability in testing examples.
      
      * Enhance OnArrayDeclaration method to handle repeated buffer declarations
      
      - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
      - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.
      
      * Add abbreviation for bfloat16 data type in mfma_macro_generator.py
      
      - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.
      
      * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation
      
      - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
      - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
      - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.
      
      * lint fix
      
      * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP
      
      - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
      - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
      - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
      - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.
      
      * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py
      
      - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
      - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.
      
      * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.
      
      * Refactor attention sink examples to simplify index calculations
      
      - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
      - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.
      
      * Refactor attention sink examples to streamline index calculations
      
      - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
      - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.
      
      * lint fix
      
      * bugfix
      
      * Refactor reduce operation handling in CUDA and Python
      
      - Removed outdated shared memory reduction logic from `reduce.cc`.
      - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
      - Updated CUDA header to define a wider accumulator type for better numerical accuracy.
      - Enhanced error handling for buffer scope validation in the reduction process.
      
      * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs
      
      * Enhance unit loop handling by refining annotation checks
      
      - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
      - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.
      
      * clean clode
      8fbe1b3a
    • Kuris's avatar
      2b1f5990