"vscode:/vscode.git/clone" did not exist on "8b9bcf163f7704821ef430e94f0b8367d8434af2"
  1. 26 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout vmap for Tile Operators (#1334) · f5d9da46
      Lei Wang authored
      
      
      * Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.
      
      * lint fix
      
      * Refactor region handling by removing the RegionOp and updating NormalizeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.
      
      * fix
      
      * Refactor memory region handling by introducing `tl.region` calls across various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.
      
      * fix
      
      * fix
      
      * test fix
      
      * lint fix
      
      * Refactor GEMM operations to improve memory region handling by replacing `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.
      
      * fix
      
      * lint fix
      
      * fix
      
      * fix
      
      * test fix
      
      * lint fix
      
      * lint fix
      
      * minor fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      f5d9da46
  2. 24 Nov, 2025 1 commit
  3. 21 Nov, 2025 1 commit
  4. 20 Nov, 2025 1 commit
  5. 19 Nov, 2025 1 commit
  6. 18 Nov, 2025 2 commits
    • Lei Wang's avatar
      [FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696
      Lei Wang authored
      * [Refactor] Update FFI type handling and simplify argument management
      
      * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
      * Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
      * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
      * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
      * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.
      
      * [Update] Sync TVM submodule and enhance kernel source handling
      
      * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
      * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
      * Commented out the main execution call in test files to prevent unintended execution during testing.
      * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
      * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.
      
      * [Refactor] Clean up imports and improve code formatting
      
      * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
      * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
      * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.
      
      * Update execution backend options and improve resolution logic
      
      - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
      - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
      - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
      - Updated documentation to reflect changes in execution backend options and their defaults.
      
      * lint fix
      
      * fix
      
      * Enhance argument handling in CUDA and HIP runtime modules
      
      - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
      - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
      - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * minor fix
      
      * fix
      
      * recover check
      
      * Refactor argument binding and validation in `arg_binder.cc`
      
      - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
      - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
      - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
      - Minor adjustments in test files to streamline kernel execution and improve readability.
      
      * lint fix
      
      * stride fix
      
      * minor fix
      
      * fix
      
      * lint fix
      
      * lint fix
      
      * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management
      
      - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
      - Updated runtime files to include new FFI packed functions for managing stream attributes.
      - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
      - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.
      
      * check with symbolic
      
      * support null ptr
      
      * Update CMakeLists and lower.py for code generation and subproject status
      
      - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
      - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
      - Marked the TVM subproject as dirty to indicate local modifications.
      
      * lint fix
      
      * Update comments for clarity in quickstart.py
      74da3696
    • Jay Zhuang's avatar
      Bug fix for Gated Delta Net benchmark script (#1267) · 0f980f15
      Jay Zhuang authored
      
      
      * fix argument order for fla chunk_gated_delta_rule_fwd_h
      
      * explicit import assert_similar from utils
      
      * rename utils module to avoid name clash
      
      * set store_final_state and save_new_value to True
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0f980f15
  7. 17 Nov, 2025 2 commits
  8. 16 Nov, 2025 2 commits
  9. 15 Nov, 2025 1 commit
    • Tong WU's avatar
      [BugFix] Refactor attention kernel to handle OOB positions by filling with... · 0af3fd7c
      Tong WU authored
      [BugFix] Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. (#1222)
      
      * Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators.
      
      * lint
      
      * pre-commit
      
      * Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.
      0af3fd7c
  10. 14 Nov, 2025 1 commit
  11. 13 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf
      Lei Wang authored
      * fix
      
      * Refactor tensor reshaping in fp8_lighting_indexer.py
      
      - Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
      - Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.
      
      * Refactor analyzer usage in Layout and Fragment reshaping
      
      - Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
      - Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.
      d7164abf
    • Lei Wang's avatar
      [Bugfix] Fix fp8 dtype for some cases (#1246) · 63bf1609
      Lei Wang authored
      * [Enhancement] Add FP8 support and reproducibility in lighting indexer
      
      * Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance.
      * Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix
      
      * Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h`
      
      * Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity.
      * Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures.
      
      * test fix
      
      * bug fix
      63bf1609
  12. 12 Nov, 2025 2 commits
    • pengxin99's avatar
      RMSNorm epsilon refine in the example (#1243) · 468b1b70
      pengxin99 authored
      * Fix division by zero in RMS normalization
      
      * Fix rsqrt calculation to avoid division by zero
      468b1b70
    • Lei Wang's avatar
      [Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a
      Lei Wang authored
      * Add kernel selection option for GEMM v1 in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
      - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
      - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.
      
      * bug fix
      
      * Add kernel selection option for GEMM in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
      - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
      - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.
      
      * Refactor GEMM macro generator to use BufferRegion instead of Buffer
      
      - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
      - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
      - Simplified buffer access logic for better clarity and maintainability.
      
      * Refactor GEMM functions to utilize BufferRegion for improved memory handling
      
      - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
      - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
      - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
      - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.
      
      * Refactor GEMM code for improved readability and consistency
      
      - Cleaned up formatting and spacing in GEMM-related files for better readability.
      - Standardized comments and code structure across various GEMM functions and macros.
      - Enhanced error messages for clarity in buffer region checks.
      - Removed redundant lines and improved overall code maintainability.
      
      * Update GEMM correctness evaluation and macro generator for improved functionality
      
      - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
      - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
      - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
      - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.
      
      * Refactor GEMM and intrinsic files for improved clarity and functionality
      
      - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
      - Adjusted function signature formatting in `swizzle.py` for better readability.
      - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
      - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
      - Improved function signature formatting in `language.py` for consistency.
      
      * Enhance GEMM and MMA functionality for FP64 support
      
      - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
      - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
      - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
      - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
      - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.
      
      * lint fix
      
      * Refactor GEMM correctness evaluation and shared memory alignment handling
      
      - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
      - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
      - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
      - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.
      
      * Enhance GEMM and MMA implementations with region-based memory handling
      
      - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
      - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
      - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.
      
      * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls
      
      - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
      - Modified function calls to use the new module structure for better organization and maintainability in testing examples.
      
      * Enhance OnArrayDeclaration method to handle repeated buffer declarations
      
      - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
      - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.
      
      * Add abbreviation for bfloat16 data type in mfma_macro_generator.py
      
      - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.
      
      * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation
      
      - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
      - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
      - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.
      
      * lint fix
      
      * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP
      
      - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
      - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
      - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
      - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.
      
      * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py
      
      - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
      - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.
      
      * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.
      
      * Refactor attention sink examples to simplify index calculations
      
      - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
      - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.
      
      * Refactor attention sink examples to streamline index calculations
      
      - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
      - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.
      
      * lint fix
      
      * bugfix
      
      * Refactor reduce operation handling in CUDA and Python
      
      - Removed outdated shared memory reduction logic from `reduce.cc`.
      - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
      - Updated CUDA header to define a wider accumulator type for better numerical accuracy.
      - Enhanced error handling for buffer scope validation in the reduction process.
      
      * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs
      
      * Enhance unit loop handling by refining annotation checks
      
      - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
      - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.
      
      * clean clode
      8fbe1b3a
  13. 11 Nov, 2025 1 commit
  14. 05 Nov, 2025 2 commits
    • Yu Cheng's avatar
      [Example] Update GQA varlen fwd (#1173) · a9d823b8
      Yu Cheng authored
      * [Example] Update GQA varlen fwd
      
      * fix
      a9d823b8
    • Zhengju Tang's avatar
      [GQA] Use TMA in GQA bwd kernel to boost performance (#1176) · 298ab480
      Zhengju Tang authored
      
      
      * [Test] Add cp async to avoid register spill
      
      * [BugFix] GQA fwd and bwd
      - Fix the undefined behavior of -inf in acc_s
      - Fix the causal loop range in varlen scenario
      
      * [TMA] Move on to TMA and locate the register spill issue
      
      * [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT
      
      * [Debug] The SIMT copy in producer occupies too many registers
      
      * [BugFix] Use 3D lse and delta to avoid illegal instruction
      
      * [Perf] Relaxed order for dQ and SIMT store for dKdV
      
      * [Feat] For atomic add version
      
      * [Lint]
      
      * [Bugfix] Enable code lowering with producer‑copy‑only program (#1168)
      
      * bugfix
      
      * lint fix
      
      * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.
      
      * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.
      
      * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.
      
      * [Bugfix] Support 16bits shfl_sync (#1169)
      
      * Add type-safe warp shuffle helpers for 16-bit float types in common.h
      
      - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`.
      - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations.
      - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability.
      
      * lint fix
      
      * [Testing] Move TMA 1D and test for its functionality (#1167)
      
      * [Testing] Move TMA 1D and test for its functionality
      
      * [Lint]
      
      * [Refactor]: Change the params in pytest to avoid oom error during ci (#1170)
      
      * [Refactor]: Change the params in pytest to avoid oom error during ci
      
      * format
      
      * fix
      
      * Update test_example_cast.py
      
      * Update parameters in test_example_cast
      
      * Update test_example_flash_attention.py
      
      * update
      
      * format
      
      * fix
      
      * fix
      
      * format
      
      * [Bugfix] Fix tvm import path for editable build (#1172)
      
      * [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986)
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      
      * [Language] Add Correctness and performance check scripts for V2 (#1174)
      
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      
      * [Bugfix] Legalize Datatype for mma intrinisc codegen  (#1179)
      
      * fix
      
      * lint fix
      
      * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.
      
      * [Perf] Add layout and use_tma to boost performance
      
      * [Lint]
      
      * [Note]
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarYuqi Dong <134183314+yyttt6@users.noreply.github.com>
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      298ab480
  15. 04 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      
      * [Refactor] Improve Python compatibility for ParamSpec and Self
      
      - Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
      - Updated type annotations across multiple files to ensure consistent usage of typing features.
      
      * [Update] Require Python 3.9 and enhance type annotations
      
      - Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
      - Removed references to Python 3.8 in classifiers.
      - Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
      - Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.
      
      * [Refactor] Update import statements to enhance type annotations
      
      - Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
      - Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
      - Adjusted import statements in the language proxy file to maintain consistency in type annotations.
      
      * disable rocm rs nt test.
      
      * lint fix
      7d961892
  16. 03 Nov, 2025 1 commit
    • Kurisu's avatar
      [Language] Initial version of tilelang frontend v2 (#1120) · 5f202fe5
      Kurisu authored
      
      
      * tilelang frontend v2
      
      * syntax sugar: defining a local var by annotation
      
      * [Refactor] fix type linting warning like `T.float32`
      
      * Add tl.local_var_init for new tl.float32
      
      * allow passing default argument as function annotation
      
      * allow default arguments as annotation
      
      * fix lint error
      
      * minor fix
      
      * [Refactor] refactor tilelang.jit and tilelang.autotune
      
      * minor fix
      
      * minor fix
      
      * minor fix
      
      * fix metal get function name
      
      * add par_compile impl and tests
      
      * Type consistency on tvm datatype
      1. isinstance(tl.float32, tvm.DataType) == True
      2. Allow `tl.float32` as function annotations
      3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions
      
      * fix lint error
      
      * add more warning in frontend
      
      * update tvm version
      
      * Minor fix on tvm_ffi annotations
      
      * add document and examples
      
      * fix lint error
      
      * Simplify index calculations in example_chunk_o_bwd.py
      
      Refactor index calculations for dg_last_fragment assignment.
      
      * minor fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5f202fe5
  17. 02 Nov, 2025 1 commit
  18. 01 Nov, 2025 1 commit
  19. 31 Oct, 2025 1 commit
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
  20. 28 Oct, 2025 1 commit
  21. 27 Oct, 2025 1 commit
  22. 23 Oct, 2025 1 commit
    • Tong WU's avatar
      [Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a
      Tong WU authored
      * [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen
      
      * Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
      * Enhanced the existing code to support both directions of conversion based on the lane count.
      * Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.
      
      * [Feature] Add float32 to float8 conversion support in CUDA codegen
      
      * Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
      * Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
      * Enhanced type handling for better compatibility with TileLang, particularly for float8 types.
      
      * lint
      
      * fix a bug
      
      * [Enhancement] Support lanes=4 cases and add unit test for vectorized cast
      
      * lint
      
      * [Feature] Refactor bf16 convertion operations and remove legacy compile flags
      
      * lint
      a148d62a
  23. 22 Oct, 2025 2 commits
  24. 21 Oct, 2025 4 commits
  25. 20 Oct, 2025 3 commits
  26. 19 Oct, 2025 2 commits
  27. 18 Oct, 2025 1 commit