1. 29 Sep, 2025 8 commits
    • Lei Wang's avatar
      [Example] Add topk into sparse mla example and append some docs (#901) · 6021ef32
      Lei Wang authored
      * Remove unused `fp8_mqa_logits.py` file and update README.md to reflect new directory structure and file descriptions for deepseek_v32 example. Added sections for architecture overview, Lightning Indexer, Top-k Selector, and Sparse MLA Forward implementations.
      
      * Update linting configurations and improve code formatting in deepseek_v32 example scripts
      
      - Added per-file ignores for the inference directory in `pyproject.toml`.
      - Refactored code in `topk_selector.py`, `convert.py`, `generate.py`, `kernel.py`, and `model.py` to enhance readability by adjusting spacing and line breaks.
      - Ensured consistent formatting across function definitions and assertions for better clarity.
      
      * Refactor test functions in deepseek_v32 example scripts for improved clarity and consistency
      
      - Updated `fp8_lighting_indexer.py` to define a dedicated test function for the lighting indexer.
      - Refactored `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` to standardize test function parameters and improve readability.
      - Enhanced `topk_selector.py` by introducing a test function with parameters for batch size and sequence length.
      - Ensured all test functions are invoked correctly in the main execution block.
      
      * Enhance test functions in deepseek_v32 example scripts with CUDA requirements and parameterization
      
      - Added CUDA requirements decorators to `test_example_sparse_mla_fwd` and `test_example_sparse_mla_fwd_pipelined`.
      - Parameterized test functions to use specific small shapes for testing, improving test coverage and clarity.
      
      * lint fix
      
      * Update README.md to correct image path for DeepSeek V3.2 architecture diagram
      6021ef32
    • Wenxuan Tan's avatar
      [Bugfix] Fix flops comp and softmax scale in mla (#900) · 16561159
      Wenxuan Tan authored
      * fix flops comp and softmax scale
      
      * format
      16561159
    • Lei Wang's avatar
      [CI] Legalize math related test (#899) · 54fc6ba0
      Lei Wang authored
      54fc6ba0
    • Wenhao Xie's avatar
      [Typo] Fix backend name for Huawei Ascend (#898) · d19fe1ae
      Wenhao Xie authored
      * [Typo] Fix backend name for Huawei Ascend chips
      
      * update
      d19fe1ae
    • Lei Wang's avatar
      [Example] Add sparse mla examples (#896) · 65ac7454
      Lei Wang authored
      * Update README.md to include directory structure and file descriptions for deepseek_v32 example
      
      * Refactor and clean up deepseek_v32 example scripts
      
      - Removed unused imports and functions from `fp8_mqa_logits.py` to streamline the code.
      - Improved formatting and readability in `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` by adjusting function signatures and indentation.
      - Added `# ruff: noqa` comments to suppress linting warnings in multiple files.
      - Enhanced the `generate_random_cu_seqlens` function in `utils.py` for better clarity and organization.
      - Updated print statements for consistency in output formatting.
      65ac7454
    • Wenhao Xie's avatar
    • Lei Wang's avatar
      [Example] Add example (#894) · 4424fa9a
      Lei Wang authored
      * [Refactor] Enhance CopyNode Lower method to support disable_tma flag and improve flash attention implementation
      
      * Updated the CopyNode Lower method to correctly include the disable_tma flag in the GetCopyInst call.
      * Refactored the flash attention implementation to selectively disable TMA for specific copy operations while allowing it for others.
      * Addressed linting issues for improved code quality
      
      * sparse mla kernels
      
      * Remove deprecated sparse MLA and utility files to streamline the codebase.
      4424fa9a
    • Jiaxing Ding's avatar
      [Layout] fix plot layout (#890) · 6c67a77f
      Jiaxing Ding authored
      6c67a77f
  2. 28 Sep, 2025 2 commits
    • Tong WU's avatar
      [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst (#888) · 599264ca
      Tong WU authored
      * Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call
      
      * Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations
      
      * attempt to fix lint
      599264ca
    • Zhiwen Mo's avatar
      [SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43
      Zhiwen Mo authored
      * update sm100 related utcmma, tmem, ld/st256 in src
      * update sm100 related utcmma, tmem, ld/st256 in tilelang
      * Remove deprecated GEMM examples and related README documentation for SM100 architecture support
      * Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
      * Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
      * Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
      * Update README and source files to reflect TCGEN5.MMA terminology changes
      * Refactor CUDA GEMM header for improved readability
      f58bcd43
  3. 26 Sep, 2025 8 commits
    • Lei Wang's avatar
      [Layout] Introduce Flexible Parallel to Support T.serial and local buffers... · c382dcbc
      Lei Wang authored
      
      [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844)
      
      * Support T.serial and local buffers inside T.Parallel loop.
      
      * Fix reducer layout in T.Parallel nested inside other loops
      
      * Debug output with LOG(INFO)
      
      * Add disable option for WGMMA.
      
      * fix
      
      * Use DLOG; fix missing registration for new pass config
      
      * bug fix
      
      * lint fix
      
      * Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example
      
      * Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments
      
      * Enhance GEMM instantiation logic and improve layout inference for local buffer detection
      
      - Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust.
      - Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      c382dcbc
    • Tong WU's avatar
      [Example] Optimize sink attention forward via swizzled layout and report benchmark results (#885) · bf67fb19
      Tong WU authored
      
      
      * Enhance attention sink examples with swizzled layout and performance metrics
      
      - Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns.
      - Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.
      
      * Add README for Attention Sink example with algorithm details and benchmark results
      
      - Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`.
      - Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.
      
      * Update README.md for Attention Sink example to include link to Triton implementation
      
      * Update examples/attention_sink/README.md
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * typo
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      bf67fb19
    • Lei Wang's avatar
      [Dist] Provide an option to include commit ID in version (#884) · c861d8a2
      Lei Wang authored
      * Update MANIFEST.in and setup.py to include commit ID in versioning and adjust included files
      
      - Modified MANIFEST.in to include shared library files `libtvm.so` and `libtvm_runtime.so`.
      - Updated setup.py to conditionally include the commit ID in the package version based on the `WITH_COMMITID` environment variable.
      - Enhanced versioning logic in version.py to use a truncated commit ID for better compatibility.
      
      * Update setup.py and related scripts to enable commit ID inclusion in package metadata
      
      - Changed the default value of the `WITH_COMMITID` environment variable in setup.py to "True".
      - Updated tox.ini to set `WITH_COMMITID` to "TRUE" for the testing environment and "FALSE" for the build environment.
      - Modified pypi_distribution.sh to pass `WITH_COMMITID=FALSE` during the wheel build process.
      
      * Update MANIFEST.in to include additional files and directories for packaging
      
      - Added VERSION, CMakeLists.txt, and various requirements files to the package.
      - Included recursive inclusion of source files and third-party libraries, while excluding specific clang and llvm directories.
      c861d8a2
    • Lei Wang's avatar
      [Precision] Introduce `T.ieee_rsqrt` and related high precision op (#882) · a58bf9b6
      Lei Wang authored
      * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)
      
      * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.
      
      * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.
      
      * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.
      
      * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
      
      * Add IEEE-compliant mathematical operations and refactor fast math module
      
      This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.
      
      * debug removed
      
      * Refactor IEEE math tests for improved readability and consistency
      
      This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.
      
      * Update README.md to enhance formatting of precision comparison results
      
      This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.
      a58bf9b6
    • Tong WU's avatar
      [Example] Add efficient attention sink backward implementations and tests (#877) · ec24561a
      Tong WU authored
      * [Example] Add a new example to support attention sink for MHA
      
      - Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Added a reference attention function to validate the implementation against PyTorch.
      - Included argument parsing for command-line execution of the example.
      
      * [Example] Replace MHA sink forward example with updated implementation
      
      - Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
      - Updated argument parsing and reference functions to align with the new implementation.
      
      * Enhance MHA sink example with sliding window support
      
      - Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
      - Implemented assertions to ensure `window_size` is compatible with `block_N`.
      - Updated the main function to include a `tune` option for performance tuning.
      - Introduced a new test file to validate both full attention and sliding window scenarios.
      - Adjusted FLOPS calculation to account for the sliding window configuration.
      
      * lint
      
      * [Fix] Add checkinf process to fix the bug of swa
      
      * Migrate to BSHD layout to align with triton baselines
      
      * lint
      
      * fix typo
      
      * Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.
      
      * Add GQA sink example for optimized attention mechanism & lint fix
      
      * fix several typos and bugs
      
      * lint
      
      * fix speed issues of swa
      
      * Add flash attention example with backward pass for BHSD layout and corresponding test cases
      
      * Add backward pass implementation for flash attention with sinks and corresponding test case
      
      * fix lint and typo
      
      * Optimze the calculation of `dsinks`
      
      * Add support for swa backward and update examples
      
      * fix previous typos
      
      * Add example for GQA sink backward pass and update tests for both MHA and GQA sinks
      
      * fix lint
      
      * fix previous typos
      
      * typo
      ec24561a
    • Lei Wang's avatar
      [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit... · 95c373f5
      Lei Wang authored
      [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke (#875)
      
      * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)
      
      * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.
      
      * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.
      
      * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.
      
      * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
      95c373f5
    • alex_xiao's avatar
      [CI][AMD] Remove amd Timeout test (#881) · 56f7494f
      alex_xiao authored
      56f7494f
    • LJC00118's avatar
      [Cython] Remove an incorrect check (#880) · 6f6ef7ad
      LJC00118 authored
      6f6ef7ad
  4. 25 Sep, 2025 4 commits
    • Lei Wang's avatar
      [Language] Support atomic add with ret (#870) · aa0b1090
      Lei Wang authored
      * Add atomic operations for CUDA templates in new atomic.h file
      
      - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
      - Implemented support for half, bfloat16, and float types with appropriate memory ordering.
      - Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
      - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
      - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.
      
      * Refactor atomic operations in CUDA templates for improved readability
      
      - Reformatted atomic operation implementations in atomic.h for better code clarity.
      - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
      - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.
      
      * Add thread storage synchronization configuration option
      
      - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
      - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
      - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.
      
      * Refactor thread storage sync configuration retrieval
      
      - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
      - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.
      
      * test fix
      
      * Update atomic operations and tests for improved functionality
      
      - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
      - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
      - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
      - Updated the TVM subproject to the latest commit for better compatibility.
      
      * Update attention sink examples to use 32 heads
      
      - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
      - Ensured consistency across example scripts for improved usability and testing.
      
      * Refactor atomic add handling in vectorization
      
      - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
      - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.
      
      * Add loop break functionality and enhance thread synchronization
      
      - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
      - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
      - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.
      
      * test fix
      aa0b1090
    • Lei Wang's avatar
      [Bugfix] Use `ExprDeepEqual` instead of `StructuralEqual` when merge consecutive If stmt (#876) · 1dfac2e8
      Lei Wang authored
      * Update submodule TVM to latest commit and fix condition comparison in merge_if_stmt.cc
      
      * Update submodule TVM to latest commit 0524f760
      
      * lint fix
      1dfac2e8
    • Yu Cheng's avatar
      [Language] Support loop_break primitive (#873) · 15a303d2
      Yu Cheng authored
      15a303d2
    • Lei Wang's avatar
      [Language] Support sequence comparisons (#872) · c538d8ab
      Lei Wang authored
      * Update submodule 'tvm' to latest commit 7a71ee34
      
      * lint fix
      c538d8ab
  5. 24 Sep, 2025 2 commits
  6. 23 Sep, 2025 6 commits
  7. 22 Sep, 2025 4 commits
    • Lei Wang's avatar
      [AMD][MLA] Fix mla autotune for rocm (#861) · 3b21a67d
      Lei Wang authored
      * Refactor matmul example to include ReLU activation and update batch size in benchmark script
      
      * lint fix
      
      * Enhance autotuning capabilities in benchmark script and update argument defaults
      
      - Introduced a new `get_configs` function to generate autotuning configurations for the benchmark.
      - Updated the default batch size and kv context length in the argument parser for improved performance.
      - Renamed the `--auto_tune` argument to `--autotune` for consistency.
      - Modified the kernel invocation logic to support autotuning based on the new configurations.
      
      * lint fix
      3b21a67d
    • Lei Wang's avatar
      [TMA] Bugfix when a shared buffer is both issued with tma store and tma load (#857) · b9a51c43
      Lei Wang authored
      - Updated `init_desc_arg_map` to use `Var` as the key instead of `String` in `lower_hopper_intrin.cc`.
      - Enhanced `func_call_args` method in `TLCUDASourceWrapper` to accept additional parameters for better argument mapping.
      - Added assertions to ensure consistency between function parameters and arguments during kernel launches.
      - Modified `generate_tma_descriptor_args` to utilize a mapping of variable names for TMA descriptor initialization.
      b9a51c43
    • Lei Wang's avatar
      [Doc] Optimize the quickstart guide for clarity and not just for CUDA (#858) · 058a670b
      Lei Wang authored
      * Refactor matmul example to include ReLU activation and update batch size in benchmark script
      
      * lint fix
      058a670b
    • Lei Wang's avatar
  8. 21 Sep, 2025 1 commit
    • Lei Wang's avatar
      [PATCH] Static libg++ linking fix (#854) · a3497ebc
      Lei Wang authored
      * bump version to 0.1.6
      
      * phaseout py38
      
      * py39
      
      * Update submodule 'tvm' to latest commit adc0e48
      
      * [Build] Update CMake and Python environment settings
      
      - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
      - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
      - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.
      
      * [Build] Update Python version requirements in scripts and documentation
      
      - Changed Python version requirement in README.md from 3.9+ to 3.8+.
      - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
      - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.
      
      * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml
      
      - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
      - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.
      
      * [Build] Update CMake and Dockerfile for improved compatibility
      
      - Removed static linking flags from CMakeLists.txt to simplify build configuration.
      - Updated Dockerfile to use Ubuntu 20.04 and streamlined the installation of dependencies, removing gcc-9 and g++-9.
      - Adjusted symlink creation for Python environments to use the `-sf` option for safer linking.
      
      * [Build] Bump version to 0.1.6.post1 for post-release updates
      
      * [Build] Remove static linking flags from CMakeLists.txt
      
      - Eliminated static linking flags for GCC and libstdc++ to simplify build configuration and avoid potential conflicts with Python extensions.
      
      * [Build] Update Docker distribution scripts for manylinux compatibility
      
      - Changed base image from `tilelang-builder:18.04` to `tilelang-builder:manylinux` in both local and PyPI distribution scripts.
      - Updated Dockerfile references to use `pypi.manylinux.Dockerfile`.
      - Added `--gpus all` flag to the Docker run command to enable GPU support during execution.
      
      * lint fix
      
      * add cmake
      a3497ebc
  9. 19 Sep, 2025 3 commits
    • Lei Wang's avatar
      [Release] Bump Version to 0.1.6 (#818) · 1ad6e461
      Lei Wang authored
      * bump version to 0.1.6
      
      * phaseout py38
      
      * py39
      
      * Update submodule 'tvm' to latest commit adc0e48
      
      * [Build] Update CMake and Python environment settings
      
      - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
      - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
      - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.
      
      * [Build] Update Python version requirements in scripts and documentation
      
      - Changed Python version requirement in README.md from 3.9+ to 3.8+.
      - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
      - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.
      
      * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml
      
      - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
      - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.
      1ad6e461
    • Lei Wang's avatar
      [Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298
      Lei Wang authored
      - Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
      - Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
      - Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.
      094e2298
    • Lei Wang's avatar
      [Py38] Revert typing and parser updates for Python 3.8 compatibility (#850) · bc9623fc
      Lei Wang authored
      * Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards.
      
      * Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.
      bc9623fc
  10. 18 Sep, 2025 2 commits