Commits · fecc833699322f27f9618ca221c23fbac3b0979d · OpenDAS / tilelang

20 Jun, 2025 1 commit

[Enhancement] align shared memory allocations (#583) · fecc8336

Lei Wang authored Jun 20, 2025

* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper

- Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
- Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
- Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.

* [Feature] Implement dynamic shared memory allocation alignment

- Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
- Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
- Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.

* [Enhancement] Update dtype and argument defaults in GEMM autotuning example

- Changed data type from `float16` to `bfloat16` for improved precision in computations.
- Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.

* [Enhancement] Improve thread range computation in storage access

- Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
- Updated `AccessEntry` structure to include `thread_range`.
- Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
- Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.

* [Refactor] Update comments for clarity in dynamic shared memory allocation alignment

- Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
- Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
- Added a blank line for improved readability in `thread_storage_sync.cc`.

* [Refactor] Enhance storage access analysis and thread range computation

- Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
- Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
- Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.

* [Enhancement] Update default matrix dimensions in GEMM autotune example

- Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.

* typo fix

* enhancement

* [Fix] Add conflict detection for buffer index size mismatch in thread storage sync

- Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.

fecc8336

11 Apr, 2025 1 commit

[Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65

Lei Wang authored Apr 11, 2025



* [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks

- Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
- Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
- Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
- Introduced tests for the new logical operations to ensure correctness and performance.
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

* lint fix

---------
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

c4638d65

14 Feb, 2025 1 commit

[Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm (#85) · ec84188f

Lei Wang authored Feb 14, 2025

* bump version into v0.1.0

* [Enhancement] Add custom develop command for editable installs and update .gitignore

* [Documentation] Update README to include system dependencies installation instructions

* [Build] Update setup.py to support library file copying for both release and develop modes

* [Build] Refactor library file copying logic in setup.py

* [Documentation] Remove unnecessary install section header in Installation.md

* [Build] Add tox configuration and local distribution script for multi-Python version support

* [Build] Improve git submodule update function with better error handling

* [Build] Update LLVM configuration path in ROCm installation script

* [Build] Add .tox/ to .gitignore for tox testing environment

* [Build] Add support for TVM prebuild path configuration in CMakeLists.txt

* [Cleanup] Remove unused TVM runtime error codes header

* [Cleanup] Fix TVM grid constant type reference in CUDA module

* [Cleanup] Remove unused customized_code function from IR module

* [Feature] Add TileLang thread synchronization and storage access analysis passes

* [Build] Reorder DLL search path directories for more flexible library loading

* [Refactor] Improve thread synchronization and library path handling

- Rename ThreadSync and TileLangThreadSync functions in C++ code
- Update Python docstring for ThreadSync with more detailed description
- Reorder library path detection in tilelang environment setup
- Minor comment and code cleanup in CUDA and warp specialization modules

* [Refactor] Improve thread synchronization code style and formatting

- Standardize pointer type spacing in storage_access.h and storage_access.cc
- Update whitespace and indentation in thread_storage_sync.cc
- Reorder include statements in thread_partial_sync.cc
- Minor code formatting improvements across thread synchronization files

* [Refactor] Fix global function registration for ThreadSync

- Correct global function registration to use ThreadSync instead of TileLangThreadSync
- Update TVM global registration to match recent refactoring efforts

* [Refactor] Simplify ThreadSync global function registration

- Remove unnecessary whitespace in global function registration
- Compact the TVM global registration line for ThreadSync

ec84188f