src/transform/thread_storage_sync.cc · fecc833699322f27f9618ca221c23fbac3b0979d · OpenDAS / tilelang

[Enhancement] align shared memory allocations (#583) · fecc8336

Lei Wang authored Jun 20, 2025

* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper

- Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
- Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
- Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.

* [Feature] Implement dynamic shared memory allocation alignment

- Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
- Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
- Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.

* [Enhancement] Update dtype and argument defaults in GEMM autotuning example

- Changed data type from `float16` to `bfloat16` for improved precision in computations.
- Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.

* [Enhancement] Improve thread range computation in storage access

- Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
- Updated `AccessEntry` structure to include `thread_range`.
- Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
- Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.

* [Refactor] Update comments for clarity in dynamic shared memory allocation alignment

- Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
- Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
- Added a blank line for improved readability in `thread_storage_sync.cc`.

* [Refactor] Enhance storage access analysis and thread range computation

- Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
- Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
- Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.

* [Enhancement] Update default matrix dimensions in GEMM autotune example

- Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.

* typo fix

* enhancement

* [Fix] Add conflict detection for buffer index size mismatch in thread storage sync

- Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.

fecc8336

thread_storage_sync.cc 18.2 KB

Replace thread_storage_sync.cc