[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) (83412458) · Commits · OpenDAS / tilelang

"git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "2d0896e7d3fcfc74e15f5b58180bf1f8cc084ab7"

Commit 83412458 authored Mar 27, 2025 by

Lei Wang Committed by LeiWang1999 Mar 27, 2025

[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291)

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

* [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py

- Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
- Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
- Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.

* [Refactor] Update imports in __init__.py for tir compatibility

- Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
- Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.

* lint fix

* [Refactor] Update imports in tir.ir.py for improved compatibility

- Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
- Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.

* [Refactor] Update function calls in tir.ir.py to return values

- Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.

* bugfix

* [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper

- Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.

* bugfix

* [Update] Sync subproject commit and modify CUDA atomic add functions

- Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
- Commented out the CUDA capability check in the example convolution script to prevent execution errors.
- Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.

parent be0bf36d

Show whitespace changes

Inline Side-by-side

tvm @ edd35139

		Subproject commit 9ddb7a1753b7af7a0917fb1914563fddb9794879
		Subproject commit edd35139a0481e9359aa269e3e50450b95ba2f5a

View file @ 83412458

@@ -8,11 +8,12 @@ from functools import partial
 def check_hopper():
     if not torch.cuda.is_available():
         return None
     props = torch.cuda.get_device_properties(0)
     compute_capability = props.major, props.minor
     return compute_capability == (9, 0)
     # if not torch.cuda.is_available():
     #     return None
     # props = torch.cuda.get_device_properties(0)
     # compute_capability = props.major, props.minor
     # return compute_capability == (9, 0)
     return False
 def get_configs():
-...

View file @ 83412458

@@ -126,25 +126,27 @@ template <> TL_DEVICE void AtomicAdd(bfloat16_t *address, float val) {
   atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address), __float2bfloat16(val));
+}
 // AtomicAdd Functions for BFLOAT16
 template <> TL_DEVICE void AtomicAdd(bfloat16_t *address, bfloat16_t val) {
   atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address),
             static_cast<__nv_bfloat16>(val));
+}
 // AtomicAdd Functions for FP16x2
 TL_DEVICE void AtomicAddx2(half_t *address, half_t *val) {
   atomicAdd(reinterpret_cast<half2 *>(address),
             static_cast<half2>(*reinterpret_cast<half2 *>(val)));
+}
 #if (defined(__CUDA_ARCH_LIST__) && (__CUDA_ARCH_LIST__ >= 750))
 // AtomicAdd Functions for BFLOAT16
 template <> TL_DEVICE void AtomicAdd(bfloat16_t *address, bfloat16_t val) {
   atomicAdd(reinterpret_cast<__nv_bfloat16 *>(address),
             static_cast<__nv_bfloat16>(val));
+}
 // AtomicAdd Functions for BFLOAT16x2
 TL_DEVICE void AtomicAddx2(bfloat16_t *address, bfloat16_t *val) {
   atomicAdd(
       reinterpret_cast<__nv_bfloat162 *>(address),
       static_cast<__nv_bfloat162>(*reinterpret_cast<__nv_bfloat162 *>(val)));
+}
 #endif
 // DP4A
 template <typename InDatatype, typename OutDatatype>
 TL_DEVICE void DP4A(InDatatype *a, InDatatype *b, OutDatatype *c) {
-...

View file @ 83412458

@@ -90,4 +90,5 @@ def test_loop_tail_split(block_M, block_N, block_K, threads, vec_load_b, dtype):
 if __name__ == "__main__":
     tilelang.testing.main()
     # tilelang.testing.main()
     test_loop_tail_split(64, 64, 32, 128, 8, "float16")

Please register or to comment