[ROCBLAS_GEMM] Default use of hipMallocAsync

Default use of hipMallocAsync rather than hipMalloc in rocblas_gemm and add support of fp16_fp16_fp32 in rocblas_gemm. Signed-off-by: wenjh <wenjh@sugon.com>

[ROCBLAS_GEMM] Default use of hipMallocAsync
Default use of hipMallocAsync rather than hipMalloc in rocblas_gemm and add support of fp16_fp16_fp32 in rocblas_gemm. Signed-off-by: wenjh <wenjh@sugon.com>
7a47930f · wenjh · e8f92b93 · 7a47930f · 7a47930f
Commit 7a47930f authored May 08, 2025 by wenjh
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 2 deletions

tests/pytorch/test_numerics.py tests/pytorch/test_numerics.py +3 -1

transformer_engine/common/gemm/rocm_gemm.cu transformer_engine/common/gemm/rocm_gemm.cu +2 -1

No files found.
--- a/tests/pytorch/test_numerics.py
+++ b/tests/pytorch/test_numerics.py
@@ -2087,7 +2087,9 @@ def test_transformer_layer_hidden_states_format(dtype, bs, model):
 @pytest.mark.parametrize("is_paged", [False, True])
 def test_kv_cache_accuracy(dtype, bs, model_key, use_RoPE, input_format, module, backend, is_paged):
    reset_rng_states()
+    if backend in ["FusedAttention"]:
+        pytest.skip("Not support FusedAttention")
    if backend in ["FusedAttention", "FlashAttention"] and dtype == torch.float32:
        pytest.skip("FusedAttention and FlashAttention do not support FP32")
    if use_RoPE:

--- a/transformer_engine/common/gemm/rocm_gemm.cu
+++ b/transformer_engine/common/gemm/rocm_gemm.cu
@@ -1459,7 +1459,7 @@ void rocblas_gemm(const Tensor *inputA,
  // extract the stream order alloc env
  bool stream_order_alloc = false;
  if (const char* env_p = std::getenv("ROCBLAS_STREAM_ORDER_ALLOC") ) {
-    if (env_p != nullptr && std::string(env_p) == "1")
+    if (env_p == nullptr || std::string(env_p) == "1")
      stream_order_alloc = true;
  }
@@ -1467,6 +1467,7 @@ void rocblas_gemm(const Tensor *inputA,
  NVTE_CHECK((A_type==rocblas_datatype_f16_r && B_type==rocblas_datatype_f16_r && D_type==rocblas_datatype_f16_r) || 
+  (A_type==rocblas_datatype_f16_r && B_type==rocblas_datatype_f16_r && D_type==rocblas_datatype_f32_r) ||
       (A_type==rocblas_datatype_bf16_r && B_type==rocblas_datatype_bf16_r && D_type==rocblas_datatype_bf16_r) || 
       (A_type==rocblas_datatype_bf16_r && B_type==rocblas_datatype_bf16_r && D_type==rocblas_datatype_f32_r) ||
       (A_type==rocblas_datatype_f32_r && B_type==rocblas_datatype_f32_r && D_type==rocblas_datatype_f32_r) ||