[PyTorch] Fix cuBLAS workspace leak in applications that initialize+destroy...

[PyTorch] Fix cuBLAS workspace leak in applications that initialize+destroy Userbuffers more than once (#1715) safeguarded cuBLAS workspace expansion in initialize_ub() to avoid exponential growth across repeat initializations Signed-off-by: Alp Dener <adener@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch] Fix cuBLAS workspace leak in applications that initialize+destroy...
[PyTorch] Fix cuBLAS workspace leak in applications that initialize+destroy Userbuffers more than once (#1715) safeguarded cuBLAS workspace expansion in initialize_ub() to avoid exponential growth across repeat initializations Signed-off-by: Alp Dener <adener@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
4e9c2c39 · Alp Dener · GitHub · 8ace813c · 4e9c2c39
Unverified Commit 4e9c2c39 authored Apr 28, 2025 by Alp Dener Committed by GitHub Apr 28, 2025
Show whitespace changes
Inline Side-by-side

Showing with 8 additions and 2 deletions

transformer_engine/pytorch/module/base.py transformer_engine/pytorch/module/base.py +8 -2

No files found.
--- a/transformer_engine/pytorch/module/base.py
+++ b/transformer_engine/pytorch/module/base.py
@@ -230,9 +230,15 @@ def initialize_ub(
                flush=True,
            )

-    # Increase the workspace by the number of maximum concurrent streams
+    # Allocate cuBLAS workspace with expanded size for chunking in overlapping GEMM calls
    global _cublas_workspace
+    if _cublas_workspace is None:
        _cublas_workspace = get_workspace().repeat(_NUM_MAX_UB_STREAMS)
+    elif _cublas_workspace.numel() != get_cublas_workspace_size_bytes() * _NUM_MAX_UB_STREAMS:
+        # This ensures we don't do `.repeat()` on an already expanded workspace
+        _cublas_workspace = torch.empty(
+            get_cublas_workspace_size_bytes(), dtype=torch.uint8, device="cuda"
+        ).repeat(_NUM_MAX_UB_STREAMS)

    # Default buffer precision: AllGather buffers use fp8 when using fp8 recipe
    layers_all_gather_overlap = [