[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts...

[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts...
[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
c449c6cf · Zhongbo Zhu · GitHub · 06a38cc0 · c449c6cf
Unverified Commit c449c6cf authored Aug 28, 2025 by Zhongbo Zhu Committed by GitHub Aug 28, 2025
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 10 deletions

transformer_engine/pytorch/csrc/extensions/cast.cpp transformer_engine/pytorch/csrc/extensions/cast.cpp +4 -10

No files found.
--- a/transformer_engine/pytorch/csrc/extensions/cast.cpp
+++ b/transformer_engine/pytorch/csrc/extensions/cast.cpp
@@ -205,11 +205,8 @@ std::tuple<std::vector<py::object>, std::vector<TensorWrapper>> bulk_allocate_fp
  auto make_torch_view = [](std::shared_ptr<at::Tensor> &buffer, const std::vector<size_t> &shape,
                            size_t offset, at::ScalarType dtype) -> at::Tensor {
    std::vector<int64_t> shape_int64(shape.begin(), shape.end());
-    // in the case where full buffer is empty because local rank receives no tokens for all the experts
+    bool is_empty_shape = product(shape) == 0;
-    // then the data_ptr is nullptr, we need to return an empty tensor instead of calling from_blob
+    if (buffer->data_ptr<uint8_t>() == nullptr || is_empty_shape) {
-    // but in the case where some experts receive tokens, some not, we want to leverage from_blob
-    // as much as possible to avoid CPU overhead
-    if (buffer->data_ptr<uint8_t>() == nullptr) {
      return at::empty(shape_int64, at::device(at::kCUDA).dtype(dtype));
    }
    return at::from_blob(
@@ -359,11 +356,8 @@ std::tuple<std::vector<py::object>, std::vector<TensorWrapper>> bulk_allocate_mx
  auto make_torch_view = [](std::shared_ptr<at::Tensor> &buffer, const std::vector<size_t> &shape,
                            size_t offset, at::ScalarType dtype) -> at::Tensor {
    std::vector<int64_t> shape_int64(shape.begin(), shape.end());
-    // in the case where full buffer is empty because local rank receives no tokens for all the experts
+    bool is_empty_shape = product(shape) == 0;
-    // then the data_ptr is nullptr, we need to return an empty tensor instead of calling from_blob
+    if (buffer->data_ptr<uint8_t>() == nullptr || is_empty_shape) {
-    // but in the case where some experts receive tokens, some not, we want to leverage from_blob
-    // as much as possible to avoid CPU overhead
-    if (buffer->data_ptr<uint8_t>() == nullptr) {
      return at::empty(shape_int64, at::device(at::kCUDA).dtype(dtype));
    }
    return at::from_blob(