Add TensorFlow module and extensions (#85)

* Add tensorflow build Improve build instructions Fix pybind enum usage Fix Python_EXECUTABLE cmake var Move scale_inv calculations to FW Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Apply clang-format Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Format python files Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Add TF build CI Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Lint checks Signed-off-by: kaixih <kaixih@nvidia.com> * Another round of lint checks Signed-off-by: kaixih <kaixih@nvidia.com> * Fix TF image tag Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Use the existing recipe file Signed-off-by: kaixih <kaixih@nvidia.com> * Add license claim blocks Signed-off-by: kaixih <kaixih@nvidia.com> * Fix a bug about bias dtype conversion Signed-off-by: kaixih <kaixih@nvidia.com> * Add mnist example and cleanup old examples Signed-off-by: kaixih <kaixih@nvidia.com> * Autopep8 the tests Signed-off-by: kaixih <kaixih@nvidia.com> * Autopep8 the examples Signed-off-by: kaixih <kaixih@nvidia.com> * Add example in Readme Signed-off-by: kaixih <kaixih@nvidia.com> * Add unit tests and linting for TensorFlow Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add causal mask for non-fused case Signed-off-by: kaixih <kaixih@nvidia.com> * Fix the mismatched TF vs TE masks Signed-off-by: kaixih <kaixih@nvidia.com> * Addressing CI tests Signed-off-by: kaixih <kaixih@nvidia.com> * Run lint test Signed-off-by: kaixih <kaixih@nvidia.com> * Add missing import Signed-off-by: kaixih <kaixih@nvidia.com> * Skip fp8 tests for pre-Hopper GPUs Signed-off-by: kaixih <kaixih@nvidia.com> * Remove non-pytest tests Signed-off-by: kaixih <kaixih@nvidia.com> * Fix license Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: kaixih <kaixih@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>

Add TensorFlow module and extensions (#85)
* Add tensorflow build Improve build instructions Fix pybind enum usage Fix Python_EXECUTABLE cmake var Move scale_inv calculations to FW Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Apply clang-format Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Format python files Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Add TF build CI Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Lint checks Signed-off-by: kaixih <kaixih@nvidia.com> * Another round of lint checks Signed-off-by: kaixih <kaixih@nvidia.com> * Fix TF image tag Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> * Use the existing recipe file Signed-off-by: kaixih <kaixih@nvidia.com> * Add license claim blocks Signed-off-by: kaixih <kaixih@nvidia.com> * Fix a bug about bias dtype conversion Signed-off-by: kaixih <kaixih@nvidia.com> * Add mnist example and cleanup old examples Signed-off-by: kaixih <kaixih@nvidia.com> * Autopep8 the tests Signed-off-by: kaixih <kaixih@nvidia.com> * Autopep8 the examples Signed-off-by: kaixih <kaixih@nvidia.com> * Add example in Readme Signed-off-by: kaixih <kaixih@nvidia.com> * Add unit tests and linting for TensorFlow Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add causal mask for non-fused case Signed-off-by: kaixih <kaixih@nvidia.com> * Fix the mismatched TF vs TE masks Signed-off-by: kaixih <kaixih@nvidia.com> * Addressing CI tests Signed-off-by: kaixih <kaixih@nvidia.com> * Run lint test Signed-off-by: kaixih <kaixih@nvidia.com> * Add missing import Signed-off-by: kaixih <kaixih@nvidia.com> * Skip fp8 tests for pre-Hopper GPUs Signed-off-by: kaixih <kaixih@nvidia.com> * Remove non-pytest tests Signed-off-by: kaixih <kaixih@nvidia.com> * Fix license Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Trevor Morris <tmorris@nvidia.com> Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: kaixih <kaixih@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>
b7acb6e1 · Trevor Morris · GitHub · 0963b288 · b7acb6e1 · b7acb6e1
Unverified Commit b7acb6e1 authored Mar 28, 2023 by Trevor Morris Committed by GitHub Mar 28, 2023
8 changed files
--- a/transformer_engine/tensorflow/csrc/extensions.cu
+++ b/transformer_engine/tensorflow/csrc/extensions.cu
--- a/transformer_engine/tensorflow/csrc/get_stream_op.cpp
+++ b/transformer_engine/tensorflow/csrc/get_stream_op.cpp
+/*************************************************************************
+ * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ *
+ * See LICENSE for license information.
+ ************************************************************************/
+
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/compiler/xla/stream_executor/gpu/gpu_stream.h"
+#include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/shape_inference.h"
+
+namespace tensorflow {
+class GetStreamOp : public OpKernel {
+ public:
+  explicit GetStreamOp(OpKernelConstruction* context) : OpKernel(context) {}
+  void Compute(OpKernelContext* ctx) override {
+    Tensor* output = nullptr;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output("stream_id", {1}, &output));
+    auto vec = output->vec<int64_t>();
+    se::Stream* stream = ctx->op_device_context()->stream();
+    auto gpu_stream = se::gpu::AsGpuStreamValue(stream);
+    vec(0) = static_cast<int64_t>(reinterpret_cast<uintptr_t>(gpu_stream));
+  }
+};
+
+REGISTER_OP("GetStream")
+    .Output("stream_id: int64")
+    .SetShapeFn(shape_inference::UnknownShape);
+REGISTER_OP_NO_GRADIENT("GetStream");
+REGISTER_KERNEL_BUILDER(
+    Name("GetStream").Device(DEVICE_GPU).HostMemory("stream_id"), GetStreamOp);
+}  // namespace tensorflow
--- a/transformer_engine/tensorflow/fp8.py
+++ b/transformer_engine/tensorflow/fp8.py
+# Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+"""FP8 utilies for TransformerEngine"""
+from contextlib import contextmanager
+from typing import Optional, Dict, Any
+
+import tensorflow as tf
+import transformer_engine_tensorflow as tex
+
+from transformer_engine.common.recipe import DelayedScaling, Format
+
+_FP8_ENABLED = False
+_FP8_RECIPE = None
+_FP8_DISTRIBUTED_GROUP = None
+_IS_FIRST_FP8_MODULE = False
+_FP8_AUTOCAST_COUNTER = 0
+_FP8_CURRENT_CONTEXT_ID = 0
+_FP8_AUTOCAST_DEPTH = 0
+_global_fp8_buffer = {}
+_amax_forward_global_reduce_func = lambda: None
+_buffer_delete_key_fwd = None
+_buffer_delete_key_bwd = None
+
+
+def get_meta_tensor_key(forward: bool = True) -> str:
+    """Returns scaling key in `fp8_meta`."""
+    if forward:
+        return "scaling_fwd"
+    return "scaling_bwd"
+
+
+def get_autocast_key(forward: bool = True) -> str:
+    """Returns module position key in `fp8_meta`."""
+    if forward:
+        return "autocast_id_fwd"
+    return "autocast_id_bwd"
+
+
+def get_amax_buffer_key(fp8_meta: Dict[str, Any], forward: bool = True) -> str:
+    """Return a key in `_global_fp8_buffer` for the AMAX storage."""
+    if forward:
+        return f"FWD_AMAX_{fp8_meta['autocast_id_fwd']}"
+    return f"BWD_AMAX_{fp8_meta['autocast_id_bwd']}"
+
+
+def set_amax_buffer_key_deletion(
+    fp8_meta: Dict[str, Any], forward: bool = True
+) -> None:
+    """Delete this amax key from global buffer during autocast end."""
+    if get_autocast_key(forward=forward) not in fp8_meta:
+        return
+    global _buffer_delete_key_fwd, _buffer_delete_key_bwd
+    if forward:
+        _buffer_delete_key_fwd = get_amax_buffer_key(fp8_meta, forward=forward)
+    else:
+        _buffer_delete_key_bwd = get_amax_buffer_key(fp8_meta, forward=forward)
+
+
+def get_default_fp8_recipe():
+    """FP8 recipe if not provided by user
+    Margin = 0, interval = 1, E4M3
+    """
+    return DelayedScaling()
+
+
+@contextmanager
+def fp8_autocast(
+    enabled: bool = False,
+    fp8_recipe: Optional[DelayedScaling] = None,
+) -> None:
+    """
+    Context manager for FP8 usage.
+
+    .. code-block:: python
+
+        with fp8_autocast(enabled=True):
+            out = model(inp)
+
+    .. note::
+
+        Support for FP8 in the Dense layer of Transformer Engine is currently
+        limited to tensors with shapes where both dimensions are divisible by 16.
+        In terms of the input to the full Transformer network, this typically
+        requires padding sequence length to be multiple of 16.
+
+    Parameters
+    ----------
+    enabled: bool, default = `False`
+             whether or not to enable fp8
+    fp8_recipe: recipe.DelayedScaling, default = `None`
+                recipe used for FP8 training.
+    """
+    global _FP8_ENABLED, _FP8_RECIPE, _FP8_DISTRIBUTED_GROUP, _FP8_AUTOCAST_DEPTH
+    global _IS_FIRST_FP8_MODULE, _FP8_AUTOCAST_COUNTER
+    global _global_fp8_buffer, _buffer_delete_key_fwd
+    fp8_state = (_FP8_ENABLED, _FP8_RECIPE, _FP8_DISTRIBUTED_GROUP)
+    try:
+        _FP8_ENABLED = enabled
+        _FP8_RECIPE = get_default_fp8_recipe() if fp8_recipe is None else fp8_recipe
+
+        if _FP8_AUTOCAST_DEPTH == 0:
+            _IS_FIRST_FP8_MODULE = True
+            _FP8_AUTOCAST_COUNTER += 1
+        _FP8_AUTOCAST_DEPTH += 1
+
+        yield
+    finally:
+        _FP8_ENABLED, _FP8_RECIPE, _FP8_DISTRIBUTED_GROUP = fp8_state
+        _IS_FIRST_FP8_MODULE = False
+        _FP8_AUTOCAST_DEPTH -= 1
+
+        if _FP8_AUTOCAST_DEPTH == 0:
+            if callable(_amax_forward_global_reduce_func):
+                _amax_forward_global_reduce_func()
+            delete_key_from_amax_buffer(forward=True)
+
+
+def get_fp8_context_id() -> int:
+    """Returns an ID for the current FP8 context."""
+    return _FP8_CURRENT_CONTEXT_ID
+
+
+def set_fp8_context_id(ctx_id: int) -> None:
+    """Sets the current FP8 context."""
+    global _FP8_CURRENT_CONTEXT_ID
+    _FP8_CURRENT_CONTEXT_ID = ctx_id
+
+
+def new_fp8_context_id() -> int:
+    """Returns global autocast counter as a proxy to be used
+    as the autocast ID for FP8 modules.
+    """
+    return _FP8_AUTOCAST_COUNTER
+
+
+def is_fp8_enabled():
+    """Is FP8 enabled"""
+    return _FP8_ENABLED
+
+
+def is_first_fp8_module():
+    """Returns `True` only the first time when called multiple
+    times from within the same `fp8_autocast` context.
+    """
+    global _IS_FIRST_FP8_MODULE
+    tmp = _IS_FIRST_FP8_MODULE
+    _IS_FIRST_FP8_MODULE = False
+    return tmp
+
+
+def get_fp8_recipe():
+    """Return the fp8 recipe"""
+    return _FP8_RECIPE
+
+
+def _default_sf_compute(amax, scale, fp8_max, margin):
+    """Default function to convert amax to scaling factor."""
+    exp = tf.math.floor(tf.experimental.numpy.log2(fp8_max / amax)) - margin
+    sf = tf.math.round(tf.math.pow(2.0, tf.math.abs(exp)))
+    sf = tf.where(amax > 0.0, sf, scale)
+    sf = tf.where(tf.math.is_finite(amax), sf, scale)
+    sf = tf.where(exp < 0, 1.0 / sf, sf)
+    return sf
+
+
+def _roll_and_zero_out(amax_history):
+    """Update amax history and set next amax to zero."""
+    amax_history = tf.roll(amax_history, -1, 0)
+    zeros = tf.zeros(shape=amax_history[0].shape)
+    updated = tf.tensor_scatter_nd_update(amax_history, [[0]], [zeros])
+    return updated
+
+
+@tf.function(jit_compile=True)
+def _reduce_max_and_default_sf_compute(amax_history, scale, fp8_max, margin):
+    """Get amax using max algorithm and compute scaling factor."""
+    amax = tf.reduce_max(amax_history, axis=0)
+    sf = _default_sf_compute(amax, scale, fp8_max, margin)
+    updated = _roll_and_zero_out(amax_history)
+    return updated, sf
+
+
+@tf.function(jit_compile=True)
+def _most_recent_and_default_sf_compute(amax_history, scale, fp8_max, margin):
+    """Get amax using most-recent algorithm and compute scaling factor."""
+    amax = amax_history[0]
+    sf = _default_sf_compute(amax, scale, fp8_max, margin)
+    updated = _roll_and_zero_out(amax_history)
+    return updated, sf
+
+
+def fused_amax_and_scale_update(
+    amax_history: tf.Variable,
+    scale: tf.Variable,
+    scale_inv: tf.Variable,
+    fp8_max: float,
+    margin: int,
+    amax_compute_algo: str,
+):
+    """Amax to scale conversion."""
+
+    if amax_compute_algo == "max":
+        updated, sf = _reduce_max_and_default_sf_compute(
+            amax_history, scale, fp8_max, margin
+        )
+    else:
+        assert amax_compute_algo == "most_recent"
+        updated, sf = _most_recent_and_default_sf_compute(
+            amax_history, scale, fp8_max, margin
+        )
+    amax_history.assign(updated)
+    scale.assign(sf)
+    scale_inv.assign(1.0 / sf)
+
+
+def amax_and_scale_update(
+    fp8_meta: Dict[str, Any],
+    fwd_update: bool,
+) -> None:
+    """Updates fp8 amaxes/scales for fwd | bwd."""
+    amax_compute = fp8_meta["recipe"].amax_compute_algo
+    sf_compute = fp8_meta["recipe"].scaling_factor_compute_algo
+    fp8_meta_tensor_key = "scaling_fwd" if fwd_update else "scaling_bwd"
+    fp8_max_key = "fp8_max_fwd" if fwd_update else "fp8_max_bwd"
+
+    if not callable(amax_compute) and sf_compute is None:
+        fused_amax_and_scale_update(
+            fp8_meta[fp8_meta_tensor_key]["amax_history"],
+            fp8_meta[fp8_meta_tensor_key]["scale"],
+            fp8_meta[fp8_meta_tensor_key]["scale_inv"],
+            fp8_meta[fp8_max_key],
+            fp8_meta["recipe"].margin,
+            fp8_meta["recipe"].amax_compute_algo,
+        )
+    else:
+        raise ValueError(
+            "We only support the fp8 recipe with 'max' or 'most_recent' "
+            "amax_compute_algo and default scaling_factor_compute_algo at this "
+            "moment."
+        )
+
+
+def get_fp8_te_dtype(fp8_recipe: DelayedScaling, fprop_tensor: bool = True):
+    """Get fp8 data type according to recipe and tensor"""
+    if fp8_recipe.fp8_format == Format.E4M3 or (
+        fp8_recipe.fp8_format == Format.HYBRID and fprop_tensor
+    ):
+        return tex.DType.kFloat8E4M3
+    return tex.DType.kFloat8E5M2
+
+
+def delete_key_from_amax_buffer(forward: bool = True) -> None:
+    """Delete the key from global amax buffer."""
+    global _global_fp8_buffer, _buffer_delete_key_fwd, _buffer_delete_key_bwd
+    if forward:
+        if (
+            _buffer_delete_key_fwd is not None
+            and _buffer_delete_key_fwd in _global_fp8_buffer
+        ):
+            del _global_fp8_buffer[_buffer_delete_key_fwd]
+    else:
+        if (
+            _buffer_delete_key_bwd is not None
+            and _buffer_delete_key_bwd in _global_fp8_buffer
+        ):
+            del _global_fp8_buffer[_buffer_delete_key_bwd]
--- a/transformer_engine/tensorflow/jit.py
+++ b/transformer_engine/tensorflow/jit.py
+# Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+"""XLA functions and JIT utilities"""
+from typing import Callable
+
+import tensorflow as tf
+
+
+@tf.function(jit_compile=True)
+def _bgrad_dgelu_fused(grad_output, inp):
+    """Bgrad-Dgelu fused"""
+    x = inp
+    tanh_out = tf.math.tanh(0.79788456 * x * (1 + 0.044715 * x * x))
+    # sqrt(2/pi) * 3 * 0.044715 -> 0.1070322243
+    ff = 0.5 * x * (
+        (1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)
+    ) + 0.5 * (1 + tanh_out)
+    dgelu = ff * grad_output
+    bgrad = tf.math.reduce_sum(dgelu, axis=0)
+    return bgrad, dgelu
+
+
+def bgrad_dgelu_fused(grad_output, inp):
+    """Bgrad-Dgelu fused"""
+    return _bgrad_dgelu_fused(grad_output, inp)
+
+
+def bias_dropout_add(
+    x: tf.Tensor,
+    bias: tf.Variable,
+    residual: tf.Tensor,
+    prob: float,
+    training: bool,
+) -> tf.Tensor:
+    """dropout(inp + bias) + residual"""
+    # TODO(kaixih): Use stateless_dropout and specify the seed mainly for
+    # debugging purpose. Should allow random seed.
+    out = (
+        tf.nn.experimental.stateless_dropout(
+            x + bias,
+            rate=prob,
+            seed=[1, 0],
+        )
+        if training
+        else x + bias
+    )
+
+    out = residual + out
+    return out
+
+
+def get_bias_dropout_add(training: bool) -> Callable:
+    """bias_dropout_add based on training or not"""
+
+    def _bias_dropout_add(x, bias, residual, prob):
+        return bias_dropout_add(x, bias, residual, prob, training)
+
+    return _bias_dropout_add
+
+
+@tf.function(jit_compile=True)
+def bias_dropout_add_fused_train_(
+    x: tf.Tensor,
+    bias: tf.Variable,
+    residual: tf.Tensor,
+    prob: float,
+) -> tf.Tensor:
+    """Jit fused bias_dropout_add for training"""
+    return bias_dropout_add(x, bias, residual, prob, True)
+
+
+def bias_dropout_add_fused_train(
+    x: tf.Tensor,
+    bias: tf.Variable,
+    residual: tf.Tensor,
+    prob: float,
+) -> tf.Tensor:
+    """Jit fused bias_dropout_add for training"""
+    return bias_dropout_add_fused_train_(x, bias, residual, prob)
+
+
+@tf.function(jit_compile=True)
+def bias_dropout_add_fused_inference_(
+    x: tf.Tensor,
+    bias: tf.Variable,
+    residual: tf.Tensor,
+    prob: float,
+) -> tf.Tensor:
+    """Jit fused bias_dropout_add for inference"""
+    return bias_dropout_add(x, bias, residual, prob, False)
+
+
+def bias_dropout_add_fused_inference(
+    x: tf.Tensor,
+    bias: tf.Variable,
+    residual: tf.Tensor,
+    prob: float,
+) -> tf.Tensor:
+    """Jit fused bias_dropout_add for inference"""
+    return bias_dropout_add_fused_inference_(x, bias, residual, prob)
--- a/transformer_engine/tensorflow/module.py
+++ b/transformer_engine/tensorflow/module.py
--- a/transformer_engine/tensorflow/softmax.py
+++ b/transformer_engine/tensorflow/softmax.py
+# Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+"""Fused scaled masked softmax functions"""
+
+from typing import Callable
+
+import os
+import transformer_engine_tensorflow as tex
+import tensorflow as tf
+
+from .module import get_stream_id
+
+THREADS_PER_WARP = 32
+THREADS_PER_BLOCK = 128
+
+
+_default_causal_mask = {}
+
+def _get_default_causal_mask(sq: int) -> tf.Tensor:
+    """Return the causal upper triangular mask for softmax input"""
+    if sq not in _default_causal_mask:
+        # In TF, the mask specifies 1 to keep and 0 to mask. In "causal" mask
+        # mode, we compute the softmax of the lower triangular.
+        mask_operator = tf.linalg.LinearOperatorLowerTriangular(
+            tf.ones((sq, sq), dtype=tf.bool))
+        mask = mask_operator.to_dense()
+        _default_causal_mask[sq] = mask
+    return _default_causal_mask[sq]
+
+class FusedScaleMaskSoftmax(tf.keras.Model):
+    """
+    fused operation: scaling + mask + softmax
+
+    Arguments:
+        attn_mask_type: attention mask type (pad or causal)
+        mask_func: mask function to be applied.
+        softmax_in_fp32: if true, softmax in performed at fp32 precision.
+        scale: scaling factor used in input tensor scaling.
+    """
+
+    def __init__(
+        self,
+        attn_mask_type: str,
+        mask_func: Callable,
+        softmax_in_fp32: bool,
+        scale: float,
+    ) -> None:
+        super().__init__()
+        self.attn_mask_type = attn_mask_type
+        self.scaled_masked_softmax_fusion = bool(
+            int(os.getenv("NVTE_MASKED_SOFTMAX_FUSION", "1"))
+        )
+        self.mask_func = mask_func
+        self.softmax_in_fp32 = softmax_in_fp32
+        self.scale = scale
+        self.stream = get_stream_id()
+
+        assert (
+            self.scale is None or softmax_in_fp32
+        ), "softmax should be in fp32 when scaled"
+
+    def __call__(self, inp: tf.Tensor, mask: tf.Tensor) -> tf.Tensor:
+        """FusedScaleMaskSoftmax fprop"""
+        # [b, np, sq, sk]
+        assert len(inp.shape) == 4
+        self.input_in_fp16 = inp.dtype == tf.float16
+        self.input_in_bf16 = inp.dtype == tf.bfloat16
+        self.input_in_float16 = self.input_in_fp16 or self.input_in_bf16
+
+        if self.is_kernel_available(*inp.shape):
+            return self.forward_fused_softmax(inp, mask)
+        return self.forward_tf_softmax(inp, mask)
+
+    def is_kernel_available(self, b: int, np: int, sq: int, sk: int) -> bool:
+        """Check FusedScaleMaskSoftmax kernel availability based on size"""
+        attn_batches = b * np
+
+        if (
+            self.scaled_masked_softmax_fusion  # user want to fuse
+            and self.input_in_float16  # input must be fp16
+            and 16 < sk <= 4096  # sk must be 16 ~ 2048
+            and sq % 4 == 0  # sq must be divisor of 4
+            and attn_batches % 4 == 0  # np * b must be divisor of 4
+        ):
+            if 0 <= sk <= 4096:
+                batch_per_block = self.get_batch_per_block(int(sk))
+
+                if self.attn_mask_type == "causal":
+                    if attn_batches % batch_per_block == 0:
+                        return True
+                else:
+                    if sq % batch_per_block == 0:
+                        return True
+        return False
+
+    @tf.custom_gradient
+    def scaled_masked_softmax(self, x: tf.Tensor, mask: tf.Tensor,
+                              scale: float):
+        """Scaled masked softmax."""
+        y = tex.scaled_masked_softmax_forward(x, mask, scale, self.stream)
+
+        def grad_fn(upstream):
+            dx = tex.scaled_masked_softmax_backward(upstream, y, scale,
+                                                    self.stream)
+            return dx, None, None
+
+        return y, grad_fn
+
+    @tf.custom_gradient
+    def scaled_softmax(self, x: tf.Tensor, scale: float):
+        """Scaled softmax."""
+        y = tex.scaled_softmax_forward(x, scale, self.stream)
+
+        def grad_fn(upstream):
+            dx = tex.scaled_softmax_backward(upstream, y, scale, self.stream)
+            return dx, None
+
+        return y, grad_fn
+
+    @tf.custom_gradient
+    def scaled_upper_triang_masked_softmax(self, x: tf.Tensor, scale: float):
+        """Scaled upper triangular masked softmax."""
+        y = tex.scaled_upper_triang_masked_softmax_forward(x, scale,
+                                                           self.stream)
+
+        def grad_fn(upstream):
+            dx = tex.scaled_upper_triang_masked_softmax_backward(
+                upstream, y, scale, self.stream
+            )
+            return dx, None
+
+        return y, grad_fn
+
+    def forward_fused_softmax(
+        self,
+        inp: tf.Tensor,
+        mask: tf.Tensor,
+    ) -> tf.Tensor:
+        """Fused masked softmax kernel"""
+        sq, sk = inp.shape[2], inp.shape[3]
+        scale = self.scale if self.scale is not None else 1.0
+
+        if self.attn_mask_type == "causal":
+            assert sq == sk, "causal mask is only for self attention"
+
+            # input is 3D tensor (attn_batches, sq, sk)
+            inp = tf.reshape(inp, (-1, sq, sk))
+            probs = self.scaled_upper_triang_masked_softmax(inp, scale)
+            return tf.reshape(probs, inp.shape)
+        # input is 4D tensor (b, np, sq, sk)
+        if mask is not None:
+            # The mask defined in TE kernels are different from TF. In TE, the
+            # mask specifies 1 to mask out and 0 to keep.
+            mask = tf.math.logical_not(mask)
+
+            ndims = len(mask.shape)
+            assert ndims <= 4, "mask ndims should be <= 4"
+            if len(mask.shape) < 4:
+                # Broadcasting the first dims of mask to match the input ndims.
+                broadcast_shape = [1] * (4 - ndims) + mask.shape[:]
+                mask = tf.reshape(mask, broadcast_shape)
+            return self.scaled_masked_softmax(inp, mask, scale)
+        return self.scaled_softmax(inp, scale)
+
+    def forward_tf_softmax(self, inp: tf.Tensor, mask: tf.Tensor) -> tf.Tensor:
+        """Framework softmax"""
+        if self.input_in_float16 and self.softmax_in_fp32:
+            inp = tf.cast(inp, tf.float32)
+
+        if self.scale is not None:
+            inp = inp * self.scale
+
+        if self.attn_mask_type == "causal":
+            mask = _get_default_causal_mask(inp.shape[2])
+
+        mask_output = self.mask_func(inp, mask) if mask is not None else inp
+        probs = tf.nn.softmax(mask_output, axis=-1)
+
+        if self.input_in_float16 and self.softmax_in_fp32:
+            if self.input_in_fp16:
+                probs = tf.cast(probs, tf.half)
+            else:
+                probs = tf.cast(probs, tf.bfloat16)
+
+        return probs
+
+    @staticmethod
+    def get_batch_per_block(key_seq_len: int) -> int:
+        """Softmax utility"""
+        pow2 = 1 << (key_seq_len - 1).bit_length()
+        warp_size = pow2 if pow2 < THREADS_PER_WARP else THREADS_PER_WARP
+        batches_per_warp = 2 if pow2 <= 128 else 1
+        warps_per_block = THREADS_PER_BLOCK / warp_size
+        batches_per_block = warps_per_block * batches_per_warp
+        return batches_per_block
--- a/transformer_engine/tensorflow/transformer.py
+++ b/transformer_engine/tensorflow/transformer.py
--- a/transformer_engine/tensorflow/utils.py
+++ b/transformer_engine/tensorflow/utils.py
+# Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+"""Utility functions for Transformer Engine modules"""
+import tensorflow as tf
+
+
+def attention_mask_func(
+    attention_scores: tf.Tensor, attention_mask: tf.Tensor
+) -> tf.Tensor:
+    """Get attention mask"""
+    return tf.where(attention_mask, attention_scores, -10000.0)
+
+
+def ensure_divisibility(numerator: int, denominator: int) -> None:
+    """Ensure that numerator is divisible by the denominator."""
+    assert (
+        numerator % denominator == 0
+    ), f"{numerator} is not divisible by {denominator}"
+
+
+def divide(numerator: int, denominator: int) -> int:
+    """Ensure that numerator is divisible by the denominator and return
+    the division value."""
+    ensure_divisibility(numerator, denominator)
+    return numerator // denominator