Generalize quantization APIs for FP8/FP4/.. recipes (#2256)

* Initial API change Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Generalize quantization APIs for FP8/FP4/.. recipes (#2256)
* Initial API change Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
85a91997 · Kirthi Shankar Sivamani · GitHub · ca6fedcf · 85a91997 · 85a91997
Unverified Commit 85a91997 authored Oct 14, 2025 by Kirthi Shankar Sivamani Committed by GitHub Oct 14, 2025
20 changed files
--- a/tests/pytorch/test_multi_tensor.py
+++ b/tests/pytorch/test_multi_tensor.py
@@ -5,7 +5,7 @@
 import pytest
 import torch
-import transformer_engine.pytorch as te
+import transformer_engine.pytorch
 import transformer_engine_torch as tex
 from transformer_engine.pytorch.optimizers import MultiTensorApply

--- a/tests/pytorch/test_numerics.py
+++ b/tests/pytorch/test_numerics.py
@@ -12,18 +12,15 @@ import torch
 import torch.nn as nn
 from torch.nn import Parameter
-from transformer_engine.pytorch.fp8 import (
+from transformer_engine.pytorch.quantization import FP8GlobalStateManager
-    FP8GlobalStateManager,
-    fp8_autocast,
-    fp8_model_init,
-)
 from transformer_engine.pytorch.utils import (
    init_method_normal,
    scaled_init_method_normal,
    attention_mask_func,
-    is_bf16_compatible,
 )
 from transformer_engine.pytorch import (
+    autocast,
+    quantized_model_init,
    DotProductAttention,
    LayerNormLinear,
    LayerNormMLP,
@@ -35,26 +32,28 @@ from transformer_engine.pytorch import (
    LayerNorm,
    Fp8Padding,
    Fp8Unpadding,
-)
-from transformer_engine.pytorch.distributed import checkpoint as te_checkpoint
-from transformer_engine.pytorch.cpp_extensions import general_gemm, general_grouped_gemm
-from transformer_engine.pytorch.cpp_extensions.fused_attn import FusedAttnBackend
-from transformer_engine.pytorch.tensor.float8_tensor import (
    Float8Quantizer,
    Float8CurrentScalingQuantizer,
+    MXFP8Quantizer,
+    get_device_compute_capability,
+    is_fp8_available,
+    is_mxfp8_available,
+    is_fp8_block_scaling_available,
+    is_bf16_available,
 )
-from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Quantizer
+from transformer_engine.pytorch import checkpoint as te_checkpoint
+from transformer_engine.pytorch.cpp_extensions import general_gemm, general_grouped_gemm
+from transformer_engine.pytorch.cpp_extensions.fused_attn import FusedAttnBackend
 from transformer_engine.pytorch.module.base import get_multi_stream_cublas_workspace, get_workspace
-from transformer_engine.pytorch.utils import get_device_compute_capability
 from transformer_engine.common import recipe
 import transformer_engine_torch as tex
 from utils import ModelConfig, reset_rng_states, get_available_attention_backends
 # Only run FP8 tests on supported devices.
-fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
+fp8_available, reason_for_no_fp8 = is_fp8_available(return_reason=True)
-mxfp8_available, reason_for_no_mxfp8 = FP8GlobalStateManager.is_mxfp8_available()
+mxfp8_available, reason_for_no_mxfp8 = is_mxfp8_available(return_reason=True)
-fp8_block_scaling_available, _ = FP8GlobalStateManager.is_fp8_block_scaling_available()
+fp8_block_scaling_available = is_fp8_block_scaling_available()
 sm_80plus = get_device_compute_capability() >= (8, 0)
@@ -77,7 +76,7 @@ module_inference = ["TransformerLayer", "MultiheadAttention"]
 input_formats_inference = ["sbhd", "bshd"]
 param_types = [torch.float32, torch.float16]
-if is_bf16_compatible():  # bf16 requires sm_80 or higher
+if is_bf16_available():  # bf16 requires sm_80 or higher
    param_types.append(torch.bfloat16)
 batch_sizes = [1, 2]
@@ -548,7 +547,7 @@ def _test_e2e_selective_recompute(
    init_method = init_method_normal(sigma)
    output_layer_init_method = scaled_init_method_normal(sigma, config.num_layers)
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        block = TransformerLayer(
            config.hidden_size,
            4 * config.hidden_size,
@@ -575,7 +574,7 @@ def _test_e2e_selective_recompute(
    te_inp_hidden_states.retain_grad()
    te_inp_attn_mask = get_causal_attn_mask(config.max_seqlen_q)
-    with fp8_autocast(enabled=fp8, fp8_recipe=recipe):
+    with autocast(enabled=fp8, recipe=recipe):
        te_out = block(
            te_inp_hidden_states,
            attention_mask=te_inp_attn_mask,
@@ -637,7 +636,7 @@ def _test_e2e_full_recompute(
    init_method = init_method_normal(sigma)
    output_layer_init_method = scaled_init_method_normal(sigma, config.num_layers)
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        block = TransformerLayer(
            config.hidden_size,
            4 * config.hidden_size,
@@ -665,7 +664,7 @@ def _test_e2e_full_recompute(
        te_inp_hidden_states.retain_grad()
    te_inp_attn_mask = get_causal_attn_mask(config.max_seqlen_q)
-    with fp8_autocast(enabled=fp8, fp8_recipe=recipe):
+    with autocast(enabled=fp8, recipe=recipe):
        if recompute:
            te_out = te_checkpoint(
                block,
@@ -1088,7 +1087,7 @@ def _test_granular_accuracy(block, bs, dtype, config, delay_wgrad_compute=False,
    )
    inp_hidden_states.retain_grad()
-    with fp8_autocast(enabled=fp8, fp8_recipe=recipe):
+    with autocast(enabled=fp8, recipe=recipe):
        out = block(inp_hidden_states)
        if isinstance(out, (List, Tuple)):
            out = out[0]
@@ -1304,7 +1303,7 @@ def test_linear_accuracy_save_original_input(dtype, model, recipe):
    if config.max_seqlen_q % 16 != 0 and fp8:
        pytest.skip("FP8 requires sequence length to be divisible by 16.")
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        te_linear_ref = Linear(
            config.hidden_size,
            4 * config.hidden_size,
@@ -1758,7 +1757,7 @@ def _test_grouped_linear_accuracy(
    else:
        m_splits = torch.tensor([config.max_seqlen_q])
-    with fp8_autocast(enabled=fp8, fp8_recipe=recipe):
+    with autocast(enabled=fp8, recipe=recipe):
        if isinstance(block, GroupedLinear):
            m_splits = m_splits * bs
            out = block(inp_hidden_states, m_splits.tolist())
@@ -1820,7 +1819,7 @@ def test_grouped_linear_accuracy(
    if config.max_seqlen_q % 16 != 0 and fp8:
        pytest.skip("FP8 requires sequence length to be divisible by 16.")
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        grouped_linear = GroupedLinear(
            num_gemms,
            config.hidden_size,
@@ -1956,7 +1955,7 @@ def test_grouped_linear_accuracy_save_original_input(
    if config.max_seqlen_q % 16 != 0 and fp8:
        pytest.skip("FP8 requires sequence length to be divisible by 16.")
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        grouped_linear = GroupedLinear(
            num_gemms,
            config.hidden_size,
@@ -2110,7 +2109,7 @@ def _test_padding_grouped_linear_accuracy(block, num_gemms, bs, dtype, config, r
    m_splits = _generate_random_numbers(num_gemms, config.max_seqlen_q * bs)
-    with fp8_autocast(enabled=fp8, fp8_recipe=recipe):
+    with autocast(enabled=fp8, recipe=recipe):
        if isinstance(block, TorchGroupedLinearWithPadding):
            out = block(inp_hidden_states, m_splits)
        else:
@@ -2158,7 +2157,7 @@ def test_padding_grouped_linear_accuracy(
    if config.max_seqlen_q % 16 != 0 and fp8:
        pytest.skip("FP8 requires sequence length to be divisible by 16.")
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        grouped_linear = TorchGroupedLinearWithPadding(
            num_gemms,
            config.hidden_size,
@@ -2169,7 +2168,7 @@ def test_padding_grouped_linear_accuracy(
            fp8=fp8,
        ).eval()
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        ref_grouped_linear = GroupedLinear(
            num_gemms,
            config.hidden_size,
@@ -2229,7 +2228,7 @@ def test_padding_grouped_linear_accuracy_save_original_input(
    if config.max_seqlen_q % 16 != 0 and fp8:
        pytest.skip("FP8 requires sequence length to be divisible by 16.")
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        grouped_linear = TorchGroupedLinearWithPadding(
            num_gemms,
            config.hidden_size,
@@ -2240,7 +2239,7 @@ def test_padding_grouped_linear_accuracy_save_original_input(
            fp8=fp8,
        ).eval()
-    with fp8_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8 and fp8_model_params, recipe=recipe):
        ref_grouped_linear = GroupedLinear(
            num_gemms,
            config.hidden_size,
@@ -2390,7 +2389,7 @@ def _test_gpt_fp8_parameters(bs, dtype, config, fp8_model_params, recipe):
    init_method = init_method_normal(sigma)
    output_layer_init_method = scaled_init_method_normal(sigma, config.num_layers)
-    with fp8_model_init(enabled=fp8_model_params, recipe=recipe):
+    with quantized_model_init(enabled=fp8_model_params, recipe=recipe):
        block = TransformerLayer(
            config.hidden_size,
            4 * config.hidden_size,
@@ -2417,7 +2416,7 @@ def _test_gpt_fp8_parameters(bs, dtype, config, fp8_model_params, recipe):
    te_inp_hidden_states.retain_grad()
    te_inp_attn_mask = get_causal_attn_mask(config.max_seqlen_q)
-    with fp8_autocast(enabled=True, fp8_recipe=recipe):
+    with autocast(enabled=True, recipe=recipe):
        te_out = block(te_inp_hidden_states, attention_mask=te_inp_attn_mask)
    loss = te_out.sum()
    loss.backward()

--- a/tests/pytorch/test_onnx_export.py
+++ b/tests/pytorch/test_onnx_export.py
@@ -34,7 +34,7 @@ import transformer_engine.pytorch as te
 from transformer_engine.common import recipe
 import transformer_engine_torch as tex
 from transformer_engine.pytorch.export import is_in_onnx_export_mode, te_translation_table
-from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
+from transformer_engine.pytorch.quantization import FP8GlobalStateManager
 from transformer_engine.pytorch.utils import get_default_init_method
 import tensorrt as trt
@@ -57,8 +57,8 @@ NVTE_TEST_ARTIFACTS_DIR = NVTE_TEST_ARTIFACTS_DIR or os.path.join(
 # The directory where this file is stored.
 TESTS_DIR = os.path.dirname(os.path.abspath(__file__))
-fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
+fp8_available, reason_for_no_fp8 = te.is_fp8_available(return_reason=True)
-mxfp8_available, reason_for_no_mxfp8 = FP8GlobalStateManager.is_mxfp8_available()
+mxfp8_available, reason_for_no_mxfp8 = te.is_mxfp8_available(return_reason=True)
 fp8_recipes = []
 if mxfp8_available:
@@ -178,8 +178,8 @@ def do_export(
    input_names = input_names or ["input"]
    output_names = output_names or ["output"]
-    with torch.inference_mode(), te.fp8_autocast(
+    with torch.inference_mode(), te.autocast(
-        enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe
+        enabled=fp8_recipe is not None, recipe=fp8_recipe
    ), warnings.catch_warnings():
        warnings.filterwarnings(action="ignore", category=torch.jit.TracerWarning, module=r".*")
@@ -233,8 +233,8 @@ def te_infer(
    fp8_recipe: recipe.Recipe,
 ):
    """Transformer Engine forward propagation."""
-    with torch.inference_mode(), te.fp8_autocast(
+    with torch.inference_mode(), te.autocast(
-        enabled=is_fp8, fp8_recipe=fp8_recipe
+        enabled=is_fp8, recipe=fp8_recipe
    ), warnings.catch_warnings():
        te_outputs = model(*inps if isinstance(inps, tuple) else (inps,))
        if not isinstance(te_outputs, tuple):
@@ -440,7 +440,7 @@ def _test_export_linear(
    bias_str = "_bias" if use_bias else ""
    high_prec_str = dtype2str(precision)
    fname = f"te.linear{fp8_str}{bias_str}{high_prec_str}.onnx"
-    with te.fp8_autocast(enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe):
+    with te.autocast(enabled=fp8_recipe is not None, recipe=fp8_recipe):
        model = Test_Linear(in_features, out_features, use_bias, return_bias, precision).to(
            device="cuda"
        )
@@ -500,7 +500,7 @@ def _test_export_layernorm(
    fname = f"te.layernorm_linear{fp8_str}{high_prec_str}.onnx"
    with torch.no_grad():
-        with te.fp8_autocast(enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe):
+        with te.autocast(enabled=fp8_recipe is not None, recipe=fp8_recipe):
            layernorm_cls = te.LayerNorm if normalization == "LayerNorm" else te.RMSNorm
            model = layernorm_cls(
                hidden_size,
@@ -568,7 +568,7 @@ def _test_export_layernorm_linear(
    fname = f"te.layernorm_linear{fp8_str}{bias_str}{high_prec_str}.onnx"
    with torch.no_grad():
-        with te.fp8_autocast(enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe):
+        with te.autocast(enabled=fp8_recipe is not None, recipe=fp8_recipe):
            model = te.LayerNormLinear(
                hidden_size,
                3 * hidden_size,
@@ -654,7 +654,7 @@ def _test_export_layernorm_mlp(
    bias_str = "_bias" if use_bias else ""
    high_prec_str = dtype2str(precision)
    fname = f"te.layernorm_mlp{fp8_str}{bias_str}{high_prec_str}_{activation}.onnx"
-    with te.fp8_autocast(enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe):
+    with te.autocast(enabled=fp8_recipe is not None, recipe=fp8_recipe):
        model = te.LayerNormMLP(
            hidden_size,
            ffn_hidden_size,
@@ -1160,13 +1160,13 @@ def test_trt_integration(fp8_recipe: recipe.Recipe):
    inps = (torch.randn([16, 16, 128], device="cuda", requires_grad=False),)
-    with te.fp8_autocast(enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe):
+    with te.autocast(enabled=fp8_recipe is not None, recipe=fp8_recipe):
        out_ref = model(*inps)
    onnx_fd, onnx_path = tempfile.mkstemp(suffix=".onnx")
    os.close(onnx_fd)
    try:
-        with te.fp8_autocast(enabled=fp8_recipe is not None, fp8_recipe=fp8_recipe):
+        with te.autocast(enabled=fp8_recipe is not None, recipe=fp8_recipe):
            with te.onnx_export(enabled=True):
                torch.onnx.export(
                    model,

--- a/tests/pytorch/test_parallel_cross_entropy.py
+++ b/tests/pytorch/test_parallel_cross_entropy.py
@@ -4,7 +4,7 @@
 import random
 import torch
-from transformer_engine.pytorch.cross_entropy import parallel_cross_entropy
+from transformer_engine.pytorch import parallel_cross_entropy
 from utils import dtype_tols

--- a/tests/pytorch/test_permutation.py
+++ b/tests/pytorch/test_permutation.py
@@ -8,6 +8,7 @@ import torch
 import pytest
 from typing import Dict, List
+import transformer_engine.pytorch as te
 from transformer_engine.common import recipe
 from transformer_engine.pytorch import (
    moe_permute as te_permute,
@@ -16,14 +17,12 @@ from transformer_engine.pytorch import (
    moe_sort_chunks_by_index as te_sort_chunks_by_index,
    moe_sort_chunks_by_index_with_probs as te_sort_chunks_by_index_with_probs,
 )
-from transformer_engine.pytorch.utils import is_bf16_compatible
+from transformer_engine.pytorch import (
-from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
-from transformer_engine.pytorch.tensor.float8_tensor import (
    Float8Quantizer,
    Float8CurrentScalingQuantizer,
+    Float8BlockQuantizer,
+    MXFP8Quantizer,
 )
-from transformer_engine.pytorch.tensor.float8_blockwise_tensor import Float8BlockQuantizer
-from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Quantizer
 import transformer_engine_torch as tex
 import copy
@@ -1119,7 +1118,7 @@ def perf_test_cuda_kernel(cuda_kernel_fn):
 # TE tensor dtypes
 _te_dtypes: List[tex.DType] = [tex.DType.kFloat32, tex.DType.kFloat16]
-if is_bf16_compatible():
+if te.is_bf16_available():
    _te_dtypes.append(tex.DType.kBFloat16)
@@ -1239,10 +1238,10 @@ def test_permutation_mask_map_alongside_probs_empty_input(te_dtype):
 # Only run FP8 tests on H100.
-fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
+fp8_available, reason_for_no_fp8 = te.is_fp8_available(return_reason=True)
-mxfp8_available, reason_for_no_mxfp8 = FP8GlobalStateManager.is_mxfp8_available()
+mxfp8_available, reason_for_no_mxfp8 = te.is_mxfp8_available(return_reason=True)
-fp8_block_scaling_available, reason_for_no_fp8_block_scaling = (
+fp8_block_scaling_available, reason_for_no_fp8_block_scaling = te.is_fp8_block_scaling_available(
-    FP8GlobalStateManager.is_fp8_block_scaling_available()
+    return_reason=True
 )
 fp8_recipes = [
    recipe.MXFP8BlockScaling(),

--- a/tests/pytorch/test_recipe.py
+++ b/tests/pytorch/test_recipe.py
@@ -2,7 +2,7 @@
 #
 # See LICENSE for license information.
-from typing import Iterable, Optional
+from typing import Optional
 import pytest
 import torch
@@ -10,28 +10,34 @@ import warnings
 import transformer_engine.common.recipe
 import transformer_engine.pytorch as te
-from transformer_engine.pytorch.tensor.float8_blockwise_tensor import Float8BlockQuantizer
+from transformer_engine.pytorch import (
-from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Quantizer
+    Float8BlockQuantizer,
+    MXFP8Quantizer,
+    Float8Quantizer,
+    NVFP4Quantizer,
+    quantized_model_init,
+    Linear,
+    LayerNormLinear,
+    LayerNormMLP,
+    GroupedLinear,
+)
 import transformer_engine_torch as tex
-from transformer_engine.pytorch.fp8 import (
+from transformer_engine.pytorch.quantization import (
    FP8GlobalStateManager,
    _amax_and_scale_update,
-    fp8_model_init,
 )
-from transformer_engine.pytorch.tensor.float8_tensor import Float8Quantizer
-from transformer_engine.pytorch.tensor.nvfp4_tensor import NVFP4Quantizer
 import transformer_engine.pytorch.ops as te_ops
-from transformer_engine.pytorch import Linear, LayerNormLinear, LayerNormMLP, GroupedLinear
-from transformer_engine.pytorch.distributed import fp8_autocast
 from transformer_engine.common.recipe import DelayedScaling, Float8BlockScaling, MXFP8BlockScaling
 import transformer_engine_torch as tex
 # Check if FP8 is supported
-fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
+fp8_available, reason_for_no_fp8 = te.is_fp8_available(return_reason=True)
-mxfp8_available, reason_for_no_mxfp8 = FP8GlobalStateManager.is_mxfp8_available()
+mxfp8_available, reason_for_no_mxfp8 = te.is_mxfp8_available(return_reason=True)
-fp8_block_scaling_available, reason_for_no_fp8_block_scaling = (
+fp8_block_scaling_available, reason_for_no_fp8_block_scaling = te.is_fp8_block_scaling_available(
-    FP8GlobalStateManager.is_fp8_block_scaling_available()
+    return_reason=True
 )
+fp4_available, reason_for_no_fp4 = te.is_nvfp4_available(return_reason=True)
 # FP8 per tensor delayed scaling
@@ -64,7 +70,7 @@ class TestFP8Recipe:
            amax_history_len=amax_history_len,
            amax_compute_algo=amax_compute_algo,
        )
-        with te.fp8_autocast(fp8_recipe=recipe):
+        with te.autocast(recipe=recipe):
            module = te.Linear(16, 16)
            y = module(
                torch.randn([16, 16], device="cuda"),
@@ -120,7 +126,7 @@ class TestFP8Recipe:
        # ref_scale_inv_backward = torch.reciprocal(ref_scale_backward)
        # Perform forward, backward, and optimizer steps to update fp8_meta
-        with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
+        with te.autocast(enabled=True, recipe=recipe):
            x = torch.randn([16, 16], device="cuda")
            y = module(x, is_first_microbatch=is_first_microbatch)
        y.backward(torch.randn_like(y))
@@ -219,7 +225,7 @@ class TestFP8Recipe:
                op.weight.fill_(w_history[-1])
            # Forward and backward pass
-            with te.fp8_autocast(fp8_recipe=recipe):
+            with te.autocast(recipe=recipe):
                y = op(x)
            y.backward(dy)
@@ -301,7 +307,7 @@ class TestFP8Recipe:
        scaling_factor_compute_algo = None
        if fused_update:
            scaling_factor_compute_algo = (
-                lambda amax, scale, fp8_max, recipe: te.fp8._default_sf_compute(
+                lambda amax, scale, fp8_max, recipe: te.quantization._default_sf_compute(
                    amax, scale, fp8_max, recipe.margin
                )
            )
@@ -311,7 +317,7 @@ class TestFP8Recipe:
        # Setup fp8_meta dictionary
        def setup_fp8_meta():
-            with te.fp8_autocast(fp8_recipe=recipe):
+            with te.autocast(recipe=recipe):
                module = te.Linear(16, 16)
                y = module(torch.zeros([16, 16], device="cuda"))
            y.backward(torch.zeros_like(y))
@@ -393,11 +399,11 @@ class TestFP8Recipe:
        ],
    )
    def test_check_for_weight_tensor_and_recipe_correspondence(self, model_init_recipe):
-        with fp8_model_init(enabled=True, recipe=model_init_recipe):
+        with quantized_model_init(enabled=True, recipe=model_init_recipe):
            linear = Linear(32, 32).cuda()
        x = torch.randn(32, 32, device="cuda")
-        with fp8_autocast(enabled=True, fp8_recipe=DelayedScaling()):
+        with te.autocast(enabled=True, recipe=DelayedScaling()):
            with pytest.raises(RuntimeError) as excinfo:
                _ = linear(x)
            assert "Recipe mismatch for " in str(excinfo.value)
@@ -436,7 +442,7 @@ class TestFP8Recipe:
        # Run initial iterations with DelayedScaling
        for _ in range(3):
            x = torch.randn(batch_size, in_features, device="cuda")
-            with fp8_autocast(enabled=True, fp8_recipe=initial_recipe):
+            with te.autocast(enabled=True, recipe=initial_recipe):
                y = linear(x)
            loss = y.mean()
            loss.backward()
@@ -453,7 +459,7 @@ class TestFP8Recipe:
            if i == 0:
                # Expect a warning on the first iteration with the new recipe
                with pytest.warns(UserWarning, match="Recipe type changed"):
-                    with fp8_autocast(enabled=True, fp8_recipe=target_recipe):
+                    with te.autocast(enabled=True, recipe=target_recipe):
                        y = linear(x)
                for quantizer in linear.quantizers["scaling_fwd"]:
                    assert isinstance(quantizer, expected_quantizer_type)
@@ -461,7 +467,7 @@ class TestFP8Recipe:
                # No warning expected on subsequent iterations
                with warnings.catch_warnings():
                    warnings.simplefilter("error")  # Raise error if unexpected warning occurs
-                    with fp8_autocast(enabled=True, fp8_recipe=target_recipe):
+                    with te.autocast(enabled=True, recipe=target_recipe):
                        y = linear(x)
            loss = y.mean()
            loss.backward()
@@ -485,7 +491,7 @@ class TestFP8Recipe:
        batch_size = 32
        recipe = DelayedScaling(amax_history_len=1024)
-        with fp8_model_init(recipe=recipe):
+        with quantized_model_init(recipe=recipe):
            if module_class == GroupedLinear:
                module = module_class(1, in_features, out_features).cuda()
            else:
@@ -493,7 +499,7 @@ class TestFP8Recipe:
        x = torch.randn(batch_size, in_features, device="cuda")
        recipe = DelayedScaling(amax_history_len=1)
-        with fp8_autocast(enabled=True, fp8_recipe=recipe):
+        with te.autocast(enabled=True, recipe=recipe):
            warn_msg = "Quantizer is being updated, this may affect model behavior"
            with pytest.warns(UserWarning, match=warn_msg):
                if module_class == GroupedLinear:
@@ -502,9 +508,6 @@ class TestFP8Recipe:
                    y = module(x)
-fp4_available, reason_for_no_fp4 = FP8GlobalStateManager.is_nvfp4_available()
 @pytest.mark.skipif(not fp4_available, reason=reason_for_no_fp4)
 @pytest.mark.parametrize("dtype", [torch.float32, torch.bfloat16], ids=str)
 @pytest.mark.parametrize(

--- a/tests/pytorch/test_sanity.py
+++ b/tests/pytorch/test_sanity.py
@@ -8,18 +8,16 @@ import torch
 import pytest
 import os
-import transformer_engine.pytorch
+import transformer_engine
-from transformer_engine.pytorch.fp8 import (
+import transformer_engine.pytorch as te
-    fp8_autocast,
+from transformer_engine.pytorch.quantization import FP8GlobalStateManager
-    FP8GlobalStateManager,
-    fp8_model_init,
-)
 from transformer_engine.pytorch.utils import (
    init_method_normal,
    scaled_init_method_normal,
-    is_bf16_compatible,
 )
 from transformer_engine.pytorch import (
+    autocast,
+    quantized_model_init,
    LayerNormLinear,
    Linear,
    GroupedLinear,
@@ -27,26 +25,25 @@ from transformer_engine.pytorch import (
    TransformerLayer,
    RMSNorm,
    LayerNorm,
+    Float8CurrentScalingQuantizer,
+    Float8Quantizer,
+    Float8Tensor,
+    MXFP8Tensor,
+    checkpoint,
+    QuantizedTensor,
+    is_bf16_available,
 )
 from transformer_engine.common import recipe
 import transformer_engine_torch as tex
 from transformer_engine.pytorch.cpp_extensions import general_gemm
 from transformer_engine.pytorch.module.base import get_workspace
-from transformer_engine.pytorch.tensor import QuantizedTensor
-from transformer_engine.pytorch.tensor.float8_tensor import (
-    Float8CurrentScalingQuantizer,
-    Float8Quantizer,
-    Float8Tensor,
-)
-from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Tensor
 from transformer_engine.pytorch.tensor.utils import replace_raw_data
-from transformer_engine.pytorch.distributed import checkpoint
 from utils import ModelConfig
 # Only run FP8 tests on supported devices.
-fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
+fp8_available, reason_for_no_fp8 = te.is_fp8_available(return_reason=True)
-fp8_block_scaling_available, _ = FP8GlobalStateManager.is_fp8_block_scaling_available()
+fp8_block_scaling_available, _ = te.is_fp8_block_scaling_available(return_reason=True)
-mxfp8_available, reason_for_no_mxfp8 = FP8GlobalStateManager.is_mxfp8_available()
+mxfp8_available, reason_for_no_mxfp8 = te.is_mxfp8_available(return_reason=True)
 # Record initial RNG state from script run.
 seed = 1234
@@ -108,7 +105,7 @@ if fp8_available:
 fp8_recipes.append(None)
 param_types = [torch.float32, torch.float16]
-if is_bf16_compatible():  # bf16 requires sm_80 or higher
+if is_bf16_available():  # bf16 requires sm_80 or higher
    param_types.append(torch.bfloat16)
 all_boolean = [True, False]
@@ -160,7 +157,7 @@ def _test_sanity_e2e_amp(block, dtype, config, fp8_recipe, skip_wgrad):
    use_fp8 = fp8_recipe is not None
    with torch.autocast(device_type="cuda", enabled=True, dtype=dtype):
-        with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+        with autocast(enabled=use_fp8, recipe=fp8_recipe):
            te_out = block(te_inp_hidden_states, attention_mask=te_inp_attn_mask)
        loss = te_out.sum()
@@ -199,7 +196,7 @@ def _test_sanity_e2e_gradient_accumulation_fusion(block, dtype, config, fp8_reci
            p.main_grad = torch.zeros_like(p)
    use_fp8 = fp8_recipe is not None
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        te_out = block(te_inp_hidden_states, attention_mask=te_inp_attn_mask)
    loss = te_out.sum()
    loss.backward()
@@ -227,7 +224,7 @@ def _test_sanity_e2e(block, dtype, config, fp8_recipe, skip_wgrad):
        _disable_wgrads(block)
    use_fp8 = fp8_recipe is not None
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        te_out = block(te_inp_hidden_states)
    loss = te_out.sum()
    loss.backward()
@@ -253,7 +250,7 @@ def _test_sanity_e2e_bert(block, dtype, config, fp8_recipe, skip_wgrad):
        _disable_wgrads(block)
    use_fp8 = fp8_recipe is not None
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        te_out = block(te_inp_hidden_states, attention_mask=te_inp_attn_mask)
    loss = te_out.sum()
    loss.backward()
@@ -285,7 +282,7 @@ def _test_sanity_e2e_T5(block, dtype, config, fp8_recipe, skip_wgrad):
        _disable_wgrads(block)
    use_fp8 = fp8_recipe is not None
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        te_out = block(
            te_inp_hidden_states,
            attention_mask=te_inp_attn_mask,
@@ -314,7 +311,7 @@ def _test_sanity_common(
        _disable_wgrads(block)
    use_fp8 = fp8_recipe is not None
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        if not microbatching:
            te_out = block(te_inp)
        else:
@@ -455,7 +452,7 @@ def test_sanity_linear_with_zero_tokens(dtype, bs, model, fp8_recipe, fp8_model_
            pytest.skip("FP16 output for NVFP4 not supported")
    use_fp8 = fp8_recipe is not None
-    with fp8_model_init(enabled=use_fp8 and fp8_model_params, recipe=fp8_recipe):
+    with quantized_model_init(enabled=use_fp8 and fp8_model_params, recipe=fp8_recipe):
        te_linear = Linear(
            config.hidden_size, ffn_hidden_size, bias=use_bias, params_dtype=dtype
        ).cuda()
@@ -463,7 +460,7 @@ def test_sanity_linear_with_zero_tokens(dtype, bs, model, fp8_recipe, fp8_model_
    inp_hidden_states = torch.randn(
        num_tokens, config.hidden_size, dtype=dtype, requires_grad=True
    ).cuda()
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        out = te_linear(inp_hidden_states)
    loss = out.sum()
    loss.backward()
@@ -496,7 +493,7 @@ def test_sanity_grouped_linear(
            pytest.skip("NVFP4 not supported for grouped linear")
    use_fp8 = fp8_recipe is not None
-    with fp8_model_init(enabled=use_fp8 and fp8_model_params, recipe=fp8_recipe):
+    with quantized_model_init(enabled=use_fp8 and fp8_model_params, recipe=fp8_recipe):
        te_grouped_linear = GroupedLinear(
            num_gemms, config.hidden_size, ffn_hidden_size, bias=use_bias, params_dtype=dtype
        ).cuda()
@@ -512,7 +509,7 @@ def test_sanity_grouped_linear(
    elif empty_split == "middle":
        m_splits[num_gemms // 2] = 0
-    with fp8_autocast(enabled=use_fp8, fp8_recipe=fp8_recipe):
+    with autocast(enabled=use_fp8, recipe=fp8_recipe):
        out = te_grouped_linear(inp_hidden_states, m_splits)
    loss = out.sum()
    loss.backward()
@@ -976,9 +973,9 @@ def test_replace_raw_data_for_float8tensor():
 @pytest.mark.skipif(not fp8_available, reason=reason_for_no_fp8)
-def test_fp8_model_init_high_precision_init_val():
+def test_quantized_model_init_high_precision_init_val():
-    """Test fp8_model_init with preserve_high_precision_init_val=True"""
+    """Test quantized_model_init with preserve_high_precision_init_val=True"""
-    with fp8_model_init(preserve_high_precision_init_val=True):
+    with quantized_model_init(preserve_high_precision_init_val=True):
        model = Linear(768, 768)
    weight = model.weight
@@ -1051,7 +1048,7 @@ def test_linear_frozen_weights_memory_default_recipe():
    linear.weight.requires_grad = False
    # Forward and backward pass with FP8
-    with fp8_autocast():
+    with autocast():
        o = linear(x)
        g_o = torch.randn_like(o)
@@ -1105,7 +1102,7 @@ def test_inference_mode(
    # Construct module
    module = None
    with torch.no_grad():
-        with fp8_model_init(enabled=with_quantization, recipe=quantization_recipe):
+        with quantized_model_init(enabled=with_quantization, recipe=quantization_recipe):
            if module_name == "Linear":
                module = Linear(hidden_size, hidden_size)
            elif module_name == "LayerNormLinear":
@@ -1140,6 +1137,6 @@ def test_inference_mode(
        kwargs = {}
        if module_name == "GroupedLinear":
            kwargs["m_splits"] = [sequence_length]
-        with fp8_autocast(enabled=with_quantization, fp8_recipe=quantization_recipe):
+        with autocast(enabled=with_quantization, recipe=quantization_recipe):
            y = module(x, **kwargs)
    check_weights()
--- a/tests/pytorch/utils.py
+++ b/tests/pytorch/utils.py
@@ -7,14 +7,14 @@ from __future__ import annotations
 import logging
 import os
 from contextlib import contextmanager
+from typing import Optional, Tuple, Dict, Any, List
-import pytest
 import torch
 import transformer_engine
-import transformer_engine.common.recipe
-import transformer_engine.pytorch as te
 import transformer_engine_torch as tex
+from transformer_engine.common.recipe import Recipe
+from transformer_engine.pytorch import InferenceParams
 from transformer_engine.pytorch.attention.dot_product_attention import _attention_backends
 from transformer_engine.pytorch.attention.dot_product_attention.utils import (
    get_attention_backend,

--- a/transformer_engine/common/recipe/__init__.py
+++ b/transformer_engine/common/recipe/__init__.py
@@ -161,7 +161,7 @@ class DelayedScaling(Recipe):
                                 where `Tensor` is a framework tensor type.
    reduce_amax: bool, default = `True`
                By default, if `torch.distributed` is initialized, the `amax` value for FP8
-                tensors is reduced across the `fp8_group` (specified in the `fp8_autocast`
+                tensors is reduced across the `amax_reduction_group` (specified in the `autocast`
                call). This keeps the amaxes and scaling factors synced across the given
                distributed group. If set to `False`, this reduction is skipped and every
                GPU maintains local amaxes and scaling factors. To ensure results are
@@ -169,7 +169,7 @@ class DelayedScaling(Recipe):
                ranks must checkpoint in order to store the local tensors.
    fp8_dpa: bool, default = `False`
             Whether to enable FP8 dot product attention (DPA). When the model is placed in an
-             `fp8_autocast(enabled=True)` region and `fp8_dpa` is set to `True`, DPA casts the
+             `autocast(enabled=True)` region and `fp8_dpa` is set to `True`, DPA casts the
             inputs from higher precision to FP8, performs attention in FP8, and casts tensors
             back to higher precision as outputs. FP8 DPA currently is only supported in the
             `FusedAttention` backend.

--- a/transformer_engine/debug/features/fake_quant.py
+++ b/transformer_engine/debug/features/fake_quant.py
@@ -19,7 +19,7 @@ from transformer_engine.common.recipe import Format
 from transformer_engine.pytorch.tensor import Quantizer
 from transformer_engine.pytorch.tensor.float8_tensor import Float8Quantizer
 from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Quantizer
-from transformer_engine.pytorch.fp8 import _default_sf_compute
+from transformer_engine.pytorch.quantization import _default_sf_compute
 def fake_quantize(tensor: torch.Tensor, fp8_format: tex.DType, out=None):

--- a/transformer_engine/jax/__init__.py
+++ b/transformer_engine/jax/__init__.py
@@ -34,7 +34,7 @@ load_framework_extension("jax")
 from . import flax
 from . import quantize
-from .quantize import fp8_autocast, update_collections
+from .quantize import autocast, fp8_autocast, update_collections
 from .quantize import NVTE_FP8_COLLECTION_NAME
 from .sharding import MeshResource
@@ -45,6 +45,7 @@ from ..common.utils import DeprecatedEnum
 __all__ = [
    "NVTE_FP8_COLLECTION_NAME",
+    "autocast",
    "fp8_autocast",
    "update_collections",
    "MeshResource",

--- a/transformer_engine/jax/flax/transformer.py
+++ b/transformer_engine/jax/flax/transformer.py
@@ -66,7 +66,7 @@ def extend_logical_axis_rules(rules: LogicalRules) -> LogicalRules:
        for 1D-sharding tensor parallelism.
    .. warning::
-        Please make sure ShardingResource is set via fp8_autocast before calling this function.
+        Please make sure ShardingResource is set via autocast before calling this function.
    .. note::
        This function is only needed when using TransformerLayer. For  other modules, such as

--- a/transformer_engine/jax/quantize/helper.py
+++ b/transformer_engine/jax/quantize/helper.py
@@ -7,6 +7,7 @@ Config module for quantization metadata management
 This module provides configuration and helper functions for managing quantization metadata
 in JAX, including support for different scaling modes and datatypes.
 """
 from abc import ABC, abstractmethod
 from contextlib import contextmanager
 from dataclasses import dataclass
@@ -23,7 +24,14 @@ import jax.numpy as jnp
 from flax.core.frozen_dict import FrozenDict
 from transformer_engine_jax import DType, get_cublasLt_version, get_cuda_version
-from transformer_engine.common import recipe
+from transformer_engine.common.recipe import (
+    Recipe,
+    DelayedScaling,
+    Format,
+    MXFP8BlockScaling,
+    Float8CurrentScaling,
+    NVFP4BlockScaling,
+)
 from transformer_engine.jax.sharding import (
    global_shard_guard,
    MeshResource,
@@ -39,6 +47,7 @@ from .device_utils import get_device_compute_capability
 __all__ = [
    "get_quantize_config",
    "get_quantize_config_with_recipe",
+    "autocast",
    "fp8_autocast",
    "is_fp8_available",
    "is_scaling_mode_supported",
@@ -51,8 +60,6 @@ __all__ = [
    "TensorSource",
 ]
-_is_fp8_available = None
-_reason_for_no_fp8 = ""
 _is_scaling_mode_supported = None
 _reason_for_no_scaling_mode = ""
 Collection = Union[Dict, FrozenDict]
@@ -195,22 +202,22 @@ def get_supported_scaling_modes() -> List[ScalingMode]:
    ]
-def get_supported_quantization_recipes() -> List[recipe.Recipe]:
+def get_supported_quantization_recipes() -> List[Recipe]:
    """Get all supported quantization recipes."""
    # We don't support all the recipes TE/Common supports yet
    # return [get_quantize_config_class(recipe)() for recipe in recipe.Recipe.__subclasses__()]
    all_recipes = [
-        recipe.DelayedScaling(),
+        DelayedScaling(),
-        recipe.Float8CurrentScaling(),
+        Float8CurrentScaling(),
-        recipe.MXFP8BlockScaling(),
+        MXFP8BlockScaling(),
-        recipe.NVFP4BlockScaling(),
+        NVFP4BlockScaling(),
    ]
    return [
        recipe for recipe in all_recipes if get_quantize_config_class(recipe)().is_supported()[0]
    ]
-def _format2dtypes(format_: recipe.Format):
+def _format2dtypes(format_: Format):
    """Convert recipe.Format.dtype to corresponding JAX dtypes.
    Args:
@@ -219,13 +226,13 @@ def _format2dtypes(format_: recipe.Format):
    Returns:
        A tuple of (forward_dtype, backward_dtype) for the given format
    """
-    if format_ == recipe.Format.E4M3:
+    if format_ == Format.E4M3:
        return jnp.float8_e4m3fn, jnp.float8_e4m3fn
-    if format_ == recipe.Format.E5M2:
+    if format_ == Format.E5M2:
        return jnp.float8_e5m2, jnp.float8_e5m2
-    if format_ == recipe.Format.HYBRID:
+    if format_ == Format.HYBRID:
        return jnp.float8_e4m3fn, jnp.float8_e5m2
-    if format_ == recipe.Format.E2M1:
+    if format_ == Format.E2M1:
        return jnp.float4_e2m1fn, jnp.float4_e2m1fn
    return jnp.bfloat16, jnp.bfloat16
@@ -289,7 +296,7 @@ class BaseQuantizeConfig(ABC):
    AMAX_HISTORY_LEN: int = 1024
    AMAX_COMPUTE_ALGO: AmaxComputeAlgo = AmaxComputeAlgo.MAX
-    def initialize_from_recipe(self, fp8_recipe: recipe.Recipe) -> None:
+    def initialize_from_recipe(self, fp8_recipe: Recipe) -> None:
        """Initialize the quantization configuration from a given recipe.
        Args:
@@ -359,7 +366,7 @@ class BaseQuantizeConfig(ABC):
 class NoOpQuantizeConfig(BaseQuantizeConfig):
    """Configuration class higher-precision non-quantized operation."""
-    def initialize_from_recipe(self, fp8_recipe: recipe.Recipe) -> None:
+    def initialize_from_recipe(self, fp8_recipe: Recipe) -> None:
        """Initialize no-op configuration."""
        raise NotImplementedError(
            "NoOpQuantizeConfig cannot be initialize from a recipe as it represents"
@@ -399,7 +406,7 @@ class DelayedScalingQuantizeConfig(BaseQuantizeConfig):
    FP8 quantization mode.
    """
-    def initialize_from_recipe(self, fp8_recipe: recipe.Recipe) -> None:
+    def initialize_from_recipe(self, fp8_recipe: Recipe) -> None:
        """Initialize delayed scaling FP8 configuration.
        Args:
@@ -477,7 +484,7 @@ class CurrentScalingQuantizeConfig(BaseQuantizeConfig):
    FP8 quantization mode.
    """
-    def initialize_from_recipe(self, fp8_recipe: recipe.Recipe) -> None:
+    def initialize_from_recipe(self, fp8_recipe: Recipe) -> None:
        """Initialize current scaling FP8 configuration.
        Args:
@@ -519,7 +526,7 @@ class BlockScalingQuantizeConfig(BaseQuantizeConfig):
    FP8 quantization mode.
    """
-    def initialize_from_recipe(self, fp8_recipe: recipe.Recipe) -> None:
+    def initialize_from_recipe(self, fp8_recipe: Recipe) -> None:
        """Initialize block scaling FP8 configuration.
        Args:
@@ -560,7 +567,7 @@ class NVFP4ScalingQuantizeConfig(BaseQuantizeConfig):
    This class provides specific initialization and finalization for NVFP4 scaling quantization mode.
    """
-    def initialize_from_recipe(self, fp8_recipe: recipe.Recipe) -> None:
+    def initialize_from_recipe(self, fp8_recipe: Recipe) -> None:
        """Initialize block scaling FP8 configuration.
        Args:
@@ -622,12 +629,12 @@ _QUANTIZE_CONFIG = NoOpQuantizeConfig()
 def get_quantize_config():
-    """Global instance of BaseQuantizeConfig set by fp8_autocast context."""
+    """Global instance of BaseQuantizeConfig set by autocast context."""
    return _QUANTIZE_CONFIG
 def get_quantize_config_class(
-    fp8_recipe: recipe.Recipe,
+    fp8_recipe: Recipe,
 ) -> Type[BaseQuantizeConfig]:
    """Get the quantization configuration class based on the FP8 recipe.
@@ -636,18 +643,18 @@ def get_quantize_config_class(
    Returns:
        The quantization config class corresponding to the given recipe.
    """
-    if isinstance(fp8_recipe, recipe.DelayedScaling):
+    if isinstance(fp8_recipe, DelayedScaling):
        return DelayedScalingQuantizeConfig
-    if isinstance(fp8_recipe, recipe.MXFP8BlockScaling):
+    if isinstance(fp8_recipe, MXFP8BlockScaling):
        return BlockScalingQuantizeConfig
-    if isinstance(fp8_recipe, recipe.Float8CurrentScaling):
+    if isinstance(fp8_recipe, Float8CurrentScaling):
        return CurrentScalingQuantizeConfig
-    if isinstance(fp8_recipe, recipe.NVFP4BlockScaling):
+    if isinstance(fp8_recipe, NVFP4BlockScaling):
        return NVFP4ScalingQuantizeConfig
    raise ValueError(f"Unsupported recipe type: {type(fp8_recipe)}")
-def get_quantize_config_with_recipe(fp8_recipe: recipe.Recipe):
+def get_quantize_config_with_recipe(fp8_recipe: Recipe):
    """Get the quantization configuration object based on the FP8 recipe."""
    config = get_quantize_config_class(fp8_recipe)()
    config.initialize_from_recipe(fp8_recipe)
@@ -655,14 +662,14 @@ def get_quantize_config_with_recipe(fp8_recipe: recipe.Recipe):
 @contextmanager
-def fp8_autocast(
+def autocast(
    enabled: bool = False,
-    fp8_recipe: Optional[recipe.Recipe] = None,
+    recipe: Optional[Recipe] = None,
    mesh_resource: Optional[MeshResource] = None,
 ) -> None:
-    r"""Context manager for FP8 automatic mixed precision.
+    r"""Context manager for FP8 or FP4 usage.
-    This context manager enables FP8 quantization for the duration of its context.
+    This context manager enables quantization for the duration of its context.
        .. code-block:: python
            mesh_shape = (4, 2)
@@ -673,7 +680,7 @@ def fp8_autocast(
            with maps.Mesh(devices, (dp_mesh_axis_name, tp_mesh_axis_name)):
                mesh_resource=MeshResource(dp_mesh_axis_name, tp_mesh_axis_name)
-                with fp8_autocast(enabled=True, mesh_resource=mesh_resource):
+                with autocast(enabled=True, mesh_resource=mesh_resource):
                    rules = extend_logical_axis_rules(tuple())
                    transformer = TransformerLayer()
@@ -690,15 +697,15 @@ def fp8_autocast(
    ----------
    enabled: bool, default = False
        Whether or not to enable fp8
-    fp8_recipe: recipe.DelayedScaling, default = None
+    recipe: recipe.DelayedScaling, default = None
-        Recipe used for FP8 training.
+            recipe used for low precision quantization.
    mesh_resource: MeshResource, default = None
        Specify the mesh axes for data and tensor parallelism to shard along.
        If set to None, then no data or tensor parallelism will be used.
    """
-    if fp8_recipe is None:
+    if recipe is None:
-        fp8_recipe = recipe.DelayedScaling()
+        recipe = DelayedScaling()
    global _QUANTIZE_CONFIG
@@ -709,15 +716,45 @@ def fp8_autocast(
    try:
        with global_shard_guard(mesh_resource):
            if enabled:
-                _QUANTIZE_CONFIG = get_quantize_config_class(fp8_recipe)()
+                _QUANTIZE_CONFIG = get_quantize_config_class(recipe)()
                is_supported, reason = _QUANTIZE_CONFIG.is_supported()
                assert is_supported, reason
-                _QUANTIZE_CONFIG.initialize_from_recipe(fp8_recipe)
+                _QUANTIZE_CONFIG.initialize_from_recipe(recipe)
            yield
    finally:
        _QUANTIZE_CONFIG = old_quantize_config
+@contextmanager
+def fp8_autocast(
+    enabled: bool = False,
+    fp8_recipe: Optional[Recipe] = None,
+    mesh_resource: Optional[MeshResource] = None,
+) -> None:
+    """
+    .. warning::
+       fp8_autocast is deprecated and will be removed in a future release.
+       Use autocast(enabled=..., recipe=..., mesh_resource=...) instead.
+    """
+    warnings.warn(
+        "fp8_autocast is deprecated and will be removed in a future release. "
+        "Use autocast(enabled=..., recipe=..., mesh_resource=...) instead.",
+        category=DeprecationWarning,
+        stacklevel=2,
+    )
+    # Call new implementation.
+    with autocast(
+        enabled=enabled,
+        recipe=fp8_recipe,
+        mesh_resource=mesh_resource,
+    ):
+        yield
 def update_collections(new: Collection, original: Collection) -> Collection:
    r"""Update collections with new values while preserving original structure.

--- a/transformer_engine/pytorch/__init__.py
+++ b/transformer_engine/pytorch/__init__.py
@@ -46,8 +46,18 @@ from transformer_engine.pytorch.permutation import (
    moe_sort_chunks_by_index,
    moe_sort_chunks_by_index_with_probs,
 )
-from transformer_engine.pytorch.fp8 import fp8_autocast
+from transformer_engine.pytorch.quantization import fp8_autocast
-from transformer_engine.pytorch.fp8 import fp8_model_init
+from transformer_engine.pytorch.quantization import fp8_model_init
+from transformer_engine.pytorch.quantization import autocast
+from transformer_engine.pytorch.quantization import quantized_model_init
+from transformer_engine.pytorch.quantization import is_fp8_available
+from transformer_engine.pytorch.quantization import is_mxfp8_available
+from transformer_engine.pytorch.quantization import is_fp8_block_scaling_available
+from transformer_engine.pytorch.quantization import is_nvfp4_available
+from transformer_engine.pytorch.quantization import get_default_recipe
+from transformer_engine.pytorch.utils import get_cudnn_version
+from transformer_engine.pytorch.utils import get_device_compute_capability
+from transformer_engine.pytorch.utils import is_bf16_available
 from transformer_engine.pytorch.graph import make_graphed_callables
 from transformer_engine.pytorch.distributed import checkpoint
 from transformer_engine.pytorch.distributed import CudaRNGStatesTracker
@@ -61,14 +71,17 @@ from transformer_engine.pytorch.tensor import Float8Quantizer
 from transformer_engine.pytorch.tensor import Float8CurrentScalingQuantizer
 from transformer_engine.pytorch.tensor import MXFP8Quantizer
 from transformer_engine.pytorch.tensor import Float8BlockQuantizer
+from transformer_engine.pytorch.tensor import NVFP4Quantizer
 from transformer_engine.pytorch.tensor import QuantizedTensorStorage
 from transformer_engine.pytorch.tensor import Float8TensorStorage
 from transformer_engine.pytorch.tensor import MXFP8TensorStorage
 from transformer_engine.pytorch.tensor import Float8BlockwiseQTensorStorage
+from transformer_engine.pytorch.tensor import NVFP4TensorStorage
 from transformer_engine.pytorch.tensor import QuantizedTensor
 from transformer_engine.pytorch.tensor import Float8Tensor
 from transformer_engine.pytorch.tensor import MXFP8Tensor
 from transformer_engine.pytorch.tensor import Float8BlockwiseQTensor
+from transformer_engine.pytorch.tensor import NVFP4Tensor
 from transformer_engine.pytorch.tensor import prepare_for_saving
 from transformer_engine.pytorch.tensor import restore_from_saved

--- a/transformer_engine/pytorch/attention/dot_product_attention/backends.py
+++ b/transformer_engine/pytorch/attention/dot_product_attention/backends.py
@@ -42,7 +42,7 @@ from transformer_engine.pytorch.cpp_extensions.fused_attn import (
    META_O,
    META_QKV,
 )
-from transformer_engine.pytorch.fp8 import get_fp8_torch_dtype, FP8GlobalStateManager
+from transformer_engine.pytorch.quantization import get_fp8_torch_dtype, FP8GlobalStateManager
 from transformer_engine.pytorch.distributed import get_distributed_world_size
 from transformer_engine.pytorch.jit import no_torch_dynamo
 from transformer_engine.pytorch.attention.dot_product_attention.context_parallel import (
@@ -1074,7 +1074,7 @@ class FusedAttnFunc(torch.autograd.Function):
        nvtx_label = "transformer_engine.FusedAttnFunc.forward"
        nvtx_range_push(f"{nvtx_label}")
-        # recipe passed in through fp8_autocast or set by NVTE_DPA_FP8_RECIPE;
+        # recipe passed in through autocast or set by NVTE_DPA_FP8_RECIPE;
        # may be different from fp8_meta["recipe"]
        fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
        if fp8_meta is not None and fp8_meta.get("local_recipes", None) is not None:

--- a/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py
+++ b/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py
@@ -19,7 +19,7 @@ from transformer_engine.pytorch.cpp_extensions.fused_attn import (
    fused_attn_bwd,
    FusedAttnBackend,
 )
-from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
+from transformer_engine.pytorch.quantization import FP8GlobalStateManager
 from transformer_engine.pytorch.tensor.float8_tensor import Float8Tensor
 from transformer_engine.pytorch.tensor.quantized_tensor import QuantizedTensorStorage
 from transformer_engine.pytorch.jit import jit_fuser
@@ -1164,7 +1164,7 @@ class AttnFuncWithCPAndKVP2P(torch.autograd.Function):
        is_input_fp8 = isinstance(q, Float8Tensor)
        is_output_fp8 = fp8_output
        is_bwd_fp8 = int(os.getenv("NVTE_FP8_DPA_BWD", "1"))
-        # recipe passed in through fp8_autocast or set by NVTE_DPA_FP8_RECIPE;
+        # recipe passed in through autocast or set by NVTE_DPA_FP8_RECIPE;
        # may be different from fp8_meta["recipe"]
        fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
        if fp8_meta is not None and fp8_meta.get("local_recipes", None) is not None:
@@ -3151,7 +3151,7 @@ class AttnFuncWithCPAndQKVOA2A(torch.autograd.Function):
        is_input_fp8 = isinstance(q, Float8Tensor)
        is_output_fp8 = fp8_output
        is_bwd_fp8 = int(os.getenv("NVTE_FP8_DPA_BWD", "1"))
-        # recipe passed in through fp8_autocast or set by NVTE_DPA_FP8_RECIPE;
+        # recipe passed in through autocast or set by NVTE_DPA_FP8_RECIPE;
        # may be different from fp8_meta["recipe"]
        fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
        if fp8_meta is not None and fp8_meta.get("local_recipes", None) is not None:

--- a/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py
+++ b/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py
@@ -21,7 +21,7 @@ from transformer_engine.common.recipe import (
    Float8CurrentScaling,
 )
 from transformer_engine.pytorch.utils import get_cudnn_version
-from transformer_engine.pytorch.fp8 import (
+from transformer_engine.pytorch.quantization import (
    get_fp8_te_dtype,
    FP8GlobalStateManager,
    RecipeState,
@@ -91,26 +91,26 @@ _alibi_cache = {
 This feature is **experimental** and subject to change.
 Some models may use different FP8 recipes for their linear layers and attention layers. To support this,
-users can either use multiple, nested fp8_autocast() contexts to assign a distinct recipe for each layer,
+users can either use multiple, nested autocast() contexts to assign a distinct recipe for each layer,
-or use a single fp8_autocast() for the non-attention layers and configure the recipe for the attention
+or use a single autocast() for the non-attention layers and configure the recipe for the attention
 layers as follows.
 +-------------------+-----------+-----------------------------------------------------------------------------------+
 | Linear            | Attention | Configuration                                                                     |
 +===================+===========+===================================================================================+
-| FP8DS/FP8CS/NVFP4 | FP16/BF16 | Pass FP8DS, FP8CS or NVFP4 to fp8_autocast();                                     |
+| FP8DS/FP8CS/NVFP4 | FP16/BF16 | Pass FP8DS, FP8CS or NVFP4 to autocast();                                     |
 |                   |           | export NVTE_DPA_FP8_RECIPE="F16"                                                  |
 +-------------------+-----------+-----------------------------------------------------------------------------------+
-| FP8DS             | FP8DS     | Pass FP8DS to fp8_autocast();                                                     |
+| FP8DS             | FP8DS     | Pass FP8DS to autocast();                                                     |
 +-------------------+-----------+-----------------------------------------------------------------------------------+
-| FP8CS             | FP8DS     | Pass FP8CS to fp8_autocast();                                                     |
+| FP8CS             | FP8DS     | Pass FP8CS to autocast();                                                     |
 |                   |           | Attention FP8DS reuses the fp8_format, fp8_dpa, fp8_mha values from linear FP8CS; |
 |                   |           | export NVTE_DPA_FP8_RECIPE="DelayedScaling"       # switch to DS                  |
 |                   |           | export NVTE_DPA_FP8DS_AMAX_ALGO="most_recent"     # or "max"                      |
 |                   |           | export NVTE_DPA_FP8DS_AMAX_HISTLEN=1              # or any other integer          |
 |                   |           | export NVTE_DPA_FP8DS_REDUCE_AMAX=1               # or 0                          |
 +-------------------+-----------+-----------------------------------------------------------------------------------+
-| NVFP4             | FP8DS     | Pass NVFP4 to fp8_autocast();                                                     |
+| NVFP4             | FP8DS     | Pass NVFP4 to autocast();                                                     |
 |                   |           | Attention FP8DS reuses the fp8_dpa, fp8_mha values from linear NVFP4;             |
 |                   |           | export NVTE_DPA_FP8_RECIPE="DelayedScaling"       # switch to DS                  |
 |                   |           | export NVTE_DPA_FP8_FORMAT="HYBRID"               # or "E4M3", "E5M2"             |
@@ -118,19 +118,19 @@ layers as follows.
 |                   |           | export NVTE_DPA_FP8DS_AMAX_HISTLEN=1              # or any other integer          |
 |                   |           | export NVTE_DPA_FP8DS_REDUCE_AMAX=1               # or 0                          |
 +-------------------+-----------+-----------------------------------------------------------------------------------+
-| FP8DS             | FP8CS     | Pass FP8DS to fp8_autocast();                                                     |
+| FP8DS             | FP8CS     | Pass FP8DS to autocast();                                                     |
 |                   |           | Attention uses FP8DS for S, dP tensors, and creates a new FP8CS recipe for QKV, O,|
 |                   |           | dO, dQKV tensors based on fp8_format, fp8_dpa, fp8_mha from linear FP8DS;         |
 |                   |           | export NVTE_DPA_FP8_RECIPE="Float8CurrentScaling" # switch to CS                  |
 +-------------------+-----------+-----------------------------------------------------------------------------------+
-| FP8CS             | FP8CS     | Pass FP8CS to fp8_autocast();                                                     |
+| FP8CS             | FP8CS     | Pass FP8CS to autocast();                                                     |
 |                   |           | Attention uses FP8CS for QKV, O, dO, dQKV tensors, and creates a new FP8DS recipe |
 |                   |           | for S, dP tensors based on fp8_format, fp8_dpa, fp8_mha from linear FP8CS and:    |
 |                   |           | export NVTE_DPA_FP8DS_AMAX_ALGO="most_recent"     # or "max"                      |
 |                   |           | export NVTE_DPA_FP8DS_AMAX_HISTLEN=1              # or any other integer          |
 |                   |           | export NVTE_DPA_FP8DS_REDUCE_AMAX=1               # or 0                          |
 +-------------------+-----------+-----------------------------------------------------------------------------------+
-| NVFP4             | FP8CS     | Pass NVFP4 to fp8_autocast();                                                     |
+| NVFP4             | FP8CS     | Pass NVFP4 to autocast();                                                     |
 |                   |           | Attention creates a new FP8CS recipe for QKV, O, dO, dQKV, and a new FP8DS recipe |
 |                   |           | for S, dP, based on the fp8_dpa, fp8_mha values from linear NVFP4 and:            |
 |                   |           | export NVTE_DPA_FP8_RECIPE="Float8CurrentScaling" # switch to CS                  |
@@ -544,7 +544,7 @@ class DotProductAttention(TransformerEngineBaseModule):
        """
        _original_recipe = self.fp8_meta.get("recipe", None)
-        # global recipe set in fp8_autocast()
+        # global recipe set in autocast()
        fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
        if fp8_recipe.custom():
            return
@@ -560,7 +560,7 @@ class DotProductAttention(TransformerEngineBaseModule):
        fp8_recipe_dpa = fp8_recipe
        fp8_recipes = fp8_recipe
        if _dpa_fp8_recipe == "F16":
-            # ignore the recipe from fp8_autocast, set fp8_dpa = False, fp8_mha = False
+            # ignore the recipe from autocast, set fp8_dpa = False, fp8_mha = False
            fp8_recipe.fp8_dpa = False
            fp8_recipe.fp8_mha = False
        elif fp8_recipe.float8_current_scaling() and _dpa_fp8_recipe == "DelayedScaling":

--- a/transformer_engine/pytorch/attention/dot_product_attention/utils.py
+++ b/transformer_engine/pytorch/attention/dot_product_attention/utils.py
@@ -40,7 +40,7 @@ from transformer_engine.pytorch.tensor.float8_tensor import (
    Float8Quantizer,
    Float8CurrentScalingQuantizer,
 )
-from transformer_engine.pytorch.fp8 import get_fp8_te_dtype
+from transformer_engine.pytorch.quantization import get_fp8_te_dtype
 from transformer_engine.pytorch.constants import TE_DType
@@ -222,7 +222,7 @@ class AttentionParams:
    is_training: bool, default = `True`
        Whether in training mode (`True`) or inference mode (`False`)
    fp8: bool, default = `False`
-        Whether `DotProductAttention` is in an `fp8_autocast` region.
+        Whether `DotProductAttention` is in an `autocast` region.
    fp8_meta: Optional[Dict[str Any]], default = `None`
        The FP8 metadata tensor of `DotProductAttention`.
    inference_params: Optional[InferenceParams], default = `None`

--- a/transformer_engine/pytorch/attention/multi_head_attention.py
+++ b/transformer_engine/pytorch/attention/multi_head_attention.py
@@ -9,7 +9,7 @@ from typing import Callable, List, Optional, Tuple, Union
 import torch
 from transformer_engine.debug.pytorch.debug_state import TEDebugState
-from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
+from transformer_engine.pytorch.quantization import FP8GlobalStateManager
 from transformer_engine.pytorch.tensor.float8_tensor import Float8Tensor
 from transformer_engine.pytorch.module.base import TransformerEngineBaseModule
 from transformer_engine.pytorch.module import LayerNormLinear, Linear, RMSNorm, LayerNorm
@@ -33,7 +33,7 @@ from transformer_engine.pytorch.attention.dot_product_attention import DotProduc
 from transformer_engine.pytorch.attention.inference import InferenceParams
 from transformer_engine.pytorch.attention.rope import apply_rotary_pos_emb
-# Force DotProductAttention to use a different recipe than the fp8_recipe set in fp8_autocast().
+# Force DotProductAttention to use a different recipe than the fp8_recipe set in autocast().
 # Useful when GEMMs and attention use different recipes. Supported values are "DelayedScaling"
 # and "Float8CurrentScaling". Use other relevant variables here to define the recipe, e.g. fp8_dpa.
 _dpa_fp8_recipe = os.getenv("NVTE_DPA_FP8_RECIPE", "")

--- a/transformer_engine/pytorch/distributed.py
+++ b/transformer_engine/pytorch/distributed.py
@@ -36,7 +36,7 @@ from .utils import (
    needs_quantized_gemm,
 )
 from .constants import dist_group_type
-from .fp8 import FP8GlobalStateManager, fp8_autocast
+from .quantization import FP8GlobalStateManager, autocast
 from .tensor.float8_tensor import Float8Quantizer, Float8Tensor, Float8CurrentScalingQuantizer
 from .tensor.mxfp8_tensor import MXFP8Quantizer
 from .tensor.nvfp4_tensor import NVFP4Quantizer
@@ -419,8 +419,8 @@ class _CheckpointFunction(torch.autograd.Function):
        detached_inputs = detach_variable(inputs)
        with torch.enable_grad(), ctx.recompute_ctx, ctx.torch_gpu_amp_ctx, ctx.torch_cpu_amp_ctx, activation_recompute_forward(
            activation_recompute=True, recompute_phase=True
-        ), fp8_autocast(
+        ), autocast(
-            enabled=ctx.fp8, fp8_recipe=ctx.fp8_recipe
+            enabled=ctx.fp8, recipe=ctx.fp8_recipe
        ):
            outputs = ctx.run_function(*detached_inputs, **ctx.kwargs)
@@ -754,8 +754,8 @@ def checkpoint(
    def recompute_fn(*args, **kwargs):
        with torch.autograd.enable_grad(), (
            te_recompute_ctx
-        ), user_recompute_ctx, torch_gpu_amp_forward_ctx, torch_cpu_amp_forward_ctx, fp8_autocast(
+        ), user_recompute_ctx, torch_gpu_amp_forward_ctx, torch_cpu_amp_forward_ctx, autocast(
-            enabled=fp8, fp8_recipe=fp8_recipe
+            enabled=fp8, recipe=fp8_recipe
        ):
            function(*args, **kwargs)
@@ -1969,7 +1969,7 @@ def prepare_te_modules_for_fsdp(fsdp_root: torch.nn.Module) -> None:
        if hasattr(fsdp_root, "primary_weights_in_fp8"):
            assert not fsdp_root.primary_weights_in_fp8, (
                "TE modules with primary weights in FP8 cannot be FSDP-wrapped. "
-                "Please initialize your model without the te.fp8_model_init(...) context."
+                "Please initialize your model without the te.quantized_model_init(...) context."
            )
        root_state = _get_module_fsdp_state(fsdp_root)
        assert root_state is not None, "Root module does not have a valid _FSDPState."
@@ -1982,7 +1982,7 @@ def prepare_te_modules_for_fsdp(fsdp_root: torch.nn.Module) -> None:
            if hasattr(fsdp_module.module, "primary_weights_in_fp8"):
                assert not fsdp_module.module.primary_weights_in_fp8, (
                    "TE modules with primary weights in FP8 cannot be FSDP-wrapped. "
-                    "Please initialize your model without the te.fp8_model_init(...) context."
+                    "Please initialize your model without the te.quantized_model_init(...) context."
                )
            setattr(fsdp_module.module, "fsdp_group", state.process_group)