/usr/local/lib/python3.12/dist-packages/torch/library.py:357: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:926
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:926 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
# BENCHMARK_BASELINE_OUTPUT_START
Baseline PyTorch:
Mean time: 48.280 ms
# BENCHMARK_BASELINE_OUTPUT_END
# BENCHMARK_TE_UNFUSED_OUTPUT_START
TE Unfused:
Mean time: 49.342 ms
# BENCHMARK_TE_UNFUSED_OUTPUT_END
# BENCHMARK_TE_UNFUSED_ATTN_OUTPUT_START
TE Unfused + TE Attention:
Mean time: 35.709 ms
# BENCHMARK_TE_UNFUSED_ATTN_OUTPUT_END
# BENCHMARK_TE_UNFUSED_FP8_OUTPUT_START
TE Unfused + TE Attention + FP8:
Mean time: 23.406 ms
# BENCHMARK_TE_UNFUSED_FP8_OUTPUT_END
# BENCHMARK_TE_FUSED_FP8_OUTPUT_START
TE Fused + TE Attention + FP8:
Mean time: 22.964 ms
# BENCHMARK_TE_FUSED_FP8_OUTPUT_END
# BENCHMARK_TE_TRANSFORMER_LAYER_OUTPUT_START
TE TransformerLayer + FP8:
Mean time: 21.670 ms
# BENCHMARK_TE_TRANSFORMER_LAYER_OUTPUT_END
Summary written to getting_started_pytorch_summary.csv
Now let's also replace the attention mechanism with TE's optimized ``DotProductAttention``.
TE's attention automatically selects the best available backend — for example, FlashAttention or cuDNN fused attention — based on your hardware and input configuration,
delivering optimal performance without manual tuning.
.. tabs::
.. tab:: PyTorch
Replace the custom attention with TE's optimized implementation:
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
See LICENSE for license information.
...
...
@@ -28,7 +28,7 @@ on `NVIDIA GPU Cloud <https://ngc.nvidia.com>`_.
pip - from PyPI
-----------------------
---------------
Transformer Engine can be directly installed from `our PyPI <https://pypi.org/project/transformer-engine/>`_, e.g.
...
...
@@ -47,7 +47,7 @@ The core package from Transformer Engine (without any framework extensions) can
By default, this will install the core library compiled for CUDA 12. The cuda major version can be specified by modified the extra dependency to `core_cu12` or `core_cu13`.