Commits · b998121c828aaa3f954caa4c72e70dbcf71e1272 · OpenDAS / TransformerEngine

09 Jan, 2026 1 commit

Fix swizzle, swap_first_dims and RMSNorm issues on release_v2.7 (Rocky 8.6) · b998121c

wuyf1 authored Jan 09, 2026

## Summary
Fix swizzle / swap_first_dims RTC build and normalization test issues on `release_v2.7` (ROCm/HIP).

## Background
- ROCm/HIP path currently hits build/runtime/test issues in:
  - `swizzle_scaling_factors` (HIP compile constraints with `__device__ __host__` constexpr)
  - RTC `swap_first_dims` source selection
  - `test_normalization` when `use_cudnn` is enabled for LayerNorm/RMSNorm
  - PyTorch L0 unittest environment relying on `PYTHONPATH`

## Changes
1) **qa/L0_pytorch_unittest/test.sh**
   - Export `PYTHONPATH` to include `${TE_PATH}` so tests can import from source tree without reinstalling pytest.
   - Removed explicit `pip3 install pytest==8.2.1` from the script.

2) **tests/cpp/operator/test_normalization.cu**
   - Skip LayerNorm/RMSNorm cases when `use_cudnn` is enabled:
     - `GTEST_SKIP(): CudnnLayerNorm and CudnnRmsNorm are disabled.`
   - Avoids running unsupported/disabled cuDNN normalization paths in this configuration.

3) **transformer_engine/common/CMakeLists.txt**
   - Fix RTC header generation for `swap_first_dims` on ROCm:
     - use `transpose/rtc/swap_first_dims.hip` instead of `.cu`.

4) **transformer_engine/common/swizzle/swizzle.cu**
   - For `__HIP_PLATFORM_AMD__`, replace `constexpr __device__ __host__ int ...` with plain `constexpr int ...`
   - Keeps CUDA path unchanged.
   - Addresses HIP compilation constraints while preserving constants’ values and usage.

## Verification
- [x] Build on 10.16.4.9 rocky_8.6 docker Enviroment
- [x] Run `qa/L0_pytorch_unittest/test.sh`
- [x] Run C++ operator tests related to normalization/swizzle as applicable

## Notes
- Branch synced with latest `origin/release_v2.7` before opening this MR.

See merge request dcutoolkit/deeplearing/TransformerEngine!66

b998121c

14 Aug, 2025 1 commit

[Core] Add launch bounds to swizzle kernels (#2076) · 12065ac2

Kirthi Shankar Sivamani authored Aug 14, 2025



Add launch bounds to swizzle kernel, use empty scale inv
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

12065ac2

06 Aug, 2025 1 commit

[PyTorch] Multi-tensor swizzle scaling factors for MXFP8 and fuse padding zeros (#2019) · c0d2f1a5

Xin Yao authored Aug 07, 2025



* for loop
Signed-off-by: Xin Yao <xiny@nvidia.com>

* bulk alloc
Signed-off-by: Xin Yao <xiny@nvidia.com>

* multi-tensor swizzle
Signed-off-by: Xin Yao <xiny@nvidia.com>

* pad zeros in swizzle kernels
Signed-off-by: Xin Yao <xiny@nvidia.com>

* unify single- and multi-tensor swizzle
Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix empty tensor list
Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix bug for col swizzle
Signed-off-by: Xin Yao <xiny@nvidia.com>

* check context & fix signifiers
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c0d2f1a5

29 May, 2025 1 commit

Avoid memory allocations and deallocations when creating NVTETensor (#1813) · 4292653c

Przemyslaw Tredak authored May 29, 2025



* Changed the Tensor allocation strategy
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Disable debug flag
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the double free error
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixed pyTorch recipe extension
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Hide TensorAllocator and fix the usage in LayerNorm
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Cleaning
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix permutation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

4292653c

27 Mar, 2025 1 commit
- [DCU] compile pass · ab122dac
  yuguo authored Mar 27, 2025
  
  ab122dac
07 Feb, 2025 1 commit
- Update main branch with TE 2.0 code, update version to 2.1.0.dev0 · 544dd14b
  Przemek Tredak authored Feb 07, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
  544dd14b