Commits · 6cd5b128559b48a996cddf81b34b51a252cbe25c · OpenDAS / TransformerEngine

16 Jun, 2023 3 commits

Change VERSION to 0.11.0dev (#284) · 6cd5b128

Kirthi Shankar Sivamani authored Jun 16, 2023



Changed VERSION to 0.11.0dev
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

6cd5b128

Fix softmax ONNX export (#282) · 3fbded65

Neta Zmora authored Jun 17, 2023



* Fix softmax ONNX export

* BF16 is validated using "fake i/o": ie. instead of using BF16 as input/output, use FP32 input/output and convert to/from BF16 in the forward method.

* Wrap softmax symbolic functions with conversion to/from FP32 to produce the same semantics as TE's softmax (compute is performed at FP32 precision regardless of input/output data type).
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* ONNX export Code refactoring

Share function compute_in_fp32 between softmax.py (softmax symbolic functions) and te_onnx_extensions.py (the rest of the symbolic functions).
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* Fix exports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Neta Zmora <nzmora@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

3fbded65

ONNX export for ReLU and GLU variants (#281) · fee8970f

Kirthi Shankar Sivamani authored Jun 16, 2023



* ReLU ONNX export
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add GLU variants
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* linter check
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Export reglu, geglu, swiglu
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review comments
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fee8970f

14 Jun, 2023 1 commit

Added more integrations to README.md (#280) · 3f13e55f

Santosh Bhavani authored Jun 13, 2023



* Update README.rst

Added additional integrations
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* Update README.rst

Added DeepSpeed integration
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

---------
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

3f13e55f

13 Jun, 2023 3 commits

Update FA version (#279) · e17c31c3
Kirthi Shankar Sivamani authored Jun 13, 2023
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
e17c31c3

Adding other activation types to LayerNormMLP (#265) · c67bb2fc

Przemyslaw Tredak authored Jun 13, 2023



* Added ReLU and GLU variants to common
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* pyTorch changes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* PyTorch C++ lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Bug fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* More fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix storage errors
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Compute bgrad
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix numerical tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix ONNX export tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review comments
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c67bb2fc

[JAX] Move jax.experimental.maps.Mesh to jax.sharding.Mesh (#276) · df6f347f
zlsh80826 authored Jun 14, 2023
```
Move jax.experimental.maps.Mesh to jax.sharding.Mesh
Signed-off-by: Reese Wang <rewang@nvidia.com>
```
df6f347f

12 Jun, 2023 1 commit
- Revert "Fix BF16 ONNX export for successful ONNX Runtime Verification (#271)" (#275) · 487871e2
  Tim Moon authored Jun 12, 2023
```
This reverts commit 914f3841

.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
  487871e2
11 Jun, 2023 1 commit

Fix BF16 ONNX export for successful ONNX Runtime Verification (#271) · 914f3841

asfiyab-nvidia authored Jun 10, 2023



* fix BF16 onnx export for ort verification
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* Update transformer_engine/pytorch/attention.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com>

---------
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

914f3841

08 Jun, 2023 1 commit

Restrict TF tests to one GPU (#264) · 6b6823a1

Kaixi Hou authored Jun 08, 2023



* Only use one gpu for tensorflow tests
Signed-off-by: kaixih <kaixih@nvidia.com>

* Simplify the change
Signed-off-by: kaixih <kaixih@nvidia.com>

* Final fix
Signed-off-by: kaixih <kaixih@nvidia.com>

---------
Signed-off-by: kaixih <kaixih@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

6b6823a1

07 Jun, 2023 3 commits

Updated README.md (#253) · 39b2ef10

Santosh Bhavani authored Jun 07, 2023



* Update README.rst

1/ added a nav header with links
2/ added integrations section
3/ minor grammatical changes
4/ added link to release notes
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* Update README.rst

Update NGC PyT container usage instructions
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* Update README.rst

- added pre-reqs under installation
- reorganized useful links as papers and videos
- updated integrations to include upcoming work
- updated copy in contributing section
- updated highlights section
- updated nav header
- added latest news section
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* Update README.rst
Co-authored-by: Santosh Bhavani <santosh.bhavani@live.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update README.rst

- updated integrations section
- add DL FW support info
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

---------
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

39b2ef10

Use torch.compile for version 2.0 and higher (#255) · 0832cd2c

Kirthi Shankar Sivamani authored Jun 07, 2023



* Use torch.compile for version 2.0 and higher
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove unused import
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* use torch.__version__
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Use NVFuser for dropout fusions
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix onnx tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

0832cd2c

JAX small changes (#251) · 6280dc7a

Frédéric Bastien authored Jun 06, 2023



* Use the same default in the function to what the class default.
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>

* Assert instead of silently ignoring not supported variation. Small doc correction, amax_compute_algo is partially supported.
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>

* Fix line lenght to fix the CI.
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>

* Apply suggestions from code review
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Frédéric Bastien <frederic.bastien@gmail.com>

* grammar
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Clarify that it is only TE/JAX that don't support that faeture.
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>

* Update transformer_engine/jax/fp8.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Frédéric Bastien <frederic.bastien@gmail.com>

* Update the test following the change in default value
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>

* Fix ci
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>
Signed-off-by: Frédéric Bastien <frederic.bastien@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

6280dc7a

06 Jun, 2023 4 commits

First step of PaddlePaddle integration (#249) · 207b231e

Tian Zheng authored Jun 07, 2023



* First step of PaddlePaddle integration
- Add build option for paddle
- Add basic test framework
- Add 3 basic operators: cast_from_fp8, cast_to_fp8, gemm
Signed-off-by: Tian Zheng <tizheng@nvidia.com>
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Fix review comments
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Support paddle build
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Add paddle build support for new building framework
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Fix review comments
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Clean up build process for Paddle stub file
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Minor fixes
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Fix pylint "wrong-import-order" warning
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Fix review comments
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

* Skip BF16 GEMM tests for unsupported arch
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>

---------
Signed-off-by: Tian Zheng <tizheng@nvidia.com>
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

207b231e

ONNX export test - BF16 support (#256) · 48b31ca9

galagam authored Jun 06, 2023



* add bf16 subgraph tests
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* changes:
1. Add normal mode BF16 tests for all subgraphs
2. Add fake BF16 tests for low-level subgraphs
3. Separate IO serialization from validation
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* ONNX export test - BF16 support part 1

TE infer returns torch.tensor, to support output of bf16 which is
currently not supported in numpy
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

* ONNX export test - BF16 support part 2

- Separate TE infer from serialize
- Fix serialize function to use full path
- Set unique filenames for fake bf16 (avoid overriding standard bf16)
- Remove overwriting fake_bf16_io value
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

* Export test: Slight tolerance increase in test_export_gpt_generation

Causes sporadic failures ~1% of all runs
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

* Remove GEMM fake-bf16 export test and patch to enable it
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

48b31ca9

Fix header files for doxygen (#252) · d7704b98

cyanguwa authored Jun 05, 2023



* fix headers for doxygen
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix description f16 and use half precision instead
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

d7704b98

[JAX] Enhance fall-back conditions for fMHA. (#260) · f70b4bbf
Ming-Xu Huang authored Jun 06, 2023
```
Signed-off-by: Ming Huang <mingh@nvidia.com>
```
f70b4bbf

02 Jun, 2023 1 commit

Fix some Pylance errors (#259) · 144e4888

Jan Bielak authored Jun 02, 2023



* Ignore IDE files
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix typing errors
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Ignore devcontainer files
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Avoid import from private module
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Apply @timmoon10 's suggestions
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

---------
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

144e4888

01 Jun, 2023 2 commits

Don't save fp8 weight tensors if `is_first_microbatch` is None (#244) · 80825fde

Sudhakar Singh authored Jun 01, 2023



* extend fp8 weight placeholders logic for Linear, LNLinear, LNMLP
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/base.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/base.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/base.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/base.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/base.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/layernorm_linear.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/layernorm_mlp.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/module/linear.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update linear.py
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update layernorm_linear.py
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update layernorm_mlp.py
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

80825fde

Pin FA version (#254) · 5495883c
Kirthi Shankar Sivamani authored Jun 01, 2023
```
pin FA version
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
5495883c

31 May, 2023 1 commit

Refactor build system (#235) · 37bbfc76

Tim Moon authored May 31, 2023



* Refactor Setuptools build system

Successfully launches CMake install, but installs CMake extensions in temp dir.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug JAX build

Fix pybind11 import. Distinguish between build-time and run-time dependencies.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add helper function to determine dependencies
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add missing license
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug case where system CMake is too old
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add missing license
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Simplify sanity import tests

Just importing modules provides richer error messages.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Properly install submodules
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Install helper library for TensorFlow
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update documentation
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Do not install Ninja by default
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Include Git commit hash in version string
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Override build_ext.build_extensions instead of build_ext.run
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix incorrect include path

Restore Ninja dependency. Restore overriding build_ext.run func.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestions from @nouiz
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable parallel Ninja jobs in GitHub actions
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Properly install userbuffers lib
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak install docs

Review suggestion from @ksivaman
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add examples for specifying framework in docs
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

37bbfc76

26 May, 2023 1 commit
- Documentation fixes (#248) · 215dfe7e
  Kirthi Shankar Sivamani authored May 26, 2023
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  215dfe7e
25 May, 2023 2 commits

[PyTorch] Rotary Position Embedding (#246) · 156a074a

Kirthi Shankar Sivamani authored May 25, 2023



* Rotary Position Embedding
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* remove einops
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Improve docstr
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

156a074a

Clearer error messages for dtype and shape assertions (#245) · 871fdf51

Carlos Mocholí authored May 24, 2023



* Clearer error messages for dtype and shape assertions
Signed-off-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update transformer_engine/pytorch/utils.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update transformer_engine/pytorch/utils.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Carlos Mocholí <carlossmocholi@gmail.com>

---------
Signed-off-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

871fdf51

23 May, 2023 1 commit

Jax bug fixes for the dot product attention (#236) · 69003969

zlsh80826 authored May 23, 2023



* Unfused scale+softmax if bias is present
Signed-off-by: Reese Wang <rewang@nvidia.com>

* WAR a causal masking + no_bias bug and add the unittests
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix the optional args (bias) sharding
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Disable fused attn in JAX by default, enable it with NVTE_USE_FUSED_ATTN
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add thread local for the plan cache
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Rename dbeta to dbias for the readability
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add scaled softmax with dropout test cases
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Updated NVTE_FUSED_ATTN variable name
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>

69003969

22 May, 2023 2 commits

Relax checks for attn_mask_type in FlashAttention (#226) · 122de2cc

cyanguwa authored May 22, 2023



* relax attn mask type checks for FlashAttention
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable flash attn if mask tensor is not None
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix the logic for flash attn
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix for lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

122de2cc

Changed VERSION to 0.10.0dev · a5f61ce2
Przemek Tredak authored May 22, 2023
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
a5f61ce2

19 May, 2023 1 commit

Support for NVRTC kernels (#138) · e9022290

Tim Moon authored May 19, 2023



* Initial implementation of NVRTC infrastructure
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Initial NVRTC impl for transpose

NVRTC gives compilation errors at runtime. Everything else compiles and passes tests as expected.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug NVRTC transpose impl

NVRTC kernel compiles, runs, and passes tests with FP32.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use variadic template for kernel arguments in RTC kernel launch func
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Refactoring

Added utility header for CUDA Runtime API. Optimized concat_strings function.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add helper function for regex substitutions in strings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add option to disable NVRTC support
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add support for header includes in NVRTC kernels
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Access lazily-initialized CUDA driver lib and add option to specify CUDA header dir
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Configure NVRTC transpose kernel with simple perf model
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Revert change to tests
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Style fixes
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add prime-valued test cases
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix multiple definition error
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Optimize NVRTC transpose kernel for small data sizes
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Mention NVRTC in docs
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add unit tests for NVRTC and string utils
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add comment in install docs about NVRTC

Review suggestion from @nouiz
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug perf model for RTC transpose kernel
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove NVRTC discussion from docs
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Require CUDA headers unless NVRTC is explicitly disabled
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use diagonal coords in transpose kernel to avoid partition camping
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use std::call_once for thread-safety
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Minor fixes
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug CMake error
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove unnecessary call_once
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove diagonal coordinates from transpose kernel
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use size_t indices instead of int
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestions from @ptrendx

Check build-time CUDA include path for run-time CUDA headers. Handle case where CUDA context is initially uninitialized.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

e9022290

16 May, 2023 2 commits

WAR a JAX issue (#221) · 0d251991
Frédéric Bastien authored May 16, 2023
```
Signed-off-by: Frederic Bastien <fbastien@nvidia.com>
```
0d251991

The Implementation of Praxis's Modules (#158) · b20c0531

Ming-Xu Huang authored May 17, 2023



* Adding JAX/Praxis modules and dependencies.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding UTs to JAX/Praxis modules.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove praxis as a dependency due to not strictly needed
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Repalce is_fp8_supported to is_fp8_available
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Make Praxis as an optional dependency.

1. Removed 'from . import praxis' in __init__.py.
    1.1 Noted, keep 'from . import flax' for deprecated warning.
2. Changed te.flax to te_flax in examples and README.rst.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding a workaround to FP8 training on Praxis.
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>

b20c0531

13 May, 2023 1 commit

Add env. var. for efficient text-generation in inference (#214) · 68f60b89

Neta Zmora authored May 14, 2023



* Dynamically-generated causal attention mask (for ONNX export)

TE's default causal mask is square (seq_len, seq_len) and is
dynamically allocated for different sequence sizes. Dynamic
allocation and dictionary lookups are not supported by ONNX.
GPT generative phase uses rectangular masks.

This commit forces softmax to use `forward_torch_softmax` and
to dynamically generate an attention mask when exporting to ONNX.
The mask is generated w/o using conditional control-flow by generating
a  (k_seq_len, k_seq_len) mask and slicing it to (q_seq_len, k_seq_len)

An alternate implementation is to pre-allocate a mask of shape
(max_seq, max_seq) and to slice that. This solution is more performant
at the expense of space, but the problem is the TE doesn't have a concept
of max_seq.

* Add to test_export_softmax a test for te.softmax.FusedScaleMaskSoftmax.
* Add test_softmax_mask_fn to test that TE's default attention mask and
the new ONNX-compatible mask produce the same behavior.
* Add test_export_gpt_generation to test that the ONNX model can correctly
handle inputs with different shapes and that the attention mask it adjusted
on-the-fly to different sequence lengths.

Misc:
* Add a PRNG seeding fixture for more stability in tests.
* Add dynamic shapes for ONNX input/output tests.
* Allow validate_result to compare ORT output to pre-computed TE outputs.
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* Add NVTE_ONNX_KVCACHE_MAX_SEQ_LEN for efficient text-generation in inference

* Introduce an environment variable (NVTE_ONNX_KVCACHE_MAX_SEQ_LEN) to set the maximum sequence length.
In ONNX inference with KV-Cache optimizations for GPT text generation, the attention mask shape can be square (context-phase) or rectangular (generation-phase).
When exporting to ONNX and this variable is set, TE preallocates an upper triangular (k=1) matrix with a size as prescribed by the variable, and dynamically slices the mask for the required shape.
TE models can be exported to ONNX when NVTE_ONNX_KVCACHE_MAX_SEQ_LEN is not configured, but the attention masking is always square and not fit for efficient text generation.

* Work-around torch.onnx.export bug that incorrectly folds
layer_norm(data, scale=add(gamma,1)) to layer_norm(data, scale=gamma)
when we use LN with zero-centered gamma.

* ONNX export tests
  * Add a fixture (seed_default_rng) to seed the PRNG
  * Add a fixture (set_max_seq_len) to set the max sequence length when exporting to ONNX for GPT text generation
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* Fix linting errors
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* Remove immutable default values from a couple of function signatures
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* Add @skip_FP8 to test_export_gpt_generation
Signed-off-by: Neta Zmora <nzmora@nvidia.com>

* Update transformer_engine/pytorch/softmax.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix CI error for softmax export
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Neta Zmora <nzmora@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

68f60b89

12 May, 2023 3 commits

[JAX] bugfix for softmax lowering (#218) · 0d2021ef
Jeng Bai-Cheng authored May 13, 2023
```
bugfix for softmax lowering
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>
```
0d2021ef

Deterministic JIT warmup (#216) · 8d4761ad

Kirthi Shankar Sivamani authored May 11, 2023



* deterministic JIT warmup
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* review comments
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

8d4761ad

PyTorch API numeric tests (#215) · ec0d40d6

Kirthi Shankar Sivamani authored May 11, 2023



* LayerNormMLP numeric test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* DotProductAttention numeric test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ec0d40d6

11 May, 2023 1 commit
- Add users to CI · ce3980c8
  Przemek Tredak authored May 11, 2023
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
  ce3980c8
10 May, 2023 2 commits

Check input dimensions for Sequence Parallel (#208) · bc5d4c18
Kirthi Shankar Sivamani authored May 10, 2023
```
Check input dimensions for SP
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
bc5d4c18

Shriya/tp overlap patch (#205) · e6bca031

Shriya Palsamudram authored May 10, 2023



userbuffer pushsend/recv fix with atomicAdd_system
Signed-off-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

e6bca031

09 May, 2023 3 commits

[JAX] Fix missing axes parameters in TransformerLayer and the wrong shape of... · 22ccf9b1

Ming-Xu Huang authored May 10, 2023


[JAX] Fix missing axes parameters in TransformerLayer and the wrong shape of bias in LayerNormMLP (#196)

Fixed missing axes and wrong shape of bias in LayerNormMLP
Signed-off-by: Ming Huang <mingh@nvidia.com>

22ccf9b1

[JAX] add multiprocessing example and improve debugging message (#198) · 496b8fdd

Jeng Bai-Cheng authored May 10, 2023



* add mp example
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* refactor
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* update doc-string
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* better FP8 checker
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* update readme
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* replace te.* with te.flax* to remove deprecated warning
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* remove nouse os.environ
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* remove nouse
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix typo
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* Update examples/jax/encoder/README.md
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* Update examples/jax/encoder/README.md
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* Update examples/jax/encoder/README.md
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* Update examples/jax/encoder/README.md
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* Update examples/jax/encoder/test_multiprocessing_encoder.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* remove cuda-python
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* adjust readme
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* Update examples/jax/encoder/README.md
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix cpp lint

fix issue of "Could not find a newline character at the end of the file."
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix AssertionError: 1 GPU per process
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* replace tfds with datasets

The Flax application crash if it use TensorFlow Dataset (tfds) in NVIDIA JAX container.
The tfds is very useful for downloading well-knwon dataset (e.g., MNIST, GLUE) and commonly used by TF/JAX community.
However, it seems like that it is NOT compatible with NVIDIA TensorFlow in NVIDIA JAX container and somehow affects JAX.
It triggers random errors at JAX initialization depending on different versions, and make CI unstable.
Thus, this commit replaces tfds with "huggingface datasets" to download needed datasets.
See "nvbugs 4039266" for more details.
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix input sharding

Unlike SPMD mode, in multiprocessing mode, the input tensor must be sharded manually.
Using DP=4, TP=2 as an example, the device mesh looks like:

mesh.device_ids = [[0, 1],
                   [2, 3],
                   [4, 5],
                   [6, 7]]

Assume that the process ID is mapped to GPU ID.
The process 0 and process 1 are grouped for model parallelism,
process 2 and process 3 are grouped together too, and so on.

The process 0 and process 1 need to share the same micro-batch in the training step,
process 0 and process 2, 4, and 6 have different micro-batch.

Thus, `shard_array_wrapper` partitions inputs to 4 parts (and setup
needed arguments for jax.make_array_from_single_device_arrays).
The process 0 and process 1 take the first quarter,
process 2 and process 3 take the second quarter, and so on.
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* refactor UT for multiprocess example

Use Python `multiprocessing` to test the multiprocessing example,
if the system has multiple GPU. 1 GPU per process.

Because `jax.distributed.initialize` must be called before any other JAX or Flax API,
GPU info cannot be queried by calling jax.local_devices() in TestEncoder.
Thus, `unittest_query_gpu()` forks another process to query number of GPUs and
FP8 capability.
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* remove nouse arg `--num-gpu`

JAX doesn't have an API to setup number of GPU used in SPMD mode.
The only way is to use `CUDA_VISIBLE_DEVICES` for now.
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix typo
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix ut
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* simplify the mask setting
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* increase batch-size for multigpu example

The batch-size 64 is too small to be partitioned for 8xH100.
If batch-size is 64, the GEMM shape is 256x8192x8 per GPU.
The 8 is too small for FP8 GEMM kernel, and
cuBLASLt will throw "Failed to query heuristics".
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

* fix downloading mnist error

To download MNIST via `huggingface datasets`, it requires Pillow.
Otherwise, it throws `An error occurred while generating the
dataset`
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>

---------
Signed-off-by: Ryan Jeng <rjeng@nvidia.com>
Signed-off-by: Jeng Bai-Cheng <jeng1220@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

496b8fdd

ONNX export test - minor fixes (#200) · 441fa968

galagam authored May 10, 2023



* ONNX export - input names fix

* Fix discrepencies due to input names not defined correctly/not passed to export
* Refactor ORT input feed creation for simplicity
* Control whether to save test IO files via environment variable
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

* ONNX export test: minor refactor
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: galagam <96368689+galagam@users.noreply.github.com>

---------
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: galagam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

441fa968