- 16 Jun, 2023 3 commits
-
-
Kirthi Shankar Sivamani authored
Changed VERSION to 0.11.0dev Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Neta Zmora authored
* Fix softmax ONNX export * BF16 is validated using "fake i/o": ie. instead of using BF16 as input/output, use FP32 input/output and convert to/from BF16 in the forward method. * Wrap softmax symbolic functions with conversion to/from FP32 to produce the same semantics as TE's softmax (compute is performed at FP32 precision regardless of input/output data type). Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * ONNX export Code refactoring Share function compute_in_fp32 between softmax.py (softmax symbolic functions) and te_onnx_extensions.py (the rest of the symbolic functions). Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * Fix exports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Neta Zmora <nzmora@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* ReLU ONNX export Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add GLU variants Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * linter check Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Export reglu, geglu, swiglu Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 14 Jun, 2023 1 commit
-
-
Santosh Bhavani authored
* Update README.rst Added additional integrations Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> * Update README.rst Added DeepSpeed integration Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> --------- Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com>
-
- 13 Jun, 2023 3 commits
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Przemyslaw Tredak authored
* Added ReLU and GLU variants to common Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * pyTorch changes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * PyTorch C++ lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix storage errors Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Compute bgrad Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix numerical tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix ONNX export tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review comments Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
zlsh80826 authored
Move jax.experimental.maps.Mesh to jax.sharding.Mesh Signed-off-by:Reese Wang <rewang@nvidia.com>
-
- 12 Jun, 2023 1 commit
-
-
Tim Moon authored
This reverts commit 914f3841 . Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 11 Jun, 2023 1 commit
-
-
asfiyab-nvidia authored
* fix BF16 onnx export for ort verification Signed-off-by:
Asfiya Baig <asfiyab@nvidia.com> * Update transformer_engine/pytorch/attention.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com> --------- Signed-off-by:
Asfiya Baig <asfiyab@nvidia.com> Signed-off-by:
asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Jun, 2023 1 commit
-
-
Kaixi Hou authored
* Only use one gpu for tensorflow tests Signed-off-by:
kaixih <kaixih@nvidia.com> * Simplify the change Signed-off-by:
kaixih <kaixih@nvidia.com> * Final fix Signed-off-by:
kaixih <kaixih@nvidia.com> --------- Signed-off-by:
kaixih <kaixih@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Jun, 2023 3 commits
-
-
Santosh Bhavani authored
* Update README.rst 1/ added a nav header with links 2/ added integrations section 3/ minor grammatical changes 4/ added link to release notes Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> * Update README.rst Update NGC PyT container usage instructions Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> * Update README.rst - added pre-reqs under installation - reorganized useful links as papers and videos - updated integrations to include upcoming work - updated copy in contributing section - updated highlights section - updated nav header - added latest news section Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> * Update README.rst Co-authored-by:
Santosh Bhavani <santosh.bhavani@live.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update README.rst - updated integrations section - add DL FW support info Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> --------- Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Use torch.compile for version 2.0 and higher Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove unused import Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * use torch.__version__ Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use NVFuser for dropout fusions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix onnx tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Frédéric Bastien authored
* Use the same default in the function to what the class default. Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> * Assert instead of silently ignoring not supported variation. Small doc correction, amax_compute_algo is partially supported. Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> * Fix line lenght to fix the CI. Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Frédéric Bastien <frederic.bastien@gmail.com> * grammar Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Clarify that it is only TE/JAX that don't support that faeture. Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> * Update transformer_engine/jax/fp8.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Frédéric Bastien <frederic.bastien@gmail.com> * Update the test following the change in default value Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> * Fix ci Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> Signed-off-by:
Frédéric Bastien <frederic.bastien@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Jun, 2023 4 commits
-
-
Tian Zheng authored
* First step of PaddlePaddle integration - Add build option for paddle - Add basic test framework - Add 3 basic operators: cast_from_fp8, cast_to_fp8, gemm Signed-off-by:
Tian Zheng <tizheng@nvidia.com> Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix review comments Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Support paddle build Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add paddle build support for new building framework Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix review comments Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Clean up build process for Paddle stub file Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor fixes Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix pylint "wrong-import-order" warning Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix review comments Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Skip BF16 GEMM tests for unsupported arch Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> --------- Signed-off-by:
Tian Zheng <tizheng@nvidia.com> Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
galagam authored
* add bf16 subgraph tests Signed-off-by:
Asfiya Baig <asfiyab@nvidia.com> * changes: 1. Add normal mode BF16 tests for all subgraphs 2. Add fake BF16 tests for low-level subgraphs 3. Separate IO serialization from validation Signed-off-by:
Asfiya Baig <asfiyab@nvidia.com> * ONNX export test - BF16 support part 1 TE infer returns torch.tensor, to support output of bf16 which is currently not supported in numpy Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> * ONNX export test - BF16 support part 2 - Separate TE infer from serialize - Fix serialize function to use full path - Set unique filenames for fake bf16 (avoid overriding standard bf16) - Remove overwriting fake_bf16_io value Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> * Export test: Slight tolerance increase in test_export_gpt_generation Causes sporadic failures ~1% of all runs Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> * Remove GEMM fake-bf16 export test and patch to enable it Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Asfiya Baig <asfiyab@nvidia.com> Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Asfiya Baig <asfiyab@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
cyanguwa authored
* fix headers for doxygen Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix description f16 and use half precision instead Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Ming-Xu Huang authored
Signed-off-by:Ming Huang <mingh@nvidia.com>
-
- 02 Jun, 2023 1 commit
-
-
Jan Bielak authored
* Ignore IDE files Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix typing errors Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Ignore devcontainer files Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Avoid import from private module Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply @timmoon10 's suggestions Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com>
-
- 01 Jun, 2023 2 commits
-
-
Sudhakar Singh authored
* extend fp8 weight placeholders logic for Linear, LNLinear, LNMLP Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/layernorm_linear.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/layernorm_mlp.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/linear.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update linear.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update layernorm_linear.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update layernorm_mlp.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
pin FA version Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 31 May, 2023 1 commit
-
-
Tim Moon authored
* Refactor Setuptools build system Successfully launches CMake install, but installs CMake extensions in temp dir. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug JAX build Fix pybind11 import. Distinguish between build-time and run-time dependencies. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add helper function to determine dependencies Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add missing license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug case where system CMake is too old Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add missing license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Simplify sanity import tests Just importing modules provides richer error messages. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Properly install submodules Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Install helper library for TensorFlow Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not install Ninja by default Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include Git commit hash in version string Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Override build_ext.build_extensions instead of build_ext.run Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix incorrect include path Restore Ninja dependency. Restore overriding build_ext.run func. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @nouiz Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable parallel Ninja jobs in GitHub actions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Properly install userbuffers lib Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak install docs Review suggestion from @ksivaman Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add examples for specifying framework in docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 26 May, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 May, 2023 2 commits
-
-
Kirthi Shankar Sivamani authored
* Rotary Position Embedding Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * remove einops Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Improve docstr Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Carlos Mocholí authored
* Clearer error messages for dtype and shape assertions Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> * Update transformer_engine/pytorch/utils.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> * Update transformer_engine/pytorch/utils.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> --------- Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 23 May, 2023 1 commit
-
-
zlsh80826 authored
* Unfused scale+softmax if bias is present Signed-off-by:
Reese Wang <rewang@nvidia.com> * WAR a causal masking + no_bias bug and add the unittests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix the optional args (bias) sharding Signed-off-by:
Reese Wang <rewang@nvidia.com> * Disable fused attn in JAX by default, enable it with NVTE_USE_FUSED_ATTN Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add thread local for the plan cache Signed-off-by:
Reese Wang <rewang@nvidia.com> * Rename dbeta to dbias for the readability Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add scaled softmax with dropout test cases Signed-off-by:
Reese Wang <rewang@nvidia.com> * Updated NVTE_FUSED_ATTN variable name Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 22 May, 2023 2 commits
-
-
cyanguwa authored
* relax attn mask type checks for FlashAttention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable flash attn if mask tensor is not None Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix the logic for flash attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 19 May, 2023 1 commit
-
-
Tim Moon authored
* Initial implementation of NVRTC infrastructure Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initial NVRTC impl for transpose NVRTC gives compilation errors at runtime. Everything else compiles and passes tests as expected. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug NVRTC transpose impl NVRTC kernel compiles, runs, and passes tests with FP32. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use variadic template for kernel arguments in RTC kernel launch func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactoring Added utility header for CUDA Runtime API. Optimized concat_strings function. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add helper function for regex substitutions in strings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add option to disable NVRTC support Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for header includes in NVRTC kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Access lazily-initialized CUDA driver lib and add option to specify CUDA header dir Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Configure NVRTC transpose kernel with simple perf model Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert change to tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Style fixes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add prime-valued test cases Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix multiple definition error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Optimize NVRTC transpose kernel for small data sizes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Mention NVRTC in docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add unit tests for NVRTC and string utils Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add comment in install docs about NVRTC Review suggestion from @nouiz Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug perf model for RTC transpose kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove NVRTC discussion from docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Require CUDA headers unless NVRTC is explicitly disabled Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use diagonal coords in transpose kernel to avoid partition camping Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use std::call_once for thread-safety Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor fixes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug CMake error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unnecessary call_once Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove diagonal coordinates from transpose kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use size_t indices instead of int Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ptrendx Check build-time CUDA include path for run-time CUDA headers. Handle case where CUDA context is initially uninitialized. Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 16 May, 2023 2 commits
-
-
Frédéric Bastien authored
Signed-off-by:Frederic Bastien <fbastien@nvidia.com>
-
Ming-Xu Huang authored
* Adding JAX/Praxis modules and dependencies. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding UTs to JAX/Praxis modules. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Remove praxis as a dependency due to not strictly needed Signed-off-by:
Ming Huang <mingh@nvidia.com> * Repalce is_fp8_supported to is_fp8_available Signed-off-by:
Ming Huang <mingh@nvidia.com> * Make Praxis as an optional dependency. 1. Removed 'from . import praxis' in __init__.py. 1.1 Noted, keep 'from . import flax' for deprecated warning. 2. Changed te.flax to te_flax in examples and README.rst. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding a workaround to FP8 training on Praxis. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com>
-
- 13 May, 2023 1 commit
-
-
Neta Zmora authored
* Dynamically-generated causal attention mask (for ONNX export) TE's default causal mask is square (seq_len, seq_len) and is dynamically allocated for different sequence sizes. Dynamic allocation and dictionary lookups are not supported by ONNX. GPT generative phase uses rectangular masks. This commit forces softmax to use `forward_torch_softmax` and to dynamically generate an attention mask when exporting to ONNX. The mask is generated w/o using conditional control-flow by generating a (k_seq_len, k_seq_len) mask and slicing it to (q_seq_len, k_seq_len) An alternate implementation is to pre-allocate a mask of shape (max_seq, max_seq) and to slice that. This solution is more performant at the expense of space, but the problem is the TE doesn't have a concept of max_seq. * Add to test_export_softmax a test for te.softmax.FusedScaleMaskSoftmax. * Add test_softmax_mask_fn to test that TE's default attention mask and the new ONNX-compatible mask produce the same behavior. * Add test_export_gpt_generation to test that the ONNX model can correctly handle inputs with different shapes and that the attention mask it adjusted on-the-fly to different sequence lengths. Misc: * Add a PRNG seeding fixture for more stability in tests. * Add dynamic shapes for ONNX input/output tests. * Allow validate_result to compare ORT output to pre-computed TE outputs. Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * Add NVTE_ONNX_KVCACHE_MAX_SEQ_LEN for efficient text-generation in inference * Introduce an environment variable (NVTE_ONNX_KVCACHE_MAX_SEQ_LEN) to set the maximum sequence length. In ONNX inference with KV-Cache optimizations for GPT text generation, the attention mask shape can be square (context-phase) or rectangular (generation-phase). When exporting to ONNX and this variable is set, TE preallocates an upper triangular (k=1) matrix with a size as prescribed by the variable, and dynamically slices the mask for the required shape. TE models can be exported to ONNX when NVTE_ONNX_KVCACHE_MAX_SEQ_LEN is not configured, but the attention masking is always square and not fit for efficient text generation. * Work-around torch.onnx.export bug that incorrectly folds layer_norm(data, scale=add(gamma,1)) to layer_norm(data, scale=gamma) when we use LN with zero-centered gamma. * ONNX export tests * Add a fixture (seed_default_rng) to seed the PRNG * Add a fixture (set_max_seq_len) to set the max sequence length when exporting to ONNX for GPT text generation Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * Fix linting errors Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * Remove immutable default values from a couple of function signatures Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * Add @skip_FP8 to test_export_gpt_generation Signed-off-by:
Neta Zmora <nzmora@nvidia.com> * Update transformer_engine/pytorch/softmax.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI error for softmax export Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Neta Zmora <nzmora@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 May, 2023 3 commits
-
-
Jeng Bai-Cheng authored
bugfix for softmax lowering Signed-off-by:Ryan Jeng <rjeng@nvidia.com>
-
Kirthi Shankar Sivamani authored
* deterministic JIT warmup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* LayerNormMLP numeric test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * DotProductAttention numeric test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 11 May, 2023 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 10 May, 2023 2 commits
-
-
Kirthi Shankar Sivamani authored
Check input dimensions for SP Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Shriya Palsamudram authored
userbuffer pushsend/recv fix with atomicAdd_system Signed-off-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 09 May, 2023 3 commits
-
-
Ming-Xu Huang authored
[JAX] Fix missing axes parameters in TransformerLayer and the wrong shape of bias in LayerNormMLP (#196) Fixed missing axes and wrong shape of bias in LayerNormMLP Signed-off-by:Ming Huang <mingh@nvidia.com>
-
Jeng Bai-Cheng authored
* add mp example Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * refactor Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * update doc-string Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * better FP8 checker Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * update readme Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * replace te.* with te.flax* to remove deprecated warning Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove nouse os.environ Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove nouse Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix typo Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/test_multiprocessing_encoder.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove cuda-python Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * adjust readme Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix cpp lint fix issue of "Could not find a newline character at the end of the file." Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix AssertionError: 1 GPU per process Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * replace tfds with datasets The Flax application crash if it use TensorFlow Dataset (tfds) in NVIDIA JAX container. The tfds is very useful for downloading well-knwon dataset (e.g., MNIST, GLUE) and commonly used by TF/JAX community. However, it seems like that it is NOT compatible with NVIDIA TensorFlow in NVIDIA JAX container and somehow affects JAX. It triggers random errors at JAX initialization depending on different versions, and make CI unstable. Thus, this commit replaces tfds with "huggingface datasets" to download needed datasets. See "nvbugs 4039266" for more details. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix input sharding Unlike SPMD mode, in multiprocessing mode, the input tensor must be sharded manually. Using DP=4, TP=2 as an example, the device mesh looks like: mesh.device_ids = [[0, 1], [2, 3], [4, 5], [6, 7]] Assume that the process ID is mapped to GPU ID. The process 0 and process 1 are grouped for model parallelism, process 2 and process 3 are grouped together too, and so on. The process 0 and process 1 need to share the same micro-batch in the training step, process 0 and process 2, 4, and 6 have different micro-batch. Thus, `shard_array_wrapper` partitions inputs to 4 parts (and setup needed arguments for jax.make_array_from_single_device_arrays). The process 0 and process 1 take the first quarter, process 2 and process 3 take the second quarter, and so on. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * refactor UT for multiprocess example Use Python `multiprocessing` to test the multiprocessing example, if the system has multiple GPU. 1 GPU per process. Because `jax.distributed.initialize` must be called before any other JAX or Flax API, GPU info cannot be queried by calling jax.local_devices() in TestEncoder. Thus, `unittest_query_gpu()` forks another process to query number of GPUs and FP8 capability. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove nouse arg `--num-gpu` JAX doesn't have an API to setup number of GPU used in SPMD mode. The only way is to use `CUDA_VISIBLE_DEVICES` for now. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix typo Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix ut Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * simplify the mask setting Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * increase batch-size for multigpu example The batch-size 64 is too small to be partitioned for 8xH100. If batch-size is 64, the GEMM shape is 256x8192x8 per GPU. The 8 is too small for FP8 GEMM kernel, and cuBLASLt will throw "Failed to query heuristics". Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix downloading mnist error To download MNIST via `huggingface datasets`, it requires Pillow. Otherwise, it throws `An error occurred while generating the dataset` Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> --------- Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
galagam authored
* ONNX export - input names fix * Fix discrepencies due to input names not defined correctly/not passed to export * Refactor ORT input feed creation for simplicity * Control whether to save test IO files via environment variable Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> * ONNX export test: minor refactor Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
galagam <96368689+galagam@users.noreply.github.com> --------- Signed-off-by:
Gal Hubara Agam <ghubaraagam@nvidia.com> Signed-off-by:
galagam <96368689+galagam@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-