- 12 Jun, 2024 1 commit
-
-
Oleg Goncharov authored
* Merged CT+dbias+dact into a single template Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Moved gated activations ifrom the cast_transpose_fused ito a sseparate cpp file Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Update transformer_engine/common/transpose/cast_transpose_fusion.cu Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Update transformer_engine/common/transpose/cast_transpose_fusion.cu Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Reverted the change with the file split Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Implemented JIT compiled kernels Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Replaced aligned statically compiled kernels with JIT kernels. Added support of various activations functions for JIT kernels. Cleaned up the code per the code review Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 05 Jun, 2024 1 commit
-
-
Oleg Goncharov authored
* Merged CT+dbias+dact into a single template Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Moved gated activations ifrom the cast_transpose_fused ito a sseparate cpp file Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Code clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Update transformer_engine/common/transpose/cast_transpose_fusion.cu Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Update transformer_engine/common/transpose/cast_transpose_fusion.cu Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Reverted the change with the file split Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 13 May, 2024 1 commit
-
-
Phuong Nguyen authored
* renamed gelu to act * added relu, srelu, qgelu * fixes initialization for layernorm_fp8_mlp tests * moved activation_fp8 prim into testunit file * Moved NVTE_Activation_Enum to common/.../activation.h --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 03 May, 2024 1 commit
-
-
Phuong Nguyen authored
* templated primitives and respective C++ functions Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fixes for LayerNormMLP, tests in test_custom_compute all passed Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added default arg for pybind get_workspace_size funcs Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fixes for TestTransFormer with non-gated act tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * renamed gelu to act Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * improved enum implementation, avoid using magic numbers Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Exposed C++ ActivationEnum to python side Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Changed error messages Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * changed conditional check on input shape for dbias_cast_transpose Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * changed dtype (tol) for bias grad tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fixes so that layer_norm_fp8_mlp can take bias = None Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Set bias = None in flax modules Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 24 Apr, 2024 2 commits
-
-
Phuong Nguyen authored
* Implemented swiglu and silu Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Renamed nvte-*silu to nvte-*swish + generalized GetDBiasDact functions Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* combined layernorm_geglu with layernorm_gelu into fused_layernorm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fixes to pass all unit tests in test_custom_call_compute.py, test_layer.py, and test_praxis_layer.py Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * cleaning and formatting Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * renaming based on reviewers suggestions Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * implemented partial fused layernorm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * geglu + bias passed tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added partial fused calculation for dbias_1 Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up Co-authored-by:
Alp Dener <adener@nvidia.com> Signed-off-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com> Co-authored-by:
Alp Dener <adener@nvidia.com>
-
- 19 Apr, 2024 1 commit
-
-
Tim Moon authored
* Add NVRTC kernels for cast-transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update copyright year Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add noop flag to NVRTC cast-transpose kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 12 Apr, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* FP8 cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> * Fix numerics Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * exclude torch compile from numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More numerics fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm fusion from unfused path Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com>
-
- 02 Feb, 2024 1 commit
-
-
Ming-Xu Huang authored
* Adding support of sequence parallelism Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding RoPE Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix wrong batch_logical_axes Signed-off-by:
Ming Huang <mingh@nvidia.com> * Rnaming FSDP outer env var Signed-off-by:
Ming Huang <mingh@nvidia.com> * Poring RoPE to Praxis layers. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Porting GeLU + [FP8 Cast]. Signed-off-by:
Ming Huang <mingh@nvidia.com> * WAR to make XLA successfully match FP8 GEMM on FFN1 with GeLU. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Allowing arbitrary dimension of NVShape for the workspace allocation Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding checkpoint_name to fused functions of mlp.py to get better perf with nn.scan. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Modify with review feedback. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix bugs Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix typo. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed for lint Signed-off-by:
Ming Huang <mingh@nvidia.com> * Follow review feedback to modify code. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix typo. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Port SP to Praxis Signed-off-by:
Ming-Xu Huang <mingh@nvidia.com> * Fix an issue when enabling both GQA and RoPE. Signed-off-by:
Ming-Xu Huang <mingh@nvidia.com> * Update docs Signed-off-by:
Ming-Xu Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Signed-off-by:
Ming-Xu Huang <mingh@nvidia.com>
-
- 03 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 25 Aug, 2023 1 commit
-
-
cyanguwa authored
fix rng_state issue and minor compiler warning Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 13 Jun, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Added ReLU and GLU variants to common Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * pyTorch changes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * PyTorch C++ lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix storage errors Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Compute bgrad Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix numerical tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix ONNX export tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review comments Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 19 May, 2023 1 commit
-
-
Tim Moon authored
* Initial implementation of NVRTC infrastructure Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initial NVRTC impl for transpose NVRTC gives compilation errors at runtime. Everything else compiles and passes tests as expected. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug NVRTC transpose impl NVRTC kernel compiles, runs, and passes tests with FP32. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use variadic template for kernel arguments in RTC kernel launch func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactoring Added utility header for CUDA Runtime API. Optimized concat_strings function. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add helper function for regex substitutions in strings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add option to disable NVRTC support Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for header includes in NVRTC kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Access lazily-initialized CUDA driver lib and add option to specify CUDA header dir Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Configure NVRTC transpose kernel with simple perf model Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert change to tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Style fixes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add prime-valued test cases Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix multiple definition error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Optimize NVRTC transpose kernel for small data sizes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Mention NVRTC in docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add unit tests for NVRTC and string utils Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add comment in install docs about NVRTC Review suggestion from @nouiz Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug perf model for RTC transpose kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove NVRTC discussion from docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Require CUDA headers unless NVRTC is explicitly disabled Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use diagonal coords in transpose kernel to avoid partition camping Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use std::call_once for thread-safety Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor fixes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug CMake error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unnecessary call_once Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove diagonal coordinates from transpose kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use size_t indices instead of int Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ptrendx Check build-time CUDA include path for run-time CUDA headers. Handle case where CUDA context is initially uninitialized. Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 21 Mar, 2023 1 commit
-
-
vasunvidia authored
* Initial commit for fp8_transpose_dbias kernel Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * lint fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Suggestions and fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Mar, 2023 1 commit
-
-
Tim Moon authored
Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 17 Jan, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Move scale inverse calculation to framework Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cleanup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix RMSNorm Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix gated kernel/geglu Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Jan, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Add NVTX to TE modules Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix pylint Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix NVTX in _prepare_backward Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Add NVTX to C API Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix cpplint and link nvToolsExt Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Add NVTX to GeGlu Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 10 Jan, 2023 1 commit
-
-
zlsh80826 authored
* Add GeGLU and DGeGLU Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add DGeGLUCT Signed-off-by:
Reese Wang <rewang@nvidia.com> * Update copyright year Signed-off-by:
Reese Wang <rewang@nvidia.com> * Refine shape check Signed-off-by:
Reese Wang <rewang@nvidia.com> * Code refine Signed-off-by:
Reese Wang <rewang@nvidia.com> Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 03 Jan, 2023 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 08 Dec, 2022 1 commit
-
-
Przemyslaw Tredak authored
* Move the amax/scale/scale_inv into the TE Tensor struct. Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Handle multi_cast_transpose Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Changed softmax to new Tensor Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * First pass at the cpp tests Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Round of fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix multi_cast_transpose Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix cast_to_fp8 Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 28 Nov, 2022 1 commit
-
-
Tim Moon authored
* Add kernel for multi-tensor cast-transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix incorrect test function in multi-tensor cast-transpose unit test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove std::vector from multi-tensor cast-transpose function signature Makes sure the main header is C-compatible. Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 08 Nov, 2022 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 28 Sep, 2022 1 commit
-
-
Przemek Tredak authored
Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-