- 25 Jun, 2024 1 commit
-
-
Xin Yao authored
* GroupedGEMM via multi-stream cublas * fix A/B is nullptr while D is not nullptr * add fp8 grouped gemm * register with TorchScript * add the GroupedLinear layer --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Jiang Shao <jiangs@nvidia.com> Co-authored-by:
Qi Zhang <qizhang@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 18 Jun, 2024 4 commits
-
-
Tim Moon authored
Release GIL in PyTorch pybind11 functions Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
Charlene Yang authored
* simplify offset tensors Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes; tests pass Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix C lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace with_offset with with_padding Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace with_padding with padded Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes after merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for fused attn fwd/bwd calls Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Jax Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust spacing in docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix pytorch tests; fix paddle api Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix attn_biases Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix AttnFuncWithCP backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix attn with CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix paddle Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Charlene Yang authored
fix tp_initialized error Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Remove optional UB build leftovers Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm unused import Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 17 Jun, 2024 1 commit
-
-
Sangkug Lym authored
* Add the option to use SM for P2P comm in TP overlap Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Python formatting with black Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Format C++ with clang-format Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/csrc/comm_gemm_overlap.h Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Jun, 2024 1 commit
-
-
Charlene Yang authored
* subclass DPA with BaseModule and test with test_gpt_checkpointing Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test DPA only Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test save and load Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook in case core_attention._extra_state is missing Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * check named buffers in BaseModule; remove FP8 scratchpad override function; test FP8 for sm90+ Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes: test size, interval in recipe, named_buffer loop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move BaseModule from FusedAttention to DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Jun, 2024 4 commits
-
-
Pavel Shamis (Pasha) authored
* A hot fix to disable CE deadlock check Signed-off-by:
Pavel Shamis (Pasha) <pasharesearch@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pavel Shamis (Pasha) <pasharesearch@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Initial config test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * remove linters, fix clang-format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix clang-format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix clang-format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Adjust config Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * use config file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * adjust pylintrc Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * pre-format fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Python only Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add FA module Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update CI configs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CRLF -> LF Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * revert accidental formatting changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * try with sudo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cpp formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix pylint error properly Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * some review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add fp8 attn include in the correct file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * autofix PRs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Charlene Yang authored
* add attention docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak to first draft Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up pictures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft for review Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add logging info/debug Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix of an SWA message Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use subprocess instaed of os.sys Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up benchmark script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add example script and update notebook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax/Paddle related comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rerun H100 benchmark Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict fp8 tests to sm90+ Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move get_cudnn_version from common to pytorch utils Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 13 Jun, 2024 4 commits
-
-
Alp Dener authored
* added DL framework callbacks for bootstrapping userbuffers without MPI Signed-off-by:
Alp Dener <adener@nvidia.com> * removed userbuffers availability check in TE modules since userbuffers is now always compiled Signed-off-by:
Alp Dener <adener@nvidia.com> * added comm+GEMM overlap example with LayerNormMLP Signed-off-by:
Alp Dener <adener@nvidia.com> * lintin and review fixes Signed-off-by:
Alp Dener <adener@nvidia.com> * linting and review fixes Signed-off-by:
Alp Dener <adener@nvidia.com> * added header guards Signed-off-by:
Alp Dener <adener@nvidia.com> * removed defunct userbuffers checks in build_utils and setup.py Signed-off-by:
Alp Dener <adener@nvidia.com> * added exposed API in modules/base.py to __all__ Signed-off-by:
Alp Dener <adener@nvidia.com> * removed transformer_engine/CMakeLists.txt and shifted all TE/common compile into transformer_engine/common/CmakeLists.txt Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
BoxiangW authored
* Add norm_factor arg into DotProductAttention Signed-off-by:
Boxiang Wang <boxiangw@nvidia.com> * Change kwarg name from `norm_factor` to `softmax_scale` Signed-off-by:
Boxiang Wang <boxiangw@nvidia.com> * Change all norm_factor representation into softmax_scale Signed-off-by:
Boxiang Wang <boxiangw@nvidia.com> * Update transformer_engine/pytorch/attention.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Update attention.py changing typo Signed-off-by:
BoxiangW <45734921+BoxiangW@users.noreply.github.com> --------- Signed-off-by:
Boxiang Wang <boxiangw@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
BoxiangW <45734921+BoxiangW@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Alp Dener authored
reverting autocast API back to PyTorch v2.3.1 and below Signed-off-by:Alp Dener <adener@nvidia.com>
-
Xin Yao authored
* expose multi_tensor_* kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com>
-
- 12 Jun, 2024 3 commits
-
-
Sudhakar Singh authored
skip switching to nvfuser for torch >= 2.2 Signed-off-by:Sudhakar Singh <sudhakars@nvidia.com>
-
Alp Dener authored
added @torch._disable_dynamo fixed deprecation warnings with torch autocast API for TE checkpoint Signed-off-by:Alp Dener <adener@nvidia.com>
-
Alp Dener authored
restricted fsdp asserts on primary fp8 weights to TE modules Signed-off-by:Alp Dener <adener@nvidia.com>
-
- 10 Jun, 2024 2 commits
-
-
Xiaowei Ren authored
* add seq_offsets_qkvo for cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets calculation of cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a thd assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias for thd test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add thd test for cudnn FA with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * skip GQA/MQA test for cuDNN THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets inputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove two comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn mask type for cudnn thd with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type for cudnn fa with thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a typo Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix out dout in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert cudnn+thd does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if attn_mask_type has padding Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change cp test batch size to 2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix two assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Tim Moon authored
* Avoid select operation in cast-transpose extension Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid select operation in cast-transpose-dbias extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid select op in LayerNorm and RMSNorm Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter errors Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 07 Jun, 2024 2 commits
-
-
Alp Dener authored
* New TE wrapper for PyTorch FullyShardedDataParallel to make TE modules distribute their activations after the forward pass and gather them before the backward pass Signed-off-by:
Alp Dener <adener@nvidia.com> * simplified TE module setup for FSDP comms Signed-off-by:
Alp Dener <adener@nvidia.com> * FSDP scatter/gather for tensors saved into autograd ctx now working for base TE modules Signed-off-by:
Alp Dener <adener@nvidia.com> * make sure activation recompute disables FSDP scatter/gather Signed-off-by:
Alp Dener <adener@nvidia.com> * make sure Fp8 weight buffers are sharded at the end of the backward pass and gathered before forward Signed-off-by:
Alp Dener <adener@nvidia.com> * Fixed typo in attribute name Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed bug in finding FSDP-wrapped TE modules Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed typo in fp8 weight tensor name Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect # of gradients Signed-off-by:
Alp Dener <adener@nvidia.com> * Added fp8 amax gradient hook tensor to the parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> * get rid of erroneous dummy tensor leftover from incorrect rebase Signed-off-by:
Alp Dener <adener@nvidia.com> * Linting fixes Signed-off-by:
Alp Dener <adener@nvidia.com> * fixing git snafu and removing debug statements Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Remove interval arg from recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove usage of interval and use explicit kwarg for testing recipes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Cleanup Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 03 Jun, 2024 1 commit
-
-
Tim Moon authored
* Modify CUDA graph tests to use grad accumulation steps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initialize grad buffers before capturing CUDA graph in CUDA graph tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Only use BS=2 in CUDA graph tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update tests/pytorch/test_cuda_graphs.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 31 May, 2024 1 commit
-
-
Tim Moon authored
Replace int8_t in PyTorch extensions with int64_t Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 30 May, 2024 3 commits
-
-
Charlene Yang authored
* add THD support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add seq_offsets_o and use new offset calculation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * addition to previous commit; fix unit test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add None for offset_o gradient Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: test padding between sequences Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: fix tests for padding between sequences Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix tests for sbhd/bshd layouts; clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update cudnn-frontend and add tests for max_seqlen_q=1 and d=256 for inference Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test sbhd/bshd layouts for sq1, d256 inference case Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace wording from accumulative to cumulative Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add offset tensors to custom fp8 mha tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add version control for cuDNN Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add sm>=90 constraint for thd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN support for sq=1, d=256 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint and minor tweak for fp8 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * modify cudnn version and restrict MQA/GQA support for THD Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add notes for seq offset tensors Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add dummy tensor to pass jax build Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add dummy tensor to pass paddle build Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
Xin Yao authored
* add multi-tensor kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * add FusedAdam Signed-off-by:
Xin Yao <xiny@nvidia.com> * add test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * add FusedSGD Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Tim Moon authored
* Initial refactor of FP8 workspaces in Linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove extra kernel launch Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor perf optimizations Tensor base class functions in Float8Tensor have significant overhead. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug FP8 recipe test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Document FP8 workspace function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to FP8 recipe tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for lazy FP8 transpose caching Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix Pylint warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug ONNX export ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training). Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug fused attention tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure Float8Tensor.transpose_2d is backward compatible Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to ONNX export operations Work around ONNX test failures by filling FP8 scale tensors instead of copying into them. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug scale factor update in Float8Tensor transpose_2d Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 29 May, 2024 1 commit
-
-
Tim Moon authored
Make sure RoPE frequencies are in FP32 Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 25 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Fixed Llama tutorial. Changed batch size and added fused=True. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial updated but not complete yet. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial notebook reseted - removed fuse=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Removed fused=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Batch size back to 8 Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Typo and commented out line Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code. Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Comments Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Models cast added again Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Weight download info Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Moved parameter gate_proj_size to config Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * gate_proj_size removed and put immediate_size instead Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Llama 3 added to tutorial Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed model loading Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Loading fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Different dim for attention Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Reversed other commit Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Changed name to kv_channels Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed typo Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Test fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * changed file modes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix, hopefully last Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> Signed-off-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2373.nvidia.com> Co-authored-by:
root <root@ipp2-1588.nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
root <root@ipp2-0037.nvidia.com> Co-authored-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2371.nvidia.com> Co-authored-by:
root <root@ipp2-1589.nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 22 May, 2024 1 commit
-
-
Alp Dener authored
TE checkpoint now preserves the torch autocast context from the forward pass during the recompute phase Signed-off-by:Alp Dener <adener@nvidia.com>
-
- 21 May, 2024 2 commits
-
-
Alp Dener authored
replaced deprecated pkg_resources with packaging Signed-off-by:Alp Dener <adener@nvidia.com>
-
Pavel Shamis (Pasha) authored
-
- 20 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Calibration fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 17 May, 2024 1 commit
-
-
Charlene Yang authored
* fix inconsistency for attn mask; now True means participating in attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix sliding window window_size for decoder+padding combination Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert paddle changes regarding mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert softmax to 1-mask;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce 1-mask out; 0-keep rule for jax masks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert pytorch mask changes; some kept in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to jax fused attn on main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * inverse mask logic for get_cu_seqlens/_and_indices in PyTorch implementation and mask generation in unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporarily disable update_weight_scale_inv Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce window_size for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for mask definition 1-mask out;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add aux_ctx_tensors to save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak make_decoder_mask and make_mask in jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip dBias for shapes other than 1HSS; otherwise dq/dk/dv NaNs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand attn_biases from list to variables in save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix use of variable before assignment in jax dact_lu Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove window size definition for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add change notes in README for padding mask in PyTorch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak padding mask notes in README Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand list to tensors for save_for_backwards Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 16 May, 2024 1 commit
-
-
Phuong Nguyen authored
* added squared relu in te-torch Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 May, 2024 1 commit
-
-
Kunlun Li authored
Signed-off-by:
kunlunl <kunlunl@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 09 May, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Bump FA version to 2.5.8 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 02 May, 2024 1 commit
-
-
cyanguwa authored
* initialize tp_group for FP8 DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove test and redundant implementation from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove warning message and replace with docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify post_state_dict_hooks between FU and DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add temporary test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove the test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable pylint self arg for hook which is required by hook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 01 May, 2024 1 commit
-
-
Jinze Xue authored
* Handle the scaling factor when amax is too tiny that leads to an infinite scale Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert formatting changes Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * fix comments Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * apply review suggestion Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert changes to update_weight_scale_inv Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Jinze Xue <jinzex@nvidia.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jinze Xue <jinzex@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-