- 06 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Cleanup Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 30 May, 2024 1 commit
-
-
Tim Moon authored
* Initial refactor of FP8 workspaces in Linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove extra kernel launch Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor perf optimizations Tensor base class functions in Float8Tensor have significant overhead. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug FP8 recipe test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Document FP8 workspace function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to FP8 recipe tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for lazy FP8 transpose caching Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix Pylint warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug ONNX export ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training). Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug fused attention tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure Float8Tensor.transpose_2d is backward compatible Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to ONNX export operations Work around ONNX test failures by filling FP8 scale tensors instead of copying into them. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug scale factor update in Float8Tensor transpose_2d Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 20 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Calibration fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 29 Apr, 2024 1 commit
-
-
Zhenhuan Liu authored
* Add support for MoE with FP8. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix unittest. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix error in linear backward. Signed-off-by:
Dennis Liu <denliu@nvidia.com> --------- Signed-off-by:
Dennis Liu <denliu@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 22 Apr, 2024 1 commit
-
-
Tim Moon authored
* Remove unnecessary Pylint overrides Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fixes to lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Apr, 2024 1 commit
-
-
cyanguwa authored
* WIP: fp8 v1 fprop integration Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fprop working for h1; w/ debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup; bprop running but has mismatches Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add gitlab frontend as submodule Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up and add back v0.9.2 FE support; fprop/bprop passing with 5e-2 tols Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix after merge; add bias_b/h to caching descriptor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * distinguish fwd/bwd tensor types for bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for F16 cases; include added dqkv_type and d_scale_dp Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adjust out shape for bwd in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add casting from/to FP8 to DPA module Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: bshd_bshd_bshd layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: support all sbhd/bshd layouts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add qkvpacked and kvpacked support in both FusedAttnFunc and C levels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove qkvpacked/kvpacked calls in DPA module (used for testing) Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp setup; add allow_non_contiguous; update FE; revert to sbh3d in tests; clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_FP8_DPA_BWD to control whether to use FP8 bwd or F16 bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA/GQA in FP8 v1 API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 705d8e3, with API change Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test causal mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict mha_fill for THD format Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn with CP and comment out is_alibi code Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up FE0.9 vs FE1.0 FP8 implementations, and related unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change NVTE_FP8_DPA_BWD default to 1, and fix its use in qkvpacked/kvpacked APIs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint and self.tp_size/group in FusedAttention() Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 6902c94 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 MHA support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to FE v1.3.0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for FP8 MHA with different configs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * emit stats regardless of is_training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix linear when input is not Float8Tensor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix d_out type when f16 bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix user buffer for layernorm_linear/linear and revert two FP8 casts in MHA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for fp8_dpa/mha in recipe Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix backend selection to avoid FA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use RMSE for FP8 unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace two more transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 initialization to FusedAttention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rm docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Revert "add FP8 initialization to FusedAttention" This reverts commit 15fffd825d6f23f31ea709b16ba01dfd61efabf8. Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change order of ctxs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back docs and mark as beta Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for tests and docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Apr, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* FP8 cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> * Fix numerics Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * exclude torch compile from numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More numerics fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm fusion from unfused path Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com>
-
- 06 Apr, 2024 2 commits
-
-
Sangkug Lym authored
fix the default userbuffer communicator init settings Signed-off-by:Sangkug Lym <slym@nvidia.com>
-
Jaemin Choi authored
* Enable DGRAD RS overlap Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> * fix lint; apply suggestions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 04 Apr, 2024 1 commit
-
-
Sangkug Lym authored
* userbuffer fp8 reduction support for individual overlap Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup dict ub_cfg dict value load Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Remove unnecessary fence from producer From @erhoo82 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 03 Apr, 2024 1 commit
-
-
Sangkug Lym authored
* Atomic gemm for TP-AR and TP-RS overlap with P2P exchanges Signed-off-by:
Sangkug Lym <slym@nvidia.com> * FP8 reduction for atomic TP-RS with p2p exchange Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 21 Mar, 2024 1 commit
-
-
Sangkug Lym authored
* TP-RS overlap with send/recv Atomic GEMM based TP-RS overlap with send/recv Signed-off-by:
Sangkug Lym <slym@nvidia.com> Specify userbuffer overlap method of each overlap instance Signed-off-by:
Sangkug Lym <slym@nvidia.com> P2P TP-RS overlap with fp8 GEMM outputs Signed-off-by:
Sangkug Lym <slym@nvidia.com> Fix TP-RS overlap with send/recv Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * linting Signed-off-by:
Sangkug Lym <slym@nvidia.com> * fix typo Signed-off-by:
Sangkug Lym <slym@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 07 Mar, 2024 1 commit
-
-
Hongbin Liu authored
* add_dtype_for_userbuf Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * Update transformer_engine/pytorch/module/base.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Fix syntax Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 01 Mar, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Avoid updating real during param cast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 31 Jan, 2024 1 commit
-
-
Tim Moon authored
Do not allocate FP8 workspace buffers when params are FP8 Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 23 Jan, 2024 1 commit
-
-
Alp Dener authored
* added missing parameter materialization on real device for LayerNorm and RMSNorm Signed-off-by:
Alp Dener <adener@nvidia.com> * added new unittest for deferred initialization and modified parameter materialization to support standalone execution outside of FSDP Signed-off-by:
Alp Dener <adener@nvidia.com> * restored tensor parallel attributes that were being wiped out by the parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect order of fp8 metadata initialization Signed-off-by:
Alp Dener <adener@nvidia.com> * added deferred init unittest to the QA script Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 17 Jan, 2024 1 commit
-
-
Alp Dener authored
* Implemented deferred initialization via `device='meta'` option for te.Linear and added new PyTorch example to demonstrate its use with FullyShardedDataParallel execution. Signed-off-by:
Alp Dener <adener@nvidia.com> * correcting Float8Tensor initialization and fixing linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * removed duplicate code from upstream rebase, local tests passing Signed-off-by:
Alp Dener <adener@nvidia.com> * improved comments/documentation for FSDP example Signed-off-by:
Alp Dener <adener@nvidia.com> * converted reset_parameters() into a base module function Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed Float8Tensor creation with deferred init, all tests passing locally Signed-off-by:
Alp Dener <adener@nvidia.com> * extended deferred initialization to all TE modules Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * removed unnecessary reference to the parent module of parameter, added clarifying comments in parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 08 Jan, 2024 1 commit
-
-
Tim Moon authored
* Refactor parameter split in Linear module Remove module state from noop_cat. Support arbitrary names in parameter split. Handle tensor parallelism. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make noop_cat a standalone operation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update parameter splits in LayerNormLinear Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug case without bias Fix pylint complaints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unused import Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 03 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 07 Dec, 2023 2 commits
-
-
cyanguwa authored
* Integrate cuDNN frontend v1 to fused attention and miscellaneous fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax/paddle for unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax/pytorch lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify stride generation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix and/or logic in get_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix flag_max512 and test_numerics Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove v.contiguous() since get_qkv_layout covers it Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip fp8 tests for sm89 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further fix jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert mask type to comma-separated list Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last two commits Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * integrate v1/pre-release-5 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup prerelease5 integration and fix FA2.1 commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force dropout to 0 if not training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * testing bias/alibi and padding+causal; add alibi to unfused DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * set flag_arb to false when non determinism is not allowed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * followup on prev commit; remove redundant python env var setting Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor tweaks for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prepare for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix determinism logic for fused attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add bias to bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix gpt_checkpointing/dpa_accuracy problem Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix some seg fault issues Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add failure notes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove use of non-deter var for backend selection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for lint and CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix workspace size in bwd and uncomment bias test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get_alibi and remove check_support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update tests status Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove workspace_opt from FADescriptor_v1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable arbitrary backend + post scale bias in Jax; waiting on PR 525 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up bhsd order Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * swap bias/rng_state order in aux_ctx_tensor and add bias to aux_ctx_tensor in _qkvpacked/_kvpacked API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove support for padding_causal + cross for max512 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change alibi bias to float32 for bias_1_4/5 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further clean up tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix thd fwd output shape for FlashAttention and add backend info for DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix definition of workspace limit when dbias is present Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further tweak DP_WORKSPACE_LIMIT definition Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disallow alibi+no_mask for sdpa flash and update alibi tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update jax/paddle after PR525 and fix DP_WORKSPACE_LIMIT for dbias Jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable dbias for non-hopper archs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix layernorm lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remode unused arg for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove build dir in setup.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change selection logic to prefer fused attn on sm90 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix distributed jax test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix h and s order in header Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to cudnn fe v1 branch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove manual setting of workopt path due to dbias after v1 update Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix paddle CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add post_scale_bias and alibi to sdpa flash support matrix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix support matrix in header files Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move headers back to .cu and change seed/offset to int64 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update Megatron commit in L1 test and remove all prints in fused attn test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix L1 Megatron test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 arg in L1 Megatron script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * print only when debug flag is on Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove checkpointing loading to avoid loading other tests results Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
Tim Moon authored
* Float8Tensor uses cached transpose if available Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug with non-2D transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Custom pickling for Float8Tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test for pickling Float8Tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix merge conflict Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @sudhakarsingh27 Avoid FP8 casts when copying between Float8Tensors. Make make_like a class function. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add unit test for checkpointing model with FP8 params Debugged pickling and copy functions. Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 31 Oct, 2023 1 commit
-
-
Tim Moon authored
* Experimental FP8 tensor Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add fp8 tensor to ci test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review comments and tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Minor changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Default to FP8 usage Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Naming changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix transpose caching Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Debug transpose caching Handle case where transpose cache is updated externally. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename FP8GlobalStateManager.with_fp8_parameters Signed-off-by:
Tim Moon <tmoon@nvidia.com> * remove Float8Tensor from import API Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Avoid caching FP8 transposes if not required Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix import error in FP8 tensor tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix tranpose caching and checkpointing bug Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Improve caching and fix distopt case Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/float8_tensor.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Remove recursive logic Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cache reset bug Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Store FP8 attributes in dict Easier for multiple tensors to share, e.g. detached tensors. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure scale_inv is 1D tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure scale_inv is 1D tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fixes and detach recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Set default fp8 data type Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com>
-
- 23 Oct, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* initial test fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Drop eval for selective checkpointing tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove redundant recompute for FA Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CI fix; Decouple fused attention and numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Oct, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Improve docs Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Oct, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 05 Oct, 2023 1 commit
-
-
vasunvidia authored
* Initial commit Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Repro for RS output mismatch with Single GEMM + Split pipelined RS Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * minor changes for AG->GEMM pipelined overlap Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Add Atomic Gemm cublasApi attributes and initial implementation of AG->Atomic GEMM Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * AtomicGemm+RS functional with workaround Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * add amax update to layernorm_linear for FP8 unit test accuracy Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable reducescatter2_userbuff_strided variants Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * AG+AtomicGemm overlap functional but gemm doesnt overlap with comm Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Add userbuffers_sendrecv kernel variants Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * TransformerLayer API changes to enable AtomicGemm+RS overlap Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Code cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Code cleanup2 Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [UB] AllGather Atomic GEMM overlap using userbuffer_sendrecv kernels Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Code cleanup + bug fix for multiatomic sendrecv kernel Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fixes Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [UB] Add shuffling for better AG AtomicGEMM overlap Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix for AG AtomicGemm overlap Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix for multiAtomicAG and singleAtomicAG Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Use chunk_i+1 as recv_chunk for multiatomic_AG with shuffling Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Launch AtomicGEMM after first-chunk AG Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Rebase to main Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Add FP8 ReduceScatter kernels, AtomicGEMM+FP8 RS not functional Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Add FP8 ReduceScatter kernels, AtomicGEMM+FP8 RS not functional" This reverts commit 80a47a76355440cd5fb4314c96fe9fda632d87f9. Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Add support for NVLS-MC and FP8 Reduce Scatter Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Atomic and Multiatomic FP8 RS functional Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Remove debug print Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * UB comm initialization hang fix Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Code cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Create new GEMM API for Atomic GEMM Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * CI ready Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * more fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * license Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fix Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert NVLS-MC Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Check cu* versions for running atomic gemms Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Add experimental warning Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better wording Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add warning to c api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix wording Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 27 Sep, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Change deprecation warnings Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 23 Sep, 2023 1 commit
-
-
cyanguwa authored
* [PyTorch] Implement GQA based on fused q, k, v projection. Additionally fixes #392 Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> * [PyTorch] Extend parameters_split option in Linear and LayerNormLinear to support splitting with different sizes as required by unfused GQA. Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> * fix parameters split Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix noop cat to bypass torch.cat and support uneven split Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix torch.split args Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuda graph due to noop_cat Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove the use of enumerate when possible Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix strides in SplitAlongDim Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Markus Schnoes <markus.schnoes@gmx.de>
-
- 20 Sep, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Enable the model to be change precision between iterations Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Add test Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix for the test Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 18 Aug, 2023 1 commit
-
-
Sudhakar Singh authored
Signed-off-by:Sudhakar Singh <sudhakars@nvidia.com>
-
- 16 Aug, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial refactor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Reorder methods by purpose Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Save full global state Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More fixes to test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Aug, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Optimize calls to .cpu() during checkpointing Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for ONNX Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 02 Aug, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 14 Jul, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Bug fix for checkpointing Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 02 Jun, 2023 1 commit
-
-
Jan Bielak authored
* Ignore IDE files Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix typing errors Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Ignore devcontainer files Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Avoid import from private module Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply @timmoon10 's suggestions Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com>
-
- 01 Jun, 2023 1 commit
-
-
Sudhakar Singh authored
* extend fp8 weight placeholders logic for Linear, LNLinear, LNMLP Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/layernorm_linear.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/layernorm_mlp.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/module/linear.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update linear.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update layernorm_linear.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update layernorm_mlp.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 May, 2023 1 commit
-
-
Carlos Mocholí authored
* Clearer error messages for dtype and shape assertions Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> * Update transformer_engine/pytorch/utils.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> * Update transformer_engine/pytorch/utils.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> --------- Signed-off-by:
Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 09 May, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial refactor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * refactor attention out of transformer.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix ONNX export Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * linting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-