- 14 Jun, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Charlene Yang authored
* add attention docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak to first draft Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up pictures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft for review Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add logging info/debug Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix of an SWA message Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use subprocess instaed of os.sys Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up benchmark script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add example script and update notebook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax/Paddle related comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rerun H100 benchmark Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict fp8 tests to sm90+ Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move get_cudnn_version from common to pytorch utils Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 12 Jun, 2024 1 commit
-
-
rybakov authored
Signed-off-by:
Oleg Rybakov <orybakov@nvidia.com> Co-authored-by:
Oleg Rybakov <orybakov@nvidia.com>
-
- 10 Jun, 2024 1 commit
-
-
Xiaowei Ren authored
* add seq_offsets_qkvo for cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets calculation of cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a thd assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias for thd test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add thd test for cudnn FA with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * skip GQA/MQA test for cuDNN THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets inputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove two comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn mask type for cudnn thd with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type for cudnn fa with thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a typo Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix out dout in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert cudnn+thd does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if attn_mask_type has padding Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change cp test batch size to 2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix two assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 07 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Remove interval arg from recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove usage of interval and use explicit kwarg for testing recipes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Cleanup Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 03 Jun, 2024 1 commit
-
-
Tim Moon authored
* Modify CUDA graph tests to use grad accumulation steps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initialize grad buffers before capturing CUDA graph in CUDA graph tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Only use BS=2 in CUDA graph tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update tests/pytorch/test_cuda_graphs.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 30 May, 2024 3 commits
-
-
Charlene Yang authored
* add THD support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add seq_offsets_o and use new offset calculation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * addition to previous commit; fix unit test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add None for offset_o gradient Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: test padding between sequences Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: fix tests for padding between sequences Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix tests for sbhd/bshd layouts; clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update cudnn-frontend and add tests for max_seqlen_q=1 and d=256 for inference Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test sbhd/bshd layouts for sq1, d256 inference case Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace wording from accumulative to cumulative Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add offset tensors to custom fp8 mha tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add version control for cuDNN Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add sm>=90 constraint for thd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN support for sq=1, d=256 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint and minor tweak for fp8 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * modify cudnn version and restrict MQA/GQA support for THD Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add notes for seq offset tensors Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add dummy tensor to pass jax build Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add dummy tensor to pass paddle build Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
Xin Yao authored
* add multi-tensor kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * add FusedAdam Signed-off-by:
Xin Yao <xiny@nvidia.com> * add test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * add FusedSGD Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Tim Moon authored
* Initial refactor of FP8 workspaces in Linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove extra kernel launch Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor perf optimizations Tensor base class functions in Float8Tensor have significant overhead. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug FP8 recipe test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Document FP8 workspace function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to FP8 recipe tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for lazy FP8 transpose caching Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix Pylint warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug ONNX export ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training). Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug fused attention tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure Float8Tensor.transpose_2d is backward compatible Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to ONNX export operations Work around ONNX test failures by filling FP8 scale tensors instead of copying into them. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug scale factor update in Float8Tensor transpose_2d Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 25 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Fixed Llama tutorial. Changed batch size and added fused=True. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial updated but not complete yet. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial notebook reseted - removed fuse=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Removed fused=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Batch size back to 8 Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Typo and commented out line Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code. Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Comments Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Models cast added again Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Weight download info Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Moved parameter gate_proj_size to config Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * gate_proj_size removed and put immediate_size instead Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Llama 3 added to tutorial Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed model loading Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Loading fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Different dim for attention Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Reversed other commit Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Changed name to kv_channels Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed typo Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Test fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * changed file modes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix, hopefully last Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> Signed-off-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2373.nvidia.com> Co-authored-by:
root <root@ipp2-1588.nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
root <root@ipp2-0037.nvidia.com> Co-authored-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2371.nvidia.com> Co-authored-by:
root <root@ipp2-1589.nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 22 May, 2024 1 commit
-
-
Phuong Nguyen authored
* added alignment requirements for CuBLAS heuristics Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * minor rewords Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added unit test for gemm with unaligned inputs Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added pytest skip if fp8 is not available Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * changed offset so that it has alignment with 128 Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 20 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Calibration fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 17 May, 2024 1 commit
-
-
Charlene Yang authored
* fix inconsistency for attn mask; now True means participating in attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix sliding window window_size for decoder+padding combination Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert paddle changes regarding mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert softmax to 1-mask;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce 1-mask out; 0-keep rule for jax masks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert pytorch mask changes; some kept in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to jax fused attn on main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * inverse mask logic for get_cu_seqlens/_and_indices in PyTorch implementation and mask generation in unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporarily disable update_weight_scale_inv Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce window_size for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for mask definition 1-mask out;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add aux_ctx_tensors to save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak make_decoder_mask and make_mask in jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip dBias for shapes other than 1HSS; otherwise dq/dk/dv NaNs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand attn_biases from list to variables in save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix use of variable before assignment in jax dact_lu Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove window size definition for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add change notes in README for padding mask in PyTorch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak padding mask notes in README Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand list to tensors for save_for_backwards Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 16 May, 2024 1 commit
-
-
Phuong Nguyen authored
* added squared relu in te-torch Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 May, 2024 1 commit
-
-
Kunlun Li authored
Signed-off-by:
kunlunl <kunlunl@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 07 May, 2024 1 commit
-
-
Tim Moon authored
Update FP8 recipe test to handle recipe changes Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 02 May, 2024 1 commit
-
-
cyanguwa authored
* initialize tp_group for FP8 DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove test and redundant implementation from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove warning message and replace with docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify post_state_dict_hooks between FU and DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add temporary test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove the test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable pylint self arg for hook which is required by hook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 01 May, 2024 1 commit
-
-
Jinze Xue authored
* Handle the scaling factor when amax is too tiny that leads to an infinite scale Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert formatting changes Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * fix comments Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * apply review suggestion Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert changes to update_weight_scale_inv Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Jinze Xue <jinzex@nvidia.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jinze Xue <jinzex@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 29 Apr, 2024 2 commits
-
-
cyanguwa authored
restrict context parallel tests to sm80+ as fused/flash attn backends require sm80+ Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Zhenhuan Liu authored
* Add support for MoE with FP8. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix unittest. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix error in linear backward. Signed-off-by:
Dennis Liu <denliu@nvidia.com> --------- Signed-off-by:
Dennis Liu <denliu@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 26 Apr, 2024 1 commit
-
-
Xiaowei Ren authored
* make FusedAttn with CP support bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert Alibi cannot work with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * syntax fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix variable name Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix tensor shapes Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * a typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias indexing for CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add attn bias tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change dbias update location Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP test model configs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change CP test sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make AttnFuncWithCP support qkv format of sbhd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure qkv are contiguous for CP in cuDNN fused attn Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change assert message Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 19 Apr, 2024 1 commit
-
-
Tim Moon authored
* Support noop concat without providing full tensor Stop storing fused buffers in linear modules. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug noop cat func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Construct TE modules in tests with correct dtypes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tolerances to numerical tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use plain PyTorch concat when exporting to ONNX Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Apr, 2024 1 commit
-
-
cyanguwa authored
* WIP: fp8 v1 fprop integration Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fprop working for h1; w/ debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup; bprop running but has mismatches Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add gitlab frontend as submodule Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up and add back v0.9.2 FE support; fprop/bprop passing with 5e-2 tols Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix after merge; add bias_b/h to caching descriptor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * distinguish fwd/bwd tensor types for bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for F16 cases; include added dqkv_type and d_scale_dp Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adjust out shape for bwd in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add casting from/to FP8 to DPA module Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: bshd_bshd_bshd layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: support all sbhd/bshd layouts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add qkvpacked and kvpacked support in both FusedAttnFunc and C levels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove qkvpacked/kvpacked calls in DPA module (used for testing) Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp setup; add allow_non_contiguous; update FE; revert to sbh3d in tests; clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_FP8_DPA_BWD to control whether to use FP8 bwd or F16 bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA/GQA in FP8 v1 API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 705d8e3, with API change Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test causal mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict mha_fill for THD format Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn with CP and comment out is_alibi code Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up FE0.9 vs FE1.0 FP8 implementations, and related unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change NVTE_FP8_DPA_BWD default to 1, and fix its use in qkvpacked/kvpacked APIs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint and self.tp_size/group in FusedAttention() Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 6902c94 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 MHA support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to FE v1.3.0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for FP8 MHA with different configs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * emit stats regardless of is_training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix linear when input is not Float8Tensor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix d_out type when f16 bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix user buffer for layernorm_linear/linear and revert two FP8 casts in MHA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for fp8_dpa/mha in recipe Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix backend selection to avoid FA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use RMSE for FP8 unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace two more transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 initialization to FusedAttention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rm docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Revert "add FP8 initialization to FusedAttention" This reverts commit 15fffd825d6f23f31ea709b16ba01dfd61efabf8. Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change order of ctxs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back docs and mark as beta Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for tests and docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Apr, 2024 2 commits
-
-
Sangkug Lym authored
* Add LN margin to inference Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Fix symbolic func registration Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix grads Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* FP8 cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> * Fix numerics Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * exclude torch compile from numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More numerics fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm fusion from unfused path Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com>
-
- 06 Mar, 2024 1 commit
-
-
Oleg Goncharov authored
* Modified MHA and DPA logic to use causal softmax and FA for inference Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Adjusted unfused attention and softmax logic for inference Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Cleaned up the code per pylint Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Added test cases to evaluate numerics of incremental decoding Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Apply suggestions from code review [sequence start-end] Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Apply suggestions from code review [inference_params offset update]] Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Fixed bug in KV-cache indices and updated test suite Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Added inference_params description and applied suggestions from the code review Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Adjusted absolute tolerances in numerics tests Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Cleaned up the files per pylint Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 28 Feb, 2024 1 commit
-
-
cyanguwa authored
* added support for arbitrary bias shapes for fused_attn Signed-off-by:
Alp Dener <adener@nvidia.com> * Fix linting Signed-off-by:
Alp Dener <adener@nvidia.com> * Add b1ss/bhss/11ss bias shapes when not requiring dBias Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add bias_b/h to plan cache Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixed compile errors after PR653 merge Signed-off-by:
Alp Dener <adener@nvidia.com> * updated JAX unittests for new bias shapes Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed mismatched mask type checking Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected skip condition Signed-off-by:
Alp Dener <adener@nvidia.com> * fix selection logic for A100s Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * corrected skip checks for bias shapes Signed-off-by:
Alp Dener <adener@nvidia.com> * resolved test issues but neginf with float16 is still problematic with JAX Signed-off-by:
Alp Dener <adener@nvidia.com> * new bias shapes passing TE JAX CI for seqlen <= 512, seq_q == seq_kv and h_q == h_kv conditions Signed-off-by:
Alp Dener <adener@nvidia.com> * TE/JAX fused attn tests for new bias shapes passing with neg_inf=-2**27 for Bfloat16 and -2**15 for Float16 Signed-off-by:
Alp Dener <adener@nvidia.com> * code style fixes and test parameter ID cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect skip condition for backward fused attn test Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Alp Dener <adener@nvidia.com>
-
- 24 Feb, 2024 1 commit
-
-
Alp Dener authored
* added non-reentrant mode support to TE checkpoint Signed-off-by:
Alp Dener <adener@nvidia.com> * updated get_cuda_rng_tracker kwarg to get_rng_state_tracker to remain consistent with other TE API Signed-off-by:
Alp Dener <adener@nvidia.com> * docstring cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * added mechanism to disable bias_gelu_nvfusion in LayerNormMLP when checkpointing in non-reentrant mode Signed-off-by:
Alp Dener <adener@nvidia.com> * refactored checkpoint and recompute hook names to match PyTorch implementation Signed-off-by:
Alp Dener <adener@nvidia.com> * Fixed incorrect reference before assignment Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed argument error in calling native PyTorch checkpoint Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors for missing docstrings Signed-off-by:
Alp Dener <adener@nvidia.com> * Fix lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * bias GELU fusion consistency between checkpoint test and reference comparison Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Feb, 2024 1 commit
-
-
Alp Dener authored
* Added QuickGELUActivation from HuggingFace/Transformers to common and pytorch Signed-off-by:
Alp Dener <adener@nvidia.com> * Removing 'qgelu' from double-size activations list in LayerNormMLP. Signed-off-by:
Alp Dener <adener@nvidia.com> * indent fix Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 15 Feb, 2024 2 commits
-
-
Przemyslaw Tredak authored
* Use fused implementation of RoPE in MultiHeadAttention Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix freqs dtype Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
* Add option to avoid updating transpose cache when possible Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix typo Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use string kwarg for FP8 transpose caching Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unused attr Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 08 Feb, 2024 2 commits
-
-
Tim Moon authored
* Implement fused kernel for FP8 scale update Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused kernel for amax and scale update Add unit test. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Replace paddle.fluid imports with paddle.base Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move fused kernel to core library Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use FP8 update kernel in Paddle Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug FP8 scale update in Paddle Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix lint errors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug Paddle test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make update kernel in-place for PyTorch Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert cudnn-frontend commit Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
cyanguwa authored
* test alibi between fa and fu Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move alibi slopes and bias to global to avoid repeating calculation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix alibi slopes/bias generation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix _is_flash_attention_supported to allow alibi type Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable padding mask when alibi is used for fused attn arbi backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add support for custom [n_heads] alibi_slopes in flash, fused, unfused attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove alibi_type=none tests as they are unnecessary Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update cudnn-frontend to 1.0.2 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change bias/dbias shape to allow b,1/1,h/b,h in arbi backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak tests for arbi post_scale_bias [1,h,s,s] or alibi_slopes [n_heads] Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change bias/dbias shape in max512 backend - incomplete Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove max512 changes from last commit and disable max512 (and arbi temporarily) for [b, h, s, s]; pending cuDNN backend support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up and tweak backend selection logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace || with () in docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix bias shape for max512 backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * combine slopes/bias generation to one function get_alibi() and fix alibi tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix PR557 bugs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update transformer_engine/pytorch/attention.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com> * encapsulate global alibi tensors into a dict cache Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce alibi slopes test size Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to cudnn-frontend 1.0.3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use dBias shape to define bias_b/bias_h because jax materializes dBias rather than Bias in bwd abstract Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 03 Feb, 2024 1 commit
-
-
Przemyslaw Tredak authored
* Add zero_centered_gamma option to RMSNorm Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Improving tests Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * More improvements to tests Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Tweaking the tolerances Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix LayerNormMLP test Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Update transformer_engine/common/rmsnorm/rmsnorm_api.cpp Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/rmsnorm/rmsnorm_api.cpp Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * docs suggestions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Tweak tolerances with bfloat16 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 25 Jan, 2024 1 commit
-
-
Xin Yao authored
* fused apply rope Signed-off-by:
Xin Yao <xiny@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * make rotary_percent optional Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ci Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix test Signed-off-by:
Xin Yao <xiny@nvidia.com> * add rope test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix linting Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: add transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * small fix Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fuse sin/cos Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fused rope for thd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * Fix license headers Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * add support for bshd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * support different seq length Signed-off-by:
Xin Yao <xiny@nvidia.com> * update Signed-off-by:
Xin Yao <xiny@nvidia.com> * update copyright Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * Make outputs contiguous in SBHD case Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 24 Jan, 2024 1 commit
-
-
Alp Dener authored
[PyTorch] Workaround for incorrect output from torch.cuda.is_bf16_compatible() on V100s and TU102s (#626) * replaced torch.cuda.is_bf16_compatible() with explicit sm_80 check via torch.cuda.get_device_capability() Signed-off-by:
Alp Dener <adener@nvidia.com> * implement te.utils.is_bf16_compatible() to replace torch.cuda counterpart Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 23 Jan, 2024 1 commit
-
-
Alp Dener authored
* added missing parameter materialization on real device for LayerNorm and RMSNorm Signed-off-by:
Alp Dener <adener@nvidia.com> * added new unittest for deferred initialization and modified parameter materialization to support standalone execution outside of FSDP Signed-off-by:
Alp Dener <adener@nvidia.com> * restored tensor parallel attributes that were being wiped out by the parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect order of fp8 metadata initialization Signed-off-by:
Alp Dener <adener@nvidia.com> * added deferred init unittest to the QA script Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 21 Jan, 2024 1 commit
-
-
Selvaraj Anandaraj authored
Activation offloading to CPU's for the Linear, Layernorm Linear and the Layernorm MLP modules (#571) * Added support activation offloading to CPU's Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Moving CPU offloading library to TE Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Restructured code, added switch to choose between weight/activation offloading Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Removed arg during constructor Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fix nit-pick errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Documentation fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix to the code block in docs Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added offloading unit test Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fixed formatting Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * wgrad fusion fix, minor errors and lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Errors, test, lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * RM test file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixed stray PyT tensors in LayernormMLP getting offloaded Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fixed typi Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fix offloading for rmsnorm, rm test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix errors Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Float8Tensor compatible offloading Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Jan, 2024 1 commit
-
-
Sudhakar Singh authored
fix failing tests due to PR #557 Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-