- 06 Jan, 2026 1 commit
-
-
Paweł Gadziński authored
* docs: Add comprehensive Getting Started guide with benchmarks - Add new Getting Started documentation with PyTorch and JAX tutorials - Include benchmark scripts demonstrating TE performance benefits - Add CSS styling for code output and tabs - Replace old quickstart notebooks with improved documentation - Add transformer layer diagram (SVG) - Update docs configuration and workflow Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * 2026 in copyright Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Jan, 2026 1 commit
-
-
Kirthi Shankar Sivamani authored
Update copyright to include 2026 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Dec, 2025 1 commit
-
-
jberchtold-nvidia authored
* Tutorial for integration te/jax quantization into an existing framework Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add todos Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * support nvfp4 sr rng key, move wrapper module into TE itself, fix bfloat16 cast Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update docstrings Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix QKV proj and out proj in Flax example transformer layer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use fused attention in quickstart_jax example Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remat policy Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add tutorial to docs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update title Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * remove unused dtype from TE DPA module Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix notebook title Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add explanation of flax module wrapper Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 06 Dec, 2025 1 commit
-
-
Kshitij Lakhani authored
* Add generic stripe_height support for load balancing Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix imports in test for deprecated jax.experimental.pjit Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add test case for stripe_height greater than 1. Add stripe_height arg to reordering methods Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add Striped 1 and 4 test cases. Refactor the Load Balancing test case. Fix the incorrect shape in striping inverser reordering Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Modify test code for CP + AG + THD + stripe height greater than 1 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add stripe_height arg to fused attn and fused attn fwd API. Add appropriate mask checks for AG+THD+CP and pick BRCM to be executed per rank. Add Fused Attn Primitive for CP + THD +AG + Striping. Add a method to reorder and all gather segment ids and offsets for kv Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway testing commit Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add comments in primitive registration process Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway test commit Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Undoing incorrect rebase/merge leftovers Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway test commits Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add support for calculating q and kv seqlens and offsets per rank for CP+THD+AG+SW+Striped>1 primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Augment jax primitive register code comments Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Fix the array sizes and padding values returned for seqlens and offsets to fit what the fused attn primitive non cp computation Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add support in new primitive for softmax_offset related changes. Put in missing primitive registering line in again. Increase the seqoffsets arrays lengths by 1 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Add new set of helper functions for seqlens and seqoffsets fo AG+THD+CP+Stripe>1 which accounts for batching and seq offsets size b+1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add backward primitive for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Modify tests for backward primitive for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Move stripe_height along with other static args in fused_attn_bwd rule. Fix typo in CP+AG+TH+Striped>1 primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Code clean up: remove older version for calculating seqlens and offsets for CP+AG+THD+striped>1 primitive Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add test for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix missing var Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add SWA tests for AG+Striped>1+CP+THD+SWA Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Restoring test code Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove assert preventing SWA code path in CP+AG+Striped primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Parametrize num_segments_per_seq in tests Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean up test code Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Clean up test code in TE common Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Clean up debug statements Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Rename stripe_height to stripe_size Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Code clean up and add additional comments Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> nit: Apply suggestions from code review Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Fix type on fused attn tests Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Fix seqoffsets length to be passed onto FusedAttn primitive as it is b and not b+1 needed by cuDNN Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove commented code Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Fix linting issues Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Fix incorrect greptile change Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip THD test cases for CP + AG + Dual chunk. Skip BSHD cases for CP + AG + Striped>1. Correct the layout and shapr parameters passed to the tests Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Pass stripe_size explicitly for ring attn tests for THD cases Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove TODO Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Explicitly fail if THD + AG is being used with a non padding causal mask Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Correct the ID for the test dist fused attn tests to account for cp*2 which is done under the hood Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Set num_segments_per_seq defaults to None instead of 0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Augment comments. Add ValueError for stripe_size=0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Test only 1 num_segments_per_seq combination for CP+AG+THD+Striped>1+SWA instead of 2. Modify the num segments and window size to easily to debug values Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Default stripe_size to None instead of 0. Modify stripe_size check for <=0 instead of ==0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove incorrectly added file Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Explicitly pass zero sized arrays for seg ids and pos in the CP + AG + Striped primitive rather than using the seqlens or the offsets as placeholders Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix linting errors Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add a deep dive doc for CP+THD+AG+Stripe>1+SWA regarding design considerations and decisions Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Put docs and pngs into it's separate dir Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Replace png screenshots with markdown coe blocks for the attention patterns. Remove unecessary pngs Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add doc file to index.rst. Fix grammatical errors Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 27 Nov, 2025 1 commit
-
-
Teddy Do authored
Signed-off-by:tdophung <tdophung@nvidia.com>
-
- 26 Nov, 2025 1 commit
-
-
Paweł Gadziński authored
* init Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lines lenght Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * subtitle --- fix in many files: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * a lot of small fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * torch_version() change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * removed training whitespace: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/api/pytorch.rst Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Fix import Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix more imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 21 Nov, 2025 1 commit
-
-
Kshitij Lakhani authored
* Make BSHD default for Unfused DPA, DPA and MHA in TE JAX Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Remove explicit transpose_batch set for BSHD for DPA in JAX quickstart Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add warnings in DPA and MHA to warn users of change defaults to BSHD instead of SBHD Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Minimize the scope of when to trigger warnings for changed defaults for transpose_batch_sequence Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Nov, 2025 1 commit
-
-
Teddy Do authored
* jax quickstart guide first commit Signed-off-by:
tdophung <tdophung@nvidia.com> * edit the syntax errors and remove unnecessary comments in utils. Add some footnotes in the quick start notebook Signed-off-by:
tdophung <tdophung@nvidia.com> * Fix greptiles comments on spelling, deepcopy, vjp function signature comaptibility with speedometer Signed-off-by:
tdophung <tdophung@nvidia.com> * Add Copyright to utils and fix some more greptiles complaints Signed-off-by:
tdophung <tdophung@nvidia.com> * Add comments to alternative of layers Signed-off-by:
tdophung <tdophung@nvidia.com> * Remove weight sharing between different iterations of the transformerLayer Signed-off-by:
tdophung <tdophung@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
tdophung <tdophung@nvidia.com> * Add enum for attention implementations. Fix inconsistency between fuse and unfused TE impls to achieve same performance (removing extra dropout layer in fused layers. Also some minor wording changes Signed-off-by:
tdophung <tdophung@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug in TransformerLayer expected input shape being [sequence, batch, ...] instead of [batch, sequence,...] Signed-off-by:
tdophung <tdophung@nvidia.com> * Changing structure of notebook to bring fp8 ahead of fuse, to allow for fuse to take effect because quantization exist as suggested. Also make TransformerLayer perf get closer to Fused by setting hidden_dropout=0 Signed-off-by:
tdophung <tdophung@nvidia.com> * add option to choose between different attention implementation in call of BasicTETransformerLayer and demonstrated difference in runtime between using flax and using te's attetion implementation Signed-off-by:
tdophung <tdophung@nvidia.com> * Fix mistake in lacking attention_implementation in FuseTETransformerLayer Signed-off-by:
tdophung <tdophung@nvidia.com> * Removing AttentionWrapper and custom built DPA, using flax and TE's impl only, removing last mention of Pytorch Signed-off-by:
tdophung <tdophung@nvidia.com> * More changing to markdowns to remove pytorch Signed-off-by:
tdophung <tdophung@nvidia.com> * cosmetics fixes Signed-off-by:
tdophung <tdophung@nvidia.com> * changing names of all implementations Signed-off-by:
tdophung <tdophung@nvidia.com> * change fp8_autocast to autocast, make causal mask, and some wording changes Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
tdophung <tdophung@dc2-container-xterm-034.prd.it.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
-
- 14 Oct, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial API change Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 05 Oct, 2025 1 commit
-
-
Przemyslaw Tredak authored
* Added the NVFP4 part to the low precision tutorial Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added the runtime results Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Update docs/examples/fp8_primer.ipynb Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/examples/fp8_primer.ipynb Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/examples/fp8_primer.ipynb Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/examples/fp8_primer.ipynb Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/examples/fp8_primer.ipynb Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/examples/fp8_primer.ipynb Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com>
-
- 18 Sep, 2025 1 commit
-
-
zhujian authored
feature(FA3,MLA,CP): 1. Update FA3 to commit-id 3ba6f82 (tag 2.8.0.post2 with compile error fixed), PR-1604 support hdimQK != hdimV backward 2. Update get_attention_backend method because FA3 support MLA now 3. Add CP MLA support for FA3 4. Add unit tests for FA3 MLA CP 5. Update attention doc Signed-off-by:zhujian <zhujian.whu.cs@gmail.com>
-
- 17 Sep, 2025 1 commit
-
-
Sudhakar Singh authored
* add tutorial files and other local changes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove extraneous code for easy debu Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * make cuda graphs work with non-paged and paged attention Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * perf imp for kv cache ops Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add code for calibration Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * optimize kv_cache reindex and copy kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changes to make quantizers work with fp8_calibration Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * avoid reindexing from python side Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename variable from previous commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use quantizer only if needed Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * functionality of the tutorial tested and perf checked Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove files and update headers/licenses Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * update header/license Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update tutorial for review Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make weights downloadable on the fly; remove extra print statements Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint and update comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add comma back, typo Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * sequence_start_positions should be None for training Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add paged attention numberes and update requirements.txt file Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * more fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make tutorial work on blackwell Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove gemma FT tutorial for now Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fixing the headings placement and rewording attention -> kv caching Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fixes from comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the images Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * misc fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add more comments to te_gemma.py and cleanup utils.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add more information about the hierarchy of the classes used in the tutorial Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add better cuda graphs picture Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * addd updated cuda graphs pictures Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add illustrated cuda graphs Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * small fixes in documentation Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add torch.no_grad() to force reduced memory usage Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * some fixes from recent comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * more fixes from remaining comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add te_rope_emb to class desc Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix tutorial wording; add calibration fix to grouped_linear.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 28 Aug, 2025 1 commit
-
-
Paweł Gadziński authored
* Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 31 Jul, 2025 1 commit
-
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 21 Jul, 2025 1 commit
-
-
Charlene Yang authored
* exclude 9.10.0/.1 for certain configs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix kv_channels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add get_backend to tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add init files Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix numerics and cuda graph tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor changes after renaming Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix import structure and rename get_attention_backends Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix docs and benchmarks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get backend calls Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix get backend calls" This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix docs and benchmarks" This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix docs, benchmarks and pre-commit ci Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dpa/mha flash attn selection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix rng states Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix backend selection on Ampere Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix issues from last merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/utils.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove initialization of rng_states to None Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * redefine ModelConfig Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix seed for CP tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/test_sanity.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move fixture from utils to individual tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 09 Jun, 2025 1 commit
-
-
Jan Bielak authored
Use public API instead of removed private function * replaced use of _load_state_dict_into_model with model.load_state_dict because the private function _load_state_dict_into_model was removed in https://github.com/huggingface/transformers/pull/36335 Signed-off-by:
Jan Bielak <jbielak@nvidia.com>
-
- 07 May, 2025 1 commit
-
-
Santosh Bhavani authored
* added a direct link to the quickstart notebook right after the code examples section Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * updated link in README for HF Accelerate docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * update DeepSpeed integration link Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update Release Notes link to documentation archive Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * updated latest news and moved older news under a dropdown caret Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * moved previous news to bottom of readme Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixed previous news link Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * added gtc videos Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * added TE GTC 2025 talk to latest news Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Santosh Bhavani <sbhavani@nvidia.com>
-
- 28 Apr, 2025 1 commit
-
-
Kshitij Lakhani authored
* Move MultiHeadAttention into its own file. Modify tests and files in t_e/pytorch to import from the new MHA module Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Resolving lost MHA changes from PR 1614 as a result of rebase Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move context parallelism code into it's own file. Modify test and local imports of cp code accordingly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move softmax.py frm pytorch/ to pytorch/d_p_a Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move Unfused and Fused attention to backends.py and some utils functions to pytorch/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Resolving lost mark_activation_offload changes from PR 1678 as a result of rebase Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor attention dir Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Refactor dir structure. Make relevant symbols public in __init__ for attention and d_p_a dirs Move FA package imports to backends.py Code cleanup Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Modify tests to import attention modules correctly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Lint fixes Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up and fix typo Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Allowing InferenceParams and RoPE imports from attention module and pytorch module Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Allow InferenceParams and RoPE imports via transformer_engine.pytorch and transformer_engine.pytorch.attention modules Remove unnecessary checks for check_set_window_size in MHA and TL Reorder backends such that smaller classes at the start and larger ones at the end Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Reinstating changes from PR 1478 for rope.py lost during rebase conflict resolution Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint issues Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make imports leaner Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Mar, 2025 1 commit
-
-
Kshitij Lakhani authored
* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info Move versioning info out of pytorch/attention.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol Rename tensor packaging methods with leading underscore to make them as internal to file Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it Modify tests and other files to import InferenceParams correctly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Modify docs api for InferenceParams Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Create d_p_a/rope.py and move RoPE methods from pytorch/attention.py to it Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code cleanup Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix qa testing induced bug Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect pack_tensor arg type Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Resolve lint errors Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove typedef FAUtils for FlashAttentionUtils Use attn_log instead of att_log Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Fix lint error Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix the function name from get_cumul to the earlier get_cu Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Fix typos, explicit imports and remove extra comments Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 12 Feb, 2025 1 commit
-
-
Przemyslaw Tredak authored
* Updated docs for TE 2.0 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Do not expose comm_gemm_overlap and cast_transpose_noop Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Made the figures larger Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> * Update quickstart_utils.py Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change from review Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Sep, 2024 1 commit
-
-
Sudhakar Singh authored
* allow tutorial to download the model weights automatically Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * allow users to provide weight cache directory Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 13 Aug, 2024 1 commit
-
-
Charlene Yang authored
* update example/benchmark scripts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix head_dim after MLA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update notebook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Aug, 2024 1 commit
-
-
Przemyslaw Tredak authored
* Link attention docs to the main docs and fix errors reported by Sphinx Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Lower the version of nbsphinx Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * More fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Change the URL of example_attention.py to GitHub Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * More fixes in the attention tutorial Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 10 Jul, 2024 1 commit
-
-
Charlene Yang authored
* add cuDNN swa Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix SWA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add set_deterministic and minor fixes for swa Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add AttentionParams Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change window_size to int64_t; fix swa/determinism tests; cache _attention_backends Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add window_size to get_backend; fix jax and paddle Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes; add set_deter to bwd_impl Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FP8 tests due to determinism Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add support matrix for SWA and bias Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes and lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add wording on window_size special cases Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak on wording Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax assertion error Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix wording Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * call bwd with deterministic=true for jax/paddle Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism words in documentation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 03 Jul, 2024 1 commit
-
-
Charlene Yang authored
* update to FE 1.5.1 and add bottom right causal Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust logic for backend selection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.5.2 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add get_attention_backend function Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update get_attention_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get_attention_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tweak get_attention_backend and fix unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for unfused, get_backend, etc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/attention.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cpu offload Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for get_attention_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * explicitly skip FP32 and padding tests because there is no support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for window size check Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update check_set_window_size and add enc_dec_attn_mask_type/enc_dec_window_size Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 14 Jun, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Charlene Yang authored
* add attention docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak to first draft Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up pictures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft for review Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add logging info/debug Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix of an SWA message Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use subprocess instaed of os.sys Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up benchmark script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add example script and update notebook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax/Paddle related comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rerun H100 benchmark Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict fp8 tests to sm90+ Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move get_cudnn_version from common to pytorch utils Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 06 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Cleanup Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 01 Jun, 2024 1 commit
-
-
Paweł Gadziński authored
* Llama 3 update Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Times update Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Times update Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * utils.py fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * utils.py fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * utils.py fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * update te llama tutorial to allow running with llama 3 weights Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * small fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * small fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * small fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add llama 3 vs llama 2 distinctions Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * paraphrasing and corrected facts Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com>
-
- 28 May, 2024 1 commit
-
-
Tim Moon authored
* Use correct FP8 group in multi-GPU docs FP8 process group should be tensor-parallel group Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Synchronize FP8 scales over world group in multi-GPU docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 25 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Fixed Llama tutorial. Changed batch size and added fused=True. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial updated but not complete yet. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial notebook reseted - removed fuse=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Removed fused=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Batch size back to 8 Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Typo and commented out line Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code. Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Comments Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Models cast added again Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Weight download info Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Moved parameter gate_proj_size to config Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * gate_proj_size removed and put immediate_size instead Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Llama 3 added to tutorial Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed model loading Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Loading fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Different dim for attention Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Reversed other commit Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Changed name to kv_channels Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed typo Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Test fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * changed file modes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix, hopefully last Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> Signed-off-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2373.nvidia.com> Co-authored-by:
root <root@ipp2-1588.nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
root <root@ipp2-0037.nvidia.com> Co-authored-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2371.nvidia.com> Co-authored-by:
root <root@ipp2-1589.nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 31 Mar, 2024 1 commit
-
-
Paweł Gadziński authored
Llama tutorial fixes - all Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 20 Mar, 2024 1 commit
-
-
Sudhakar Singh authored
* tutorial and doc fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove extra code Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix typos Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com>
-
- 01 Mar, 2024 1 commit
-
-
Sudhakar Singh authored
-
- 08 Feb, 2024 1 commit
-
-
Quentin Anthony authored
Signed-off-by:Quentin Anthony <qganthony@yahoo.com>
-
- 19 Jan, 2024 1 commit
-
-
hugo-syn authored
Signed-off-by:hugo-syn <hugo.vincent@synacktiv.com>
-
- 03 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 06 Dec, 2023 1 commit
-
-
Santosh Bhavani authored
* Add H200 perf non-alpha image Signed-off-by:
Santosh Bhavani <santosh@semantic.md> * Update README.rst - non-transparent H200 plot Signed-off-by:
Santosh Bhavani <santosh@semantic.md> --------- Signed-off-by:
Santosh Bhavani <santosh@semantic.md>
-