Commits · 855fa6530ea87b3c5833e4d4cb269ccf5bd1b8a3 · OpenDAS / TransformerEngine

29 May, 2025 1 commit

[JAX] Support SWA in CP Ring Attn THD striped sharding (#1810) · 855fa653

Hua Huang authored May 29, 2025



* Support SWA in CP Ring Attn THD striped sharding
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add some comments; move check to _FusedAttnCPWithP2PHelper.check_supported()
Signed-off-by: Hua Huang <huah@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



Remove unused check
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>

855fa653

22 May, 2025 1 commit
- [JAX] Fix incorrectly skipped test_quantize_dbias tests (#1808) · 0cd1cd8e
  jberchtold-nvidia authored May 22, 2025
```
Fix incorrectly skipped test_quantize_dbias tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  0cd1cd8e
16 May, 2025 1 commit

[JAX] Support logical partitioning axes in TE Flax modules (#1772) · 27612051

jberchtold-nvidia authored May 16, 2025



* [JAX] Update flax module param initialization to support logical partitioning axes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix ffn1 intermediate result being replicated
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add documentation and assert when logical_axes=None
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix bias in LayerNormMLP flax module
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix layer tests to not use nn_partitioning and instead use nn.with_logical_axes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

27612051

06 May, 2025 1 commit

[JAX] Fix failing L2 JAX unit tests (#1735) · fe31af80

jberchtold-nvidia authored May 06, 2025



* Fix L2 test_custom_call_compute.py L2 tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix test_helper.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

fe31af80

01 May, 2025 1 commit

[JAX] Exclude GroupedGemm APIs for TE 2.3 (#1737) · 221fedc2

Phuong Nguyen authored Apr 30, 2025



* exclude GroupedGemm APIs
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

221fedc2

30 Apr, 2025 1 commit
- [JAX] Fix distributed Layernorm test failure (#1734) · dac098d8
  jberchtold-nvidia authored Apr 30, 2025
```
Fix distributed layernorm test failure
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  dac098d8
29 Apr, 2025 1 commit

[JAX] Distributed Current Scaling (#1699) · 4ceb3d4c

jberchtold-nvidia authored Apr 28, 2025



* Update test_helper.py and add QuantizeConfig class for CurrentScaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* WIP distributed current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Distributed Current Scaling (debugging). Distributed implementation with replicated scale_inv works for layernorm_mlp but feels like a hack

Has different per-device scale_inv values, but jax.debug.print only shows one of them. Since we're telling JAX/XLA that this scale is replicated, I think it assumes all the values are equal. However, it doesn't actually check this, so it seems we are able to get away with per-device scales for current scaling but I am not sure how stable this will be and may randomly fail if us or the user changes partitioning at all or if XLA decides to actually act on the assumption that all these scale_invs are the same.
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Implement distributed current scaling by computing a global amax and scale before quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add encoder and mnist tests for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add primitive prefix to shardy unique_vars to prevent factor conflicts when performing unfused primitives for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove scale_shape primitive arg that is no longer used
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix expected result on multiprocessing encoder test
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint fix
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update multiprocessing current scaling tolerances
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Uncomment test case that was disabled for testing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove commented out debug line
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

4ceb3d4c

22 Apr, 2025 1 commit

[JAX] JAX Current Scaling (#1647) · 9a819334

jberchtold-nvidia authored Apr 22, 2025



* [JAX-Q] Single GPU current scaling for JAX
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix scale check dtype for MXFP8 scales affecting tests using assert_bitwise_scaled_tensors
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove cast to fp32 for norm primitives now that zero-centered gamma dtype issue is fixed
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix lint issue
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove unnecessary cast to fp32
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

9a819334

21 Apr, 2025 1 commit

[JAX] WAR for CuDNN MXFP8 norm incorrect result (#1700) · a1c18bc8

jberchtold-nvidia authored Apr 21, 2025



Check CuDNN version and apply unfused norm if
below a version with the fix
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

a1c18bc8

16 Apr, 2025 1 commit

Fix #1524 and other softmax mask functionality (#1681) · 0994fb48

Kshitij Lakhani authored Apr 15, 2025



* Add test cases for full coverage in jax/test_layer.py
- causal and window size None
- causal and window size default (-1,1)
- no_mask and window size default (-1,1)
- no_mask and window size default (2,2)
- padding and window size None
- padding_causal and window_size (2,2)
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Correct the condition where padding_causal_mask was being mapped to scaled upper triangle
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix Issue #1524
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add a runner and test cases for jax.flax.module.Softmax class for fwd pass only
Segregate runner classes for Softmax module and softmax primitives
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Simplify logic when picking softmax primitives and softmax jax framework calls
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Simplify the logic for performing jax based softmax
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add support table for mask, SWA and Softmax type. Code linting
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Explicit SWA conditons in comments. Fix Typo
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Resolve typo to remove None in SWA comments section
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0994fb48

14 Apr, 2025 1 commit

Add experimental Shardy support. (#1642) · 6117b20c

Johannes Reifferscheid authored Apr 14, 2025



* Add experimental Shardy support.

Production use is not yet recommended.

---------
Signed-off-by: Johannes Reifferscheid <jreiffers@nvidia.com>

6117b20c

09 Apr, 2025 1 commit

[JAX] Scaling Enum Abstracting (#1655) · 962d9c53

Phuong Nguyen authored Apr 09, 2025



* scaling enum abstract

* rm NVTE_ from ScalingMode names

* rework scaling mode enum in grouped gemm

* fix norm sharding

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

962d9c53

04 Apr, 2025 2 commits

[JAX] Flatten_axis for quantization and Sharding propagation fixes (#1644) · ff884e20

Phuong Nguyen authored Apr 04, 2025



* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ff884e20

[JAX-Q] Distributed MXFP8 flax layer tests (#1643) · be1f647c
jberchtold-nvidia authored Apr 04, 2025
```
MXFP8 flax layer tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
be1f647c

01 Apr, 2025 1 commit

[JAX] Refactor + MXFP8 + GroupedGEMM (#1627) · cf9a7c2f

Phuong Nguyen authored Mar 31, 2025



* refactor + mxfp8

* added grouped gemm

* rename linear to dense

* added cublas init phase for groupedGemm

* relax the tol of test encoder multiprocessing mxfp8 by 0.001
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>

cf9a7c2f

12 Mar, 2025 1 commit
- Remove xla_ignore_channel_id check and ignore Scan loop warning in un… (#1540) · ab4fd3cf
  Reese Wang authored Mar 12, 2025
```
Remove xla_ignore_channel_id check and ignore Scan loop warning in unit test
Signed-off-by: Reese Wang <rewang@nvidia.com>
```
  ab4fd3cf
05 Mar, 2025 1 commit

Fix installation from PyPI wheels after a source install (#1526) · a3e6ed80

Kirthi Shankar Sivamani authored Mar 05, 2025



* Fix wheel install after src install
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix JAX imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* switch order of dirs for finding so
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Use existing dir src build
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

a3e6ed80

03 Mar, 2025 1 commit

[JAX] THD ring attention (#1454) · c5d6a069

Reese Wang authored Mar 03, 2025



* Support THD + ring attention for self attn
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Consolidate reorder strategy
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix dataclass frozen issue
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove redundant code
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use AttnBiasType, AttnMaskType, QKVLayout in cpp_extension
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix lint
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Refine P2P helper check_supported
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add segment_ids/pos check
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fixup
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add dual chunk swap example
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Align different reorder code structure
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

c5d6a069

18 Feb, 2025 1 commit
- [JAX] Flax with compute dtype inferred from input dtype. (#1485) · 6673f165
  Phuong Nguyen authored Feb 18, 2025
```
flax module with compute dtype inferred from the inputs
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
  6673f165
14 Feb, 2025 2 commits

[JAX] Fix issues when mask/sequence_descriptor is None (#1477) · dfbf4dde

Reese Wang authored Feb 15, 2025



Fix issues when mask/sequence_descriptor is None
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

dfbf4dde

[JAX] Fixes for CI failures with the latest JAX (#1469) · e19b8281

Phuong Nguyen authored Feb 14, 2025



* fixes L1 test

* fix test_multigpu_encoder

* fixes for other multi-encoder tests

* jax.extend.ffi to jax.ffi

* initialization with float32

* add init_dtype as an optional arg to all modules

* update use_scan query from xla flags

* relax threshold for test_encoder fp8

* relax the tols

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

e19b8281

07 Feb, 2025 1 commit
- Update main branch with TE 2.0 code, update version to 2.1.0.dev0 · 544dd14b
  Przemek Tredak authored Feb 07, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
  544dd14b
24 Jan, 2025 1 commit

[JAX] Support segment_ids/pos as FA inputs (#1406) · c2c3d540

Reese Wang authored Jan 24, 2025



* POC for segment_ids/segment_pos
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Change segment_pos position
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use RemainingArgs to solve number of parameters mismatches
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Test mask_descriptor for accomendating different mask representations
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix bugs
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use descriptor in bwd
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Primitives only accepts pure jnp array
Signed-off-by: Reese Wang <rewang@nvidia.com>

* segment_ids/pos support POC
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Move seqlens/offsets generation to mask descriptor
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Rename MaskDescriptor to SequenceDescriptor
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Generalize get_seqlens_and_offsets
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Utilize sequence desc on FA bwd
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Migrate to new API
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add docstrings
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove small inputs and test different input format
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix lint
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix seed shardings
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Optimize sequence converting overhead
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Optimize seq_offsets calculation
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix up
Signed-off-by: Reese Wang <rewang@nvidia.com>

* fix lint
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix conflicts
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove reduntant line
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>

c2c3d540

17 Jan, 2025 1 commit

[JAX] Consolidate the distributed fused attention test code (#1405) · 6e848924

Michael Goldfarb authored Jan 16, 2025



Consolidate the distributed fused attention tests to shared input generation and execition logic.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

6e848924

08 Jan, 2025 2 commits

[JAX] Correct fused attention output after each step of ring attention (#1393) · a4cb1d17

Michael Goldfarb authored Jan 08, 2025



Correct fused attention output after each step to reduce intermediate memory use.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

a4cb1d17

[JAX] Add THD + SWA unit tests (#1390) · b898cbe1

Reese Wang authored Jan 08, 2025



* Fix SWA mask for THD and forcing seqlen_kv >= seqlen_q for SWA
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Generalize sliding window mask
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix pylint
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>

b898cbe1

02 Jan, 2025 1 commit
- Update copyright to include 2025 (#1388) · c9ea6be9
  Kirthi Shankar Sivamani authored Jan 02, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  c9ea6be9
20 Dec, 2024 1 commit

[common/PyTorch] Add cuDNN SWA (left, 0) + padding + bottom right causal (#1378) · 838345eb

Charlene Yang authored Dec 19, 2024



* add swa (left,0) + padding + brcm support
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* final fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* upgrade to FE 1.9-rc
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* skip thd + CP + fused attn tests for cuDNN 9.6+ due to different stats shapes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

838345eb

17 Dec, 2024 1 commit

[JAX] Fused attention unit tests fixes and refinements (#1352) · 7f5c784e

Reese Wang authored Dec 17, 2024



* Add util functions to attn_mask_type
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add util functions to qkv_layout
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix THD cross reference code
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove explicit segment_pad, encoding it to segment_ids
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add jax.jit, replace _token with segment_ids, rename bias shape enum
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add comment for make_mask
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Clean code
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add doc strings for the added functions
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove cache for fa deterministic which causes UT failed
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Rename fixture to avoid conflict
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>

7f5c784e

04 Dec, 2024 1 commit
- [JAX] Scale sequence length in CP tests to avoid tiny sizes. (#1347) · d3cbccdf
  Michael Goldfarb authored Dec 04, 2024
```
Scale sequence length in CP tests to avoid tiny sizes.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
```
  d3cbccdf
11 Nov, 2024 1 commit

[JAX] Support Ring Attention (Context Parallelism) (#1059) · bfddb483

Ming-Xu Huang authored Nov 11, 2024



* Implement ring attention primative for Jax.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

bfddb483

06 Nov, 2024 1 commit

[TE/JAX] XLA FFI calls for three cast transpose functions (#1310) · 4d65073f

Hua Huang authored Nov 06, 2024



* FFI for some transpose & activation functions
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove comments in transformer_engine/jax/csrc/extensions/activation.cpp
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
Signed-off-by: Hua Huang <huangh1994@outlook.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: Hua Huang <huangh1994@outlook.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

4d65073f

04 Nov, 2024 1 commit

[JAX] Expose context parallel params to jax DPA api (#1292) · d7256866

Md Fahim Faysal Khan authored Nov 04, 2024



Exposed context parallel params to DPA api
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

---------
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Co-authored-by: Michael Goldfarb <mgoldfarb@nvidia.com>

d7256866

24 Oct, 2024 2 commits

[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose,... · 18c2234c

Hua Huang authored Oct 24, 2024


[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose, ActLuFP8, LayerNormForwardFP8FFI, and LayerNormBackwardFFI (#1263)

* Add TransposeFFI, test passed
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add ActLuFP8FFI; fix TransposeFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add QuantizeFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add FusedAttnForwardFFI and some unit tests
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Minor fix
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add LayerNormForwardFP8FFI & LayerNormBackwardFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revise FusedAttnForwardFFI()
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add FFI_CudaGraph_Traits

All tests passed, ready for merge
Signed-off-by: Hua Huang <huah@nvidia.com>

* Bug fix for FFI data type mismatch

Also add a safeguard on the entrance to FFI function
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

18c2234c

[JAX] Fix correctness of JAX fused attention with CP and improve numerics... · 20c75295

Michael Goldfarb authored Oct 24, 2024


[JAX] Fix correctness of JAX fused attention with CP and improve numerics check in unit tests (#1282)

Fix correctness of JAX fused attention with CP.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

20c75295

22 Oct, 2024 1 commit

Add THD + GQA supports (#1260) · d9b4bfb5

Reese Wang authored Oct 23, 2024



Add THD + GQA supports for cuDNN >= 9.6
Signed-off-by: Reese Wang <rewang@nvidia.com>

d9b4bfb5

15 Oct, 2024 1 commit
- Check for backend support in Jax context parallel fused attention test (#1227) · 20c55e46
  Michael Goldfarb authored Oct 15, 2024
```
Update test to check support for context parallel attention.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
```
  20c55e46
10 Oct, 2024 1 commit

[JAX] Expose sliding window attn to TE-JAX API (#1205) · 85e60e64

Hua Huang authored Oct 10, 2024



* Expose JAX sliding window attn API
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* No SWA in context parallel; fix RNG seed in test
Signed-off-by: Hua Huang <huah@nvidia.com>

* Handle SAW API discrepancy in cuDNN and Python
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add SAW API for flax, all tests passed

Will update tests/jax/test_praxis_layers.py next
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update test_praxis_layers.py for SWA, test passed
Signed-off-by: Hua Huang <huah@nvidia.com>

* Use tuple window_size; update for PR #1212
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add and adjust some pytest.skip
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revised following Reese Wang's comments

Still need further debugging:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:

These errors does not exist in the previous commit
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix no-SWA test case errors in previous commit
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add Padding mask w/ sliding windows sanity tests
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use float32 for the reference code softmax calculation
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Reese Wang <rewang@nvidia.com>

85e60e64

17 Sep, 2024 1 commit

[JAX] Context Parallel Attention with All-Gather (#1106) · 9101a78f

Michael Goldfarb authored Sep 17, 2024



Implementation of context parallel fused attention using all-gather.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

9101a78f

16 Sep, 2024 1 commit
- [JAX] Fix unit tests to work around cuDNN 9.4 regression of 0 length sequences (#1179) · df699655
  Michael Goldfarb authored Sep 16, 2024
```
Modify unit tests to work around cuDNN 9.4 regression.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
```
  df699655