1. 23 Jun, 2023 1 commit
  2. 22 Jun, 2023 2 commits
  3. 21 Jun, 2023 1 commit
  4. 20 Jun, 2023 3 commits
  5. 16 Jun, 2023 2 commits
  6. 13 Jun, 2023 2 commits
  7. 12 Jun, 2023 1 commit
  8. 11 Jun, 2023 1 commit
  9. 08 Jun, 2023 1 commit
  10. 07 Jun, 2023 2 commits
  11. 06 Jun, 2023 3 commits
  12. 02 Jun, 2023 1 commit
  13. 01 Jun, 2023 1 commit
  14. 31 May, 2023 1 commit
  15. 26 May, 2023 1 commit
  16. 25 May, 2023 2 commits
  17. 23 May, 2023 1 commit
  18. 22 May, 2023 1 commit
  19. 19 May, 2023 1 commit
  20. 16 May, 2023 2 commits
  21. 13 May, 2023 1 commit
    • Neta Zmora's avatar
      Add env. var. for efficient text-generation in inference (#214) · 68f60b89
      Neta Zmora authored
      
      
      * Dynamically-generated causal attention mask (for ONNX export)
      
      TE's default causal mask is square (seq_len, seq_len) and is
      dynamically allocated for different sequence sizes. Dynamic
      allocation and dictionary lookups are not supported by ONNX.
      GPT generative phase uses rectangular masks.
      
      This commit forces softmax to use `forward_torch_softmax` and
      to dynamically generate an attention mask when exporting to ONNX.
      The mask is generated w/o using conditional control-flow by generating
      a  (k_seq_len, k_seq_len) mask and slicing it to (q_seq_len, k_seq_len)
      
      An alternate implementation is to pre-allocate a mask of shape
      (max_seq, max_seq) and to slice that. This solution is more performant
      at the expense of space, but the problem is the TE doesn't have a concept
      of max_seq.
      
      * Add to test_export_softmax a test for te.softmax.FusedScaleMaskSoftmax.
      * Add test_softmax_mask_fn to test that TE's default attention mask and
      the new ONNX-compatible mask produce the same behavior.
      * Add test_export_gpt_generation to test that the ONNX model can correctly
      handle inputs with different shapes and that the attention mask it adjusted
      on-the-fly to different sequence lengths.
      
      Misc:
      * Add a PRNG seeding fixture for more stability in tests.
      * Add dynamic shapes for ONNX input/output tests.
      * Allow validate_result to compare ORT output to pre-computed TE outputs.
      Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
      
      * Add NVTE_ONNX_KVCACHE_MAX_SEQ_LEN for efficient text-generation in inference
      
      * Introduce an environment variable (NVTE_ONNX_KVCACHE_MAX_SEQ_LEN) to set the maximum sequence length.
      In ONNX inference with KV-Cache optimizations for GPT text generation, the attention mask shape can be square (context-phase) or rectangular (generation-phase).
      When exporting to ONNX and this variable is set, TE preallocates an upper triangular (k=1) matrix with a size as prescribed by the variable, and dynamically slices the mask for the required shape.
      TE models can be exported to ONNX when NVTE_ONNX_KVCACHE_MAX_SEQ_LEN is not configured, but the attention masking is always square and not fit for efficient text generation.
      
      * Work-around torch.onnx.export bug that incorrectly folds
      layer_norm(data, scale=add(gamma,1)) to layer_norm(data, scale=gamma)
      when we use LN with zero-centered gamma.
      
      * ONNX export tests
        * Add a fixture (seed_default_rng) to seed the PRNG
        * Add a fixture (set_max_seq_len) to set the max sequence length when exporting to ONNX for GPT text generation
      Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
      
      * Fix linting errors
      Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
      
      * Remove immutable default values from a couple of function signatures
      Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
      
      * Add @skip_FP8 to test_export_gpt_generation
      Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
      
      * Update transformer_engine/pytorch/softmax.py
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      
      * Fix CI error for softmax export
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      
      * Lint
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      
      ---------
      Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      Co-authored-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      68f60b89
  22. 12 May, 2023 2 commits
  23. 10 May, 2023 2 commits
  24. 09 May, 2023 5 commits