• Neta Zmora's avatar
    Add env. var. for efficient text-generation in inference (#214) · 68f60b89
    Neta Zmora authored
    
    
    * Dynamically-generated causal attention mask (for ONNX export)
    
    TE's default causal mask is square (seq_len, seq_len) and is
    dynamically allocated for different sequence sizes. Dynamic
    allocation and dictionary lookups are not supported by ONNX.
    GPT generative phase uses rectangular masks.
    
    This commit forces softmax to use `forward_torch_softmax` and
    to dynamically generate an attention mask when exporting to ONNX.
    The mask is generated w/o using conditional control-flow by generating
    a  (k_seq_len, k_seq_len) mask and slicing it to (q_seq_len, k_seq_len)
    
    An alternate implementation is to pre-allocate a mask of shape
    (max_seq, max_seq) and to slice that. This solution is more performant
    at the expense of space, but the problem is the TE doesn't have a concept
    of max_seq.
    
    * Add to test_export_softmax a test for te.softmax.FusedScaleMaskSoftmax.
    * Add test_softmax_mask_fn to test that TE's default attention mask and
    the new ONNX-compatible mask produce the same behavior.
    * Add test_export_gpt_generation to test that the ONNX model can correctly
    handle inputs with different shapes and that the attention mask it adjusted
    on-the-fly to different sequence lengths.
    
    Misc:
    * Add a PRNG seeding fixture for more stability in tests.
    * Add dynamic shapes for ONNX input/output tests.
    * Allow validate_result to compare ORT output to pre-computed TE outputs.
    Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
    
    * Add NVTE_ONNX_KVCACHE_MAX_SEQ_LEN for efficient text-generation in inference
    
    * Introduce an environment variable (NVTE_ONNX_KVCACHE_MAX_SEQ_LEN) to set the maximum sequence length.
    In ONNX inference with KV-Cache optimizations for GPT text generation, the attention mask shape can be square (context-phase) or rectangular (generation-phase).
    When exporting to ONNX and this variable is set, TE preallocates an upper triangular (k=1) matrix with a size as prescribed by the variable, and dynamically slices the mask for the required shape.
    TE models can be exported to ONNX when NVTE_ONNX_KVCACHE_MAX_SEQ_LEN is not configured, but the attention masking is always square and not fit for efficient text generation.
    
    * Work-around torch.onnx.export bug that incorrectly folds
    layer_norm(data, scale=add(gamma,1)) to layer_norm(data, scale=gamma)
    when we use LN with zero-centered gamma.
    
    * ONNX export tests
      * Add a fixture (seed_default_rng) to seed the PRNG
      * Add a fixture (set_max_seq_len) to set the max sequence length when exporting to ONNX for GPT text generation
    Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
    
    * Fix linting errors
    Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
    
    * Remove immutable default values from a couple of function signatures
    Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
    
    * Add @skip_FP8 to test_export_gpt_generation
    Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
    
    * Update transformer_engine/pytorch/softmax.py
    Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
    
    * Fix CI error for softmax export
    Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
    
    * Lint
    Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
    
    ---------
    Signed-off-by: default avatarNeta Zmora <nzmora@nvidia.com>
    Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
    Co-authored-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
    68f60b89
softmax.py 13 KB