Unverified Commit 2a413829 authored by Xiaoyu Zhang's avatar Xiaoyu Zhang Committed by GitHub
Browse files

Add triton version as a fused_moe_triton config search key to avoid performace...

Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version (#5955)
parent d5c097a2
...@@ -3,6 +3,9 @@ For different settings of ...@@ -3,6 +3,9 @@ For different settings of
- E (number of experts) - E (number of experts)
- N (intermediate size) - N (intermediate size)
- device_name (torch.cuda.get_device_name()) - device_name (torch.cuda.get_device_name())
- dtype: The data type used by the fused MoE kernel for computation. Supported types include fp8_w8a8, int8_w8a8, int8_w8a16, int4_w4a16, etc. This determines the precision and quantization scheme for both weights and activations.
- block_shape: The block quantization shape introduced starting from DeepSeek V3/R1 models. This parameter defines the granularity for block-wise quantization, typically specified as `[block_n, block_k]` where `block_n` and `block_k` represent the block dimensions. For example, DeepSeek V3 commonly uses `[128, 128]` block shapes for efficient block-wise FP8 quantization.
the JSON file contains a mapping from M (batch size) to the chosen configuration. the JSON file contains a mapping from M (batch size) to the chosen configuration.
The example configurations provided are for the Mixtral model for TP2 on H100 The example configurations provided are for the Mixtral model for TP2 on H100
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment