Unverified Commit 2a413829 authored by Xiaoyu Zhang's avatar Xiaoyu Zhang Committed by GitHub
Browse files

Add triton version as a fused_moe_triton config search key to avoid performace...

Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version (#5955)
parent d5c097a2
......@@ -3,6 +3,9 @@ For different settings of
- E (number of experts)
- N (intermediate size)
- device_name (torch.cuda.get_device_name())
- dtype: The data type used by the fused MoE kernel for computation. Supported types include fp8_w8a8, int8_w8a8, int8_w8a16, int4_w4a16, etc. This determines the precision and quantization scheme for both weights and activations.
- block_shape: The block quantization shape introduced starting from DeepSeek V3/R1 models. This parameter defines the granularity for block-wise quantization, typically specified as `[block_n, block_k]` where `block_n` and `block_k` represent the block dimensions. For example, DeepSeek V3 commonly uses `[128, 128]` block shapes for efficient block-wise FP8 quantization.
the JSON file contains a mapping from M (batch size) to the chosen configuration.
The example configurations provided are for the Mixtral model for TP2 on H100
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment