**Note: The following focuses on sharing the differences between MX-Formats quantization and Per-Row/Per-Column quantization, as well as the layout requirements for compatibility with Cutlass Block Scaled GEMMs.**
### Data Formats and Quantization Factors
Target data format reference: [MX-Formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). Note that we do not need to pack raw data and scale factors together here.
### Data Formats and Quantization Factors
Target data format reference: [MX-Formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). Note that we do not need to pack raw data and scale factors together here.
Source data format: fp16/bf16
Target data format: mxfp4/6/8
Quantization factor data format: E8M0, *Per-Row/Per-Column quantization typically stores quantization factors in fp32, whereas E8M0 has the same numerical range as fp32. After rounding, the quantization factors can be stored directly, though the loss of mantissa bits may affect precision.*
Quantization granularity: \[1X32\]
Quantization dimension: Following Cutlass GEMM conventions, where M, N, K represent the three dimensions of matrix multiplication, we should quantize along K dimension.
Source data format: fp16/bf16
Target data format: mxfp4/6/8
Quantization factor data format: E8M0, *Per-Row/Per-Column quantization typically stores quantization factors in fp32, whereas E8M0 has the same numerical range as fp32. After rounding, the quantization factors can be stored directly, though the loss of mantissa bits may affect precision.*
Quantization granularity: \[1X32\]
Quantization dimension: Following Cutlass GEMM conventions, where M, N, K represent the three dimensions of matrix multiplication, we should quantize along K dimension.
### Rounding and Clamp
Unlike software emulation, CUDA can efficiently handle complex rounding and clamping operations using PTX or built-in functions.
For example, `cvt.rn.satfinite.e2m1x2.f32` can convert two fp32 inputs into two fp4 outputs.
Rounding mode: `rn` (round-to-nearest-even)
Clamp mode: `satfinite` (clamped to the maximum finite value within the target range, excluding infinities and NaN)
For more data types and modes, refer to: [PTX cvt Instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt)
### Rounding and Clamp
Unlike software emulation, CUDA can efficiently handle complex rounding and clamping operations using PTX or built-in functions.
For example, `cvt.rn.satfinite.e2m1x2.f32` can convert two fp32 inputs into two fp4 outputs.
Rounding mode: `rn` (round-to-nearest-even)
Clamp mode: `satfinite` (clamped to the maximum finite value within the target range, excluding infinities and NaN)
For more data types and modes, refer to: [PTX cvt Instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt)
### Data Layout and Quantization Factor Layout
**Data Layout**
- mxfp4 requires packing two values into a uint8.
- mxfp6 requires packing every four values into three uint8s. For the format, refer to: [mxfp6 cutlass mm format packing](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74).
### Data Layout and Quantization Factor Layout
**Data Layout**
- mxfp4 requires packing two values into a uint8.
- mxfp6 requires packing every four values into three uint8s. For the format, refer to: [mxfp6 cutlass mm format packing](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74).
**Quantization Factor Layout**
Cutlass Block Scaled GEMMs impose special swizzle requirements on quantization factor layouts to optimize matrix operations.
After understanding the above, the calculation of the target data and quantization factor values can refer to [nvfp4 Quantization Basics](https://github.com/theNiemand/lightx2v/blob/main/lightx2v_kernel/docs/zh_CN/nvfp4%E9%87%8F%E5%8C%96%E5%9F%BA%E7%A1%80.md). Note that MX-Formats do not require quantizing the scale itself.
\ No newline at end of file
### Quantization Method
After understanding the above, the calculation of the target data and quantization factor values can refer to [nvfp4 Quantization Basics](https://github.com/theNiemand/lightx2v/blob/main/lightx2v_kernel/docs/zh_CN/nvfp4%E9%87%8F%E5%8C%96%E5%9F%BA%E7%A1%80.md). Note that MX-Formats do not require quantizing the scale itself.
- mxfp6需要每4个打包为3个uint8,格式参考:[mxfp6 cutlass mm 格式打包](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74)