"scripts/convert_original_musicldm_to_diffusers.py" did not exist on "039958eae55ff0700cfb42a7e72739575ab341f1"
Commit 6a09ef10 authored by helloyongyang's avatar helloyongyang
Browse files

fix ci

parent 58c321cc
...@@ -2,30 +2,30 @@ ...@@ -2,30 +2,30 @@
**Note: The following focuses on sharing the differences between MX-Formats quantization and Per-Row/Per-Column quantization, as well as the layout requirements for compatibility with Cutlass Block Scaled GEMMs.** **Note: The following focuses on sharing the differences between MX-Formats quantization and Per-Row/Per-Column quantization, as well as the layout requirements for compatibility with Cutlass Block Scaled GEMMs.**
### Data Formats and Quantization Factors ### Data Formats and Quantization Factors
Target data format reference: [MX-Formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). Note that we do not need to pack raw data and scale factors together here. Target data format reference: [MX-Formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). Note that we do not need to pack raw data and scale factors together here.
Source data format: fp16/bf16 Source data format: fp16/bf16
Target data format: mxfp4/6/8 Target data format: mxfp4/6/8
Quantization factor data format: E8M0, *Per-Row/Per-Column quantization typically stores quantization factors in fp32, whereas E8M0 has the same numerical range as fp32. After rounding, the quantization factors can be stored directly, though the loss of mantissa bits may affect precision.* Quantization factor data format: E8M0, *Per-Row/Per-Column quantization typically stores quantization factors in fp32, whereas E8M0 has the same numerical range as fp32. After rounding, the quantization factors can be stored directly, though the loss of mantissa bits may affect precision.*
Quantization granularity: \[1X32\] Quantization granularity: \[1X32\]
Quantization dimension: Following Cutlass GEMM conventions, where M, N, K represent the three dimensions of matrix multiplication, we should quantize along K dimension. Quantization dimension: Following Cutlass GEMM conventions, where M, N, K represent the three dimensions of matrix multiplication, we should quantize along K dimension.
### Rounding and Clamp ### Rounding and Clamp
Unlike software emulation, CUDA can efficiently handle complex rounding and clamping operations using PTX or built-in functions. Unlike software emulation, CUDA can efficiently handle complex rounding and clamping operations using PTX or built-in functions.
For example, `cvt.rn.satfinite.e2m1x2.f32` can convert two fp32 inputs into two fp4 outputs. For example, `cvt.rn.satfinite.e2m1x2.f32` can convert two fp32 inputs into two fp4 outputs.
Rounding mode: `rn` (round-to-nearest-even) Rounding mode: `rn` (round-to-nearest-even)
Clamp mode: `satfinite` (clamped to the maximum finite value within the target range, excluding infinities and NaN) Clamp mode: `satfinite` (clamped to the maximum finite value within the target range, excluding infinities and NaN)
For more data types and modes, refer to: [PTX cvt Instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt) For more data types and modes, refer to: [PTX cvt Instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt)
### Data Layout and Quantization Factor Layout ### Data Layout and Quantization Factor Layout
**Data Layout** **Data Layout**
- mxfp4 requires packing two values into a uint8. - mxfp4 requires packing two values into a uint8.
- mxfp6 requires packing every four values into three uint8s. For the format, refer to: [mxfp6 cutlass mm format packing](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74). - mxfp6 requires packing every four values into three uint8s. For the format, refer to: [mxfp6 cutlass mm format packing](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74).
**Quantization Factor Layout** **Quantization Factor Layout**
Cutlass Block Scaled GEMMs impose special swizzle requirements on quantization factor layouts to optimize matrix operations. Cutlass Block Scaled GEMMs impose special swizzle requirements on quantization factor layouts to optimize matrix operations.
Reference: [Scale Factor Layouts](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/blackwell_functionality.md#scale-factor-layouts) Reference: [Scale Factor Layouts](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/blackwell_functionality.md#scale-factor-layouts)
### Quantization Method ### Quantization Method
After understanding the above, the calculation of the target data and quantization factor values can refer to [nvfp4 Quantization Basics](https://github.com/theNiemand/lightx2v/blob/main/lightx2v_kernel/docs/zh_CN/nvfp4%E9%87%8F%E5%8C%96%E5%9F%BA%E7%A1%80.md). Note that MX-Formats do not require quantizing the scale itself. After understanding the above, the calculation of the target data and quantization factor values can refer to [nvfp4 Quantization Basics](https://github.com/theNiemand/lightx2v/blob/main/lightx2v_kernel/docs/zh_CN/nvfp4%E9%87%8F%E5%8C%96%E5%9F%BA%E7%A1%80.md). Note that MX-Formats do not require quantizing the scale itself.
\ No newline at end of file
...@@ -5,27 +5,27 @@ ...@@ -5,27 +5,27 @@
### 数据格式与量化因子 ### 数据格式与量化因子
目标数据格式参考:[MX-Formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf),需要注意的是,我们这里不需要将raw data和scale factor打包在一起 目标数据格式参考:[MX-Formats](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf),需要注意的是,我们这里不需要将raw data和scale factor打包在一起
源数据格式:fp16/bf16 源数据格式:fp16/bf16
目标数据格式:mxfp4/6/8 目标数据格式:mxfp4/6/8
量化因子数据格式:E8M0, *Per-Row/Per-Column量化的量化因子一般以fp32进行存储,而E8M0与fp32数值范围一致,经过rounding后可直接存储量化因子,缺点是尾数的丢失会影响精度。* 量化因子数据格式:E8M0, *Per-Row/Per-Column量化的量化因子一般以fp32进行存储,而E8M0与fp32数值范围一致,经过rounding后可直接存储量化因子,缺点是尾数的丢失会影响精度。*
量化粒度:\[1X32\] 量化粒度:\[1X32\]
量化维度:以Cutlass GEMM的规范,M N K表示矩阵乘的三个维度,需要沿着K维度量化 量化维度:以Cutlass GEMM的规范,M N K表示矩阵乘的三个维度,需要沿着K维度量化
### Rounding与Clamp ### Rounding与Clamp
不同于软件模拟,CUDA可以通过PTX或者内置函数高性能地便捷地来完成繁琐的Rouding和Clamp操作。 不同于软件模拟,CUDA可以通过PTX或者内置函数高性能地便捷地来完成繁琐的Rouding和Clamp操作。
例如,`cvt.rn.satfinite.e2m1x2.f32` 可以将两个fp32类型的输入,转换为​两个fp4类型的输出 例如,`cvt.rn.satfinite.e2m1x2.f32` 可以将两个fp32类型的输入,转换为​两个fp4类型的输出
Rounding模式为:`rn`,​round-to-nearest-even​ Rounding模式为:`rn`,​round-to-nearest-even​
Clamp模式为:`satfinite`,钳制到目标范围内的最大有限值,​排除无穷和 NaN Clamp模式为:`satfinite`,钳制到目标范围内的最大有限值,​排除无穷和 NaN
更多数据类型和模式参考:[PTX cvt指令](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt) 更多数据类型和模式参考:[PTX cvt指令](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt)
### 数据布局与量化因子布局 ### 数据布局与量化因子布局
数据布局 数据布局
- mxfp4需要两两打包为uint8 - mxfp4需要两两打包为uint8
- mxfp6需要每4个打包为3个uint8,格式参考:[mxfp6 cutlass mm 格式打包](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74) - mxfp6需要每4个打包为3个uint8,格式参考:[mxfp6 cutlass mm 格式打包](https://github.com/ModelTC/LightX2V/blob/main/lightx2v_kernel/csrc/gemm/mxfp6_quant_kernels_sm120.cu#L74)
量化因子布局 量化因子布局
Cutlass Block Scaled GEMMs为了满足矩阵运算加速,对量化因子布局有特殊的swizzle要求 Cutlass Block Scaled GEMMs为了满足矩阵运算加速,对量化因子布局有特殊的swizzle要求
参考:[Scale Factor Layouts](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/blackwell_functionality.md#scale-factor-layouts) 参考:[Scale Factor Layouts](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/blackwell_functionality.md#scale-factor-layouts)
### 量化方法 ### 量化方法
了解完上述后,目标数据和量化因子两者自身数值的求解,可参考[nvfp4量化基础](https://github.com/theNiemand/lightx2v/blob/main/lightx2v_kernel/docs/zh_CN/nvfp4%E9%87%8F%E5%8C%96%E5%9F%BA%E7%A1%80.md),注意MX-Formats无需量化scale本身 了解完上述后,目标数据和量化因子两者自身数值的求解,可参考[nvfp4量化基础](https://github.com/theNiemand/lightx2v/blob/main/lightx2v_kernel/docs/zh_CN/nvfp4%E9%87%8F%E5%8C%96%E5%9F%BA%E7%A1%80.md),注意MX-Formats无需量化scale本身
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment