feat: support AWQ 4-bit T5 and single-file model loading in ComfyUI (#421)

* better linter * update * remove merge t5; update the nightly-build.yaml workflow * fix the workflow name * no __metadata__ key * remember to remove the files * make linter happy * check hardware compatibility * ready to add tests * update the README * update the README

feat: support AWQ 4-bit T5 and single-file model loading in ComfyUI (#421)
* better linter * update * remove merge t5; update the nightly-build.yaml workflow * fix the workflow name * no __metadata__ key * remember to remove the files * make linter happy * check hardware compatibility * ready to add tests * update the README * update the README
f4f11133 · Muyang Li · GitHub · 02683930 · f4f11133 · f4f11133
Unverified Commit f4f11133 authored Jun 01, 2025 by Muyang Li Committed by GitHub Jun 01, 2025
17 changed files
--- a/app/flux.1/t2i/README.md
+++ b/app/flux.1/t2i/README.md
@@ -12,10 +12,10 @@ To launch the application, simply run:
 python run_gradio.py
 ```

-* The demo also defaults to the FLUX.1-schnell model. To switch to the FLUX.1-dev model, use `-m dev`.
-* By default, the Gemma-2B model is loaded as a safety checker. To disable this feature and save GPU memory, use `--no-safety-checker`.
-* To further reduce GPU memory usage, you can enable the W4A16 text encoder by specifying `--use-qencoder`.
-* By default, only the INT4 DiT is loaded. Use `-p int4 bf16` to add a BF16 DiT for side-by-side comparison, or `-p bf16` to load only the BF16 model.
+- The demo also defaults to the FLUX.1-schnell model. To switch to the FLUX.1-dev model, use `-m dev`.
+- By default, the Gemma-2B model is loaded as a safety checker. To disable this feature and save GPU memory, use `--no-safety-checker`.
+- To further reduce GPU memory usage, you can enable the W4A16 text encoder by specifying `--use-qencoder`.
+- By default, only the INT4 DiT is loaded. Use `-p int4 bf16` to add a BF16 DiT for side-by-side comparison, or `-p bf16` to load only the BF16 model.

 ## Command Line Inference

@@ -25,13 +25,17 @@ We provide a script, [generate.py](generate.py), that generates an image from a
 python generate.py --prompt "You Text Prompt"
 ```

-* The generated image will be saved as `output.png` by default. You can specify a different path using the `-o` or `--output-path` options.
-* The script defaults to using the FLUX.1-schnell model. To switch to the FLUX.1-dev model, use `-m dev`.
-* By default, the script uses our INT4 model. To use the BF16 model instead, specify `-p bf16`.
-* You can specify `--use-qencoder` to use our W4A16 text encoder.
-* You can adjust the number of inference steps and guidance scale with `-t` and `-g`, respectively. For the FLUX.1-schnell model, the defaults are 4 steps and a guidance scale of 0; for the FLUX.1-dev model, the defaults are 50 steps and a guidance scale of 3.5.
+- The generated image will be saved as `output.png` by default. You can specify a different path using the `-o` or `--output-path` options.

-* When using the FLUX.1-dev model, you also have the option to load a LoRA adapter with `--lora-name`. Available choices are `None`, [`Anime`](https://huggingface.co/alvdansen/sonny-anime-fixed), [`GHIBSKY Illustration`](https://huggingface.co/aleksa-codes/flux-ghibsky-illustration), [`Realism`](https://huggingface.co/XLabs-AI/flux-RealismLora), [`Children Sketch`](https://huggingface.co/Shakker-Labs/FLUX.1-dev-LoRA-Children-Simple-Sketch), and [`Yarn Art`](https://huggingface.co/linoyts/yarn_art_Flux_LoRA), with the default set to `None`. You can also specify the LoRA weight with `--lora-weight`, which defaults to 1.
+- The script defaults to using the FLUX.1-schnell model. To switch to the FLUX.1-dev model, use `-m dev`.
+
+- By default, the script uses our INT4 model. To use the BF16 model instead, specify `-p bf16`.
+
+- You can specify `--use-qencoder` to use our W4A16 text encoder.
+
+- You can adjust the number of inference steps and guidance scale with `-t` and `-g`, respectively. For the FLUX.1-schnell model, the defaults are 4 steps and a guidance scale of 0; for the FLUX.1-dev model, the defaults are 50 steps and a guidance scale of 3.5.
+
+- When using the FLUX.1-dev model, you also have the option to load a LoRA adapter with `--lora-name`. Available choices are `None`, [`Anime`](https://huggingface.co/alvdansen/sonny-anime-fixed), [`GHIBSKY Illustration`](https://huggingface.co/aleksa-codes/flux-ghibsky-illustration), [`Realism`](https://huggingface.co/XLabs-AI/flux-RealismLora), [`Children Sketch`](https://huggingface.co/Shakker-Labs/FLUX.1-dev-LoRA-Children-Simple-Sketch), and [`Yarn Art`](https://huggingface.co/linoyts/yarn_art_Flux_LoRA), with the default set to `None`. You can also specify the LoRA weight with `--lora-weight`, which defaults to 1.

 ## Latency Benchmark

@@ -41,12 +45,12 @@ To measure the latency of our INT4 models, use the following command:
 python latency.py
 ```

-* The script defaults to the INT4 FLUX.1-schnell model. To switch to FLUX.1-dev, use the `-m dev` option. For BF16 precision, add `-p bf16`.
-* Adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively.
+- The script defaults to the INT4 FLUX.1-schnell model. To switch to FLUX.1-dev, use the `-m dev` option. For BF16 precision, add `-p bf16`.
+- Adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively.
  - For FLUX.1-schnell, the defaults are 4 steps and a guidance scale of 0.
  - For FLUX.1-dev, the defaults are 50 steps and a guidance scale of 3.5.
-* By default, the script measures the end-to-end latency for generating a single image. To measure the latency of a single DiT forward step instead, use the `--mode step` flag.
-* Specify the number of warmup and test runs using `--warmup-times` and `--test-times`. The defaults are 2 warmup runs and 10 test runs.
+- By default, the script measures the end-to-end latency for generating a single image. To measure the latency of a single DiT forward step instead, use the `--mode step` flag.
+- Specify the number of warmup and test runs using `--warmup-times` and `--test-times`. The defaults are 2 warmup runs and 10 test runs.

 ## Quality Results

@@ -63,12 +67,12 @@ python evaluate.py -p int4
 python evaluate.py -p bf16
 ```

-* The commands above will generate images from FLUX.1-schnell on both datasets. Use `-m dev` to switch to FLUX.1-dev, or specify a single dataset with `-d MJHQ` or `-d DCI`.
-* By default, generated images are saved to `results/$MODEL/$PRECISION`. Customize the output path using the `-o` option if desired.
-* You can also adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively.
+- The commands above will generate images from FLUX.1-schnell on both datasets. Use `-m dev` to switch to FLUX.1-dev, or specify a single dataset with `-d MJHQ` or `-d DCI`.
+- By default, generated images are saved to `results/$MODEL/$PRECISION`. Customize the output path using the `-o` option if desired.
+- You can also adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively.
  - For FLUX.1-schnell, the defaults are 4 steps and a guidance scale of 0.
  - For FLUX.1-dev, the defaults are 50 steps and a guidance scale of 3.5.
-* To accelerate the generation process, you can distribute the workload across multiple GPUs. For instance, if you have $N$ GPUs, on GPU $i (0 \le i < N)$ , you can add the options `--chunk-start $i --chunk-step $N`. This setup ensures each GPU handles a distinct portion of the workload, enhancing overall efficiency.
+- To accelerate the generation process, you can distribute the workload across multiple GPUs. For instance, if you have $N$ GPUs, on GPU $i (0 \\le i < N)$ , you can add the options `--chunk-start $i --chunk-step $N`. This setup ensures each GPU handles a distinct portion of the workload, enhancing overall efficiency.

 Finally you can compute the metrics for the images with


--- a/app/sana/t2i/README.md
+++ b/app/sana/t2i/README.md
@@ -10,8 +10,8 @@ This interactive Gradio application can generate an image based on your provided
 python run_gradio.py
 ```

-* By default, the Gemma-2B model is loaded as a safety checker. To disable this feature and save GPU memory, use `--no-safety-checker`.
-* By default, only the INT4 DiT is loaded. Use `-p int4 bf16` to add a BF16 DiT for side-by-side comparison, or `-p bf16` to load only the BF16 model.
+- By default, the Gemma-2B model is loaded as a safety checker. To disable this feature and save GPU memory, use `--no-safety-checker`.
+- By default, only the INT4 DiT is loaded. Use `-p int4 bf16` to add a BF16 DiT for side-by-side comparison, or `-p bf16` to load only the BF16 model.

 ## Command Line Inference

@@ -21,10 +21,10 @@ We provide a script, [generate.py](generate.py), that generates an image from a
 python generate.py --prompt "You Text Prompt"
 ```

-* The generated image will be saved as `output.png` by default. You can specify a different path using the `-o` or `--output-path` options.
-* By default, the script uses our INT4 model. To use the BF16 model instead, specify `-p bf16`.
-* You can adjust the number of inference steps and classifier-free guidance scale with `-t` and `-g`, respectively. The defaults are 20 steps and a guidance scale of 5.
-* In addition to the classifier-free guidance, you can also adjust the [PAG guidance](https://arxiv.org/abs/2403.17377) scale with `--pag-scale`. The default is 2.
+- The generated image will be saved as `output.png` by default. You can specify a different path using the `-o` or `--output-path` options.
+- By default, the script uses our INT4 model. To use the BF16 model instead, specify `-p bf16`.
+- You can adjust the number of inference steps and classifier-free guidance scale with `-t` and `-g`, respectively. The defaults are 20 steps and a guidance scale of 5.
+- In addition to the classifier-free guidance, you can also adjust the [PAG guidance](https://arxiv.org/abs/2403.17377) scale with `--pag-scale`. The default is 2.

 ## Latency Benchmark

@@ -34,7 +34,7 @@ To measure the latency of our INT4 models, use the following command:
 python latency.py
 ```

-* Adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively. The defaults are 20 steps and a guidance scale of 5.
-* You can also adjust the [PAG guidance](https://arxiv.org/abs/2403.17377) scale with `--pag-scale`. The default is 2.
-* By default, the script measures the end-to-end latency for generating a single image. To measure the latency of a single DiT forward step instead, use the `--mode step` flag.
-* Specify the number of warmup and test runs using `--warmup-times` and `--test-times`. The defaults are 2 warmup runs and 10 test runs.
+- Adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively. The defaults are 20 steps and a guidance scale of 5.
+- You can also adjust the [PAG guidance](https://arxiv.org/abs/2403.17377) scale with `--pag-scale`. The default is 2.
+- By default, the script measures the end-to-end latency for generating a single image. To measure the latency of a single DiT forward step instead, use the `--mode step` flag.
+- Specify the number of warmup and test runs using `--warmup-times` and `--test-times`. The defaults are 2 warmup runs and 10 test runs.
--- a/docs/faq.md
+++ b/docs/faq.md
 ### ❗ Import Error: `ImportError: cannot import name 'to_diffusers' from 'nunchaku.lora.flux' (...)` (e.g., mit-han-lab/nunchaku#250)
+
 This error usually indicates that the nunchaku library was not installed correctly. We’ve prepared step-by-step installation guides for Windows users:

 📺 [English tutorial](https://youtu.be/YHAVe-oM7U8?si=cM9zaby_aEHiFXk0) | 📺 [Chinese tutorial](https://www.bilibili.com/video/BV1BTocYjEk5/?share_source=copy_web&vd_source=8926212fef622f25cc95380515ac74ee) | 📖 [Corresponding Text guide](https://github.com/mit-han-lab/nunchaku/blob/main/docs/setup_windows.md)

 Please also check the following common causes:
-* **You only installed the ComfyUI plugin (`ComfyUI-nunchaku`) but not the core `nunchaku` library.** Please follow the [installation instructions in our README](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation) to install the correct version of the `nunchaku` library.

-* **You installed `nunchaku` using `pip install nunchaku`, but this is the wrong package.**
+- **You only installed the ComfyUI plugin (`ComfyUI-nunchaku`) but not the core `nunchaku` library.** Please follow the [installation instructions in our README](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation) to install the correct version of the `nunchaku` library.
+
+- **You installed `nunchaku` using `pip install nunchaku`, but this is the wrong package.**
  The `nunchaku` name on PyPI is already taken by an unrelated project. Please uninstall the incorrect package and follow our [installation guide](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation) to install the correct version.

-* **(MOST LIKELY) You installed `nunchaku` correctly, but into the wrong Python environment.**
+- **(MOST LIKELY) You installed `nunchaku` correctly, but into the wrong Python environment.**
  If you're using the ComfyUI portable package, its Python interpreter is very likely not the system default. To identify the correct Python path, launch ComfyUI and check the several initial lines in the log. For example, you will find

  ```text
@@ -28,28 +30,33 @@ Please also check the following common causes:
  "G:\ComfyUI\python\python.exe" -m pip install https://github.com/mit-han-lab/nunchaku/releases/download/v0.2.0/nunchaku-0.2.0+torch2.6-cp311-cp311-linux_x86_64.whl
  ```

-* **You have a folder named `nunchaku` in your working directory.**
+- **You have a folder named `nunchaku` in your working directory.**
  Python may mistakenly load from that local folder instead of the installed library. Also, make sure your plugin folder under `custom_nodes` is named `ComfyUI-nunchaku`, not `nunchaku`.

 ### ❗ Runtime Error: `Assertion failed: this->shape.dataExtent == other.shape.dataExtent, file ...Tensor.h` (e.g., mit-han-lab/nunchaku#212)
+
 This error is typically due to using the wrong model for your GPU.

 - If you're using a **Blackwell GPU (e.g., RTX 50-series)**, please use our **FP4** models.
 - For all other GPUs, use our **INT4** models.

 ### ❗ System crash or blue screen (e.g., mit-han-lab/nunchaku#57)
+
 We have observed some cases where memory is not properly released after image generation, especially when using ComfyUI. This may lead to system instability or crashes.

 We’re actively investigating this issue. If you have experience or insights into memory management in ComfyUI, we would appreciate your help!

 ### ❗Out of Memory or Slow Model Loading (e.g.,mit-han-lab/nunchaku#249 mit-han-lab/nunchaku#311 mit-han-lab/nunchaku#276)
+
 Try upgrading your CUDA driver and try setting the environment variable `NUNCHAKU_LOAD_METHOD` to either `READ` or `READNOPIN`.

 ### ❗Same Seeds Produce Slightly Different Images (e.g., mit-han-lab/nunchaku#229 mit-han-lab/nunchaku#294)
+
 This behavior is due to minor precision noise introduced by the GPU’s accumulation order. Because modern GPUs execute operations out of order for better performance, small variations in output can occur, even with the same seed.
 Enforcing strict accumulation order would reduce this variability but significantly hurt performance, so we do not plan to change this behavior.

 ### ❓ PuLID Support (e.g., mit-han-lab/nunchaku#258)
+
 PuLID support is currently in development and will be included in the next major release.

 ### ~~❗ Assertion Error: `Assertion failed: a.dtype() == b.dtype(), file ...misc_kernels.cu` (e.g., mit-han-lab/nunchaku#30))~~

--- a/docs/faq_ZH.md
+++ b/docs/faq_ZH.md
 ### ❗ 导入错误：`ImportError: cannot import name 'to_diffusers' from 'nunchaku.lora.flux' (...)`（例如 mit-han-lab/nunchaku#250）
+
 此错误通常表示 `nunchaku` 库未正确安装。我们为 Windows 用户准备了分步安装指南：

 📺 [英文教程](https://youtu.be/YHAVe-oM7U8?si=cM9zaby_aEHiFXk0) | 📺 [中文教程](https://www.bilibili.com/video/BV1BTocYjEk5/?share_source=copy_web&vd_source=8926212fef622f25cc95380515ac74ee) | 📖 [对应文本指南](https://github.com/mit-han-lab/nunchaku/blob/main/docs/setup_windows.md)

 请同时检查以下常见原因：
-* **您仅安装了 ComfyUI 插件（`ComfyUI-nunchaku`）而未安装核心 `nunchaku` 库。** 请遵循[README 中的安装说明](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation)安装正确版本的 `nunchaku` 库。

-* **您使用 `pip install nunchaku` 安装了错误包。**
+- **您仅安装了 ComfyUI 插件（`ComfyUI-nunchaku`）而未安装核心 `nunchaku` 库。** 请遵循[README 中的安装说明](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation)安装正确版本的 `nunchaku` 库。
+
+- **您使用 `pip install nunchaku` 安装了错误包。**
  PyPI 上的 `nunchaku` 名称已被无关项目占用。请卸载错误包并按照[安装指南](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation)操作。

-* **（最常见）您正确安装了 `nunchaku`，但安装到了错误的 Python 环境中。**
+- **（最常见）您正确安装了 `nunchaku`，但安装到了错误的 Python 环境中。**
  如果使用 ComfyUI 便携包，其 Python 解释器很可能不是系统默认版本。启动 ComfyUI 后，检查日志开头的 Python 路径，例如：
+
  ```text
  ** Python executable: G:\ComfyuI\python\python.exe
  ```
+
  使用以下命令安装到该环境：
+
  ```shell
  "G:\ComfyUI\python\python.exe" -m pip install <your-wheel-file>.whl
  ```
+
  示例（Python 3.11 + Torch 2.6）：
+
  ```shell
  "G:\ComfyUI\python\python.exe" -m pip install https://github.com/mit-han-lab/nunchaku/releases/download/v0.2.0/nunchaku-0.2.0+torch2.6-cp311-cp311-linux_x86_64.whl
  ```

-* **您的工作目录中存在名为 `nunchaku` 的文件夹。**
+- **您的工作目录中存在名为 `nunchaku` 的文件夹。**
  Python 可能会误加载本地文件夹而非已安装库。同时确保 `custom_nodes` 下的插件文件夹名为 `ComfyUI-nunchaku`，而非 `nunchaku`。

 ### ❗ 运行时错误：`Assertion failed: this->shape.dataExtent == other.shape.dataExtent, file ...Tensor.h`(例如 mit-han-lab/nunchaku#212)
+
 此错误通常由使用与 GPU 不匹配的模型引起：
+
 - 若使用 **Blackwell GPU（如 RTX 50 系列）**，请使用 **FP4** 模型。
 - 其他 GPU 请使用 **INT4** 模型。

 ### ❗ 系统崩溃或蓝屏（例如 mit-han-lab/nunchaku#57）
+
 我们观察到在使用 ComfyUI 时，图像生成后内存未正确释放可能导致系统不稳定或崩溃。我们正在积极调查此问题。若您有 ComfyUI 内存管理经验，欢迎协助！

 ### ❗ 内存不足或模型加载缓慢（例如 mit-han-lab/nunchaku#249、mit-han-lab/nunchaku#311、mit-han-lab/nunchaku#276）
+
 尝试升级 CUDA 驱动，并设置环境变量 `NUNCHAKU_LOAD_METHOD` 为 `READ` 或 `READNOPIN`。

 ### ❗ 相同种子生成略微不同的图像（例如 mit-han-lab/nunchaku#229、mit-han-lab/nunchaku#294）
+
 此现象由 GPU 计算顺序导致的微小精度噪声引起。强制固定计算顺序会显著降低性能，因此我们不计划调整此行为。

 ### ❓ PuLID 支持（例如 mit-han-lab/nunchaku#258）
+
 PuLID 支持正在开发中，将在下一主要版本中加入。

 ### ~~❗ 断言错误：`Assertion failed: a.dtype() == b.dtype(), file ...misc_kernels.cu`（例如 mit-han-lab/nunchaku#30）~~
+
 ~~目前我们**仅支持 16 位版本的 [ControlNet-Union-Pro](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro)**。FP8 及其他 ControlNet 支持将在未来版本中加入。~~ ✅ 此问题已解决。

 ### ~~❗ Assertion Error：`assert image_rotary_emb.shape[2] == batch_size * (txt_tokens + img_tokens)`（例如 mit-han-lab/nunchaku#24）~~
+
 ~~当前**不支持推理时批量大小超过 1**。我们将在未来主要版本中支持此功能。~~ ✅ 自 [v0.3.0dev0](https://github.com/mit-han-lab/nunchaku/releases/tag/v0.3.0dev0) 起已支持多批量推理。
--- a/docs/setup_windows.md
+++ b/docs/setup_windows.md
@@ -63,16 +63,15 @@ Install PyTorch appropriate for your setup

 - **For most users**:

-    ```bash
-    "G:\ComfyuI\python\python.exe" -m pip install torch==2.6 torchvision==0.21 torchaudio==2.6
-    ```
+  ```bash
+  "G:\ComfyuI\python\python.exe" -m pip install torch==2.6 torchvision==0.21 torchaudio==2.6
+  ```

 - **For RTX 50-series GPUs** (requires PyTorch ≥2.7 with CUDA 12.8):

-    ```bash
-    "G:\ComfyuI\python\python.exe" -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
-    ```
-
+  ```bash
+  "G:\ComfyuI\python\python.exe" -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
+  ```

 ## Step 3: Install Nunchaku

@@ -109,62 +108,62 @@ Please use CMD instead of PowerShell for building.

 - Step 1: Install Build Tools

-    ```bash
-    C:\Users\muyang\miniconda3\envs\comfyui\python.exe
-    "G:\ComfyuI\python\python.exe" -m pip install ninja setuptools wheel build
-    ```
+  ```bash
+  C:\Users\muyang\miniconda3\envs\comfyui\python.exe
+  "G:\ComfyuI\python\python.exe" -m pip install ninja setuptools wheel build
+  ```

 - Step 2: Clone the Repository

-    ```bash
-    git clone https://github.com/mit-han-lab/nunchaku.git
-    cd nunchaku
-    git submodule init
-    git submodule update
-    ```
+  ```bash
+  git clone https://github.com/mit-han-lab/nunchaku.git
+  cd nunchaku
+  git submodule init
+  git submodule update
+  ```

 - Step 3: Set Up Visual Studio Environment

-    Locate the `VsDevCmd.bat` script on your system. Example path:
+  Locate the `VsDevCmd.bat` script on your system. Example path:

-    ```
-    C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\VsDevCmd.bat
-    ```
+  ```
+  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\VsDevCmd.bat
+  ```

-    Then run:
+  Then run:

-    ```bash
-    "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
-    set DISTUTILS_USE_SDK=1
-    ```
+  ```bash
+  "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
+  set DISTUTILS_USE_SDK=1
+  ```

 - Step 4: Build Nunchaku

-    ```bash
-    "G:\ComfyuI\python\python.exe" setup.py develop
-    ```
+  ```bash
+  "G:\ComfyuI\python\python.exe" setup.py develop
+  ```

-    Verify with:
+  Verify with:

-    ```bash
-    "G:\ComfyuI\python\python.exe" -c "import nunchaku"
-    ```
+  ```bash
+  "G:\ComfyuI\python\python.exe" -c "import nunchaku"
+  ```

-    You can also run a test (requires a Hugging Face token for downloading the models):
+  You can also run a test (requires a Hugging Face token for downloading the models):

-    ```bash
-    "G:\ComfyuI\python\python.exe" -m huggingface-cli login
-    "G:\ComfyuI\python\python.exe" -m nunchaku.test
-    ```
+  ```bash
+  "G:\ComfyuI\python\python.exe" -m huggingface-cli login
+  "G:\ComfyuI\python\python.exe" -m nunchaku.test
+  ```

 - (Optional) Step 5: Building wheel for Portable Python

-    If building directly with portable Python fails, you can first build the wheel in a working Conda environment, then install the `.whl` file using your portable Python:
+  If building directly with portable Python fails, you can first build the wheel in a working Conda environment, then install the `.whl` file using your portable Python:

-    ```shell
-    set NUNCHAKU_INSTALL_MODE=ALL
-    "G:\ComfyuI\python\python.exe" python -m build --wheel --no-isolation
-    ```
+  ```shell
+  set NUNCHAKU_INSTALL_MODE=ALL
+  "G:\ComfyuI\python\python.exe" python -m build --wheel --no-isolation
+  ```

 # Use Nunchaku in ComfyUI

@@ -183,37 +182,36 @@ Alternatively, install using [ComfyUI-Manager](https://github.com/Comfy-Org/Comf

 - **Standard FLUX.1-dev Models**

-    Start by downloading the standard [FLUX.1-dev text encoders](https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main) and [VAE](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors). You can also optionally download the original [BF16 FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors) model. An example command:
+  Start by downloading the standard [FLUX.1-dev text encoders](https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main) and [VAE](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors). You can also optionally download the original [BF16 FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors) model. An example command:

-    ```bash
-    huggingface-cli download comfyanonymous/flux_text_encoders clip_l.safetensors --local-dir models/text_encoders
-    huggingface-cli download comfyanonymous/flux_text_encoders t5xxl_fp16.safetensors --local-dir models/text_encoders
-    huggingface-cli download black-forest-labs/FLUX.1-schnell ae.safetensors --local-dir models/vae
-    huggingface-cli download black-forest-labs/FLUX.1-dev flux1-dev.safetensors --local-dir models/diffusion_models
-    ```
+  ```bash
+  huggingface-cli download comfyanonymous/flux_text_encoders clip_l.safetensors --local-dir models/text_encoders
+  huggingface-cli download comfyanonymous/flux_text_encoders t5xxl_fp16.safetensors --local-dir models/text_encoders
+  huggingface-cli download black-forest-labs/FLUX.1-schnell ae.safetensors --local-dir models/vae
+  huggingface-cli download black-forest-labs/FLUX.1-dev flux1-dev.safetensors --local-dir models/diffusion_models
+  ```

 - **SVDQuant 4-bit FLUX.1-dev Models**

-    Next, download the SVDQuant 4-bit models:
+  Next, download the SVDQuant 4-bit models:

-    - For **50-series GPUs**, use the [FP4 model](https://huggingface.co/mit-han-lab/svdq-fp4-flux.1-dev).
-    - For **other GPUs**, use the [INT4 model](https://huggingface.co/mit-han-lab/svdq-int4-flux.1-dev).
+  - For **50-series GPUs**, use the [FP4 model](https://huggingface.co/mit-han-lab/svdq-fp4-flux.1-dev).
+  - For **other GPUs**, use the [INT4 model](https://huggingface.co/mit-han-lab/svdq-int4-flux.1-dev).

-    Make sure to place the **entire downloaded folder** into `models/diffusion_models`. For example:
+  Make sure to place the **entire downloaded folder** into `models/diffusion_models`. For example:

-    ```bash
-    huggingface-cli download mit-han-lab/svdq-int4-flux.1-dev --local-dir models/diffusion_models/svdq-int4-flux.1-dev
-    ```
+  ```bash
+  huggingface-cli download mit-han-lab/svdq-int4-flux.1-dev --local-dir models/diffusion_models/svdq-int4-flux.1-dev
+  ```

 - **(Optional): Download Sample LoRAs**

-    You can test with some sample LoRAs like [FLUX.1-Turbo](https://huggingface.co/alimama-creative/FLUX.1-Turbo-Alpha/blob/main/diffusion_pytorch_model.safetensors) and [Ghibsky](https://huggingface.co/aleksa-codes/flux-ghibsky-illustration/blob/main/lora.safetensors). Place these files in the `models/loras` directory:
-
-    ```bash
-    huggingface-cli download alimama-creative/FLUX.1-Turbo-Alpha diffusion_pytorch_model.safetensors --local-dir models/loras
-    huggingface-cli download aleksa-codes/flux-ghibsky-illustration lora.safetensors --local-dir models/loras
-    ```
+  You can test with some sample LoRAs like [FLUX.1-Turbo](https://huggingface.co/alimama-creative/FLUX.1-Turbo-Alpha/blob/main/diffusion_pytorch_model.safetensors) and [Ghibsky](https://huggingface.co/aleksa-codes/flux-ghibsky-illustration/blob/main/lora.safetensors). Place these files in the `models/loras` directory:

+  ```bash
+  huggingface-cli download alimama-creative/FLUX.1-Turbo-Alpha diffusion_pytorch_model.safetensors --local-dir models/loras
+  huggingface-cli download aleksa-codes/flux-ghibsky-illustration lora.safetensors --local-dir models/loras
+  ```

 ## 3. Set Up Workflows


--- a/nunchaku/merge_models.py
+++ b/nunchaku/merge_models.py
 import argparse
+import json
 import os
 from pathlib import Path

@@ -9,10 +10,11 @@ from safetensors.torch import save_file
 from .utils import load_state_dict_in_safetensors


-def merge_models_into_a_single_file(
+def merge_safetensors(
    pretrained_model_name_or_path: str | os.PathLike[str], **kwargs
 ) -> tuple[dict[str, torch.Tensor], dict[str, str]]:
    subfolder = kwargs.get("subfolder", None)
+    comfy_config_path = kwargs.get("comfy_config_path", None)

    if isinstance(pretrained_model_name_or_path, str):
        pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
@@ -21,7 +23,8 @@ def merge_models_into_a_single_file(
        unquantized_part_path = dirpath / "unquantized_layers.safetensors"
        transformer_block_path = dirpath / "transformer_blocks.safetensors"
        config_path = dirpath / "config.json"
-        comfy_config_path = dirpath / "comfy_config.json"
+        if comfy_config_path is None:
+            comfy_config_path = dirpath / "comfy_config.json"
    else:
        download_kwargs = {
            "subfolder": subfolder,
@@ -59,9 +62,35 @@ def merge_models_into_a_single_file(
    state_dict = unquantized_part_sd
    state_dict.update(transformer_block_sd)

+    precision = "int4"
+    for v in state_dict.values():
+        assert isinstance(v, torch.Tensor)
+        if v.dtype in [
+            torch.float8_e4m3fn,
+            torch.float8_e4m3fnuz,
+            torch.float8_e5m2,
+            torch.float8_e5m2fnuz,
+            torch.float8_e8m0fnu,
+        ]:
+            precision = "fp4"
+    quantization_config = {
+        "method": "svdquant",
+        "weight": {
+            "dtype": "fp4_e2m1_all" if precision == "fp4" else "int4",
+            "scale_dtype": [None, "fp8_e4m3_nan"] if precision == "fp4" else None,
+            "group_size": 16 if precision == "fp4" else 64,
+        },
+        "activation": {
+            "dtype": "fp4_e2m1_all" if precision == "fp4" else "int4",
+            "scale_dtype": "fp8_e4m3_nan" if precision == "fp4" else None,
+            "group_size": 16 if precision == "fp4" else 64,
+        },
+    }
    return state_dict, {
        "config": Path(config_path).read_text(),
        "comfy_config": Path(comfy_config_path).read_text(),
+        "model_class": "NunchakuFluxTransformer2dModel",
+        "quantization_config": json.dumps(quantization_config),
    }


@@ -76,7 +105,7 @@ if __name__ == "__main__":
    )
    parser.add_argument("-o", "--output-path", type=Path, required=True, help="Path to output path")
    args = parser.parse_args()
-    state_dict, metadata = merge_models_into_a_single_file(args.input_path)
+    state_dict, metadata = merge_safetensors(args.input_path)
    output_path = Path(args.output_path)
    dirpath = output_path.parent
    dirpath.mkdir(parents=True, exist_ok=True)

--- a/nunchaku/merge_t5.py
+++ b/nunchaku/merge_t5.py
-import argparse
-import os
-from pathlib import Path
-
-import torch
-from huggingface_hub import constants, hf_hub_download
-from safetensors.torch import save_file
-
-from .utils import load_state_dict_in_safetensors
-
-
-def merge_config_into_model(
-    pretrained_model_name_or_path: str | os.PathLike[str], **kwargs
-) -> tuple[dict[str, torch.Tensor], dict[str, str]]:
-    subfolder = kwargs.get("subfolder", None)
-
-    if isinstance(pretrained_model_name_or_path, str):
-        pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
-    if pretrained_model_name_or_path.exists():
-        dirpath = pretrained_model_name_or_path if subfolder is None else pretrained_model_name_or_path / subfolder
-        model_path = dirpath / "awq-int4-flux.1-t5xxl.safetensors"
-        config_path = dirpath / "config.json"
-    else:
-        download_kwargs = {
-            "subfolder": subfolder,
-            "repo_type": "model",
-            "revision": kwargs.get("revision", None),
-            "cache_dir": kwargs.get("cache_dir", None),
-            "local_dir": kwargs.get("local_dir", None),
-            "user_agent": kwargs.get("user_agent", None),
-            "force_download": kwargs.get("force_download", False),
-            "proxies": kwargs.get("proxies", None),
-            "etag_timeout": kwargs.get("etag_timeout", constants.DEFAULT_ETAG_TIMEOUT),
-            "token": kwargs.get("token", None),
-            "local_files_only": kwargs.get("local_files_only", None),
-            "headers": kwargs.get("headers", None),
-            "endpoint": kwargs.get("endpoint", None),
-            "resume_download": kwargs.get("resume_download", None),
-            "force_filename": kwargs.get("force_filename", None),
-            "local_dir_use_symlinks": kwargs.get("local_dir_use_symlinks", "auto"),
-        }
-        model_path = hf_hub_download(
-            repo_id=str(pretrained_model_name_or_path), filename="awq-int4-flux.1-t5xxl.safetensors", **download_kwargs
-        )
-        config_path = hf_hub_download(
-            repo_id=str(pretrained_model_name_or_path), filename="config.json", **download_kwargs
-        )
-        model_path = Path(model_path)
-        config_path = Path(config_path)
-
-    state_dict = load_state_dict_in_safetensors(model_path)
-    metadata = {"config": config_path.read_text()}
-    return state_dict, metadata
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "-i",
-        "--input-path",
-        type=Path,
-        default="mit-han-lab/nunchaku-t5",
-        help="Path to model directory. It can also be a huggingface repo.",
-    )
-    parser.add_argument("-o", "--output-path", type=Path, required=True, help="Path to output path")
-    args = parser.parse_args()
-    state_dict, metadata = merge_config_into_model(args.input_path)
-    output_path = Path(args.output_path)
-    dirpath = output_path.parent
-    dirpath.mkdir(parents=True, exist_ok=True)
-    save_file(state_dict, output_path, metadata=metadata)
--- a/nunchaku/models/text_encoders/t5_encoder.py
+++ b/nunchaku/models/text_encoders/t5_encoder.py
@@ -23,10 +23,9 @@ class NunchakuT5EncoderModel(T5EncoderModel):
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str | os.PathLike[str], **kwargs):
        pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
-        state_dict = load_state_dict_in_safetensors(pretrained_model_name_or_path, return_metadata=True)
+        state_dict, metadata = load_state_dict_in_safetensors(pretrained_model_name_or_path, return_metadata=True)

        # Load the config file
-        metadata = state_dict.pop("__metadata__", {})
        config = json.loads(metadata["config"])
        config = T5Config(**config)


--- a/nunchaku/models/transformers/transformer_flux.py
+++ b/nunchaku/models/transformers/transformer_flux.py
+import json
 import logging
 import os
 from pathlib import Path
@@ -17,7 +18,7 @@ from ..._C import QuantizedFluxModel
 from ..._C import utils as cutils
 from ...lora.flux.nunchaku_converter import fuse_vectors, to_nunchaku
 from ...lora.flux.utils import is_nunchaku_format
-from ...utils import get_precision, load_state_dict_in_safetensors
+from ...utils import check_hardware_compatibility, get_precision, load_state_dict_in_safetensors
 from .utils import NunchakuModelLoaderMixin, pad_tensor

 SVD_RANK = 32
@@ -315,7 +316,7 @@ class NunchakuFluxTransformer2dModel(FluxTransformer2DModel, NunchakuModelLoader
        self._quantized_part_vectors: dict[str, torch.Tensor] = {}
        self._original_in_channels = in_channels

-        # Comfyui LoRA related
+        # ComfyUI LoRA related
        self.comfy_lora_meta_list = []
        self.comfy_lora_sd_list = []

@@ -328,13 +329,14 @@ class NunchakuFluxTransformer2dModel(FluxTransformer2DModel, NunchakuModelLoader
        offload = kwargs.get("offload", False)
        torch_dtype = kwargs.get("torch_dtype", torch.bfloat16)
        precision = get_precision(kwargs.get("precision", "auto"), device, pretrained_model_name_or_path)
+        metadata = None

        if isinstance(pretrained_model_name_or_path, str):
            pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
        if pretrained_model_name_or_path.is_file() or pretrained_model_name_or_path.name.endswith(
            (".safetensors", ".sft")
        ):
-            transformer, model_state_dict = cls._build_model(pretrained_model_name_or_path, **kwargs)
+            transformer, model_state_dict, metadata = cls._build_model(pretrained_model_name_or_path, **kwargs)
            quantized_part_sd = {}
            unquantized_part_sd = {}
            for k, v in model_state_dict.items():
@@ -342,6 +344,9 @@ class NunchakuFluxTransformer2dModel(FluxTransformer2DModel, NunchakuModelLoader
                    quantized_part_sd[k] = v
                else:
                    unquantized_part_sd[k] = v
+            precision = get_precision(device=device)
+            quantization_config = json.loads(metadata["quantization_config"])
+            check_hardware_compatibility(quantization_config, device)
        else:
            transformer, unquantized_part_path, transformer_block_path = cls._build_model_legacy(
                pretrained_model_name_or_path, **kwargs
@@ -384,7 +389,10 @@ class NunchakuFluxTransformer2dModel(FluxTransformer2DModel, NunchakuModelLoader
        transformer.load_state_dict(unquantized_part_sd, strict=False)
        transformer._unquantized_part_sd = unquantized_part_sd

-        return transformer
+        if kwargs.get("return_metadata", False):
+            return transformer, metadata
+        else:
+            return transformer

    def inject_quantized_module(self, m: QuantizedFluxModel, device: str | torch.device = "cuda"):
        print("Injecting quantized module")

--- a/nunchaku/models/transformers/transformer_sana.py
+++ b/nunchaku/models/transformers/transformer_sana.py
@@ -146,13 +146,14 @@ class NunchakuSanaTransformer2DModel(SanaTransformer2DModel, NunchakuModelLoader
            device = torch.device(device)
        pag_layers = kwargs.get("pag_layers", [])
        precision = get_precision(kwargs.get("precision", "auto"), device, pretrained_model_name_or_path)
+        metadata = None

        if isinstance(pretrained_model_name_or_path, str):
            pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
        if pretrained_model_name_or_path.is_file() or pretrained_model_name_or_path.name.endswith(
            (".safetensors", ".sft")
        ):
-            transformer, model_state_dict = cls._build_model(pretrained_model_name_or_path)
+            transformer, model_state_dict, metadata = cls._build_model(pretrained_model_name_or_path)
            quantized_part_sd = {}
            unquantized_part_sd = {}
            for k, v in model_state_dict.items():
@@ -177,7 +178,10 @@ class NunchakuSanaTransformer2DModel(SanaTransformer2DModel, NunchakuModelLoader
            transformer.to_empty(device=device)
            unquantized_state_dict = load_file(unquantized_part_path)
            transformer.load_state_dict(unquantized_state_dict, strict=False)
-        return transformer
+        if kwargs.get("return_metadata", False):
+            return transformer, metadata
+        else:
+            return transformer

    def inject_quantized_module(self, m: QuantizedSanaModel, device: str | torch.device = "cuda"):
        self.transformer_blocks = torch.nn.ModuleList([NunchakuSanaTransformerBlocks(m, self.dtype, device)])

--- a/nunchaku/models/transformers/utils.py
+++ b/nunchaku/models/transformers/utils.py
@@ -24,19 +24,18 @@ class NunchakuModelLoaderMixin:
    @classmethod
    def _build_model(
        cls, pretrained_model_name_or_path: str | os.PathLike[str], **kwargs
-    ) -> tuple[nn.Module, dict[str, torch.Tensor]]:
+    ) -> tuple[nn.Module, dict[str, torch.Tensor], dict[str, str]]:
        if isinstance(pretrained_model_name_or_path, str):
            pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
-        state_dict = load_state_dict_in_safetensors(pretrained_model_name_or_path, return_metadata=True)
+        state_dict, metadata = load_state_dict_in_safetensors(pretrained_model_name_or_path, return_metadata=True)

        # Load the config file
-        metadata = state_dict.pop("__metadata__", {})
        config = json.loads(metadata["config"])

        with torch.device("meta"):
            transformer = cls.from_config(config).to(kwargs.get("torch_dtype", torch.bfloat16))

-        return transformer, state_dict
+        return transformer, state_dict, metadata

    @classmethod
    def _build_model_legacy(
@@ -45,8 +44,8 @@ class NunchakuModelLoaderMixin:
        logger.warning(
            "Loading models from a folder will be deprecated in v0.4. "
            "Please download the latest safetensors model, or use one of the following tools to "
-            "merge your model into a single file: the CLI utility `python -m nunchaku.merge_models` "
-            "or the ComfyUI node `MergeFolderIntoSingleFile`."
+            "merge your model into a single file: the CLI utility `python -m nunchaku.merge_safetensors` "
+            "or the ComfyUI workflow `merge_safetensors.json`."
        )
        subfolder = kwargs.get("subfolder", None)
        if os.path.exists(pretrained_model_name_or_path):

--- a/nunchaku/utils.py
+++ b/nunchaku/utils.py
+import os
 import warnings
-from os import PathLike
 from pathlib import Path

 import safetensors
@@ -43,35 +43,22 @@ def ceil_divide(x: int, divisor: int) -> int:


 def load_state_dict_in_safetensors(
-    path: str | PathLike[str],
+    path: str | os.PathLike[str],
    device: str | torch.device = "cpu",
    filter_prefix: str = "",
    return_metadata: bool = False,
-) -> dict[str, torch.Tensor]:
-    """Load state dict in SafeTensors.
-
-    Args:
-        path (`str`):
-            file path.
-        device (`str` | `torch.device`, optional, defaults to `"cpu"`):
-            device.
-        filter_prefix (`str`, optional, defaults to `""`):
-            filter prefix.
-
-    Returns:
-        `dict`:
-            loaded SafeTensors.
-    """
+) -> dict[str, torch.Tensor] | tuple[dict[str, torch.Tensor], dict[str, str]]:
    state_dict = {}
    with safetensors.safe_open(fetch_or_download(path), framework="pt", device=device) as f:
        metadata = f.metadata()
-        if return_metadata:
-            state_dict["__metadata__"] = metadata
        for k in f.keys():
            if filter_prefix and not k.startswith(filter_prefix):
                continue
            state_dict[k.removeprefix(filter_prefix)] = f.get_tensor(k)
-    return state_dict
+    if return_metadata:
+        return state_dict, metadata
+    else:
+        return state_dict


 def filter_state_dict(state_dict: dict[str, torch.Tensor], filter_prefix: str = "") -> dict[str, torch.Tensor]:
@@ -91,7 +78,9 @@ def filter_state_dict(state_dict: dict[str, torch.Tensor], filter_prefix: str =


 def get_precision(
-    precision: str = "auto", device: str | torch.device = "cuda", pretrained_model_name_or_path: str | None = None
+    precision: str = "auto",
+    device: str | torch.device = "cuda",
+    pretrained_model_name_or_path: str | os.PathLike[str] | None = None,
 ) -> str:
    assert precision in ("auto", "int4", "fp4")
    if precision == "auto":
@@ -102,10 +91,10 @@ def get_precision(
        precision = "fp4" if sm == "120" else "int4"
    if pretrained_model_name_or_path is not None:
        if precision == "int4":
-            if "fp4" in pretrained_model_name_or_path:
+            if "fp4" in str(pretrained_model_name_or_path):
                warnings.warn("The model may be quantized to fp4, but you are loading it with int4 precision.")
        elif precision == "fp4":
-            if "int4" in pretrained_model_name_or_path:
+            if "int4" in str(pretrained_model_name_or_path):
                warnings.warn("The model may be quantized to int4, but you are loading it with fp4 precision.")
    return precision

@@ -146,3 +135,21 @@ def get_gpu_memory(device: str | torch.device = "cuda", unit: str = "GiB") -> in
        return memory // (1024**2)
    else:
        return memory
+
+
+def check_hardware_compatibility(quantization_config: dict, device: str | torch.device = "cuda"):
+    if isinstance(device, str):
+        device = torch.device(device)
+    capability = torch.cuda.get_device_capability(0 if device.index is None else device.index)
+    sm = f"{capability[0]}{capability[1]}"
+    if sm == "120":  # you can only use the fp4 models
+        if quantization_config["weight"]["dtype"] != "fp4_e2m1_all":
+            raise ValueError('Please use "fp4" quantization for Blackwell GPUs. ')
+    elif sm in ["75", "80", "86", "89"]:
+        if quantization_config["weight"]["dtype"] != "int4":
+            raise ValueError('Please use "int4" quantization for Turing, Ampere and Ada GPUs. ')
+    else:
+        raise ValueError(
+            f"Unsupported GPU architecture {sm} due to the lack of 4-bit tensorcores. "
+            "Please use a Turing, Ampere, Ada or Blackwell GPU for this quantization configuration."
+        )
--- a/nunchaku_models.yaml
+++ b/nunchaku_models.yaml
+diffusion_models:
+  - repo_id: "mit-han-lab/nunchaku-t5"
+    filename: "awq-int4-flux.1-t5xxl.safetensors"
+    sub_folder: "text_encoders"
+    new_filename: null
+  - repo_id: "mit-han-lab/nunchaku-flux.1-dev"
+    filename: "svdq-{precision}_r32-flux.1-dev.safetensors"
+    sub_folder: "diffusion_models"
+    new_filename: null
+  - repo_id: "mit-han-lab/nunchaku-flux.1-schnell"
+    filename: "svdq-{precision}_r32-flux.1-schnell.safetensors"
+    sub_folder: "diffusion_models"
+    new_filename: null
+  - repo_id: "mit-han-lab/nunchaku-flux.1-depth-dev"
+    filename: "svdq-{precision}_r32-flux.1-depth-dev.safetensors"
+    sub_folder: "diffusion_models"
+    new_filename: null
+  - repo_id: "mit-han-lab/nunchaku-flux.1-canny-dev"
+    filename: "svdq-{precision}_r32-flux.1-canny-dev.safetensors"
+    sub_folder: "diffusion_models"
+    new_filename: null
+  - repo_id: "mit-han-lab/nunchaku-flux.1-fill-dev"
+    filename: "svdq-{precision}_r32-flux.1-fill-dev.safetensors"
+    sub_folder: "diffusion_models"
+    new_filename: null
+  - repo_id: "mit-han-lab/nunchaku-shuttle-jaguar"
+    filename: "svdq-{precision}_r32-shuttle-jaguar.safetensors"
+    sub_folder: "diffusion_models"
+    new_filename: null
--- a/tests/README.md
+++ b/tests/README.md
 # Nunchaku Tests
+
 Nunchaku uses pytest as its testing framework.

 ## Setting Up Test Environments
+
 After installing `nunchaku` as described in the [README](../README.md#installation), you can install the test dependencies with:
+
 ```shell
 pip install -r tests/requirements.txt
 ```

 ## Running the Tests
+
 ```shell
 HF_TOKEN=$YOUR_HF_TOKEN pytest -v tests/flux/test_flux_memory.py
 HF_TOKEN=$YOUR_HF_TOKEN pytest -v tests/flux --ignore=tests/flux/test_flux_memory.py
@@ -15,7 +19,7 @@ HF_TOKEN=$YOUR_HF_TOKEN pytest -v tests/sana
 ```

 > **Note:** `$YOUR_HF_TOKEN` refers to your Hugging Face access token, required to download models and datasets. You can create one at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
->  If you've already logged in using `huggingface-cli login`, you can skip setting this environment variable.
+> If you've already logged in using `huggingface-cli login`, you can skip setting this environment variable.

 Some tests generate images using the original 16-bit models. You can cache these results to speed up future test runs by setting the environment variable `NUNCHAKU_TEST_CACHE_ROOT`. If not set, the images will be saved in `test_results/ref`.

@@ -27,9 +31,9 @@ To test visual output correctness, you can:

 1. **Generate reference images:** Use the original 16-bit model to produce a small number of reference images (e.g., 4).

-2. **Generate comparison images:** Run your method using the **same inputs and seeds** to ensure deterministic outputs. You can control the seed by setting the `generator` parameter in the diffusers pipeline.
+1. **Generate comparison images:** Run your method using the **same inputs and seeds** to ensure deterministic outputs. You can control the seed by setting the `generator` parameter in the diffusers pipeline.

-3. **Compute similarity:** Evaluate the similarity between your outputs and the reference images using the [LPIPS](https://arxiv.org/abs/1801.03924) metric. Use the `compute_lpips` function provided in [`tests/flux/utils.py`](flux/utils.py):
+1. **Compute similarity:** Evaluate the similarity between your outputs and the reference images using the [LPIPS](https://arxiv.org/abs/1801.03924) metric. Use the `compute_lpips` function provided in [`tests/flux/utils.py`](flux/utils.py):

   ```shell
   lpips = compute_lpips(dir1, dir2)

--- a/tests/flux/test_flux_schnell.py
+++ b/tests/flux/test_flux_schnell.py
@@ -9,7 +9,7 @@ from .utils import run_test
 @pytest.mark.parametrize(
    "height,width,attention_impl,cpu_offload,expected_lpips",
    [
-        (1024, 1024, "flashattn2", False, 0.126 if get_precision() == "int4" else 0.126),
+        (1024, 1024, "flashattn2", False, 0.141 if get_precision() == "int4" else 0.126),
        (1024, 1024, "nunchaku-fp16", False, 0.139 if get_precision() == "int4" else 0.126),
        (1920, 1080, "nunchaku-fp16", False, 0.190 if get_precision() == "int4" else 0.138),
        (2048, 2048, "nunchaku-fp16", True, 0.166 if get_precision() == "int4" else 0.120),

--- a/tests/flux/test_flux_teacache.py
+++ b/tests/flux/test_flux_teacache.py
@@ -44,7 +44,7 @@ from .utils import already_generate, compute_lpips, offload_pipeline
            "muppets",
            42,
            0.3,
-            0.360 if get_precision() == "int4" else 0.495,
+            0.507 if get_precision() == "int4" else 0.495,
        ),
        (
            1024,

--- a/upload_models.py
+++ b/upload_models.py
+import os
+
+from huggingface_hub import HfApi, HfFolder, create_repo, upload_folder
+
+# Configuration
+LOCAL_MODELS_DIR = "nunchaku-models"
+HUGGINGFACE_NAMESPACE = "mit-han-lab"
+PRIVATE = False  # Set to True if you want the repos to be private
+
+# Initialize API
+api = HfApi()
+
+# Get your token from local cache
+token = HfFolder.get_token()
+
+# Iterate over all folders in the models directory
+for model_name in os.listdir(LOCAL_MODELS_DIR):
+    model_path = os.path.join(LOCAL_MODELS_DIR, model_name)
+    if not os.path.isdir(model_path):
+        continue  # Skip non-folder files
+
+    repo_id = f"{HUGGINGFACE_NAMESPACE}/{model_name}"
+    print(f"\n📦 Uploading {model_path} to {repo_id}")
+
+    # Create the repo (skip if it exists)
+    try:
+        create_repo(repo_id, token=token, repo_type="model", private=PRIVATE, exist_ok=True)
+    except Exception as e:
+        print(f"⚠️ Failed to create repo {repo_id}: {e}")
+        continue
+
+    # Upload the local model folder
+    try:
+        upload_folder(
+            folder_path=model_path,
+            repo_id=repo_id,
+            token=token,
+            repo_type="model",
+            path_in_repo="",  # root of repo
+        )
+        print(f"✅ Uploaded {model_name} successfully.")
+    except Exception as e:
+        print(f"❌ Upload failed for {model_name}: {e}")