README.md

# Nunchaku INT4 FLUX.1 Models

## Text-to-Image Gradio Demo

```shell
python run_gradio.py
```

* The demo also defaults to the FLUX.1-schnell model. To switch to the FLUX.1-dev model, use `-m dev`.
* By default, the Gemma-2B model is loaded as a safety checker. To disable this feature and save GPU memory, use `--no-safety-checker`.
* To further reduce GPU memory usage, you can enable the W4A16 text encoder by specifying `--use-qencoder`.
* By default, only the INT4 DiT is loaded. Use `-p int4 bf16` to add a BF16 DiT for side-by-side comparison, or `-p bf16` to load only the BF16 model.

## Command Line Inference

We provide a script, [generate.py](generate.py), that generates an image from a text prompt directly from the command line, similar to the demo. Simply run:

```shell
python generate.py --prompt "You Text Prompt"
```

* The generated image will be saved as `output.png` by default. You can specify a different path using the `-o` or `--output-path` options.
* The script defaults to using the FLUX.1-schnell model. To switch to the FLUX.1-dev model, use `-m dev`.
* By default, the script uses our INT4 model. To use the BF16 model instead, specify `-p bf16`.
* You can specify `--use-qencoder` to use our W4A16 text encoder.
* You can adjust the number of inference steps and guidance scale with `-t` and `-g`, respectively. For the FLUX.1-schnell model, the defaults are 4 steps and a guidance scale of 0; for the FLUX.1-dev model, the defaults are 50 steps and a guidance scale of 3.5.

* When using the FLUX.1-dev model, you also have the option to load a LoRA adapter with `--lora-name`. Available choices are `None`, [`Anime`](https://huggingface.co/alvdansen/sonny-anime-fixed), [`GHIBSKY Illustration`](https://huggingface.co/aleksa-codes/flux-ghibsky-illustration), [`Realism`](https://huggingface.co/XLabs-AI/flux-RealismLora), [`Children Sketch`](https://huggingface.co/Shakker-Labs/FLUX.1-dev-LoRA-Children-Simple-Sketch), and [`Yarn Art`](https://huggingface.co/linoyts/yarn_art_Flux_LoRA), with the default set to `None`. You can also specify the LoRA weight with `--lora-weight`, which defaults to 1.

## Latency Benchmark

To measure the latency of our INT4 models, use the following command:

```shell
python latency.py
```

* The script defaults to the INT4 FLUX.1-schnell model. To switch to FLUX.1-dev, use the `-m dev` option. For BF16 precision, add `-p bf16`.
* Adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively.
  - For FLUX.1-schnell, the defaults are 4 steps and a guidance scale of 0.
  - For FLUX.1-dev, the defaults are 50 steps and a guidance scale of 3.5.
* By default, the script measures the end-to-end latency for generating a single image. To measure the latency of a single DiT forward step instead, use the `--mode step` flag.
* Specify the number of warmup and test runs using `--warmup_times` and `--test_times`. The defaults are 2 warmup runs and 10 test runs.

## Quality Results

Below are the steps to reproduce the quality metrics reported in our paper. Firstly, you will need to install slightly more packages for the image quality metrics:

```shell
pip install clean-fid torchmetrics image-reward clip datasets
```

Then generate images using both the original BF16 model and our INT4 model on the [MJHQ](https://huggingface.co/datasets/playgroundai/MJHQ-30K) and [DCI](https://github.com/facebookresearch/DCI) datasets:

```shell
python evaluate.py -p int4
python evaluate.py -p bf16
```

* The commands above will generate images from FLUX.1-schnell on both datasets. Use `-m dev` to switch to FLUX.1-dev, or specify a single dataset with `-d MJHQ` or `-d DCI`.
* By default, generated images are saved to `results/$MODEL/$PRECISION`. Customize the output path using the `-o` option if desired.
* You can also adjust the number of inference steps and the guidance scale using `-t` and `-g`, respectively.
  - For FLUX.1-schnell, the defaults are 4 steps and a guidance scale of 0.
  - For FLUX.1-dev, the defaults are 50 steps and a guidance scale of 3.5.
* To accelerate the generation process, you can distribute the workload across multiple GPUs. For instance, if you have $N$ GPUs, on GPU $i (0 \le i < N)$ , you can add the options `--chunk-start $i --chunk-step $N`. This setup ensures each GPU handles a distinct portion of the workload, enhancing overall efficiency.

Finally you can compute the metrics for the images with

```shell
python get_metrics.py results/schnell/int4 results/schnell/bf16
```

Remember to replace the example paths with the actual paths to your image folders.

**Notes:**

- The script will calculate quality metrics (CLIP IQA, CLIP Score, Image Reward, FID) only for the first folder specified. Ensure the INT4 results folder is listed first.
- **Similarity Metrics**: If a second folder path is not provided, similarity metrics (LPIPS, PSNR, SSIM) will be skipped.
- **Output File**: Metric results are saved in `metrics.json` by default. Use `-o` to specify a custom output file if needed.