README.md

# SAT CogView3 & CogView-3-Plus

[Read this in Chinese](./README_zh.md)

This folder contains the inference code using the [SAT](https://github.com/THUDM/SwissArmyTransformer) weights, as well as fine-tuning code for SAT weights.

The code is the framework used by the team during model training. There are few comments, so it requires careful study.

## Step-by-step guide to running the model

### 1. Environment setup

Ensure you have installed the dependencies required by this folder:

```shell
pip install -r requirements.txt
```

### 2. Download model weights

The following links are for different model weights:

### CogView-3-Plus-3B

+ transformer: https://cloud.tsinghua.edu.cn/d/f913eabd3f3b4e28857c
+ vae: https://cloud.tsinghua.edu.cn/d/af4cc066ce8a4cf2ab79

### CogView-3-Base-3B

+ transformer:
    + cogview3-base: https://cloud.tsinghua.edu.cn/d/242b66daf4424fa99bf0
    + cogview3-base-distill-4step: https://cloud.tsinghua.edu.cn/d/d10032a94db647f5aa0e
    + cogview3-base-distill-8step: https://cloud.tsinghua.edu.cn/d/1598d4fe4ebf4afcb6ae
  
  **These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.**

+ vae: https://cloud.tsinghua.edu.cn/d/c8b9497fc5124d71818a/ 

### CogView-3-Base-3B-Relay

+ transformer:
    + cogview3-relay: https://cloud.tsinghua.edu.cn/d/134951acced949c1a9e1/
    + cogview3-relay-distill-2step: https://cloud.tsinghua.edu.cn/d/6a902976fcb94ac48402
    + cogview3-relay-distill-1step: https://cloud.tsinghua.edu.cn/d/4d50ec092c64418f8418/
  
  **These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.**

+ vae: Same as CogView-3-Base-3B

Next, arrange the model files into the following format:

```
.cogview3-plus-3b
├── transformer
│   ├── 1
│   │   └── mp_rank_00_model_states.pt
│   └── latest
└── vae
    └── imagekl_ch16.pt
```

Clone the T5 model. This model is not used for training or fine-tuning but is necessary. You can download the T5 model separately, but it must be in `safetensors` format, not `bin` format (otherwise an error may occur).

Since we have uploaded the T5 model in `safetensors` format in `CogVideoX`, a simple way is to clone the model from the `CogVideoX-2B` model and move it to the corresponding folder.

```shell
git clone https://huggingface.co/THUDM/CogVideoX-2b.git
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```

With this setup, you will have a safetensor format T5 file, ensuring no errors during Deepspeed fine-tuning.

```
├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json

0 directories, 8 files
```

### 3. Modify the files in `configs`.

Here is an example using `CogView3-Base`, with explanations for some of the parameters:

```yaml
args:
  mode: inference
  relay_model: False # Set to True when using CogView-3-Relay
  load: "cogview3_base/transformer" # Path to the transformer folder
  batch_size: 8 # Number of images per inference
  grid_num_columns: 2 # Number of columns in grid.png output
  input_type: txt # Input can be from command line or TXT file
  input_file: configs/test.txt # Not needed for command line input
  fp16: True # Set to bf16 for CogView-3-Plus inference
  # bf16: True
  sampling_image_size: 512 # Fixed size, supports 512x512 resolution images
  # For CogView-3-Plus, use the following:
  # sampling_image_size_x: 1024 (width)
  # sampling_image_size_y: 1024 (height)

  output_dir: "outputs/cogview3_base-512x512"
  # This section is for CogView-3-Relay. Set the input_dir to the folder with base model generated images.
  # input_dir: "outputs/cogview3_base-512x512" 
  deepspeed_config: { }

model:
  conditioner_config:
  target: sgm.modules.GeneralConditioner
  params:
    emb_models:
      - is_trainable: False
        input_key: txt
        target: sgm.modules.encoders.modules.FrozenT5Embedder
        params:
          model_dir: "google/t5-v1_1-xxl" # Path to T5 safetensors
          max_length: 225 # Maximum prompt length

  first_stage_config:
    target: sgm.models.autoencoder.AutoencodingEngine
    params:
      ckpt_path: "cogview3_base/vae/imagekl_ch16.pt" # Path to VAE PT file
      monitor: val/rec_loss
```

### 4. Running the model

Different models require different code for inference. Here are the inference commands for each model:

### CogView-3Plus

```shell
python sample_dit.py --base configs/cogview3_plus.yaml
```

### CogView-3-Base

+ Original model

```shell
python sample_unet.py --base configs/cogview3_base.yaml
```

+ Distilled model

```bash
python sample_unet.py --base configs/cogview3_base_distill_4step.yaml
```

### CogView-3-Relay

+ Original model

```shell
python sample_unet.py --base configs/cogview3_relay.yaml
```

+ Distilled model

```shell
python sample_unet.py --base configs/cogview3_relay_distill_1step.yaml 
```

The output image format will be a folder. The folder name will consist of the sequence number and the first 15 characters of the prompt, containing multiple images. The number of images is based on the `batch` parameter. The structure should look like this:

```
.
├── 000000000.png
├── 000000001.png
├── 000000002.png
├── 000000003.png
├── 000000004.png
├── 000000005.png
├── 000000006.png
├── 000000007.png
└── grid.png

1 directory, 9 files
```

In this example, the `batch` size is 8, so there are 8 images along with one `grid.png`.