README.md 5.68 KB
Newer Older
suily's avatar
suily committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# SAT CogView3 & CogView-3-Plus

[Read this in Chinese](./README_zh.md)

This folder contains the inference code using the [SAT](https://github.com/THUDM/SwissArmyTransformer) weights, as well as fine-tuning code for SAT weights.

The code is the framework used by the team during model training. There are few comments, so it requires careful study.

## Step-by-step guide to running the model

### 1. Environment setup

Ensure you have installed the dependencies required by this folder:

```shell
pip install -r requirements.txt
```

### 2. Download model weights

The following links are for different model weights:

### CogView-3-Plus-3B

+ transformer: https://cloud.tsinghua.edu.cn/d/f913eabd3f3b4e28857c
+ vae: https://cloud.tsinghua.edu.cn/d/af4cc066ce8a4cf2ab79

### CogView-3-Base-3B

+ transformer:
    + cogview3-base: https://cloud.tsinghua.edu.cn/d/242b66daf4424fa99bf0
    + cogview3-base-distill-4step: https://cloud.tsinghua.edu.cn/d/d10032a94db647f5aa0e
    + cogview3-base-distill-8step: https://cloud.tsinghua.edu.cn/d/1598d4fe4ebf4afcb6ae
  
  **These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.**

+ vae: https://cloud.tsinghua.edu.cn/d/c8b9497fc5124d71818a/ 

### CogView-3-Base-3B-Relay

+ transformer:
    + cogview3-relay: https://cloud.tsinghua.edu.cn/d/134951acced949c1a9e1/
    + cogview3-relay-distill-2step: https://cloud.tsinghua.edu.cn/d/6a902976fcb94ac48402
    + cogview3-relay-distill-1step: https://cloud.tsinghua.edu.cn/d/4d50ec092c64418f8418/
  
  **These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.**

+ vae: Same as CogView-3-Base-3B

Next, arrange the model files into the following format:

```
.cogview3-plus-3b
├── transformer
│   ├── 1
│   │   └── mp_rank_00_model_states.pt
│   └── latest
└── vae
    └── imagekl_ch16.pt
```

Clone the T5 model. This model is not used for training or fine-tuning but is necessary. You can download the T5 model separately, but it must be in `safetensors` format, not `bin` format (otherwise an error may occur).

Since we have uploaded the T5 model in `safetensors` format in `CogVideoX`, a simple way is to clone the model from the `CogVideoX-2B` model and move it to the corresponding folder.

```shell
git clone https://huggingface.co/THUDM/CogVideoX-2b.git
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```

With this setup, you will have a safetensor format T5 file, ensuring no errors during Deepspeed fine-tuning.

```
├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json

0 directories, 8 files
```

### 3. Modify the files in `configs`.

Here is an example using `CogView3-Base`, with explanations for some of the parameters:

```yaml
args:
  mode: inference
  relay_model: False # Set to True when using CogView-3-Relay
  load: "cogview3_base/transformer" # Path to the transformer folder
  batch_size: 8 # Number of images per inference
  grid_num_columns: 2 # Number of columns in grid.png output
  input_type: txt # Input can be from command line or TXT file
  input_file: configs/test.txt # Not needed for command line input
  fp16: True # Set to bf16 for CogView-3-Plus inference
  # bf16: True
  sampling_image_size: 512 # Fixed size, supports 512x512 resolution images
  # For CogView-3-Plus, use the following:
  # sampling_image_size_x: 1024 (width)
  # sampling_image_size_y: 1024 (height)

  output_dir: "outputs/cogview3_base-512x512"
  # This section is for CogView-3-Relay. Set the input_dir to the folder with base model generated images.
  # input_dir: "outputs/cogview3_base-512x512" 
  deepspeed_config: { }

model:
  conditioner_config:
  target: sgm.modules.GeneralConditioner
  params:
    emb_models:
      - is_trainable: False
        input_key: txt
        target: sgm.modules.encoders.modules.FrozenT5Embedder
        params:
          model_dir: "google/t5-v1_1-xxl" # Path to T5 safetensors
          max_length: 225 # Maximum prompt length

  first_stage_config:
    target: sgm.models.autoencoder.AutoencodingEngine
    params:
      ckpt_path: "cogview3_base/vae/imagekl_ch16.pt" # Path to VAE PT file
      monitor: val/rec_loss
```

### 4. Running the model

Different models require different code for inference. Here are the inference commands for each model:

### CogView-3Plus

```shell
python sample_dit.py --base configs/cogview3_plus.yaml
```

### CogView-3-Base

+ Original model

```shell
python sample_unet.py --base configs/cogview3_base.yaml
```

+ Distilled model

```bash
python sample_unet.py --base configs/cogview3_base_distill_4step.yaml
```

### CogView-3-Relay

+ Original model

```shell
python sample_unet.py --base configs/cogview3_relay.yaml
```

+ Distilled model

```shell
python sample_unet.py --base configs/cogview3_relay_distill_1step.yaml 
```

The output image format will be a folder. The folder name will consist of the sequence number and the first 15 characters of the prompt, containing multiple images. The number of images is based on the `batch` parameter. The structure should look like this:

```
.
├── 000000000.png
├── 000000001.png
├── 000000002.png
├── 000000003.png
├── 000000004.png
├── 000000005.png
├── 000000006.png
├── 000000007.png
└── grid.png

1 directory, 9 files
```

In this example, the `batch` size is 8, so there are 8 images along with one `grid.png`.