bagel.md 8.18 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# BAGEL-7B-MoT

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel>.

## Set up

Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.

## Run examples

**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.

Get into the bagel folder

```bash
cd examples/offline_inference/bagel
```

### Modality Control

BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:

#### Text to Image (text2img)

- **Pipeline**: Text → Thinker  → DiT → VAE Decode → Image
- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT)
- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation

Generate images from text prompts:

```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat"
```

#### Image to Image (img2img)

- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image
- **Stages Used**: Stage 1 (DiT) only
- **Special**: Bypasses the Thinker stage, direct image-to-image transformation

Transform images based on text prompts:

```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality img2img \
                  --image-path /path/to/image.jpg \
                  --prompts "Let the woman wear a blue dress"
```

#### Image to Text (img2text)

- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output
- **Stages Used**: Stage 0 (Thinker) only
- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding

Generate text descriptions from images:

```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality img2text \
                  --image-path /path/to/image.jpg \
                  --prompts "Describe this image in detail"
```

#### Text to Text (text2text)

- **Pipeline**: Text → Thinker → Text Output
- **Stages Used**: Stage 0 (Thinker) only
- **Special**: No visual components involved, operates as pure language model

Pure text generation:

```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2text \
                  --prompts "What is the capital of France?"

# You can load prompts from a text file (one prompt per line):  
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2text \
                  --txt-prompts /path/to/prompts.txt
```

### Inference Steps

Control the number of inference steps for image generation:

```bash
# You can adjust steps to 100 to improve image quality
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --steps 50 \
                  --prompts "A cute cat"
```

### Key arguments

BAGEL-7B-MoT supports **multiple modality modes** for different use cases.

The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)

#### 📌 Command Line Arguments (end2end.py)

| Argument               | Type   | Default                       | Description                                                  |
| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- |
| `--model`              | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name                                           |
| `--modality`           | choice | `text2img`                    | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` |
| `--prompts`            | list   | `None`                        | Input text prompts directly                                  |
| `--txt-prompts`        | string | `None`                        | Path to txt file with one prompt per line                    |
| `--image-path`         | string | `None`                        | Input image path (for `img2img`/`img2text`)                  |
| `--steps`              | int    | `50`                          | Number of inference steps                                    |
| `--stage-configs-path` | string | `None`                        | Custom stage config file path                                |
| `--worker-backend`     | choice | `process`                     | Worker backend: `process` or `ray`                           |
| `--ray-address`        | string | `None`                        | Ray cluster address                                          |
| `--enable-stats`       | flag   | `False`                       | Enable statistics logging                                    |
| `--init-sleep-seconds` | int    | `20`                          | Initialization sleep time                                    |
| `--batch-timeout`      | int    | `5`                           | Batch timeout                                                |
| `--init-timeout`       | int    | `300`                         | Initialization timeout                                       |

------

#### ⚙️ Stage Configuration Parameters (bagel.yaml)

 **Stage 0 - Thinker (LLM Stage)**

| Parameter                        | Value                           | Description              |
| :------------------------------- | :------------------------------ | :----------------------- |
| `stage_type`                     | `llm`                           | Stage type               |
| `devices`                        | `"0"`                           | GPU device ID            |
| `max_batch_size`                 | `1`                             | Maximum batch size       |
| `model_stage`                    | `thinker`                       | Model stage identifier   |
| `model_arch`                     | `BagelForConditionalGeneration` | Model architecture       |
| `gpu_memory_utilization`         | `0.4`                           | GPU memory utilization   |
| `tensor_parallel_size`           | `1`                             | Tensor parallel size     |
| `max_num_batched_tokens`         | `32768`                         | Maximum batched tokens   |
| `omni_kv_config.need_send_cache` | `true`                          | Whether to send KV cache |

------

**Stage 1 - DiT (Diffusion Stage)**

| Parameter                        | Value       | Description                 |
| :------------------------------- | :---------- | :-------------------------- |
| `stage_type`                     | `diffusion` | Stage type                  |
| `devices`                        | `"0"`       | GPU device ID               |
| `max_batch_size`                 | `1`         | Maximum batch size          |
| `model_stage`                    | `dit`       | Model stage identifier      |
| `gpu_memory_utilization`         | `0.4`       | GPU memory utilization      |
| `omni_kv_config.need_recv_cache` | `true`      | Whether to receive KV cache |
| `engine_input_source`            | `[0]`       | Input source from Stage 0   |

------

#### 🔗 Runtime Configuration

| Parameter             | Value   | Description                      |
| :-------------------- | :------ | :------------------------------- |
| `window_size`         | `-1`    | Window size (-1 means unlimited) |
| `max_inflight`        | `1`     | Maximum inflight requests        |
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB)   |

## FAQ

- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.

```bash
sudo apt update
sudo apt install ffmpeg
```

- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.

| Stage               | VRAM                         |
| :------------------ | :--------------------------- |
| Stage-0 (Thinker)   | **15.04 GiB** **+ KV Cache** |
| Stage-1 (DiT)       | **26.50 GiB**                |
| Total               | **~42 GiB + KV Cache**       |