# BAGEL-7B-MoT Source . ## Set up Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup. ## Run examples **Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices. Get into the bagel folder ```bash cd examples/offline_inference/bagel ``` ### Modality Control BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument: #### Text to Image (text2img) - **Pipeline**: Text → Thinker → DiT → VAE Decode → Image - **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT) - **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation Generate images from text prompts: ```bash python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ --modality text2img \ --prompts "A cute cat" ``` #### Image to Image (img2img) - **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image - **Stages Used**: Stage 1 (DiT) only - **Special**: Bypasses the Thinker stage, direct image-to-image transformation Transform images based on text prompts: ```bash python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ --modality img2img \ --image-path /path/to/image.jpg \ --prompts "Let the woman wear a blue dress" ``` #### Image to Text (img2text) - **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output - **Stages Used**: Stage 0 (Thinker) only - **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding Generate text descriptions from images: ```bash python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ --modality img2text \ --image-path /path/to/image.jpg \ --prompts "Describe this image in detail" ``` #### Text to Text (text2text) - **Pipeline**: Text → Thinker → Text Output - **Stages Used**: Stage 0 (Thinker) only - **Special**: No visual components involved, operates as pure language model Pure text generation: ```bash python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ --modality text2text \ --prompts "What is the capital of France?" # You can load prompts from a text file (one prompt per line): python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ --modality text2text \ --txt-prompts /path/to/prompts.txt ``` ### Inference Steps Control the number of inference steps for image generation: ```bash # You can adjust steps to 100 to improve image quality python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ --modality text2img \ --steps 50 \ --prompts "A cute cat" ``` ### Key arguments BAGEL-7B-MoT supports **multiple modality modes** for different use cases. The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml) #### 📌 Command Line Arguments (end2end.py) | Argument | Type | Default | Description | | :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- | | `--model` | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name | | `--modality` | choice | `text2img` | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` | | `--prompts` | list | `None` | Input text prompts directly | | `--txt-prompts` | string | `None` | Path to txt file with one prompt per line | | `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) | | `--steps` | int | `50` | Number of inference steps | | `--stage-configs-path` | string | `None` | Custom stage config file path | | `--worker-backend` | choice | `process` | Worker backend: `process` or `ray` | | `--ray-address` | string | `None` | Ray cluster address | | `--enable-stats` | flag | `False` | Enable statistics logging | | `--init-sleep-seconds` | int | `20` | Initialization sleep time | | `--batch-timeout` | int | `5` | Batch timeout | | `--init-timeout` | int | `300` | Initialization timeout | ------ #### ⚙️ Stage Configuration Parameters (bagel.yaml) **Stage 0 - Thinker (LLM Stage)** | Parameter | Value | Description | | :------------------------------- | :------------------------------ | :----------------------- | | `stage_type` | `llm` | Stage type | | `devices` | `"0"` | GPU device ID | | `max_batch_size` | `1` | Maximum batch size | | `model_stage` | `thinker` | Model stage identifier | | `model_arch` | `BagelForConditionalGeneration` | Model architecture | | `gpu_memory_utilization` | `0.4` | GPU memory utilization | | `tensor_parallel_size` | `1` | Tensor parallel size | | `max_num_batched_tokens` | `32768` | Maximum batched tokens | | `omni_kv_config.need_send_cache` | `true` | Whether to send KV cache | ------ **Stage 1 - DiT (Diffusion Stage)** | Parameter | Value | Description | | :------------------------------- | :---------- | :-------------------------- | | `stage_type` | `diffusion` | Stage type | | `devices` | `"0"` | GPU device ID | | `max_batch_size` | `1` | Maximum batch size | | `model_stage` | `dit` | Model stage identifier | | `gpu_memory_utilization` | `0.4` | GPU memory utilization | | `omni_kv_config.need_recv_cache` | `true` | Whether to receive KV cache | | `engine_input_source` | `[0]` | Input source from Stage 0 | ------ #### 🔗 Runtime Configuration | Parameter | Value | Description | | :-------------------- | :------ | :------------------------------- | | `window_size` | `-1` | Window size (-1 means unlimited) | | `max_inflight` | `1` | Maximum inflight requests | | `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) | ## FAQ - If you encounter an error about the backend of librosa, try to install ffmpeg with the command below. ```bash sudo apt update sudo apt install ffmpeg ``` - If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len. | Stage | VRAM | | :------------------ | :--------------------------- | | Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** | | Stage-1 (DiT) | **26.50 GiB** | | Total | **~42 GiB + KV Cache** |