# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
fromtypingimportOptional
frompydanticimportBaseModel
# For omni models, we need to support raw_request parsing and json output format. We need to have these protocols defined here for serialization and deserialization.
# TODO: Replace these Pydantic models with Python bindings to the Rust protocol types once PyO3 bindings are available.
classImageNvExt(BaseModel):
"""NVIDIA extensions for image generation requests.
Matches Rust NvExt in lib/llm/src/protocols/openai/images/nvext.rs.
"""
annotations:Optional[list[str]]=None
"""Annotations for SSE stream events."""
negative_prompt:Optional[str]=None
"""Optional negative prompt."""
num_inference_steps:Optional[int]=None
"""Number of denoising steps."""
guidance_scale:Optional[float]=None
"""CFG guidance scale."""
seed:Optional[int]=None
"""Random seed for reproducibility."""
classNvCreateImageRequest(BaseModel):
"""Request for image generation (/v1/images/generations endpoint).
Matches the flattened Rust NvCreateImageRequest in lib/llm/src/protocols/openai/images.rs
This mode returns OpenAI-compatible streaming chunks
)->AsyncGenerator[Dict[str,Any],None]:
Text input -> Text output / Image output
"""Single generation path for all request protocols and output modalities."""
"""
# (ayushag) TODO: Support all type of OmniPrompt. Right now it works for only text prompts
parsed_request,request_type=parse_request_type(
# (ayushag) TODO: Document all I/O formats from vllm omni
request,self.config.output_modalities
# OmniText prompt support additional negative prompts as well. need to support that as well.
)
# Support multimodal content as well. That will involve applying tokenizer to the prompt and loading images. Follow general multimodal support pattern.
Dynamo supports omni (multimodal generation) models via the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This enables multi-stage pipelines for tasks like text-to-text and text-to-image generation through an OpenAI-compatible API.
Dynamo supports multimodal generation through the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This integration exposes text-to-text, text-to-image, and text-to-video capabilities via OpenAI-compatible API endpoints.
## Prerequisites
## Prerequisites
This guide assumes familiarity with deploying Dynamo with vLLM as described in [README.md](/docs/pages/backends/vllm/README.md).
This guide assumes familiarity with deploying Dynamo with vLLM as described in the [vLLM backend guide](/docs/pages/backends/vllm/README.md).
Launch an aggregated deployment (frontend + omni worker) using the provided script:
The `--output-modalities` flag determines which endpoint(s) the worker registers. When set to `image`, both `/v1/chat/completions` (returns inline base64 images) and `/v1/images/generations` are available. When set to `video`, the worker serves `/v1/videos`.
@@ -41,34 +60,29 @@ curl -X POST http://localhost:8000/v1/chat/completions \
...
@@ -41,34 +60,29 @@ curl -X POST http://localhost:8000/v1/chat/completions \
}'
}'
```
```
### Text-to-Image
This script uses a custom stage config (`stage_configs/single_stage_llm.yaml`) that configures the thinker stage for text generation. See [Stage Configuration](#stage-configuration) for details.
## Text-to-Image
Text-to-image uses vLLM-Omni's built-in default stage configs (no custom YAML needed). Launch without a stage config path so vLLM-Omni loads the model's default multi-stage pipeline:
Launch using the provided script with `Qwen/Qwen-Image`:
| `--connector none` | Disable KV connector (recommended for omni workers) |
## Stage Configuration
## Stage Configuration
Omni pipelines are configured via YAML stage configs. See [`examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml`](/examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml) for an example. Key fields:
Omni pipelines are configured via YAML stage configs. See [`examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml`](/examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml) for an example. For full documentation on stage config format and multi-stage pipelines, refer to the [vLLM-Omni Stage Configs documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/).
-**`model_stage`**: Pipeline stage name (e.g., `thinker`, `talker`, `code2wav`)
-**`final_output_type`**: Output format — `text` or `image`
-**`is_comprehension`**: Whether this stage processes input text/multimodal content
For full documentation on stage config format, supported fields, and multi-stage pipeline examples, see the [vLLM-Omni Stage Configs documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/).
## Current Limitations
## Current Limitations
- Only text prompts are supported (no multimodal input yet)
- Only text prompts are supported as input (no multimodal input yet).
- KV cache events are not published for omni workers
- KV cache events are not published for omni workers.
- Each worker supports a single output modality at a time.