# This is a workaround for now to pass in the raw request to stage 0. StageRequest validates it but ignores any unknown keys, so it gets passed through.
@@ -318,6 +318,95 @@ For S3 credential configuration, set the standard AWS environment variables (`AW
...
@@ -318,6 +318,95 @@ For S3 credential configuration, set the standard AWS environment variables (`AW
Omni pipelines are configured via YAML stage configs. See [`examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml) for an example. For full documentation on stage config format and multi-stage pipelines, refer to the [vLLM-Omni Stage Configs documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/).
Omni pipelines are configured via YAML stage configs. See [`examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml) for an example. For full documentation on stage config format and multi-stage pipelines, refer to the [vLLM-Omni Stage Configs documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/).
## Disaggregated Multi-Stage Serving
For models with multiple pipeline stages (e.g., AR + Diffusion), Dynamo supports disaggregated serving where each stage runs as an independent process on its own GPU. This enables independent scaling, GPU isolation, and multi-worker replicas per stage.
### Architecture
Each stage runs as an independent process on its own GPU. A lightweight router coordinates them, acting as a **pure message broker** — it never inspects or transforms inter-stage data.
```mermaid
flowchart LR
client(Client) --> frontend(Frontend)
frontend --> router(Router)
router -->|request| s0(Stage 0)
s0 -->|ref| router
router -->|ref| s1(Stage 1)
s1 -->|result| router
router --> frontend --> client
s0 <-->|bulk data| conn[(Connector)]
conn <--> s1
```
**How it works:**
- The router sends the initial request to Stage 0 and receives back a lightweight connector reference (pointer to the output in shared memory).
- The router forwards that reference — unchanged — to Stage 1. It never reads the bulk data.
- Each stage fetches its inputs from the connector, runs any model-specific processor (e.g., `ar2diffusion`, `thinker2talker`), then runs its engine.
- The final stage's result goes back to the router for formatting and response.
- Connector references accumulate as the pipeline progresses, so any stage can access outputs from all previous stages.
### Data Flow
```mermaid
sequenceDiagram
participant C as Client
participant R as Router
participant S0 as Stage 0 (AR)
participant SHM as Connector
participant S1 as Stage 1 (DiT)
C->>R: POST /v1/images/generations
R->>S0: request + prompt
S0->>SHM: store output
S0-->>R: connector ref
R->>S1: connector ref (opaque)
S1->>SHM: fetch output
S1->>S1: processor → engine
S1-->>R: result
R-->>C: {"data": [...]}
```
### Quick Start: GLM-Image (2-Stage, 2 GPUs)
GLM-Image is a 2-stage text-to-image model with an AR stage (generates prior token IDs) and a DiT stage (diffusion denoising + VAE decode). The built-in vLLM-Omni stage config already assigns each stage to a separate GPU.
Each stage registers independently with Dynamo's service discovery. To scale a bottleneck stage, launch additional workers with the same `--stage-id` on different GPUs — the router automatically load-balances across all replicas for that stage. Other stages are unaffected.
### Tested Models
| Model | Stages | Output | Stage Config |
|---|---|---|---|
| GLM-Image (`zai-org/GLM-Image`) | AR -> DiT | Image | `glm_image.yaml` (built-in) |
### CLI Flags (Disaggregated Mode)
| Flag | Description |
|---|---|
| `--stage-id <int>` | Run as a single-stage worker for the given stage ID. Requires `--stage-configs-path`. |
| `--omni-router` | Run as the stage router. Requires `--stage-configs-path`. Mutually exclusive with `--stage-id`. |