"vscode:/vscode.git/clone" did not exist on "93acc631da3ecf057ae25392e0612368b2ab1243"
Unverified Commit 6642e23e authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

feat: sglang to 0.5.9 + updated docs (#6518)


Co-authored-by: default avatarbaihuitian <baihuitian.bht@gmail.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 1df620b4
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Profiling
---
# Profiling SGLang Workers in Dynamo
> [!NOTE]
> **See also**: [Profiler Component Overview](../../components/profiler/README.md) for SLA-driven profiling and deployment optimization.
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
## Quick Start
1. **Start profiling:**
```bash
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
```
2. **Run some inference requests to generate profiling data**
3. **Stop profiling:**
```bash
curl -X POST http://localhost:9090/engine/stop_profile
```
4. **View the traces:**
The profiler outputs Chrome trace files in the specified `output_dir`. You can view them using:
- Chrome's `chrome://tracing`
- [Perfetto UI](https://ui.perfetto.dev/)
- TensorBoard with the PyTorch Profiler plugin
## Test Script
A test script is provided at [`examples/backends/sglang/test_sglang_profile.py`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/test_sglang_profile.py) that demonstrates the full profiling workflow:
```bash
python examples/backends/sglang/test_sglang_profile.py
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Diffusion
---
# Diffusion Models
Dynamo SGLang supports three types of diffusion-based generation: **LLM diffusion** (text generation via iterative refinement), **image diffusion** (text-to-image), and **video generation** (text-to-video). Each uses a different worker flag and handler, but all integrate with SGLang's `DiffGenerator`.
## Overview
| Type | Worker Flag | API Endpoint |
|------|------------|--------------|
| LLM Diffusion | `--dllm-algorithm <algo>` | `/v1/chat/completions`, `/v1/completions` |
| Image Diffusion | `--image-diffusion-worker` | `/v1/images/generations` |
| Video Generation | `--video-generation-worker` | `/v1/videos` |
<Note>
If you see a CuDNN version mismatch error on startup (`cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0`), set `SGLANG_DISABLE_CUDNN_CHECK=1` before launching. This is common when PyTorch ships a CuDNN version older than what SGLang requires for Conv3d operations.
</Note>
## LLM Diffusion
Diffusion Language Models generate text through iterative refinement rather than autoregressive token-by-token generation. The model starts with masked tokens and progressively replaces them with predictions, refining low-confidence tokens each step.
LLM diffusion is auto-detected: when `--dllm-algorithm` is set, the worker automatically uses `DiffusionWorkerHandler` without needing a separate flag. For more details on diffusion algorithms, see the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md).
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/diffusion_llada.sh
```
See the [launch script](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/diffusion_llada.sh) for configuration options.
### Test
```bash
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionAI/LLaDA2.0-mini-preview",
"messages": [{"role": "user", "content": "Hello! How are you?"}],
"temperature": 0.7,
"max_tokens": 512
}'
```
## Image Diffusion
Image diffusion workers generate images from text prompts using SGLang's `DiffGenerator`. Generated images are returned as either URLs (when using `--media-output-fs-url` for storage) or base64 data, in an OpenAI-compatible response format.
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/image_diffusion.sh
```
Supports local storage (`--fs-url file:///tmp/images`) and S3 (`--fs-url s3://bucket`). Pass `--http-url` to set the base URL for serving stored images. See the launch script for all configuration options.
### Test
```bash
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "black-forest-labs/FLUX.1-dev",
"prompt": "A sunset over the ocean",
"size": "1024x1024",
"response_format": "url",
"nvext": {
"num_inference_steps": 15
}
}'
```
## Video Generation
Video generation workers produce videos from text or image prompts using SGLang's `DiffGenerator` with frame-to-video encoding. Supports text-to-video (T2V) and image-to-video (I2V) workflows.
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/text-to-video-diffusion.sh
```
Use `--wan-size 1b` (default, 1 GPU) or `--wan-size 14b` (2 GPUs). See the launch script for all configuration options.
### Test
```bash
curl http://localhost:8000/v1/videos \
-H "Content-Type: application/json" \
-d '{
"prompt": "A curious raccoon exploring a garden",
"model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"seconds": 2,
"size": "832x480",
"response_format": "url",
"nvext": {
"fps": 8,
"num_frames": 17,
"num_inference_steps": 50
}
}'
```
## See Also
- **[Examples](sglang-examples.md)**: Launch scripts for all deployment patterns
- **[Reference Guide](sglang-reference-guide.md)**: Worker types and argument reference
- **[SGLang Diffusion LMs (upstream)](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md)**: SGLang diffusion documentation
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Examples
---
# SGLang Examples
For quick start instructions, see the [SGLang README](README.md). This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment.
## Table of Contents
- [Infrastructure Setup](#infrastructure-setup)
- [LLM Serving](#llm-serving)
- [Embedding Models](#embedding-models)
- [Vision Models](#vision-models)
- [Diffusion Models](#diffusion-models)
- [Kubernetes Deployment](#kubernetes-deployment)
- [Testing](#testing)
## Infrastructure Setup
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
<Note>
- **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
- **NATS** is only needed when using KV routing with events (`--kv-events-config`). Use `--no-router-kv-events` on the frontend for prediction-based routing without NATS.
- **On Kubernetes**, neither is required when using the Dynamo operator (`DYN_DISCOVERY_BACKEND=kubernetes`).
</Note>
<Tip>
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. For AI agents working with Dynamo, you can run the launch script in the background and use the `curl` commands to test the deployment.
</Tip>
## LLM Serving
### Aggregated Serving
The simplest deployment pattern: a single worker handles both prefill and decode.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh
```
### Aggregated Serving with KV Routing
Two workers behind a [KV-aware router](../../components/router/README.md) that maximizes cache reuse:
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_router.sh
```
This launches the frontend with `--router-mode kv` and two workers with ZMQ-based KV event publishing.
### Disaggregated Serving
Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg.sh
```
For details on how SGLang disaggregation works with Dynamo, including the bootstrap mechanism and RDMA transfer flow, see [SGLang Disaggregation](sglang-disaggregation.md).
### Disaggregated Serving with KV-Aware Prefill Routing
Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs.
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_router.sh
```
The frontend uses `--router-mode kv` and automatically detects prefill workers to activate an internal prefill router. Each worker publishes KV events over ZMQ on unique ports.
## Multimodal Serving
### Aggregated Multimodal
Serve multimodal models using SGLang's built-in multimodal support:
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_vision.sh
```
<Accordion title="Verify the deployment">
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-8B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image."},
{"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test2017/000000155781.jpg"}}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
</Accordion>
### Multimodal with Disaggregated Components
For advanced multimodal deployments with separate encoder, prefill, and decode workers (E/PD and E/P/D patterns), see the dedicated [SGLang Multimodal](../../features/multimodal/multimodal-sglang.md) documentation.
| Pattern | Script | Description |
|---------|--------|-------------|
| E/PD | `./launch/multimodal_epd.sh` | Separate vision encoder + combined PD worker |
| E/P/D | `./launch/multimodal_disagg.sh` | Separate encoder, prefill, and decode workers |
## Diffusion Models
### Diffusion LM
Run diffusion language models like [LLaDA2.0](https://github.com/inclusionAI/LLaDA2.0):
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/diffusion_llada.sh
```
### Image Diffusion
Generate images from text prompts using [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev) or other diffusion models:
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/image_diffusion.sh
```
Options: `--model-path`, `--fs-url` (local or S3), `--http-url`.
### Video Generation
Generate videos from text prompts using [Wan2.1](https://huggingface.co/Wan-AI) models:
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/text-to-video-diffusion.sh
```
Options: `--wan-size 1b|14b`, `--num-frames`, `--height`, `--width`, `--num-inference-steps`.
For full details on all diffusion worker types (LLM, image, video), see [Diffusion](sglang-diffusion.md).
### Kubernetes Deployment
For complete K8s deployment examples, see:
- [SGLang K8s deployment guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy)
- [SGLang aggregated router K8s example](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Kubernetes Deployment Guide](../../kubernetes/README.md)
## Troubleshooting
### CuDNN Version Check Fails
```
RuntimeError: cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0
```
Set `SGLANG_DISABLE_CUDNN_CHECK=1` before launching. This is common when PyTorch ships a CuDNN version older than what SGLang's Conv3d models require. Affects vision and diffusion models.
### Model Registration Fails with `config.json` Error
```
unable to extract config.json from directory ...
```
This happens with diffusers models (FLUX.1-dev, Wan2.1, etc.) that use `model_index.json` instead of `config.json`. Ensure you are using the correct worker flag (`--image-diffusion-worker` or `--video-generation-worker`) rather than the standard LLM worker mode. These flags use a registration path that does not require `config.json`.
### GPU OOM on Startup
If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes:
```bash
nvidia-smi # look for lingering sgl_diffusion::scheduler or python processes
kill -9 <PID>
```
### Disaggregated Workers Cannot Connect
Ensure both prefill and decode workers can reach each other over TCP. The bootstrap mechanism uses `--disaggregation-bootstrap-port` (default: 12345). For multi-node setups, ensure the port is reachable across hosts and set `--host 0.0.0.0`.
## See Also
- **[SGLang README](README.md)**: Quick start and feature overview
- **[Reference Guide](sglang-reference-guide.md)**: Architecture, configuration, and operational details
- **[SGLang Multimodal](../../features/multimodal/multimodal-sglang.md)**: Vision model deployment patterns
- **[SGLang HiCache](../../integrations/sglang-hicache.md)**: Hierarchical cache integration
- **[Benchmarking](../../benchmarks/benchmarking.md)**: Performance benchmarking tools
- **[Tuning Disaggregated Performance](../../performance/tuning.md)**: P/D tuning guide
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Reference Guide
subtitle: Architecture, configuration, and operational details for the SGLang backend
---
# Reference Guide
## Overview
The SGLang backend in Dynamo uses a modular architecture where `main.py` dispatches to specialized initialization modules based on the worker type. Each worker type has its own init module, request handler, health check, and registration logic.
Dynamo SGLang uses SGLang's native argument parser -- all SGLang engine arguments (e.g., `--model-path`, `--tp`, `--trust-remote-code`) are passed through directly. Dynamo adds its own arguments for worker mode selection, tokenizer control, and disaggregation configuration.
### Worker Types
| Worker Type | Description |
|------------|-------------|
| **Decode** *(default)* | Standard LLM inference (aggregated or disaggregated decode) |
| **Prefill** | Disaggregated prefill phase (`--disaggregation-mode prefill`) |
| **Embedding** | Text embedding models (`--embedding-worker`) |
| **Multimodal Processor** | HTTP entry point for multimodal, OpenAI-to-SGLang conversion (`--multimodal-processor`) |
| **Multimodal Encode** | Vision encoder and embeddings generation (`--multimodal-encode-worker`) |
| **Multimodal Worker** | LLM inference with multimodal data (`--multimodal-worker`) |
| **Multimodal Prefill** | Prefill phase for multimodal disaggregation (`--multimodal-worker --disaggregation-mode prefill`) |
| **Image Diffusion** | Image generation via DiffGenerator (`--image-diffusion-worker`) |
| **Video Generation** | Text/image-to-video via DiffGenerator (`--video-generation-worker`) |
| **LLM Diffusion** | Diffusion language models like LLaDA (`--dllm-algorithm <algo>`) |
## Argument Reference
### Dynamo-Specific Arguments
These arguments are added by Dynamo on top of SGLang's native arguments.
| Argument | Env Var | Default | Description |
|----------|---------|---------|-------------|
| `--endpoint` | `DYN_ENDPOINT` | Auto-generated | Dynamo endpoint in `dyn://namespace.component.endpoint` format |
| `--use-sglang-tokenizer` | `DYN_SGL_USE_TOKENIZER` | `false` | Use SGLang's tokenizer instead of Dynamo's |
| `--dyn-tool-call-parser` | `DYN_TOOL_CALL_PARSER` | `None` | [Tool call](../../agents/tool-calling.md) parser (overrides SGLang's `--tool-call-parser`) |
| `--dyn-reasoning-parser` | `DYN_REASONING_PARSER` | `None` | Reasoning parser for chain-of-thought models |
| `--custom-jinja-template` | `DYN_CUSTOM_JINJA_TEMPLATE` | `None` | Custom chat template path (incompatible with `--use-sglang-tokenizer`) |
| `--embedding-worker` | `DYN_SGL_EMBEDDING_WORKER` | `false` | Run as embedding worker (also sets SGLang's `--is-embedding`) |
| `--multimodal-processor` | `DYN_SGL_MULTIMODAL_PROCESSOR` | `false` | Run as [multimodal](../../features/multimodal/multimodal-sglang.md) processor |
| `--multimodal-encode-worker` | `DYN_SGL_MULTIMODAL_ENCODE_WORKER` | `false` | Run as multimodal encode worker |
| `--multimodal-worker` | `DYN_SGL_MULTIMODAL_WORKER` | `false` | Run as multimodal LLM worker |
| `--image-diffusion-worker` | `DYN_SGL_IMAGE_DIFFUSION_WORKER` | `false` | Run as [image diffusion](sglang-diffusion.md#image-diffusion) worker |
| `--video-generation-worker` | `DYN_SGL_VIDEO_GENERATION_WORKER` | `false` | Run as [video generation](sglang-diffusion.md#video-generation) worker |
| `--disagg-config` | `DYN_SGL_DISAGG_CONFIG` | `None` | Path to YAML disaggregation config file |
| `--disagg-config-key` | `DYN_SGL_DISAGG_CONFIG_KEY` | `None` | Key to select from disaggregation config (e.g., `prefill`, `decode`) |
<Note>
`--disagg-config` and `--disagg-config-key` must be provided together. The selected section is written to a temp YAML file and passed to SGLang's `--config` flag.
</Note>
## Tokenizer Behavior
By default, Dynamo handles tokenization and detokenization through its Rust-based frontend, passing `input_ids` to SGLang. This enables all frontend endpoints (`v1/chat/completions`, `v1/completions`, `v1/embeddings`).
With `--use-sglang-tokenizer`, SGLang handles tokenization internally and Dynamo passes raw prompts. This restricts the frontend to `v1/chat/completions` only.
<Warning>
`--custom-jinja-template` and `--use-sglang-tokenizer` are mutually exclusive. Custom templates require Dynamo's preprocessor.
</Warning>
## Request Cancellation
When a client disconnects, Dynamo automatically cancels the in-flight request across all workers, freeing compute resources. A background cancellation monitor detects disconnection and aborts the SGLang request.
| Mode | Prefill | Decode |
|------|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |
<Warning>Cancellation during remote prefill in disaggregated mode is not currently supported.</Warning>
For details on the cancellation architecture, see [Request Cancellation](../../fault-tolerance/request-cancellation.md).
## Graceful Shutdown
SGLang workers use Dynamo's graceful shutdown mechanism. When a `SIGTERM` or `SIGINT` is received:
1. **Discovery unregister**: The worker is removed from service discovery so no new requests are routed to it
2. **Grace period**: In-flight requests are allowed to complete
3. **Deferred handlers**: SGLang's internal signal handlers (captured during startup via monkey-patching `loop.add_signal_handler`) are invoked after the graceful period
This ensures zero dropped requests during rolling updates or scale-down events.
For more details, see [Graceful Shutdown](../../fault-tolerance/graceful-shutdown.md).
## Health Checks
Each worker type has a specialized health check payload that validates the full inference pipeline:
| Worker Type | Health Check Strategy |
|------------|----------------------|
| Decode / Aggregated | Short generation request (`max_new_tokens=1`) |
| Prefill | Wrapped prefill-specific request structure |
| Image Diffusion | Minimal image generation request |
| Video Generation | Minimal video generation request |
| Embedding | Standard embedding request |
Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. See [Health Checks](../../observability/health-checks.md) for the broader health check architecture.
## Metrics and KV Events
### Prometheus Metrics
Enable metrics with `--enable-metrics` on the worker. Set `DYN_SYSTEM_PORT` to expose the `/metrics` endpoint:
```bash
DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --enable-metrics
```
Both SGLang engine metrics (`sglang:*` prefix) and Dynamo runtime metrics (`dynamo_*` prefix) are served from the same endpoint.
For metric details, see [SGLang Prometheus Metrics](sglang-prometheus.md). For visualization setup, see [Prometheus + Grafana](../../observability/prometheus-grafana.md).
### KV Events
When configured with `--kv-events-config`, workers publish KV cache events (block creation/deletion) for the [KV-aware router](../../components/router/README.md). Events are published via ZMQ from SGLang's scheduler and relayed through Dynamo's event plane.
For DP attention mode (`--enable-dp-attention`), the publisher handles multiple DP ranks per node, each with its own KV event stream.
## Engine Routes
SGLang workers expose operational endpoints via Dynamo's system server:
| Route | Description |
|-------|-------------|
| `/engine/start_profile` | Start PyTorch profiling |
| `/engine/stop_profile` | Stop profiling and save traces |
| `/engine/release_memory_occupation` | Release GPU memory for maintenance |
| `/engine/resume_memory_occupation` | Resume GPU memory after release |
| `/engine/update_weights_from_distributor` | Update model weights from distributor |
| `/engine/update_weights_from_disk` | Update model weights from disk |
| `/engine/update_weight_version` | Update weight version metadata |
## See Also
- **[Examples](sglang-examples.md)**: All deployment patterns
- **[Disaggregation](sglang-disaggregation.md)**: P/D architecture and KV transfer
- **[Diffusion](sglang-diffusion.md)**: LLM, image, and video diffusion models
- **[Router Guide](../../components/router/router-guide.md)**: KV-aware routing configuration
...@@ -86,7 +86,7 @@ Dynamo exposes several categories of metrics: ...@@ -86,7 +86,7 @@ Dynamo exposes several categories of metrics:
- **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements - **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
- **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime - **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
- **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics - **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/prometheus.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`) - **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-prometheus.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
## Runtime Hierarchy ## Runtime Hierarchy
......
...@@ -16,7 +16,7 @@ The following table shows the backend framework versions included with each Dyna ...@@ -16,7 +16,7 @@ The following table shows the backend framework versions included with each Dyna
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** | | **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** |
| :--- | :--- | :--- | :--- | :--- | | :--- | :--- | :--- | :--- | :--- |
| **main (ToT)** | `0.15.1` | `0.5.8` | `1.3.0rc3` | `0.9.0` | | **main (ToT)** | `0.15.1` | `0.5.9` | `1.3.0rc3` | `0.9.0` |
| **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` | | **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` |
| **v0.9.1** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc3` | `0.9.0` | | **v0.9.1** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc3` | `0.9.0` |
| **v0.9.0** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` | | **v0.9.0** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` |
......
...@@ -137,8 +137,19 @@ navigation: ...@@ -137,8 +137,19 @@ navigation:
contents: contents:
- page: vLLM - page: vLLM
path: ../pages/backends/vllm/README.md path: ../pages/backends/vllm/README.md
- page: SGLang - section: SGLang
path: ../pages/backends/sglang/README.md path: ../pages/backends/sglang/README.md
contents:
- page: Reference Guide
path: ../pages/backends/sglang/sglang-reference-guide.md
- page: Examples
path: ../pages/backends/sglang/sglang-examples.md
- page: Disaggregation
path: ../pages/backends/sglang/sglang-disaggregation.md
- page: Diffusion
path: ../pages/backends/sglang/sglang-diffusion.md
- page: Prometheus
path: ../pages/backends/sglang/sglang-prometheus.md
- page: TensorRT-LLM - page: TensorRT-LLM
path: ../pages/backends/trtllm/README.md path: ../pages/backends/trtllm/README.md
- section: Frontend - section: Frontend
...@@ -280,20 +291,6 @@ navigation: ...@@ -280,20 +291,6 @@ navigation:
path: ../pages/backends/vllm/prompt-embeddings.md path: ../pages/backends/vllm/prompt-embeddings.md
- page: vLLM-Omni - page: vLLM-Omni
path: ../pages/backends/vllm/vllm-omni.md path: ../pages/backends/vllm/vllm-omni.md
- section: SGLang Details
contents:
- page: Expert Distribution (EPLB)
path: ../pages/backends/sglang/expert-distribution-eplb.md
- page: GPT-OSS
path: ../pages/backends/sglang/gpt-oss.md
- page: Diffusion LM
path: ../pages/backends/sglang/diffusion-lm.md
- page: Profiling
path: ../pages/backends/sglang/profiling.md
- page: Disaggregation
path: ../pages/backends/sglang/sglang-disaggregation.md
- page: Prometheus
path: ../pages/backends/sglang/prometheus.md
- section: TensorRT-LLM Details - section: TensorRT-LLM Details
contents: contents:
- page: Multinode Examples - page: Multinode Examples
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Aggregated serving: single worker handles both prefill and decode.
# GPUs: 1
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -54,6 +57,26 @@ if [ "$ENABLE_OTEL" = true ]; then ...@@ -54,6 +57,26 @@ if [ "$ENABLE_OTEL" = true ]; then
TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317) TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
fi fi
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated LLM Worker"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
OTEL_SERVICE_NAME=dynamo-frontend \ OTEL_SERVICE_NAME=dynamo-frontend \
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Aggregated embedding model serving.
# GPUs: 1
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -31,6 +34,26 @@ while [[ $# -gt 0 ]]; do ...@@ -31,6 +34,26 @@ while [[ $# -gt 0 ]]; do
esac esac
done done
MODEL="Qwen/Qwen3-Embedding-4B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Embedding Worker"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/embeddings \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"input\": \"Hello world\""
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python3 -m dynamo.frontend & python3 -m dynamo.frontend &
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Two aggregated workers behind a KV-aware router.
# GPUs: 2
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -51,6 +54,27 @@ if [ "$ENABLE_OTEL" = true ]; then ...@@ -51,6 +54,27 @@ if [ "$ENABLE_OTEL" = true ]; then
TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317) TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
fi fi
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated Router (2 workers)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
FRONTEND_ARGS=(--router-mode kv) FRONTEND_ARGS=(--router-mode kv)
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Aggregated multimodal (vision + LLM) serving.
# GPUs: 1
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -60,6 +63,29 @@ if [ "$ENABLE_OTEL" = true ]; then ...@@ -60,6 +63,29 @@ if [ "$ENABLE_OTEL" = true ]; then
TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317) TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
fi fi
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated Vision Worker"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": ["
echo " {\"type\": \"text\", \"text\": \"Describe the image.\"},"
echo " {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/test2017/000000155781.jpg\"}}"
echo " ]}],"
echo " \"max_tokens\": 50"
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
OTEL_SERVICE_NAME=dynamo-frontend \ OTEL_SERVICE_NAME=dynamo-frontend \
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Diffusion language model (LLaDA2.0). Text generation via iterative refinement.
# GPUs: 1
set -e set -e
...@@ -38,6 +41,19 @@ echo "TP Size: $TP_SIZE" ...@@ -38,6 +41,19 @@ echo "TP Size: $TP_SIZE"
echo "Diffusion Algorithm: ${DLLM_ALGORITHM:-LowConfidence}" echo "Diffusion Algorithm: ${DLLM_ALGORITHM:-LowConfidence}"
echo "Algorithm Config: ${DLLM_ALGORITHM_CONFIG:-default}" echo "Algorithm Config: ${DLLM_ALGORITHM_CONFIG:-default}"
echo "==========================================" echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL_PATH}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello! How are you?\"}],"
echo " \"temperature\": 0.7,"
echo " \"max_tokens\": 512"
echo " }'"
echo ""
echo "=========================================="
# Launch frontend (OpenAI-compatible API server) # Launch frontend (OpenAI-compatible API server)
echo "Starting Dynamo Frontend on port $HTTP_PORT..." echo "Starting Dynamo Frontend on port $HTTP_PORT..."
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Disaggregated serving: prefill on GPU 0, decode on GPU 1.
# GPUs: 2
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -45,6 +48,27 @@ if [ "$ENABLE_OTEL" = true ]; then ...@@ -45,6 +48,27 @@ if [ "$ENABLE_OTEL" = true ]; then
TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317) TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
fi fi
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Disaggregated Workers (P/D)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
OTEL_SERVICE_NAME=dynamo-frontend \ OTEL_SERVICE_NAME=dynamo-frontend \
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Disaggregated serving with KV-aware routing: 2 prefill + 2 decode workers.
# GPUs: 4
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -46,6 +49,27 @@ if [ "$ENABLE_OTEL" = true ]; then ...@@ -46,6 +49,27 @@ if [ "$ENABLE_OTEL" = true ]; then
TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317) TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
fi fi
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Disaggregated Router (2P + 2D)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# Start frontend with KV routing # Start frontend with KV routing
# The frontend will automatically detect prefill workers and activate an internal prefill router # The frontend will automatically detect prefill workers and activate an internal prefill router
# No standalone prefill router needed - the frontend handles prefill routing internally # No standalone prefill router needed - the frontend handles prefill routing internally
......
...@@ -2,6 +2,9 @@ ...@@ -2,6 +2,9 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# #
# Disaggregated serving on a single GPU (prefill + decode share memory).
# GPUs: 1 (requires 16+ GB VRAM)
#
# Usage: ./disagg_same_gpu.sh [GPU_MEM_FRACTION] # Usage: ./disagg_same_gpu.sh [GPU_MEM_FRACTION]
# GPU_MEM_FRACTION: Fraction of GPU memory to use per worker (default: 0.45) # GPU_MEM_FRACTION: Fraction of GPU memory to use per worker (default: 0.45)
# Example: ./disagg_same_gpu.sh 0.45 # Example: ./disagg_same_gpu.sh 0.45
...@@ -36,6 +39,28 @@ cleanup() { ...@@ -36,6 +39,28 @@ cleanup() {
trap cleanup EXIT INT TERM trap cleanup EXIT INT TERM
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Disaggregated (same GPU)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "GPU Mem: ${GPU_MEM_FRACTION} per worker"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress with KV router mode for disaggregated setup # run ingress with KV router mode for disaggregated setup
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python3 -m dynamo.frontend --router-mode kv & python3 -m dynamo.frontend --router-mode kv &
......
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Image diffusion worker (text-to-image). Default model: FLUX.1-dev (~38 GB VRAM).
# GPUs: 1
set -e
# Setup cleanup trap
cleanup() {
echo "Cleaning up background processes..."
kill $FRONTEND_PID 2>/dev/null || true
wait $FRONTEND_PID 2>/dev/null || true
echo "Cleanup complete."
}
trap cleanup EXIT INT TERM
# Defaults
MODEL_PATH="black-forest-labs/FLUX.1-dev"
FS_URL="file:///tmp/dynamo_media"
HTTP_URL=""
HTTP_PORT="${HTTP_PORT:-8000}"
# Parse command line arguments
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--model-path)
MODEL_PATH="$2"
shift 2
;;
--fs-url)
FS_URL="$2"
shift 2
;;
--http-url)
HTTP_URL="$2"
shift 2
;;
--http-port)
HTTP_PORT="$2"
shift 2
;;
-h|--help)
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Launch a Dynamo image diffusion worker."
echo ""
echo "Options:"
echo " --model-path <path> Model path (default: black-forest-labs/FLUX.1-dev)"
echo " --fs-url <url> Filesystem URL for image storage (default: file:///tmp/dynamo_media)"
echo " --http-url <url> Base URL for serving images over HTTP (optional)"
echo " --http-port <port> Frontend HTTP port (default: 8000)"
echo " -h, --help Show this help message"
echo ""
echo "Additional flags are forwarded to dynamo.sglang."
echo ""
echo "Examples:"
echo " # Local file storage"
echo " $0 --model-path black-forest-labs/FLUX.1-dev --fs-url file:///tmp/images"
echo ""
echo " # S3 storage (set FSSPEC_S3_KEY, FSSPEC_S3_SECRET, optionally FSSPEC_S3_ENDPOINT_URL)"
echo " $0 --fs-url s3://my-bucket/images"
exit 0
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
echo "=========================================="
echo "Launching Image Diffusion Worker"
echo "=========================================="
echo "Model: $MODEL_PATH"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "FS URL: $FS_URL"
[ -n "$HTTP_URL" ] && echo "HTTP URL: $HTTP_URL"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/images/generations \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"prompt\": \"A curious raccoon exploring a garden\","
echo " \"model\": \"${MODEL_PATH}\","
echo " \"size\": \"1024x1024\","
echo " \"response_format\": \"url\","
echo " \"nvext\": {"
echo " \"num_inference_steps\": 15"
echo " }"
echo " }'"
echo ""
echo "=========================================="
# Build optional HTTP URL arg
HTTP_URL_ARGS=()
if [ -n "$HTTP_URL" ]; then
HTTP_URL_ARGS=(--media-output-http-url "$HTTP_URL")
fi
# Launch frontend
echo "Starting Dynamo Frontend on port $HTTP_PORT..."
python3 -m dynamo.frontend \
--http-port "$HTTP_PORT" &
FRONTEND_PID=$!
sleep 2
# Launch image diffusion worker
echo "Starting Image Diffusion Worker..."
python3 -m dynamo.sglang \
--model-path "$MODEL_PATH" \
--served-model-name "$MODEL_PATH" \
--image-diffusion-worker \
--media-output-fs-url "$FS_URL" \
"${HTTP_URL_ARGS[@]}" \
--trust-remote-code \
--skip-tokenizer-init \
--enable-metrics \
"${EXTRA_ARGS[@]}"
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Multimodal E/P/D: encoder (GPU 0), prefill (GPU 1), decode (GPU 2).
# GPUs: 3
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -59,6 +62,29 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then ...@@ -59,6 +62,29 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then
SERVED_MODEL_ARG="--served-model-name $SERVED_MODEL_NAME" SERVED_MODEL_ARG="--served-model-name $SERVED_MODEL_NAME"
fi fi
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Multimodal E/P/D Workers"
echo "=========================================="
echo "Model: $MODEL_NAME"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL_NAME}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": ["
echo " {\"type\": \"text\", \"text\": \"Describe the image.\"},"
echo " {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/test2017/000000155781.jpg\"}}"
echo " ]}],"
echo " \"max_tokens\": 50"
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python3 -m dynamo.frontend & python3 -m dynamo.frontend &
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Multimodal E/PD: separate vision encoder (GPU 0) + combined PD worker (GPU 1).
# GPUs: 2
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
...@@ -59,6 +62,29 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then ...@@ -59,6 +62,29 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then
SERVED_MODEL_ARG="--served-model-name $SERVED_MODEL_NAME" SERVED_MODEL_ARG="--served-model-name $SERVED_MODEL_NAME"
fi fi
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Multimodal E/PD Workers"
echo "=========================================="
echo "Model: $MODEL_NAME"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL_NAME}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": ["
echo " {\"type\": \"text\", \"text\": \"Describe the image.\"},"
echo " {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/test2017/000000155781.jpg\"}}"
echo " ]}],"
echo " \"max_tokens\": 50"
echo " }'"
echo ""
echo "=========================================="
# run ingress # run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python3 -m dynamo.frontend & python3 -m dynamo.frontend &
......
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Text-to-video generation with Wan2.1 models.
# GPUs: 1 (--wan-size 1b) or 2 (--wan-size 14b)
set -e set -e
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment