"tests/vscode:/vscode.git/clone" did not exist on "a75021dbe72cf2aee1ed80e3345c1de6b745fb6c"
Unverified Commit 7dfbe4fd authored by Alec's avatar Alec Committed by GitHub
Browse files

chore: remove stale example assets (#7059)

parent 310f8ca9
......@@ -200,6 +200,42 @@ Dynamo is built in the open with an OSS-first development model. We welcome cont
<details>
<summary>Older news</summary>
Dynamo provides comprehensive benchmarking tools:
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/components/planner/planner-guide.md)** – Optimize deployments to meet SLA requirements
## Frontend OpenAPI Specification
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
```bash
cargo run -p dynamo-llm --bin generate-frontend-openapi
```
This writes to `docs/reference/api/openapi.json`.
## Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service-discovery.md)) handle service discovery. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|------------|------|------|-------|
| **Local Development** | ❌ Not required | ❌ Not required | Pass `--discovery-backend file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
For Slurm or other distributed deployments (and KV-aware routing):
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
## More News
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
......
......@@ -4,281 +4,79 @@
title: Multinode Examples
---
For general TensorRT-LLM features and configuration, see the [Reference Guide](../trtllm-reference-guide.md).
For general TensorRT-LLM features and engine configuration, see the
[Reference Guide](../trtllm-reference-guide.md).
---
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via `mpirun` or `srun`. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate two examples launching multinode workers
on a slurm cluster with `srun`:
1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
worker across 4 GB200 nodes
2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
worker (4 nodes) across a total of 8 GB200 nodes.
NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
using `mpirun` directly, with relative ease.
## Setup
For simplicity of the example, we will make some assumptions about your slurm cluster:
1. First, we assume you have access to a slurm cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
3. Third, we assume you have a Dynamo+TRTLLM container image available.
You can use the [prebuilt container](../README.md#quick-start) or [build a custom one](../trtllm-building-custom-container.md).
This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity
to run both examples. If you plan to only run the aggregated example, you
will only need 4 nodes. If you customize the configurations to require a
different number of nodes, you can adjust the number of allocated nodes
accordingly. Pre-allocating nodes is technically not a requirement,
but it makes iterations of testing/experimenting easier.
## Recommended Path
Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
```bash
# Set partition manually based on your slurm cluster's partition names
PARTITION=""
# Set account manually if this command doesn't work on your cluster
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
salloc \
--partition="${PARTITION}" \
--account="${ACCOUNT}" \
--job-name="${ACCOUNT}-dynamo.trtllm" \
-t 05:00:00 \
--nodes 8
```
5. Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which may be the default behavior after executing the `salloc` command above
depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
For multinode TensorRT-LLM deployments, start from the checked-in Kubernetes
recipes under [`recipes/`](../../../../recipes/README.md). Those manifests are
the supported entrypoints for launching multi-node workers, frontend services,
and related routing components.
### Environment Variable Setup
The main TRT-LLM recipe entrypoints are:
This example aims to automate as much of the environment setup as possible,
but all slurm clusters and environments are different, and you may need to
dive into the scripts to make modifications based on your specific environment.
- [DeepSeek-R1 WideEP on GB200](../../../../recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml)
- [Qwen3-235B-A22B-FP8 aggregated](../../../../recipes/qwen3-235b-a22b-fp8/trtllm/agg/deploy.yaml)
- [Qwen3-235B-A22B-FP8 disaggregated](../../../../recipes/qwen3-235b-a22b-fp8/trtllm/disagg/deploy.yaml)
- [Qwen3-32B-FP8 aggregated](../../../../recipes/qwen3-32b-fp8/trtllm/agg/deploy.yaml)
- [Qwen3-32B-FP8 disaggregated](../../../../recipes/qwen3-32b-fp8/trtllm/disagg/deploy.yaml)
- [GPT-OSS-120B aggregated](../../../../recipes/gpt-oss-120b/trtllm/agg/deploy.yaml)
- [GPT-OSS-120B disaggregated](../../../../recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml)
- [Nemotron-3-Super-FP8 disaggregated](../../../../recipes/nemotron-3-super-fp8/trtllm/disagg/deploy.yaml)
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
```bash
# NOTE: IMAGE must be set manually for now
# Use the prebuilt container from NGC (see ../README.md#quick-start):
# export IMAGE="nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0"
# Or build a custom one (see ../trtllm-building-custom-container.md)
# Or you can also download the image to shared storage and point
# IMAGE to the local path.
export IMAGE="<dynamo_trtllm_image>"
For model-level setup, prerequisites, and hardware notes, use the recipe
README files:
# MOUNTS are the host:container path pairs that are mounted into the containers
# launched by each `srun` command.
#
# If you want to reference files, such as $MODEL_PATH below, in a
# different location, you can customize MOUNTS or specify additional
# comma-separated mount pairs here.
#
# NOTE: Currently, this example assumes that the local bash scripts and configs
# referenced are mounted into into /mnt inside the container. If you want to
# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
# accordingly for the new locations of `start_frontend_services.sh` and
# `start_trtllm_worker.sh`.
#
# For example, assuming your cluster had a `/lustre` directory on the host, you
# could add that as a mount like so:
#
# export MOUNTS="${PWD}/../../../../:/mnt,/lustre:/lustre"
export MOUNTS="${PWD}/../../../../:/mnt"
- [DeepSeek-R1 recipes](../../../../recipes/deepseek-r1/README.md)
- [Qwen3-235B-A22B-FP8 recipes](../../../../recipes/qwen3-235b-a22b-fp8/README.md)
- [Qwen3-32B-FP8 recipes](../../../../recipes/qwen3-32b-fp8/README.md)
- [GPT-OSS-120B recipes](../../../../recipes/gpt-oss-120b/README.md)
- [Kimi-K2.5 recipes](../../../../recipes/kimi-k2.5/README.md)
# NOTE: In general, Deepseek R1 is very large, so it is recommended to
# pre-download the model weights and save them in some shared location,
# NFS storage, HF_HOME, etc. and modify the `--model-path` below
# to reuse the pre-downloaded weights instead.
#
# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
#
# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
# https://huggingface.co/deepseek-ai/DeepSeek-R1
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
## Quick Start
# The name the model will be served/queried under, matching what's
# returned by the /v1/models endpoint.
#
# By default this is inferred from MODEL_PATH, but when using locally downloaded
# model weights, it can be nice to have explicit control over the name.
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
```
At a high level, the Kubernetes workflow is:
## Aggregated WideEP
1. Install the Dynamo platform on Kubernetes. See the
[Kubernetes Deployment Guide](../../../kubernetes/README.md).
2. Create a namespace and any required secrets such as a Hugging Face token.
3. Apply the recipe's model cache and model download manifests when the recipe
includes them.
4. Apply the recipe's `deploy.yaml`.
5. Port-forward the frontend service and send test requests to `/v1/models` or
`/v1/chat/completions`.
Assuming you have at least 4 nodes allocated following the setup steps above,
follow these steps below to launch an **aggregated** deployment across 4 nodes:
Example flow:
```bash
# Default set in srun_aggregated.sh, but can customize here.
# export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml"
# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
# total GPUs necessary to satisfy the requested parallelism. For example,
# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
# export NUM_NODES=4
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
# export NUM_GPUS_PER_NODE=4
# Launches:
# - frontend + etcd/nats on current (head) node
# - one large aggregated trtllm worker across multiple nodes via MPI tasks
./srun_aggregated.sh
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Example: deploy DeepSeek-R1 TRT-LLM WideEP on GB200.
kubectl apply -f recipes/deepseek-r1/model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f recipes/deepseek-r1/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s
kubectl apply -f recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml -n ${NAMESPACE}
```
## Disaggregated WideEP
Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
following the setup above, follow these steps below to launch a **disaggregated**
deployment across 8 nodes:
> [!Tip]
> Make sure you have a fresh environment and don't still have the aggregated
> example above still deployed on the same set of nodes.
After the deployment is ready, port-forward the frontend service named by the
recipe and send a test request:
```bash
# Defaults set in srun_disaggregated.sh, but can customize here.
# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml"
kubectl port-forward svc/<frontend-service> 8000:8000 -n ${NAMESPACE}
# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
# GPUs necessary to satisfy the requested parallelism in each config.
# export NUM_PREFILL_NODES=4
# export NUM_DECODE_NODES=4
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
# export NUM_GPUS_PER_NODE=4
# Launches:
# - frontend + etcd/nats on current (head) node.
# - one large prefill trtllm worker across multiple nodes via MPI tasks
# - one large decode trtllm worker across multiple nodes via MPI tasks
./srun_disaggregated.sh
```
> [!Tip]
> To launch multiple replicas of the configured prefill/decode workers, you can set
> NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
## Understanding the Output
1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
etcd, NATS, and the OpenAI frontend on the head node only
called "node1" in the example output below. The second launches
a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
using 4 GPUs each.
```
# Frontend/etcd/nats services
srun: launching StepId=453374.17 on host node1, 1 tasks: 0
...
# TP16 TRTLLM worker split across 4 nodes with 4 gpus each
srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
```
2. The OpenAI frontend will listen for and dynamically discover workers as
they register themselves with Dynamo's distributed runtime:
```
0: 2025-06-13T02:36:48.161Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
```
3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
GPU on each node, which will each output their progress while loading the model.
You can see each rank's output prefixed with the rank at the start of each log line
until the model succesfully finishes loading:
```
8: rank8 run mgmn worker node with mpi_world_size: 16 ...
10: rank10 run mgmn worker node with mpi_world_size: 16 ...
9: rank9 run mgmn worker node with mpi_world_size: 16 ...
11: rank11 run mgmn worker node with mpi_world_size: 16 ...
...
15: Model init total -- 55.42s
11: Model init total -- 55.91s
12: Model init total -- 55.24s
```
4. After the model fully finishes loading on all ranks, the worker will register itself,
and the OpenAI frontend will detect it, signaled by this output:
```
0: 2025-06-13T02:46:35.040Z INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4"
```
5. At this point, with the worker fully initialized and detected by the frontend,
it is now ready for inference.
6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
three srun jobs instead of two. One for frontend, one for prefill worker,
and one for decode worker.
## Example Request
To verify the deployed model is working, send a `curl` request:
```bash
# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
HOST=localhost
PORT=8000
# "model" here should match the model name returned by the /v1/models endpoint
curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'${SERVED_MODEL_NAME}'",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": true,
"max_tokens": 30
}'
```
## Cleanup
To cleanup background `srun` processes launched by `srun_aggregated.sh` or
`srun_disaggregated.sh`, you can run:
```bash
pkill srun
curl http://localhost:8000/v1/models
```
## Known Issues
## Notes
- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
FP4 weights. In theory, the example should work on alternative setups such as
H100 nodes with FP8 weights, but this hasn't been tested yet.
- WideEP configs in this directory are still being tested. A WideEP specific
example with documentation will be added once ready.
- There are known issues where WideEP workers may not cleanly shut down:
- This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
now, you must manually clean these up before deploying again on the
same set of nodes.
- Similarly, there may be GPU memory left in-use after killing the `srun`
jobs. After cleaning up any leftover shared memory files as described
above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
to check on this behavior. If you don't free the GPU memory before the
next deployment, you may get a CUDA OOM error while loading the model.
- There is mention of this issue in the relevant TRT-LLM blog
[here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
- The TRT-LLM engine config files used by launch and deploy flows live under
[`examples/backends/trtllm/engine_configs/`](../../../../examples/backends/trtllm/engine_configs/README.md).
- If you need to customize model parallelism, replica counts, or routing mode,
edit the recipe-local manifest rather than introducing a separate scheduler-specific guide.
- For the current catalog of supported recipes, see [recipes/README.md](../../../../recipes/README.md).
......@@ -104,10 +104,6 @@ For comprehensive instructions on multinode serving, see the [Multinode Examples
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
### Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md).
## Client
See the [client](../sglang/README.md#testing-the-deployment) section to learn how to send requests to the deployment.
......
......@@ -529,7 +529,6 @@ flowchart TD
## Next Steps
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/README.md)
- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
- **Monitoring**: Set up Prometheus and Grafana for production monitoring
- **Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
......@@ -105,7 +105,7 @@ For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployme
- [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
- [Kubernetes deployment guide](../../kubernetes/README.md)
**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
......
......@@ -327,65 +327,6 @@ sequenceDiagram
Frontend->>Client: Stream response
```
## Multi-node Deployment (Slurm)
This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
### Environment Setup
Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
```bash
# Container image (build using docs/backends/trtllm/README.md#build-container)
export IMAGE="<dynamo_trtllm_image>"
# Host:container path pairs for mounting
export MOUNTS="${PWD}/../../../../:/mnt"
# Model configuration
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODALITY=${MODALITY:-"multimodal"}
```
### Multi-node Disaggregated Launch
For 4 4xGB200 nodes (2 for prefill, 2 for decode):
```bash
# Customize parallelism to match your engine configs
# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
# export NUM_PREFILL_NODES=2
# export NUM_DECODE_NODES=2
# export NUM_GPUS_PER_NODE=4
# Launches frontend + etcd/nats on head node, plus prefill and decode workers
./srun_disaggregated.sh
```
### Understanding the Output
1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
2. The OpenAI frontend will dynamically discover workers as they register:
```text
INFO dynamo_run::input::http: Watching for remote model at models
INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
```
3. TRT-LLM workers output progress from each MPI rank while loading
4. When ready, the frontend logs:
```text
INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
```
### Cleanup
```bash
pkill srun
```
## Embedding Cache
Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:
......@@ -499,4 +440,3 @@ Common examples:
| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler with disaggregated params encoding/decoding |
| `components/src/dynamo/trtllm/utils/disagg_utils.py` | DisaggregatedParamsCodec for network transfer |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
......@@ -10,7 +10,7 @@ title: Inference Gateway (GAIE)
Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml), following the checked-in GAIE/EPP configuration layout used by this repository.
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. A request only exercises disaggregated routing when the EPP config defines a `prefill` profile and prefill workers are available. The standalone [`epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) currently only defines a `decode` profile, while the recipe examples use separate aggregated and disaggregated configs under `recipes/llama-3-70b/vllm/agg/gaie/` and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/`. Unless `DYN_ENFORCE_DISAGG=true`, deployments without a `prefill` profile or prefill workers fall back to aggregated serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway.
......
......@@ -26,9 +26,9 @@ This directory contains practical examples demonstrating how to deploy and use D
Learn fundamental Dynamo concepts through these introductory examples:
- **[Quickstart](/examples/basics/quickstart/README.md)** - Simple aggregated serving example with vLLM backend
- **[Disaggregated Serving](/examples/basics/disaggregated_serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
- **[Multi-node](/examples/basics/multinode/README.md)** - Distributed inference across multiple nodes and GPUs
- **[Quickstart](/docs/getting-started/quickstart.md)** - Simple local Dynamo setup across supported backends
- **[Disaggregated Serving](/docs/features/disaggregated-serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
- **[Multi-node TensorRT-LLM](/docs/backends/trtllm/multinode/trtllm-multinode-examples.md)** - Distributed inference across multiple nodes and GPUs
## Framework Support
......@@ -56,7 +56,7 @@ Low-level runtime examples for developers using Python<>Rust bindings:
## Getting Started
1. **Choose your deployment pattern**: Start with the [Quickstart](/examples/basics/quickstart/README.md) for a simple local deployment, or explore [Disaggregated Serving](/examples/basics/disaggregated_serving/README.md) for advanced architectures.
1. **Choose your deployment pattern**: Start with the [Quickstart](/docs/getting-started/quickstart.md) for a simple local deployment, or explore [Disaggregated Serving](/docs/features/disaggregated-serving/README.md) for advanced architectures.
2. **Set up prerequisites**: Most examples require etcd and NATS services. You can start them using:
```bash
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
tensor_parallel_size: 1
moe_expert_parallel_size: 1
enable_attention_dp: false
max_num_tokens: 8192
max_batch_size: 16
trust_remote_code: true
backend: pytorch
enable_chunked_prefill: true
kv_cache_config:
free_gpu_memory_fraction: 0.85
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
cuda_graph_config:
max_batch_size: 16
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
tensor_parallel_size: 1
moe_expert_parallel_size: 1
enable_attention_dp: false
max_num_tokens: 8192
trust_remote_code: true
backend: pytorch
enable_chunked_prefill: true
disable_overlap_scheduler: false
cuda_graph_config:
max_batch_size: 16
kv_cache_config:
free_gpu_memory_fraction: 0.85
cache_transceiver_config:
backend: DEFAULT
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
tensor_parallel_size: 1
moe_expert_parallel_size: 1
enable_attention_dp: false
max_num_tokens: 8192
trust_remote_code: true
backend: pytorch
enable_chunked_prefill: true
# Overlap scheduler not currently supported in prefill only workers.
disable_overlap_scheduler: true
cuda_graph_config:
max_batch_size: 16
kv_cache_config:
free_gpu_memory_fraction: 0.85
cache_transceiver_config:
backend: DEFAULT
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# NOTE: FP4 only supported starting with Blackwell GPUs.
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
# You can also specify the full path to locally downloaded weights
# instead of a HuggingFace ID here.
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
max_batch_size: 256
# 8448 = 8192 ISL + 256 OSL
max_num_tokens: 8448
max_seq_len: 8448
kv_cache_config:
free_gpu_memory_fraction: 0.30
dtype: fp8
# Enable the MTP(Multi-Token Prediction) in the model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
backend: pytorch
# TP/EP/PP/DP
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: false
max_batch_size: 256
# 8448 = 8192 ISL + 256 OSL
max_num_tokens: 8448
max_seq_len: 8448
kv_cache_config:
# With dp attention disabled: high free_gpu_memory_fraction is fine.
free_gpu_memory_fraction: 0.85
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
dtype: fp8
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
cuda_graph_config:
enable_padding: true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Example of a Multi-node worker, but no WideEP or EPLB.
# See wide_ep*.yaml for WideEP example configs.
backend: pytorch
tensor_parallel_size: 16
moe_expert_parallel_size: 16
enable_attention_dp: true
max_batch_size: 256
max_num_tokens: 256
max_seq_len: 8448
kv_cache_config:
free_gpu_memory_fraction: 0.7
dtype: fp8
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# moe_load_balancer settings for TRTLLM based on:
# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer
num_slots: 288
layer_updates_per_iter: 2
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
backend: pytorch
# WideEP related settings
moe_config:
backend: WIDEEP
# moe_max_num_tokens will default to max_num_tokens if left unspecified.
#
# If you want to set this value explicitly, one recommendation is below:
# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
# 4096 = 256 * 16
# moe_max_num_tokens: 4096
load_balancer: /mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/eplb.yaml
tensor_parallel_size: 16
moe_expert_parallel_size: 16
enable_attention_dp: true
max_batch_size: 256
max_num_tokens: 256
max_seq_len: 8448
kv_cache_config:
free_gpu_memory_fraction: 0.3
dtype: fp8
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# NOTE: FP4 only supported starting with Blackwell GPUs.
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
# You can also specify the full path to locally downloaded weights
# instead of a HuggingFace ID here.
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: false
max_batch_size: 256
# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
# This is a known issue in TensorRT-LLM and will be resolved in the next release.
max_num_tokens: 512
# 8704 = 8192 ISL + 512 OSL
max_seq_len: 8704
kv_cache_config:
free_gpu_memory_fraction: 0.85
dtype: fp8
# Enable the MTP(Multi-Token Prediction) in decode model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
cache_transceiver_config:
backend: DEFAULT
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# NOTE: FP4 only supported starting with Blackwell GPUs.
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
# You can also specify the full path to locally downloaded weights
# instead of a HuggingFace ID here.
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
max_batch_size: 1
max_num_tokens: 8192
max_seq_len: 8192
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
print_iter_log: true
disable_overlap_scheduler: true
# Enable the MTP(Multi-Token Prediction) in the prefill model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
cache_transceiver_config:
backend: DEFAULT
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
backend: pytorch
# TP/EP/PP/DP
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: false
max_batch_size: 256
max_num_tokens: 256
# 8448 = 8192 ISL + 256 OSL
max_seq_len: 8448
kv_cache_config:
# With dp attention disabled: high free_gpu_memory_fraction is fine.
free_gpu_memory_fraction: 0.85
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
dtype: fp8
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: false
cuda_graph_config:
enable_padding: true
# NOTE: For larger max batch size, you may want to
# add larger cuda graph batch sizes below to match.
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
cache_transceiver_config:
backend: DEFAULT
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
backend: pytorch
# TP/EP/PP/DP
tensor_parallel_size: 4
moe_expert_parallel_size: 4
pipeline_parallel_size: 1
enable_attention_dp: true
max_batch_size: 1
max_num_tokens: 8192
max_seq_len: 8192
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: true
print_iter_log: true
cache_transceiver_config:
backend: DEFAULT
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment