@@ -200,6 +200,42 @@ Dynamo is built in the open with an OSS-first development model. We welcome cont
...
@@ -200,6 +200,42 @@ Dynamo is built in the open with an OSS-first development model. We welcome cont
<details>
<details>
<summary>Older news</summary>
<summary>Older news</summary>
Dynamo provides comprehensive benchmarking tools:
-**[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
-**[SLA-Driven Deployments](docs/components/planner/planner-guide.md)** – Optimize deployments to meet SLA requirements
## Frontend OpenAPI Specification
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
```bash
cargo run -p dynamo-llm --bin generate-frontend-openapi
```
This writes to `docs/reference/api/openapi.json`.
## Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service-discovery.md)) handle service discovery. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|------------|------|------|-------|
| **Local Development** | ❌ Not required | ❌ Not required | Pass `--discovery-backend file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
For Slurm or other distributed deployments (and KV-aware routing):
-[etcd](https://etcd.io/) can be run directly as `./etcd`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
## More News
-[11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
-[11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
-[11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
For general TensorRT-LLM features and configuration, see the [Reference Guide](../trtllm-reference-guide.md).
For general TensorRT-LLM features and engine configuration, see the
[Reference Guide](../trtllm-reference-guide.md).
---
## Recommended Path
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via `mpirun` or `srun`. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate two examples launching multinode workers
on a slurm cluster with `srun`:
1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
worker across 4 GB200 nodes
2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
worker (4 nodes) across a total of 8 GB200 nodes.
NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
using `mpirun` directly, with relative ease.
## Setup
For simplicity of the example, we will make some assumptions about your slurm cluster:
1. First, we assume you have access to a slurm cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
3. Third, we assume you have a Dynamo+TRTLLM container image available.
You can use the [prebuilt container](../README.md#quick-start) or [build a custom one](../trtllm-building-custom-container.md).
This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity
to run both examples. If you plan to only run the aggregated example, you
will only need 4 nodes. If you customize the configurations to require a
different number of nodes, you can adjust the number of allocated nodes
accordingly. Pre-allocating nodes is technically not a requirement,
but it makes iterations of testing/experimenting easier.
Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
For multinode TensorRT-LLM deployments, start from the checked-in Kubernetes
```bash
recipes under [`recipes/`](../../../../recipes/README.md). Those manifests are
# Set partition manually based on your slurm cluster's partition names
the supported entrypoints for launching multi-node workers, frontend services,
PARTITION=""
and related routing components.
# Set account manually if this command doesn't work on your cluster
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami)format=account)"
salloc \
--partition="${PARTITION}"\
--account="${ACCOUNT}"\
--job-name="${ACCOUNT}-dynamo.trtllm"\
-t 05:00:00 \
--nodes 8
```
5. Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which may be the default behavior after executing the `salloc` command above
depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
### Environment Variable Setup
The main TRT-LLM recipe entrypoints are:
This example aims to automate as much of the environment setup as possible,
-[DeepSeek-R1 WideEP on GB200](../../../../recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml)
but all slurm clusters and environments are different, and you may need to
@@ -104,10 +104,6 @@ For comprehensive instructions on multinode serving, see the [Multinode Examples
...
@@ -104,10 +104,6 @@ For comprehensive instructions on multinode serving, see the [Multinode Examples
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
### Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md).
## Client
## Client
See the [client](../sglang/README.md#testing-the-deployment) section to learn how to send requests to the deployment.
See the [client](../sglang/README.md#testing-the-deployment) section to learn how to send requests to the deployment.
-**Production Deployment**: For multi-node deployments, see the [Multi-node Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/README.md)
-**Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
-**Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
-**Monitoring**: Set up Prometheus and Grafana for production monitoring
-**Monitoring**: Set up Prometheus and Grafana for production monitoring
-**Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
-**Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
### Environment Setup
Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
```bash
# Container image (build using docs/backends/trtllm/README.md#build-container)
Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml), following the checked-in GAIE/EPP configuration layout used by this repository.
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. A request only exercises disaggregated routing when the EPP config defines a `prefill` profile and prefill workers are available. The standalone [`epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) currently only defines a `decode` profile, while the recipe examples use separate aggregated and disaggregated configs under `recipes/llama-3-70b/vllm/agg/gaie/` and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/`. Unless `DYN_ENFORCE_DISAGG=true`, deployments without a `prefill` profile or prefill workers fall back to aggregated serving.
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. A request only exercises disaggregated routing when the EPP config defines a `prefill` profile and prefill workers are available. The standalone [`epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) currently only defines a `decode` profile, while the recipe examples use separate aggregated and disaggregated configs under `recipes/llama-3-70b/vllm/agg/gaie/` and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/`. Unless `DYN_ENFORCE_DISAGG=true`, deployments without a `prefill` profile or prefill workers fall back to aggregated serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway.
If you want to use LoRA deploy Dynamo without the Inference Gateway.
@@ -26,9 +26,9 @@ This directory contains practical examples demonstrating how to deploy and use D
...
@@ -26,9 +26,9 @@ This directory contains practical examples demonstrating how to deploy and use D
Learn fundamental Dynamo concepts through these introductory examples:
Learn fundamental Dynamo concepts through these introductory examples:
-**[Quickstart](/examples/basics/quickstart/README.md)** - Simple aggregated serving example with vLLM backend
-**[Quickstart](/docs/getting-started/quickstart.md)** - Simple local Dynamo setup across supported backends
-**[Disaggregated Serving](/examples/basics/disaggregated_serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
-**[Disaggregated Serving](/docs/features/disaggregated-serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
-**[Multi-node](/examples/basics/multinode/README.md)** - Distributed inference across multiple nodes and GPUs
-**[Multi-node TensorRT-LLM](/docs/backends/trtllm/multinode/trtllm-multinode-examples.md)** - Distributed inference across multiple nodes and GPUs
## Framework Support
## Framework Support
...
@@ -56,7 +56,7 @@ Low-level runtime examples for developers using Python<>Rust bindings:
...
@@ -56,7 +56,7 @@ Low-level runtime examples for developers using Python<>Rust bindings:
## Getting Started
## Getting Started
1.**Choose your deployment pattern**: Start with the [Quickstart](/examples/basics/quickstart/README.md) for a simple local deployment, or explore [Disaggregated Serving](/examples/basics/disaggregated_serving/README.md) for advanced architectures.
1.**Choose your deployment pattern**: Start with the [Quickstart](/docs/getting-started/quickstart.md) for a simple local deployment, or explore [Disaggregated Serving](/docs/features/disaggregated-serving/README.md) for advanced architectures.
2.**Set up prerequisites**: Most examples require etcd and NATS services. You can start them using:
2.**Set up prerequisites**: Most examples require etcd and NATS services. You can start them using: