This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
Note that this TensorRT-LLM version does not support all the options yet.
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
## Getting Started
...
...
@@ -83,8 +101,8 @@ This figure shows an overview of the major components to deploy:
@@ -96,201 +114,73 @@ This figure shows an overview of the major components to deploy:
```
Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
### Example architectures
### Single-Node Deployments
#### Aggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
```
#### Aggregated serving with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/agg_router.yaml
```
#### Disaggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
#### Disaggregated serving with KV Routing
#### Aggregated
```bash
cd/workspace/examples/tensorrt_llm
dynamo serve graphs.disagg:Frontend -f ./configs/disagg_router.yaml
cd$DYNAMO_ROOT/examples/tensorrt_llm
./launch/agg.sh
```
#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1
#### Aggregated with KV Routing
```bash
cd/workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
cd$DYNAMO_ROOT/examples/tensorrt_llm
./launch/agg_router.sh
```
Notes:
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
#### Multi-Node Disaggregated Serving
In the following example, we will demonstrate how to run a Disaggregated Serving
deployment across multiple nodes. For simplicity, we will demonstrate how to
deploy a single Decode worker on one node, and a single Prefill worker on the other node.
However, the instance counts, TP sizes, other configs, and responsibilities of each node
can be customized and deployed in similar ways.
#### Disaggregated
For example, to deploy Deepseek R1, you could replace the referenced example
configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1
example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`).
You can find the example Deepseek R1 configs for GB200
[here](configs/deepseek_r1), but the config settings can be customized for testing
other hardware configurations or parallelism strategies.
> [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
This "multi-node" example demonstrates how to generally connect dynamo workers from
different nodes, but for simplicity, each worker individually fits on a single node.
For details on how to launch a worker that spans multiple nodes due to sheer model
size, or for features like large scale expert parallelism, see the
# NOTE: Clearing out the etcd and nats jetstream data directories across runs
# helps to guarantee a clean and reproducible results.
cd$DYNAMO_ROOT/examples/tensorrt_llm
./launch/disagg.sh
```
Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
#### Disaggregated with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml &
```
Notes:
- The aggregated graph (`graphs.agg`) is chosen here because it also describes
our desired deployment settings for the head node: launching the utility components
(Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with
`remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker`
independently on a separate node in the next step of this demonstration.
You are free to customize the graph and configuration of components launched on
each node.
- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a
single source of truth to be used for deployments on all of our nodes, describing
the configurations for all of our components, including both decode and prefill
workers, but can be customized based on your deployment needs.
##### Worker Node(s)
Set environment variables pointing at the etcd/nats endpoints on the head node
so the Dynamo Distributed Runtime can orchestrate communication and
discoverability between the head node and worker nodes:
```bash
# if not head node
export HEAD_NODE_IP="<head-node-ip>"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```
> [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
Deploy a Prefill worker:
```bash
cd/workspace/examples/tensorrt_llm
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
cd$DYNAMO_ROOT/examples/tensorrt_llm
./launch/disagg_router.sh
```
Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node!
##### Additional Notes for Multi-Node Deployments
Notes:
- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`,
and change the config to one that includes the router, such as `configs/disagg_router.yaml`
- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes.
Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step
remains mostly the same. The primary difference between aggregation and disaggregation for this step is
whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file
(ex: `configs/disagg.yaml` vs `configs/agg.yaml`).
- To apply the same concept for launching additional decode workers on worker nodes, you can
directly start them, similar to the prefill worker step above:
```bash
# Example: deploy decode worker only
cd /workspace/examples/tensorrt_llm
dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker &
```
- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster,
try unsetting the following environment variables before launching the TRTLLM worker. If you intend to
run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may
want to save these values into temporary variables and then restore them afterwards.
```bash
# Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes`
unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
```
#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1
Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations
##### Head Node
Start nats/etcd
#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
### Multinode Deployment
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
### Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component.
### Close deployment
See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
### Benchmarking
...
...
@@ -298,103 +188,23 @@ To benchmark your deployment with GenAI-Perf, see this utility script, configuri
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
### KV Cache Transfer for Disaggregated Serving
In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer:
#### Default Method: UCX
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
#### Experimental Method: NIXL
TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet.
#### Using NIXL for KV Cache Transfer
**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer.
To enable NIXL for KV cache transfer in disaggregated serving:
1.**Build the container with NIXL support:**
The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.
**Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):**
```bash
rm-rf /tmp/trtllm_wheel
```
## Disaggregation Strategy
**Build the container with NIXL support:**
```bash
./container/build.sh --framework tensorrtllm \
--use-default-experimental-tensorrtllm-commit\
--trtllm-use-nixl-kvcache-experimental
```
The disaggregation strategy controls how requests are distributed between the prefill and decode workers in a disaggregated deployment.
**Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support.
By default, Dynamo uses a `decode first` strategy: incoming requests are initially routed to the decode worker, which then forwards them to the prefill worker in round-robin fashion. The prefill worker processes the request and returns results to the decode worker for any remaining decode operations.
2.**Run the containerized environment:**
See [run container](#run-container) section to learn how to start the container image built in previous step.
When using KV routing, however, Dynamo switches to a `prefill first` strategy. In this mode, requests are routed directly to the prefill worker, which can help maximize KV cache reuse and improve overall efficiency for certain workloads. Choosing the appropriate strategy can have a significant impact on performance, depending on your use case.
3.**Start the disaggregated service:**
See [disaggregated serving](#disaggregated-serving) to see how to start the deployment.
4.**Send the request:**
See [client](#client) section to learn how to send the request to deployment.
**Important:** Ensure that ETCD and NATS services are running before starting the service.
The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer.
The disaggregation strategy can be set using the `DISAGGREGATION_STRATEGY` environment variable. You can set the strategy before launching your deployment, for example:
### Example architectures for Llama 4 Maverick Instruct + Eagle Speculative Decoding
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).
#### Notes
* Testing for the current example used:
* One GB200x4 node for aggregate serving
* Two GB200x4 nodes for disaggregate serving
* To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
* Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
* The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
* If you need to download model weights off huggingface, make sure you run the command `huggingface-cli login` and have access to the necessary gated models.
##### Aggregated Serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml
```
* Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.