Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
f0ac8e2b
Unverified
Commit
f0ac8e2b
authored
May 02, 2025
by
Ryan McCormick
Committed by
GitHub
May 02, 2025
Browse files
docs: Add multi-node TRTLLM steps to README (#930)
parent
58df5aca
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
91 additions
and
3 deletions
+91
-3
examples/tensorrt_llm/README.md
examples/tensorrt_llm/README.md
+91
-3
No files found.
examples/tensorrt_llm/README.md
View file @
f0ac8e2b
...
...
@@ -143,20 +143,108 @@ dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
cache between the context and generation workers.
#### Multi-Node Disaggregated Serving
In the following example, we will demonstrate how to run a Disaggregated Serving
deployment across multiple nodes. For simplicity, we will demonstrate how to
deploy a single Decode worker on one node, and a single Prefill worker on the other node.
However, the instance counts, TP sizes, other configs, and responsibilities of each node
can be customized and deployed in similar ways.
##### Head Node
Start nats/etcd:
```
bash
# NATS data persisted to /tmp/nats/jetstream by default
nats-server
-js
&
# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified
etcd
--listen-client-urls
http://0.0.0.0:2379
--advertise-client-urls
http://0.0.0.0:2379
--data-dir
/tmp/etcd &
# NOTE: Clearing out the etcd and nats jetstream data directories across runs
# helps to guarantee a clean and reproducible results.
```
Launch graph of Frontend, Processor, and TensorRTLLMWorker (decode) on head node:
```
bash
cd
/workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend
-f
./configs/disagg.yaml &
```
Notes:
-
The aggregated graph (
`graphs.agg`
) is chosen here because it also describes
our desired deployment settings for the head node: launching the utility components
(Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with
`remote-prefill`
enabled). We plan to launch the
`TensorRTLLMPrefillWorker`
independently on a separate node in the next step of this demonstration.
You are free to customize the graph and configuration of components launched on
each node.
-
The disaggregated config
`configs/disagg.yaml`
is intentionally chosen here as a
single source of truth to be used for deployments on all of our nodes, describing
the configurations for all of our components, including both decode and prefill
workers, but can be customized based on your deployment needs.
##### Worker Node(s)
Set environment variables pointing at the etcd/nats endpoints on the head node
so the Dynamo Distributed Runtime can orchestrate communication and
discoverability between the head node and worker nodes:
```
bash
# if not head node
export
HEAD_NODE_IP
=
"<head-node-ip>"
export
NATS_SERVER
=
"nats://
${
HEAD_NODE_IP
}
:4222"
export
ETCD_ENDPOINTS
=
"
${
HEAD_NODE_IP
}
:2379"
```
Deploy a Prefill worker:
```
cd /workspace/examples/tensorrt_llm
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
```
Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node!
##### Additional Notes for Multi-Node Deployments
Notes:
-
To include a router in this deployment, change the graph to one that includes the router, such as
`graphs.agg_router`
,
and change the config to one that includes the router, such as
`configs/disagg_router.yaml`
-
This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes.
Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step
remains mostly the same. The primary difference between aggregation and disaggregation for this step is
whether or not the
`TensorRTLLMWorker`
is configured to do
`remote-prefill`
or not in the config file
(ex:
`configs/disagg.yaml`
vs
`configs/agg.yaml`
).
-
To apply the same concept for launching additional decode workers on worker nodes, you can
directly start them, similar to the prefill worker step above:
```
bash
# Example: deploy decode worker only
cd
/workspace/examples/tensorrt_llm
dynamo serve components.worker:TensorRTLLMWorker
-f
./configs/disagg.yaml
--service-name
TensorRTLLMWorker &
```
### Client
See
[
client
](
../llm/README.md#client
)
section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which deployed the
`Frontend`
component.
### Close deployment
See
[
close deployment
](
../../docs/guides/dynamo_serve.md#close-deployment
)
section to learn about how to close the deployment.
Remaining tasks:
### Benchmarking
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model`
name and
`host`
based on your deployment:
[
perf.sh
](
../llm/benchmarks/perf.sh
)
### Future Work
Remaining tasks:
-
[x] Add support for the disaggregated serving.
-
[x] Add multi-node support.
-
[x] Add instructions for benchmarking.
-
[ ] Add integration test coverage.
-
[ ] Add instructions for benchmarking.
-
[ ] Add multi-node support.
-
[ ] Merge the code base with llm example to reduce the code duplication.
-
[ ] Use processor from dynamo-llm framework.
-
[ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment