# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"ToolCallingwithDynamo"
---
# Tool Calling with Dynamo
You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose
to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information.
...
...
@@ -21,14 +22,12 @@ To enable this feature, you should set the following flag while launching the ba
python -m dynamo.<backend> --help"
```
<Note>
If no tool call parser is provided by the user, Dynamo will try to use default tool call parsing based on `<TOOLCALL>` and `<|python_tag|>` tool tags.
</Note>
> [!NOTE]
> If no tool call parser is provided by the user, Dynamo will try to use default tool call parsing based on `<TOOLCALL>` and `<|python_tag|>` tool tags.
<Tip>
If your model's default chat template doesn't support tool calling, but the model itself does, you can specify a custom chat template per worker
with `python -m dynamo.<backend> --custom-jinja-template </path/to/template.jinja>`.
</Tip>
> [!TIP]
> If your model's default chat template doesn't support tool calling, but the model itself does, you can specify a custom chat template per worker
> with `python -m dynamo.<backend> --custom-jinja-template </path/to/template.jinja>`.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"DynamoNIXLConnect"
---
# Dynamo NIXL Connect
Dynamo NIXL Connect specializes in moving data between models/workers in a Dynamo Graph, and for the use cases where registration and memory regions need to be dynamic.
Dynamo connect provides utilities for such use cases, using the NIXL-based I/O subsystem via a set of Python classes.
The relaxed registration comes with some performance overheads, but simplifies the integration process.
Especially for larger data transfer operations, such as between models in a multi-model graph, the overhead would be marginal.
The `dynamo.nixl_connect` library can be imported by any Dynamo container hosted application.
<Note>
Dynamo NIXL Connect will pick the best available method of data transfer available to it.
The available methods depend on the hardware and software configuration of the machines and network running the graph.
GPU Direct RDMA operations require that both ends of the operation have:
- NIC and GPU capable of performing RDMA operations
- Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
- Network that supports InfiniBand or RoCE
With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
</Note>
> [!NOTE]
> Dynamo NIXL Connect will pick the best available method of data transfer available to it.
> The available methods depend on the hardware and software configuration of the machines and network running the graph.
> GPU Direct RDMA operations require that both ends of the operation have:
> - NIC and GPU capable of performing RDMA operations
> - Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
> - Network that supports InfiniBand or RoCE
> With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
> For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
```python
importdynamo.nixl_connect
...
...
@@ -85,9 +85,8 @@ flowchart LR
e2@{ animate: true; }
```
<Note>
When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
</Note>
> [!NOTE]
> When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
### Multimodal Example
...
...
@@ -135,10 +134,9 @@ flowchart LR
o2@{ animate: true; }
```
<Note>
In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
</Note>
> [!NOTE]
> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
> The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"dynamo.nixl_connect.RdmaMetadata"
---
# dynamo.nixl_connect.RdmaMetadata
A Pydantic type intended to provide JSON serialized NIXL metadata about a [`ReadableOperation`](readable-operation.md) or [`WritableOperation`](writable-operation.md) object.
NIXL metadata contains detailed information about a worker process and how to access memory regions registered with the corresponding agent.
This data is required to perform data transfers using the NIXL-based I/O subsystem.
<Warning>
NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
</Warning>
> [!WARNING]
> NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
> This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
Use the respective class's `.metadata()` method to generate an `RdmaMetadata` object for an operation.
<Tip>
Classes using `RdmaMetadata` objects must be paired correctly.
[`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
[`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).
Incorrect pairing will result in an error being raised.
</Tip>
> [!TIP]
> Classes using `RdmaMetadata` objects must be paired correctly.
> [`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
> [`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).
> Incorrect pairing will result in an error being raised.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"dynamo.nixl_connect.ReadableOperation"
---
# dynamo.nixl_connect.ReadableOperation
An operation which enables a remote worker to read data from the local worker.
To create the operation, a set of local [`Descriptor`](descriptor.md) objects must be provided that reference memory intended to be transferred to a remote worker.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"dynamo.nixl_connect.WritableOperation"
---
# dynamo.nixl_connect.WritableOperation
An operation which enables a remote worker to write data to the local worker.
To create the operation, a set of local [`Descriptor`](descriptor.md) objects must be provided which reference memory intended to receive data from a remote worker.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"RunningSGLangwithDynamo"
---
# Running SGLang with Dynamo
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
...
...
@@ -65,9 +66,8 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
-**Default (`--use-sglang-tokenizer` not set)**: Dynamo handles tokenization/detokenization via our blazing fast frontend and passes `input_ids` to SGLang
-**With `--use-sglang-tokenizer`**: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
<Note>
When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
</Note>
> [!NOTE]
> When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
### Request Cancellation
...
...
@@ -80,9 +80,8 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |
<Warning>
⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
</Warning>
> [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
...
...
@@ -164,18 +163,22 @@ docker run \
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
<Tip>
Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
</Tip>
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
> [!TIP]
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
# Expert Parallelism Load Balancer (EPLB) in SGLang
Mixture-of-Experts (MoE) models utilize a technique called Expert Parallelism (EP), where experts are distributed across multiple GPUs. While this allows for much larger and more powerful models, it can lead to an uneven workload distribution. Because the load on different experts may vary depending on the workload, some GPUs can become bottlenecks, forcing the entire system to wait. This imbalance leads to wasted compute cycles and increased memory usage.
To address this, SGLang implements an Expert Parallelism Load Balancer (EPLB) inspired by the work in the DeepSeek-V3 paper. EPLB analyzes expert usage patterns and dynamically re-arranges the experts across the available GPUs to ensure a more balanced workload.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"ProfilingSGLangWorkersinDynamo"
---
# Profiling SGLang Workers in Dynamo
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"SGLangPrometheusMetrics"
---
# SGLang Prometheus Metrics
## Overview
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.