Unverified Commit 39d645e5 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate Fern docs from fern/ into docs/ (#6206)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent d381e6ff
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Disaggregation and Performance Tuning
Disaggregation gains performance by separating the prefill and decode into different engines to reduce interferences between the two.
However, performant disaggregation requires careful tuning of the inference parameters.
Specifically, there are three sets of parameters that needs to be tuned:
1. Engine configuration and options (e.g. parallelization mapping, maximum number of tokens, etc.).
2. Disaggregated router configuration and options.
3. Number of prefill and decode engines.
This guide describes the process of tuning these parameters.
## Engine Configuration and Tuning
The most important engine configuration to tune is the parallelization mapping.
For most dense models, the best setting is to use TP within node and PP across nodes.
For example, for Llama-405b w8a8 on H100, TP8 on a single node or TP8PP2 on two nodes is usually the best choice.
The next thing to decide is how many numbers of GPU to serve the model.
Typically, the number of GPUs vs the performance follows the following pattern:
| Number of GPUs | Performance
| :-------------------------------------------------- | :---------------------------------------------------------------------------------------- |
| Cannot hold weights in VRAM | OOM |
| (Barely hold weights in VRAM) | (KV cache is too small to maintain large enough sequence length or reasonable batch size) |
| Minimum number with fair amount of KV cache | Best overall throughput/GPU, worst latency/user |
| Between minimum and maximum | Tradeoff between throughput/GPU and latency/user |
| Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user |
| More than maximum | Communication overhead dominates, poor performance |
> [!Note]
> for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user.
>
> For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction:
| TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1 |
| ------: | -----------------: | --------------------: | ---------------------------: |
| 1 | 113 | 113 | 1.00x |
| 2 | 269 | 135 | 1.19x |
| 4 | 578 | 144 | 1.28x |
The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed ISL/OSL/concurrency test using [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main) and compare with the SLA.
AIPerf is pre-installed in the dynamo container.
> [!Tip]
> If you are unfamiliar with AIPerf, please see this helpful [tutorial](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorial.md) to get you started.
Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size.
For prefill engines, usually a small batch size and large `max_num_token` is preferred.
For decode engines, usually a large batch size and medium `max_num_token` is preferred.
For details on tuning the `max_num_token` and max_batch_size, see the next section.
For block size, if the block size is too small, it leads to small memory chunks in the P->D KV cache transfer and poor performance.
Too small block size also leads to memory fragmentation in the attention calculation, but the impact is usually insignificant.
If the block size is too large, it leads to low prefix cache hit ratio.
For most dense models, we find block size 128 is a good choice.
## Disaggregated Router
Disaggregated router decides whether to prefill a request in the remote prefill engine or locally in the decode engine using chunked prefill.
For most frameworks, when chunked prefill is enabled and one forward iteration gets a mixture of prefilling and decoding request, three kernels are launched:
1. The attention kernel for context tokens (context_fmha kernel in TRTLLM).
2. The attention kernel for decode tokens (xqa kernel in TRTLLM).
3. Dense kernel for the combined active tokens in prefills and decodes.
### Prefill Engine
In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average time to first token (TTFT) is minimized.
For example, for Llama3.3-70b NVFP4 quantization on B200 TP1 in vLLM, the below figure shows the prefill time with different isl (prefix caching is turned off):
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../images/prefill_time.png)
For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated.
For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
Currently, prefill engines in Dynamo operate at a batch size of 1.
To make sure prefill engine is saturated, users can set `max-local-prefill-length` to the saturation point to make sure prefill engine is optimal.
### Decode Engine
In the decode engine, maximum batch size and maximum number of tokens affects the size of intermediate tensors.
With a larger batch size and number of tokens, the size of intermediate tensors increases and the size of KV cache decreases.
TensorRT-LLM (TRTLLM) has a good [summary](https://nvidia.github.io/TensorRT-LLM/reference/memory.html) on the memory footprint where similar ideas also applies to other LLM frameworks.
With chunked prefill enabled, the maximum number of tokens controls the longest prefill that can be piggybacked to decode and control the inter-token latency (ITL).
For the same prefill requests, a large maximum number of tokens leads to fewer but longer stalls in the generation, while a small maximum number of tokens leads to more but shorter stalls in the generation.
However, chunked prefill is currently not supported in Dynamo (vLLM backend).
Hence, the current best strategy is to set the maximum batch size to the optimized KV cache size and set the maximum number of tokens to the maximum local prefill length + maximum batch size (since one decode request has one active token).
## Number of Prefill and Decode Engines
The best dynamo knob choices depends on the operating condition of the model.
Based on the load, we define three operating conditions:
1. **Low load**:
The endpoint is hit by a single user (single-stream) most of the time.
2. **Medium load**:
The endpoint is hit by multiple users, but the KV cache of the decode engines is never fully utilized.
3. **High load**:
The endpoint is hit by multiple users and the requests are queued up due to no available KV cache in the decode engines.
At low load, disaggregation would not benefit much as prefill and decode are usually computed separately.
It is usually better to use a single monolithic engine.
At medium load, disaggregation allows better ITL compared with prefill-prioritized and chunked prefill engines and better TTFT compared with chunked prefill engine and decode-only engine for each user.
Dynamo users can adjust the number of prefill and decode engines based on TTFT and ITL SLA.
At high load where KV cache capacity is the bottleneck, disaggregation has the following effect on the KV cache usage in the decode engines:
* Increase the total amount of KVcache:
* Being able to use greater TP values in decode engines leads to more KV cache per GPU and higher prefix cache hit rate.
* When the requests is prefilled remotely, the decode engine does not need to maintain its KV cache (currently not implemented in Dynamo).
* Lower ITL reduces the decode time and allow the same amount of KV cache to serve more requests.
* Decrease the total amount of KV cache:
* Some GPUs are configured as prefill engines whose KV cache is not used in the decode phase.
Since Dynamo currently allocates the KV blocks immediately when the decode engine get the requests,
it is advisable to use as few prefill engines as possible (even no prefill engine) to maximize the available KV cache in decode engines.
To prevent queueing at prefill engines, users can set a large `max-local-prefill-length` and piggyback more prefill requests at decode engines.
{"name": "NVIDIA Dynamo", "version": "dev"}
\ No newline at end of file
{
"openapi": "3.1.0",
"info": {
"title": "NVIDIA Dynamo OpenAI Frontend",
"description": "OpenAI-compatible HTTP API for NVIDIA Dynamo.",
"contact": {
"name": "NVIDIA Dynamo",
"url": "https://github.com/ai-dynamo/dynamo"
},
"license": {
"name": "Apache-2.0"
},
"version": "0.7.0"
},
"servers": [
{
"url": "/",
"description": "Current server"
}
],
"paths": {
"/busy_threshold": {
"get": {
"summary": "Endpoint: /busy_threshold",
"description": "Endpoint for path: /busy_threshold",
"operationId": "get_busy_threshold",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/docs": {
"get": {
"summary": "API documentation",
"description": "Interactive API documentation powered by Swagger UI.",
"operationId": "get_docs",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/health": {
"get": {
"summary": "Health check",
"description": "Returns the health status of the service. Used for readiness probes.",
"operationId": "get_health",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/live": {
"get": {
"summary": "Liveness check",
"description": "Returns the liveness status of the service. Used for liveness probes.",
"operationId": "get_live",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/metrics": {
"get": {
"summary": "Prometheus metrics",
"description": "Returns Prometheus metrics for monitoring the service.",
"operationId": "get_metrics",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/openapi.json": {
"get": {
"summary": "OpenAPI specification",
"description": "Returns the OpenAPI 3.0 specification for this API in JSON format.",
"operationId": "get_openapi.json",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/v1/chat/completions": {
"post": {
"summary": "Create chat completion",
"description": "Creates a completion for a chat conversation. Supports both streaming and non-streaming modes. Compatible with OpenAI's chat completions API.",
"operationId": "post_v1_chat_completions",
"requestBody": {
"description": "Chat completion request with model, messages, and optional parameters",
"content": {
"application/json": {
"schema": {
"allOf": [
{
"$ref": "#/components/schemas/CreateChatCompletionRequest"
},
{
"$ref": "#/components/schemas/CommonExt"
},
{
"type": "object",
"properties": {
"chat_template_args": {
"type": [
"object",
"null"
],
"description": "Extra args to pass to the chat template rendering context",
"additionalProperties": {},
"propertyNames": {
"type": "string"
}
},
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
},
"additionalProperties": {
"description": "Catch-all for unsupported fields - checked during validation"
}
}
],
"description": "A request structure for creating a chat completion, extending OpenAI's\n`CreateChatCompletionRequest` with [`NvExt`] extensions and common fields.\n\n# Fields\n- `inner`: The base OpenAI chat completion request, embedded using `serde(flatten)`.\n- `common`: Common extension fields (ignore_eos, min_tokens) at root level, embedded using `serde(flatten)`.\n- `nvext`: The optional NVIDIA extension field. See [`NvExt`] for more details.\n Note: If ignore_eos is specified in both common and nvext, the common (root-level) value takes precedence."
},
"example": {
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello! Can you help me understand what this API does?"
}
],
"temperature": 0.7,
"max_tokens": 50,
"stream": false
}
}
},
"required": true
},
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/v1/completions": {
"post": {
"summary": "Create text completion",
"description": "Creates a completion for a given prompt. Supports both streaming and non-streaming modes. Compatible with OpenAI's completions API.",
"operationId": "post_v1_completions",
"requestBody": {
"description": "Text completion request with model, prompt, and optional parameters",
"content": {
"application/json": {
"schema": {
"allOf": [
{
"$ref": "#/components/schemas/CreateCompletionRequest"
},
{
"$ref": "#/components/schemas/CommonExt"
},
{
"type": "object",
"properties": {
"metadata": {},
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
},
"additionalProperties": {
"description": "Catch-all for unsupported fields - checked during validation"
}
}
]
},
"example": {
"model": "Qwen/Qwen3-0.6B",
"prompt": "Once upon a time",
"temperature": 0.7,
"max_tokens": 50,
"stream": false
}
}
},
"required": true
},
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/v1/embeddings": {
"post": {
"summary": "Create embeddings",
"description": "Creates an embedding vector representing the input text. Compatible with OpenAI's embeddings API.",
"operationId": "post_v1_embeddings",
"requestBody": {
"description": "Embedding request with model and input text",
"content": {
"application/json": {
"schema": {
"allOf": [
{
"$ref": "#/components/schemas/CreateEmbeddingRequest"
},
{
"type": "object",
"properties": {
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
}
}
]
},
"example": {
"model": "Qwen/Qwen3-Embedding-4B",
"input": "The quick brown fox jumps over the lazy dog"
}
}
},
"required": true
},
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/v1/models": {
"get": {
"summary": "List available models",
"description": "Lists the currently available models and provides basic information about each.",
"operationId": "get_v1_models",
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
},
"/v1/responses": {
"post": {
"summary": "Create response",
"description": "Creates a response for a given input. Compatible with OpenAI's responses API.",
"operationId": "post_v1_responses",
"requestBody": {
"description": "Response request with model and input",
"content": {
"application/json": {
"schema": {
"allOf": [
{
"$ref": "#/components/schemas/CreateResponse",
"description": "Flattened CreateResponse fields (model, input, temperature, etc.)"
},
{
"type": "object",
"properties": {
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
}
}
]
},
"example": {
"model": "Qwen/Qwen3-0.6B",
"input": "What is the capital of France?"
}
}
},
"required": true
},
"responses": {
"200": {
"description": "Successful response"
},
"400": {
"description": "Bad request - invalid input"
},
"404": {
"description": "Model not found"
},
"503": {
"description": "Service unavailable"
}
}
}
}
},
"components": {
"schemas": {
"AudioUrl": {
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"type": "string",
"format": "uri",
"description": "URL of the audio file"
},
"uuid": {
"type": [
"string",
"null"
],
"format": "uuid",
"description": "Optional unique identifier for the audio."
}
}
},
"ChatCompletionAudio": {
"type": "object",
"required": [
"voice",
"format"
],
"properties": {
"format": {
"$ref": "#/components/schemas/ChatCompletionAudioFormat",
"description": "Specifies the output audio format. Must be one of `wav`, `mp3`, `flac`, `opus`, or `pcm16`."
},
"voice": {
"$ref": "#/components/schemas/ChatCompletionAudioVoice",
"description": "The voice the model uses to respond. Supported voices are `ash`, `ballad`, `coral`, `sage`, and `verse` (also supported but not recommended are `alloy`, `echo`, and `shimmer`; these voices are less expressive)."
}
}
},
"ChatCompletionAudioFormat": {
"type": "string",
"enum": [
"wav",
"mp3",
"flac",
"opus",
"pcm16"
]
},
"ChatCompletionAudioVoice": {
"type": "string",
"enum": [
"alloy",
"ash",
"ballad",
"coral",
"echo",
"sage",
"shimmer",
"verse"
]
},
"ChatCompletionFunctionCall": {
"oneOf": [
{
"type": "string",
"description": "The model does not call a function, and responds to the end-user.",
"enum": [
"none"
]
},
{
"type": "string",
"description": "The model can pick between an end-user or calling a function.",
"enum": [
"auto"
]
},
{
"type": "object",
"description": "Forces the model to call the specified function.",
"required": [
"Function"
],
"properties": {
"Function": {
"type": "object",
"description": "Forces the model to call the specified function.",
"required": [
"name"
],
"properties": {
"name": {
"type": "string"
}
}
}
}
}
]
},
"ChatCompletionFunctions": {
"type": "object",
"required": [
"name",
"parameters"
],
"properties": {
"description": {
"type": [
"string",
"null"
],
"description": "A description of what the function does, used by the model to choose when and how to call the function."
},
"name": {
"type": "string",
"description": "The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64."
},
"parameters": {
"description": "The parameters the functions accepts, described as a JSON Schema object. See the [guide](https://platform.openai.com/docs/guides/text-generation/function-calling) for examples, and the [JSON Schema reference](https://json-schema.org/understanding-json-schema/) for documentation about the format.\n\nOmitting `parameters` defines a function with an empty parameter list."
}
},
"deprecated": true
},
"ChatCompletionMessageToolCall": {
"type": "object",
"required": [
"id",
"type",
"function"
],
"properties": {
"function": {
"$ref": "#/components/schemas/FunctionCall",
"description": "The function that the model called."
},
"id": {
"type": "string",
"description": "The ID of the tool call."
},
"type": {
"$ref": "#/components/schemas/ChatCompletionToolType",
"description": "The type of the tool. Currently, only `function` is supported."
}
}
},
"ChatCompletionModalities": {
"type": "string",
"description": "Output types that you would like the model to generate for this request.\n\nMost models are capable of generating text, which is the default: `[\"text\"]`\n\nThe `gpt-4o-audio-preview` model can also be used to [generate\naudio](https://platform.openai.com/docs/guides/audio). To request that this model generate both text and audio responses, you can use: `[\"text\", \"audio\"]`",
"enum": [
"text",
"audio"
]
},
"ChatCompletionNamedToolChoice": {
"type": "object",
"description": "Specifies a tool the model should use. Use to force the model to call a specific function.",
"required": [
"type",
"function"
],
"properties": {
"function": {
"$ref": "#/components/schemas/FunctionName"
},
"type": {
"$ref": "#/components/schemas/ChatCompletionToolType",
"description": "The type of the tool. Currently, only `function` is supported."
}
}
},
"ChatCompletionRequestAssistantMessage": {
"type": "object",
"properties": {
"audio": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionRequestAssistantMessageAudio",
"description": "Data about a previous audio response from the model.\n[Learn more](https://platform.openai.com/docs/guides/audio)."
}
]
},
"content": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionRequestAssistantMessageContent",
"description": "The contents of the assistant message. Required unless `tool_calls` or `function_call` is specified."
}
]
},
"function_call": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/FunctionCall",
"description": "Deprecated and replaced by `tool_calls`. The name and arguments of a function that should be called, as generated by the model."
}
]
},
"name": {
"type": [
"string",
"null"
],
"description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role."
},
"refusal": {
"type": [
"string",
"null"
],
"description": "The refusal message by the assistant."
},
"tool_calls": {
"type": [
"array",
"null"
],
"items": {
"$ref": "#/components/schemas/ChatCompletionMessageToolCall"
}
}
}
},
"ChatCompletionRequestAssistantMessageAudio": {
"type": "object",
"required": [
"id"
],
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for a previous audio response from the model."
}
}
},
"ChatCompletionRequestAssistantMessageContent": {
"oneOf": [
{
"type": "string",
"description": "The text contents of the message."
},
{
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestAssistantMessageContentPart"
},
"description": "An array of content parts with a defined type. Can be one or more of type `text`, or exactly one of type `refusal`."
}
]
},
"ChatCompletionRequestAssistantMessageContentPart": {
"oneOf": [
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"text"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartRefusal"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"refusal"
]
}
}
}
]
}
]
},
"ChatCompletionRequestDeveloperMessage": {
"type": "object",
"required": [
"content"
],
"properties": {
"content": {
"$ref": "#/components/schemas/ChatCompletionRequestDeveloperMessageContent",
"description": "The contents of the developer message."
},
"name": {
"type": [
"string",
"null"
],
"description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role."
}
}
},
"ChatCompletionRequestDeveloperMessageContent": {
"oneOf": [
{
"type": "string"
},
{
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText"
}
}
]
},
"ChatCompletionRequestFunctionMessage": {
"type": "object",
"required": [
"name"
],
"properties": {
"content": {
"type": [
"string",
"null"
],
"description": "The return value from the function call, to return to the model."
},
"name": {
"type": "string",
"description": "The name of the function to call."
}
}
},
"ChatCompletionRequestMessage": {
"oneOf": [
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestDeveloperMessage"
},
{
"type": "object",
"required": [
"role"
],
"properties": {
"role": {
"type": "string",
"enum": [
"developer"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestSystemMessage"
},
{
"type": "object",
"required": [
"role"
],
"properties": {
"role": {
"type": "string",
"enum": [
"system"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestUserMessage"
},
{
"type": "object",
"required": [
"role"
],
"properties": {
"role": {
"type": "string",
"enum": [
"user"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestAssistantMessage"
},
{
"type": "object",
"required": [
"role"
],
"properties": {
"role": {
"type": "string",
"enum": [
"assistant"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestToolMessage"
},
{
"type": "object",
"required": [
"role"
],
"properties": {
"role": {
"type": "string",
"enum": [
"tool"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestFunctionMessage"
},
{
"type": "object",
"required": [
"role"
],
"properties": {
"role": {
"type": "string",
"enum": [
"function"
]
}
}
}
]
}
]
},
"ChatCompletionRequestMessageContentPartAudio": {
"type": "object",
"description": "Learn about [audio inputs](https://platform.openai.com/docs/guides/audio).",
"required": [
"input_audio"
],
"properties": {
"input_audio": {
"$ref": "#/components/schemas/InputAudio"
}
}
},
"ChatCompletionRequestMessageContentPartAudioUrl": {
"type": "object",
"required": [
"audio_url"
],
"properties": {
"audio_url": {
"$ref": "#/components/schemas/AudioUrl"
}
}
},
"ChatCompletionRequestMessageContentPartImage": {
"type": "object",
"required": [
"image_url"
],
"properties": {
"image_url": {
"$ref": "#/components/schemas/ImageUrl"
}
}
},
"ChatCompletionRequestMessageContentPartRefusal": {
"type": "object",
"required": [
"refusal"
],
"properties": {
"refusal": {
"type": "string",
"description": "The refusal message generated by the model."
}
}
},
"ChatCompletionRequestMessageContentPartText": {
"type": "object",
"required": [
"text"
],
"properties": {
"text": {
"type": "string"
}
}
},
"ChatCompletionRequestMessageContentPartVideo": {
"type": "object",
"required": [
"video_url"
],
"properties": {
"video_url": {
"$ref": "#/components/schemas/VideoUrl"
}
}
},
"ChatCompletionRequestSystemMessage": {
"type": "object",
"required": [
"content"
],
"properties": {
"content": {
"$ref": "#/components/schemas/ChatCompletionRequestSystemMessageContent",
"description": "The contents of the system message."
},
"name": {
"type": [
"string",
"null"
],
"description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role."
}
}
},
"ChatCompletionRequestSystemMessageContent": {
"oneOf": [
{
"type": "string",
"description": "The text contents of the system message."
},
{
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestSystemMessageContentPart"
},
"description": "An array of content parts with a defined type. For system messages, only type `text` is supported."
}
]
},
"ChatCompletionRequestSystemMessageContentPart": {
"oneOf": [
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"text"
]
}
}
}
]
}
]
},
"ChatCompletionRequestToolMessage": {
"type": "object",
"description": "Tool message",
"required": [
"content",
"tool_call_id"
],
"properties": {
"content": {
"$ref": "#/components/schemas/ChatCompletionRequestToolMessageContent",
"description": "The contents of the tool message."
},
"tool_call_id": {
"type": "string"
}
}
},
"ChatCompletionRequestToolMessageContent": {
"oneOf": [
{
"type": "string",
"description": "The text contents of the tool message."
},
{
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestToolMessageContentPart"
},
"description": "An array of content parts with a defined type. For tool messages, only type `text` is supported."
}
]
},
"ChatCompletionRequestToolMessageContentPart": {
"oneOf": [
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"text"
]
}
}
}
]
}
]
},
"ChatCompletionRequestUserMessage": {
"type": "object",
"required": [
"content"
],
"properties": {
"content": {
"$ref": "#/components/schemas/ChatCompletionRequestUserMessageContent",
"description": "The contents of the user message."
},
"name": {
"type": [
"string",
"null"
],
"description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role."
}
}
},
"ChatCompletionRequestUserMessageContent": {
"oneOf": [
{
"type": "string",
"description": "The text contents of the message."
},
{
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestUserMessageContentPart"
},
"description": "An array of content parts with a defined type. Supported options differ based on the [model](https://platform.openai.com/docs/models) being used to generate the response. Can contain text, image, or audio inputs."
}
]
},
"ChatCompletionRequestUserMessageContentPart": {
"oneOf": [
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"text"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartImage"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"image_url"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartVideo"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"video_url"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartAudioUrl"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"audio_url"
]
}
}
}
]
},
{
"allOf": [
{
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartAudio"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"input_audio"
]
}
}
}
]
}
]
},
"ChatCompletionStreamOptions": {
"type": "object",
"description": "Options for streaming response. Only set this when you set `stream: true`.",
"required": [
"include_usage"
],
"properties": {
"include_usage": {
"type": "boolean",
"description": "If set, an additional chunk will be streamed before the `data: [DONE]` message. The `usage` field on this chunk shows the token usage statistics for the entire request, and the `choices` field will always be an empty array. All other chunks will also include a `usage` field, but with a null value."
}
}
},
"ChatCompletionTool": {
"type": "object",
"required": [
"type",
"function"
],
"properties": {
"function": {
"$ref": "#/components/schemas/FunctionObject"
},
"type": {
"$ref": "#/components/schemas/ChatCompletionToolType"
}
}
},
"ChatCompletionToolChoiceOption": {
"oneOf": [
{
"type": "string",
"enum": [
"none"
]
},
{
"type": "string",
"enum": [
"auto"
]
},
{
"type": "string",
"enum": [
"required"
]
},
{
"type": "object",
"required": [
"named"
],
"properties": {
"named": {
"$ref": "#/components/schemas/ChatCompletionNamedToolChoice"
}
}
}
],
"description": "Controls which (if any) tool is called by the model.\n`none` means the model will not call any tool and instead generates a message.\n`auto` means the model can pick between generating a message or calling one or more tools.\n`required` means the model must call one or more tools.\nSpecifying a particular tool via `{\"type\": \"function\", \"function\": {\"name\": \"my_function\"}}` forces the model to call that tool.\n\n`none` is the default when no tools are present. `auto` is the default if tools are present."
},
"ChatCompletionToolType": {
"type": "string",
"enum": [
"function"
]
},
"CommonExt": {
"type": "object",
"description": "Common extensions for OpenAI API requests that are not part of the standard OpenAI spec\nbut are commonly needed across different request types.",
"properties": {
"guided_choice": {
"type": [
"array",
"null"
],
"items": {
"type": "string"
},
"description": "If specified, the output will be exactly one of the choices."
},
"guided_decoding_backend": {
"type": [
"string",
"null"
],
"description": "If specified, the backend to use for guided decoding, can be backends like xgrammar or custom guided decoding backend"
},
"guided_grammar": {
"type": [
"string",
"null"
],
"description": "If specified, the output will follow the context-free grammar. Can be a string or null."
},
"guided_json": {
"description": "Guided Decoding Options\nIf specified, the output will be a JSON object. Can be a string, an object, or null."
},
"guided_regex": {
"type": [
"string",
"null"
],
"description": "If specified, the output will follow the regex pattern. Can be a string or null."
},
"guided_whitespace_pattern": {
"type": [
"string",
"null"
],
"description": "If specified, the output will follow the whitespace pattern. Can be a string or null."
},
"ignore_eos": {
"type": [
"boolean",
"null"
],
"description": "If true, the model will ignore the end of string token and generate to max_tokens.\nThis field can also be specified in nvext, but the root-level value takes precedence."
},
"include_stop_str_in_output": {
"type": [
"boolean",
"null"
],
"description": "include_stop_str_in_output"
},
"min_p": {
"type": [
"number",
"null"
],
"format": "float",
"description": "Relative probability floor"
},
"min_tokens": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "The minimum number of tokens to generate.\nThis is a common parameter needed across different request types.",
"minimum": 0
},
"repetition_penalty": {
"type": [
"number",
"null"
],
"format": "float",
"description": "How much to penalize tokens based on how frequently they occur in the text.\nA value of 1 means no penalty, while values larger than 1 discourage and values smaller encourage."
},
"skip_special_tokens": {
"type": [
"boolean",
"null"
],
"description": "Whether to skip special tokens in the decoded output.\nWhen true, special tokens (like EOS, BOS, PAD) are removed from the output text.\nWhen false, special tokens are included in the output text.\nDefaults to false if not specified."
},
"top_k": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens."
}
}
},
"CreateChatCompletionRequest": {
"type": "object",
"required": [
"messages",
"model"
],
"properties": {
"audio": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionAudio",
"description": "Parameters for audio output. Required when audio output is requested with `modalities: [\"audio\"]`. [Learn more](https://platform.openai.com/docs/guides/audio)."
}
]
},
"frequency_penalty": {
"type": [
"number",
"null"
],
"format": "float",
"description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim."
},
"function_call": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionFunctionCall",
"description": "Deprecated in favor of `tool_choice`.\n\nControls which (if any) function is called by the model.\n`none` means the model will not call a function and instead generates a message.\n`auto` means the model can pick between generating a message or calling a function.\nSpecifying a particular function via `{\"name\": \"my_function\"}` forces the model to call that function.\n\n`none` is the default when no functions are present. `auto` is the default if functions are present."
}
]
},
"functions": {
"type": [
"array",
"null"
],
"items": {
"$ref": "#/components/schemas/ChatCompletionFunctions"
},
"description": "Deprecated in favor of `tools`.\n\nA list of functions the model may generate JSON inputs for.",
"deprecated": true
},
"logit_bias": {
"type": [
"object",
"null"
],
"description": "Modify the likelihood of specified tokens appearing in the completion.\n\nAccepts a json object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100.\nMathematically, the bias is added to the logits generated by the model prior to sampling.\nThe exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection;\nvalues like -100 or 100 should result in a ban or exclusive selection of the relevant token.",
"additionalProperties": {},
"propertyNames": {
"type": "string"
}
},
"logprobs": {
"type": [
"boolean",
"null"
],
"description": "Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the `content` of `message`."
},
"max_completion_tokens": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and [reasoning tokens](https://platform.openai.com/docs/guides/reasoning).",
"minimum": 0
},
"max_tokens": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "The maximum number of [tokens](https://platform.openai.com/tokenizer) that can be generated in the chat completion.\n\nThis value can be used to control [costs](https://openai.com/api/pricing/) for text generated via API.\nThis value is now deprecated in favor of `max_completion_tokens`, and is\nnot compatible with [o1 series models](https://platform.openai.com/docs/guides/reasoning).",
"deprecated": true,
"minimum": 0
},
"messages": {
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestMessage"
},
"description": "A list of messages comprising the conversation so far. Depending on the [model](https://platform.openai.com/docs/models) you use, different message types (modalities) are supported, like [text](https://platform.openai.com/docs/guides/text-generation), [images](https://platform.openai.com/docs/guides/vision), and [audio](https://platform.openai.com/docs/guides/audio)."
},
"metadata": {
"description": "Developer-defined tags and values used for filtering completions in the [dashboard](https://platform.openai.com/chat-completions)."
},
"mm_processor_kwargs": {
"description": "Multimodal processor configuration parameters"
},
"modalities": {
"type": [
"array",
"null"
],
"items": {
"$ref": "#/components/schemas/ChatCompletionModalities"
}
},
"model": {
"type": "string",
"description": "ID of the model to use.\nSee the [model endpoint compatibility](https://platform.openai.com/docs/models#model-endpoint-compatibility) table for details on which models work with the Chat API."
},
"n": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep `n` as `1` to minimize costs.",
"minimum": 0
},
"parallel_tool_calls": {
"type": [
"boolean",
"null"
],
"description": "Whether to enable [parallel function calling](https://platform.openai.com/docs/guides/function-calling/parallel-function-calling) during tool use."
},
"prediction": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/PredictionContent",
"description": "Configuration for a [Predicted Output](https://platform.openai.com/docs/guides/predicted-outputs),which can greatly improve response times when large parts of the model response are known ahead of time. This is most common when you are regenerating a file with only minor changes to most of the content."
}
]
},
"presence_penalty": {
"type": [
"number",
"null"
],
"format": "float",
"description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics."
},
"reasoning_effort": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ReasoningEffort",
"description": "**o1 models only**\n\nConstrains effort on reasoning for\n[reasoning models](https://platform.openai.com/docs/guides/reasoning).\n\nCurrently supported values are `low`, `medium`, and `high`. Reducing\n\nreasoning effort can result in faster responses and fewer tokens\nused on reasoning in a response."
}
]
},
"response_format": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ResponseFormat",
"description": "An object specifying the format that the model must output. Compatible with [GPT-4o](https://platform.openai.com/docs/models/gpt-4o), [GPT-4o mini](https://platform.openai.com/docs/models/gpt-4o-mini), [GPT-4 Turbo](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo) and all GPT-3.5 Turbo models newer than `gpt-3.5-turbo-1106`.\n\nSetting to `{ \"type\": \"json_schema\", \"json_schema\": {...} }` enables Structured Outputs which guarantees the model will match your supplied JSON schema. Learn more in the [Structured Outputs guide](https://platform.openai.com/docs/guides/structured-outputs).\n\nSetting to `{ \"type\": \"json_object\" }` enables JSON mode, which guarantees the message the model generates is valid JSON.\n\n**Important:** when using JSON mode, you **must** also instruct the model to produce JSON yourself via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly \"stuck\" request. Also note that the message content may be partially cut off if `finish_reason=\"length\"`, which indicates the generation exceeded `max_tokens` or the conversation exceeded the max context length."
}
]
},
"seed": {
"type": [
"integer",
"null"
],
"format": "int64",
"description": " This feature is in Beta.\nIf specified, our system will make a best effort to sample deterministically, such that repeated requests\nwith the same `seed` and parameters should return the same result.\nDeterminism is not guaranteed, and you should refer to the `system_fingerprint` response parameter to monitor changes in the backend."
},
"service_tier": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ServiceTier",
"description": "Specifies the latency tier to use for processing the request. This parameter is relevant for customers subscribed to the scale tier service:\n- If set to 'auto', the system will utilize scale tier credits until they are exhausted.\n- If set to 'default', the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee.\n- When not set, the default behavior is 'auto'.\n\nWhen this parameter is set, the response body will include the `service_tier` utilized."
}
]
},
"stop": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/Stop",
"description": "Up to 4 sequences where the API will stop generating further tokens."
}
]
},
"store": {
"type": [
"boolean",
"null"
],
"description": "Whether or not to store the output of this chat completion request\n\nfor use in our [model distillation](https://platform.openai.com/docs/guides/distillation) or [evals](https://platform.openai.com/docs/guides/evals) products."
},
"stream": {
"type": [
"boolean",
"null"
],
"description": "If set, partial message deltas will be sent, like in ChatGPT.\nTokens will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format)\nas they become available, with the stream terminated by a `data: [DONE]` message. [Example Python code](https://cookbook.openai.com/examples/how_to_stream_completions)."
},
"stream_options": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionStreamOptions"
}
]
},
"temperature": {
"type": [
"number",
"null"
],
"format": "float",
"description": "What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random,\nwhile lower values like 0.2 will make it more focused and deterministic.\n\nWe generally recommend altering this or `top_p` but not both."
},
"tool_choice": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionToolChoiceOption"
}
]
},
"tools": {
"type": [
"array",
"null"
],
"items": {
"$ref": "#/components/schemas/ChatCompletionTool"
},
"description": "A list of tools the model may call. Currently, only functions are supported as a tool.\nUse this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported."
},
"top_logprobs": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used.",
"minimum": 0
},
"top_p": {
"type": [
"number",
"null"
],
"format": "float",
"description": "An alternative to sampling with temperature, called nucleus sampling,\nwhere the model considers the results of the tokens with top_p probability mass.\nSo 0.1 means only the tokens comprising the top 10% probability mass are considered.\n\n We generally recommend altering this or `temperature` but not both."
},
"user": {
"type": [
"string",
"null"
],
"description": "A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/guides/safety-best-practices#end-user-ids)."
},
"web_search_options": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/WebSearchOptions",
"description": "This tool searches the web for relevant results to use in a response.\nLearn more about the [web search tool](https://platform.openai.com/docs/guides/tools-web-search?api-mode=chat)."
}
]
}
}
},
"CreateCompletionRequest": {
"type": "object",
"required": [
"model",
"prompt"
],
"properties": {
"best_of": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "Generates `best_of` completions server-side and returns the \"best\" (the one with the highest log probability per token). Results cannot be streamed.\n\nWhen used with `n`, `best_of` controls the number of candidate completions and `n` specifies how many to return – `best_of` must be greater than `n`.\n\n**Note:** Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for `max_tokens` and `stop`.",
"minimum": 0
},
"echo": {
"type": [
"boolean",
"null"
],
"description": "Echo back the prompt in addition to the completion"
},
"frequency_penalty": {
"type": [
"number",
"null"
],
"format": "float",
"description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.\n\n[See more information about frequency and presence penalties.](https://platform.openai.com/docs/guides/text-generation/parameter-details)"
},
"logit_bias": {
"type": [
"object",
"null"
],
"description": "Modify the likelihood of specified tokens appearing in the completion.\n\nAccepts a json object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100. You can use this [tokenizer tool](/tokenizer?view=bpe) (which works for both GPT-2 and GPT-3) to convert text to token IDs. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.\n\nAs an example, you can pass `{\"50256\": -100}` to prevent the <|endoftext|> token from being generated.",
"additionalProperties": {},
"propertyNames": {
"type": "string"
}
},
"logprobs": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "Include the log probabilities on the `logprobs` most likely output tokens, as well the chosen tokens. For example, if `logprobs` is 5, the API will return a list of the 5 most likely tokens. The API will always return the `logprob` of the sampled token, so there may be up to `logprobs+1` elements in the response.\n\nThe maximum value for `logprobs` is 5.",
"minimum": 0
},
"max_tokens": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "The maximum number of [tokens](https://platform.openai.com/tokenizer) that can be generated in the completion.\n\nThe token count of your prompt plus `max_tokens` cannot exceed the model's context length. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens.",
"minimum": 0
},
"model": {
"type": "string",
"description": "ID of the model to use. You can use the [List models](https://platform.openai.com/docs/api-reference/models/list) API to see all of your available models, or see our [Model overview](https://platform.openai.com/docs/models/overview) for descriptions of them."
},
"n": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "How many completions to generate for each prompt.\n**Note:** Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for `max_tokens` and `stop`.\n",
"minimum": 0
},
"presence_penalty": {
"type": [
"number",
"null"
],
"format": "float",
"description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.\n\n[See more information about frequency and presence penalties.](https://platform.openai.com/docs/guides/text-generation/parameter-details)"
},
"prompt": {
"$ref": "#/components/schemas/Prompt",
"description": "The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.\n\nNote that <|endoftext|> is the document separator that the model sees during training, so if a prompt is not specified the model will generate as if from the beginning of a new document."
},
"seed": {
"type": [
"integer",
"null"
],
"format": "int64",
"description": "If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same `seed` and parameters should return the same result.\n\nDeterminism is not guaranteed, and you should refer to the `system_fingerprint` response parameter to monitor changes in the backend."
},
"stop": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/Stop",
"description": "Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence."
}
]
},
"stream": {
"type": [
"boolean",
"null"
],
"description": "Whether to stream back partial progress. If set, tokens will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format)\nas they become available, with the stream terminated by a `data: [DONE]` message."
},
"stream_options": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ChatCompletionStreamOptions"
}
]
},
"suffix": {
"type": [
"string",
"null"
],
"description": "The suffix that comes after a completion of inserted text.\n\nThis parameter is only supported for `gpt-3.5-turbo-instruct`."
},
"temperature": {
"type": [
"number",
"null"
],
"format": "float",
"description": "What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.\n\nWe generally recommend altering this or `top_p` but not both."
},
"top_p": {
"type": [
"number",
"null"
],
"format": "float",
"description": "An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.\n\n We generally recommend altering this or `temperature` but not both."
},
"user": {
"type": [
"string",
"null"
],
"description": "A unique identifier representing your end-user, which will help OpenAI to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/usage-policies/end-user-ids)."
}
}
},
"CreateEmbeddingRequest": {
"type": "object",
"required": [
"model",
"input"
],
"properties": {
"dimensions": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "The number of dimensions the resulting output embeddings should have. Only supported in `text-embedding-3` and later models.",
"minimum": 0
},
"encoding_format": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/EncodingFormat",
"description": "The format to return the embeddings in. Can be either `float` or [`base64`](https://pypi.org/project/pybase64/). Defaults to float"
}
]
},
"input": {
"$ref": "#/components/schemas/EmbeddingInput",
"description": "Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for `text-embedding-ada-002`), cannot be an empty string, and any array must be 2048 dimensions or less. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens."
},
"model": {
"type": "string",
"description": "ID of the model to use. You can use the\n[List models](https://platform.openai.com/docs/api-reference/models/list)\nAPI to see all of your available models, or see our\n[Model overview](https://platform.openai.com/docs/models/overview)\nfor descriptions of them."
},
"user": {
"type": [
"string",
"null"
],
"description": "A unique identifier representing your end-user, which will help OpenAI\n to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/usage-policies/end-user-ids)."
}
}
},
"CreateResponse": {
"type": "object",
"description": "Builder for a Responses API request.",
"required": [
"input",
"model"
],
"properties": {
"background": {
"type": [
"boolean",
"null"
],
"description": "Whether to run the model response in the background.\nboolean or null."
},
"include": {
"type": [
"array",
"null"
],
"items": {
"type": "string"
},
"description": "Specify additional output data to include in the model response.\n\nSupported values:\n- `file_search_call.results`\n Include the search results of the file search tool call.\n- `message.input_image.image_url`\n Include image URLs from the input message.\n- `computer_call_output.output.image_url`\n Include image URLs from the computer call output.\n- `reasoning.encrypted_content`\n Include an encrypted version of reasoning tokens in reasoning item outputs.\n This enables reasoning items to be used in multi-turn conversations when\n using the Responses API statelessly (for example, when the `store` parameter\n is set to `false`, or when an organization is enrolled in the zero-data-\n retention program).\n\nIf `None`, no additional data is returned."
},
"input": {
"type": "object",
"description": "Text, image, or file inputs to the model, used to generate a response.\nUsing value_type to prevent deep schema recursion from Input's nested content types."
},
"instructions": {
"type": [
"string",
"null"
],
"description": "Inserts a system (or developer) message as the first item in the model's context.\n\nWhen using along with previous_response_id, the instructions from a previous response will\nnot be carried over to the next response. This makes it simple to swap out system\n(or developer) messages in new responses."
},
"max_output_tokens": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "An upper bound for the number of tokens that can be generated for a\nresponse, including visible output tokens and reasoning tokens.",
"minimum": 0
},
"max_tool_calls": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "The maximum number of total calls to built-in tools that can be processed in a response.\nThis maximum number applies across all built-in tool calls, not per individual tool.\nAny further attempts to call a tool by the model will be ignored.",
"minimum": 0
},
"metadata": {
"description": "Arbitrary JSON metadata used as a passthrough parameter"
},
"model": {
"type": "string",
"description": "Model ID used to generate the response, like `gpt-4o`.\nOpenAI offers a wide range of models with different capabilities,\nperformance characteristics, and price points."
},
"parallel_tool_calls": {
"type": [
"boolean",
"null"
],
"description": "Whether to allow the model to run tool calls in parallel."
},
"previous_response_id": {
"type": [
"string",
"null"
],
"description": "The unique ID of the previous response to the model. Use this to create\nmulti-turn conversations."
},
"prompt": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/PromptConfig",
"description": "Reference to a prompt template and its variables."
}
]
},
"reasoning": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ReasoningConfig",
"description": "**o-series models only**: Configuration options for reasoning models."
}
]
},
"service_tier": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ServiceTier",
"description": "Specifies the latency tier to use for processing the request.\n\nThis parameter is relevant for customers subscribed to the Scale tier service.\n\nSupported values:\n- `auto`\n - If the Project is Scale tier enabled, the system will utilize Scale tier credits until\n they are exhausted.\n - If the Project is not Scale tier enabled, the request will be processed using the\n default service tier with a lower uptime SLA and no latency guarantee.\n- `default`\n The request will be processed using the default service tier with a lower uptime SLA and\n no latency guarantee.\n- `flex`\n The request will be processed with the Flex Processing service tier. Learn more.\n\nWhen not set, the default behavior is `auto`.\n\nWhen this parameter is set, the response body will include the `service_tier` utilized."
}
]
},
"store": {
"type": [
"boolean",
"null"
],
"description": "Whether to store the generated model response for later retrieval via API."
},
"stream": {
"type": [
"boolean",
"null"
],
"description": "If set to true, the model response data will be streamed to the client as it is\ngenerated using server-sent events."
},
"temperature": {
"type": [
"number",
"null"
],
"format": "float",
"description": "What sampling temperature to use, between 0 and 2. Higher values like 0.8\nwill make the output more random, while lower values like 0.2 will make it\nmore focused and deterministic. We generally recommend altering this or\n`top_p` but not both."
},
"text": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/TextConfig",
"description": "Configuration options for a text response from the model. Can be plain text\nor structured JSON data."
}
]
},
"tool_choice": {
"type": "object",
"description": "How the model should select which tool (or tools) to use when generating\na response."
},
"tools": {
"type": "array",
"items": {
"type": "object"
},
"description": "An array of tools the model may call while generating a response.\nCan include built-in tools (file_search, web_search_preview,\ncomputer_use_preview) or custom function definitions."
},
"top_logprobs": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "An integer between 0 and 20 specifying the number of most likely tokens to return\nat each token position, each with an associated log probability.",
"minimum": 0
},
"top_p": {
"type": [
"number",
"null"
],
"format": "float",
"description": "An alternative to sampling with temperature, called nucleus sampling,\nwhere the model considers the results of the tokens with top_p probability\nmass. So 0.1 means only the tokens comprising the top 10% probability mass\nare considered. We generally recommend altering this or `temperature` but\nnot both."
},
"truncation": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/Truncation",
"description": "The truncation strategy to use for the model response:\n- `auto`: drop items in the middle to fit context window.\n- `disabled`: error if exceeding context window."
}
]
},
"user": {
"type": [
"string",
"null"
],
"description": "A unique identifier representing your end-user, which can help OpenAI to\nmonitor and detect abuse."
}
}
},
"EmbeddingInput": {
"oneOf": [
{
"type": "string"
},
{
"type": "array",
"items": {
"type": "string"
}
},
{
"type": "array",
"items": {
"type": "integer",
"format": "int32",
"minimum": 0
}
},
{
"type": "array",
"items": {
"type": "array",
"items": {
"type": "integer",
"format": "int32",
"minimum": 0
}
}
}
]
},
"EncodingFormat": {
"type": "string",
"enum": [
"float",
"base64"
]
},
"FunctionCall": {
"type": "object",
"description": "The name and arguments of a function that should be called, as generated by the model.",
"required": [
"name",
"arguments"
],
"properties": {
"arguments": {
"type": "string",
"description": "The arguments to call the function with, as generated by the model in JSON format. Note that the model does not always generate valid JSON, and may hallucinate parameters not defined by your function schema. Validate the arguments in your code before calling your function."
},
"name": {
"type": "string",
"description": "The name of the function to call."
}
}
},
"FunctionName": {
"type": "object",
"required": [
"name"
],
"properties": {
"name": {
"type": "string",
"description": "The name of the function to call."
}
}
},
"FunctionObject": {
"type": "object",
"required": [
"name"
],
"properties": {
"description": {
"type": [
"string",
"null"
],
"description": "A description of what the function does, used by the model to choose when and how to call the function."
},
"name": {
"type": "string",
"description": "The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64."
},
"parameters": {
"description": "The parameters the functions accepts, described as a JSON Schema object. See the [guide](https://platform.openai.com/docs/guides/text-generation/function-calling) for examples, and the [JSON Schema reference](https://json-schema.org/understanding-json-schema/) for documentation about the format.\n\nOmitting `parameters` defines a function with an empty parameter list."
},
"strict": {
"type": [
"boolean",
"null"
],
"description": "Whether to enable strict schema adherence when generating the function call. If set to true, the model will follow the exact schema defined in the `parameters` field. Only a subset of JSON Schema is supported when `strict` is `true`. Learn more about Structured Outputs in the [function calling guide](https://platform.openai.com/docs/guides/function-calling)."
}
}
},
"ImageDetail": {
"type": "string",
"enum": [
"auto",
"low",
"high"
]
},
"ImageUrl": {
"type": "object",
"required": [
"url"
],
"properties": {
"detail": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ImageDetail",
"description": "Specifies the detail level of the image. Learn more in the [Vision guide](https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding)."
}
]
},
"url": {
"type": "string",
"format": "uri",
"description": "Either a URL of the image or the base64 encoded image data."
},
"uuid": {
"type": [
"string",
"null"
],
"format": "uuid",
"description": "Optional unique identifier for the image."
}
}
},
"InputAudio": {
"type": "object",
"required": [
"data",
"format"
],
"properties": {
"data": {
"type": "string",
"description": "Base64 encoded audio data."
},
"format": {
"$ref": "#/components/schemas/InputAudioFormat",
"description": "The format of the encoded audio data. Currently supports \"wav\" and \"mp3\"."
}
}
},
"InputAudioFormat": {
"type": "string",
"enum": [
"wav",
"mp3"
]
},
"NvCreateChatCompletionRequest": {
"allOf": [
{
"$ref": "#/components/schemas/CreateChatCompletionRequest"
},
{
"$ref": "#/components/schemas/CommonExt"
},
{
"type": "object",
"properties": {
"chat_template_args": {
"type": [
"object",
"null"
],
"description": "Extra args to pass to the chat template rendering context",
"additionalProperties": {},
"propertyNames": {
"type": "string"
}
},
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
},
"additionalProperties": {
"description": "Catch-all for unsupported fields - checked during validation"
}
}
],
"description": "A request structure for creating a chat completion, extending OpenAI's\n`CreateChatCompletionRequest` with [`NvExt`] extensions and common fields.\n\n# Fields\n- `inner`: The base OpenAI chat completion request, embedded using `serde(flatten)`.\n- `common`: Common extension fields (ignore_eos, min_tokens) at root level, embedded using `serde(flatten)`.\n- `nvext`: The optional NVIDIA extension field. See [`NvExt`] for more details.\n Note: If ignore_eos is specified in both common and nvext, the common (root-level) value takes precedence."
},
"NvCreateCompletionRequest": {
"allOf": [
{
"$ref": "#/components/schemas/CreateCompletionRequest"
},
{
"$ref": "#/components/schemas/CommonExt"
},
{
"type": "object",
"properties": {
"metadata": {},
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
},
"additionalProperties": {
"description": "Catch-all for unsupported fields - checked during validation"
}
}
]
},
"NvCreateEmbeddingRequest": {
"allOf": [
{
"$ref": "#/components/schemas/CreateEmbeddingRequest"
},
{
"type": "object",
"properties": {
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
}
}
]
},
"NvCreateResponse": {
"allOf": [
{
"$ref": "#/components/schemas/CreateResponse",
"description": "Flattened CreateResponse fields (model, input, temperature, etc.)"
},
{
"type": "object",
"properties": {
"nvext": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/NvExt"
}
]
}
}
}
]
},
"NvExt": {
"type": "object",
"description": "NVIDIA LLM extensions to the OpenAI API",
"properties": {
"annotations": {
"type": [
"array",
"null"
],
"items": {
"type": "string"
},
"description": "Annotations\nUser requests triggers which result in the request issue back out-of-band information in the SSE\nstream using the `event:` field."
},
"backend_instance_id": {
"type": [
"integer",
"null"
],
"format": "int64",
"description": "Targeted backend instance ID for the request\nIf set, the request will be routed to backend instance with the given ID.\nIf not set, the request will be routed to the best matching instance.",
"minimum": 0
},
"extra_fields": {
"type": [
"array",
"null"
],
"items": {
"type": "string"
},
"description": "Extra fields to be included in the response's nvext\nThis is a list of field names that should be populated in the response\nSupported fields: \"worker_id\""
},
"greed_sampling": {
"type": [
"boolean",
"null"
],
"description": "If true, sampling will be forced to be greedy.\nThe backend is responsible for selecting the correct backend-specific options to\nimplement this."
},
"max_thinking_tokens": {
"type": [
"integer",
"null"
],
"format": "int32",
"description": "Maximum number of thinking tokens allowed\nNOTE: Currently passed through to backends as a no-op for future implementation",
"minimum": 0
},
"token_data": {
"type": [
"array",
"null"
],
"items": {
"type": "integer",
"format": "int32",
"minimum": 0
},
"description": "Pre-tokenized data to use instead of tokenizing the prompt\nIf provided along with backend_instance_id, these tokens will be used directly\nand tokenization will be skipped."
},
"use_raw_prompt": {
"type": [
"boolean",
"null"
],
"description": "If true, the preproessor will try to bypass the prompt template and pass the prompt directly to\nto the tokenizer."
}
}
},
"PredictionContent": {
"oneOf": [
{
"type": "object",
"description": "The type of the predicted content you want to provide. This type is\ncurrently always `content`.",
"required": [
"content",
"type"
],
"properties": {
"content": {
"$ref": "#/components/schemas/PredictionContentContent",
"description": "The type of the predicted content you want to provide. This type is\ncurrently always `content`."
},
"type": {
"type": "string",
"enum": [
"content"
]
}
}
}
],
"description": "Static predicted output content, such as the content of a text file that is being regenerated."
},
"PredictionContentContent": {
"oneOf": [
{
"type": "string",
"description": "The content used for a Predicted Output. This is often the text of a file you are regenerating with minor changes."
},
{
"type": "array",
"items": {
"$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText"
},
"description": "An array of content parts with a defined type. Supported options differ based on the [model](https://platform.openai.com/docs/models) being used to generate the response. Can contain text inputs."
}
],
"description": "The content that should be matched when generating a model response. If generated tokens would match this content, the entire model response can be returned much more quickly."
},
"Prompt": {
"oneOf": [
{
"type": "string"
},
{
"type": "array",
"items": {
"type": "string"
}
},
{
"type": "array",
"items": {
"type": "integer",
"format": "int32",
"minimum": 0
}
},
{
"type": "array",
"items": {
"type": "array",
"items": {
"type": "integer",
"format": "int32",
"minimum": 0
}
}
}
]
},
"PromptConfig": {
"type": "object",
"description": "Service tier request options.",
"required": [
"id"
],
"properties": {
"id": {
"type": "string",
"description": "The unique identifier of the prompt template to use."
},
"variables": {
"type": [
"object",
"null"
],
"description": "Optional map of values to substitute in for variables in your prompt. The substitution\nvalues can either be strings, or other Response input types like images or files.\nFor now only supporting Strings.",
"additionalProperties": {
"type": "string"
},
"propertyNames": {
"type": "string"
}
},
"version": {
"type": [
"string",
"null"
],
"description": "Optional version of the prompt template."
}
}
},
"ReasoningConfig": {
"type": "object",
"description": "o-series reasoning settings.",
"properties": {
"effort": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ReasoningEffort",
"description": "Constrain effort on reasoning."
}
]
},
"summary": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ReasoningSummary",
"description": "Summary mode for reasoning."
}
]
}
}
},
"ReasoningEffort": {
"type": "string",
"enum": [
"minimal",
"low",
"medium",
"high"
]
},
"ReasoningSummary": {
"type": "string",
"enum": [
"auto",
"concise",
"detailed"
]
},
"ResponseFormat": {
"oneOf": [
{
"type": "object",
"description": "The type of response format being defined: `text`",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"text"
]
}
}
},
{
"type": "object",
"description": "The type of response format being defined: `json_object`",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"json_object"
]
}
}
},
{
"type": "object",
"description": "The type of response format being defined: `json_schema`",
"required": [
"json_schema",
"type"
],
"properties": {
"json_schema": {
"$ref": "#/components/schemas/ResponseFormatJsonSchema"
},
"type": {
"type": "string",
"enum": [
"json_schema"
]
}
}
}
]
},
"ResponseFormatJsonSchema": {
"type": "object",
"required": [
"name"
],
"properties": {
"description": {
"type": [
"string",
"null"
],
"description": "A description of what the response format is for, used by the model to determine how to respond in the format."
},
"name": {
"type": "string",
"description": "The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64."
},
"schema": {
"description": "The schema for the response format, described as a JSON Schema object."
},
"strict": {
"type": [
"boolean",
"null"
],
"description": "Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the `schema` field. Only a subset of JSON Schema is supported when `strict` is `true`. To learn more, read the [Structured Outputs guide](https://platform.openai.com/docs/guides/structured-outputs)."
}
}
},
"ServiceTier": {
"type": "string",
"description": "Service tier request options.",
"enum": [
"auto",
"default",
"flex"
]
},
"Stop": {
"oneOf": [
{
"type": "string"
},
{
"type": "array",
"items": {
"type": "string"
}
}
]
},
"TextConfig": {
"type": "object",
"description": "Configuration for text response format.",
"required": [
"format"
],
"properties": {
"format": {
"$ref": "#/components/schemas/TextResponseFormat",
"description": "Defines the format: plain text, JSON object, or JSON schema."
}
}
},
"TextResponseFormat": {
"oneOf": [
{
"type": "object",
"description": "The type of response format being defined: `text`",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"text"
]
}
}
},
{
"type": "object",
"description": "The type of response format being defined: `json_object`",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"json_object"
]
}
}
},
{
"allOf": [
{
"$ref": "#/components/schemas/ResponseFormatJsonSchema",
"description": "The type of response format being defined: `json_schema`"
},
{
"type": "object",
"required": [
"type"
],
"properties": {
"type": {
"type": "string",
"enum": [
"json_schema"
]
}
}
}
],
"description": "The type of response format being defined: `json_schema`"
}
]
},
"Truncation": {
"type": "string",
"description": "Truncation strategies.",
"enum": [
"auto",
"disabled"
]
},
"VideoUrl": {
"type": "object",
"required": [
"url"
],
"properties": {
"detail": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/ImageDetail",
"description": "Specifies the detail level of the video processing."
}
]
},
"url": {
"type": "string",
"format": "uri",
"description": "Either a URL of the video or the base64 encoded video data."
},
"uuid": {
"type": [
"string",
"null"
],
"format": "uuid",
"description": "Optional unique identifier for the video."
}
}
},
"WebSearchContextSize": {
"type": "string",
"description": "The amount of context window space to use for the search.",
"enum": [
"low",
"medium",
"high"
]
},
"WebSearchLocation": {
"type": "object",
"description": "Approximate location parameters for the search.",
"properties": {
"city": {
"type": [
"string",
"null"
],
"description": "Free text input for the city of the user, e.g. `San Francisco`."
},
"country": {
"type": [
"string",
"null"
],
"description": "The two-letter [ISO country code](https://en.wikipedia.org/wiki/ISO_3166-1) of the user, e.g. `US`."
},
"region": {
"type": [
"string",
"null"
],
"description": "Free text input for the region of the user, e.g. `California`."
},
"timezone": {
"type": [
"string",
"null"
],
"description": "The [IANA timezone](https://timeapi.io/documentation/iana-timezones) of the user, e.g. `America/Los_Angeles`."
}
}
},
"WebSearchOptions": {
"type": "object",
"description": "Options for the web search tool.",
"properties": {
"search_context_size": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/WebSearchContextSize",
"description": "High level guidance for the amount of context window space to use for the search. One of `low`, `medium`, or `high`. `medium` is the default."
}
]
},
"user_location": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/components/schemas/WebSearchUserLocation",
"description": "Approximate location parameters for the search."
}
]
}
}
},
"WebSearchUserLocation": {
"type": "object",
"required": [
"type",
"approximate"
],
"properties": {
"approximate": {
"$ref": "#/components/schemas/WebSearchLocation"
},
"type": {
"$ref": "#/components/schemas/WebSearchUserLocationType"
}
}
},
"WebSearchUserLocationType": {
"type": "string",
"enum": [
"approximate"
]
}
}
}
}
\ No newline at end of file
# Dynamo Run
`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` engines, as well as testing engines `echo` and `mocker`.
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
## Basics
Usage: See `dynamo-run --help`
Example: `dynamo-run Qwen/Qwen3-0.6B`
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
To adjust verbosity, use `-v` to enable debug logging or `-vv` to enable full trace logging. For example:
```bash
dynamo-run in=http out=mistralrs <model> -v # enables debug logging
```
### Use model from Hugging Face
To automatically download Qwen3 4B from Hugging Face (16 GiB download) and to start it in interactive text mode:
```
dynamo-run Qwen/Qwen3-4B
```
The general format for HF download follows this pattern:
```
dynamo-run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
```
For gated models (such as meta-llama/Llama-3.2-3B-Instruct), you must set an `HF_TOKEN` environment variable.
The parameter can be the ID of a HuggingFace repository (which will be downloaded) or a folder containing safetensors, config.json, or similar (perhaps a locally checked out HuggingFace repository).
### Run a model from local file
To run a model from local file:
- Download the model from Hugging Face
- Run the model from local file
See the following sections for details.
#### Download model from Hugging Face
This model available from Hugging Face should be high quality and fast on almost any machine: https://huggingface.co/Qwen/Qwen3-0.6B
To run the model:
*Text interface*
```
dynamo-run Qwen/Qwen3-0.6B
```
You can also pipe a prompt into `dynamo-run`:
```
echo 'What is the capital of Tuvalu?' | dynamo-run Qwen/Qwen3-0.6B --context-length 4096
```
*HTTP interface*
```
dynamo-run in=http out=mistralrs Qwen/Qwen3-0.6B
```
You can also list models or send a request:
*List the models*
```
curl localhost:8080/v1/models
```
*Send a request*
```
curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
```
## Distributed System
You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes. For development I run NATS like this: `nats-server -js --trace --store_dir $(mktemp -d)`.
**Node 1:** OpenAI compliant HTTP server, optional pre-processing, worker discovery:
```
dynamo-run in=http out=auto
```
**Node 2:** Engine. Receives and returns requests over the network:
```
dynamo-run in=dyn://llama3B.backend.generate out=mistralrs ~/llms/Llama-3.2-3B-Instruct
```
This uses etcd to auto-discover the model and NATS to talk to it. You can
run multiple instances on the same endpoint; it picks one based on the
`--router-mode` (round-robin by default if left unspecified).
Run `dynamo-run --help` for more options.
### Network names
The `in=dyn://` URLs have the format `dyn://namespace.component.endpoint`. For quickstart just use any string `dyn://test`, `dynamo-run` will default any missing parts for you. The pieces matter for a larger system.
* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
Example 1: Data parallel load balanced, one model one pipeline two instances.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
Node 2: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
```
Example 2: Two models, two pipelines.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
Node 2: dynamo-run in=dyn://llama3-1-8b.backend.generate /data/Llama-3.1-8B-Instruct/
```
Example 3: Different endpoints.
The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
Example 4: Multiple component in a pipeline.
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
For output it is always only `out=auto`. This tells Dynamo to auto-discover the instances, group them by model, and load balance appropriately (depending on `--router-mode` flag).
### KV-aware routing
```
dynamo-run in=http out=auto --router-mode kv
```
The only difference from the distributed system above is `--router-mode kv`. vllm announces when a KV block is created or removed. The Dynamo router finds the worker with the best match for those KV blocks and directs the traffic to that node.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
The KV-aware routing arguments:
- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, the router uses KV events to track block creation and deletion from workers. If false, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Set false if your backend engine does not emit KV events.
### Request Migration
In a [Distributed System](#distributed-system), you can enable [request migration](../fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
```bash
dynamo-run in=dyn://... out=<engine> ... --migration-limit=3
```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../fault_tolerance/request_migration.md) documentation for details on how this works.
### Request Cancellation
When using the HTTP interface (`in=http`), if the HTTP request connection is dropped by the client, Dynamo automatically cancels the downstream request to the worker. This ensures that computational resources are not wasted on generating responses that are no longer needed.
For detailed information about how request cancellation works across the system, see the [Request Cancellation Architecture](../fault_tolerance/request_cancellation.md) documentation.
## Development
`dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
### Step 1: Install libraries
**Ubuntu:**
```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
```
**macOS:**
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)
```
brew install cmake protobuf
## Check that Metal is accessible
xcrun -sdk macosx metal
```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
### Step 2: Install Rust
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
### Step 3: Build
- Linux with GPU and CUDA (tested on Ubuntu):
```
cargo build --features cuda
```
- macOS with Metal:
```
cargo build --features metal
```
- CPU only:
```
cargo build
```
Optionally you can run `cargo build` from any location with arguments:
```
--target-dir /path/to/target_directory # specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml # if cargo build is run outside of `launch/` directory
```
The binary is called `dynamo-run` in `target/debug`
```
cd target/debug
```
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.
## Engines
The input defaults to `in=text`. The output defaults to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case an engine that echo's back your input is used.
### mistralrs
[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run and fast to load, and runs well on CPU as well as GPU. For those reasons it is the default engine.
```
dynamo-run Qwen/Qwen3-4B
```
is equivalent to
```
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
```
If you have multiple GPUs, `mistral.rs` does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
### Mocker engine
The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
- Testing distributed system components without GPU resources
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis
**Basic usage:**
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
And below are arguments that are mocker-specific:
- `speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
- `dp_size`: Number of data parallel workers to simulate (default: 1)
- `watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
```bash
echo '{"speedup_ratio": 10.0}' > mocker_args.json
dynamo-run in=dyn://dynamo.mocker.generate out=mocker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args mocker_args.json
dynamo-run in=http out=auto --router-mode kv
```
### echo
The `echo` engine echoes the prompt back as the response.
```
dynamo-run in=http out=echo --model-name my_model
```
The echo engine uses a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:
```
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo
```
The default delay is 10ms, which produces approximately 100 tokens per second.
### Other engines, multi-node, production
`vllm`, `sglang` and `trtllm` production grade engines are available in `examples/backends`. They run as Python components, using the Rust bindings. See the main README.
`dynamo-run` is an exploration, development and prototyping tool, as well as an example of using the Rust API. Multi-node and production setups should be using the main engine components.
## Batch mode
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
```
dynamo-run in=batch:prompts.jsonl out=mistralrs <model>
```
The input file should look like this:
```
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
```
Each one is passed as a prompt to the model. The output is written back to the same folder in `output.jsonl`. At the end of the run some statistics are printed.
The output looks like this:
```
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```
## Writing your own engine in Python
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. All of the main backend components in `examples/backends/` work like this.
The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler
```
from dynamo.llm import ModelInput, ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker
# 1. Decorate a function to get the runtime
#
@dynamo_worker()
async def worker(runtime: DistributedRuntime):
# 2. Register ourselves on the network
#
component = runtime.namespace("namespace").component("component")
model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
model_input = ModelInput.Tokens # or ModelInput.Text if engine handles pre-processing
model_type = ModelType.Chat # or ModelType.Chat | ModelType.Completions if model can be deployed on chat and completions endpoints
endpoint = component.endpoint("endpoint")
# Optional last param to register_llm is model_name. If not present derives it from model_path
await register_llm(model_input, model_type, endpoint, model_path)
# Initialize your engine here
# engine = ...
# 3. Attach request handler
#
await endpoint.serve_endpoint(RequestHandler(engine).generate)
class RequestHandler:
def __init__(self, engine):
...
async def generate(self, request):
# Call the engine
# yield result dict
...
if __name__ == "__main__":
uvloop.install()
asyncio.run(worker())
```
The `model_path` can be:
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
The `model_input` can be:
- ModelInput.Tokens. Your engine expects pre-processed input (token IDs). Dynamo handles tokenization and pre-processing.
- ModelInput.Text. Your engine expects raw text input and handles its own tokenization and pre-processing.
The `model_type` can be:
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat).
- ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions).
`register_llm` can also take the following kwargs:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, or the folder name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
Here are some example engines:
- Backend:
* [vllm](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_vllm.py)
* [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang.py)
- Chat:
* [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py)
More fully-featured Python engines are in `examples/backends`.
## Debugging
`dynamo-run` and `dynamo-runtime` support [tokio-console](https://github.com/tokio-rs/console). Build with the feature to enable:
```
cargo build --features cuda,tokio-console -p dynamo-run
```
The listener uses the default tokio console port, and all interfaces (0.0.0.0).
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Feature Compatibility Matrices
This document provides a comprehensive compatibility matrix for key Dynamo features across the supported backends.
*Updated for Dynamo v0.9.0*
**Legend:**
* ✅ : Supported
* 🚧 : Work in Progress / Experimental / Limited
## Quick Comparison
| Feature | vLLM | TensorRT-LLM | SGLang | Source |
| :--- | :---: | :---: | :---: | :--- |
| **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] |
| **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] |
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
| **KV Block Manager** | ✅ | ✅ | 🚧 | [KVBM Doc][kvbm] |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Video)** | ✅ | | | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc][mm] |
| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
| **Request Cancellation** | ✅ | ✅ | 🚧 | Backend READMEs |
| **LoRA** | ✅ | | | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | ✅ | ✅ | 🚧 | Backend READMEs |
## 1. vLLM Backend
vLLM offers the broadest feature coverage in Dynamo, with full support for disaggregated serving, KV-aware routing, KV block management, LoRA adapters, and multimodal inference including video and audio.
*Source: [docs/backends/vllm/README.md][vllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
| **Multimodal** | ✅ | <sup>1</sup> | — | ✅ | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | |
| **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
| **LoRA** | ✅ | ✅<sup>2</sup> | — | ✅ | — | ✅ | ✅ | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | |
| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | — | ✅ | — |
> **Notes:**
> 1. **Multimodal + KV-Aware Routing**: The KV router uses token-based hashing and does not yet support image/video hashes, so it falls back to random/round-robin routing. ([Source][kv-routing])
> 2. **KV-Aware LoRA Routing**: vLLM supports routing requests based on LoRA adapter affinity.
> 3. **Audio Support**: vLLM supports audio models like Qwen2-Audio (experimental). ([Source][mm-vllm])
> 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm])
> 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec])
## 2. SGLang Backend
SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
*Source: [docs/backends/sglang/README.md][sglang-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | |
| **Multimodal** | ✅<sup>2</sup> | <sup>1</sup> | — | 🚧 | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | |
| **Request Cancellation** | 🚧<sup>3</sup> | ✅ | ✅ | 🚧 | 🚧 | ✅ | — | | | |
| **LoRA** | | | | 🚧 | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | — | | 🚧 | — |
> **Notes:**
> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing])
> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source][mm-sglang])
> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme])
> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
## 3. TensorRT-LLM Backend
TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
*Source: [docs/backends/trtllm/README.md][trtllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
| **Multimodal** | ✅<sup>1</sup> | <sup>2</sup> | — | ✅ | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | ✅ | 🚧 | — | | | | |
| **Request Cancellation** | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | — | | | |
| **LoRA** | | | | | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | | ✅ | — |
> **Notes:**
> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source][mm-trtllm])
> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source][kv-routing])
> 3. **Request Cancellation**: Due to known issues, the TensorRT-LLM engine is temporarily not notified of request cancellations, meaning allocated resources for cancelled requests are not freed.
---
## Source References
<!-- Backend READMEs -->
[vllm-readme]: docs/backends/vllm/README.md
[sglang-readme]: docs/backends/sglang/README.md
[trtllm-readme]: docs/backends/trtllm/README.md
<!-- Design Docs -->
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/components/router/router_guide.md
[planner]: docs/components/planner/README.md
[kvbm]: docs/components/kvbm/README.md
[migration]: docs/fault_tolerance/request_migration.md
[tools]: docs/agents/tool-calling.md
<!-- Multimodal -->
[mm]: docs/features/multimodal/README.md
[mm-vllm]: docs/features/multimodal/multimodal_vllm.md
[mm-trtllm]: docs/features/multimodal/multimodal_trtllm.md
[mm-sglang]: docs/features/multimodal/multimodal_sglang.md
<!-- Feature-specific -->
[lora]: docs/kubernetes/deployment/dynamomodel-guide.md
[vllm-spec]: docs/features/speculative_decoding/speculative_decoding_vllm.md
[trtllm-eagle]: docs/backends/trtllm/llama4_plus_eagle.md
# NVIDIA Dynamo Glossary
## B
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
## C
**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor).
**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status.
## D
**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time.
**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
**Dynamo Kubernetes Platform** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs.
## E
**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
## F
**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
## G
**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
## I
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
## K
**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference.
**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics.
**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates.
**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
## M
**Model Deployment Card (MDC)** - A configuration structure containing all information required for distributed model serving. When a worker loads a model, it creates an MDC containing references to components such as the tokenizer, templates, runtime config. Workers publish their MDC to make the model discoverable to frontends. Frontends use the MDC to configure request preprocessing (tokenization, prompt formatting).
## N
**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments.
**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies.
## P
**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks.
**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics.
**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache.
**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes.
**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
## R
**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
## S
**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
## T
**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support.
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
## V
**vLLM** - High-throughput LLM serving engine with distributed tensor/pipeline parallelism and PagedAttention.
## W
**Wide Expert Parallelism (WideEP)** - Mixture-of-Experts deployment strategy that spreads experts across many GPUs (e.g., 64-way EP) so each GPU hosts only a few experts.
## X
**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Release Artifacts
This document provides a comprehensive inventory of all Dynamo release artifacts including container images, Python wheels, Helm charts, and Rust crates.
> **See also:** [Support Matrix](support-matrix.md) for hardware and platform compatibility | [Feature Matrix](feature-matrix.md) for backend feature support
Release history in this document begins at v0.6.0.
## Current Release: Dynamo v0.8.1
- **GitHub Release:** [v0.8.1](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1)
- **Docs:** [v0.8.1](https://docs.nvidia.com/dynamo/archive/0.8.1/index.html)
- **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
### Patch Release: v0.8.1.post1 (Jan 23, 2026)
**v0.8.1.post1** is a patch release for PyPI wheels and TRT-LLM container only (no GitHub release). All other artifacts remain at v0.8.1.
| Artifact | Version | Change | Link |
|----------|---------|--------|------|
| `ai-dynamo` | `0.8.1.post1` | Updated TRT-LLM to `v1.2.0rc6.post2` | [PyPI](https://pypi.org/project/ai-dynamo/0.8.1.post1/) |
| `ai-dynamo-runtime` | `0.8.1.post1` | Updated TRT-LLM to `v1.2.0rc6.post2` | [PyPI](https://pypi.org/project/ai-dynamo-runtime/0.8.1.post1/) |
| `tensorrtllm-runtime` | `0.8.1.post1` | TRT-LLM `v1.2.0rc6.post2` | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime?version=0.8.1.post1) |
### Container Images
| Image:Tag | Description | Backend | CUDA | Arch | NGC | Notes |
|-----------|-------------|---------|------|------|-----|-------|
| `vllm-runtime:0.8.1` | Runtime container for vLLM backend | vLLM `v0.12.0` | `v12.9` | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=0.8.1) | |
| `vllm-runtime:0.8.1-cuda13` | Runtime container for vLLM backend (CUDA 13) | vLLM `v0.12.0` | `v13.0` | AMD64/ARM64* | — | Fails to launch |
| `sglang-runtime:0.8.1` | Runtime container for SGLang backend | SGLang `v0.5.6.post2` | `v12.9` | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=0.8.1) | |
| `sglang-runtime:0.8.1-cuda13` | Runtime container for SGLang backend (CUDA 13) | SGLang `v0.5.6.post2` | `v13.0` | AMD64/ARM64* | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=0.8.1-cuda13) | Experimental |
| `tensorrtllm-runtime:0.8.1` | Runtime container for TensorRT-LLM backend | TRT-LLM `v1.2.0rc6.post1` | `v13.0` | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime?version=0.8.1) | |
| `dynamo-frontend:0.8.1` | API gateway with Endpoint Prediction Protocol (EPP) | — | — | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend?version=0.8.1) | |
| `kubernetes-operator:0.8.1` | Kubernetes operator for Dynamo deployments | — | — | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator?version=0.8.1) | |
\* Multimodal inference on CUDA 13 images: works on AMD64 for all backends; works on ARM64 only for TensorRT-LLM (`vllm-runtime:*-cuda13` and `sglang-runtime:*-cuda13` do not support multimodality on ARM64).
### Python Wheels
We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtllm]` wheel. See the [NGC container collection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for supported images.
| Package | Description | Python | Platform | PyPI |
|---------|-------------|--------|----------|------|
| `ai-dynamo==0.8.1` | Main package with backend integrations (vLLM, SGLang, TRT-LLM) | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo/0.8.1/) |
| `ai-dynamo-runtime==0.8.1` | Core Python bindings for Dynamo runtime | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo-runtime/0.8.1/) |
| `kvbm==0.8.1` | KV Block Manager for disaggregated KV cache | `3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/kvbm/0.8.1/) |
### Helm Charts
| Chart | Description | NGC |
|-------|-------------|-----|
| `dynamo-crds-0.8.1` | Custom Resource Definitions for Dynamo Kubernetes resources | [link](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-0.8.1.tgz) |
| `dynamo-platform-0.8.1` | Platform services (etcd, NATS) for Dynamo cluster | [link](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-0.8.1.tgz) |
| `dynamo-graph-0.8.1` | Deployment graph controller for Dynamo workloads | [link](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-graph-0.8.1.tgz) |
### Rust Crates
| Crate | Description | MSRV (Rust) | crates.io |
|-------|-------------|-------------|-----------|
| `dynamo-runtime@0.8.1` | Core distributed runtime library | `v1.82` | [link](https://crates.io/crates/dynamo-runtime/0.8.1) |
| `dynamo-llm@0.8.1` | LLM inference engine | `v1.82` | [link](https://crates.io/crates/dynamo-llm/0.8.1) |
| `dynamo-async-openai@0.8.1` | Async OpenAI-compatible API client | `v1.82` | [link](https://crates.io/crates/dynamo-async-openai/0.8.1) |
| `dynamo-parsers@0.8.1` | Protocol parsers (SSE, JSON streaming) | `v1.82` | [link](https://crates.io/crates/dynamo-parsers/0.8.1) |
| `dynamo-memory@0.8.1` | Memory management utilities | `v1.82` | [link](https://crates.io/crates/dynamo-memory/0.8.1) |
| `dynamo-config@0.8.1` | Configuration management | `v1.82` | [link](https://crates.io/crates/dynamo-config/0.8.1) |
## Quick Install Commands
### Container Images (NGC)
> For detailed run instructions, see the [Container README](../../container/README.md) or backend-specific guides: [vLLM](../backends/vllm/README.md) | [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md)
```bash
# Runtime containers
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1
# CUDA 13 variants (experimental)
# vLLM CUDA 13 image fails to launch (known issue)
# docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1-cuda13
docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1-cuda13
# Infrastructure containers
docker pull nvcr.io/nvidia/ai-dynamo/dynamo-frontend:0.8.1
docker pull nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.8.1
```
### Python Wheels (PyPI)
> For detailed installation instructions, see the [Local Quick Start](https://github.com/ai-dynamo/dynamo#local-quick-start) in the README.
```bash
# Install Dynamo with a specific backend (Recommended)
uv pip install "ai-dynamo[vllm]==0.8.1.post1"
uv pip install "ai-dynamo[sglang]==0.8.1.post1"
# TensorRT-LLM requires the NVIDIA PyPI index and pip
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]==0.8.1.post1"
# Install Dynamo core only
uv pip install ai-dynamo==0.8.1.post1
# Install standalone KVBM (Python 3.12 only)
uv pip install kvbm==0.8.1
```
### Helm Charts (NGC)
> For Kubernetes deployment instructions, see the [Kubernetes Installation Guide](../kubernetes/installation_guide.md).
```bash
helm install dynamo-crds oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds --version 0.8.1
helm install dynamo-platform oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform --version 0.8.1
helm install dynamo-graph oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-graph --version 0.8.1
```
### Rust Crates (crates.io)
> For API documentation, see each crate on [docs.rs](https://docs.rs/). To build Dynamo from source, see [Building from Source](https://github.com/ai-dynamo/dynamo#building-from-source).
```bash
cargo add dynamo-runtime@0.8.1
cargo add dynamo-llm@0.8.1
cargo add dynamo-async-openai@0.8.1
cargo add dynamo-parsers@0.8.1
cargo add dynamo-memory@0.8.1
cargo add dynamo-config@0.8.1
```
## CUDA and Driver Requirements
For detailed CUDA toolkit versions and minimum driver requirements for each container image, see the [Support Matrix](support-matrix.md#cuda-and-driver-requirements).
## Known Issues
For a complete list of known issues, refer to the release notes for each patch:
- [v0.8.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.0)
- [v0.8.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1)
### Known Artifact Issues
| Version | Artifact | Issue | Status |
|---------|----------|-------|--------|
| v0.8.1 | `vllm-runtime:0.8.1-cuda13` | Container fails to launch. | Known issue |
| v0.8.1 | `sglang-runtime:0.8.1-cuda13`, `vllm-runtime:0.8.1-cuda13` | Multimodality not expected to work on ARM64. Works on AMD64. | Known limitation |
| v0.8.0 | `sglang-runtime:0.8.0-cuda13` | CuDNN installation issue caused PyTorch `v2.9.1` compatibility problems with `nn.Conv3d`, resulting in performance degradation and excessive memory usage in multimodal workloads. | Fixed in v0.8.1 ([#5461](https://github.com/ai-dynamo/dynamo/pull/5461)) |
---
## Release History
- **v0.8.1.post1 Patch**: Updated TRT-LLM to `v1.2.0rc6.post2` (PyPI wheels and TRT-LLM container only)
- **Standalone Frontend Container**: `dynamo-frontend` added in v0.8.0
- **CUDA 13 Runtimes**: Experimental CUDA 13 runtime for vLLM and SGLang in v0.8.0
- **New Rust Crates**: `dynamo-memory` and `dynamo-config` added in v0.8.0
### GitHub Releases
| Version | Release Date | GitHub | Docs |
|---------|--------------|--------|------|
| `v0.8.1` | Jan 23, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1) | [Docs](https://docs.nvidia.com/dynamo/archive/0.8.1/index.html) |
| `v0.8.0` | Jan 15, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.0) | [Docs](https://docs.nvidia.com/dynamo/archive/0.8.0/index.html) |
| `v0.7.1` | Dec 15, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.1) | [Docs](https://docs.nvidia.com/dynamo/archive/0.7.1/index.html) |
| `v0.7.0` | Nov 26, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.0) | [Docs](https://docs.nvidia.com/dynamo/archive/0.7.0/index.html) |
| `v0.6.1` | Nov 6, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.1) | [Docs](https://docs.nvidia.com/dynamo/archive/0.6.1/index.html) |
| `v0.6.0` | Oct 28, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.0) | [Docs](https://docs.nvidia.com/dynamo/archive/0.6.0/index.html) |
### Container Images
> **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
>
> To access a specific version, append `?version=TAG` to the container URL:
> `https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/{container}?version={tag}`
#### vllm-runtime
| Image:Tag | vLLM | Arch | CUDA | Notes |
|-----------|------|------|------|-------|
| `vllm-runtime:0.8.1` | `v0.12.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:0.8.0` | `v0.12.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:0.8.0-cuda13` | `v0.12.0` | AMD64/ARM64 | `v13.0` | Experimental |
| `vllm-runtime:0.7.0.post2` | `v0.11.2` | AMD64/ARM64 | `v12.8` | Patch |
| `vllm-runtime:0.7.1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | |
| `vllm-runtime:0.7.0.post1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | Patch |
| `vllm-runtime:0.7.0` | `v0.11.0` | AMD64/ARM64 | `v12.8` | |
| `vllm-runtime:0.6.1.post1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | Patch |
| `vllm-runtime:0.6.1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | |
| `vllm-runtime:0.6.0` | `v0.11.0` | AMD64 | `v12.8` | |
#### sglang-runtime
| Image:Tag | SGLang | Arch | CUDA | Notes |
|-----------|--------|------|------|-------|
| `sglang-runtime:0.8.1` | `v0.5.6.post2` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.8.1-cuda13` | `v0.5.6.post2` | AMD64/ARM64 | `v13.0` | Experimental |
| `sglang-runtime:0.8.0` | `v0.5.6.post2` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.8.0-cuda13` | `v0.5.6.post2` | AMD64/ARM64 | `v13.0` | Experimental |
| `sglang-runtime:0.7.1` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.7.0.post1` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | Patch |
| `sglang-runtime:0.7.0` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.6.1.post1` | `v0.5.3.post2` | AMD64/ARM64 | `v12.9` | Patch |
| `sglang-runtime:0.6.1` | `v0.5.3.post2` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.6.0` | `v0.5.3.post2` | AMD64 | `v12.8` | |
#### tensorrtllm-runtime
| Image:Tag | TRT-LLM | Arch | CUDA | Notes |
|-----------|---------|------|------|-------|
| `tensorrtllm-runtime:0.8.1.post1` | `v1.2.0rc6.post2` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.8.1` | `v1.2.0rc6.post1` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.8.0` | `v1.2.0rc6.post1` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.7.0.post2` | `v1.2.0rc2` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.7.1` | `v1.2.0rc3` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.7.0.post1` | `v1.2.0rc3` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.7.0` | `v1.2.0rc2` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.6.1-cuda13` | `v1.2.0rc1` | AMD64/ARM64 | `v13.0` | Experimental |
| `tensorrtllm-runtime:0.6.1.post1` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | Patch |
| `tensorrtllm-runtime:0.6.1` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | |
| `tensorrtllm-runtime:0.6.0` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | |
#### dynamo-frontend
| Image:Tag | Arch | Notes |
|-----------|------|-------|
| `dynamo-frontend:0.8.1` | AMD64/ARM64 | |
| `dynamo-frontend:0.8.0` | AMD64/ARM64 | Initial |
#### kubernetes-operator
| Image:Tag | Arch | Notes |
|-----------|------|-------|
| `kubernetes-operator:0.8.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.8.0` | AMD64/ARM64 | |
| `kubernetes-operator:0.7.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.7.0.post1` | AMD64/ARM64 | Patch |
| `kubernetes-operator:0.7.0` | AMD64/ARM64 | |
| `kubernetes-operator:0.6.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.6.0` | AMD64/ARM64 | |
### Python Wheels
> **PyPI:** [ai-dynamo](https://pypi.org/project/ai-dynamo/) | [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/) | [kvbm](https://pypi.org/project/kvbm/)
>
> To access a specific version: `https://pypi.org/project/{package}/{version}/`
#### ai-dynamo (wheel)
| Package | Python | Platform | Notes |
|---------|--------|----------|-------|
| `ai-dynamo==0.8.1.post1` | `3.10``3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post2` |
| `ai-dynamo==0.8.1` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.8.0` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.7.1` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.7.0` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.6.1` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.6.0` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
#### ai-dynamo-runtime (wheel)
| Package | Python | Platform | Notes |
|---------|--------|----------|-------|
| `ai-dynamo-runtime==0.8.1.post1` | `3.10``3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post2` |
| `ai-dynamo-runtime==0.8.1` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.8.0` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.7.1` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.7.0` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.6.1` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.6.0` | `3.10``3.12` | Linux (glibc `v2.28+`) | |
#### kvbm (wheel)
| Package | Python | Platform | Notes |
|---------|--------|----------|-------|
| `kvbm==0.8.1` | `3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.8.0` | `3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.7.1` | `3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.7.0` | `3.12` | Linux (glibc `v2.28+`) | Initial |
### Helm Charts
> **NGC Helm Registry:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
>
> Direct download: `https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/{chart}-{version}.tgz`
#### dynamo-crds (Helm chart)
| Chart | Notes |
|-------|-------|
| `dynamo-crds-0.8.1` | |
| `dynamo-crds-0.8.0` | |
| `dynamo-crds-0.7.1` | |
| `dynamo-crds-0.7.0` | |
| `dynamo-crds-0.6.1` | |
| `dynamo-crds-0.6.0` | |
#### dynamo-platform (Helm chart)
| Chart | Notes |
|-------|-------|
| `dynamo-platform-0.8.1` | |
| `dynamo-platform-0.8.0` | |
| `dynamo-platform-0.7.1` | |
| `dynamo-platform-0.7.0` | |
| `dynamo-platform-0.6.1` | |
| `dynamo-platform-0.6.0` | |
#### dynamo-graph (Helm chart)
| Chart | Notes |
|-------|-------|
| `dynamo-graph-0.8.1` | |
| `dynamo-graph-0.8.0` | |
| `dynamo-graph-0.7.1` | |
| `dynamo-graph-0.7.0` | |
| `dynamo-graph-0.6.1` | |
| `dynamo-graph-0.6.0` | |
### Rust Crates
> **crates.io:** [dynamo-runtime](https://crates.io/crates/dynamo-runtime) | [dynamo-llm](https://crates.io/crates/dynamo-llm) | [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai) | [dynamo-parsers](https://crates.io/crates/dynamo-parsers) | [dynamo-memory](https://crates.io/crates/dynamo-memory) | [dynamo-config](https://crates.io/crates/dynamo-config)
>
> To access a specific version: `https://crates.io/crates/{crate}/{version}`
#### dynamo-runtime (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-runtime@0.8.1` | `v1.82` | |
| `dynamo-runtime@0.8.0` | `v1.82` | |
| `dynamo-runtime@0.7.1` | `v1.82` | |
| `dynamo-runtime@0.7.0` | `v1.82` | |
| `dynamo-runtime@0.6.1` | `v1.82` | |
| `dynamo-runtime@0.6.0` | `v1.82` | |
#### dynamo-llm (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-llm@0.8.1` | `v1.82` | |
| `dynamo-llm@0.8.0` | `v1.82` | |
| `dynamo-llm@0.7.1` | `v1.82` | |
| `dynamo-llm@0.7.0` | `v1.82` | |
| `dynamo-llm@0.6.1` | `v1.82` | |
| `dynamo-llm@0.6.0` | `v1.82` | |
#### dynamo-async-openai (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-async-openai@0.8.1` | `v1.82` | |
| `dynamo-async-openai@0.8.0` | `v1.82` | |
| `dynamo-async-openai@0.7.1` | `v1.82` | |
| `dynamo-async-openai@0.7.0` | `v1.82` | |
| `dynamo-async-openai@0.6.1` | `v1.82` | |
| `dynamo-async-openai@0.6.0` | `v1.82` | |
#### dynamo-parsers (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-parsers@0.8.1` | `v1.82` | |
| `dynamo-parsers@0.8.0` | `v1.82` | |
| `dynamo-parsers@0.7.1` | `v1.82` | |
| `dynamo-parsers@0.7.0` | `v1.82` | |
| `dynamo-parsers@0.6.1` | `v1.82` | |
| `dynamo-parsers@0.6.0` | `v1.82` | |
#### dynamo-memory (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-memory@0.8.1` | `v1.82` | |
| `dynamo-memory@0.8.0` | `v1.82` | Initial |
#### dynamo-config (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-config@0.8.1` | `v1.82` | |
| `dynamo-config@0.8.0` | `v1.82` | Initial |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Support Matrix
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
**See also:** [Release Artifacts](release-artifacts.md) for container images, wheels, Helm charts, and crates | [Feature Matrix](feature-matrix.md) for backend feature support
## Backend Dependencies
The following table shows the backend framework versions included with each Dynamo release:
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** |
| :--- | :--- | :--- | :--- | :--- |
| **main (ToT)** | `0.15.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` |
| **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` |
| **v0.9.0** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` |
| **v0.8.1.post3** *(in progress)* | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post3` | `0.8.0` |
| **v0.8.1.post2** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post2` | `0.8.0` |
| **v0.8.1.post1** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post1` | `0.8.0` |
| **v0.8.1** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post1` | `0.8.0` |
| **v0.8.0** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post1` | `0.8.0` |
| **v0.7.1** | `0.11.0` | `0.5.4.post3` | `1.2.0rc3` | `0.8.0` |
| **v0.7.0.post1** | `0.11.0` | `0.5.4.post3` | `1.2.0rc3` | `0.8.0` |
| **v0.7.0** | `0.11.0` | `0.5.4.post3` | `1.2.0rc2` | `0.8.0` |
| **v0.6.1.post1** | `0.11.0` | `0.5.3.post2` | `1.1.0rc5` | `0.6.0` |
| **v0.6.1** | `0.11.0` | `0.5.3.post2` | `1.1.0rc5` | `0.6.0` |
| **v0.6.0** | `0.11.0` | `0.5.3.post2` | `1.1.0rc5` | `0.6.0` |
### Version Labels
- **main (ToT)** reflects the current development branch.
- Releases marked *(in progress)* or *(planned)* show target versions that may change before final release.
### Version Compatibility
- Backend versions listed are the only versions tested and supported for each release.
- TensorRT-LLM does not support Python 3.11; installation of the `ai-dynamo[trtllm]` wheel will fail on Python 3.11.
### CUDA Versions by Backend
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **Notes** |
| :--- | :--- | :--- | :--- | :--- |
| **v0.8.1** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support |
| **v0.8.0** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support |
| **v0.7.1** | `12.9` | `12.8` | `13.0` | |
| **v0.7.0** | `12.8` | `12.9` | `13.0` | TensorRT-LLM CUDA 13 support - CUDA 12.9 deprecated |
| **v0.6.1** | `12.8` | `12.9` | `12.9` | |
| **v0.6.0** | `12.8` | `12.8` | `12.9` | |
Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version.
For detailed artifact versions and NGC links (including container images, Python wheels, Helm charts, and Rust crates), see the [Release Artifacts](release-artifacts.md) page.
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| :------------------- | :----------- |
| **x86_64** | Supported |
| **ARM64** | Supported |
Dynamo provides multi-arch container images supporting both AMD64 (x86_64) and ARM64 architectures. See [Release Artifacts](release-artifacts.md) for available images.
### GPU Compatibility
If you are using a **GPU**, the following GPU models and architectures are supported:
| **GPU Architecture** | **Status** |
| :----------------------------------- | :--------- |
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
| **Operating System** | **Version** | **Architecture** | **Status** |
| :------------------- | :---------- | :--------------- | :----------- |
| **Ubuntu** | 22.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **CentOS Stream** | 9 | x86_64 | Experimental |
Wheels are built using a manylinux_2_28-compatible environment and validated on CentOS Stream 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but not officially verified.
> [!Caution]
> KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
## Software Compatibility
### CUDA and Driver Requirements
Dynamo container images include CUDA toolkit libraries. The host machine must have a compatible NVIDIA GPU driver installed.
| Dynamo Version | Backend | CUDA Toolkit | Min Driver (Linux) | Min Driver (Windows) | Notes |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **0.8.1** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.8.0** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.7.1** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | **SGLang** | 12.8 | 570.xx+ | 571.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.7.0** | **vLLM** | 12.8 | 570.xx+ | 571.xx+ | |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
Experimental CUDA 13 images are not published for all versions. Check [Release Artifacts](release-artifacts.md) for availability.
#### CUDA Compatibility Resources
For detailed information on CUDA driver compatibility, forward compatibility, and troubleshooting:
- [CUDA Compatibility Overview](https://docs.nvidia.com/deploy/cuda-compatibility/)
- [Why CUDA Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html)
- [Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html)
- [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html)
- [FAQ](https://docs.nvidia.com/deploy/cuda-compatibility/frequently-asked-questions.html)
For extended driver compatibility beyond the minimum versions listed above, consider using `cuda-compat` packages on the host. See [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html) for details.
## Cloud Service Provider Compatibility
### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported |
> [!Caution]
> **AL2023 TensorRT-LLM Limitation:** There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
For version-specific artifact details, installation commands, and release history, see [Release Artifacts](release-artifacts.md).
**Dynamo** currently provides build support in the following ways:
- **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
- [ai-dynamo](https://pypi.org/project/ai-dynamo/)
- [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/)
- [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
- **Dynamo Container Images**: We distribute multi-arch images (x86 & ARM64 compatible) on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [Dynamo Frontend](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend) *(New in v0.8.0)*
- [SGLang Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime)
- [SGLang Runtime (CUDA 13)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime-cu13)
- [TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime)
- [vLLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
- [vLLM Runtime (CUDA 13)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime-cu13)
- [Kubernetes Operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator)
- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo:
- [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds)
- [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform)
- [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph)
- **Rust Crates**:
- [dynamo-runtime](https://crates.io/crates/dynamo-runtime/)
- [dynamo-llm](https://crates.io/crates/dynamo-llm/)
- [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai/)
- [dynamo-parsers](https://crates.io/crates/dynamo-parsers/)
- [dynamo-config](https://crates.io/crates/dynamo-config/) *(New in v0.8.0)*
- [dynamo-memory](https://crates.io/crates/dynamo-memory/) *(New in v0.8.0)*
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the [Local Quick Start](https://github.com/ai-dynamo/dynamo/blob/main/README.md#local-quick-start) in the README.
---
orphan: true
---
# Documentation Templates
Templates for creating consistent Dynamo documentation.
## Directory Hierarchy
### Components (Router, Planner, KVBM, Frontend, Profiler)
```
┌──────────────────────────────────────────────────────────────┐
│ Tier 1: components/src/dynamo/<component>/README.md │ ← Redirect stub
│ Content: 1-5 lines pointing to docs/components/<component>/│
│ Template: incode_readme.md │
└─────────────────────┬────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 2: docs/components/<component>/ │ ← User docs
│ • README.md ← component_readme.md │
│ • <component>_guide.md ← component_guide.md │
│ • <component>_examples.md ← component_examples.md │
└─────────────────────┬────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 3: docs/design_docs/<component>_design.md │ ← Contributor docs
│ Template: component_design.md │
└──────────────────────────────────────────────────────────────┘
```
### Backends (vLLM, SGLang, TRT-LLM)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: components/src/dynamo/<backend>/README.md │ ← Redirect stub
│ Content: 1-5 lines pointing to docs/backends/ │
│ Template: incode_readme.md │
└─────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/backends/<backend>/ │ ← User docs
│ • README.md ← backend_readme.md │
│ • <backend>_guide.md ← backend_guide.md │
│ │
│ Tier 2.5: docs/backends/README.md (exists) │
│ • Backend comparison table │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 3: External │
│ Backend internals documented in upstream repos │
└─────────────────────────────────────────────────────┘
```
### Features (Multimodal, LoRA, Speculative Decoding)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (features are not components) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/features/<feature>/ │ ← User docs
│ • README.md ← feature_readme.md │
│ • <feature>_vllm.md ← feature_backend.md │
│ • <feature>_sglang.md ← feature_backend.md │
│ • <feature>_trtllm.md ← feature_backend.md │
└─────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 3: docs/design_docs/<feature>_design.md │ ← Optional
│ Only if significant architecture │
└─────────────────────────────────────────────────────┘
```
### Integrations (LMCache, HiCache, NIXL)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (external tools) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/integrations/<integration>/ │ ← User docs
│ • README.md ← integration_readme.md │
│ • <integration>_setup.md (custom) │
│ • <integration>_<backend>.md (custom) │
└─────────────────────────────────────────────────────┘
```
### Deploy (Kubernetes, Helm, Operator)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (deployment topics) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/deploy/ │ ← User docs
│ • README.md (deployment overview) │
│ • installation_guide.md, dynamo_operator.md │
│ • helm.md, examples/ │
└─────────────────────────────────────────────────────┘
```
### Performance (Tuning, Benchmarks)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (performance topics) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/performance/ │ ← User docs
│ • README.md (performance overview) │
│ • tuning.md, benchmarking.md, etc. │
└─────────────────────────────────────────────────────┘
```
### Infrastructure (Observability, Fault Tolerance, Development)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (operations topics) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/infrastructure/<topic>/ │ ← User docs
│ • README.md ← infrastructure_readme.md │
│ • <subtopic>.md (detailed guides) │
└─────────────────────────────────────────────────────┘
```
## Three-Tier Pattern
| Tier | Purpose | Audience | Location |
|------|---------|----------|----------|
| **Tier 1** | Redirect stub (5 lines) | Developers browsing code | `components/src/dynamo/<name>/README.md` |
| **Tier 2** | User documentation | Users, operators | `docs/<category>/<name>/` (e.g., `docs/components/router/`) |
| **Tier 3** | Design documentation | Contributors | `docs/design_docs/<name>_design.md` |
## Template Selection
| What you're documenting | Templates to use |
|------------------------|------------------|
| New component | `incode_readme.md` + `component_*.md` (all 4) |
| New backend | `incode_readme.md` + `backend_*.md` (both) |
| New feature | `feature_readme.md` + `feature_backend.md` (per backend) |
| New integration | `integration_readme.md` |
| New deploy topic | Custom (follows `docs/deploy/` structure) |
| New performance topic | Custom (follows `docs/performance/` structure) |
| New infrastructure topic | `infrastructure_readme.md` |
| Migrating existing docs | Use the template matching your target file |
## Usage
1. Identify which category your documentation belongs to (component, backend, feature, integration)
2. Create the directory structure shown above
3. Copy templates to the correct locations with correct filenames
4. Replace all `<placeholders>` with actual values
5. Replace `<!-- comments -->` with actual content
6. Remove sections that don't apply
## Updating Navigation
After adding new documentation:
1. **Sphinx (current):** Update `docs/index.rst` or the appropriate `_sections/*.rst` file to include your new docs in the navigation
2. **Fern (future):** Update `fern/docs.yml` with your new pages
See [docs/README.md](../README.md) for documentation build instructions.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment