.. SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Welcome to NVIDIA Dynamo ======================== NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dive in: Examples ----------------------- .. grid:: 1 2 2 2 :gutter: 3 :margin: 0 :padding: 3 4 0 0 .. grid-item-card:: :doc:`Hello World ` :link: /examples/hello_world :link-type: doc Demonstrates the basic concepts of Dynamo by creating a simple multi-service pipeline. .. grid-item-card:: :doc:`LLM Deployment ` :link: /examples/llm_deployment :link-type: doc Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations. .. grid-item-card:: :doc:`Multinode ` :link: /examples/multinode :link-type: doc Demonstrates deployment for disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. .. grid-item-card:: :doc:`TensorRT-LLM ` :link: /examples/trtllm :link-type: doc Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations. Overview -------- The NVIDIA Dynamo Platform is a high-performance, low-latency inference platform designed to serve all AI models—across any framework, architecture, or deployment scale. Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures LLM-specific capabilities such as: * **Disaggregated prefill & decode inference** - Maximizes GPU throughput and facilitates trade off between throughput and latency. * **Dynamic GPU scheduling** - Optimizes performance based on fluctuating demand. * **LLM-aware request routing** - Eliminates unnecessary KV cache re-computation. * **Accelerated data transfer** - Reduces inference response time using NIXL. * **KV cache offloading** - Leverages several memory hierarchies for higher system throughput. Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach. .. toctree:: :hidden: Welcome to Dynamo Support Matrix Getting Started .. toctree:: :hidden: :caption: Architecture & Features High Level Architecture Distributed Runtime Disaggregated Serving KV Block Manager KV Cache Routing Planner .. toctree:: :hidden: :caption: Dynamo Command Line Interface CLI Overview Running Dynamo (dynamo run) Serving Inference Graphs (dynamo serve) Building Dynamo (dynamo build) Deploying Inference Graphs (dynamo deploy) .. toctree:: :hidden: :caption: Usage Guides Writing Python Workers in Dynamo Disaggregation and Performance Tuning KV Cache Router Performance Tuning Planner Benchmark Example .. toctree:: :hidden: :caption: Deployment Guides Dynamo Cloud Kubernetes Platform Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform Manual Helm Deployment Minikube Setup Guide .. toctree:: :hidden: :caption: API SDK Reference Python API .. toctree:: :hidden: :caption: Examples Hello World Example LLM Deployment Examples Multinode Examples LLM Deployment Examples using TensorRT-LLM