Unverified Commit 91a8d07f authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: README updates for 0.8.0 (#5395)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
parent bbb79afb
...@@ -43,11 +43,13 @@ High-throughput, low-latency inference framework designed for serving generative ...@@ -43,11 +43,13 @@ High-throughput, low-latency inference framework designed for serving generative
## Latest News ## Latest News
- [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
- [12/02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
- [12/01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/)
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X) - [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#qwen3-coder-benchmarks-with-kv-routing) - [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
- [10/13] [NVIDIA Blackwell Leads on New SemiAnalysis InferenceMax Benchmarks](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/)
- [09/09] [Dynamo + NVIDIA Blackwell Ultra Sets New MLPerf Inference Benchmark Record](https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/)
- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node ## The Era of Multi-GPU, Multi-Node
...@@ -92,23 +94,6 @@ Backend engines require Python development headers for JIT compilation. Install ...@@ -92,23 +94,6 @@ Backend engines require Python development headers for JIT compilation. Install
sudo apt install python3-dev sudo apt install python3-dev
``` ```
### Install etcd (optional) and NATS (required)
To coordinate across a data center, Dynamo relies on etcd and NATS. These will be used in production. To run Dynamo locally etcd is optional.
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
To quickly setup etcd & NATS, you can also run:
```bash
# At the root of the repository:
docker compose -f deploy/docker-compose.yml up -d
```
To run locally without etcd, pass `--store-kv file` to both the frontend and workers. The directory used for key-value data can be configured via the `DYN_FILE_KV` environment variable (example: `export DYN_FILE_KV=/data/kv/dynamo`). Defaults to `$TMPDIR/dynamo_store_kv`.
## 2. Select an engine ## 2. Select an engine
We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines. We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
...@@ -143,7 +128,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl ...@@ -143,7 +128,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
- **Workers** – Set of pre-configured LLM serving engines. - **Workers** – Set of pre-configured LLM serving engines.
``` ```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router. # Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
# Pass the TLS certificate and key paths to use HTTPS instead of HTTP. # Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
# Pass --store-kv to use the filesystem instead of etcd. The workers and frontend must share a disk. # Pass --store-kv to use the filesystem instead of etcd. The workers and frontend must share a disk.
python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem] [--store-kv file] python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem] [--store-kv file]
...@@ -178,6 +163,23 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res ...@@ -178,6 +163,23 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
- Check out [Backends](examples/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.) - Check out [Backends](examples/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.)
- Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations. - Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations.
### Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|------------|------|------|-------|
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
| **Local development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; TCP request plane |
| **KV-aware routing** | — | ✅ Required | Add NATS for KV event messaging |
For local development, pass `--store-kv file` to both the frontend and workers. For distributed non-Kubernetes deployments or KV-aware routing:
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
### Benchmarking Dynamo ### Benchmarking Dynamo
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments: Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
...@@ -198,7 +200,7 @@ This writes the current frontend spec to `docs/frontends/openapi.json` at the re ...@@ -198,7 +200,7 @@ This writes the current frontend spec to `docs/frontends/openapi.json` at the re
# Engines # Engines
Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`). Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, start a Dynamo frontend (`python -m dynamo.frontend`). For local development, pass `--store-kv file` to avoid etcd dependency. NATS is optional and only required for KV-aware routing.
## vLLM ## vLLM
...@@ -355,8 +357,18 @@ uv pip install -e . ...@@ -355,8 +357,18 @@ uv pip install -e .
You should now be able to run `python -m dynamo.frontend`. You should now be able to run `python -m dynamo.frontend`.
Remember that nats and etcd must typically be running (see earlier). For local development, pass `--store-kv file` to avoid external dependencies (see Service Discovery and Messaging section).
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`. Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details. If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
<!-- Reference links for Feature Compatibility Matrix -->
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/kv_cache_routing.md
[planner]: docs/planner/sla_planner.md
[kvbm]: docs/kvbm/kvbm_architecture.md
[mm]: examples/multimodal/
[migration]: docs/fault_tolerance/request_migration.md
[lora]: examples/backends/vllm/deploy/lora/README.md
[tools]: docs/agents/tool-calling.md
...@@ -5,11 +5,8 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu ...@@ -5,11 +5,8 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
*Updated for Dynamo v0.8.0* *Updated for Dynamo v0.8.0*
**Legend:** **Legend:**
* ✅ : Fully Supported / Compatible * ✅ : Supported
* ❌ : Not Supported / Incompatible * 🚧 : Work in Progress / Experimental / Limited
* 🚧 : Work in Progress
* ⚠️ : Limited Support (see notes)
* 🧪 : Experimental
## Quick Comparison ## Quick Comparison
...@@ -20,11 +17,11 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu ...@@ -20,11 +17,11 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] | | **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
| **KV Block Manager** | ✅ | ✅ | 🚧 | [KVBM Doc][kvbm] | | **KV Block Manager** | ✅ | ✅ | 🚧 | [KVBM Doc][kvbm] |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] | | **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Video)** | ✅ | ❌ | ❌ | [Multimodal Doc][mm] | | **Multimodal (Video)** | ✅ | | | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | 🧪 | ❌ | ❌ | [Multimodal Doc][mm] | | **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc][mm] |
| **Request Migration** | ✅ | ⚠️ | ✅ | [Migration Doc][migration] | | **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
| **Request Cancellation** | ✅ | ✅ | ⚠️ | Backend READMEs | | **Request Cancellation** | ✅ | ✅ | 🚧 | Backend READMEs |
| **LoRA** | ✅ | ❌ | ❌ | [K8s Guide][lora] | | **LoRA** | ✅ | | | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] | | **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | ✅ | ✅ | 🚧 | Backend READMEs | | **Speculative Decoding** | ✅ | ✅ | 🚧 | Backend READMEs |
...@@ -40,7 +37,7 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag ...@@ -40,7 +37,7 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag
| **KV-Aware Routing** | ✅ | — | | | | | | | | | | **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | | | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | | | **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
| **Multimodal** | ✅ | <sup>1</sup> | — | ✅ | — | | | | | | | **Multimodal** | ✅ | <sup>1</sup> | — | ✅ | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | | | **Request Migration** | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | |
| **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | | **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
| **LoRA** | ✅ | ✅<sup>2</sup> | — | ✅ | — | ✅ | ✅ | — | | | | **LoRA** | ✅ | ✅<sup>2</sup> | — | ✅ | — | ✅ | ✅ | — | | |
...@@ -54,55 +51,55 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag ...@@ -54,55 +51,55 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag
> 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm]) > 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm])
> 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec]) > 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec])
## 2. TensorRT-LLM Backend ## 2. SGLang Backend
TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support. SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
*Source: [docs/backends/trtllm/README.md][trtllm-readme]* *Source: [docs/backends/sglang/README.md][sglang-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding | | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | | | **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | | | **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | | | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | | | | — | | | | | | | | **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | |
| **Multimodal** | ✅<sup>1</sup> | <sup>2</sup> | — | | — | | | | | | | **Multimodal** | ✅<sup>2</sup> | <sup>1</sup> | — | 🚧 | — | | | | | |
| **Request Migration** | ⚠️<sup>3</sup> | ✅ | ✅ | ✅ | ⚠️ | — | | | | | | **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | |
| **Request Cancellation** | | ✅ | ✅ | | | ✅ | — | | | | | **Request Cancellation** | 🚧<sup>3</sup> | ✅ | ✅ | 🚧 | 🚧 | ✅ | — | | | |
| **LoRA** | ❌ | ❌ | | | ❌ | ❌ | ❌ | — | | | | **LoRA** | | | | 🚧 | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | | — | | | **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | | | — | | — | | | | | — | | **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | | | 🚧 | — |
> **Notes:** > **Notes:**
> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source][mm-trtllm]) > 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing])
> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source][kv-routing]) > 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source][mm-sglang])
> 3. **Request Migration**: Supported on **Decode/Aggregated** workers only. **Prefill** workers do not support migration. ([Source][trtllm-readme]) > 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme])
> 4. **Speculative Decoding**: Llama 4 + Eagle support documented. ([Source][trtllm-eagle]) > 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
## 3. SGLang Backend ## 3. TensorRT-LLM Backend
SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration. TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
*Source: [docs/backends/sglang/README.md][sglang-readme]* *Source: [docs/backends/trtllm/README.md][trtllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding | | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | | | **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | | | **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | | | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | | | **KV Block Manager** | | | | — | | | | | | |
| **Multimodal** | ✅<sup>2</sup> | <sup>1</sup> | — | 🚧 | — | | | | | | | **Multimodal** | ✅<sup>1</sup> | <sup>2</sup> | — | | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | | | **Request Migration** | 🚧<sup>3</sup> | ✅ | ✅ | ✅ | 🚧 | — | | | | |
| **Request Cancellation** | ⚠️<sup>3</sup> | ✅ | ✅ | 🚧 | ⚠️ | ✅ | — | | | | | **Request Cancellation** | | ✅ | ✅ | | | ✅ | — | | | |
| **LoRA** | ❌ | ❌ | ❌ | 🚧 | ❌ | ❌ | ❌ | — | | | | **LoRA** | | | | | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | | — | | | **Tool Calling** | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | | | 🚧 | — | | **Speculative Decoding** | | | — | | — | | | | | — |
> **Notes:** > **Notes:**
> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing]) > 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source][mm-trtllm])
> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source][mm-sglang]) > 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source][kv-routing])
> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme]) > 3. **Request Migration**: Supported on **Decode/Aggregated** workers only. **Prefill** workers do not support migration. ([Source][trtllm-readme])
> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet. > 4. **Speculative Decoding**: Llama 4 + Eagle support documented. ([Source][trtllm-eagle])
--- ---
...@@ -110,8 +107,8 @@ SGLang is optimized for high-throughput serving with fast primitives, providing ...@@ -110,8 +107,8 @@ SGLang is optimized for high-throughput serving with fast primitives, providing
<!-- Backend READMEs --> <!-- Backend READMEs -->
[vllm-readme]: docs/backends/vllm/README.md [vllm-readme]: docs/backends/vllm/README.md
[trtllm-readme]: docs/backends/trtllm/README.md
[sglang-readme]: docs/backends/sglang/README.md [sglang-readme]: docs/backends/sglang/README.md
[trtllm-readme]: docs/backends/trtllm/README.md
<!-- Design Docs --> <!-- Design Docs -->
[disagg]: docs/design_docs/disagg_serving.md [disagg]: docs/design_docs/disagg_serving.md
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment