docs: README updates for 0.8.0 (#5395)

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com>

docs: README updates for 0.8.0 (#5395)
Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com>
91a8d07f · dagil-nvidia · GitHub · bbb79afb · 91a8d07f · 91a8d07f
Unverified Commit 91a8d07f authored Jan 13, 2026 by dagil-nvidia Committed by GitHub Jan 13, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 73 additions and 64 deletions

README.md README.md +36 -24

feature-matrix.md feature-matrix.md +37 -40

No files found.
--- a/README.md
+++ b/README.md
@@ -43,11 +43,13 @@ High-throughput, low-latency inference framework designed for serving generative
 ## Latest News
+- [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
+- [12/02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
+- [12/01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/)
+- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
+- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
 - [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#qwen3-coder-benchmarks-with-kv-routing)
+- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
- [10/13] [NVIDIA Blackwell Leads on New SemiAnalysis InferenceMax Benchmarks](https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/)
- [09/09] [Dynamo + NVIDIA Blackwell Ultra Sets New MLPerf Inference Benchmark Record](https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/)
- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md)
 ## The Era of Multi-GPU, Multi-Node
@@ -92,23 +94,6 @@ Backend engines require Python development headers for JIT compilation. Install
 sudo apt install python3-dev
 ```
-### Install etcd (optional) and NATS (required)
-To coordinate across a data center, Dynamo relies on etcd and NATS. These will be used in production. To run Dynamo locally etcd is optional.
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
-To quickly setup etcd & NATS, you can also run:
-```bash
-# At the root of the repository:
-docker compose -f deploy/docker-compose.yml up -d
-```
-To run locally without etcd, pass `--store-kv file` to both the frontend and workers. The directory used for key-value data can be configured via the `DYN_FILE_KV` environment variable (example: `export DYN_FILE_KV=/data/kv/dynamo`). Defaults to `$TMPDIR/dynamo_store_kv`.
 ## 2. Select an engine
 We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
@@ -143,7 +128,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
 - **Workers** – Set of pre-configured LLM serving engines.
 ```
-# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
+# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
 # Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
 # Pass --store-kv to use the filesystem instead of etcd. The workers and frontend must share a disk.
 python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem] [--store-kv file]
@@ -178,6 +163,23 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
 - Check out [Backends](examples/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.)
 - Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations.
+### Service Discovery and Messaging
+Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
+| Deployment | etcd | NATS | Notes |
+|------------|------|------|-------|
+| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
+| **Local development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; TCP request plane |
+| **KV-aware routing** | — | ✅ Required | Add NATS for KV event messaging |
+For local development, pass `--store-kv file` to both the frontend and workers. For distributed non-Kubernetes deployments or KV-aware routing:
+- [etcd](https://etcd.io/) can be run directly as `./etcd`.
+- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
+To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
 ### Benchmarking Dynamo
 Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
@@ -198,7 +200,7 @@ This writes the current frontend spec to `docs/frontends/openapi.json` at the re
 # Engines
-Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`).
+Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, start a Dynamo frontend (`python -m dynamo.frontend`). For local development, pass `--store-kv file` to avoid etcd dependency. NATS is optional and only required for KV-aware routing.
 ## vLLM
@@ -355,8 +357,18 @@ uv pip install -e .
 You should now be able to run `python -m dynamo.frontend`.
-Remember that nats and etcd must typically be running (see earlier).
+For local development, pass `--store-kv file` to avoid external dependencies (see Service Discovery and Messaging section).
 Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
 If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
+<!-- Reference links for Feature Compatibility Matrix -->
+[disagg]: docs/design_docs/disagg_serving.md
+[kv-routing]: docs/router/kv_cache_routing.md
+[planner]: docs/planner/sla_planner.md
+[kvbm]: docs/kvbm/kvbm_architecture.md
+[mm]: examples/multimodal/
+[migration]: docs/fault_tolerance/request_migration.md
+[lora]: examples/backends/vllm/deploy/lora/README.md
+[tools]: docs/agents/tool-calling.md
--- a/feature-matrix.md
+++ b/feature-matrix.md
@@ -5,11 +5,8 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
 *Updated for Dynamo v0.8.0*
 **Legend:**
-*   ✅ : Fully Supported / Compatible
+*   ✅ : Supported
-*   ❌ : Not Supported / Incompatible
+*   🚧 : Work in Progress / Experimental / Limited
-*   🚧 : Work in Progress
-*   ⚠️ : Limited Support (see notes)
-*   🧪 : Experimental
 ## Quick Comparison
@@ -20,11 +17,11 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
 | **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
 | **KV Block Manager** | ✅ | ✅ | 🚧 | [KVBM Doc][kvbm] |
 | **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
-| **Multimodal (Video)** | ✅ | ❌ | ❌ | [Multimodal Doc][mm] |
+| **Multimodal (Video)** | ✅ | | | [Multimodal Doc][mm] |
-| **Multimodal (Audio)** | 🧪 | ❌ | ❌ | [Multimodal Doc][mm] |
+| **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc][mm] |
-| **Request Migration** | ✅ | ⚠️ | ✅ | [Migration Doc][migration] |
+| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
-| **Request Cancellation** | ✅ | ✅ | ⚠️ | Backend READMEs |
+| **Request Cancellation** | ✅ | ✅ | 🚧 | Backend READMEs |
-| **LoRA** | ✅ | ❌ | ❌ | [K8s Guide][lora] |
+| **LoRA** | ✅ | | | [K8s Guide][lora] |
 | **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
 | **Speculative Decoding** | ✅ | ✅ | 🚧 | Backend READMEs |
@@ -40,7 +37,7 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag
 | **KV-Aware Routing** | ✅ | — | | | | | | | | |
 | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
 | **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
-| **Multimodal** | ✅ | ❌<sup>1</sup> | — | ✅ | — | | | | | |
+| **Multimodal** | ✅ | <sup>1</sup> | — | ✅ | — | | | | | |
 | **Request Migration** | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | |
 | **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
 | **LoRA** | ✅ | ✅<sup>2</sup> | — | ✅ | — | ✅ | ✅ | — | | |
@@ -54,55 +51,55 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag
 > 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm])
 > 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec])
-## 2. TensorRT-LLM Backend
+## 2. SGLang Backend
-TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
+SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
-*Source: [docs/backends/trtllm/README.md][trtllm-readme]*
+*Source: [docs/backends/sglang/README.md][sglang-readme]*
 | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | **Disaggregated Serving** | — | | | | | | | | | |
 | **KV-Aware Routing** | ✅ | — | | | | | | | | |
 | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
-| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
+| **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | |
-| **Multimodal** | ✅<sup>1</sup> | ❌<sup>2</sup> | — | ✅ | — | | | | | |
+| **Multimodal** | ✅<sup>2</sup> | <sup>1</sup> | — | 🚧 | — | | | | | |
-| **Request Migration** | ⚠️<sup>3</sup> | ✅ | ✅ | ✅ | ⚠️ | — | | | | |
+| **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | |
-| **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
+| **Request Cancellation** | 🚧<sup>3</sup> | ✅ | ✅ | 🚧 | 🚧 | ✅ | — | | | |
-| **LoRA** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | — | | |
+| **LoRA** | | | | 🚧 | | | | — | | |
-| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | — | |
+| **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | | — | |
-| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | ❌ | ✅ | — |
+| **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | — | | 🚧 | — |
 > **Notes:**
-> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source][mm-trtllm])
+> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing])
-> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source][kv-routing])
+> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source][mm-sglang])
-> 3. **Request Migration**: Supported on **Decode/Aggregated** workers only. **Prefill** workers do not support migration. ([Source][trtllm-readme])
+> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme])
-> 4. **Speculative Decoding**: Llama 4 + Eagle support documented. ([Source][trtllm-eagle])
+> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
-## 3. SGLang Backend
+## 3. TensorRT-LLM Backend
-SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
+TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
-*Source: [docs/backends/sglang/README.md][sglang-readme]*
+*Source: [docs/backends/trtllm/README.md][trtllm-readme]*
 | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | **Disaggregated Serving** | — | | | | | | | | | |
 | **KV-Aware Routing** | ✅ | — | | | | | | | | |
 | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
-| **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | |
+| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
-| **Multimodal** | ✅<sup>2</sup> | ❌<sup>1</sup> | — | 🚧 | — | | | | | |
+| **Multimodal** | ✅<sup>1</sup> | <sup>2</sup> | — | ✅ | — | | | | | |
-| **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | |
+| **Request Migration** | 🚧<sup>3</sup> | ✅ | ✅ | ✅ | 🚧 | — | | | | |
-| **Request Cancellation** | ⚠️<sup>3</sup> | ✅ | ✅ | 🚧 | ⚠️ | ✅ | — | | | |
+| **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
-| **LoRA** | ❌ | ❌ | ❌ | 🚧 | ❌ | ❌ | ❌ | — | | |
+| **LoRA** | | | | | | | | — | | |
-| **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ❌ | — | |
+| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | — | |
-| **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | — | ❌ | 🚧 | — |
+| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | | ✅ | — |
 > **Notes:**
-> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing])
+> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source][mm-trtllm])
-> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source][mm-sglang])
+> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source][kv-routing])
-> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme])
+> 3. **Request Migration**: Supported on **Decode/Aggregated** workers only. **Prefill** workers do not support migration. ([Source][trtllm-readme])
-> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
+> 4. **Speculative Decoding**: Llama 4 + Eagle support documented. ([Source][trtllm-eagle])
 ---
@@ -110,8 +107,8 @@ SGLang is optimized for high-throughput serving with fast primitives, providing
 <!-- Backend READMEs -->
 [vllm-readme]: docs/backends/vllm/README.md
-[trtllm-readme]: docs/backends/trtllm/README.md
 [sglang-readme]: docs/backends/sglang/README.md
+[trtllm-readme]: docs/backends/trtllm/README.md
 <!-- Design Docs -->
 [disagg]: docs/design_docs/disagg_serving.md