docs: migrate existing docs to fern (#5445)

Signed-off-by: Jont828 <jt572@cornell.edu> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>

docs: migrate existing docs to fern (#5445)
Signed-off-by: Jont828 <jt572@cornell.edu> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>
f9050aae · Jonathan Tong · GitHub · f238d23a · f9050aae · f9050aae
Unverified Commit f9050aae authored Jan 26, 2026 by Jonathan Tong Committed by GitHub Jan 26, 2026
20 changed files
--- a/fern/pages/backends/trtllm/README.md
+++ b/fern/pages/backends/trtllm/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "LLM Deployment using TensorRT-LLM"
+---
+This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
+## Use the Latest Release
+We recommend using the latest stable release of dynamo to avoid breaking changes:
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+---
+## Table of Contents
+- [Feature Support Matrix](#feature-support-matrix)
+- [Quick Start](#tensorrt-llm-quick-start)
+- [Single Node Examples](#single-node-examples)
+- [Advanced Examples](#advanced-examples)
+- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
+- [Client](#client)
+- [Benchmarking](#benchmarking)
+- [Multimodal Support](#multimodal-support)
+- [Logits Processing](#logits-processing)
+- [Performance Sweep](#performance-sweep)
+## Feature Support Matrix
+### Core Dynamo Features
+| Feature | TensorRT-LLM | Notes |
+|---------|--------------|-------|
+| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ |  |
+| [**Load Based Planner**](../../planner/load-planner.md) | 🚧 | Planned |
+| [**KVBM**](../../kvbm/kvbm-architecture.md) | ✅ | |
+### Large Scale P/D and WideEP Features
+| Feature            | TensorRT-LLM | Notes                                                           |
+|--------------------|--------------|-----------------------------------------------------------------|
+| **WideEP**         | ✅           |                                                                 |
+| **DP Rank Routing**| ✅           |                                                                 |
+| **GB200 Support**  | ✅           |                                                                 |
+## TensorRT-LLM Quick Start
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
+### Start NATS and ETCD in the background
+Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+### Build container
+```bash
+# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
+apt-get update && apt-get -y install git git-lfs
+# On an x86 machine:
+./container/build.sh --framework trtllm
+# On an ARM machine:
+./container/build.sh --framework trtllm --platform linux/arm64
+# Build the container with the default experimental TensorRT-LLM commit
+# WARNING: This is for experimental feature testing only.
+# The container should not be used in a production environment.
+./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main
+```
+### Run container
+```bash
+./container/run.sh --framework trtllm -it
+```
+## Single Node Examples
+<Warning>
+Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
+</Warning>
+For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv-cache-routing.md).
+### Aggregated
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/agg.sh
+```
+### Aggregated with KV Routing
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/agg_router.sh
+```
+### Disaggregated
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/disagg.sh
+```
+### Disaggregated with KV Routing
+<Warning>
+In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
+</Warning>
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/disagg_router.sh
+```
+### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+# nvidia/DeepSeek-R1-FP4 is a large model
+export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
+./launch/agg.sh
+```
+Notes:
+- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
+- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
+## Advanced Examples
+Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
+### Multinode Deployment
+For comprehensive instructions on multinode serving, see the [multinode-examples.md](multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
+### Speculative Decoding
+- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](llama4-plus-eagle.md)**
+### Kubernetes Deployment
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
+### Client
+See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
+NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
+### Benchmarking
+To benchmark your deployment with AIPerf, see this utility script, configuring the
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh)
+## KV Cache Transfer in Disaggregated Serving
+Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](kv-cache-transfer.md).
+## Request Migration
+You can enable [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+```bash
+# For decode and aggregated workers
+python3 -m dynamo.trtllm ... --migration-limit=3
+```
+<Warning>
+**Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
+</Warning>
+See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
+## Request Cancellation
+When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
+### Cancellation Support Matrix
+| | Prefill | Decode |
+|-|---------|--------|
+| **Aggregated** | ✅ | ✅ |
+| **Disaggregated** | ✅ | ✅ |
+For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
+## Client
+See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
+NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
+## Benchmarking
+To benchmark your deployment with AIPerf, see this utility script, configuring the
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh)
+## Multimodal support
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
+## Logits Processing
+Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
+### How it works
+- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
+- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
+- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
+### Quick test: HelloWorld processor
+You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
+./launch/agg.sh
+```
+Notes:
+- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
+- Expected chat response contains "Hello world".
+### Bring your own processor
+Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
+```python
+from typing import Sequence
+import torch
+from dynamo.logits_processing import BaseLogitsProcessor
+class TemperatureProcessor(BaseLogitsProcessor):
+    def __init__(self, temperature: float = 1.0):
+        if temperature <= 0:
+            raise ValueError("Temperature must be positive")
+        self.temperature = temperature
+    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
+        if self.temperature == 1.0:
+            return
+        logits.div_(self.temperature)
+```
+Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
+```python
+from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
+from dynamo.logits_processing.examples import TemperatureProcessor
+processors = [TemperatureProcessor(temperature=0.7)]
+sampling_params.logits_processor = create_trtllm_adapters(processors)
+```
+### Current limitations
+- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
+- Processors must modify logits in-place and not return a new tensor.
+- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
+## Performance Sweep
+For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
+## Dynamo KV Block Manager Integration
+Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
+Here is the instruction: [Running KVBM in TensorRT-LLM](../../kvbm/trtllm-setup.md) .
--- a/fern/pages/backends/trtllm/gemma3-sliding-window-attention.md
+++ b/fern/pages/backends/trtllm/gemma3-sliding-window-attention.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Gemma 3 with Variable Sliding Window Attention"
+---
+This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
+VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
+<Note>
+- Ensure that required services such as `nats` and `etcd` are running before starting.
+- Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
+- It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
+</Note>
+## Aggregated Serving
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export AGG_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
+./launch/agg.sh
+```
+## Aggregated Serving with KV Routing
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export AGG_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
+./launch/agg_router.sh
+```
+## Disaggregated Serving
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export PREFILL_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
+export DECODE_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
+./launch/disagg.sh
+```
+## Disaggregated Serving with KV Routing
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export PREFILL_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
+export DECODE_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
+./launch/disagg_router.sh
+```
--- a/fern/pages/backends/trtllm/gpt-oss.md
+++ b/fern/pages/backends/trtllm/gpt-oss.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Running gpt-oss-120b Disaggregated with TensorRT-LLM"
+---
+Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
+## Overview
+This deployment uses disaggregated serving in TensorRT-LLM where:
+- **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
+- **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
+- **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
+The disaggregated approach optimizes for both low-latency (maximizing tokens per second per user) and high-throughput (maximizing total tokens per GPU per second) use cases by separating the compute-intensive prefill phase from the memory-bound decode phase.
+## Prerequisites
+- 1x NVIDIA B200 node with 8 GPUs (this guide focuses on single-node B200 deployment)
+- CUDA Toolkit 12.8 or later
+- Docker with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed
+- Fast SSD storage for model weights (~240GB required)
+- HuggingFace account and [access token](https://huggingface.co/settings/tokens)
+- [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
+Ensure that the `etcd` and `nats` services are running with the following command:
+```bash
+docker compose -f deploy/docker-compose.yml up
+```
+## Instructions
+### 1. Download the Model
+```bash
+export MODEL_PATH=<LOCAL_MODEL_DIRECTORY>
+export HF_TOKEN=<INSERT_TOKEN_HERE>
+pip install -U "huggingface_hub[cli]"
+huggingface-cli download openai/gpt-oss-120b --exclude "original/*" --exclude "metal/*" --local-dir $MODEL_PATH
+```
+### 2. Run the Container
+Set the container image:
+```bash
+export DYNAMO_CONTAINER_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+```
+Launch the Dynamo TensorRT-LLM container with the necessary configurations:
+```bash
+docker run \
+    --gpus all \
+    -it \
+    --rm \
+    --network host \
+    --volume $MODEL_PATH:/model \
+    --volume $PWD:/workspace \
+    --shm-size=10G \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    --ulimit nofile=65536:65536 \
+    --cap-add CAP_SYS_PTRACE \
+    --ipc host \
+    -e HF_TOKEN=$HF_TOKEN \
+    -e TRTLLM_ENABLE_PDL=1 \
+    -e TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
+    $DYNAMO_CONTAINER_IMAGE
+```
+This command:
+- Automatically removes the container when stopped (`--rm`)
+- Allows container to interact with host's IPC resources for optimal performance (`--ipc=host`)
+- Runs the container in interactive mode (`-it`)
+- Sets up shared memory and stack limits for optimal performance
+- Mounts your model directory into the container at `/model`
+- Mounts the current Dynamo workspace into the container at `/workspace/dynamo`
+- Enables [PDL](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) and disables parallel weight loading
+- Sets HuggingFace token as environment variable in the container
+### 3. Understanding the Configuration
+The deployment uses configuration files and command-line arguments to control behavior:
+#### Configuration Files
+**Prefill Configuration (`examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`)**:
+- `enable_attention_dp: false` - Attention data parallelism disabled for prefill
+- `enable_chunked_prefill: true` - Enables efficient chunked prefill processing
+- `moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
+- `cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
+- `cuda_graph_config.max_batch_size: 32` - Maximum batch size for CUDA graphs
+**Decode Configuration (`examples/backends/trtllm/engine_configs/gpt-oss-120b/decode.yaml`)**:
+- `enable_attention_dp: true` - Attention data parallelism enabled for decode
+- `disable_overlap_scheduler: false` - Enables overlapping for decode efficiency
+- `moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
+- `cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
+- `cuda_graph_config.max_batch_size: 128` - Maximum batch size for CUDA graphs
+#### Command-Line Arguments
+Both workers receive these key arguments:
+- `--tensor-parallel-size 4` - Uses 4 GPUs for tensor parallelism
+- `--expert-parallel-size 4` - Expert parallelism across 4 GPUs
+- `--free-gpu-memory-fraction 0.9` - Allocates 90% of GPU memory
+Prefill-specific arguments:
+- `--max-num-tokens 20000` - Maximum tokens for prefill processing
+- `--max-batch-size 32` - Maximum batch size for prefill
+Decode-specific arguments:
+- `--max-num-tokens 16384` - Maximum tokens for decode processing
+- `--max-batch-size 128` - Maximum batch size for decode
+### 4. Launch the Deployment
+Note that GPT-OSS is a reasoning model with tool calling support. To ensure the response is being processed correctly, the worker should be launched with proper ```--dyn-reasoning-parser``` and ```--dyn-tool-call-parser```.
+You can use the provided launch script or run the components manually:
+#### Option A: Using the Launch Script
+```bash
+cd /workspace/examples/backends/trtllm
+./launch/gpt_oss_disagg.sh
+```
+#### Option B: Manual Launch
+1. **Start frontend**:
+```bash
+# Start frontend with round-robin routing
+python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
+```
+2. **Launch prefill worker**:
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.trtllm \
+  --model-path /model \
+  --served-model-name openai/gpt-oss-120b \
+  --extra-engine-args examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml \
+  --dyn-reasoning-parser gpt_oss \
+  --dyn-tool-call-parser harmony \
+  --disaggregation-mode prefill \
+  --max-num-tokens 20000 \
+  --max-batch-size 32 \
+  --free-gpu-memory-fraction 0.9 \
+  --tensor-parallel-size 4 \
+  --expert-parallel-size 4 &
+```
+3. **Launch decode worker**:
+```bash
+CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \
+  --model-path /model \
+  --served-model-name openai/gpt-oss-120b \
+  --extra-engine-args examples/backends/trtllm/engine_configs/gpt-oss-120b/decode.yaml \
+  --dyn-reasoning-parser gpt_oss \
+  --dyn-tool-call-parser harmony \
+  --disaggregation-mode decode \
+  --max-num-tokens 16384 \
+  --free-gpu-memory-fraction 0.9 \
+  --tensor-parallel-size 4 \
+  --expert-parallel-size 4
+```
+### 6. Verify the Deployment is Ready
+Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
+```
+curl http://localhost:8000/health
+```
+Make sure that both of the endpoints are available before sending an inference request:
+```
+{
+  "endpoints": [
+    "dyn://dynamo.tensorrt_llm.generate",
+    "dyn://dynamo.prefill.generate"
+  ],
+  "status": "healthy"
+}
+```
+If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
+### 7. Test the Deployment
+Send a test request to verify the deployment:
+```bash
+curl -X POST http://localhost:8000/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-120b",
+    "input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
+    "max_output_tokens": 200,
+    "stream": false
+  }'
+```
+The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
+### 8. Reasoning and Tool Calling
+Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
+is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
+multi-turn as it involves tool selection and generation based on the tool result.
+In addition, the reasoning effort can be configured through ```chat_template_args```. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: ```low```, ```medium```, and ```high```.
+Below is an example of sending multi-round requests to complete a user query with reasoning and tool calling:
+**Application setup (pseudocode)**
+```Python
+# The tool defined by the application
+def get_system_health():
+    for component in system.components:
+        if not component.health():
+            return False
+    return True
+# The JSON representation of the declaration in ChatCompletion tool style
+tool_choice = '{
+  "type": "function",
+  "function": {
+    "name": "get_system_health",
+    "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
+    "parameters": {
+      "type": "object",
+      "properties": {}
+    }
+  }
+}'
+# On user query, perform below workflow.
+def user_query(app_request):
+    # first round
+    # create chat completion with prompt and tool choice
+    request = ...
+    response = send(request)
+    if response["finish_reason"] == "tool_calls":
+        # second round
+        function, params = parse_tool_call(response)
+        function_result = function(params)
+        # create request with prompt, assistant response, and function result
+        request = ...
+        response = send(request)
+    return app_response(response)
+```
+**First request with tools**
+```bash
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hey, quick check: is everything up and running?"
+    }
+  ],
+  "chat_template_args": {
+      "reasoning_effort": "low"
+  },
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_system_health",
+        "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
+        "parameters": {
+          "type": "object",
+          "properties": {}
+        }
+      }
+    }
+  ],
+  "response_format": {
+    "type": "text"
+  },
+  "stream": false,
+  "max_tokens": 300
+}'
+```
+**First response with tool choice**
+```JSON
+{
+  "id": "chatcmpl-d1c12219-6298-4c83-a6e3-4e7cef16e1a9",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "tool_calls": [
+          {
+            "id": "call-1",
+            "type": "function",
+            "function": {
+              "name": "get_system_health",
+              "arguments": "{}"
+            }
+          }
+        ],
+        "role": "assistant",
+        "reasoning_content": "We need to check system health. Use function."
+      },
+      "finish_reason": "tool_calls"
+    }
+  ],
+  "created": 1758758741,
+  "model": "openai/gpt-oss-120b",
+  "object": "chat.completion",
+  "usage": null
+}
+```
+**Second request with tool calling result**
+```bash
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hey, quick check: is everything up and running?"
+    },
+    {
+      "role": "assistant",
+      "tool_calls": [
+        {
+          "id": "call-1",
+          "type": "function",
+          "function": {
+            "name": "get_system_health",
+            "arguments": "{}"
+          }
+        }
+      ]
+    },
+    {
+      "role": "tool",
+      "tool_call_id": "call-1",
+      "content": "{\"status\":\"ok\",\"uptime_seconds\":372045}"
+    }
+  ],
+  "chat_template_args": {
+      "reasoning_effort": "low"
+  },
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_system_health",
+        "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
+        "parameters": {
+          "type": "object",
+          "properties": {}
+        }
+      }
+    }
+  ],
+  "response_format": {
+    "type": "text"
+  },
+  "stream": false,
+  "max_tokens": 300
+}'
+```
+**Second response with final message**
+```JSON
+{
+  "id": "chatcmpl-9ebfe64a-68b9-4c1d-9742-644cf770ad0e",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
+        "role": "assistant",
+        "reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "created": 1758758853,
+  "model": "openai/gpt-oss-120b",
+  "object": "chat.completion",
+  "usage": null
+}
+```
+## Benchmarking
+### Performance Testing with AIPerf
+The Dynamo container includes [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main?tab=readme-ov-file#aiperf), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
+**Run the following benchmark from inside the container** (after completing the deployment steps above):
+```bash
+# Create a directory for benchmark results
+mkdir -p /tmp/benchmark-results
+# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
+aiperf profile \
+    --model openai/gpt-oss-120b \
+    --tokenizer /model \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --synthetic-input-tokens-mean 32000 \
+    --synthetic-input-tokens-stddev 0 \
+    --output-tokens-mean 256 \
+    --output-tokens-stddev 0 \
+    --extra-inputs max_tokens:256 \
+    --extra-inputs min_tokens:256 \
+    --extra-inputs ignore_eos:true \
+    --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+    --concurrency 256 \
+    --request-count 6144 \
+    --warmup-request-count 1000 \
+    --num-dataset-entries 8000 \
+    --random-seed 100 \
+    --artifact-dir /tmp/benchmark-results \
+    -H 'Authorization: Bearer NOT USED' \
+    -H 'Accept: text/event-stream'
+```
+### What This Benchmark Does
+This command:
+- **Tests chat completions** with streaming responses against the disaggregated deployment
+- **Simulates high load** with 256 concurrent requests and 6144 total requests
+- **Uses long context inputs** (32K tokens) to test prefill performance
+- **Generates consistent outputs** (256 tokens) to measure decode throughput
+- **Includes warmup period** (1000 requests) to stabilize performance metrics
+- **Saves detailed results** to `/tmp/benchmark-results` for analysis
+Key parameters you can adjust:
+- `--concurrency`: Number of simultaneous requests (impacts GPU utilization)
+- `--synthetic-input-tokens-mean`: Average input length (tests prefill capacity)
+- `--output-tokens-mean`: Average output length (tests decode throughput)
+- `--request-count`: Total number of requests for the benchmark
+### Installing AIPerf Outside the Container
+If you prefer to run benchmarks from outside the container:
+```bash
+# Install AIPerf
+pip install aiperf
+# Then run the same benchmark command, adjusting the tokenizer path if needed
+```
+## Architecture Overview
+The disaggregated architecture separates prefill and decode phases:
+```mermaid
+flowchart TD
+    Client["Users/Clients<br/>(HTTP)"] --> Frontend["Frontend<br/>Round-Robin Router"]
+    Frontend --> Prefill["Prefill Worker<br/>(GPUs 0-3)"]
+    Frontend --> Decode["Decode Worker<br/>(GPUs 4-7)"]
+    Prefill -.->|KV Cache Transfer<br/>via UCX| Decode
+```
+## Key Features
+1. **Disaggregated Serving**: Separates compute-intensive prefill from memory-bound decode operations
+2. **Optimized Resource Usage**: Different parallelism strategies for prefill vs decode
+3. **Scalable Architecture**: Easy to adjust worker counts based on workload
+4. **TensorRT-LLM Optimizations**: Leverages TensorRT-LLM's efficient kernels and memory management
+## Troubleshooting
+### Common Issues
+1. **CUDA Out-of-Memory Errors**
+   - Reduce `--max-num-tokens` in the launch commands (currently 20000 for prefill, 16384 for decode)
+   - Lower `--free-gpu-memory-fraction` from 0.9 to 0.8 or 0.7
+   - Ensure model checkpoints are compatible with the expected format
+2. **Workers Not Connecting**
+   - Ensure etcd and NATS services are running: `docker ps | grep -E "(etcd|nats)"`
+   - Check network connectivity between containers
+   - Verify CUDA_VISIBLE_DEVICES settings match your GPU configuration
+   - Check that no other processes are using the assigned GPUs
+3. **Performance Issues**
+   - Monitor GPU utilization with `nvidia-smi` while the deployment is running
+   - Check worker logs for bottlenecks or errors
+   - Ensure that batch sizes in manual commands match those in configuration files
+   - Adjust chunked prefill settings based on your workload
+   - For connection issues, ensure port 8000 is not being used by another application
+4. **Container Startup Issues**
+   - Verify that the NVIDIA Container Toolkit is properly installed
+   - Check Docker daemon is running with GPU support
+   - Ensure sufficient disk space for model weights and container images
+## Next Steps
+- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/README.md)
+- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
+- **Monitoring**: Set up Prometheus and Grafana for production monitoring
+- **Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
--- a/fern/pages/backends/trtllm/kv-cache-transfer.md
+++ b/fern/pages/backends/trtllm/kv-cache-transfer.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "KV Cache Transfer in Disaggregated Serving"
+---
+In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
+## Default Method: NIXL
+By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
+### Specify Backends for NIXL
+TODO: Add instructions for how to specify different backends for NIXL.
+## Alternative Method: UCX
+TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
+1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
+2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
+This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
--- a/fern/pages/backends/trtllm/llama4-plus-eagle.md
+++ b/fern/pages/backends/trtllm/llama4-plus-eagle.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM"
+---
+This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](multinode/multinode-examples.md) to set up the environment for the following scenarios:
+- **Aggregated Serving:**
+  Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
+- **Disaggregated Serving:**
+  Distribute the workload across two GB200x4 nodes:
+    - One node runs the decode worker.
+    - The other node runs the prefill worker.
+## Notes
+* Make sure the (`eagle3_one_model: true`) is set in the LLM API config inside the `examples/backends/trtllm/engine_configs/llama4/eagle` folder.
+## Setup
+Assuming you have already allocated your nodes via `salloc`, and are
+inside an interactive shell on one of the allocated nodes, set the
+following environment variables based:
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export IMAGE="<dynamo_trtllm_image>"
+# export MOUNTS="${PWD}/:/mnt,/lustre:/lustre"
+export MOUNTS="${PWD}/:/mnt"
+export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
+export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
+```
+See [this](multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options.
+## Aggregated Serving
+```bash
+export NUM_NODES=1
+export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yml"
+./multinode/srun_aggregated.sh
+```
+## Disaggregated Serving
+```bash
+export NUM_PREFILL_NODES=1
+export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yml"
+export NUM_DECODE_NODES=1
+export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yml"
+./multinode/srun_disaggregated.sh
+```
+## Example Request
+See [here](multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment.
+```
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+        "model": "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8",
+        "messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
+        "max_tokens": 1024
+    }' -w "\n"
+# output:
+{"id":"cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8","choices":[{"text":"NVIDIA is considered a great company for several reasons:\n\n1. **Technological Innovation**: NVIDIA is a leader in the field of graphics processing units (GPUs) and has been at the forefront of technological innovation.
+...
+and the broader tech industry.\n\nThese factors combined have contributed to NVIDIA's status as a great company in the technology sector.","index":0,"logprobs":null,"finish_reason":"stop"}],"created":1753329671,"model":"nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":16,"completion_tokens":562,"total_tokens":578,"prompt_tokens_details":null,"completion_tokens_details":null}}
+```
--- a/fern/pages/backends/trtllm/multinode/multinode-examples.md
+++ b/fern/pages/backends/trtllm/multinode/multinode-examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Example: Multi-node TRTLLM Workers with Dynamo on Slurm"
+---
+> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
+To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
+the set of nodes need to be launched together in the same MPI world, such as
+via `mpirun` or `srun`. This is true regardless of whether the worker is
+aggregated, prefill-only, or decode-only.
+In this document we will demonstrate two examples launching multinode workers
+on a slurm cluster with `srun`:
+1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
+   worker across 4 GB200 nodes
+2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
+   TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
+   worker (4 nodes) across a total of 8 GB200 nodes.
+NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
+`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
+using `mpirun` directly, with relative ease.
+## Setup
+For simplicity of the example, we will make some assumptions about your slurm cluster:
+1. First, we assume you have access to a slurm cluster with multiple GPU nodes
+   available. For functional testing, most setups should be fine. For performance
+   testing, you should aim to allocate groups of nodes that are performantly
+   inter-connected, such as those in an NVL72 setup.
+2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
+   SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
+   example will use `srun` arguments like `--container-image`,
+   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
+   If your cluster supports similar container based plugins, you may be able to
+   modify the script to use that instead.
+3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
+   described [here](https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container).
+   This is the image that can be set to the `IMAGE` environment variable in later steps.
+4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
+   will allocate 8 nodes below as a reference command to have enough capacity
+   to run both examples. If you plan to only run the aggregated example, you
+   will only need 4 nodes. If you customize the configurations to require a
+   different number of nodes, you can adjust the number of allocated nodes
+   accordingly. Pre-allocating nodes is technically not a requirement,
+   but it makes iterations of testing/experimenting easier.
+   Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
+    ```bash
+    # Set partition manually based on your slurm cluster's partition names
+    PARTITION=""
+    # Set account manually if this command doesn't work on your cluster
+    ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+    salloc \
+      --partition="${PARTITION}" \
+      --account="${ACCOUNT}" \
+      --job-name="${ACCOUNT}-dynamo.trtllm" \
+      -t 05:00:00 \
+      --nodes 8
+    ```
+5. Lastly, we will assume you are inside an interactive shell on one of your allocated
+   nodes, which may be the default behavior after executing the `salloc` command above
+   depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
+### Environment Variable Setup
+This example aims to automate as much of the environment setup as possible,
+but all slurm clusters and environments are different, and you may need to
+dive into the scripts to make modifications based on your specific environment.
+Assuming you have already allocated your nodes via `salloc`, and are
+inside an interactive shell on one of the allocated nodes, set the
+following environment variables based:
+```bash
+# NOTE: IMAGE must be set manually for now
+# To build an iamge, see the steps here:
+# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
+export IMAGE="<dynamo_trtllm_image>"
+# MOUNTS are the host:container path pairs that are mounted into the containers
+# launched by each `srun` command.
+#
+# If you want to reference files, such as $MODEL_PATH below, in a
+# different location, you can customize MOUNTS or specify additional
+# comma-separated mount pairs here.
+#
+# NOTE: Currently, this example assumes that the local bash scripts and configs
+# referenced are mounted into into /mnt inside the container. If you want to
+# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
+# accordingly for the new locations of `start_frontend_services.sh` and
+# `start_trtllm_worker.sh`.
+#
+# For example, assuming your cluster had a `/lustre` directory on the host, you
+# could add that as a mount like so:
+#
+# export MOUNTS="${PWD}/../../../../:/mnt,/lustre:/lustre"
+export MOUNTS="${PWD}/../../../../:/mnt"
+# NOTE: In general, Deepseek R1 is very large, so it is recommended to
+# pre-download the model weights and save them in some shared location,
+# NFS storage, HF_HOME, etc. and modify the `--model-path` below
+# to reuse the pre-downloaded weights instead.
+#
+# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+#
+# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
+# https://huggingface.co/deepseek-ai/DeepSeek-R1
+export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
+# The name the model will be served/queried under, matching what's
+# returned by the /v1/models endpoint.
+#
+# By default this is inferred from MODEL_PATH, but when using locally downloaded
+# model weights, it can be nice to have explicit control over the name.
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+```
+## Aggregated WideEP
+Assuming you have at least 4 nodes allocated following the setup steps above,
+follow these steps below to launch an **aggregated** deployment across 4 nodes:
+```bash
+# Default set in srun_aggregated.sh, but can customize here.
+# export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml"
+# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
+# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
+# total GPUs necessary to satisfy the requested parallelism. For example,
+# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
+# export NUM_NODES=4
+# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+# export NUM_GPUS_PER_NODE=4
+# Launches:
+# - frontend + etcd/nats on current (head) node
+# - one large aggregated trtllm worker across multiple nodes via MPI tasks
+./srun_aggregated.sh
+```
+## Disaggregated WideEP
+Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
+following the setup above, follow these steps below to launch a **disaggregated**
+deployment across 8 nodes:
+<Tip>
+Make sure you have a fresh environment and don't still have the aggregated
+example above still deployed on the same set of nodes.
+</Tip>
+```bash
+# Defaults set in srun_disaggregated.sh, but can customize here.
+# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml"
+# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml"
+# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
+# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
+# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
+# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
+# GPUs necessary to satisfy the requested parallelism in each config.
+# export NUM_PREFILL_NODES=4
+# export NUM_DECODE_NODES=4
+# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+# export NUM_GPUS_PER_NODE=4
+# Launches:
+# - frontend + etcd/nats on current (head) node.
+# - one large prefill trtllm worker across multiple nodes via MPI tasks
+# - one large decode trtllm worker across multiple nodes via MPI tasks
+./srun_disaggregated.sh
+```
+<Tip>
+To launch multiple replicas of the configured prefill/decode workers, you can set
+NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
+</Tip>
+## Understanding the Output
+1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
+   etcd, NATS, and the OpenAI frontend on the head node only
+   called "node1" in the example output below. The second launches
+   a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
+   using 4 GPUs each.
+    ```
+    # Frontend/etcd/nats services
+    srun: launching StepId=453374.17 on host node1, 1 tasks: 0
+    ...
+    # TP16 TRTLLM worker split across 4 nodes with 4 gpus each
+    srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
+    srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
+    srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
+    srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
+   ```
+2. The OpenAI frontend will listen for and dynamically discover workers as
+   they register themselves with Dynamo's distributed runtime:
+   ```
+   0: 2025-06-13T02:36:48.160Z  INFO dynamo_run::input::http: Watching for remote model at models
+   0: 2025-06-13T02:36:48.161Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
+   ```
+3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
+   GPU on each node, which will each output their progress while loading the model.
+   You can see each rank's output prefixed with the rank at the start of each log line
+   until the model succesfully finishes loading:
+    ```
+     8: rank8 run mgmn worker node with mpi_world_size: 16 ...
+    10: rank10 run mgmn worker node with mpi_world_size: 16 ...
+     9: rank9 run mgmn worker node with mpi_world_size: 16 ...
+    11: rank11 run mgmn worker node with mpi_world_size: 16 ...
+    ...
+    15: Model init total -- 55.42s
+    11: Model init total -- 55.91s
+    12: Model init total -- 55.24s
+    ```
+4. After the model fully finishes loading on all ranks, the worker will register itself,
+   and the OpenAI frontend will detect it, signaled by this output:
+    ```
+    0: 2025-06-13T02:46:35.040Z  INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4"
+    ```
+5. At this point, with the worker fully initialized and detected by the frontend,
+   it is now ready for inference.
+6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
+   three srun jobs instead of two. One for frontend, one for prefill worker,
+   and one for decode worker.
+## Example Request
+To verify the deployed model is working, send a `curl` request:
+```bash
+# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
+HOST=localhost
+PORT=8000
+# "model" here should match the model name returned by the /v1/models endpoint
+curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "'${SERVED_MODEL_NAME}'",
+  "messages": [
+  {
+    "role": "user",
+    "content": "Tell me a story as if we were playing dungeons and dragons."
+  }
+  ],
+  "stream": true,
+  "max_tokens": 30
+}'
+```
+## Cleanup
+To cleanup background `srun` processes launched by `srun_aggregated.sh` or
+`srun_disaggregated.sh`, you can run:
+```bash
+pkill srun
+```
+## Known Issues
+- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
+  FP4 weights. In theory, the example should work on alternative setups such as
+  H100 nodes with FP8 weights, but this hasn't been tested yet.
+- WideEP configs in this directory are still being tested. A WideEP specific
+  example with documentation will be added once ready.
+- There are known issues where WideEP workers may not cleanly shut down:
+    - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
+      now, you must manually clean these up before deploying again on the
+      same set of nodes.
+    - Similarly, there may be GPU memory left in-use after killing the `srun`
+      jobs. After cleaning up any leftover shared memory files as described
+      above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
+      to check on this behavior. If you don't free the GPU memory before the
+      next deployment, you may get a CUDA OOM error while loading the model.
+    - There is mention of this issue in the relevant TRT-LLM blog
+      [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
--- a/fern/pages/backends/trtllm/prometheus.md
+++ b/fern/pages/backends/trtllm/prometheus.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "TensorRT-LLM Prometheus Metrics"
+---
+## Overview
+When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
+Additional performance metrics are available via non-Prometheus APIs (see [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) below).
+As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm_` prefix is added by Dynamo.
+**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
+**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
+## Environment Variables
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
+## Getting Started Quickly
+This is a single machine example.
+### Start Observability Stack
+For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
+### Launch Dynamo Components
+Launch a frontend and TensorRT-LLM backend to test metrics:
+```bash
+# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
+$ python -m dynamo.frontend
+# Enable system metrics server on port 8081 and enable metrics collection
+$ DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics
+```
+**Note:** The `backend` must be set to `"pytorch"` for metrics collection (enforced in `components/src/dynamo/trtllm/main.py`). TensorRT-LLM's `MetricsCollector` integration has only been tested/validated with the PyTorch backend.
+Wait for the TensorRT-LLM worker to start, then send requests and check metrics:
+```bash
+# Send a request
+curl -H 'Content-Type: application/json' \
+-d '{
+  "model": "<model_name>",
+  "max_completion_tokens": 100,
+  "messages": [{"role": "user", "content": "Hello"}]
+}' \
+http://localhost:8000/v1/chat/completions
+# Check metrics from the worker
+curl -s localhost:8081/metrics | grep "^trtllm_"
+```
+## Exposed Metrics
+TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All TensorRT-LLM engine metrics use the `trtllm_` prefix and include labels (e.g., `model_name`, `engine_type`, `finished_reason`) to identify the source.
+**Note:** TensorRT-LLM uses `model_name` instead of Dynamo's standard `model` label convention.
+**Example Prometheus Exposition Format text:**
+```
+# HELP trtllm_request_success_total Count of successfully processed requests.
+# TYPE trtllm_request_success_total counter
+trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="stop"} 150.0
+trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="length"} 5.0
+# HELP trtllm_time_to_first_token_seconds Histogram of time to first token in seconds.
+# TYPE trtllm_time_to_first_token_seconds histogram
+trtllm_time_to_first_token_seconds_bucket{le="0.01",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 0.0
+trtllm_time_to_first_token_seconds_bucket{le="0.05",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.0
+trtllm_time_to_first_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
+trtllm_time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 8.75
+# HELP trtllm_e2e_request_latency_seconds Histogram of end to end request latency in seconds.
+# TYPE trtllm_e2e_request_latency_seconds histogram
+trtllm_e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0
+trtllm_e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
+trtllm_e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2
+# HELP trtllm_time_per_output_token_seconds Histogram of time per output token in seconds.
+# TYPE trtllm_time_per_output_token_seconds histogram
+trtllm_time_per_output_token_seconds_bucket{le="0.1",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 120.0
+trtllm_time_per_output_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
+trtllm_time_per_output_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.5
+# HELP trtllm_request_queue_time_seconds Histogram of time spent in WAITING phase for request.
+# TYPE trtllm_request_queue_time_seconds histogram
+trtllm_request_queue_time_seconds_bucket{le="1.0",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 140.0
+trtllm_request_queue_time_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
+trtllm_request_queue_time_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 32.1
+```
+**Note:** The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint for the current list.
+### Metric Categories
+TensorRT-LLM provides metrics in the following categories (all prefixed with `trtllm_`):
+- **Request metrics** - Request success tracking and latency measurements
+- **Performance metrics** - Time to first token (TTFT), time per output token (TPOT), and queue time
+**Note:** Metrics may change between TensorRT-LLM versions. Always inspect the `/metrics` endpoint for your version.
+## Available Metrics
+The following metrics are exposed via Dynamo's `/metrics` endpoint (with the `trtllm_` prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5:
+- `trtllm_request_success_total` (Counter) — Count of successfully processed requests by finish reason
+  - Labels: `model_name`, `engine_type`, `finished_reason`
+- `trtllm_e2e_request_latency_seconds` (Histogram) — End-to-end request latency (seconds)
+  - Labels: `model_name`, `engine_type`
+- `trtllm_time_to_first_token_seconds` (Histogram) — Time to first token, TTFT (seconds)
+  - Labels: `model_name`, `engine_type`
+- `trtllm_time_per_output_token_seconds` (Histogram) — Time per output token, TPOT (seconds)
+  - Labels: `model_name`, `engine_type`
+- `trtllm_request_queue_time_seconds` (Histogram) — Time a request spends waiting in the queue (seconds)
+  - Labels: `model_name`, `engine_type`
+These metric names and availability are subject to change with TensorRT-LLM version updates.
+TensorRT-LLM provides Prometheus metrics through the `MetricsCollector` class (see [tensorrt_llm/metrics/collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)).
+## Non-Prometheus Performance Metrics
+TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.
+### Available via Code References
+- **RequestPerfMetrics Structure**: [tensorrt_llm/executor/result.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) - KV cache, timing, speculative decoding metrics
+- **Engine Statistics**: `engine.llm.get_stats_async()` - System-wide aggregate statistics
+- **KV Cache Events**: `engine.llm.get_kv_cache_events_async()` - Real-time cache operations
+### Example RequestPerfMetrics JSON Structure
+```json
+{
+  "timing_metrics": {
+    "arrival_time": 1234567890.123,
+    "first_scheduled_time": 1234567890.135,
+    "first_token_time": 1234567890.150,
+    "last_token_time": 1234567890.300,
+    "kv_cache_size": 2048576,
+    "kv_cache_transfer_start": 1234567890.140,
+    "kv_cache_transfer_end": 1234567890.145
+  },
+  "kv_cache_metrics": {
+    "num_total_allocated_blocks": 100,
+    "num_new_allocated_blocks": 10,
+    "num_reused_blocks": 90,
+    "num_missed_blocks": 5
+  },
+  "speculative_decoding": {
+    "acceptance_rate": 0.85,
+    "total_accepted_draft_tokens": 42,
+    "total_draft_tokens": 50
+  }
+}
+```
+**Note:** These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.
+## Implementation Details
+- **Prometheus Integration**: Uses the `MetricsCollector` class from `tensorrt_llm.metrics` (see [collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py))
+- **Dynamo Integration**: Uses `register_engine_metrics_callback()` function with `add_prefix="trtllm_"`
+- **Engine Configuration**: `return_perf_metrics` set to `True` when `--publish-events-and-metrics` is enabled
+- **Initialization**: Metrics appear after TensorRT-LLM engine initialization completes
+- **Metadata**: `MetricsCollector` initialized with model metadata (model name, engine type)
+## Related Documentation
+### TensorRT-LLM Metrics
+- See the [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) section above for detailed performance data and source code references
+- [TensorRT-LLM Metrics Collector](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py) - Source code reference
+### Dynamo Metrics
+- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
+- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
+- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside TensorRT-LLM metrics
+  - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
+  - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
+  - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
--- a/fern/pages/backends/vllm/LMCache-Integration.md
+++ b/fern/pages/backends/vllm/LMCache-Integration.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "LMCache Integration in Dynamo"
+---
+## Introduction
+LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
+This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency.
+### Key Benefits
+- **Reduced Time to First Token (TTFT)**: Eliminates redundant prefill computations
+- **Memory Offloading**: Intelligent KV cache placement across CPU/GPU/storage tiers
+- **Improved Throughput**: Reduced GPU memory pressure enables higher batch sizes
+## Platform Support
+**Important Note**: LMCache integration currently only supports x86 architecture. ARM64 is not supported at this time.
+## Aggregated Serving
+### Configuration
+LMCache is enabled using the `--connector lmcache` flag:
+```bash
+python -m dynamo.vllm --model <model_name> --connector lmcache
+```
+### Customization
+LMCache configuration can be customized via environment variables listed [here](https://docs.lmcache.ai/api_reference/configurations.html).
+For advanced configurations, LMCache supports multiple [storage backends](https://docs.lmcache.ai/index.html):
+- **CPU RAM**: Fast local memory offloading
+- **Local Storage**: Disk-based persistence
+- **Redis**: Distributed cache sharing
+- **GDS Backend**: GPU Direct Storage for high throughput
+- **InfiniStore/Mooncake**: Cloud-native storage solutions
+### Deployment
+Use the provided launch script for quick setup:
+```bash
+./examples/backends/vllm/launch/agg_lmcache.sh
+```
+This will:
+1. Start the dynamo frontend
+2. Launch a single vLLM worker with LMCache enabled
+### Architecture for Aggregated Mode
+In aggregated mode, the system uses:
+- **KV Connector**: `LMCacheConnectorV1`
+- **KV Role**: `kv_both` (handles both reading and writing)
+## Disaggregated Serving
+Disaggregated serving separates prefill and decode operations into dedicated workers. This provides better resource utilization and scalability for production deployments.
+### Deployment
+Use the provided disaggregated launch script(the script requires at least 2 GPUs):
+```bash
+./examples/backends/vllm/launch/disagg_lmcache.sh
+```
+This will:
+1. Start the dynamo frontend
+2. Launch a decode worker on GPU 0
+3. Wait for initialization
+4. Launch a prefill worker on GPU 1 with LMCache enabled
+### Worker Roles
+#### Decode Worker
+- **Purpose**: Handles token generation (decode phase)
+- **GPU Assignment**: CUDA_VISIBLE_DEVICES=0
+- **LMCache Config**: Uses `NixlConnector` only for kv transfer between prefill and decode workers
+#### Prefill Worker
+- **Purpose**: Handles prompt processing (prefill phase)
+- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
+- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for kv offloading and use NIXL for kv transfer between prefill and decode workers.
+- **Flag**: `--is-prefill-worker`
+## Architecture
+### KV Transfer Configuration
+The system automatically configures KV transfer based on the deployment mode and worker type:
+#### Prefill Worker (Disaggregated Mode)
+```python
+kv_transfer_config = KVTransferConfig(
+    kv_connector="PdConnector",
+    kv_role="kv_both",
+    kv_connector_extra_config={
+        "connectors": [
+            {"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"},
+            {"kv_connector": "NixlConnector", "kv_role": "kv_both"}
+        ]
+    }
+)
+```
+#### Decode Worker or Aggregated Mode
+```python
+kv_transfer_config = KVTransferConfig(
+    kv_connector="LMCacheConnectorV1",
+    kv_role="kv_both"
+)
+```
+#### Fallback (No LMCache)
+```python
+kv_transfer_config = KVTransferConfig(
+    kv_connector="NixlConnector",
+    kv_role="kv_both"
+)
+```
+### Integration Points
+1. **Argument Parsing** (`args.py`):
+   - Configures appropriate KV transfer settings
+   - Sets up connector configurations based on worker type
+2. **Engine Setup** (`main.py`):
+   - Initializes LMCache environment variables
+   - Creates vLLM engine with proper KV transfer config
+   - Handles both aggregated and disaggregated modes
+### Best Practices
+1. **Chunk Size Tuning**: Adjust `LMCACHE_CHUNK_SIZE` based on your use case:
+   - Smaller chunks (128-256): Better reuse granularity for varied content
+   - Larger chunks (512-1024): More efficient for repetitive content patterns
+2. **Memory Allocation**: Set `LMCACHE_MAX_LOCAL_CPU_SIZE` conservatively:
+   - Leave sufficient RAM for other system processes
+   - Monitor memory usage during peak loads
+3. **Workload Optimization**: LMCache performs best with:
+   - Repeated prompt patterns (RAG, multi-turn conversations)
+   - Shared context across sessions
+   - Long-running services with warm caches
+## Metrics and Monitoring
+When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
+**Requirements to access LMCache metrics:**
+- `--connector lmcache` - Enables LMCache
+- `DYN_SYSTEM_PORT=8081` - Enables metrics HTTP endpoint
+- `PROMETHEUS_MULTIPROC_DIR` (optional) - If not set, Dynamo manages it internally. Only set explicitly if you need control over the metrics directory.
+For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](prometheus.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide.
+### Troubleshooting
+#### LMCache log: `PrometheusLogger instance already created with different metadata`
+You may see an error like:
+```text
+LMCache ERROR: PrometheusLogger instance already created with different metadata. This should not happen except in test
+```
+**Version note**: We reproduced this behavior with **vLLM v0.12.0**. We have not reproduced it with **vLLM v0.11.0**, so it may be specific to (or introduced in) v0.12.0.
+This is emitted by LMCache when the LMCache connector is initialized more than once in the same process (for example, once for a `WORKER` role and later for a `SCHEDULER` role). LMCache uses a process-global singleton for its Prometheus logger, so the second initialization can log this warning if its metadata differs.
+- **Impact**: This is a log-only error; in our testing it does not prevent vLLM/Dynamo from serving requests. If you care about LMCache metric labels, be aware the logger singleton uses the first-seen metadata.
+- **Repro without Dynamo** (vLLM v0.12.0):
+```bash
+vllm serve Qwen/Qwen3-0.6B \
+  --host 127.0.0.1 --port 18000 \
+  --gpu-memory-utilization 0.24 \
+  --enforce-eager \
+  --no-enable-prefix-caching \
+  --max-num-seqs 2 \
+  --kv-offloading-backend lmcache \
+  --kv-offloading-size 1 \
+  --disable-hybrid-kv-cache-manager
+```
+- **Mitigation (silence)**: set `LMCACHE_LOG_LEVEL=CRITICAL`.
+- **Upstream issue**: [vLLM issue #30996](https://github.com/vllm-project/vllm/issues/30996).
+#### vLLM log: `Found PROMETHEUS_MULTIPROC_DIR was set by user`
+vLLM v1 uses `prometheus_client.multiprocess` and stores intermediate metric values in `PROMETHEUS_MULTIPROC_DIR`.
+- If you **set `PROMETHEUS_MULTIPROC_DIR` yourself**, vLLM warns that the directory must be wiped between runs to avoid stale/incorrect metrics.
+- When running via Dynamo, the vLLM wrapper may set `PROMETHEUS_MULTIPROC_DIR` internally to a temporary directory to avoid vLLM cleanup issues. If you still see the warning, confirm you are not exporting `PROMETHEUS_MULTIPROC_DIR` in your shell or container environment.
+## References and Additional Resources
+- [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference
+- [Configuration Reference](https://docs.lmcache.ai/api_reference/configurations.html) - Detailed configuration options
+- [LMCache Observability Guide](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Metrics and monitoring details
--- a/fern/pages/backends/vllm/README.md
+++ b/fern/pages/backends/vllm/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "LLM Deployment using vLLM"
+---
+This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
+## Use the Latest Release
+We recommend using the latest stable release of Dynamo to avoid breaking changes:
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+---
+## Table of Contents
+- [Feature Support Matrix](#feature-support-matrix)
+- [Quick Start](#vllm-quick-start)
+- [Single Node Examples](#run-single-node-examples)
+- [Advanced Examples](#advanced-examples)
+- [Deploy on Kubernetes](#kubernetes-deployment)
+- [Configuration](#configuration)
+## Feature Support Matrix
+### Core Dynamo Features
+| Feature | vLLM | Notes |
+|---------|------|-------|
+| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | WIP |
+| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ |  |
+| [**Load Based Planner**](../../planner/load-planner.md) | 🚧 | WIP |
+| [**KVBM**](../../kvbm/kvbm-architecture.md) | ✅ |  |
+| [**LMCache**](LMCache-Integration.md) | ✅ |  |
+| [**Prompt Embeddings**](prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
+### Large Scale P/D and WideEP Features
+| Feature            | vLLM | Notes                                                                 |
+|--------------------|------|-----------------------------------------------------------------------|
+| **WideEP**         | ✅   | Support for PPLX / DeepEP not verified                                           |
+| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
+| **GB200 Support**  | 🚧   | Container functional on main |
+## vLLM Quick Start
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
+### Start NATS and ETCD in the background
+Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+### Pull or build container
+We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
+```bash
+./container/build.sh --framework VLLM
+```
+### Run container
+```bash
+./container/run.sh -it --framework VLLM [--mount-workspace]
+```
+This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
+## Run Single Node Examples
+<Warning>
+Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
+</Warning>
+### Aggregated Serving
+```bash
+# requires one gpu
+cd examples/backends/vllm
+bash launch/agg.sh
+```
+### Aggregated Serving with KV Routing
+```bash
+# requires two gpus
+cd examples/backends/vllm
+bash launch/agg_router.sh
+```
+### Disaggregated Serving
+```bash
+# requires two gpus
+cd examples/backends/vllm
+bash launch/disagg.sh
+```
+### Disaggregated Serving with KV Routing
+```bash
+# requires three gpus
+cd examples/backends/vllm
+bash launch/disagg_router.sh
+```
+### Single Node Data Parallel Attention / Expert Parallelism
+This example is not meant to be performant but showcases Dynamo routing to data parallel workers
+```bash
+# requires four gpus
+cd examples/backends/vllm
+bash launch/dep.sh
+```
+<Tip>
+Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
+</Tip>
+## Advanced Examples
+Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
+### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)
+Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
+This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
+**Guide:** [Speculative Decoding Quickstart](speculative-decoding.md)
+### Kubernetes Deployment
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
+## Configuration
+vLLM workers are configured through command-line arguments. Key parameters include:
+- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
+- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
+- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
+- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
+- `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)
+  - **Required for:** Accepting pre-computed prompt embeddings via API
+  - **Default behavior:** Prompt embeddings DISABLED - requests with `prompt_embeds` will fail
+  - **Error without flag:** `ValueError: You must set --enable-prompt-embeds to input prompt_embeds`
+See `args.py` for the full list of configuration options and their defaults.
+The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
+### Hashing Consistency for KV Events
+When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
+- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
+- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
+```bash
+vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
+```
+See the high-level notes in [KV Cache Routing](../../router/kv-cache-routing.md) on deterministic event IDs.
+## Request Migration
+You can enable [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+```bash
+python3 -m dynamo.vllm ... --migration-limit=3
+```
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
+## Request Cancellation
+When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
+### Cancellation Support Matrix
+| | Prefill | Decode |
+|-|---------|--------|
+| **Aggregated** | ✅ | ✅ |
+| **Disaggregated** | ✅ | ✅ |
+For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
--- a/fern/pages/backends/vllm/deepseek-r1.md
+++ b/fern/pages/backends/vllm/deepseek-r1.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Running Deepseek R1 with Wide EP"
+---
+Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a seperate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
+## Instructions
+The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [vLLM Backend](README.md) Getting Started section on each node, and then run these two commands.
+node 0
+```bash
+./launch/dsr1_dep.sh --num-nodes 2 --node-rank 0 --gpus-per-node 8 --master-addr <node 0 addr>
+```
+node 1
+```bash
+./launch/dsr1_dep.sh --num-nodes 2 --node-rank 1 --gpus-per-node 8 --master-addr <node 0 addr>
+```
+### Testing the Deployment
+On node 0 (where the frontend was started) send a test request to verify your deployment:
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-R1",
+    "messages": [
+    {
+        "role": "user",
+        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
+    }
+    ],
+    "stream": false,
+    "max_tokens": 30
+  }'
+```
--- a/fern/pages/backends/vllm/gpt-oss.md
+++ b/fern/pages/backends/vllm/gpt-oss.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Running gpt-oss-120b Disaggregated with vLLM"
+---
+Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
+## Overview
+This deployment uses disaggregated serving in vLLM where:
+- **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
+- **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
+- **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
+## Prerequisites
+This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](README.md)
+## Instructions
+### 1. Launch the Deployment
+Note that GPT-OSS is a reasoning model with tool calling support. To
+ensure the response is being processed correctly, the worker should be
+launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
+**Start frontend**
+```bash
+python3 -m dynamo.frontend --http-port 8000 &
+```
+**Run decode worker**
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3  python -m dynamo.vllm \
+  --model openai/gpt-oss-120b \
+  --tensor-parallel-size 4 \
+  --dyn-reasoning-parser gpt_oss \
+  --dyn-tool-call-parser harmony
+```
+**Run prefill workers**
+```bash
+CUDA_VISIBLE_DEVICES=4,5,6,7  python -m dynamo.vllm \
+  --model openai/gpt-oss-120b \
+  --tensor-parallel-size 4 \
+  --is-prefill-worker \
+  --dyn-reasoning-parser gpt_oss \
+  --dyn-tool-call-parser harmony
+```
+### 2. Verify the Deployment is Ready
+Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
+```
+curl http://localhost:8000/health
+```
+Make sure that both of the `generate` endpoints are available before sending an inference request:
+```
+{
+  "status": "healthy",
+  "endpoints": [
+    "dyn://dynamo.backend.generate"
+  ],
+  "instances": [
+    {
+      "component": "backend",
+      "endpoint": "generate",
+      "namespace": "dynamo",
+      "instance_id": 7587889712474989333,
+      "transport": {
+        "nats_tcp": "dynamo_backend.generate-694d997dbae9a315"
+      }
+    },
+    {
+      "component": "prefill",
+      "endpoint": "generate",
+      "namespace": "dynamo",
+      "instance_id": 7587889712474989350,
+      "transport": {
+        "nats_tcp": "dynamo_prefill.generate-694d997dbae9a326"
+      }
+    },
+    ...
+  ]
+}
+```
+If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
+### 3. Test the Deployment
+Send a test request to verify the deployment:
+```bash
+curl -X POST http://localhost:8000/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-120b",
+    "input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
+    "max_output_tokens": 200,
+    "stream": false
+  }'
+```
+The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
+### 4. Reasoning and Tool Calling
+Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
+is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
+multi-turn as it involves tool selection and generation based on the tool result. Below is an example
+of sending multi-round requests to complete a user query with reasoning and tool calling:
+**Application setup (pseudocode)**
+```Python
+# The tool defined by the application
+def get_system_health():
+    for component in system.components:
+        if not component.health():
+            return False
+    return True
+# The JSON representation of the declaration in ChatCompletion tool style
+tool_choice = '{
+  "type": "function",
+  "function": {
+    "name": "get_system_health",
+    "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
+    "parameters": {
+      "type": "object",
+      "properties": {}
+    }
+  }
+}'
+# On user query, perform below workflow.
+def user_query(app_request):
+    # first round
+    # create chat completion with prompt and tool choice
+    request = ...
+    response = send(request)
+    if response["finish_reason"] == "tool_calls":
+        # second round
+        function, params = parse_tool_call(response)
+        function_result = function(params)
+        # create request with prompt, assistant response, and function result
+        request = ...
+        response = send(request)
+    return app_response(response)
+```
+**First request with tools**
+```bash
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hey, quick check: is everything up and running?"
+    }
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_system_health",
+        "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
+        "parameters": {
+          "type": "object",
+          "properties": {}
+        }
+      }
+    }
+  ],
+  "response_format": {
+    "type": "text"
+  },
+  "stream": false,
+  "max_tokens": 300
+}'
+```
+**First response with tool choice**
+```JSON
+{
+  "id": "chatcmpl-d1c12219-6298-4c83-a6e3-4e7cef16e1a9",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "tool_calls": [
+          {
+            "id": "call-1",
+            "type": "function",
+            "function": {
+              "name": "get_system_health",
+              "arguments": "{}"
+            }
+          }
+        ],
+        "role": "assistant",
+        "reasoning_content": "We need to check system health. Use function."
+      },
+      "finish_reason": "tool_calls"
+    }
+  ],
+  "created": 1758758741,
+  "model": "openai/gpt-oss-120b",
+  "object": "chat.completion",
+  "usage": null
+}
+```
+**Second request with tool calling result**
+```bash
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hey, quick check: is everything up and running?"
+    },
+    {
+      "role": "assistant",
+      "tool_calls": [
+        {
+          "id": "call-1",
+          "type": "function",
+          "function": {
+            "name": "get_system_health",
+            "arguments": "{}"
+          }
+        }
+      ]
+    },
+    {
+      "role": "tool",
+      "tool_call_id": "call-1",
+      "content": "{\"status\":\"ok\",\"uptime_seconds\":372045}"
+    }
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_system_health",
+        "description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
+        "parameters": {
+          "type": "object",
+          "properties": {}
+        }
+      }
+    }
+  ],
+  "response_format": {
+    "type": "text"
+  },
+  "stream": false,
+  "max_tokens": 300
+}'
+```
+**Second response with final message**
+```JSON
+{
+  "id": "chatcmpl-9ebfe64a-68b9-4c1d-9742-644cf770ad0e",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
+        "role": "assistant",
+        "reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "created": 1758758853,
+  "model": "openai/gpt-oss-120b",
+  "object": "chat.completion",
+  "usage": null
+}
+```
\ No newline at end of file
--- a/fern/pages/backends/vllm/multi-node.md
+++ b/fern/pages/backends/vllm/multi-node.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Multi-node Examples"
+---
+This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.
+## Prerequisites
+Multi-node deployments require:
+- Multiple nodes with GPU resources
+- Network connectivity between nodes (faster the better)
+- Firewall rules allowing NATS/ETCD communication
+## Infrastructure Setup
+### Step 1: Start NATS/ETCD on Head Node
+Start the required services on your head node. These endpoints must be accessible from all worker nodes:
+```bash
+# On head node (node-1)
+docker compose -f deploy/docker-compose.yml up -d
+```
+Default ports:
+- NATS: 4222
+- ETCD: 2379
+### Step 2: Configure Environment Variables
+Set the head node IP address and service endpoints. **Set this on all nodes** for easy copy-paste:
+```bash
+# Set this on ALL nodes - replace with your actual head node IP
+export HEAD_NODE_IP="<your-head-node-ip>"
+# Service endpoints (set on all nodes)
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+```
+## Deployment Patterns
+### Multi-node Aggregated Serving
+Deploy vLLM workers across multiple nodes for horizontal scaling:
+**Node 1 (Head Node)**: Run ingress and first worker
+```bash
+# Start ingress
+python -m dynamo.frontend --router-mode kv
+# Start vLLM worker
+python -m dynamo.vllm \
+  --model meta-llama/Llama-3.3-70B-Instruct \
+  --tensor-parallel-size 8 \
+  --enforce-eager
+```
+**Node 2**: Run additional worker
+```bash
+# Start vLLM worker
+python -m dynamo.vllm \
+  --model meta-llama/Llama-3.3-70B-Instruct \
+  --tensor-parallel-size 8 \
+  --enforce-eager
+```
+### Multi-node Disaggregated Serving
+Deploy prefill and decode workers on separate nodes for optimized resource utilization:
+**Node 1**: Run ingress and decode worker
+```bash
+# Start ingress
+python -m dynamo.frontend --router-mode kv &
+# Start prefill worker
+python -m dynamo.vllm \
+  --model meta-llama/Llama-3.3-70B-Instruct \
+  --tensor-parallel-size 8 \
+  --enforce-eager
+```
+**Node 2**: Run prefill worker
+```bash
+# Start decode worker
+python -m dynamo.vllm \
+  --model meta-llama/Llama-3.3-70B-Instruct \
+  --tensor-parallel-size 8 \
+  --enforce-eager \
+  --is-prefill-worker
+```
--- a/fern/pages/backends/vllm/prometheus.md
+++ b/fern/pages/backends/vllm/prometheus.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "vLLM Prometheus Metrics"
+---
+## Overview
+When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
+**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
+**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache-Integration.md).
+**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
+**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
+## Environment Variables and Flags
+| Variable/Flag | Description | Default | Example |
+|---------------|-------------|---------|---------|
+| `DYN_SYSTEM_PORT` | System metrics/health port. Required to expose `/metrics` endpoint. | `-1` (disabled) | `8081` |
+| `--connector` | KV connector to use. Use `lmcache` to enable LMCache metrics. | `nixl` | `--connector lmcache` |
+## Getting Started Quickly
+This is a single machine example.
+### Start Observability Stack
+For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
+### Launch Dynamo Components
+Launch a frontend and vLLM backend to test metrics:
+```bash
+# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
+$ python -m dynamo.frontend
+# Enable system metrics server on port 8081
+$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model_name> \
+   --enforce-eager --no-enable-prefix-caching --max-num-seqs 3
+```
+Wait for the vLLM worker to start, then send requests and check metrics:
+```bash
+# Send a request
+curl -H 'Content-Type: application/json' \
+-d '{
+  "model": "<model_name>",
+  "max_completion_tokens": 100,
+  "messages": [{"role": "user", "content": "Hello"}]
+}' \
+http://localhost:8000/v1/chat/completions
+# Check metrics from the worker
+curl -s localhost:8081/metrics | grep "^vllm:"
+```
+## Exposed Metrics
+vLLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All vLLM engine metrics use the `vllm:` prefix and include labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) to identify the source.
+**Example Prometheus Exposition Format text:**
+```
+# HELP vllm:request_success_total Number of successfully finished requests.
+# TYPE vllm:request_success_total counter
+vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
+vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
+# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
+# TYPE vllm:time_to_first_token_seconds histogram
+vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
+vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
+vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
+vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
+```
+**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for the current list.
+### Metric Categories
+vLLM provides metrics in the following categories (all prefixed with `vllm:`):
+- **Request metrics** - Request success, failure, and completion tracking
+- **Performance metrics** - Latency, throughput, and timing measurements
+- **Resource usage** - System resource consumption
+- **Scheduler metrics** - Scheduling and queue management
+- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
+**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
+## Available Metrics
+The official vLLM documentation includes complete metric definitions with:
+- Detailed explanations and design rationale
+- Counter, Gauge, and Histogram metric types
+- Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`)
+- Information about v1 metrics migration
+- Future work and deprecated metrics
+For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
+## LMCache Metrics
+When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics (prefixed with `lmcache:`) are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
+### Minimum Requirements
+To access LMCache metrics, both of these are required:
+1. `--connector lmcache` - Enables LMCache in vLLM
+2. `DYN_SYSTEM_PORT=8081` - Enables Dynamo's metrics HTTP endpoint
+**Example:**
+```bash
+DYN_SYSTEM_PORT=8081 \
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector lmcache
+```
+### Viewing LMCache Metrics
+```bash
+# View all LMCache metrics
+curl -s localhost:8081/metrics | grep "^lmcache:"
+```
+### Troubleshooting
+Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:
+- [LMCache Integration Guide](LMCache-Integration.md#troubleshooting)
+**For complete LMCache configuration and metric details**, see:
+- [LMCache Integration Guide](LMCache-Integration.md) - Setup and configuration
+- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference
+## Implementation Details
+- vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess`
+- `PROMETHEUS_MULTIPROC_DIR`: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when `/metrics` is scraped. Users only need to set this explicitly where complete control over the metrics directory is required.
+- Dynamo uses `MultiProcessCollector` to aggregate metrics from all worker processes
+- Metrics are filtered by the `vllm:` and `lmcache:` prefixes before being exposed (when LMCache is enabled)
+- The integration uses Dynamo's `register_engine_metrics_callback()` function with the global `REGISTRY`
+- Metrics appear after vLLM engine initialization completes
+- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details
+## Related Documentation
+### vLLM Metrics
+- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html)
+- [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/latest/usage/metrics.html)
+- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics)
+### Dynamo Metrics
+- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
+- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
+- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside vLLM metrics
+  - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
+  - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
+  - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
--- a/fern/pages/backends/vllm/prompt-embeddings.md
+++ b/fern/pages/backends/vllm/prompt-embeddings.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Prompt Embeddings"
+---
+Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
+## How It Works
+| Path | What Happens |
+|------|--------------|
+| **Text prompt** | Tokenize → Embedding Layer → Transformer |
+| **Prompt embeds** | Validate → Bypass Embedding → Transformer |
+## Architecture
+```mermaid
+flowchart LR
+    subgraph FE["Frontend (Rust)"]
+        A[Request] --> B{prompt_embeds?}
+        B -->|No| C[🔴 Tokenize text]
+        B -->|Yes| D[🟢 Validate base64+size]
+        C --> E[token_ids, ISL=N]
+        D --> F[token_ids=empty, skip ISL]
+    end
+    subgraph RT["Router (NATS)"]
+        G[Route PreprocessedRequest]
+    end
+    subgraph WK["Worker (Python)"]
+        H[TokensPrompt#40;token_ids#41;]
+        I[Decode → EmbedsPrompt#40;tensor#41;]
+    end
+    subgraph VLLM["vLLM Engine"]
+        J[🔴 Embedding Layer]
+        K[🟢 Bypass Embedding]
+        L[Transformer Layers]
+        M[LM Head → Response]
+    end
+    E --> G
+    F --> G
+    G -->|Normal| H
+    G -->|Embeds| I
+    H --> J --> L
+    I --> K --> L
+    L --> M
+```
+| Layer | **Normal Flow** | **Prompt Embeds**  |
+|---|---|---|
+| **Frontend (Rust)** | 🔴 Tokenize text → token_ids, compute ISL | 🟢 Validate base64+size, skip tokenization |
+| **Router (NATS)** | Forward token_ids in PreprocessedRequest | Forward prompt_embeds string |
+| **Worker (Python)** | `TokensPrompt(token_ids)` | Decode base64 → `EmbedsPrompt(tensor)` |
+| **vLLM Engine** | 🔴 Embedding Layer → Transformer | 🟢 Bypass Embedding → Transformer |
+## Quick Start
+Send pre-computed prompt embeddings directly to vLLM, bypassing tokenization.
+### 1. Enable Feature
+```bash
+python -m dynamo.vllm --model <model-name> --enable-prompt-embeds
+```
+> **Required:** The `--enable-prompt-embeds` flag must be set or requests will fail.
+### 2. Send Request
+```python
+import torch
+import base64
+import io
+from openai import OpenAI
+# Prepare embeddings (sequence_length, hidden_dim)
+embeddings = torch.randn(10, 4096, dtype=torch.float32)
+# Encode
+buffer = io.BytesIO()
+torch.save(embeddings, buffer)
+buffer.seek(0)
+embeddings_base64 = base64.b64encode(buffer.read()).decode()
+# Send
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.completions.create(
+    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
+    prompt="",  # Can be empty or present; prompt_embeds takes precedence
+    max_tokens=100,
+    extra_body={"prompt_embeds": embeddings_base64}
+)
+```
+## Configuration
+### Docker Compose
+```yaml
+vllm-worker:
+  command:
+    - python
+    - -m
+    - dynamo.vllm
+    - --model
+    - meta-llama/Meta-Llama-3.1-8B-Instruct
+    - --enable-prompt-embeds  # Add this
+```
+### Kubernetes
+```yaml
+extraPodSpec:
+  mainContainer:
+    args:
+      - "--model"
+      - "meta-llama/Meta-Llama-3.1-8B-Instruct"
+      - "--enable-prompt-embeds"  # Add this
+```
+### NATS Configuration
+NATS needs 15MB payload limit (already configured in default deployments):
+```yaml
+# Docker Compose - deploy/docker-compose.yml
+nats-server:
+  command: ["-js", "--trace", "-m", "8222", "--max_payload", "15728640"]
+# Kubernetes - deploy/cloud/helm/platform/values.yaml
+nats:
+  config:
+    merge:
+      max_payload: 15728640
+```
+## API Reference
+### Request
+```json
+{
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "prompt": "",
+  "prompt_embeds": "<base64-encoded-pytorch-tensor>",
+  "max_tokens": 100
+}
+```
+**Requirements:**
+- **Format:** PyTorch tensor serialized with `torch.save()` and base64-encoded
+- **Size:** 100 bytes - 10MB (decoded)
+- **Shape:** `(seq_len, hidden_dim)` or `(batch, seq_len, hidden_dim)`
+- **Dtype:** `torch.float32` (recommended)
+**Field Precedence:**
+- Both `prompt` and `prompt_embeds` can be provided in the same request
+- When both are present, **`prompt_embeds` takes precedence** and `prompt` is ignored
+- The `prompt` field can be empty (`""`) when using `prompt_embeds`
+### Response
+Standard OpenAI format with accurate usage:
+```json
+{
+  "usage": {
+    "prompt_tokens": 10,        // Extracted from embedding shape
+    "completion_tokens": 15,
+    "total_tokens": 25
+  }
+}
+```
+## Errors
+| Error | Fix |
+|-------|-----|
+| `ValueError: You must set --enable-prompt-embeds` | Add `--enable-prompt-embeds` to worker |
+| `prompt_embeds must be valid base64` | Use `.decode('utf-8')` after `base64.b64encode()` |
+| `decoded data must be at least 100 bytes` | Increase sequence length |
+| `exceeds maximum size of 10MB` | Reduce sequence length |
+| `must be a torch.Tensor` | Use `torch.save()` not NumPy |
+| `size of tensor must match` | Use correct hidden dimension for model |
+## Examples
+### Streaming
+```python
+stream = client.completions.create(
+    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
+    prompt="",
+    max_tokens=100,
+    stream=True,
+    extra_body={"prompt_embeds": embeddings_base64}
+)
+for chunk in stream:
+    if chunk.choices:
+        print(chunk.choices[0].text, end="", flush=True)
+```
+### Load from File
+```python
+embeddings = torch.load("embeddings.pt")
+buffer = io.BytesIO()
+torch.save(embeddings, buffer)
+buffer.seek(0)
+embeddings_base64 = base64.b64encode(buffer.read()).decode()
+# Use in request...
+```
+## Limitations
+- ❌ Requires `--enable-prompt-embeds` flag (disabled by default)
+- ❌ PyTorch format only (NumPy not supported)
+- ❌ 10MB decoded size limit
+- ❌ Cannot mix with multimodal data (images/video)
+## Testing
+Comprehensive test coverage ensures reliability:
+- **Unit Tests:** 31 tests (11 Rust + 20 Python)
+  - Validation, decoding, format handling, error cases, usage statistics
+- **Integration Tests:** 21 end-to-end tests
+  - Core functionality, performance, formats, concurrency, usage statistics
+Run integration tests:
+```bash
+# Start worker with flag
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enable-prompt-embeds
+# Run tests
+pytest tests/integration/test_prompt_embeds_integration.py -v
+```
+## See Also
+- [vLLM Backend](README.md)
+- [vLLM Configuration](README.md#configuration)
--- a/fern/pages/backends/vllm/speculative-decoding.md
+++ b/fern/pages/backends/vllm/speculative-decoding.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)"
+---
+This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
+Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
+## Step 1: Set Up Your Docker Environment
+First, we’ll initialize a Docker container using the VLLM backend.
+You can refer to the [VLLM Quickstart Guide](README.md#vllm-quick-start) — or follow the full steps below.
+### 1. Launch Docker Compose
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+### 2. Build the Container
+```bash
+./container/build.sh --framework VLLM
+```
+### 3. Run the Container
+```bash
+./container/run.sh -it --framework VLLM --mount-workspace
+```
+## Step 2: Get Access to the Llama-3 Model
+The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
+Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
+Approval usually takes around **5 minutes**.
+Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
+```bash
+export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
+export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
+```
+## Step 3: Run Aggregated Speculative Decoding
+Now that your environment is ready, start the aggregated server with **speculative decoding**.
+```bash
+# Requires only one GPU
+cd examples/backends/vllm
+bash launch/agg_spec_decoding.sh
+```
+Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
+## Step 4: Example Request
+To verify your setup, try sending a simple prompt to your model:
+```bash
+curl http://localhost:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+     "messages": [
+       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
+     ],
+     "max_tokens": 250
+   }'
+```
+### Example Output
+```json
+{
+  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
+  "choices": [
+    {
+      "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
+      "index": 0,
+      "finish_reason": "stop"
+    }
+  ],
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "usage": {
+    "prompt_tokens": 16,
+    "completion_tokens": 250,
+    "total_tokens": 266
+  }
+}
+```
+## Additional Resources
+* [VLLM Quickstart](README.md#vllm-quick-start)
+* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
\ No newline at end of file
--- a/fern/pages/benchmarks/benchmarking.md
+++ b/fern/pages/benchmarks/benchmarking.md
--- a/fern/pages/benchmarks/kv-router-ab-testing.md
+++ b/fern/pages/benchmarks/kv-router-ab-testing.md
--- a/fern/pages/benchmarks/sla-driven-profiling.md
+++ b/fern/pages/benchmarks/sla-driven-profiling.md
--- a/fern/pages/design-docs/architecture.md
+++ b/fern/pages/design-docs/architecture.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "High Level Architecture"
+---
+Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
+- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
+- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
+- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
+- **Accelerated data transfer**: Reduces inference response time using NIXL
+- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
+## Motivation behind Dynamo
+Scaling inference for generative AI and reasoning models presents complex challenges in three key areas: performance, correctness, and efficiency. Here's what we're solving:
+There are multi-faceted challenges:
+- *Difficult UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
+- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
+- *Expensive KV cache re-computation*: When requests aren't efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
+- *Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/))
+- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
+- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API.
+To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
+## Key benefits
+The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
+- [Dynamo Disaggregated Serving](disagg-serving.md)
+- [Dynamo Smart Router](../router/kv-cache-routing.md)
+- [Dynamo KV Cache Block Manager](../kvbm/kvbm-intro.md)
+- [Planner](../planner/planner-intro.md)
+- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
+Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
+![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../../assets/img/architecture.png "Dynamo Architecture")
+Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
+Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers.
+Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization.
+## Performance benefits of key features
+### Disaggregated serving
+Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
+![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../../assets/img/disagg-perf-benefit.png)
+* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
+The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
+### KV aware routing
+![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../../assets/img/kv-routing.png)
+* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
+Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
+### KV cache manager
+The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
+![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../../assets/img/kvbm-agg-performance.png)
+* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
+### NVIDIA Inference Transfer Library (NIXL)
+NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput.
+## Acknowledgements
+We'd like to acknowledge several open source software stacks that motivated our creation Dynamo.
+- vLLM and vLLM-project
+- SGLang
+- DistServe
+- Mooncake
+- AIBrix
+- BentoML
--- a/fern/pages/design-docs/disagg-serving.md
+++ b/fern/pages/design-docs/disagg-serving.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance"
+---
+The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
+Disaggregated execution of a request has three main steps:
+1. Prefill engine computes prefill phase and generates KV cache
+2. Prefill engine transfers the KV cache to decode engine, and
+3. Decode engine computes decode phase.
+However, not all requests’ prefill phases need to be computed in the remote prefill engine. If the prefill is short or the decode engine has a high prefix cache hit, often it is more efficient to prefill locally in the decode engine. The disaggregation design in Dynamo accounts for all these scenarios and features a flexible framework that delivers strong performance across various conditions.
+## Design
+```mermaid
+sequenceDiagram
+    participant D as Worker
+    participant Q as PrefillQueue
+    participant P as PrefillWorker
+    Note over D: Request is routed to decode
+    D->>D: Decide if prefill should be done locally or remotely
+        D->>D: Allocate KV blocks
+        D->>Q: Put RemotePrefillRequest on the queue
+        P->>Q: Pull request from the queue
+        P-->>D: Read cached KVs from Decode
+        D->>D: Decode other requests
+        P->>P: Run prefill
+        P-->>D: Write prefilled KVs into allocated blocks
+        P->>D: Send completion notification
+        Note over D: Notification received when prefill is done
+        D->>D: Schedule decoding
+```
+There are four main components in Dynamo disaggregation:
+- Worker: execute prefill and decode requests
+- Prefill worker: execute prefill requests only
+- Disaggregated router: decide whether to prefill locally or remotely
+- Prefill queue: cache and load balance the remote prefill requests
+When worker receives a request, it first decides if the prefill should be done locally or remotely using the disaggregated router and allocates the KV blocks. If prefilling remotely, it then pushes a remote prefill request to the prefill queue. After that, the prefill worker pulls from prefill queue, reads KV blocks with prefix cache hit from the worker, computes the prefill, and writes the computed KV blocks back to the worker. Finally, the worker completes the remaining decoding.
+## Conditional Disaggregation
+Not all requests’ prefill phases need to be computed in the remote prefill engine. Disaggregated router decides whether the prefill phase of a request should be computed locally and globally at runtime based on the prefill length and prefill queue status. Specifically, a request is sent to remote prefill engine if the following two conditions are met:
+1. The absolute prefill length without prefix cache hit is greater than a preset threshold. On the one hand, if the prefill length of a request is short, it can be efficiently computed in the decode engine by piggybacking chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is long, the prefill becomes memory bound and hence can be more efficiently computed in the decode engine.
+2. The number of remote prefill requests in the prefill queue is less than a preset threshold. When the prefill queue has a large number of prefill requests, it indicates that the prefill workers are lagging behind, and it is better to prefill locally until more prefill workers join.
+Conditional disaggregation allows Dynamo to achieve high performance for dynamic workloads
+## Prefill Queue
+Prefill requests are computation bound (except for very short prefills) and should be executed in their dedicated iterations without any other requests to ensure fast TTFT. To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on NATS stream to ensure high performance and availability.
+## Efficient KV Transfer
+```mermaid
+sequenceDiagram
+    participant D as Worker
+    participant SD as WorkerScheduler
+    participant SP as PrefillWorkerScheduler
+    participant P as PrefillWorker
+    Note over SD: KV blocks allocated
+    SD->>SP: Issue remote prefill request <br> with KV block descriptors via prefill queue
+    SP->>P: Add to in-flight batch
+    P-->>D: Remote NIXL read for prefix hit KV blocks (non-block)
+    P->>P: Execute prefill
+    P-->>D: Remote NIXL write for comptued KV blocks (non-block)
+    P->>SP: Notify finish
+    SP->>SD: Notify finish
+    SD->>D: Add to in-flight batch
+    D->>D: Execute decode
+```
+The key to high-performance disaggregation is efficient KV transfer. Dynamo leverage NIXL to transfer KV cache directly from the VRAM of prefill engine to the VRAM of decode engine. In addition, the KV transfer is non-blocking, allowing GPU forward pass to serve other requests in addition to the KV transfer.
+After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via prefill queue. This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read and write NIXL operations. Once the remote prefill is done, worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish.
+To reduce the size of memory descriptors, Dynamo applies two optimizations:
+1. After each worker finishes its initialization and allocates all the KV cache pool, it stores the memory descriptor of all blocks (which is also referred to as the NIXL metadata) in ETCD, a distributed key-value store. Prefill workers load and cache the memory descriptors in one worker at the first time that it serves a remote prefill request issued by this worker. Thus, only the KV block ID instead of the full memory descriptor is needed when issuing the remote prefill request.
+2. Dynamo promotes the memory allocator in the prefill engine to allocate continuous blocks and merge continuous blocks into larger blocks to reduce the total number of KV blocks.
+For decode and prefill with different KV layouts (i.e., due to different TP), Dynamo applies a high-performance kernel that transposes the KV blocks into their matching layout in the KV receiver after the NIXL reads and before the NIXL writes.
+## Runtime-Reconfigurable xPyD
+The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows runtime-reconfigurable xPyD. Workers and prefill workers can be added and removed at runtime without any system-level synchronization or overheads. New and existing prefill workers both just simply pull remote prefill requests from NATS prefill queue. The NIXL metadata of the new or existing workers (for new prefill workers) are lazily loaded and cached when necessary. Specifically, adding and removing workers and prefill workers is as easy as:
+- Add worker: add NIXL metadata in ETCD.
+- Remove worker: flush engine and delete NIXL metadata in ETCD.
+- Add prefill worker: no explicit action needed.
+- Delete prefill worker: flush engine.