chore: remove stale example assets (#7059)

7dfbe4fd · Alec · GitHub · 310f8ca9 · 7dfbe4fd · 7dfbe4fd
Unverified Commit 7dfbe4fd authored Mar 27, 2026 by Alec Committed by GitHub Mar 28, 2026
20 changed files
--- a/README.md
+++ b/README.md
@@ -200,6 +200,42 @@ Dynamo is built in the open with an OSS-first development model. We welcome cont
 <details>
 <summary>Older news</summary>
+Dynamo provides comprehensive benchmarking tools:
+- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
+- **[SLA-Driven Deployments](docs/components/planner/planner-guide.md)** – Optimize deployments to meet SLA requirements
+## Frontend OpenAPI Specification
+The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
+```bash
+cargo run -p dynamo-llm --bin generate-frontend-openapi
+```
+This writes to `docs/reference/api/openapi.json`.
+## Service Discovery and Messaging
+Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service-discovery.md)) handle service discovery. External services are optional for most deployments:
+| Deployment | etcd | NATS | Notes |
+|------------|------|------|-------|
+| **Local Development** | ❌ Not required | ❌ Not required | Pass `--discovery-backend file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
+| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
+> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
+For Slurm or other distributed deployments (and KV-aware routing):
+- [etcd](https://etcd.io/) can be run directly as `./etcd`.
+- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
+To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
+## More News
+- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
 - [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
 - [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
 - [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)

--- a/docs/backends/trtllm/multinode/trtllm-multinode-examples.md
+++ b/docs/backends/trtllm/multinode/trtllm-multinode-examples.md
@@ -4,281 +4,79 @@
 title: Multinode Examples
 ---
-For general TensorRT-LLM features and configuration, see the [Reference Guide](../trtllm-reference-guide.md).
+For general TensorRT-LLM features and engine configuration, see the
+[Reference Guide](../trtllm-reference-guide.md).
---
+## Recommended Path
-> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
-To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
-the set of nodes need to be launched together in the same MPI world, such as
-via `mpirun` or `srun`. This is true regardless of whether the worker is
-aggregated, prefill-only, or decode-only.
-In this document we will demonstrate two examples launching multinode workers
-on a slurm cluster with `srun`:
-1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
-   worker across 4 GB200 nodes
-2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
-   TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
-   worker (4 nodes) across a total of 8 GB200 nodes.
-NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
-`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
-using `mpirun` directly, with relative ease.
-## Setup
-For simplicity of the example, we will make some assumptions about your slurm cluster:
-1. First, we assume you have access to a slurm cluster with multiple GPU nodes
-   available. For functional testing, most setups should be fine. For performance
-   testing, you should aim to allocate groups of nodes that are performantly
-   inter-connected, such as those in an NVL72 setup.
-2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
-   example will use `srun` arguments like `--container-image`,
-   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
-   If your cluster supports similar container based plugins, you may be able to
-   modify the script to use that instead.
-3. Third, we assume you have a Dynamo+TRTLLM container image available.
-   You can use the [prebuilt container](../README.md#quick-start) or [build a custom one](../trtllm-building-custom-container.md).
-   This is the image that can be set to the `IMAGE` environment variable in later steps.
-4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
-   will allocate 8 nodes below as a reference command to have enough capacity
-   to run both examples. If you plan to only run the aggregated example, you
-   will only need 4 nodes. If you customize the configurations to require a
-   different number of nodes, you can adjust the number of allocated nodes
-   accordingly. Pre-allocating nodes is technically not a requirement,
-   but it makes iterations of testing/experimenting easier.
-   Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
+For multinode TensorRT-LLM deployments, start from the checked-in Kubernetes
-    ```bash
+recipes under [`recipes/`](../../../../recipes/README.md). Those manifests are
-    # Set partition manually based on your slurm cluster's partition names
+the supported entrypoints for launching multi-node workers, frontend services,
-    PARTITION=""
+and related routing components.
-    # Set account manually if this command doesn't work on your cluster
-    ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
-    salloc \
-      --partition="${PARTITION}" \
-      --account="${ACCOUNT}" \
-      --job-name="${ACCOUNT}-dynamo.trtllm" \
-      -t 05:00:00 \
-      --nodes 8
-    ```
-5. Lastly, we will assume you are inside an interactive shell on one of your allocated
-   nodes, which may be the default behavior after executing the `salloc` command above
-   depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
-### Environment Variable Setup
+The main TRT-LLM recipe entrypoints are:
-This example aims to automate as much of the environment setup as possible,
+- [DeepSeek-R1 WideEP on GB200](../../../../recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml)
-but all slurm clusters and environments are different, and you may need to
+- [Qwen3-235B-A22B-FP8 aggregated](../../../../recipes/qwen3-235b-a22b-fp8/trtllm/agg/deploy.yaml)
-dive into the scripts to make modifications based on your specific environment.
+- [Qwen3-235B-A22B-FP8 disaggregated](../../../../recipes/qwen3-235b-a22b-fp8/trtllm/disagg/deploy.yaml)
+- [Qwen3-32B-FP8 aggregated](../../../../recipes/qwen3-32b-fp8/trtllm/agg/deploy.yaml)
+- [Qwen3-32B-FP8 disaggregated](../../../../recipes/qwen3-32b-fp8/trtllm/disagg/deploy.yaml)
+- [GPT-OSS-120B aggregated](../../../../recipes/gpt-oss-120b/trtllm/agg/deploy.yaml)
+- [GPT-OSS-120B disaggregated](../../../../recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml)
+- [Nemotron-3-Super-FP8 disaggregated](../../../../recipes/nemotron-3-super-fp8/trtllm/disagg/deploy.yaml)
-Assuming you have already allocated your nodes via `salloc`, and are
+For model-level setup, prerequisites, and hardware notes, use the recipe
-inside an interactive shell on one of the allocated nodes, set the
+README files:
-following environment variables based:
-```bash
-# NOTE: IMAGE must be set manually for now
-# Use the prebuilt container from NGC (see ../README.md#quick-start):
-#   export IMAGE="nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0"
-# Or build a custom one (see ../trtllm-building-custom-container.md)
-# Or you can also download the image to shared storage and point
-# IMAGE to the local path.
-export IMAGE="<dynamo_trtllm_image>"
-# MOUNTS are the host:container path pairs that are mounted into the containers
+- [DeepSeek-R1 recipes](../../../../recipes/deepseek-r1/README.md)
-# launched by each `srun` command.
+- [Qwen3-235B-A22B-FP8 recipes](../../../../recipes/qwen3-235b-a22b-fp8/README.md)
-#
+- [Qwen3-32B-FP8 recipes](../../../../recipes/qwen3-32b-fp8/README.md)
-# If you want to reference files, such as $MODEL_PATH below, in a
+- [GPT-OSS-120B recipes](../../../../recipes/gpt-oss-120b/README.md)
-# different location, you can customize MOUNTS or specify additional
+- [Kimi-K2.5 recipes](../../../../recipes/kimi-k2.5/README.md)
-# comma-separated mount pairs here.
-#
-# NOTE: Currently, this example assumes that the local bash scripts and configs
-# referenced are mounted into into /mnt inside the container. If you want to
-# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
-# accordingly for the new locations of `start_frontend_services.sh` and
-# `start_trtllm_worker.sh`.
-#
-# For example, assuming your cluster had a `/lustre` directory on the host, you
-# could add that as a mount like so:
-#
-# export MOUNTS="${PWD}/../../../../:/mnt,/lustre:/lustre"
-export MOUNTS="${PWD}/../../../../:/mnt"
-# NOTE: In general, Deepseek R1 is very large, so it is recommended to
+## Quick Start
-# pre-download the model weights and save them in some shared location,
-# NFS storage, HF_HOME, etc. and modify the `--model-path` below
-# to reuse the pre-downloaded weights instead.
-#
-# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-#
-# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
-# https://huggingface.co/deepseek-ai/DeepSeek-R1
-export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
-# The name the model will be served/queried under, matching what's
+At a high level, the Kubernetes workflow is:
-# returned by the /v1/models endpoint.
-#
-# By default this is inferred from MODEL_PATH, but when using locally downloaded
-# model weights, it can be nice to have explicit control over the name.
-export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
-```
-## Aggregated WideEP
+1. Install the Dynamo platform on Kubernetes. See the
+   [Kubernetes Deployment Guide](../../../kubernetes/README.md).
+2. Create a namespace and any required secrets such as a Hugging Face token.
+3. Apply the recipe's model cache and model download manifests when the recipe
+   includes them.
+4. Apply the recipe's `deploy.yaml`.
+5. Port-forward the frontend service and send test requests to `/v1/models` or
+   `/v1/chat/completions`.
-Assuming you have at least 4 nodes allocated following the setup steps above,
+Example flow:
-follow these steps below to launch an **aggregated** deployment across 4 nodes:
 ```bash
-# Default set in srun_aggregated.sh, but can customize here.
+export NAMESPACE=dynamo-demo
-# export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml"
+kubectl create namespace ${NAMESPACE}
-# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
+kubectl create secret generic hf-token-secret \
-# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
+  --from-literal=HF_TOKEN="your-token-here" \
-# total GPUs necessary to satisfy the requested parallelism. For example,
+  -n ${NAMESPACE}
-# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
-# export NUM_NODES=4
+# Example: deploy DeepSeek-R1 TRT-LLM WideEP on GB200.
+kubectl apply -f recipes/deepseek-r1/model-cache/model-cache.yaml -n ${NAMESPACE}
-# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+kubectl apply -f recipes/deepseek-r1/model-cache/model-download.yaml -n ${NAMESPACE}
-# export NUM_GPUS_PER_NODE=4
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s
+kubectl apply -f recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml -n ${NAMESPACE}
-# Launches:
-# - frontend + etcd/nats on current (head) node
-# - one large aggregated trtllm worker across multiple nodes via MPI tasks
-./srun_aggregated.sh
 ```
-## Disaggregated WideEP
+After the deployment is ready, port-forward the frontend service named by the
+recipe and send a test request:
-Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
-following the setup above, follow these steps below to launch a **disaggregated**
-deployment across 8 nodes:
-> [!Tip]
-> Make sure you have a fresh environment and don't still have the aggregated
-> example above still deployed on the same set of nodes.
 ```bash
-# Defaults set in srun_disaggregated.sh, but can customize here.
+kubectl port-forward svc/<frontend-service> 8000:8000 -n ${NAMESPACE}
-# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml"
-# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml"
-# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
+curl http://localhost:8000/v1/models
-# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
-# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
-# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
-# GPUs necessary to satisfy the requested parallelism in each config.
-# export NUM_PREFILL_NODES=4
-# export NUM_DECODE_NODES=4
-# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
-# export NUM_GPUS_PER_NODE=4
-# Launches:
-# - frontend + etcd/nats on current (head) node.
-# - one large prefill trtllm worker across multiple nodes via MPI tasks
-# - one large decode trtllm worker across multiple nodes via MPI tasks
-./srun_disaggregated.sh
-```
-> [!Tip]
-> To launch multiple replicas of the configured prefill/decode workers, you can set
-> NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
-## Understanding the Output
-1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
-   etcd, NATS, and the OpenAI frontend on the head node only
-   called "node1" in the example output below. The second launches
-   a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
-   using 4 GPUs each.
-    ```
-    # Frontend/etcd/nats services
-    srun: launching StepId=453374.17 on host node1, 1 tasks: 0
-    ...
-    # TP16 TRTLLM worker split across 4 nodes with 4 gpus each
-    srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
-    srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
-    srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
-    srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
-   ```
-2. The OpenAI frontend will listen for and dynamically discover workers as
-   they register themselves with Dynamo's distributed runtime:
-   ```
-   0: 2025-06-13T02:36:48.161Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
-   ```
-3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
-   GPU on each node, which will each output their progress while loading the model.
-   You can see each rank's output prefixed with the rank at the start of each log line
-   until the model succesfully finishes loading:
-    ```
-     8: rank8 run mgmn worker node with mpi_world_size: 16 ...
-    10: rank10 run mgmn worker node with mpi_world_size: 16 ...
-     9: rank9 run mgmn worker node with mpi_world_size: 16 ...
-    11: rank11 run mgmn worker node with mpi_world_size: 16 ...
-    ...
-    15: Model init total -- 55.42s
-    11: Model init total -- 55.91s
-    12: Model init total -- 55.24s
-    ```
-4. After the model fully finishes loading on all ranks, the worker will register itself,
-   and the OpenAI frontend will detect it, signaled by this output:
-    ```
-    0: 2025-06-13T02:46:35.040Z  INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4"
-    ```
-5. At this point, with the worker fully initialized and detected by the frontend,
-   it is now ready for inference.
-6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
-   three srun jobs instead of two. One for frontend, one for prefill worker,
-   and one for decode worker.
-## Example Request
-To verify the deployed model is working, send a `curl` request:
-```bash
-# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
-HOST=localhost
-PORT=8000
-# "model" here should match the model name returned by the /v1/models endpoint
-curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-  "model": "'${SERVED_MODEL_NAME}'",
-  "messages": [
-  {
-    "role": "user",
-    "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
-  }
-  ],
-  "stream": true,
-  "max_tokens": 30
-}'
-```
-## Cleanup
-To cleanup background `srun` processes launched by `srun_aggregated.sh` or
-`srun_disaggregated.sh`, you can run:
-```bash
-pkill srun
 ```
-## Known Issues
+## Notes
- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
+- The TRT-LLM engine config files used by launch and deploy flows live under
-  FP4 weights. In theory, the example should work on alternative setups such as
+  [`examples/backends/trtllm/engine_configs/`](../../../../examples/backends/trtllm/engine_configs/README.md).
-  H100 nodes with FP8 weights, but this hasn't been tested yet.
+- If you need to customize model parallelism, replica counts, or routing mode,
- WideEP configs in this directory are still being tested. A WideEP specific
+  edit the recipe-local manifest rather than introducing a separate scheduler-specific guide.
-  example with documentation will be added once ready.
+- For the current catalog of supported recipes, see [recipes/README.md](../../../../recipes/README.md).
- There are known issues where WideEP workers may not cleanly shut down:
-    - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
-      now, you must manually clean these up before deploying again on the
-      same set of nodes.
-    - Similarly, there may be GPU memory left in-use after killing the `srun`
-      jobs. After cleaning up any leftover shared memory files as described
-      above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
-      to check on this behavior. If you don't free the GPU memory before the
-      next deployment, you may get a CUDA OOM error while loading the model.
-    - There is mention of this issue in the relevant TRT-LLM blog
-      [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
--- a/docs/backends/trtllm/trtllm-examples.md
+++ b/docs/backends/trtllm/trtllm-examples.md
@@ -104,10 +104,6 @@ For comprehensive instructions on multinode serving, see the [Multinode Examples
 For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
-### Performance Sweep
-For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md).
 ## Client
 See the [client](../sglang/README.md#testing-the-deployment) section to learn how to send requests to the deployment.

--- a/docs/backends/trtllm/trtllm-gpt-oss.md
+++ b/docs/backends/trtllm/trtllm-gpt-oss.md
@@ -529,7 +529,6 @@ flowchart TD
 ## Next Steps
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/README.md)
 - **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
 - **Monitoring**: Set up Prometheus and Grafana for production monitoring
 - **Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
--- a/docs/components/router/router-examples.md
+++ b/docs/components/router/router-examples.md
@@ -105,7 +105,7 @@ For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployme
 - [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
 - [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
 - [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
+- [Kubernetes deployment guide](../../kubernetes/README.md)
 **For A/B Testing and Advanced K8s Setup:**
 See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.

--- a/docs/features/multimodal/multimodal-trtllm.md
+++ b/docs/features/multimodal/multimodal-trtllm.md
@@ -327,65 +327,6 @@ sequenceDiagram
    Frontend->>Client: Stream response
 ```
-## Multi-node Deployment (Slurm)
-This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
-> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
-### Environment Setup
-Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
-```bash
-# Container image (build using docs/backends/trtllm/README.md#build-container)
-export IMAGE="<dynamo_trtllm_image>"
-# Host:container path pairs for mounting
-export MOUNTS="${PWD}/../../../../:/mnt"
-# Model configuration
-export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-export MODALITY=${MODALITY:-"multimodal"}
-```
-### Multi-node Disaggregated Launch
-For 4 4xGB200 nodes (2 for prefill, 2 for decode):
-```bash
-# Customize parallelism to match your engine configs
-# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
-# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
-# export NUM_PREFILL_NODES=2
-# export NUM_DECODE_NODES=2
-# export NUM_GPUS_PER_NODE=4
-# Launches frontend + etcd/nats on head node, plus prefill and decode workers
-./srun_disaggregated.sh
-```
-### Understanding the Output
-1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
-2. The OpenAI frontend will dynamically discover workers as they register:
-   ```text
-   INFO dynamo_run::input::http: Watching for remote model at models
-   INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
-   ```
-3. TRT-LLM workers output progress from each MPI rank while loading
-4. When ready, the frontend logs:
-   ```text
-   INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-   ```
-### Cleanup
-```bash
-pkill srun
-```
 ## Embedding Cache
 Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:
@@ -499,4 +440,3 @@ Common examples:
 | `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler with disaggregated params encoding/decoding |
 | `components/src/dynamo/trtllm/utils/disagg_utils.py` | DisaggregatedParamsCodec for network transfer |
 | `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
--- a/docs/kubernetes/inference-gateway.md
+++ b/docs/kubernetes/inference-gateway.md
@@ -10,7 +10,7 @@ title: Inference Gateway (GAIE)
 Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
-EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
+EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml), following the checked-in GAIE/EPP configuration layout used by this repository.
 Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. A request only exercises disaggregated routing when the EPP config defines a `prefill` profile and prefill workers are available. The standalone [`epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) currently only defines a `decode` profile, while the recipe examples use separate aggregated and disaggregated configs under `recipes/llama-3-70b/vllm/agg/gaie/` and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/`. Unless `DYN_ENFORCE_DISAGG=true`, deployments without a `prefill` profile or prefill workers fall back to aggregated serving.
 If you want to use LoRA deploy Dynamo without the Inference Gateway.

--- a/examples/README.md
+++ b/examples/README.md
@@ -26,9 +26,9 @@ This directory contains practical examples demonstrating how to deploy and use D
 Learn fundamental Dynamo concepts through these introductory examples:
- **[Quickstart](/examples/basics/quickstart/README.md)** - Simple aggregated serving example with vLLM backend
+- **[Quickstart](/docs/getting-started/quickstart.md)** - Simple local Dynamo setup across supported backends
- **[Disaggregated Serving](/examples/basics/disaggregated_serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
+- **[Disaggregated Serving](/docs/features/disaggregated-serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
- **[Multi-node](/examples/basics/multinode/README.md)** - Distributed inference across multiple nodes and GPUs
+- **[Multi-node TensorRT-LLM](/docs/backends/trtllm/multinode/trtllm-multinode-examples.md)** - Distributed inference across multiple nodes and GPUs
 ## Framework Support
@@ -56,7 +56,7 @@ Low-level runtime examples for developers using Python<>Rust bindings:
 ## Getting Started
-1. **Choose your deployment pattern**: Start with the [Quickstart](/examples/basics/quickstart/README.md) for a simple local deployment, or explore [Disaggregated Serving](/examples/basics/disaggregated_serving/README.md) for advanced architectures.
+1. **Choose your deployment pattern**: Start with the [Quickstart](/docs/getting-started/quickstart.md) for a simple local deployment, or explore [Disaggregated Serving](/docs/features/disaggregated-serving/README.md) for advanced architectures.
 2. **Set up prerequisites**: Most examples require etcd and NATS services. You can start them using:
   ```bash

--- a/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-cuda_graph_config:
-  max_batch_size: 16
\ No newline at end of file
--- a/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/decode.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/decode.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-disable_overlap_scheduler: false
-cuda_graph_config:
-  max_batch_size: 16
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-cache_transceiver_config:
-  backend: DEFAULT
--- a/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/prefill.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/prefill.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-# Overlap scheduler not currently supported in prefill only workers.
-disable_overlap_scheduler: true
-cuda_graph_config:
-  max_batch_size: 16
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: true
-max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-  dtype: fp8
-# Enable the MTP(Multi-Token Prediction) in the model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-print_iter_log: true
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/agg/simple/agg.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/agg/simple/agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: false
-max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
-  dtype: fp8
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-cuda_graph_config:
-  enable_padding: true
-# NOTE: For larger max batch size, you may want to add larger cuda graph
-# batch sizes below to match.
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-print_iter_log: true
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/dep16_agg.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/dep16_agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Example of a Multi-node worker, but no WideEP or EPLB.
-# See wide_ep*.yaml for WideEP example configs.
-backend: pytorch
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-enable_attention_dp: true
-max_batch_size: 256
-max_num_tokens: 256
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.7
-  dtype: fp8
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/eplb.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/eplb.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-# moe_load_balancer settings for TRTLLM based on:
-# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer
-num_slots: 288
-layer_updates_per_iter: 2
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-backend: pytorch
-# WideEP related settings
-moe_config:
-  backend: WIDEEP
-  # moe_max_num_tokens will default to max_num_tokens if left unspecified.
-  #
-  # If you want to set this value explicitly, one recommendation is below:
-  #   moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
-  #   4096 = 256 * 16
-  # moe_max_num_tokens: 4096
-  load_balancer: /mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/eplb.yaml
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-enable_attention_dp: true
-max_batch_size: 256
-max_num_tokens: 256
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.3
-  dtype: fp8
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
\ No newline at end of file
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/mtp/mtp_decode.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/mtp/mtp_decode.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: false
-max_batch_size: 256
-# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
-# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
-# This is a known issue in TensorRT-LLM and will be resolved in the next release.
-max_num_tokens: 512
-# 8704 = 8192 ISL + 512 OSL
-max_seq_len: 8704
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-  dtype: fp8
-# Enable the MTP(Multi-Token Prediction) in decode model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-print_iter_log: true
-cache_transceiver_config:
-  backend: DEFAULT
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/mtp/mtp_prefill.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/mtp/mtp_prefill.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: true
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-  dtype: fp8
-print_iter_log: true
-disable_overlap_scheduler: true
-# Enable the MTP(Multi-Token Prediction) in the prefill model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-cache_transceiver_config:
-  backend: DEFAULT
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/simple/decode.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/simple/decode.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: false
-max_batch_size: 256
-max_num_tokens: 256
-# 8448 = 8192 ISL + 256 OSL
-max_seq_len: 8448
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
-  dtype: fp8
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: false
-cuda_graph_config:
-  enable_padding: true
-  # NOTE: For larger max batch size, you may want to
-  # add larger cuda graph batch sizes below to match.
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-print_iter_log: true
-cache_transceiver_config:
-  backend: DEFAULT
--- a/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/simple/prefill.yaml
+++ b/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/simple/prefill.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: true
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-  dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: true
-print_iter_log: true
-cache_transceiver_config:
-  backend: DEFAULT