docs: move all md files from components to docs (#3440)

Signed-off-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Anish <80174047+athreesh@users.noreply.github.com>

docs: move all md files from components to docs (#3440)
Signed-off-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Anish <80174047+athreesh@users.noreply.github.com>
0a2a820b · Anant Sharma · GitHub · b640f283 · 0a2a820b · 0a2a820b
Unverified Commit 0a2a820b authored Oct 09, 2025 by Anant Sharma Committed by GitHub Oct 10, 2025
20 changed files
--- a/README.md
+++ b/README.md
@@ -30,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
 ## Latest News
- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
+- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md)
 ## The Era of Multi-GPU, Multi-Node
@@ -65,9 +65,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
 To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
+- **[vLLM](docs/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
+- **[SGLang](docs/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**
+- **[TensorRT-LLM](docs/backends/trtllm/README.md)**
 Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

--- a/components/README.md
+++ b/components/README.md
@@ -23,9 +23,9 @@ This directory contains the core components that make up the Dynamo inference fr
 Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities:
- **[vLLM](backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
+- **[vLLM](/docs/backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
- **[SGLang](backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
+- **[SGLang](/docs/backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
- **[TensorRT-LLM](backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
+- **[TensorRT-LLM](/docs/backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
 Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.

--- a/components/backends/sglang/slurm_jobs/README.md
+++ b/components/backends/sglang/slurm_jobs/README.md
@@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster:
   If your cluster supports similar container based plugins, you may be able to
   modify the template to use that instead.
 3. We assume you have already built a recent Dynamo+SGLang container image as
-   described [here](../docs/dsr1-wideep-gb200.md#instructions).
+   described [here](../../../../docs/backends/sglang/dsr1-wideep-gb200.md#instructions).
   This is the image that can be passed to the `--container-image` argument in later steps.
 ## Scripts Overview

--- a/components/backends/trtllm/deploy/README.md
+++ b/components/backends/trtllm/deploy/README.md
@@ -232,7 +232,7 @@ envs:
 ## Testing the Deployment
-Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions.
+Send a test request to verify your deployment. See the [client section](../../../../docs/backends/vllm/README.md#client) for detailed instructions.
 **Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.
@@ -254,7 +254,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
 - **UCX** (default): Standard method for KV cache transfer
 - **NIXL** (experimental): Alternative transfer method
-For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-transfer.md).
+For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md).
 ## Request Migration
@@ -282,8 +282,8 @@ Configure the `model` name and `host` based on your deployment.
 - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
 - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
 - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md)
+- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md)
+- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
 - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
 ## Troubleshooting

--- a/components/backends/trtllm/performance_sweeps/README.md
+++ b/components/backends/trtllm/performance_sweeps/README.md
@@ -41,7 +41,7 @@ Please note that:
 3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
 4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
-For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
+For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
 ## Usage

--- a/docs/_includes/dive_in_examples.rst
+++ b/docs/_includes/dive_in_examples.rst
@@ -11,20 +11,20 @@ The examples below assume you build the latest image yourself from source. If us
        Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph
-    .. grid-item-card:: :doc:`vLLM <../components/backends/vllm/README>`
+    .. grid-item-card:: :doc:`vLLM <../backends/vllm/README>`
-        :link: ../components/backends/vllm/README
+        :link: ../backends/vllm/README
        :link-type: doc
        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM.
-    .. grid-item-card:: :doc:`SGLang <../components/backends/sglang/README>`
+    .. grid-item-card:: :doc:`SGLang <../backends/sglang/README>`
-        :link: ../components/backends/sglang/README
+        :link: ../backends/sglang/README
        :link-type: doc
        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.
-    .. grid-item-card:: :doc:`TensorRT-LLM <../components/backends/trtllm/README>`
+    .. grid-item-card:: :doc:`TensorRT-LLM <../backends/trtllm/README>`
-        :link: ../components/backends/trtllm/README
+        :link: ../backends/trtllm/README
        :link-type: doc
        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.

--- a/docs/_sections/backends.rst
+++ b/docs/_sections/backends.rst
@@ -37,6 +37,6 @@ Dynamo currently supports the following high-performance inference backends:
 .. toctree::
   :maxdepth: 1
-   vLLM <../components/backends/vllm/README>
+   vLLM <../backends/vllm/README>
-   SGLang <../components/backends/sglang/README>
+   SGLang <../backends/sglang/README>
-   TensorRT-LLM <../components/backends/trtllm/README>
+   TensorRT-LLM <../backends/trtllm/README>
--- a/docs/architecture/kvbm_intro.rst
+++ b/docs/architecture/kvbm_intro.rst
@@ -63,4 +63,4 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes
   KVBM Architecture <kvbm_architecture.md>
   Understanding KVBM components <kvbm_components.md>
   KVBM Further Reading <kvbm_reading>
-   LMCache Integration <../components/backends/vllm/LMCache_Integration.md>
+   LMCache Integration <../backends/vllm/LMCache_Integration>
--- a/components/backends/sglang/README.md
+++ b/components/backends/sglang/README.md
@@ -35,13 +35,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature | SGLang | Notes |
 |---------|--------|-------|
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
+| [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../architecture/sla_planner.md) | ✅ |  |
-| [**Multimodal EPD Disaggregation**](docs/multimodal_epd.md) | ✅ |  |
+| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned |
+| [**Load Based Planner**](../../architecture/load_planner.md) | ❌ | Planned |
-| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
+| [**KVBM**](../../architecture/kvbm_architecture.md) | ❌ | Planned |
 ### Large Scale P/D and WideEP Features
@@ -229,7 +229,7 @@ cd $DYNAMO_HOME/components/backends/sglang
 ./launch/disagg_dp_attn.sh
 ```
-When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md).
+When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](expert-distribution-eplb.md).
 ### Testing the Deployment
@@ -266,24 +266,24 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
 Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
 ### Run a multi-node sized model
- **[Run a multi-node model](docs/multinode-examples.md)**
+- **[Run a multi-node model](multinode-examples.md)**
 ### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
+- **[Run DeepSeek-R1 on 104+ H100s](dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1-FP8 on GB200s](docs/dsr1-wideep-gb200.md)**
+- **[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)**
 ### Hierarchical Cache (HiCache)
- **[Enable SGLang Hierarchical Cache (HiCache)](docs/sgl-hicache-example.md)**
+- **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**
 ### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
- **[Run a multimodal model with EPD Disaggregation](docs/multimodal_epd.md)**
+- **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)**
 ## Deployment
 We currently provide deployment examples for Kubernetes and SLURM.
 ## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)**
+- **[Deploying Dynamo with SGLang on Kubernetes](../../../components/backends/sglang/deploy/README.md)**
 ## SLURM
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
+- **[Deploying Dynamo with SGLang on SLURM](../../../components/backends/sglang/slurm_jobs/README.md)**
--- a/components/backends/sglang/docs/dsr1-wideep-gb200.md
+++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md
--- a/components/backends/sglang/docs/dsr1-wideep-h100.md
+++ b/components/backends/sglang/docs/dsr1-wideep-h100.md
--- a/components/backends/sglang/docs/expert-distribution-eplb.md
+++ b/components/backends/sglang/docs/expert-distribution-eplb.md
--- a/components/backends/sglang/gpt-oss.md
+++ b/components/backends/sglang/gpt-oss.md
@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0
 # Running gpt-oss-120b Disaggregated with SGLang
-The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/components/backends/vllm/gpt-oss.md),
+The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/docs/backends/vllm/gpt-oss.md),
 please ues the vLLM guide as a reference with the different deployment steps as highlighted below:
 # Launch the Deployment

--- a/components/backends/sglang/docs/multimodal_epd.md
+++ b/components/backends/sglang/docs/multimodal_epd.md
@@ -31,7 +31,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA.
 The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
-Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../README.md) example.
+Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
 By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
 MultimodalEncodeWorker independently from the prefill and decode workers if needed.
@@ -116,7 +116,7 @@ You should see a response similar to this:
 For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing.
 The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding.
-For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../README.md) example.
+For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
 This figure illustrates the workflow:
 ```mermaid

--- a/components/backends/sglang/docs/multinode-examples.md
+++ b/components/backends/sglang/docs/multinode-examples.md
--- a/components/backends/sglang/docs/sgl-hicache-example.md
+++ b/components/backends/sglang/docs/sgl-hicache-example.md
--- a/components/backends/trtllm/README.md
+++ b/components/backends/trtllm/README.md
@@ -186,11 +186,11 @@ For comprehensive instructions on multinode serving, see the [multinode-examples
 ### Kubernetes Deployment
-For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md).
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../components/backends/trtllm/deploy/README.md).
 ### Client
-See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
+See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
@@ -230,7 +230,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
 ## Client
-See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
+See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
@@ -302,7 +302,7 @@ sampling_params.logits_processor = create_trtllm_adapters(processors)
 ## Performance Sweep
-For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
+For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../components/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
 ## Dynamo KV Block Manager Integration

--- a/components/backends/trtllm/gemma3_sliding_window_attention.md
+++ b/components/backends/trtllm/gemma3_sliding_window_attention.md
@@ -23,9 +23,9 @@ VSWA is a mechanism in which a model’s layers alternate between multiple slidi
 > [!Note]
 > - Ensure that required services such as `nats` and `etcd` are running before starting.
 > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
-> - It’s recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
+> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
-### Aggregated Serving
+## Aggregated Serving
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 export MODEL_PATH=google/gemma-3-1b-it
@@ -34,7 +34,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
 ./launch/agg.sh
 ```
-### Aggregated Serving with KV Routing
+## Aggregated Serving with KV Routing
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 export MODEL_PATH=google/gemma-3-1b-it
@@ -43,7 +43,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
 ./launch/agg_router.sh
 ```
-#### Disaggregated Serving
+## Disaggregated Serving
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 export MODEL_PATH=google/gemma-3-1b-it
@@ -53,7 +53,7 @@ export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml
 ./launch/disagg.sh
 ```
-#### Disaggregated Serving with KV Routing
+## Disaggregated Serving with KV Routing
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 export MODEL_PATH=google/gemma-3-1b-it

--- a/components/backends/trtllm/gpt-oss.md
+++ b/components/backends/trtllm/gpt-oss.md
--- a/components/backends/trtllm/kv-cache-transfer.md
+++ b/components/backends/trtllm/kv-cache-transfer.md