docs: address Harry/VDR feedback + fixing broken links across repository (#3802)

Signed-off-by: Harry Kim <harry_kim@live.com> Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Harry Kim <harry_kim@live.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Co-authored-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

docs: address Harry/VDR feedback + fixing broken links across repository (#3802)
Signed-off-by: Harry Kim <harry_kim@live.com> Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Harry Kim <harry_kim@live.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Co-authored-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>
c6b59045 · Anish · GitHub · d712ce8d · c6b59045 · c6b59045
Unverified Commit c6b59045 authored Oct 22, 2025 by Anish Committed by GitHub Oct 22, 2025
20 changed files
--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -52,9 +52,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature | TensorRT-LLM | Notes |
 |---------|--------------|-------|
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
 | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
 | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
 | [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
@@ -220,13 +220,13 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag
 ## Request Migration
-You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 ```bash
 python3 -m dynamo.trtllm ... --migration-limit=3
 ```
-This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
 ## Request Cancellation
@@ -240,7 +240,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
 | **Disaggregated (Decode-First)** | ✅ | ✅ |
 | **Disaggregated (Prefill-First)** | ✅ | ✅ |
-For more details, see the [Request Cancellation Architecture](../../../docs/architecture/request_cancellation.md) documentation.
+For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
 ## Client

--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -35,9 +35,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature | vLLM | Notes |
 |---------|------|-------|
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
+| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
 | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
 | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
 | [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ |  |
@@ -153,7 +153,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
 ### Kubernetes Deployment
-For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](/components/backends/vllm/deploy/README.md)
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../components/backends/vllm/deploy/README.md)
 ## Configuration
@@ -178,17 +178,17 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
 ```bash
 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
 ```
-See the high-level notes in [KV Cache Routing](../../../docs/architecture/kv_cache_routing.md) on deterministic event IDs.
+See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs.
 ## Request Migration
-You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 ```bash
 python3 -m dynamo.vllm ... --migration-limit=3
 ```
-This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
 ## Request Cancellation
@@ -201,4 +201,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
 | **Aggregated** | ✅ | ✅ |
 | **Disaggregated** | ✅ | ✅ |
-For more details, see the [Request Cancellation Architecture](../../../docs/architecture/request_cancellation.md) documentation.
+For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.
--- a/docs/benchmarks/benchmarking.md
+++ b/docs/benchmarks/benchmarking.md
@@ -521,3 +521,18 @@ The built-in Python workflow connects to endpoints, benchmarks with aiperf, and
 3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
 The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
+---
+## Testing with Mocker Backend
+For development and testing purposes, Dynamo provides a [mocker backend](../../components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for:
+- **Testing deployments** without expensive GPU infrastructure
+- **Developing and debugging** router, planner, or frontend logic
+- **CI/CD pipelines** that need to validate infrastructure without model execution
+- **Benchmarking framework validation** to ensure your setup works before using real backends
+The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
+See the [mocker directory](../../components/src/dynamo/mocker/) for usage examples and configuration options.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -70,7 +70,7 @@ html_theme_options = {
        }
    ],
    "switcher": {
-        "json_url": "../versions1.json",
+        "json_url": "versions1.json",
        "version_match": release,
    },
    "extra_head": {

--- a/docs/architecture/architecture.md
+++ b/docs/architecture/architecture.md
@@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
 The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
 - [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](kv_cache_routing.md)
+- [Dynamo Smart Router](../router/kv_cache_routing.md)
 - [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
 - [Planner](../planner/planner_intro.rst)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)

--- a/docs/architecture/disagg_serving.md
+++ b/docs/architecture/disagg_serving.md
--- a/docs/architecture/distributed_runtime.md
+++ b/docs/architecture/distributed_runtime.md
--- a/docs/architecture/dynamo_flow.md
+++ b/docs/architecture/dynamo_flow.md
--- a/docs/development/backend-guide.md
+++ b/docs/development/backend-guide.md
@@ -74,7 +74,7 @@ The `model_type` can be:
 - `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
 - `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
 - `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../architecture/request_migration.md). Defaults to 0.
+- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault_tolerance/request_migration.md). Defaults to 0.
 - `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
 See `components/backends` for full code examples.
@@ -116,7 +116,7 @@ In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.
 A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
-In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../architecture/request_migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
+In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault_tolerance/request_migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
 > [!WARNING]
 > We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
@@ -139,7 +139,7 @@ class RequestHandler:
 When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
-For more information about how request migration works, see the [Request Migration Architecture](../architecture/request_migration.md) documentation.
+For more information about how request migration works, see the [Request Migration Architecture](../fault_tolerance/request_migration.md) documentation.
 ## Request Cancellation
@@ -161,4 +161,4 @@ class RequestHandler:
 The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument.
-For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../architecture/request_cancellation.md) documentation.
+For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../fault_tolerance/request_cancellation.md) documentation.
--- a/docs/architecture/request_cancellation.md
+++ b/docs/architecture/request_cancellation.md
--- a/docs/architecture/request_migration.md
+++ b/docs/architecture/request_migration.md
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -25,10 +25,9 @@
   api/nixl_connect/README.md
   kubernetes/api_reference.md
-   kubernetes/create_deployment.md
+   kubernetes/deployment/create_deployment.md
   kubernetes/fluxcd.md
-   kubernetes/gke_setup.md
   kubernetes/grove.md
   kubernetes/model_caching_with_fluid.md
   kubernetes/README.md
@@ -36,12 +35,12 @@
   observability/metrics.md
   kvbm/vllm-setup.md
   kvbm/trtllm-setup.md
-   guides/tool-calling.md
+   agents/tool-calling.md
-   architecture/kv_cache_routing.md
+   router/kv_cache_routing.md
   planner/load_planner.md
-   architecture/request_migration.md
+   fault_tolerance/request_migration.md
-   architecture/request_cancellation.md
+   fault_tolerance/request_cancellation.md
   backends/trtllm/multinode/multinode-examples.md
   backends/trtllm/multinode/multinode-multimodal-example.md
@@ -66,14 +65,16 @@
   examples/README.md
   examples/runtime/hello_world/README.md
-   architecture/distributed_runtime.md
+   design_docs/distributed_runtime.md
-   architecture/dynamo_flow.md
+   design_docs/dynamo_flow.md
   backends/vllm/deepseek-r1.md
   backends/vllm/gpt-oss.md
   backends/vllm/multi-node.md
   backends/vllm/prometheus.md
+   benchmarks/kv-router-ab-testing.md
 ..   TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
     have some outdated names/references and need a refresh.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -43,22 +43,28 @@ Quickstart
   Quickstart <self>
   Installation <_sections/installation>
   Support Matrix <reference/support-matrix.md>
-   Architecture <_sections/architecture>
   Examples <_sections/examples>
 .. toctree::
   :hidden:
   :caption: Kubernetes Deployment
-   Quickstart (K8s) <../kubernetes/README.md>
+   Deployment Guide <_sections/k8s_deployment>
-   Detailed Installation Guide <../kubernetes/installation_guide.md>
+   Observability (K8s) <_sections/k8s_observability>
-   Creating Deployments <../kubernetes/create_deployment.md>
+   Multinode <_sections/k8s_multinode>
-   API Reference <../kubernetes/api_reference.md>
-   Dynamo Operator <../kubernetes/dynamo_operator.md>
+.. toctree::
-   Metrics <../kubernetes/metrics.md>
+   :hidden:
-   Logging <../kubernetes/logging.md>
+   :caption: User Guides
-   Multinode <../kubernetes/multinode-deployment.md>
-   Minikube Setup <../kubernetes/minikube.md>
+   Tool Calling <agents/tool-calling.md>
+   Multimodality Support <multimodal/multimodal_intro.md>
+   Finding Best Initial Configs <performance/aiconfigurator.md>
+   Benchmarking <benchmarks/benchmarking.md>
+   Tuning Disaggregated Performance <performance/tuning.md>
+   Writing Python Workers in Dynamo <development/backend-guide.md>
+   Observability (Local) <_sections/observability>
+   Glossary <reference/glossary.md>
 .. toctree::
   :hidden:
@@ -71,13 +77,9 @@ Quickstart
 .. toctree::
   :hidden:
-   :caption: Developer Guide
+   :caption: Design Docs
-   Benchmarking Guide <benchmarks/benchmarking.md>
+   Overall Architecture <design_docs/architecture.md>
-   KV Router A/B Testing <benchmarks/kv-router-ab-testing.md>
+   Architecture Flow <design_docs/dynamo_flow.md>
-   SLA Planner (Autoscaling) Quickstart <planner/sla_planner_quickstart>
+   Disaggregated Serving <design_docs/disagg_serving.md>
-   Logging <observability/logging.md>
+   Distributed Runtime <design_docs/distributed_runtime.md>
-   Health Checks <observability/health-checks.md>
-   Tuning Disaggregated Serving Performance <performance/tuning.md>
-   Writing Python Workers in Dynamo <development/backend-guide.md>
-   Glossary <reference/glossary.md>
--- a/docs/kubernetes/README.md
+++ b/docs/kubernetes/README.md
@@ -15,14 +15,26 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
-# Deploying Inference Graphs to Kubernetes
+# Deploying Dynamo on Kubernetes
 High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
-## Pre-deployment Checks
+## Important Terminology
+**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
+- Used for: Resource isolation, RBAC, organizing deployments
+- Example: `dynamo-system`, `dynamo-cloud`, `team-a-namespace`
+**Dynamo Namespace**: The logical namespace used by Dynamo components for service discovery via etcd.
+- Used for: Runtime component communication, service discovery
+- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
+- Example: `my-llm`, `production-model`, `dynamo-dev`
-Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](/deploy/cloud/pre-deployment/README.md) for more details.
+These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
+## Pre-deployment Checks
+Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](../../deploy/cloud/pre-deployment/README.md) for more details.
 ## 1. Install Platform First
@@ -31,7 +43,7 @@ Before deploying the platform, it is recommended to run the pre-deployment check
 export NAMESPACE=dynamo-system
 export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
-# 2. Install CRDs
+# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
 helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
 helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
@@ -40,22 +52,29 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$
 helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
 ```
-For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**.
+**For Shared/Multi-Tenant Clusters:**
+If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
+```bash
+--set dynamo-operator.namespaceRestriction.enabled=true
+```
+For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](./installation_guide.md)**.
 ## 2. Choose Your Backend
 Each backend has deployment examples and configuration options:
-| Backend | Available Configurations |
+| Backend      | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
-|---------|--------------------------|
+|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
-| **[vLLM](/components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node |
+| **[SGLang](../../components/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| **[SGLang](/components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
+| **[TensorRT-LLM](../../components/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
-| **[TensorRT-LLM](/components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node |
+| **[vLLM](../../components/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
 ## 3. Deploy Your First Model
 ```bash
-export NAMESPACE=dynamo-cloud
+export NAMESPACE=dynamo-system
 kubectl create namespace ${NAMESPACE}
 # to pull model from HF
@@ -75,6 +94,8 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
 curl http://localhost:8000/v1/models
 ```
+For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla_planner_quickstart.md).
 ## Understanding Dynamo's Custom Resources
 Dynamo provides two main Kubernetes Custom Resources for deploying models:
@@ -103,15 +124,15 @@ A lower-level interface that defines your complete inference pipeline:
 Use this when you need fine-grained control or have already completed profiling.
-Refer to the [API Reference and Documentation](/docs/kubernetes/api_reference.md) for more details.
+Refer to the [API Reference and Documentation](./api_reference.md) for more details.
 ## 📖 API Reference & Documentation
 For detailed technical specifications of Dynamo's Kubernetes resources:
- **[API Reference](/docs/kubernetes/api_reference.md)** - Complete CRD field specifications for all Dynamo resources
+- **[API Reference](./api_reference.md)** - Complete CRD field specifications for all Dynamo resources
- **[Create Deployment](/docs/kubernetes/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
+- **[Create Deployment](./deployment/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
- **[Operator Guide](/docs/kubernetes/dynamo_operator.md)** - Dynamo operator configuration and management
+- **[Operator Guide](./dynamo_operator.md)** - Dynamo operator configuration and management
 ### Choosing Your Architecture Pattern
@@ -194,13 +215,13 @@ Key customization points include:
 ## Additional Resources
- **[Examples](/examples/README.md)** - Complete working examples
+- **[Examples](../examples/README.md)** - Complete working examples
- **[Create Custom Deployments](/docs/kubernetes/create_deployment.md)** - Build your own CRDs
+- **[Create Custom Deployments](./deployment/create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works
+- **[Operator Documentation](./dynamo_operator.md)** - How the platform works
- **[Helm Charts](/deploy/helm/README.md)** - For advanced users
+- **[Helm Charts](../../deploy/helm/README.md)** - For advanced users
- **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users
+- **[GitOps Deployment with FluxCD](./fluxcd.md)** - For advanced users
- **[Logging](/docs/kubernetes/logging.md)** - For logging setup
+- **[Logging](./observability/logging.md)** - For logging setup
- **[Multinode Deployment](/docs/kubernetes/multinode-deployment.md)** - For multinode deployment
+- **[Multinode Deployment](./deployment/multinode-deployment.md)** - For multinode deployment
- **[Grove](/docs/kubernetes/grove.md)** - For grove details and custom installation
+- **[Grove](./grove.md)** - For grove details and custom installation
- **[Monitoring](/docs/kubernetes/metrics.md)** - For monitoring setup
+- **[Monitoring](./observability/metrics.md)** - For monitoring setup
- **[Model Caching with Fluid](/docs/kubernetes/model_caching_with_fluid.md)** - For model caching with Fluid
+- **[Model Caching with Fluid](./model_caching_with_fluid.md)** - For model caching with Fluid
--- a/docs/kubernetes/create_deployment.md
+++ b/docs/kubernetes/create_deployment.md
 # Creating Kubernetes Deployments
 The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
-The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a kubernetes deployment for your inference graph.
+The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
 This guide explains how to create your own deployment files.
 ## Step 1: Choose Your Architecture Pattern
+Before choosing a template, understand the different architecture patterns:
+### Aggregated Serving (agg.yaml)
+**Pattern**: Prefill and decode on the same GPU in a single process.
+**Suggested to use for**:
+- Small to medium models (under 70B parameters)
+- Development and testing
+- Low to moderate traffic
+- Simplicity is prioritized over maximum throughput
+**Tradeoffs**:
+- Simpler setup and debugging
+- Lower operational complexity
+- GPU utilization may not be optimal (prefill and decode compete for resources)
+- Lower throughput ceiling compared to disaggregated
+**Example**: [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml)
+### Aggregated + Router (agg_router.yaml)
+**Pattern**: Load balancer routing across multiple aggregated worker instances.
+**Suggested to use for**:
+- Medium traffic requiring high availability
+- Need horizontal scaling
+- Want some load balancing without disaggregation complexity
+**Tradeoffs**:
+- Better scalability than plain aggregated
+- High availability through multiple replicas
+- Still has GPU underutilization issues of aggregated serving
+- More complex than plain aggregated but simpler than disaggregated
+**Example**: [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml)
+### Disaggregated Serving (disagg_router.yaml)
+**Pattern**: Separate prefill and decode workers with specialized optimization.
+**Suggested to use for**:
+- Production-style deployments
+- High throughput requirements
+- Large models (70B+ parameters)
+- Maximum GPU utilization needed
+**Tradeoffs**:
+- Maximum performance and throughput
+- Better GPU utilization (prefill and decode specialized)
+- Independent scaling of prefill and decode
+- More complex setup and debugging
+- Requires understanding of prefill/decode separation
+**Example**: [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml)
+### Quick Selection Guide
 Select the architecture pattern as your template that best fits your use case.
-For example, when using the `VLLM` inference backend:
+For example, when using the `vLLM` backend:
- **Development / Testing**
+- **Development / Testing**: Use [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml) as the base configuration.
-  Use [`agg.yaml`](/components/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Production with Load Balancing**
+- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
-  Use [`agg_router.yaml`](/components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **High Performance / Disaggregated Deployment**
+- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
-  Use [`disagg_router.yaml`](/components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
 ## Step 2: Customize the Template
@@ -90,7 +144,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo
 The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
 Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
-If you are a Dynamo contributor the [dynamo run guide](/docs/reference/cli.md) for details on how to run this command.
+If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
 ## Step 3: Key Customization Points

--- a/docs/kubernetes/minikube.md
+++ b/docs/kubernetes/minikube.md
@@ -58,5 +58,5 @@ kubectl get storageclass
 ## Next Steps
-Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](./installation_guide.md) to deploy the platform to your local cluster.
+Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation_guide.md) to deploy the platform to your local cluster.
--- a/docs/kubernetes/multinode-deployment.md
+++ b/docs/kubernetes/multinode-deployment.md
--- a/docs/kubernetes/dynamo_operator.md
+++ b/docs/kubernetes/dynamo_operator.md
@@ -23,11 +23,57 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu
 For the complete technical API reference for Dynamo Custom Resource Definitions, see:
-**📖 [Dynamo CRD API Reference](/docs/kubernetes/api_reference.md)**
+**📖 [Dynamo CRD API Reference](./api_reference.md)**
 ## Installation
-[See installation steps](/docs/kubernetes/installation_guide.md#overview)
+### Quick Install with Helm
+```bash
+# Set environment
+export NAMESPACE=dynamo-system
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+# Install Platform (includes operator)
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
+```
+For namespace-restricted installations (shared clusters):
+```bash
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace ${NAMESPACE} \
+  --create-namespace \
+  --set dynamo-operator.namespaceRestriction.enabled=true
+```
+### Building from Source
+```bash
+# Set environment
+export NAMESPACE=dynamo-system
+export DOCKER_SERVER=your-registry.com/  # your container registry
+export IMAGE_TAG=latest
+# Build operator image
+cd deploy/cloud/operator
+docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG .
+docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
+cd -
+# Install CRDs
+cd deploy/cloud/helm
+helm install dynamo-crds ./crds/ --namespace default
+# Install platform with custom operator image
+helm install dynamo-platform ./platform/ \
+  --namespace ${NAMESPACE} \
+  --create-namespace \
+  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
+  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}"
+```
+For detailed installation options, see the [Installation Guide](./installation_guide.md)
 ## Development

--- a/docs/kubernetes/fluxcd.md
+++ b/docs/kubernetes/fluxcd.md
 # GitOps Deployment with FluxCD
-This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](/docs/backends/vllm/README.md) to demonstrate the workflow.
+This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
 ## Prerequisites
- A Kubernetes cluster with [Dynamo Cloud](/docs/kubernetes/installation_guide.md) installed
+- A Kubernetes cluster with [Dynamo Cloud](./installation_guide.md) installed
 - [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
 - A Git repository to store your deployment configurations
@@ -18,7 +18,7 @@ The GitOps workflow for Dynamo deployments consists of three main steps:
 ## Step 1: Build and Push Dynamo Cloud Operator
-First, follow to [See Install Dynamo Cloud](/docs/kubernetes/installation_guide.md).
+First, follow to [See Install Dynamo Cloud](./installation_guide.md).
 ## Step 2: Create Initial Deployment

--- a/docs/kubernetes/gke_setup.md
+++ b/docs/kubernetes/gke_setup.md
-# GKE Workload Identity and Artifact Registry Setup Guide
-This guide explains how to set up Workload Identity in GKE and configure access to Google Artifact Registry.
-## Prerequisites
- Google Cloud SDK installed
- Access to a GKE cluster
- Required permissions to create and manage service accounts
-## Project Setup
-Set your project:
-```bash
-export NAMESPACE=your-k8s-namespace
-export RELEASE=your-helm-release-name
-export PROJECT=$(gcloud config get-value project)
-# set the cluster related info (you can list cluster using gcloud container clusters list)
-export CLUSTER_NAME=your-cluster-name
-export CLUSTER_REGION=$(gcloud container clusters list --filter="name=${CLUSTER_NAME}" --format="get(location)")
-gcloud config set project ${PROJECT}
-# Retrieve the Workload Identifier Namespace associated with your cluster:
-export CLUSTER_WIN=$(gcloud container clusters describe ${CLUSTER_NAME} \
-  --region=${CLUSTER_REGION} \
-  --format="value(workloadIdentityConfig.workloadPool)")
-```
-```{important}
-Make sure Workload Identity is enabled in your cluster!
-```
-## Service Account Creation and Configuration
-1. Create a service account for Workload Identity:
-Go to the GCP console and create a new service account (or reuse an existing one)
-```bash
-gcloud iam service-accounts create workload-identity-sa\
-    --display-name="workload identity service account" \
-    --description="Service account to use for Workload Identity in GKE"
-export SA=workload-identity-sa@${PROJECT}.iam.gserviceaccount.com
-```
-2. Configure Workload Identity bindings for Kubernetes service accounts:
-```bash
-gcloud iam service-accounts add-iam-policy-binding \
-    ${SA} \
-    --role roles/iam.workloadIdentityUser \
-    --member "serviceAccount:${CLUSTER_WIN}[${NAMESPACE}/${RELEASE}-dynamo-operator-controller-manager]"
-gcloud iam service-accounts add-iam-policy-binding \
-    ${SA} \
-    --role roles/iam.workloadIdentityUser \
-    --member "serviceAccount:${CLUSTER_WIN}[${NAMESPACE}/${RELEASE}-dynamo-operator-image-builder]"
-gcloud iam service-accounts add-iam-policy-binding \
-    ${SA} \
-    --role roles/iam.workloadIdentityUser \
-    --member "serviceAccount:${CLUSTER_WIN}[${NAMESPACE}/${RELEASE}-dynamo-operator-component]"
-```
-## Artifact Registry Access
-### Option 1: Project-Level Access
-Grant read and write access at the project level:
-```bash
-# Grant reader role
-gcloud projects add-iam-policy-binding ${PROJECT} \
-  --member="serviceAccount:${SA}" \
-  --role="roles/artifactregistry.reader"
-# Grant writer role
-gcloud projects add-iam-policy-binding ${PROJECT} \
-  --member="serviceAccount:${SA}" \
-  --role="roles/artifactregistry.writer"
-```
-### Option 2: Repository-Level Access
-Grant access to specific repository:
-```bash
-gcloud artifacts repositories add-iam-policy-binding your-artifact-repository \
-  --location=${CLUSTER_REGION} \
-  --project=${PROJECT} \
-  --member="serviceAccount:${SA}" \
-  --role="roles/artifactregistry.reader"
-```
-## GKE Node Access to Artifact Registry
-This is needed to make sure pods can pull images from Artifact Registry without needing to specify an imagePullSecret
-### For GKE Autopilot
-```bash
-# Get project number
-export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT} --format='value(projectNumber)')
-# Grant access to the default compute service account
-gcloud projects add-iam-policy-binding ${PROJECT} \
-  --member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
-  --role="roles/artifactregistry.reader"
-```
-### For Standard GKE
-```bash
-# Get node service account
-export NODE_SERVICE_ACCOUNT=$(gcloud container clusters describe ${CLUSTER_NAME} \
-  --region ${CLUSTER_REGION} \
-  --format="get(nodeConfig.serviceAccount)")
-# Grant access to node service account
-gcloud projects add-iam-policy-binding ${PROJECT} \
-  --member="serviceAccount:${NODE_SERVICE_ACCOUNT}" \
-  --role="roles/artifactregistry.reader"
-```
-## Adding annotations to enable Workload Identity
-This is an example of values.yaml used to deploy Dynamo Cloud using custom GCP annotations to enable Workload Identity.
-```yaml
-dynamo-operator:
-  ...
-  controllerManager:
-    serviceAccount:
-      create: true
-      annotations:
-        iam.gke.io/gcp-service-account: your-sa@your-gcp-project.iam.gserviceaccount.com
-  ...
-  dynamo:
-    components:
-      serviceAccount:
-        annotations:
-          iam.gke.io/gcp-service-account: your-sa@your-gcp-project.iam.gserviceaccount.com
-    ...
-....
-```
-You can use it during helm installation:
-```bash
-helm upgrade --install ${RELEASE} platform/ -f values.yaml --namespace ${NAMESPACE}
-```
-## Important Notes
-1. **Prerequisites for Image Pulling**:
-   - Workload Identity must be enabled on your GKE cluster
-   - GKE nodes' service account must have the `artifactregistry.reader` role
-2. **Troubleshooting**:
-   - If pods can't pull images, verify both Workload Identity and node service account configurations
-   - Check service account annotations on Kubernetes service accounts
-   - Verify IAM bindings are correctly set up
-## References
- [GKE Workload Identity Documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity)
- [Artifact Registry Authentication](https://cloud.google.com/artifact-registry/docs/docker/authentication)
- [IAM Roles for Artifact Registry](https://cloud.google.com/artifact-registry/docs/access-control)
\ No newline at end of file