docs: clean up toctree navigation and add disaggregated serving guide (#6024)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: clean up toctree navigation and add disaggregated serving guide (#6024)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
07db5895 · dagil-nvidia · GitHub · 219e5c45 · 219e5c45 · 219e5c45
Unverified Commit 07db5895 authored Feb 06, 2026 by dagil-nvidia Committed by GitHub Feb 06, 2026
20 changed files
--- a/docs/_sections/fault_tolerance.rst
+++ b/docs/_sections/fault_tolerance.rst
-Fault Tolerance
-===============
-
-.. toctree::
-   :maxdepth: 1
-
-   Overview <../fault_tolerance/README.md>
-   Request Migration <../fault_tolerance/request_migration.md>
-   Request Cancellation <../fault_tolerance/request_cancellation.md>
-   Graceful Shutdown <../fault_tolerance/graceful_shutdown.md>
-   Request Rejection <../fault_tolerance/request_rejection.md>
-   Testing <../fault_tolerance/testing.md>
--- a/docs/_sections/k8s_deployment.rst
+++ b/docs/_sections/k8s_deployment.rst
-Deployment Guide
-================
-
-.. toctree::
-   :hidden:
-
-   Kubernetes Quickstart <../kubernetes/README>
-   Detailed Installation Guide <../kubernetes/installation_guide>
-   Dynamo Operator <../kubernetes/dynamo_operator>
-   Service Discovery <../kubernetes/service_discovery>
-   Webhooks <../kubernetes/webhooks>
-   Minikube Setup <../kubernetes/deployment/minikube>
-   Managing Models with DynamoModel <../kubernetes/deployment/dynamomodel-guide>
-   Autoscaling <../kubernetes/autoscaling>
-   Checkpointing <../kubernetes/chrek/README>
--- a/docs/_sections/k8s_multinode.rst
+++ b/docs/_sections/k8s_multinode.rst
-Multinode
-=========
-
-.. toctree::
-   :hidden:
-
-   Multinode Deployments <../kubernetes/deployment/multinode-deployment>
-   Grove <../kubernetes/grove>
--- a/docs/_sections/k8s_observability.rst
+++ b/docs/_sections/k8s_observability.rst
-Observability
-=============
-
-.. toctree::
-   :hidden:
-
-   Metrics <../kubernetes/observability/metrics>
-   Logging <../kubernetes/observability/logging>
-   Operator Metrics <../kubernetes/observability/operator-metrics>
--- a/docs/_sections/observability.rst
+++ b/docs/_sections/observability.rst
-Observability
-=============
-
-.. toctree::
-   :hidden:
-
-   Overview <../observability/README>
-   Prometheus + Grafana Setup <../observability/prometheus-grafana>
-   Metrics <../observability/metrics>
-   Metrics Developer Guide <../observability/metrics-developer-guide>
-   Health Checks <../observability/health-checks>
-   Tracing <../observability/tracing>
-   Logging <../observability/logging>
--- a/docs/components/frontend/README.md
+++ b/docs/components/frontend/README.md
@@ -79,3 +79,9 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options.
 |----------|-------------|
 | [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration |
 | [Router Documentation](../router/README.md) | KV-aware routing configuration |
+
+```{toctree}
+:hidden:
+
+frontend_guide
+```
--- a/docs/components/kvbm/README.md
+++ b/docs/components/kvbm/README.md
@@ -74,3 +74,9 @@ KVBM has three primary logical layers:
 - **[FlexKV Integration](../../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
 - **[SGLang HiCache](../../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
 - **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
+
+```{toctree}
+:hidden:
+
+kvbm_guide
+```
--- a/docs/components/planner/README.md
+++ b/docs/components/planner/README.md
@@ -134,3 +134,10 @@ The planner queries the frontend's `/metrics` endpoint via Prometheus. Required
 - Request count and duration
 - TTFT and ITL distributions
 - Input/output sequence lengths
+
+```{toctree}
+:hidden:
+
+planner_guide
+planner_examples
+```
--- a/docs/components/router/README.md
+++ b/docs/components/router/README.md
@@ -101,3 +101,10 @@ For basic model registration without KV routing, use `--router-mode round-robin`
 - **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
 - **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
 - **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
+
+```{toctree}
+:hidden:
+
+router_guide
+router_examples
+```
--- a/docs/diagrams/arch_comparison.d2
+++ b/docs/diagrams/arch_comparison.d2
+direction: right
+
+aggregated: Aggregated {
+  width: 600
+  height: 450
+
+  frontend: Frontend {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  router: Router {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  w1: "W1 (TP2)" {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  w2: "W2 (TP2)" {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  w3: "W3 (TP2)" {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  w4: "W4 (TP2)" {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+
+  frontend -> router
+  router -> w1
+  router -> w2
+  router -> w3
+  router -> w4
+
+  note: |md
+    Each worker handles both prefill and decode.
+  |
+  note.style.font-size: 18
+}
+
+disaggregated: Disaggregated {
+  width: 600
+  height: 450
+
+  frontend: Frontend {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  router: Router {
+    width: 180
+    height: 60
+    style.font-size: 20
+  }
+  p1: "Prefill 1 (TP2)" {
+    width: 220
+    height: 60
+    style.font-size: 20
+  }
+  p2: "Prefill 2 (TP2)" {
+    width: 220
+    height: 60
+    style.font-size: 20
+  }
+  decode: "Decode (TP4)" {
+    width: 220
+    height: 60
+    style.font-size: 20
+  }
+
+  frontend -> router
+  router -> p1
+  router -> p2
+  p1 -> decode: "KV Cache via RDMA"
+  p2 -> decode: "KV Cache via RDMA"
+
+  note: |md
+    Prefill and decode on separate workers.
+  |
+  note.style.font-size: 18
+}
+
+aggregated.style.font-size: 24
+disaggregated.style.font-size: 24
--- a/docs/diagrams/decision_flowchart.d2
+++ b/docs/diagrams/decision_flowchart.d2
+direction: down
+
+q1: "AIC shows disagg > agg\nthroughput?" {
+  shape: diamond
+}
+
+q2: "RDMA available\nin cluster?" {
+  shape: diamond
+}
+
+q3: "ISL/OSL ratio > 8:1?" {
+  shape: diamond
+}
+
+q4: "Disagg > 20%\nfaster?" {
+  shape: diamond
+}
+
+agg: "AGGREGATED\nSimpler, no RDMA needed" {
+  shape: rectangle
+  width: 380
+  height: 80
+}
+
+disagg: "DISAGGREGATED\nHigher throughput, needs RDMA" {
+  shape: rectangle
+  width: 380
+  height: 80
+}
+
+q1 -> q2: Yes
+q1 -> agg: No
+q2 -> q3: Yes
+q2 -> agg: No
+q3 -> disagg: Yes
+q3 -> q4: No
+q4 -> disagg: Yes
+q4 -> agg: No
--- a/docs/diagrams/e2e_workflow.d2
+++ b/docs/diagrams/e2e_workflow.d2
+direction: right
+
+install: {
+  label: "Install"
+  shape: rectangle
+  install_detail: |md
+    pip3 install
+    aiconfigurator
+  |
+}
+
+configure: {
+  label: "Configure"
+  shape: rectangle
+  configure_detail: |md
+    Model, GPUs,
+    SLA targets
+  |
+}
+
+compare: {
+  label: "Compare"
+  shape: rectangle
+  compare_detail: |md
+    Agg vs disagg
+    rankings
+  |
+}
+
+deploy: {
+  label: "Deploy"
+  shape: rectangle
+  deploy_detail: |md
+    Apply DGD
+    manifest
+  |
+}
+
+validate: {
+  label: "Validate"
+  shape: rectangle
+  validate_detail: |md
+    Benchmark
+    with AIPerf
+  |
+}
+
+install -> configure
+configure -> compare
+compare -> deploy
+deploy -> validate
--- a/docs/diagrams/param_mapping.d2
+++ b/docs/diagrams/param_mapping.d2
+direction: right
+
+aic: "AIC Output" {
+  shape: rectangle
+  c1: "concurrency: 56 (=14x4)" {
+    width: 280
+  }
+  c2: "ISL: 4000, OSL: 500" {
+    width: 280
+  }
+  c3: "Model: Qwen3-32B-FP8" {
+    width: 280
+  }
+  c4: "concurrency x ~14" {
+    width: 280
+  }
+  c5: "(best practice)" {
+    width: 280
+  }
+}
+
+aiperf: "AIPerf Argument" {
+  shape: rectangle
+  a1: "--concurrency 56" {
+    width: 320
+  }
+  a2: "--isl 4000 --osl 500" {
+    width: 320
+  }
+  a3: "-m Qwen/Qwen3-32B-FP8" {
+    width: 320
+  }
+  a4: "--num-requests 800" {
+    width: 320
+  }
+  a5: "--extra-inputs \"ignore_eos:true\"" {
+    width: 320
+  }
+}
+
+aic.c1 -> aiperf.a1
+aic.c2 -> aiperf.a2
+aic.c3 -> aiperf.a3
+aic.c4 -> aiperf.a4
+aic.c5 -> aiperf.a5
--- a/docs/fault_tolerance/README.md
+++ b/docs/fault_tolerance/README.md
@@ -140,3 +140,13 @@ See [Fault Tolerance Testing](testing.md) for details.
 - [Observability](../observability/README.md) - Metrics and monitoring
 - [Distributed Runtime](../design_docs/distributed_runtime.md) - Service discovery architecture
 - [Event Plane](../design_docs/event_plane.md) - etcd and NATS coordination
+
+```{toctree}
+:hidden:
+
+Request Migration <request_migration>
+Request Cancellation <request_cancellation>
+Graceful Shutdown <graceful_shutdown>
+Request Rejection <request_rejection>
+Testing <testing>
+```
--- a/docs/features/disaggregated_serving/README.md
+++ b/docs/features/disaggregated_serving/README.md
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -12,6 +12,8 @@
   :hidden:

   development/runtime-guide.md
+   development/jail_stream.md
+
   api/nixl_connect/connector.md
   api/nixl_connect/descriptor.md
   api/nixl_connect/device.md
@@ -26,48 +28,20 @@

   kubernetes/api_reference.md
   kubernetes/deployment/create_deployment.md
-   kubernetes/deployment/dynamomodel-guide.md
-   kubernetes/chrek/README.md
   kubernetes/chrek/dynamo.md
   kubernetes/chrek/standalone.md
-
   kubernetes/fluxcd.md
-   kubernetes/grove.md
   kubernetes/model_caching_with_fluid.md
-   kubernetes/README.md
-   reference/cli.md
-   observability/metrics.md
-   integrations/kv_events_custom_engines.md
-   agents/tool-calling.md
-   development/jail_stream.md

-   components/planner/README.md
-   components/planner/planner_guide.md
-   components/planner/planner_examples.md
-   components/kvbm/README.md
-   components/kvbm/kvbm_guide.md
-   components/router/README.md
-   components/router/router_guide.md
-   components/router/router_examples.md
-   components/frontend/frontend_guide.md
-   design_docs/kvbm_design.md
-   integrations/flexkv_integration.md
-   integrations/sglang_hicache.md
-   fault_tolerance/README.md
-   fault_tolerance/request_migration.md
-   fault_tolerance/request_cancellation.md
-   fault_tolerance/graceful_shutdown.md
-   fault_tolerance/request_rejection.md
-   fault_tolerance/testing.md
-   design_docs/request_plane.md
-   design_docs/event_plane.md
+   reference/cli.md
+   reference/glossary.md
+   performance/tuning.md

-   backends/trtllm/multinode/multinode-examples.md
-   backends/trtllm/llama4_plus_eagle.md
-   backends/trtllm/kv-cache-transfer.md
-   backends/trtllm/gemma3_sliding_window_attention.md
-   backends/trtllm/gpt-oss.md
-   backends/trtllm/prometheus.md
+   backends/vllm/deepseek-r1.md
+   backends/vllm/gpt-oss.md
+   backends/vllm/multi-node.md
+   backends/vllm/prometheus.md
+   backends/vllm/prompt-embeddings.md

   backends/sglang/expert-distribution-eplb.md
   backends/sglang/gpt-oss.md
@@ -76,25 +50,19 @@
   backends/sglang/sglang-disaggregation.md
   backends/sglang/prometheus.md

-   examples/README.md
-   examples/runtime/hello_world/README.md
-
-   design_docs/distributed_runtime.md
-   design_docs/dynamo_flow.md
-
-   backends/vllm/deepseek-r1.md
-   backends/vllm/gpt-oss.md
-   integrations/lmcache_integration.md
-   backends/vllm/multi-node.md
-   backends/vllm/prometheus.md
-   backends/vllm/prompt-embeddings.md
+   backends/trtllm/multinode/multinode-examples.md
+   backends/trtllm/llama4_plus_eagle.md
+   backends/trtllm/kv-cache-transfer.md
+   backends/trtllm/gemma3_sliding_window_attention.md
+   backends/trtllm/gpt-oss.md
+   backends/trtllm/prometheus.md

   features/speculative_decoding/README.md
   features/speculative_decoding/speculative_decoding_vllm.md

+   examples/README.md
+   examples/runtime/hello_world/README.md
+
   benchmarks/kv-router-ab-testing.md

   mocker/mocker.md
-
-..   TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
-     have some outdated names/references and need a refresh.
--- a/docs/images/arch_comparison.svg
+++ b/docs/images/arch_comparison.svg
--- a/docs/images/decision_flowchart.svg
+++ b/docs/images/decision_flowchart.svg
--- a/docs/images/e2e_workflow.svg
+++ b/docs/images/e2e_workflow.svg
--- a/docs/images/param_mapping.svg
+++ b/docs/images/param_mapping.svg