docs: fix broken links and unconverted callout in dev docs (#6365)

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: fix broken links and unconverted callout in dev docs (#6365)
Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
78436fbf · dagil-nvidia · GitHub · 3f99cf21 · 78436fbf · 78436fbf
Unverified Commit 78436fbf authored Feb 25, 2026 by dagil-nvidia Committed by GitHub Feb 26, 2026
4 changed files
--- a/docs/pages/backends/vllm/gpt-oss.md
+++ b/docs/pages/backends/vllm/gpt-oss.md
@@ -17,7 +17,7 @@ This deployment uses disaggregated serving in vLLM where:

 ## Prerequisites

-This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](README.md)
+This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [the vLLM Backend guide](README.md)

 ## Instructions


--- a/docs/pages/components/router/router-guide.md
+++ b/docs/pages/components/router/router-guide.md
@@ -180,25 +180,28 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na

 - `--router-event-threads`: Number of event processing threads for the KV indexer. When set to 1 (default), the router uses a single-threaded radix tree with channel-based event processing, which supports TTL-based expiration and pruning. When set to a value greater than 1, the router uses a concurrent radix tree with a thread pool of the specified size for higher event throughput. Note: the concurrent indexer does not support TTL/pruning (`--router-ttl-secs`, `--router-max-tree-size`, `--router-prune-target-ratio` are ignored when `--router-event-threads > 1`). Can be set via `DYN_ROUTER_EVENT_THREADS` env var. For details on the underlying index data structures (`RadixTree`, `ConcurrentRadixTree`, `PositionalIndexer`) and their concurrency model (inline reads, sticky-routed writes via thread pool), see the [KV Router Index documentation](../../../../lib/kv-router/README.md).

->[!Note]
-> **State persistence** depends on the event transport mode:
-> - **NATS Core / Event Plane mode** (default): State persists on workers—router rebuilds state by querying workers on startup. This is the default when workers have `local_indexer` enabled (which is the default). Works with both NATS Core and ZMQ event planes.
-> - **JetStream mode** (`--router-durable-kv-events` on **both** frontend **and** workers): State persists across router restarts via JetStream and NATS object store snapshots.
-> - **No KV events** (`--no-router-kv-events`): State persistence is not supported.
->
-> **Request plane is independent of KV event transport.**
-> The request plane (`DYN_REQUEST_PLANE` / `--request-plane`) controls how requests reach workers (TCP/HTTP/NATS), while KV events travel over a separate path. KV events use NATS in JetStream or NATS Core modes, or ZMQ when `--event-plane zmq` is set. With `--event-plane zmq` and `--discovery-backend file` or `mem`, the router can run entirely without etcd or NATS. When using a NATS-based event plane (the default), NATS is initialized automatically; set `NATS_SERVER=nats://...` to override the default `localhost:4222`. Use `--no-router-kv-events` to disable KV event transport entirely.
->
-> When `--router-kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-router-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
->
-> **Backend Configuration:** When using `--no-router-kv-events`, no additional backend flags are needed — SGLang and TRT-LLM disable KV events by default. For vLLM, KV events are currently enabled by default when prefix caching is active (deprecated — will change in a future release). Use `--kv-events-config` explicitly to control behavior:
-> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or omit (auto-enabled, deprecated)
-> - **SGLang**: Do not use `--kv-events-config`
-> - **TRT-LLM**: Do not use `--publish-events-and-metrics`
->
-> The cli args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When workers are configured to publish KV events (via `--kv-events-config`), the router relies on worker-side eviction events and these parameters are ignored.
->
-> **Queue threshold vs. busy rejection thresholds:** `--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. The busy thresholds **reject** a worker entirely from the candidate set when it exceeds a utilization limit — no traffic is sent until it drops below the threshold. In contrast, `--router-queue-threshold` does not reject workers; it **defers the entire routing decision** until at least one worker has capacity, so the request is routed with the freshest load metrics. The queue also enables priority scheduling via `nvext.agent_hints.latency_sensitivity`.
+<Note>
+
+**State persistence** depends on the event transport mode:
+- **NATS Core / Event Plane mode** (default): State persists on workers—router rebuilds state by querying workers on startup. This is the default when workers have `local_indexer` enabled (which is the default). Works with both NATS Core and ZMQ event planes.
+- **JetStream mode** (`--router-durable-kv-events` on **both** frontend **and** workers): State persists across router restarts via JetStream and NATS object store snapshots.
+- **No KV events** (`--no-router-kv-events`): State persistence is not supported.
+
+**Request plane is independent of KV event transport.**
+The request plane (`DYN_REQUEST_PLANE` / `--request-plane`) controls how requests reach workers (TCP/HTTP/NATS), while KV events travel over a separate path. KV events use NATS in JetStream or NATS Core modes, or ZMQ when `--event-plane zmq` is set. With `--event-plane zmq` and `--discovery-backend file` or `mem`, the router can run entirely without etcd or NATS. When using a NATS-based event plane (the default), NATS is initialized automatically; set `NATS_SERVER=nats://...` to override the default `localhost:4222`. Use `--no-router-kv-events` to disable KV event transport entirely.
+
+When `--router-kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-router-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
+
+**Backend Configuration:** When using `--no-router-kv-events`, no additional backend flags are needed — SGLang and TRT-LLM disable KV events by default. For vLLM, KV events are currently enabled by default when prefix caching is active (deprecated — will change in a future release). Use `--kv-events-config` explicitly to control behavior:
+- **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or omit (auto-enabled, deprecated)
+- **SGLang**: Do not use `--kv-events-config`
+- **TRT-LLM**: Do not use `--publish-events-and-metrics`
+
+The cli args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When workers are configured to publish KV events (via `--kv-events-config`), the router relies on worker-side eviction events and these parameters are ignored.
+
+**Queue threshold vs. busy rejection thresholds:** `--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. The busy thresholds **reject** a worker entirely from the candidate set when it exceeds a utilization limit — no traffic is sent until it drops below the threshold. In contrast, `--router-queue-threshold` does not reject workers; it **defers the entire routing decision** until at least one worker has capacity, so the request is routed with the freshest load metrics. The queue also enables priority scheduling via `nvext.agent_hints.latency_sensitivity`.
+
+</Note>

 To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).

@@ -318,8 +321,7 @@ await register_model(
 await prefill_endpoint.serve_endpoint(prefill_handler.generate)
 ```

-> [!Note]
-> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
+<Note>The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/launch/disagg_router.sh).</Note>

 ### Request Flow

@@ -399,10 +401,13 @@ Persistence behavior depends on which event transport mode is active:
 python -m dynamo.frontend --router-mode kv --http-port 8002 --router-replica-sync
 ```

->[!Note]
-> If you need to start with a fresh state in JetStream mode, you have two options:
-> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
-> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
+<Note>
+
+If you need to start with a fresh state in JetStream mode, you have two options:
+1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
+2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
+
+</Note>

 ## Dynamic Threshold Configuration


--- a/docs/pages/features/multimodal/multimodal-vllm.md
+++ b/docs/pages/features/multimodal/multimodal-vllm.md
@@ -8,9 +8,10 @@ title: vLLM Multimodal

 This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.

-> [!IMPORTANT]
-> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
-> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
+<Warning>
+**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
+This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
+</Warning>

 ## Support Matrix

@@ -158,7 +159,9 @@ cd $DYNAMO_HOME/examples/backends/vllm
 bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
 ```

-> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
+<Note>
+Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
+</Note>

 ## Llama 4 Serving


--- a/docs/pages/reference/feature-matrix.md
+++ b/docs/pages/reference/feature-matrix.md
@@ -108,28 +108,27 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full

 ---

-## Source References

 {/* Backend READMEs */}
-[vllm-readme]: docs/backends/vllm/README.md
-[sglang-readme]: docs/backends/sglang/README.md
-[trtllm-readme]: docs/backends/trtllm/README.md
+[vllm-readme]: ../backends/vllm
+[sglang-readme]: ../backends/sglang
+[trtllm-readme]: ../backends/trtllm

 {/* Design Docs */}
-[disagg]: docs/design_docs/disagg_serving.md
-[kv-routing]: docs/components/router/router_guide.md
-[planner]: docs/components/planner/README.md
-[kvbm]: docs/components/kvbm/README.md
-[migration]: docs/fault_tolerance/request_migration.md
-[tools]: docs/agents/tool-calling.md
+[disagg]: ../design-docs/disagg-serving
+[kv-routing]: ../components/router/router-guide
+[planner]: ../components/planner
+[kvbm]: ../components/kvbm
+[migration]: ../fault-tolerance/request-migration
+[tools]: ../agents/tool-calling

 {/* Multimodal */}
-[mm]: docs/features/multimodal/README.md
-[mm-vllm]: docs/features/multimodal/multimodal_vllm.md
-[mm-trtllm]: docs/features/multimodal/multimodal_trtllm.md
-[mm-sglang]: docs/features/multimodal/multimodal_sglang.md
+[mm]: ../features/multimodal
+[mm-vllm]: ../features/multimodal/multimodal-vllm
+[mm-trtllm]: ../features/multimodal/multimodal-trtllm
+[mm-sglang]: ../features/multimodal/multimodal-sglang

 {/* Feature-specific */}
-[lora]: docs/kubernetes/deployment/dynamomodel-guide.md
-[vllm-spec]: docs/features/speculative_decoding/speculative_decoding_vllm.md
-[trtllm-eagle]: docs/backends/trtllm/llama4_plus_eagle.md
+[lora]: ../kubernetes/deployment/dynamomodel-guide
+[vllm-spec]: ../features/speculative-decoding/speculative-decoding-vllm
+[trtllm-eagle]: ../backends/trtllm/llama4-plus-eagle