Unverified Commit b19de4ed authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: cleanup of docs refactor for components, integrations, and features (#6019)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80e7bafd
......@@ -144,7 +144,7 @@ The frontend includes an integrated router for request distribution. Configure r
python -m dynamo.frontend --router-mode kv --http-port 8000
```
See [Router Documentation](../../router/README.md) for routing configuration details.
See [Router Documentation](../router/README.md) for routing configuration details.
### With Backends
......@@ -159,4 +159,4 @@ Backends auto-register with the frontend when they call `register_llm()`. Suppor
| Document | Description |
|----------|-------------|
| [Frontend Overview](README.md) | Quick start and feature matrix |
| [Router Documentation](../../router/README.md) | KV-aware routing configuration |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
......@@ -53,7 +53,7 @@ Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GP
## Architecture
![KVBM Architecture](../images/kvbm-architecture.png)
![KVBM Architecture](../../images/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
......@@ -64,13 +64,13 @@ KVBM has three primary logical layers:
**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
> **Learn more:** See the [KVBM Design Document](kvbm_design.md) for detailed architecture, components, and data flows.
> **Learn more:** See the [KVBM Design Document](../../design_docs/kvbm_design.md) for detailed architecture, components, and data flows.
## Next Steps
- **[KVBM Guide](kvbm_guide.md)** — Installation, configuration, and deployment instructions
- **[KVBM Design](kvbm_design.md)** — Architecture deep dive, components, and data flows
- **[LMCache Integration](../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend
- **[FlexKV Integration](../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[KVBM Design](../../design_docs/kvbm_design.md)** — Architecture deep dive, components, and data flows
- **[LMCache Integration](../../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend
- **[FlexKV Integration](../../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
......@@ -43,11 +43,11 @@ KVBM can be used independently without using the rest of the Dynamo stack:
pip install kvbm
```
See the [support matrix](../reference/support-matrix.md) for version compatibility.
See the [support matrix](../../reference/support-matrix.md) for version compatibility.
### Build from Source
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../lib/bindings/kvbm/README.md#build-from-source).
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../../lib/bindings/kvbm/README.md#build-from-source).
## Run KVBM in Dynamo with vLLM
......@@ -189,7 +189,7 @@ curl localhost:8000/v1/chat/completions \
}'
```
> **Learn more:** See the [SGLang HiCache Integration Guide](../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting.
> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting.
## Disaggregated Serving with KVBM
......@@ -369,7 +369,7 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](../images/kvbm_metrics_grafana.png)
![Grafana Example](../../images/kvbm_metrics_grafana.png)
### KVBM Worker Initialization Timeout
......@@ -413,7 +413,7 @@ uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
## See Also
- [KVBM Overview](README.md)
- [KVBM Design](kvbm_design.md)
- [LMCache Integration](../integrations/lmcache_integration.md)
- [FlexKV Integration](../integrations/flexkv_integration.md)
- [SGLang HiCache](../integrations/sglang_hicache.md)
- [KVBM Design](../../design_docs/kvbm_design.md)
- [LMCache Integration](../../integrations/lmcache_integration.md)
- [FlexKV Integration](../../integrations/flexkv_integration.md)
- [SGLang HiCache](../../integrations/sglang_hicache.md)
......@@ -19,7 +19,7 @@ limitations under the License.
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](sla_planner_quickstart.md) for a complete workflow including profiling and deployment.
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner_guide.md) for a complete workflow including profiling and deployment.
## Feature Matrix
......@@ -47,7 +47,7 @@ The Planner monitors system performance and automatically scales prefill/decode
- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
- Pre-deployment profiling completed ([Profiling Guide](/docs/benchmarks/sla_driven_profiling.md))
- Pre-deployment profiling completed ([Profiling Guide](/docs/components/profiler/profiler_guide.md))
### Deploy with DGDR (Recommended)
......@@ -57,7 +57,7 @@ The fastest path to a planner-enabled deployment is through a DynamoGraphDeploym
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Quick Start](sla_planner_quickstart.md) for the full workflow.
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner_guide.md) for the full workflow.
### Deploy with DGD (Manual)
......@@ -74,10 +74,10 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
|----------|-------------|
| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
| [SLA Planner Quick Start](sla_planner_quickstart.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| [SLA-based Planner](sla_planner.md) | Scaling algorithm, correction factors, load prediction details |
| [Load-based Planner](load_planner.md) | Legacy load-based scaling (deprecated) |
| [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) | Pre-deployment profiling process and configuration |
| [SLA Planner Guide](planner_guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| [SLA-based Planner](planner_guide.md) | Scaling algorithm, correction factors, load prediction details |
| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) |
| [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) | Pre-deployment profiling process and configuration |
| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
## Configuration Reference
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Examples
Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner_guide.md). For a quick overview, see the [Planner README](README.md).
......@@ -229,7 +235,7 @@ Profiling runs against the real backend (via GPUs or AIC). The mocker deployment
For large models, use a pre-populated PVC instead of downloading from HuggingFace:
See [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) for configuration details.
See [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) for configuration details.
## Advanced Examples
......@@ -374,5 +380,5 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE
- [Planner README](README.md) -- Overview and quick start
- [Planner Guide](planner_guide.md) -- Deployment, configuration, integration
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive
- [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference)
- [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md)
- [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference)
- [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Guide
Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](/docs/design_docs/planner_design.md).
......@@ -162,7 +168,7 @@ sla:
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference).
### Profiling Methods
......@@ -181,7 +187,7 @@ sweep:
aicBackendVersion: "0.20.0"
```
For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).
For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/components/profiler/profiler_guide.md#profiling-methods).
### Load Predictors
......@@ -440,7 +446,7 @@ kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |
For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/components/profiler/profiler_guide.md#troubleshooting).
## Related Documentation
......@@ -448,5 +454,5 @@ For comprehensive troubleshooting including AI Configurator constraints, perform
- [Planner Examples](planner_examples.md) -- DGDR YAML examples and sample configurations
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive for contributors
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling](/docs/benchmarks/sla_driven_profiling.md)
- [Pre-Deployment Profiling](/docs/components/profiler/profiler_guide.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
......@@ -124,8 +124,8 @@ Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
|----------|-------------|
| [Profiler Guide](profiler_guide.md) | Configuration, methods, and troubleshooting |
| [Profiler Examples](profiler_examples.md) | Complete DGDR YAMLs, WebUI, script examples |
| [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) | End-to-end deployment workflow |
| [SLA Planner Architecture](/docs/planner/sla_planner.md) | How the Planner uses profiling data |
| [SLA Planner Guide](/docs/components/planner/planner_guide.md) | End-to-end deployment workflow |
| [SLA Planner Architecture](/docs/components/planner/planner_guide.md) | How the Planner uses profiling data |
```{toctree}
:hidden:
......
......@@ -336,7 +336,7 @@ planner:
```
> [!NOTE]
> Planner arguments use `planner_` prefix. See [SLA Planner documentation](/docs/planner/sla_planner.md) for full list.
> Planner arguments use `planner_` prefix. See [SLA Planner documentation](/docs/components/planner/planner_guide.md) for full list.
### Model Cache PVC (Advanced)
......@@ -641,7 +641,7 @@ kubectl create secret docker-registry nvcr-imagepullsecret \
## See Also
- [Profiler Examples](profiler_examples.md) - Complete DGDR YAML examples
- [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) - End-to-end deployment workflow
- [SLA Planner Architecture](/docs/planner/sla_planner.md) - How the Planner uses profiling data
- [SLA Planner Guide](/docs/components/planner/planner_guide.md) - End-to-end deployment workflow
- [SLA Planner Architecture](/docs/components/planner/planner_guide.md) - How the Planner uses profiling data
- [DGDR API Reference](/docs/kubernetes/api_reference.md) - DGDR specification
- [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py) - Full CLI reference
......@@ -75,7 +75,7 @@ All CLI arguments can be configured via environment variables using the `DYN_` p
For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md).
......@@ -83,7 +83,7 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou
**Requirements:**
- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md))
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
**Multimodal Support:**
......@@ -100,4 +100,4 @@ For basic model registration without KV routing, use `--router-mode round-robin`
- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
- **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
......@@ -113,7 +113,7 @@ For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployme
- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
### Example with Advanced Configuration
......@@ -270,7 +270,7 @@ This approach gives you complete control over routing decisions, allowing you to
- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
See [Router Design](../design_docs/router_design.md) for architecture details and the cost function algorithm.
See [Router Design](../../design_docs/router_design.md) for architecture details and the cost function algorithm.
## KV Event Publishing for Custom Engines
......@@ -547,4 +547,4 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Guide](router_guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes
......@@ -115,7 +115,7 @@ The main KV-aware routing arguments:
>
> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md).
To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md).
## Basic Routing
......@@ -135,7 +135,7 @@ We can then use the default routing methods exposed by the client class to send
KV Cache routing uses direct routing with a special worker selection algorithm.
For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md).
For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For custom routing logic and advanced patterns, see [Routing Patterns](router_examples.md#routing-patterns) in the examples documentation.
......@@ -177,7 +177,7 @@ The `router_temperature` parameter controls routing randomness:
## Disaggregated Serving
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
### Automatic Prefill Router Activation
......@@ -260,7 +260,7 @@ For improved fault tolerance, you can launch multiple frontend + router replicas
### Router State Management
The KV Router tracks two types of state (see [Router Design](../design_docs/router_design.md) for details):
The KV Router tracks two types of state (see [Router Design](../../design_docs/router_design.md) for details):
1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts.
......@@ -346,5 +346,5 @@ curl http://localhost:8000/busy_threshold
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
......@@ -53,7 +53,7 @@ redirects = {
"kubernetes/multinode-deployment": "../kubernetes/deployment/multinode-deployment.html",
"kubernetes/logging": "../kubernetes/observability/logging.html",
"kubernetes/metrics": "../kubernetes/observability/metrics.html",
"architecture/kv_cache_routing": "../router/kv_cache_routing.html",
"architecture/kv_cache_routing": "../components/router/router_guide.html",
# PR #3658
"API/nixl_connect/README": "../../api/nixl_connect/README.html",
"API/nixl_connect/connector": "../../api/nixl_connect/connector.html",
......@@ -69,34 +69,33 @@ redirects = {
"guides/backend": "../development/backend-guide.html",
"runtime/README": "../development/runtime-guide.html",
"guides/tool_calling": "../agents/tool-calling.html",
"architecture/kvbm_architecture": "../kvbm/kvbm_architecture.html",
"architecture/kvbm_components": "../kvbm/kvbm_components.html",
"architecture/kvbm_intro": "../kvbm/kvbm_intro.html",
"architecture/kvbm_motivation": "../kvbm/kvbm_motivation.html",
"architecture/kvbm_reading": "../kvbm/kvbm_reading.html",
"guides/run_kvbm_in_trtllm": "../kvbm/trtllm-setup.html",
"guides/run_kvbm_in_vllm": "../kvbm/vllm-setup.html",
"architecture/kvbm_architecture": "../design_docs/kvbm_design.html",
"architecture/kvbm_components": "../design_docs/kvbm_design.html",
"architecture/kvbm_intro": "../components/kvbm/README.html",
"architecture/kvbm_motivation": "../design_docs/kvbm_design.html",
"architecture/kvbm_reading": "../design_docs/kvbm_design.html",
"guides/run_kvbm_in_trtllm": "../components/kvbm/kvbm_guide.html",
"guides/run_kvbm_in_vllm": "../components/kvbm/kvbm_guide.html",
"guides/health_check": "../observability/health-checks.html",
"guides/logging": "../observability/logging.html",
"guides/metrics": "../observability/metrics.html",
"guides/disagg_perf_tuning": "../performance/tuning.html",
"architecture/load_planner": "../planner/load_planner.html",
"architecture/planner_intro": "../planner/planner_intro.html",
"architecture/sla_planner": "../planner/sla_planner.html",
"kubernetes/sla_planner_quickstart": "../planner/sla_planner_quickstart.html",
"architecture/load_planner": "../components/planner/README.html",
"architecture/planner_intro": "../components/planner/README.html",
"architecture/sla_planner": "../components/planner/planner_guide.html",
"kubernetes/sla_planner_quickstart": "../components/planner/planner_guide.html",
"guides/dynamo_run": "../reference/cli.html",
"dynamo_glossary": "../reference/glossary.html",
"support_matrix": "../reference/support-matrix.html",
"components/router/README": "../router/README.html",
# Multimodal documentation consolidation
"backends/vllm/multimodal": "../../multimodal/vllm.html",
"backends/vllm/multimodal_vllm_guide": "../../multimodal/vllm.html",
"backends/trtllm/multimodal_support": "../../multimodal/trtllm.html",
"backends/trtllm/multimodal_trtllm_guide": "../../multimodal/trtllm.html",
"backends/trtllm/multinode/multinode-multimodal-example": "../../../multimodal/trtllm.html",
"backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
"backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
"multimodal/multimodal_intro": "index.html",
# Multimodal documentation consolidation (all redirect to features/multimodal/)
"backends/vllm/multimodal": "../../features/multimodal/multimodal_vllm.html",
"backends/vllm/multimodal_vllm_guide": "../../features/multimodal/multimodal_vllm.html",
"backends/trtllm/multimodal_support": "../../features/multimodal/multimodal_trtllm.html",
"backends/trtllm/multimodal_trtllm_guide": "../../features/multimodal/multimodal_trtllm.html",
"backends/trtllm/multinode/multinode-multimodal-example": "../../../features/multimodal/multimodal_trtllm.html",
"backends/sglang/multimodal_epd": "../../features/multimodal/multimodal_sglang.html",
"backends/sglang/multimodal_sglang_guide": "../../features/multimodal/multimodal_sglang.html",
"multimodal/multimodal_intro": "../features/multimodal/README.html",
# Speculative decoding consolidation
"backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html",
# Multimodal migration to features/multimodal/
......@@ -104,6 +103,23 @@ redirects = {
"multimodal/vllm": "../features/multimodal/multimodal_vllm.html",
"multimodal/sglang": "../features/multimodal/multimodal_sglang.html",
"multimodal/trtllm": "../features/multimodal/multimodal_trtllm.html",
# Component consolidation into docs/components/
"router/README": "../components/router/README.html",
"router/kv_cache_routing": "../components/router/router_guide.html",
"router/kv_events": "../integrations/kv_events_custom_engines.html",
"planner/planner_intro": "../components/planner/README.html",
"planner/README": "../components/planner/README.html",
"planner/planner_guide": "../components/planner/planner_guide.html",
"planner/planner_examples": "../components/planner/planner_examples.html",
"planner/sla_planner_quickstart": "../components/planner/planner_guide.html",
"planner/sla_planner": "../components/planner/planner_guide.html",
"planner/load_planner": "../components/planner/README.html",
"kvbm/kvbm_intro": "../components/kvbm/README.html",
"kvbm/README": "../components/kvbm/README.html",
"kvbm/kvbm_guide": "../components/kvbm/kvbm_guide.html",
"kvbm/kvbm_design": "../design_docs/kvbm_design.html",
# Profiler consolidation
"benchmarks/sla_driven_profiling": "../components/profiler/profiler_guide.html",
}
# Custom extensions
......
......@@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](../router/README.md)
- [Dynamo Smart Router](../components/router/README.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
- [Planner](../planner/planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
......
......@@ -361,6 +361,6 @@ There are two components of the interface:
## See Also
- [KVBM Overview](README.md)
- [KVBM Guide](kvbm_guide.md)
- [KVBM Overview](../components/kvbm/README.md)
- [KVBM Guide](../components/kvbm/kvbm_guide.md)
- [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Design
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/).
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](/docs/components/planner/).
## Overview
......
......@@ -304,7 +304,7 @@ This dual-layer approach—persistent global KV cache state via JetStream and ep
## See Also
- **[Router README](../router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup
- **[Router Examples](../router/router_examples.md)**: Python API usage and custom routing patterns
- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../components/router/router_guide.md)**: Configuration, tuning, and production setup
- **[Router Examples](../components/router/router_examples.md)**: Python API usage and custom routing patterns
- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
......@@ -311,4 +311,4 @@ kubectl logs deployment/my-worker | grep -i lora
- [Feature Matrix](../../reference/feature-matrix.md) - Backend compatibility overview
- [vLLM Backend](../../backends/vllm/README.md) - vLLM-specific configuration
- [Dynamo Operator](../../kubernetes/dynamo_operator.md) - Kubernetes operator overview
- [KV-Aware Routing](../../router/router_guide.md) - LoRA-aware request routing
- [KV-Aware Routing](../../components/router/router_guide.md) - LoRA-aware request routing
# KServe gRPC frontend
> **Note**: This content has moved to [Frontend Guide](../components/frontend/frontend_guide.md).
> This file will be removed in a future release.
## Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
## Supported Endpoints
* `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1)
* `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92)
* `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1)
* `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
## Starting the Frontend
To start the KServe frontend, run the below command
```
python -m dynamo.frontend --kserve-grpc-server
```
## gRPC Performance Tuning
The gRPC server supports optional HTTP/2 flow control tuning via environment variables. These can be set before starting the server to optimize for high-throughput streaming workloads.
| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE` | HTTP/2 connection-level flow control window size in bytes | tonic default (64KB) |
| `DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE` | HTTP/2 per-stream flow control window size in bytes | tonic default (64KB) |
### Example: High-ISL/OSL configuration for streaming workloads
```bash
# For 128 concurrent 15k-token requests
export DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE=16777216 # 16MB
export DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE=1048576 # 1MB
python -m dynamo.frontend --kserve-grpc-server
```
If these variables are not set, the server uses tonic's default values.
> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests × request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
### OpenAI Completions
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message.
#### Model Metadata / Config
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
```
{
name: $MODEL_NAME,
version: 1,
platform: "dynamo",
backend: "dynamo", # model config specific
inputs: [
{
name: "text_input",
datatype: "BYTES",
shape: [1]
},
{
name: "streaming",
datatype: "BOOL",
shape: [1],
optional: true
}
]
outputs: [
{
name: "text_output",
datatype: "BYTES",
shape: [-1]
},
{
name: "finish_reason",
datatype: "BYTES",
shape: [-1],
optional: true
}
]
}
```
#### Inference
On receiving inference request, the following conversion will be performed:
* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
* `streaming`: the element will be converted to `stream` field in OpenAI Completion request
On receiving model response, the following conversion will be performed:
* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
### Tensor
This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem.
#### Model Metadata / Config
When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
* [TensorModelConfig](../../lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](../../lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](../../lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](../../tests/frontend/grpc/echo_tensor_worker.py) for example.
#### Inference
When receiving inference request, the backend will receive [NvCreateTensorRequest](../../lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](../../lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
## Python Bindings
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
This diff is collapsed.
......@@ -41,8 +41,18 @@
agents/tool-calling.md
development/jail_stream.md
router/router_examples.md
planner/load_planner.md
components/planner/README.md
components/planner/planner_guide.md
components/planner/planner_examples.md
components/kvbm/README.md
components/kvbm/kvbm_guide.md
components/router/README.md
components/router/router_guide.md
components/router/router_examples.md
components/frontend/frontend_guide.md
design_docs/kvbm_design.md
integrations/flexkv_integration.md
integrations/sglang_hicache.md
fault_tolerance/README.md
fault_tolerance/request_migration.md
fault_tolerance/request_cancellation.md
......@@ -63,7 +73,6 @@
backends/sglang/gpt-oss.md
backends/sglang/diffusion-lm.md
backends/sglang/profiling.md
backends/sglang/sgl-hicache-example.md
backends/sglang/sglang-disaggregation.md
backends/sglang/prometheus.md
......@@ -79,7 +88,6 @@
backends/vllm/multi-node.md
backends/vllm/prometheus.md
backends/vllm/prompt-embeddings.md
backends/vllm/speculative_decoding.md
features/speculative_decoding/README.md
features/speculative_decoding/speculative_decoding_vllm.md
......@@ -88,15 +96,5 @@
mocker/mocker.md
multimodal/index.md
multimodal/vllm.md
multimodal/sglang.md
multimodal/trtllm.md
frontends/kserve.md
_sections/frontends.rst
.. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
have some outdated names/references and need a refresh.
.. TODO: Add an OpenAI frontend doc to complement the KServe GRPC doc
in the Frontends section.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment