"lib/bindings/python/examples/cli/sglang_inc.py" did not exist on "af1f1155d0c80ee91ab1249e09147844e2953c7f"
Unverified Commit 07db5895 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: clean up toctree navigation and add disaggregated serving guide (#6024)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 219e5c45
Fault Tolerance
===============
.. toctree::
:maxdepth: 1
Overview <../fault_tolerance/README.md>
Request Migration <../fault_tolerance/request_migration.md>
Request Cancellation <../fault_tolerance/request_cancellation.md>
Graceful Shutdown <../fault_tolerance/graceful_shutdown.md>
Request Rejection <../fault_tolerance/request_rejection.md>
Testing <../fault_tolerance/testing.md>
Deployment Guide
================
.. toctree::
:hidden:
Kubernetes Quickstart <../kubernetes/README>
Detailed Installation Guide <../kubernetes/installation_guide>
Dynamo Operator <../kubernetes/dynamo_operator>
Service Discovery <../kubernetes/service_discovery>
Webhooks <../kubernetes/webhooks>
Minikube Setup <../kubernetes/deployment/minikube>
Managing Models with DynamoModel <../kubernetes/deployment/dynamomodel-guide>
Autoscaling <../kubernetes/autoscaling>
Checkpointing <../kubernetes/chrek/README>
Multinode
=========
.. toctree::
:hidden:
Multinode Deployments <../kubernetes/deployment/multinode-deployment>
Grove <../kubernetes/grove>
Observability
=============
.. toctree::
:hidden:
Metrics <../kubernetes/observability/metrics>
Logging <../kubernetes/observability/logging>
Operator Metrics <../kubernetes/observability/operator-metrics>
Observability
=============
.. toctree::
:hidden:
Overview <../observability/README>
Prometheus + Grafana Setup <../observability/prometheus-grafana>
Metrics <../observability/metrics>
Metrics Developer Guide <../observability/metrics-developer-guide>
Health Checks <../observability/health-checks>
Tracing <../observability/tracing>
Logging <../observability/logging>
......@@ -79,3 +79,9 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options.
|----------|-------------|
| [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
```{toctree}
:hidden:
frontend_guide
```
......@@ -74,3 +74,9 @@ KVBM has three primary logical layers:
- **[FlexKV Integration](../../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
```{toctree}
:hidden:
kvbm_guide
```
......@@ -134,3 +134,10 @@ The planner queries the frontend's `/metrics` endpoint via Prometheus. Required
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
```{toctree}
:hidden:
planner_guide
planner_examples
```
......@@ -101,3 +101,10 @@ For basic model registration without KV routing, use `--router-mode round-robin`
- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
```{toctree}
:hidden:
router_guide
router_examples
```
direction: right
aggregated: Aggregated {
width: 600
height: 450
frontend: Frontend {
width: 180
height: 60
style.font-size: 20
}
router: Router {
width: 180
height: 60
style.font-size: 20
}
w1: "W1 (TP2)" {
width: 180
height: 60
style.font-size: 20
}
w2: "W2 (TP2)" {
width: 180
height: 60
style.font-size: 20
}
w3: "W3 (TP2)" {
width: 180
height: 60
style.font-size: 20
}
w4: "W4 (TP2)" {
width: 180
height: 60
style.font-size: 20
}
frontend -> router
router -> w1
router -> w2
router -> w3
router -> w4
note: |md
Each worker handles both prefill and decode.
|
note.style.font-size: 18
}
disaggregated: Disaggregated {
width: 600
height: 450
frontend: Frontend {
width: 180
height: 60
style.font-size: 20
}
router: Router {
width: 180
height: 60
style.font-size: 20
}
p1: "Prefill 1 (TP2)" {
width: 220
height: 60
style.font-size: 20
}
p2: "Prefill 2 (TP2)" {
width: 220
height: 60
style.font-size: 20
}
decode: "Decode (TP4)" {
width: 220
height: 60
style.font-size: 20
}
frontend -> router
router -> p1
router -> p2
p1 -> decode: "KV Cache via RDMA"
p2 -> decode: "KV Cache via RDMA"
note: |md
Prefill and decode on separate workers.
|
note.style.font-size: 18
}
aggregated.style.font-size: 24
disaggregated.style.font-size: 24
direction: down
q1: "AIC shows disagg > agg\nthroughput?" {
shape: diamond
}
q2: "RDMA available\nin cluster?" {
shape: diamond
}
q3: "ISL/OSL ratio > 8:1?" {
shape: diamond
}
q4: "Disagg > 20%\nfaster?" {
shape: diamond
}
agg: "AGGREGATED\nSimpler, no RDMA needed" {
shape: rectangle
width: 380
height: 80
}
disagg: "DISAGGREGATED\nHigher throughput, needs RDMA" {
shape: rectangle
width: 380
height: 80
}
q1 -> q2: Yes
q1 -> agg: No
q2 -> q3: Yes
q2 -> agg: No
q3 -> disagg: Yes
q3 -> q4: No
q4 -> disagg: Yes
q4 -> agg: No
direction: right
install: {
label: "Install"
shape: rectangle
install_detail: |md
pip3 install
aiconfigurator
|
}
configure: {
label: "Configure"
shape: rectangle
configure_detail: |md
Model, GPUs,
SLA targets
|
}
compare: {
label: "Compare"
shape: rectangle
compare_detail: |md
Agg vs disagg
rankings
|
}
deploy: {
label: "Deploy"
shape: rectangle
deploy_detail: |md
Apply DGD
manifest
|
}
validate: {
label: "Validate"
shape: rectangle
validate_detail: |md
Benchmark
with AIPerf
|
}
install -> configure
configure -> compare
compare -> deploy
deploy -> validate
direction: right
aic: "AIC Output" {
shape: rectangle
c1: "concurrency: 56 (=14x4)" {
width: 280
}
c2: "ISL: 4000, OSL: 500" {
width: 280
}
c3: "Model: Qwen3-32B-FP8" {
width: 280
}
c4: "concurrency x ~14" {
width: 280
}
c5: "(best practice)" {
width: 280
}
}
aiperf: "AIPerf Argument" {
shape: rectangle
a1: "--concurrency 56" {
width: 320
}
a2: "--isl 4000 --osl 500" {
width: 320
}
a3: "-m Qwen/Qwen3-32B-FP8" {
width: 320
}
a4: "--num-requests 800" {
width: 320
}
a5: "--extra-inputs \"ignore_eos:true\"" {
width: 320
}
}
aic.c1 -> aiperf.a1
aic.c2 -> aiperf.a2
aic.c3 -> aiperf.a3
aic.c4 -> aiperf.a4
aic.c5 -> aiperf.a5
......@@ -140,3 +140,13 @@ See [Fault Tolerance Testing](testing.md) for details.
- [Observability](../observability/README.md) - Metrics and monitoring
- [Distributed Runtime](../design_docs/distributed_runtime.md) - Service discovery architecture
- [Event Plane](../design_docs/event_plane.md) - etcd and NATS coordination
```{toctree}
:hidden:
Request Migration <request_migration>
Request Cancellation <request_cancellation>
Graceful Shutdown <graceful_shutdown>
Request Rejection <request_rejection>
Testing <testing>
```
This diff is collapsed.
......@@ -12,6 +12,8 @@
:hidden:
development/runtime-guide.md
development/jail_stream.md
api/nixl_connect/connector.md
api/nixl_connect/descriptor.md
api/nixl_connect/device.md
......@@ -26,48 +28,20 @@
kubernetes/api_reference.md
kubernetes/deployment/create_deployment.md
kubernetes/deployment/dynamomodel-guide.md
kubernetes/chrek/README.md
kubernetes/chrek/dynamo.md
kubernetes/chrek/standalone.md
kubernetes/fluxcd.md
kubernetes/grove.md
kubernetes/model_caching_with_fluid.md
kubernetes/README.md
reference/cli.md
observability/metrics.md
integrations/kv_events_custom_engines.md
agents/tool-calling.md
development/jail_stream.md
components/planner/README.md
components/planner/planner_guide.md
components/planner/planner_examples.md
components/kvbm/README.md
components/kvbm/kvbm_guide.md
components/router/README.md
components/router/router_guide.md
components/router/router_examples.md
components/frontend/frontend_guide.md
design_docs/kvbm_design.md
integrations/flexkv_integration.md
integrations/sglang_hicache.md
fault_tolerance/README.md
fault_tolerance/request_migration.md
fault_tolerance/request_cancellation.md
fault_tolerance/graceful_shutdown.md
fault_tolerance/request_rejection.md
fault_tolerance/testing.md
design_docs/request_plane.md
design_docs/event_plane.md
reference/cli.md
reference/glossary.md
performance/tuning.md
backends/trtllm/multinode/multinode-examples.md
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md
backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md
backends/vllm/multi-node.md
backends/vllm/prometheus.md
backends/vllm/prompt-embeddings.md
backends/sglang/expert-distribution-eplb.md
backends/sglang/gpt-oss.md
......@@ -76,25 +50,19 @@
backends/sglang/sglang-disaggregation.md
backends/sglang/prometheus.md
examples/README.md
examples/runtime/hello_world/README.md
design_docs/distributed_runtime.md
design_docs/dynamo_flow.md
backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md
integrations/lmcache_integration.md
backends/vllm/multi-node.md
backends/vllm/prometheus.md
backends/vllm/prompt-embeddings.md
backends/trtllm/multinode/multinode-examples.md
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md
features/speculative_decoding/README.md
features/speculative_decoding/speculative_decoding_vllm.md
examples/README.md
examples/runtime/hello_world/README.md
benchmarks/kv-router-ab-testing.md
mocker/mocker.md
.. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
have some outdated names/references and need a refresh.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment