Unverified Commit 598cbbb7 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: reorganizing documentation to make things clearer (#3658)


Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Co-authored-by: default avatarClaude <noreply@anthropic.com>
parent 34fc9693
......@@ -55,9 +55,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | 🚧 | Planned |
### Large Scale P/D and WideEP Features
......@@ -308,4 +308,4 @@ For detailed instructions on running comprehensive performance sweeps across bot
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/guides/run_kvbm_in_trtllm.md) .
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
......@@ -38,9 +38,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**LMCache**](./LMCache_Integration.md) | ✅ | |
### Large Scale P/D and WideEP Features
......
......@@ -10,7 +10,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t
For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above.
Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../guides/metrics.md).
Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md).
## Metric Reference
......@@ -96,7 +96,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/engine/metrics)
### Dynamo Metrics
- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics
- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
......
# Pre-Deployment Profiling
> [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
## Profiling Script
......@@ -99,7 +99,7 @@ SLA planner can work with any interpolation data that follows the above format.
## Detailed Kubernetes Profiling Instructions
> [!TIP]
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
This section provides detailed technical information for advanced users who need to customize the profiling process.
......
../../../deploy/metrics/docker-compose.yml
\ No newline at end of file
......@@ -11,18 +11,18 @@
:maxdepth: 2
:hidden:
runtime/README.md
API/nixl_connect/connector.md
API/nixl_connect/descriptor.md
API/nixl_connect/device.md
API/nixl_connect/device_kind.md
API/nixl_connect/operation_status.md
API/nixl_connect/rdma_metadata.md
API/nixl_connect/readable_operation.md
API/nixl_connect/writable_operation.md
API/nixl_connect/read_operation.md
API/nixl_connect/write_operation.md
API/nixl_connect/README.md
development/runtime-guide.md
api/nixl_connect/connector.md
api/nixl_connect/descriptor.md
api/nixl_connect/device.md
api/nixl_connect/device_kind.md
api/nixl_connect/operation_status.md
api/nixl_connect/rdma_metadata.md
api/nixl_connect/readable_operation.md
api/nixl_connect/writable_operation.md
api/nixl_connect/read_operation.md
api/nixl_connect/write_operation.md
api/nixl_connect/README.md
kubernetes/api_reference.md
kubernetes/create_deployment.md
......@@ -32,14 +32,14 @@
kubernetes/grove.md
kubernetes/model_caching_with_fluid.md
kubernetes/README.md
guides/dynamo_run.md
guides/metrics.md
guides/run_kvbm_in_vllm.md
guides/run_kvbm_in_trtllm.md
guides/tool_calling.md
reference/cli.md
observability/metrics.md
kvbm/vllm-setup.md
kvbm/trtllm-setup.md
guides/tool-calling.md
architecture/kv_cache_routing.md
architecture/load_planner.md
planner/load_planner.md
architecture/request_migration.md
architecture/request_cancellation.md
......
......@@ -42,7 +42,7 @@ Quickstart
Quickstart <self>
Installation <_sections/installation>
Support Matrix <support_matrix.md>
Support Matrix <reference/support-matrix.md>
Architecture <_sections/architecture>
Examples <_sections/examples>
......@@ -63,18 +63,18 @@ Quickstart
:caption: Components
Backends <_sections/backends>
Router <components/router/README>
Planner <architecture/planner_intro>
KVBM <architecture/kvbm_intro>
Router <router/README>
Planner <planner/planner_intro>
KVBM <kvbm/kvbm_intro>
.. toctree::
:hidden:
:caption: Developer Guide
Benchmarking Guide <benchmarks/benchmarking.md>
SLA Planner (Autoscaling) Quickstart <kubernetes/sla_planner_quickstart>
Logging <guides/logging.md>
Health Checks <guides/health_check.md>
Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md>
Writing Python Workers in Dynamo <guides/backend.md>
Glossary <dynamo_glossary.md>
SLA Planner (Autoscaling) Quickstart <planner/sla_planner_quickstart>
Logging <observability/logging.md>
Health Checks <observability/health-checks.md>
Tuning Disaggregated Serving Performance <performance/tuning.md>
Writing Python Workers in Dynamo <development/backend-guide.md>
Glossary <reference/glossary.md>
......@@ -90,7 +90,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](/docs/guides/dynamo_run.md) for details on how to run this command.
If you are a Dynamo contributor the [dynamo run guide](/docs/reference/cli.md) for details on how to run this command.
## Step 3: Key Customization Points
......
......@@ -196,7 +196,7 @@ kubectl get pods -n ${NAMESPACE}
3. **Optional:**
- [Set up Prometheus & Grafana](metrics.md)
- [SLA Planner Quickstart Guide](sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
- [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting
......
......@@ -65,7 +65,7 @@ This will create two components:
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md)
- Available metrics: See the [metrics guide](/docs/guides/metrics.md)
- Available metrics: See the [metrics guide](/docs/observability/metrics.md)
### Validate the Deployment
......
......@@ -19,7 +19,7 @@ limitations under the License.
This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm).
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html)
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html)
> [!Note]
> - Ensure that `etcd` and `nats` are running before starting.
......
......@@ -19,7 +19,7 @@ limitations under the License.
This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in vLLM.
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html)
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html)
## Quick Start
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment