Unverified Commit 598cbbb7 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: reorganizing documentation to make things clearer (#3658)


Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Co-authored-by: default avatarClaude <noreply@anthropic.com>
parent 34fc9693
...@@ -55,9 +55,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -55,9 +55,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet | | [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned | | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned | | [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | 🚧 | Planned |
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
...@@ -308,4 +308,4 @@ For detailed instructions on running comprehensive performance sweeps across bot ...@@ -308,4 +308,4 @@ For detailed instructions on running comprehensive performance sweeps across bot
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/guides/run_kvbm_in_trtllm.md) . Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
...@@ -38,9 +38,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -38,9 +38,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP | | [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP | | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ✅ | | | [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**LMCache**](./LMCache_Integration.md) | ✅ | | | [**LMCache**](./LMCache_Integration.md) | ✅ | |
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
......
...@@ -10,7 +10,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t ...@@ -10,7 +10,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t
For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above. For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above.
Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../guides/metrics.md). Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md).
## Metric Reference ## Metric Reference
...@@ -96,7 +96,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38 ...@@ -96,7 +96,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/engine/metrics) - [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/engine/metrics)
### Dynamo Metrics ### Dynamo Metrics
- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics - **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces - **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
......
# Pre-Deployment Profiling # Pre-Deployment Profiling
> [!TIP] > [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md). > **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
## Profiling Script ## Profiling Script
...@@ -99,7 +99,7 @@ SLA planner can work with any interpolation data that follows the above format. ...@@ -99,7 +99,7 @@ SLA planner can work with any interpolation data that follows the above format.
## Detailed Kubernetes Profiling Instructions ## Detailed Kubernetes Profiling Instructions
> [!TIP] > [!TIP]
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md). > For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
This section provides detailed technical information for advanced users who need to customize the profiling process. This section provides detailed technical information for advanced users who need to customize the profiling process.
......
../../../deploy/metrics/docker-compose.yml
\ No newline at end of file
...@@ -11,18 +11,18 @@ ...@@ -11,18 +11,18 @@
:maxdepth: 2 :maxdepth: 2
:hidden: :hidden:
runtime/README.md development/runtime-guide.md
API/nixl_connect/connector.md api/nixl_connect/connector.md
API/nixl_connect/descriptor.md api/nixl_connect/descriptor.md
API/nixl_connect/device.md api/nixl_connect/device.md
API/nixl_connect/device_kind.md api/nixl_connect/device_kind.md
API/nixl_connect/operation_status.md api/nixl_connect/operation_status.md
API/nixl_connect/rdma_metadata.md api/nixl_connect/rdma_metadata.md
API/nixl_connect/readable_operation.md api/nixl_connect/readable_operation.md
API/nixl_connect/writable_operation.md api/nixl_connect/writable_operation.md
API/nixl_connect/read_operation.md api/nixl_connect/read_operation.md
API/nixl_connect/write_operation.md api/nixl_connect/write_operation.md
API/nixl_connect/README.md api/nixl_connect/README.md
kubernetes/api_reference.md kubernetes/api_reference.md
kubernetes/create_deployment.md kubernetes/create_deployment.md
...@@ -32,14 +32,14 @@ ...@@ -32,14 +32,14 @@
kubernetes/grove.md kubernetes/grove.md
kubernetes/model_caching_with_fluid.md kubernetes/model_caching_with_fluid.md
kubernetes/README.md kubernetes/README.md
guides/dynamo_run.md reference/cli.md
guides/metrics.md observability/metrics.md
guides/run_kvbm_in_vllm.md kvbm/vllm-setup.md
guides/run_kvbm_in_trtllm.md kvbm/trtllm-setup.md
guides/tool_calling.md guides/tool-calling.md
architecture/kv_cache_routing.md architecture/kv_cache_routing.md
architecture/load_planner.md planner/load_planner.md
architecture/request_migration.md architecture/request_migration.md
architecture/request_cancellation.md architecture/request_cancellation.md
......
...@@ -42,7 +42,7 @@ Quickstart ...@@ -42,7 +42,7 @@ Quickstart
Quickstart <self> Quickstart <self>
Installation <_sections/installation> Installation <_sections/installation>
Support Matrix <support_matrix.md> Support Matrix <reference/support-matrix.md>
Architecture <_sections/architecture> Architecture <_sections/architecture>
Examples <_sections/examples> Examples <_sections/examples>
...@@ -63,18 +63,18 @@ Quickstart ...@@ -63,18 +63,18 @@ Quickstart
:caption: Components :caption: Components
Backends <_sections/backends> Backends <_sections/backends>
Router <components/router/README> Router <router/README>
Planner <architecture/planner_intro> Planner <planner/planner_intro>
KVBM <architecture/kvbm_intro> KVBM <kvbm/kvbm_intro>
.. toctree:: .. toctree::
:hidden: :hidden:
:caption: Developer Guide :caption: Developer Guide
Benchmarking Guide <benchmarks/benchmarking.md> Benchmarking Guide <benchmarks/benchmarking.md>
SLA Planner (Autoscaling) Quickstart <kubernetes/sla_planner_quickstart> SLA Planner (Autoscaling) Quickstart <planner/sla_planner_quickstart>
Logging <guides/logging.md> Logging <observability/logging.md>
Health Checks <guides/health_check.md> Health Checks <observability/health-checks.md>
Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md> Tuning Disaggregated Serving Performance <performance/tuning.md>
Writing Python Workers in Dynamo <guides/backend.md> Writing Python Workers in Dynamo <development/backend-guide.md>
Glossary <dynamo_glossary.md> Glossary <reference/glossary.md>
...@@ -90,7 +90,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo ...@@ -90,7 +90,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]" The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command. Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](/docs/guides/dynamo_run.md) for details on how to run this command. If you are a Dynamo contributor the [dynamo run guide](/docs/reference/cli.md) for details on how to run this command.
## Step 3: Key Customization Points ## Step 3: Key Customization Points
......
...@@ -196,7 +196,7 @@ kubectl get pods -n ${NAMESPACE} ...@@ -196,7 +196,7 @@ kubectl get pods -n ${NAMESPACE}
3. **Optional:** 3. **Optional:**
- [Set up Prometheus & Grafana](metrics.md) - [Set up Prometheus & Grafana](metrics.md)
- [SLA Planner Quickstart Guide](sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling) - [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting ## Troubleshooting
......
...@@ -65,7 +65,7 @@ This will create two components: ...@@ -65,7 +65,7 @@ This will create two components:
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md) - Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md)
- Available metrics: See the [metrics guide](/docs/guides/metrics.md) - Available metrics: See the [metrics guide](/docs/observability/metrics.md)
### Validate the Deployment ### Validate the Deployment
......
...@@ -19,7 +19,7 @@ limitations under the License. ...@@ -19,7 +19,7 @@ limitations under the License.
This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm). This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm).
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html)
> [!Note] > [!Note]
> - Ensure that `etcd` and `nats` are running before starting. > - Ensure that `etcd` and `nats` are running before starting.
......
...@@ -19,7 +19,7 @@ limitations under the License. ...@@ -19,7 +19,7 @@ limitations under the License.
This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in vLLM. This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in vLLM.
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html)
## Quick Start ## Quick Start
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment