Unverified Commit 844f8819 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

docs: Bring back some missed release/0.4.0 doc changes, fix broken links, add...

docs: Bring back some missed release/0.4.0 doc changes, fix broken links, add lychee link checker github action (#2482)
parent 41f095cf
name: Docs link check
on:
push:
branches:
- main
pull_request:
permissions:
contents: read
jobs:
lychee:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v4
# Cache lychee results (e.g. to avoid hitting rate limits)
# https://lychee.cli.rs/github_action_recipes/caching/
- name: Restore lychee cache
uses: actions/cache@v4
with:
path: .lycheecache
key: cache-lychee-${{ github.sha }}
restore-keys: cache-lychee-
# https://github.com/lycheeverse/lychee/issues/1487
- name: Install CA Certificates for lychee
run: |
sudo apt-get install ca-certificates
- name: Install lychee
run: |
set -euo pipefail
mkdir -p "$HOME/.local/bin"
cd "$RUNNER_TEMP"
# TODO: Lychee v0.19.1 doesn't support regex in --exclude-path, so use nightly
# release until there is a released version containing regex support.
curl -sSL -o lychee.tar.gz \
https://github.com/lycheeverse/lychee/releases/download/nightly/lychee-x86_64-unknown-linux-gnu.tar.gz
tar -xzf lychee.tar.gz
BIN_PATH=$(find . -maxdepth 2 -type f -name lychee | head -n1)
install -m 0755 "$BIN_PATH" "$HOME/.local/bin/lychee"
echo "$HOME/.local/bin" >> "$GITHUB_PATH"
lychee --version
- name: Check documentation links with lychee
env:
# Set GITHUB_TOKEN to avoid github rate limits on URL checks
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
# Run lychee against all files in repo
lychee \
--cache \
--no-progress \
--exclude-path "ATTRIBUTIONS.*" \
--accept "200..=299, 403, 429" \
--exclude-all-private --exclude 0.0.0.0 \
.
...@@ -183,7 +183,7 @@ Run the backend/worker like this: ...@@ -183,7 +183,7 @@ Run the backend/worker like this:
python -m dynamo.sglang.worker --help python -m dynamo.sglang.worker --help
``` ```
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs. You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
## TensorRT-LLM ## TensorRT-LLM
......
...@@ -12,4 +12,4 @@ See the License for the specific language governing permissions and ...@@ -12,4 +12,4 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md) Coming soon.
...@@ -77,4 +77,4 @@ To get started with Dynamo components: ...@@ -77,4 +77,4 @@ To get started with Dynamo components:
4. **Run deployment scripts** from the engine's launch directory 4. **Run deployment scripts** from the engine's launch directory
5. **Monitor performance** using the metrics component 5. **Monitor performance** using the metrics component
For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/). For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
...@@ -52,7 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -52,7 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
## Quick Start ## Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each. Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start NATS and ETCD in the background ### Start NATS and ETCD in the background
......
...@@ -74,7 +74,7 @@ extraPodSpec: ...@@ -74,7 +74,7 @@ extraPodSpec:
Before using these templates, ensure you have: Before using these templates, ensure you have:
1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) 1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
2. **Kubernetes cluster with GPU support** 2. **Kubernetes cluster with GPU support**
3. **Container registry access** for SGLang runtime images 3. **Container registry access** for SGLang runtime images
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
...@@ -159,4 +159,4 @@ Common issues and solutions: ...@@ -159,4 +159,4 @@ Common issues and solutions:
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` 3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size 4. **Out of memory**: Increase memory limits or reduce model batch size
For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). For additional support, refer to the [deployment guide](../../../../docs/guides/dynamo_deploy/quickstart.md).
...@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0 ...@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0
# Running DeepSeek-R1 Disaggregated with WideEP on H100s # Running DeepSeek-R1 Disaggregated with WideEP on H100s
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs). Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-wideep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
## Instructions ## Instructions
......
# Example: Deploy Multi-node SGLang with Dynamo on SLURM # Example: Deploy Multi-node SGLang with Dynamo on SLURM
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster. This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.
## Overview ## Overview
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode. The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks. The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
## Scripts ## Scripts
...@@ -57,7 +57,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ...@@ -57,7 +57,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead. modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as 3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../dsr1-wideep.md#instructions). described [here](../docs/dsr1-wideep-h100.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps. This is the image that can be passed to the `--container-image` argument in later steps.
## Usage ## Usage
......
...@@ -193,7 +193,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo ...@@ -193,7 +193,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
### Client ### Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment. See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
...@@ -218,7 +218,7 @@ DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh ...@@ -218,7 +218,7 @@ DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
## KV Cache Transfer in Disaggregated Serving ## KV Cache Transfer in Disaggregated Serving
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md). Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
## Request Migration ## Request Migration
...@@ -233,7 +233,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ ...@@ -233,7 +233,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
## Client ## Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment. See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
......
...@@ -211,7 +211,7 @@ envs: ...@@ -211,7 +211,7 @@ envs:
## Testing the Deployment ## Testing the Deployment
Send a test request to verify your deployment. See the [client section](../../../../components/backends/llm/README.md#client) for detailed instructions. Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions.
**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`. **Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.
...@@ -241,7 +241,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving ...@@ -241,7 +241,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
- **UCX** (default): Standard method for KV cache transfer - **UCX** (default): Standard method for KV cache transfer
- **NIXL** (experimental): Alternative transfer method - **NIXL** (experimental): Alternative transfer method
For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-tranfer.md). For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-transfer.md).
## Request Migration ## Request Migration
......
...@@ -345,7 +345,7 @@ flowchart TD ...@@ -345,7 +345,7 @@ flowchart TD
## Next Steps ## Next Steps
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../examples/basics/multinode/README.md) - **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization - **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
- **Monitoring**: Set up Prometheus and Grafana for production monitoring - **Monitoring**: Set up Prometheus and Grafana for production monitoring
- **Performance Benchmarking**: Use GenAI-Perf to measure and optimize your deployment performance - **Performance Benchmarking**: Use GenAI-Perf to measure and optimize your deployment performance
...@@ -166,5 +166,5 @@ lmcache_config = { ...@@ -166,5 +166,5 @@ lmcache_config = {
## References and Additional Resources ## References and Additional Resources
- [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference - [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference
- [Configuration Reference](https://docs.lmcache.ai/api_reference/config.html) - Detailed configuration options - [Configuration Reference](https://docs.lmcache.ai/api_reference/configurations.html) - Detailed configuration options
...@@ -18,7 +18,6 @@ The deprecated `metrics` component is a utility for collecting, aggregating, and ...@@ -18,7 +18,6 @@ The deprecated `metrics` component is a utility for collecting, aggregating, and
**Note**: This is a demo implementation. The deprecated `metrics` component is no longer under active development. **Note**: This is a demo implementation. The deprecated `metrics` component is no longer under active development.
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "dynamo" (e.g., the HTTP `/metrics` endpoint will serve metrics with "dynamo" prefixes) - In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "dynamo" (e.g., the HTTP `/metrics` endpoint will serve metrics with "dynamo" prefixes)
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
<div align="center"> <div align="center">
<img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/> <img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
...@@ -81,8 +80,7 @@ metrics --component MyComponent --endpoint my_endpoint ...@@ -81,8 +80,7 @@ metrics --component MyComponent --endpoint my_endpoint
### Real Worker ### Real Worker
To run a more realistic deployment to gathering metrics from, To run a more realistic deployment to gather metrics:
see the examples in [examples/llm](../../examples/llm).
```bash ```bash
python -m dynamo.frontend & python -m dynamo.frontend &
......
...@@ -68,7 +68,7 @@ When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), th ...@@ -68,7 +68,7 @@ When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), th
### Required Files ### Required Files
The following configuration files should be present in this directory: The following configuration files should be present in this directory:
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services - [docker-compose.yml](../docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration - [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration - [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration - [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
......
...@@ -103,7 +103,7 @@ flowchart LR ...@@ -103,7 +103,7 @@ flowchart LR
### Multimodal Example ### Multimodal Example
In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md): In the case of the [Dynamo Multimodal Disaggregated Example](../../../examples/multimodal/README.md):
1. The HTTP frontend accepts a text prompt and a URL to an image. 1. The HTTP frontend accepts a text prompt and a URL to an image.
...@@ -153,11 +153,11 @@ flowchart LR ...@@ -153,11 +153,11 @@ flowchart LR
#### Code Examples #### Code Examples
See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example, See [prefill_worker](../../../examples/multimodal/components/worker.py) or [decode_worker](../../../examples/multimodal/components/worker.py) from our Multimodal example,
for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md), for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data. sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example, See [encode_worker](../../..//examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md), for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md),
a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker, a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
and the worker awaits for the data transfer to complete for yielding a response. and the worker awaits for the data transfer to complete for yielding a response.
...@@ -170,7 +170,6 @@ and the worker awaits for the data transfer to complete for yielding a response. ...@@ -170,7 +170,6 @@ and the worker awaits for the data transfer to complete for yielding a response.
- [Device](device.md) - [Device](device.md)
- [ReadOperation](read_operation.md) - [ReadOperation](read_operation.md)
- [ReadableOperation](readable_operation.md) - [ReadableOperation](readable_operation.md)
- [SerializedRequest](serialized_request.md)
- [WritableOperation](writable_operation.md) - [WritableOperation](writable_operation.md)
- [WriteOperation](write_operation.md) - [WriteOperation](write_operation.md)
...@@ -178,7 +177,6 @@ and the worker awaits for the data transfer to complete for yielding a response. ...@@ -178,7 +177,6 @@ and the worker awaits for the data transfer to complete for yielding a response.
## References ## References
- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo) - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect)
- [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl) - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
- [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) - [Dynamo Multimodal Example](../../..//examples/multimodal)
- [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect) - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
...@@ -17,7 +17,7 @@ limitations under the License. ...@@ -17,7 +17,7 @@ limitations under the License.
# Dynamo Architecture Flow # Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/llm](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm). Color-coded flows indicate different types of operations: This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [components/backends/vllm](../../components/backends/vllm). Color-coded flows indicate different types of operations:
## 🔵 Main Request Flow (Blue) ## 🔵 Main Request Flow (Blue)
The primary user journey through the system: The primary user journey through the system:
......
...@@ -39,7 +39,7 @@ This will create two components: ...@@ -39,7 +39,7 @@ This will create two components:
- A Worker component exposing metrics on its system port - A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../../../components/backends/vllm/README.md) - Deployment configuration: See the [vLLM README](../../../components/backends/vllm/README.md)
- Available metrics: See the [metrics guide](../metrics.md) - Available metrics: See the [metrics guide](../metrics.md)
### Validate the Deployment ### Validate the Deployment
...@@ -62,7 +62,7 @@ curl localhost:8080/v1/chat/completions \ ...@@ -62,7 +62,7 @@ curl localhost:8080/v1/chat/completions \
}' }'
``` ```
For more information about validating the deployment, see the [vLLM README](../../../../components/backends/vllm/README.md). For more information about validating the deployment, see the [vLLM README](../../components/backends/vllm/README.md).
## Set Up Metrics Collection ## Set Up Metrics Collection
......
...@@ -54,7 +54,7 @@ You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your dep ...@@ -54,7 +54,7 @@ You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your dep
You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment. You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.
We provide a Custom Resource YAML file for many examples under the `deploy/` folder. We provide a Custom Resource YAML file for many examples under the `deploy/` folder.
Use [VLLM YAML](../../components/backends/vllm/deploy/agg.yaml) for an example. Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example.
**Note 1** Example Image **Note 1** Example Image
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment