Unverified Commit fa474d36 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: fix disaggregated deployment example in tracing.md (#6999)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 3c44b88e
...@@ -29,95 +29,46 @@ This guide covers single GPU demo setup using Docker Compose. For Kubernetes dep ...@@ -29,95 +29,46 @@ This guide covers single GPU demo setup using Docker Compose. For Kubernetes dep
Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Observability Getting Started](README.md#getting-started-quickly) for instructions. Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
### 2. Set Environment Variables ### 2. Start Dynamo Components (Single GPU)
Configure Dynamo components to export traces: For a simple single-GPU deployment, run the aggregated tracing launch script. This script enables tracing, sets per-component service names, and starts a frontend with a single vLLM worker:
```bash ```bash
# Enable JSONL logging and tracing cd examples/backends/vllm/launch
export DYN_LOGGING_JSONL=true ./agg_tracing.sh
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
``` ```
### 3. Start Dynamo Components (Single GPU) To override the Tempo endpoint (default `http://localhost:4317`):
For a simple single-GPU deployment, start the frontend and a single vLLM worker:
```bash ```bash
# Start the frontend with tracing enabled (default port 8000, override with --http-port or DYN_HTTP_PORT env var) export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4317
export OTEL_SERVICE_NAME=dynamo-frontend ./agg_tracing.sh
python -m dynamo.frontend --router-mode kv &
# Start a single vLLM worker (aggregated prefill and decode)
export OTEL_SERVICE_NAME=dynamo-worker-vllm
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" &
wait
``` ```
This runs both prefill and decode on the same GPU, providing a simpler setup for testing tracing. This runs a single aggregated worker on one GPU, providing a simpler setup for testing tracing.
### Alternative: Disaggregated Deployment (2 GPUs) ### Alternative: Disaggregated Deployment (2 GPUs)
Run the vLLM disaggregated script with tracing enabled: For a disaggregated deployment with tracing, run the disaggregated tracing launch script. This script sets up tracing and launches a frontend, a decode worker on GPU 0, and a prefill worker on GPU 1:
```bash ```bash
# Navigate to vLLM launch directory
cd examples/backends/vllm/launch cd examples/backends/vllm/launch
./disagg_tracing.sh
# Export tracing env vars, then run the disaggregated deployment script.
./disagg.sh
``` ```
**Note:** the example vLLM `disagg.sh` sets per-worker `--kv-events-config` with unique ZMQ endpoints and unique This separates prefill and decode onto different GPUs for better resource utilization.
`VLLM_NIXL_SIDE_CHANNEL_PORT` values to avoid "Address already in use" conflicts when multiple workers run on the same host. If you run the components manually, make sure you mirror those settings.
```bash ### 3. Generate Traces
#!/bin/bash
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Enable tracing
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
# Run frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend --router-mode kv &
# Run decode worker, make sure to wait for start up
export OTEL_SERVICE_NAME=dynamo-worker-decode
DYN_SYSTEM_PORT=8081 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" &
# Run prefill worker, make sure to wait for start up
export OTEL_SERVICE_NAME=dynamo-worker-prefill
DYN_SYSTEM_PORT=8082 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--disaggregation-mode prefill \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
```
For disaggregated deployments, this separates prefill and decode onto different GPUs for better resource utilization.
### 4. Generate Traces Send requests to the frontend to generate traces (works for both aggregated and disaggregated deployments). The launch scripts print an example `curl` command on startup with the correct model name.
Send requests to the frontend to generate traces (works for both aggregated and disaggregated deployments). **Note the `x-request-id` header**, which allows you to easily search for and correlate this specific trace in Grafana: **Tip:** Add an `x-request-id` header to easily search for a specific trace in Grafana:
```bash ```bash
curl -H 'Content-Type: application/json' \ curl -H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \ -H 'x-request-id: test-trace-001' \
-d '{ -d '{
"model": "Qwen/Qwen3-0.6B", "model": "<MODEL>",
"max_completion_tokens": 100, "max_completion_tokens": 100,
"messages": [ "messages": [
{"role": "user", "content": "What is the capital of France?"} {"role": "user", "content": "What is the capital of France?"}
...@@ -126,7 +77,7 @@ curl -H 'Content-Type: application/json' \ ...@@ -126,7 +77,7 @@ curl -H 'Content-Type: application/json' \
http://localhost:8000/v1/chat/completions http://localhost:8000/v1/chat/completions
``` ```
### 5. View Traces in Grafana Tempo ### 4. View Traces in Grafana Tempo
1. Open Grafana at `http://localhost:3000` 1. Open Grafana at `http://localhost:3000`
2. Login with username `dynamo` and password `dynamo` 2. Login with username `dynamo` and password `dynamo`
...@@ -145,7 +96,7 @@ Below is an example of what a trace looks like in Grafana Tempo: ...@@ -145,7 +96,7 @@ Below is an example of what a trace looks like in Grafana Tempo:
![Trace Example](../assets/img/trace.png) ![Trace Example](../assets/img/trace.png)
### 6. Stop Services ### 5. Stop Services
When done, stop the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for Docker Compose commands. When done, stop the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for Docker Compose commands.
...@@ -157,56 +108,17 @@ For Kubernetes deployments, ensure you have a Tempo instance deployed and access ...@@ -157,56 +108,17 @@ For Kubernetes deployments, ensure you have a Tempo instance deployed and access
### Modify DynamoGraphDeployment for Tracing ### Modify DynamoGraphDeployment for Tracing
Add common tracing environment variables at the top level and service-specific names in each component in your `DynamoGraphDeployment` (e.g., `examples/backends/vllm/deploy/disagg.yaml`): Tracing-enabled variants of the example deployments are provided:
```yaml - **Aggregated:** `examples/backends/vllm/deploy/agg_tracing.yaml`
apiVersion: nvidia.com/v1alpha1 - **Disaggregated:** `examples/backends/vllm/deploy/disagg_tracing.yaml`
kind: DynamoGraphDeployment
metadata: These add the [Environment Variables](#environment-variables) to the base `agg.yaml` / `disagg.yaml` deployments. To override the Tempo endpoint, edit `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` in the YAML.
name: vllm-disagg
spec:
# Common environment variables for all services
env:
- name: DYN_LOGGING_JSONL
value: "true"
- name: OTEL_EXPORT_ENABLED
value: "true"
- name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
value: "http://tempo.observability.svc.cluster.local:4317"
services:
Frontend:
# ... existing configuration ...
extraPodSpec:
mainContainer:
# ... existing configuration ...
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-frontend"
VllmDecodeWorker:
# ... existing configuration ...
extraPodSpec:
mainContainer:
# ... existing configuration ...
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-decode"
VllmPrefillWorker:
# ... existing configuration ...
extraPodSpec:
mainContainer:
# ... existing configuration ...
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-prefill"
```
Apply the updated DynamoGraphDeployment: Apply a tracing-enabled deployment:
```bash ```bash
kubectl apply -f examples/backends/vllm/deploy/disagg.yaml kubectl apply -f examples/backends/vllm/deploy/disagg_tracing.yaml
``` ```
Traces will now be exported to Tempo and can be viewed in Grafana. Traces will now be exported to Tempo and can be viewed in Grafana.
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Aggregated vLLM deployment with OpenTelemetry tracing enabled.
# Base deployment: agg.yaml
# See docs/observability/tracing.md for setup instructions.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-tracing
spec:
envs:
- name: DYN_LOGGING_JSONL
value: "true"
- name: OTEL_EXPORT_ENABLED
value: "true"
- name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
value: "http://tempo.observability.svc.cluster.local:4317"
services:
Frontend:
envFromSecret: hf-token-secret
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-frontend"
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-vllm"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Disaggregated vLLM deployment with OpenTelemetry tracing enabled.
# Base deployment: disagg.yaml
# See docs/observability/tracing.md for setup instructions.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg-tracing
spec:
envs:
- name: DYN_LOGGING_JSONL
value: "true"
- name: OTEL_EXPORT_ENABLED
value: "true"
- name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
value: "http://tempo.observability.svc.cluster.local:4317"
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-frontend"
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "1"
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- decode
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-decode"
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
limits:
gpu: "1"
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
- --kv-transfer-config
- '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-prefill"
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Default model
MODEL="Qwen/Qwen3-0.6B"
# Parse command line arguments
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL="$2"
shift 2
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
# Enable tracing -- requires the observability stack (Prometheus, Grafana, Tempo).
# See docs/observability/README.md for setup instructions.
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT:-http://localhost:4317}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated Serving + Tracing (1 GPU)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "Tempo: $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -H 'x-request-id: test-trace-001' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend &
# run worker
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
export OTEL_SERVICE_NAME=dynamo-worker-vllm
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
"${EXTRA_ARGS[@]}"
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Common configuration
MODEL="Qwen/Qwen3-0.6B"
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL="$2"
shift 2
;;
*)
shift
;;
esac
done
# Enable tracing -- requires the observability stack (Prometheus, Grafana, Tempo).
# See docs/observability/README.md for setup instructions.
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT:-http://localhost:4317}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Disaggregated Serving + Tracing (2 GPUs)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "Tempo: $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -H 'x-request-id: test-trace-001' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend &
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
export OTEL_SERVICE_NAME=dynamo-worker-decode
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--model "$MODEL" \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
export OTEL_SERVICE_NAME=dynamo-worker-prefill
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model "$MODEL" \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment