"lib/bindings/python/vscode:/vscode.git/clone" did not exist on "c78b59013431e170b6fa48ff9967da3204f1b9a7"
Unverified Commit fa474d36 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: fix disaggregated deployment example in tracing.md (#6999)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 3c44b88e
......@@ -29,95 +29,46 @@ This guide covers single GPU demo setup using Docker Compose. For Kubernetes dep
Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
### 2. Set Environment Variables
### 2. Start Dynamo Components (Single GPU)
Configure Dynamo components to export traces:
For a simple single-GPU deployment, run the aggregated tracing launch script. This script enables tracing, sets per-component service names, and starts a frontend with a single vLLM worker:
```bash
# Enable JSONL logging and tracing
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
cd examples/backends/vllm/launch
./agg_tracing.sh
```
### 3. Start Dynamo Components (Single GPU)
For a simple single-GPU deployment, start the frontend and a single vLLM worker:
To override the Tempo endpoint (default `http://localhost:4317`):
```bash
# Start the frontend with tracing enabled (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend --router-mode kv &
# Start a single vLLM worker (aggregated prefill and decode)
export OTEL_SERVICE_NAME=dynamo-worker-vllm
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" &
wait
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4317
./agg_tracing.sh
```
This runs both prefill and decode on the same GPU, providing a simpler setup for testing tracing.
This runs a single aggregated worker on one GPU, providing a simpler setup for testing tracing.
### Alternative: Disaggregated Deployment (2 GPUs)
Run the vLLM disaggregated script with tracing enabled:
For a disaggregated deployment with tracing, run the disaggregated tracing launch script. This script sets up tracing and launches a frontend, a decode worker on GPU 0, and a prefill worker on GPU 1:
```bash
# Navigate to vLLM launch directory
cd examples/backends/vllm/launch
# Export tracing env vars, then run the disaggregated deployment script.
./disagg.sh
./disagg_tracing.sh
```
**Note:** the example vLLM `disagg.sh` sets per-worker `--kv-events-config` with unique ZMQ endpoints and unique
`VLLM_NIXL_SIDE_CHANNEL_PORT` values to avoid "Address already in use" conflicts when multiple workers run on the same host. If you run the components manually, make sure you mirror those settings.
This separates prefill and decode onto different GPUs for better resource utilization.
```bash
#!/bin/bash
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Enable tracing
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
# Run frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend --router-mode kv &
# Run decode worker, make sure to wait for start up
export OTEL_SERVICE_NAME=dynamo-worker-decode
DYN_SYSTEM_PORT=8081 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" &
# Run prefill worker, make sure to wait for start up
export OTEL_SERVICE_NAME=dynamo-worker-prefill
DYN_SYSTEM_PORT=8082 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--disaggregation-mode prefill \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
```
For disaggregated deployments, this separates prefill and decode onto different GPUs for better resource utilization.
### 3. Generate Traces
### 4. Generate Traces
Send requests to the frontend to generate traces (works for both aggregated and disaggregated deployments). The launch scripts print an example `curl` command on startup with the correct model name.
Send requests to the frontend to generate traces (works for both aggregated and disaggregated deployments). **Note the `x-request-id` header**, which allows you to easily search for and correlate this specific trace in Grafana:
**Tip:** Add an `x-request-id` header to easily search for a specific trace in Grafana:
```bash
curl -H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \
-d '{
"model": "Qwen/Qwen3-0.6B",
"model": "<MODEL>",
"max_completion_tokens": 100,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
......@@ -126,7 +77,7 @@ curl -H 'Content-Type: application/json' \
http://localhost:8000/v1/chat/completions
```
### 5. View Traces in Grafana Tempo
### 4. View Traces in Grafana Tempo
1. Open Grafana at `http://localhost:3000`
2. Login with username `dynamo` and password `dynamo`
......@@ -145,7 +96,7 @@ Below is an example of what a trace looks like in Grafana Tempo:
![Trace Example](../assets/img/trace.png)
### 6. Stop Services
### 5. Stop Services
When done, stop the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for Docker Compose commands.
......@@ -157,56 +108,17 @@ For Kubernetes deployments, ensure you have a Tempo instance deployed and access
### Modify DynamoGraphDeployment for Tracing
Add common tracing environment variables at the top level and service-specific names in each component in your `DynamoGraphDeployment` (e.g., `examples/backends/vllm/deploy/disagg.yaml`):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
# Common environment variables for all services
env:
- name: DYN_LOGGING_JSONL
value: "true"
- name: OTEL_EXPORT_ENABLED
value: "true"
- name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
value: "http://tempo.observability.svc.cluster.local:4317"
services:
Frontend:
# ... existing configuration ...
extraPodSpec:
mainContainer:
# ... existing configuration ...
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-frontend"
VllmDecodeWorker:
# ... existing configuration ...
extraPodSpec:
mainContainer:
# ... existing configuration ...
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-decode"
VllmPrefillWorker:
# ... existing configuration ...
extraPodSpec:
mainContainer:
# ... existing configuration ...
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-prefill"
```
Tracing-enabled variants of the example deployments are provided:
- **Aggregated:** `examples/backends/vllm/deploy/agg_tracing.yaml`
- **Disaggregated:** `examples/backends/vllm/deploy/disagg_tracing.yaml`
These add the [Environment Variables](#environment-variables) to the base `agg.yaml` / `disagg.yaml` deployments. To override the Tempo endpoint, edit `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` in the YAML.
Apply the updated DynamoGraphDeployment:
Apply a tracing-enabled deployment:
```bash
kubectl apply -f examples/backends/vllm/deploy/disagg.yaml
kubectl apply -f examples/backends/vllm/deploy/disagg_tracing.yaml
```
Traces will now be exported to Tempo and can be viewed in Grafana.
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Aggregated vLLM deployment with OpenTelemetry tracing enabled.
# Base deployment: agg.yaml
# See docs/observability/tracing.md for setup instructions.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-tracing
spec:
envs:
- name: DYN_LOGGING_JSONL
value: "true"
- name: OTEL_EXPORT_ENABLED
value: "true"
- name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
value: "http://tempo.observability.svc.cluster.local:4317"
services:
Frontend:
envFromSecret: hf-token-secret
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-frontend"
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-vllm"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Disaggregated vLLM deployment with OpenTelemetry tracing enabled.
# Base deployment: disagg.yaml
# See docs/observability/tracing.md for setup instructions.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg-tracing
spec:
envs:
- name: DYN_LOGGING_JSONL
value: "true"
- name: OTEL_EXPORT_ENABLED
value: "true"
- name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
value: "http://tempo.observability.svc.cluster.local:4317"
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-frontend"
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "1"
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- decode
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-decode"
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
limits:
gpu: "1"
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
- --kv-transfer-config
- '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: OTEL_SERVICE_NAME
value: "dynamo-worker-prefill"
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Default model
MODEL="Qwen/Qwen3-0.6B"
# Parse command line arguments
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL="$2"
shift 2
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
# Enable tracing -- requires the observability stack (Prometheus, Grafana, Tempo).
# See docs/observability/README.md for setup instructions.
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT:-http://localhost:4317}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated Serving + Tracing (1 GPU)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "Tempo: $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -H 'x-request-id: test-trace-001' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend &
# run worker
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
export OTEL_SERVICE_NAME=dynamo-worker-vllm
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
"${EXTRA_ARGS[@]}"
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -e
trap 'echo Cleaning up...; kill 0' EXIT
# Common configuration
MODEL="Qwen/Qwen3-0.6B"
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL="$2"
shift 2
;;
*)
shift
;;
esac
done
# Enable tracing -- requires the observability stack (Prometheus, Grafana, Tempo).
# See docs/observability/README.md for setup instructions.
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT:-http://localhost:4317}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Disaggregated Serving + Tracing (2 GPUs)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "Tempo: $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -H 'x-request-id: test-trace-001' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend &
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
export OTEL_SERVICE_NAME=dynamo-worker-decode
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--model "$MODEL" \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
export OTEL_SERVICE_NAME=dynamo-worker-prefill
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model "$MODEL" \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment