docs: expand SGLang observability guide with tracing and dashboards (#6556)

29c46fb7 · ishandhanani · GitHub · c916cd42 · 29c46fb7 · 29c46fb7
Unverified Commit 29c46fb7 authored Feb 24, 2026 by ishandhanani Committed by GitHub Feb 24, 2026
6 changed files
--- a/docs/assets/img/sglang-trace.png
+++ b/docs/assets/img/sglang-trace.png
--- a/docs/pages/backends/sglang/README.md
+++ b/docs/pages/backends/sglang/README.md
@@ -87,7 +87,7 @@ docker run \
 | [**Diffusion Models**](sglang-diffusion.md) | ✅ | LLM diffusion, image, and video generation |
 | [**Request Cancellation**](../../fault-tolerance/request-cancellation.md) | ✅ | Aggregated full; disaggregated decode-only |
 | [**Graceful Shutdown**](../../fault-tolerance/graceful-shutdown.md) | ✅ | Discovery unregister + grace period |
-| [**Prometheus Metrics**](sglang-prometheus.md) | ✅ | SGLang + Dynamo metrics on `/metrics` |
+| [**Observability**](sglang-observability.md) | ✅ | Metrics, tracing, and Grafana dashboards |
 | [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
 ## Quick Start
@@ -130,5 +130,5 @@ You can deploy SGLang with Dynamo on Kubernetes using a `DynamoGraphDeployment`.
 - **[Examples](sglang-examples.md)**: All deployment patterns with launch scripts
 - **[Disaggregation](sglang-disaggregation.md)**: P/D architecture and KV transfer details
 - **[Diffusion](sglang-diffusion.md)**: LLM, image, and video diffusion models
- **[Prometheus Metrics](sglang-prometheus.md)**: Metrics integration and monitoring
+- **[Observability](sglang-observability.md)**: Metrics, tracing, and Grafana dashboards
 - **[Deploying SGLang with Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy)**: Kubernetes deployment guide
--- a/docs/pages/backends/sglang/sglang-prometheus.md
+++ b/docs/pages/backends/sglang/sglang-prometheus.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: Prometheus
+title: Observability
 ---
-# SGLang Prometheus Metrics
+# SGLang Observability
-## Overview
+This guide covers metrics, tracing, and visualization for SGLang deployments running through Dynamo.
+## Prometheus Metrics
 When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
@@ -16,21 +18,21 @@ When running SGLang through Dynamo, SGLang engine metrics are automatically pass
 **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
-## Environment Variables
+### Environment Variables
 | Variable | Description | Default | Example |
 |----------|-------------|---------|---------|
 | `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
-## Getting Started Quickly
+### Getting Started Quickly
 This is a single machine example.
-### Start Observability Stack
+#### Start Observability Stack
 For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
-### Launch Dynamo Components
+#### Launch Dynamo Components
 Launch a frontend and SGLang backend to test metrics:
@@ -58,7 +60,7 @@ http://localhost:8000/v1/chat/completions
 curl -s localhost:8081/metrics | grep "^sglang:"
 ```
-## Exposed Metrics
+### Exposed Metrics
 SGLang exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All SGLang engine metrics use the `sglang:` prefix and include labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) to identify the source.
@@ -91,7 +93,7 @@ SGLang provides metrics in the following categories (all prefixed with `sglang:`
 **Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version.
-## Available Metrics
+### Available Metrics
 The official SGLang documentation includes complete metric definitions with:
 - HELP and TYPE descriptions
@@ -102,21 +104,283 @@ The official SGLang documentation includes complete metric definitions with:
 For the complete and authoritative list of all SGLang metrics, see the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html).
-## Implementation Details
+### Implementation Details
 - SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector`
 - Metrics are filtered by the `sglang:` prefix before being exposed
 - The integration uses Dynamo's `register_engine_metrics_callback()` function
 - Metrics appear after SGLang engine initialization completes
+---
+## Distributed Tracing
+Dynamo propagates [W3C Trace Context](https://www.w3.org/TR/trace-context/) headers through the SGLang request pipeline, allowing you to correlate traces across the frontend, router, and individual SGLang workers in a disaggregated deployment.
+### Prerequisites
+SGLang's engine-internal tracing requires the `opentelemetry` packages. These are declared as SGLang's `[tracing]` extra. Install them into your Dynamo environment:
+```bash
+uv pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-exporter-otlp-proto-grpc
+```
+Without these packages, Dynamo-side spans (frontend, handler) will still work, but SGLang's internal engine spans will not be emitted and you will see a warning: `"Tracing is disabled because the packages cannot be imported."`
+### How Trace Propagation Works
+```
+Frontend (Rust)
+  creates span, embeds trace_id + span_id in Context
+    |
+    v
+Dynamo RPC (NATS transport)
+  Context serialized with trace_id, span_id
+    |
+    v
+SGLang Handler (Python)
+  handler_base.py:_get_trace_header(context)
+  builds W3C traceparent: "00-{trace_id}-{span_id}-01"
+    |
+    v
+sgl.Engine.async_generate(
+    ...,
+    rid=trace_id,                        # request ID = trace ID
+    external_trace_header=traceparent    # W3C header for SGLang internal spans
+)
+    |
+    v
+SGLang Engine (internal spans attached to same trace)
+```
+Key implementation files:
+- `components/src/dynamo/common/utils/otel_tracing.py` - W3C `traceparent` header builder
+- `components/src/dynamo/sglang/request_handlers/handler_base.py:71-84` - Extracts trace context from Dynamo `Context` object
+- `components/src/dynamo/sglang/request_handlers/llm/decode_handler.py` - Passes `external_trace_header` and `rid=trace_id` to `engine.async_generate()`
+### Environment Variables
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging (required for tracing) | `false` | `true` |
+| `OTEL_EXPORT_ENABLED` | Enable OTLP trace export | `false` | `true` |
+| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for Tempo | `http://localhost:4317` | `http://tempo:4317` |
+| `OTEL_SERVICE_NAME` | Service name shown in Grafana Tempo | `dynamo` | `dynamo-worker-decode` |
+### SGLang-Specific Flags
+| Flag | Description |
+|------|-------------|
+| `--enable-trace` | Enable W3C trace header propagation into SGLang engine |
+| `--otlp-traces-endpoint` | OTLP gRPC endpoint for SGLang's internal trace export (bare `host:port` format, e.g. `localhost:4317`) |
+Both flags are required for end-to-end tracing through the SGLang engine. Without `--enable-trace`, the Dynamo handler still creates spans, but SGLang's internal engine spans will not be linked.
+### Launch with Tracing
+The disaggregated launch script supports `--enable-otel` to enable tracing across all components:
+```bash
+# Start observability stack first
+docker compose -f deploy/docker-compose.yml up -d
+docker compose -f deploy/docker-observability.yml up -d
+# Launch SGLang disaggregated with tracing
+cd examples/backends/sglang/launch
+./disagg.sh --enable-otel
+```
+Or manually for an aggregated deployment:
+```bash
+export DYN_LOGGING_JSONL=true
+export OTEL_EXPORT_ENABLED=true
+export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
+# Frontend
+OTEL_SERVICE_NAME=dynamo-frontend python -m dynamo.frontend &
+# SGLang worker with tracing
+OTEL_SERVICE_NAME=dynamo-worker-sglang \
+DYN_SYSTEM_PORT=8081 \
+python -m dynamo.sglang \
+  --model Qwen/Qwen3-0.6B \
+  --enable-metrics \
+  --enable-trace \
+  --otlp-traces-endpoint localhost:4317
+```
+### What You'll See in Traces
+With tracing enabled, each inference request produces a single end-to-end trace spanning the full request lifecycle:
+- **Frontend `http-request` span** - Root span from the HTTP service, includes method/uri/trace_id
+- **KV Router spans** - `kv_router.route_request`, `kv_router.select_worker`, `kv_router.compute_block_hashes`, `kv_router.find_matches`, `kv_router.compute_seq_hashes`, `kv_router.schedule`
+- **Worker `handle_payload` span** - The Dynamo RPC handler on the worker side, with component/endpoint/namespace labels
+- **SGLang engine spans** - `Req <id>`, `Scheduler`, `Tokenizer`, `request_process`, `prefill_forward`, `decode_loop`, `Bootstrap Room` (for disagg)
+- **Semantic conventions** - `gen_ai.usage.prompt_tokens`, `gen_ai.usage.completion_tokens`, `gen_ai.latency.time_to_first_token`, etc.
+Example trace tree for a KV-routed request:
+```
+dynamo-frontend: http-request (root)
+  dynamo-frontend: kv_router.route_request
+    dynamo-frontend: kv_router.select_worker
+      kv_router.compute_block_hashes
+      kv_router.find_matches
+      kv_router.compute_seq_hashes
+      kv_router.schedule
+    dynamo-worker-1: handle_payload
+      sglang: Bootstrap Room 0x0
+        sglang: Req <trace-id-prefix>
+          sglang: Scheduler [TP 0]
+            request_process
+            prefill_forward
+            decode_loop (repeated per token)
+          sglang: Tokenizer
+            tokenize
+            dispatch
+```
+![End-to-end trace in Grafana Tempo showing frontend, KV router, worker, and SGLang engine spans](../../../assets/img/sglang-trace.png)
+### Viewing Traces
+1. Open Grafana at `http://localhost:3000` (username: `dynamo`, password: `dynamo`)
+2. Navigate to **Explore** (compass icon)
+3. Select **Tempo** as the data source
+4. Use the **Search** tab:
+   - Filter by **Service Name** (e.g., `dynamo-frontend`, `dynamo-worker-1`, `sglang`)
+   - Filter by **Span Name** (e.g., `http-request`, `handle_payload`, `Req *`, `decode_loop`)
+   - Filter by **Tags** (e.g., `rid=<trace-id>`, `gen_ai.response.model=Qwen/Qwen3-0.6B`)
+5. Click a trace to view the flame graph spanning frontend -> router -> worker -> engine
+Send a request with `x-request-id` for easy lookup:
+```bash
+curl -H 'Content-Type: application/json' \
+  -H 'x-request-id: my-trace-001' \
+  -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 50,
+       "messages": [{"role": "user", "content": "Hello"}]}' \
+  http://localhost:8000/v1/chat/completions
+```
+For more details on the Tempo/Grafana tracing infrastructure, see the [Dynamo Tracing Guide](../../observability/tracing.md).
+---
+## SGLang Grafana Dashboard
+Dynamo ships a pre-provisioned Grafana dashboard for SGLang at `deploy/observability/grafana_dashboards/sglang.json`. It is automatically loaded when the observability stack starts.
+### Dashboard Panels
+The dashboard is organized into five sections:
+| Section | Panels | What to Watch |
+|---------|--------|---------------|
+| **Request Latency** | E2E Request Latency, Time-To-First-Token, Inter-Token Latency | Tail latency regressions, TTFT spikes during prefill pressure |
+| **Throughput & Queue** | Token Generation Throughput (tok/s), Running & Queued Requests, Request Rate | Throughput saturation, queue depth growth |
+| **Cache & PIN** | Cache Hit Rate, Active PIN Count, Retractions | KV cache reuse efficiency, PIN pressure from disagg routing |
+| **Memory Pressure** | GPU KV Cache Usage %, Host (CPU) KV Cache Usage %, Eviction & Load-back Rate | OOM risk, HiCache offload activity |
+| **HiCache Latency** | Eviction P99 Latency, Load-back P99 Latency | PCIe/NVLink bottlenecks in KV offload path |
+### Accessing the Dashboard
+1. Open Grafana at `http://localhost:3000`
+2. Login with `dynamo` / `dynamo`
+3. Click **Dashboards** in the left sidebar
+4. Select **SGLang Engine**
+Other available dashboards:
+- **Dynamo Dashboard** (`dynamo.json`) - Frontend and component metrics
+- **DCGM Metrics** (`dcgm-metrics.json`) - GPU utilization, memory, power
+- **KVBM** (`kvbm.json`) - KV block manager metrics
+- **Disagg Dashboard** (`disagg-dashboard.json`) - Disaggregated serving metrics
+---
+## Exposing on a Remote VM
+When developing on a remote VM (cloud instance, bare metal, etc.), the observability ports are only bound to `localhost` inside the VM. You have two options to access them.
+### Option 1: SSH Port Forwarding (Recommended)
+Forward the relevant ports through your SSH connection. No firewall changes needed, traffic is encrypted.
+```bash
+# Forward Grafana (3000), Prometheus (9090), and Tempo (3200)
+ssh -L 3000:localhost:3000 \
+    -L 9090:localhost:9090 \
+    -L 3200:localhost:3200 \
+    user@your-vm-ip
+```
+Then open `http://localhost:3000` in your local browser.
+For a long-running tunnel in the background:
+```bash
+ssh -fN \
+    -L 3000:localhost:3000 \
+    -L 9090:localhost:9090 \
+    -L 3200:localhost:3200 \
+    user@your-vm-ip
+```
+### Option 2: Firewall Rules
+Open the ports directly. Only use this on trusted networks.
+```bash
+# Ubuntu/Debian
+sudo ufw allow 3000/tcp   # Grafana
+sudo ufw allow 9090/tcp   # Prometheus
+# Or for cloud VMs, add inbound rules in your security group for ports 3000, 9090
+```
+Then access `http://<vm-ip>:3000` directly.
+### Headless / Agent Access
+For CI pipelines, AI coding agents, or headless workflows where no browser is available, you can query Grafana and Prometheus directly via their APIs:
+```bash
+# Query Prometheus for SGLang token throughput
+curl -s 'http://localhost:9090/api/v1/query?query=rate(sglang:generation_tokens_total[1m])' | python3 -m json.tool
+# Query Prometheus for GPU KV cache usage
+curl -s 'http://localhost:9090/api/v1/query?query=dynamo_component_gpu_cache_usage_percent' | python3 -m json.tool
+# List available Grafana dashboards
+curl -s -u dynamo:dynamo http://localhost:3000/api/search | python3 -m json.tool
+# Get the SGLang dashboard by title
+curl -s -u dynamo:dynamo 'http://localhost:3000/api/search?query=SGLang' | python3 -m json.tool
+# Fetch a specific dashboard by UID
+curl -s -u dynamo:dynamo http://localhost:3000/api/dashboards/uid/<dashboard-uid> | python3 -m json.tool
+# Snapshot current metrics via Prometheus range query (last hour)
+START=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)
+END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+curl -s "http://localhost:9090/api/v1/query_range?query=sglang:cache_hit_rate&start=${START}&end=${END}&step=15s"
+```
+This is useful for automated benchmarking pipelines where you want to capture metrics programmatically alongside performance results.
+---
 ## Related Documentation
 ### SGLang Metrics
 - [Official SGLang Production Metrics](https://docs.sglang.io/references/production_metrics.html)
 - [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)
-### Dynamo Metrics
+### Dynamo Observability
 - [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
+- [Dynamo Tracing Guide](../../observability/tracing.md) - Distributed tracing with OpenTelemetry and Tempo
 - [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
 - Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside SGLang metrics
  - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)

--- a/docs/pages/backends/sglang/sglang-reference-guide.md
+++ b/docs/pages/backends/sglang/sglang-reference-guide.md
@@ -115,7 +115,7 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --enab
 Both SGLang engine metrics (`sglang:*` prefix) and Dynamo runtime metrics (`dynamo_*` prefix) are served from the same endpoint.
-For metric details, see [SGLang Prometheus Metrics](sglang-prometheus.md). For visualization setup, see [Prometheus + Grafana](../../observability/prometheus-grafana.md).
+For metric details, see [SGLang Observability](sglang-observability.md). For visualization setup, see [Prometheus + Grafana](../../observability/prometheus-grafana.md).
 ### KV Events

--- a/docs/pages/observability/metrics.md
+++ b/docs/pages/observability/metrics.md
@@ -86,7 +86,7 @@ Dynamo exposes several categories of metrics:
 - **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
 - **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
 - **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-prometheus.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
+- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
 ## Runtime Hierarchy

--- a/docs/versions/dev.yml
+++ b/docs/versions/dev.yml
@@ -148,8 +148,8 @@ navigation:
                path: ../pages/backends/sglang/sglang-disaggregation.md
              - page: Diffusion
                path: ../pages/backends/sglang/sglang-diffusion.md
-              - page: Prometheus
+              - page: Observability
-                path: ../pages/backends/sglang/sglang-prometheus.md
+                path: ../pages/backends/sglang/sglang-observability.md
          - page: TensorRT-LLM
            path: ../pages/backends/trtllm/README.md
      - section: Frontend