doc: update metrics doc regarding frontend staged gauges (#8459)

5135c321 · Jie Hao · GitHub · 69823d72 · 5135c321 · 5135c321
Unverified Commit 5135c321 authored Apr 22, 2026 by Jie Hao Committed by GitHub Apr 22, 2026
Show whitespace changes
Inline Side-by-side

Showing with 66 additions and 15 deletions

docs/kubernetes/autoscaling.md docs/kubernetes/autoscaling.md +22 -12

docs/observability/metrics.md docs/observability/metrics.md +44 -3

No files found.
--- a/docs/kubernetes/autoscaling.md
+++ b/docs/kubernetes/autoscaling.md
@@ -228,11 +228,15 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
 | Metric | Type | Description | Good for scaling |
 |--------|------|-------------|------------------|
-| `dynamo_frontend_queued_requests` | Gauge | Requests waiting in HTTP queue | ✅ Workers |
+| `dynamo_frontend_active_requests` | Gauge | Total concurrent requests from HTTP entry to response complete | ✅ All services |
-| `dynamo_frontend_inflight_requests` | Gauge | Concurrent requests to engine | ✅ All services |
+| `dynamo_frontend_stage_requests{stage,phase}` | Gauge | Requests currently in a given frontend pipeline stage (`preprocess`, `route`, `dispatch`) | ✅ Workers — use `sum(...)` for queue-depth behavior, or `stage="dispatch"` for backend-prefill saturation |
 | `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
 | `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
 | `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
+| `dynamo_frontend_inflight_requests` | Gauge | Concurrent requests to engine | ⚠️ **Deprecated** — use `dynamo_frontend_active_requests` |
+| `dynamo_frontend_queued_requests` | Gauge | Requests waiting in HTTP queue | ⚠️ **Deprecated** — use `sum(dynamo_frontend_stage_requests)` across `preprocess` + `route` + `dispatch` |
+For the full definition of the `stage` and `phase` labels and derived-signal formulas, see [Stage and phase labels](../observability/metrics.md#stage-and-phase-labels) in the Metrics Reference.
 #### Metric Labels
@@ -344,16 +348,18 @@ spec:
 #### Example: Scale Based on Queue Depth
+"Queue depth" here means the number of requests that have entered the frontend but haven't yet received a first token — i.e. the sum of `dynamo_frontend_stage_requests` across the `preprocess`, `route`, and `dispatch` stages. This replaces the deprecated `dynamo_frontend_queued_requests` gauge.
 Add this rule to your `prometheus-adapter-values.yaml` (alongside the TTFT rule):
 ```yaml
 # Add to rules.external in prometheus-adapter-values.yaml
- seriesQuery: 'dynamo_frontend_queued_requests{namespace!=""}'
+- seriesQuery: 'dynamo_frontend_stage_requests{namespace!="",stage=~"preprocess|route|dispatch"}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
  name:
-    as: "dynamo_queued_requests"
+    as: "dynamo_frontend_pending_requests"
  metricsQuery: |
    sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace)
 ```
@@ -377,7 +383,7 @@ spec:
  - type: External
    external:
      metric:
-        name: dynamo_queued_requests
+        name: dynamo_frontend_pending_requests
        selector:
          matchLabels:
            dynamo_namespace: "default-sglang-agg"
@@ -500,9 +506,9 @@ spec:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
-      metricName: dynamo_queued_requests
+      metricName: dynamo_frontend_pending_requests
      query: |
-        sum(dynamo_frontend_queued_requests{dynamo_namespace="default-sglang-agg"})
+        sum(dynamo_frontend_stage_requests{dynamo_namespace="default-sglang-agg",stage=~"preprocess|route|dispatch"})
      threshold: "10"    # Scale up when queue > 10 requests
 ```
@@ -651,8 +657,8 @@ Avoid configuring multiple autoscalers for the same service:
 | Service Type | Recommended Metrics | Dynamo Metric |
 |--------------|---------------------|---------------|
 | Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
-| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
+| Prefill | Dispatch-stage depth (backend prefill saturation), TTFT | `dynamo_frontend_stage_requests{stage="dispatch"}`, `dynamo_frontend_time_to_first_token_seconds` |
-| Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
+| Decode | ITL, active concurrency | `dynamo_frontend_inter_token_latency_seconds`, `dynamo_frontend_active_requests` |
 ### 3. Configure Stabilization Windows
@@ -712,9 +718,13 @@ If HPA/KEDA shows `<unknown>` for metrics:
 kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000
 curl http://localhost:8000/metrics | grep dynamo_frontend
-# Example output:
+# Example output (note: stage_requests has no `model` label — it's per frontend pod):
-# dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2
+# dynamo_frontend_active_requests{model="Qwen/Qwen3-0.6B"} 5
-# dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5
+# dynamo_frontend_stage_requests{stage="preprocess",phase=""} 0
+# dynamo_frontend_stage_requests{stage="route",phase="aggregated"} 0
+# dynamo_frontend_stage_requests{stage="dispatch",phase="aggregated"} 2
+# dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2        # deprecated
+# dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5      # deprecated
 # Verify Prometheus is scraping the metrics
 kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

--- a/docs/observability/metrics.md
+++ b/docs/observability/metrics.md
@@ -132,8 +132,10 @@ Some components expose additional metrics specific to their functionality:
 The Dynamo HTTP Frontend (`python -m dynamo.frontend`) exposes `dynamo_frontend_*` metrics on port 8000 by default (configurable via `--http-port` or `DYN_HTTP_PORT`) at the `/metrics` endpoint. Most metrics include `model` labels containing the model name:
- `dynamo_frontend_inflight_requests`: Inflight requests (gauge)
+- `dynamo_frontend_active_requests`: Number of requests currently being handled by the frontend, from HTTP handler entry until the response stream completes (gauge). This is the top-level in-flight count with no stage breakdown.
- `dynamo_frontend_queued_requests`: Number of requests in HTTP processing queue (gauge)
+- `dynamo_frontend_stage_requests`: Number of requests currently in a given frontend pipeline stage (gauge, labels: `stage`, `phase`). See [Stage and phase labels](#stage-and-phase-labels) below.
+- `dynamo_frontend_inflight_requests`: Inflight requests (gauge). **Deprecated** — kept for backward compatibility; prefer `dynamo_frontend_active_requests`, which has identical semantics with a clearer name.
+- `dynamo_frontend_queued_requests`: Number of requests in HTTP processing queue (gauge). **Deprecated** — kept for backward compatibility; the "waiting for first token" window is now the sum of `dynamo_frontend_stage_requests` across the `preprocess`, `route`, and `dispatch` stages.
 - `dynamo_frontend_disconnected_clients`: Number of disconnected clients (gauge)
 - `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram)
 - `dynamo_frontend_cached_tokens`: Number of cached tokens (prefix cache hits) per request (histogram)
@@ -150,7 +152,44 @@ The Dynamo HTTP Frontend (`python -m dynamo.frontend`) exposes `dynamo_frontend_
 curl http://localhost:8000/metrics
 ```
-**Note**: The `dynamo_frontend_inflight_requests` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
+#### Stage and phase labels
+`dynamo_frontend_stage_requests` decomposes the lifetime of an active frontend request into three sequential pipeline stages. A request is counted in exactly one stage at a time (via an RAII guard that increments on stage entry and decrements on stage exit), and is counted in `dynamo_frontend_active_requests` for its entire lifetime. Between stages — and after `dispatch` exits while the backend is streaming tokens — the request is still in `active_requests` but in no `stage_requests` bucket.
+**`stage` label values:**
+| Stage | What it covers | Enters when | Exits when |
+|-------|----------------|-------------|------------|
+| `preprocess` | Tokenization and chat-template application | The frontend enters `preprocess_request` | Preprocessing returns |
+| `route` | Worker selection (including parking in the KV-router queue while waiting for a worker) | The router's `generate()` is called | A worker is selected or the request is queued for one |
+| `dispatch` | Serialization, transport to the chosen worker, and waiting for the backend's first response (includes backend prefill time) | `generate()` is called in `AddressedPushRouter` | The first response is received from the backend |
+**`phase` label values:**
+| Phase | Meaning |
+|-------|---------|
+| `prefill` | The request is being handled by a prefill worker in disaggregated serving |
+| `decode` | The request is being handled by a decode worker in disaggregated serving |
+| `aggregated` | Aggregated (non-disaggregated) serving — a single worker handles both prefill and decode |
+| `""` (empty) | The stage does not distinguish phases (used by `preprocess`) |
+**Derived signals operators commonly want.** These are cluster-wide totals across all frontend pods. `stage_requests` has no `model` label, so you cannot split these by model; add `by (pod)` or `by (instance)` to any `sum(...)` below if you need per-pod visibility. The stage filter `stage=~"preprocess|route|dispatch"` is used explicitly to keep the "pre-first-token" semantic stable if additional stages (e.g. `postprocess`) are added in the future.
+- **Requests waiting for a worker to start generating (the old "queued" semantic):** `sum(dynamo_frontend_stage_requests{stage=~"preprocess|route|dispatch"})` — i.e. still in `preprocess`, `route`, or `dispatch`.
+- **Requests currently being processed by a backend worker:** use the worker-side gauge `sum(dynamo_component_inflight_requests{dynamo_component="backend",dynamo_endpoint="generate"})` — this is the authoritative count, available whenever `DYN_SYSTEM_PORT` is set on workers (see [Backend Component Metrics](#backend-component-metrics)).
+    - *Frontend-perspective variant* (useful if worker metrics aren't being scraped, or when sizing frontend pods rather than workers): `sum(dynamo_frontend_active_requests) - sum(dynamo_frontend_stage_requests{stage=~"preprocess|route|dispatch"})`. This differs from the worker gauge because its window starts when the first token arrives at the frontend and extends through streaming to the client, so it includes transit and client-buffering time.
+- **Router saturation:** `sum(dynamo_frontend_stage_requests{stage="route"})` spiking indicates workers can't be selected fast enough (e.g. all backends busy, KV-router queue full).
+- **Backend prefill saturation:** `sum(dynamo_frontend_stage_requests{stage="dispatch"})` spiking indicates the backend is slow to produce first tokens.
+#### Deprecated frontend gauges
+The following gauges are still emitted but will be removed in a future release. They were superseded by the gauges above as part of the frontend-metrics rework (PR #8162). Dashboards and alerts should migrate off them.
+| Deprecated metric | Replacement |
+|-------------------|-------------|
+| `dynamo_frontend_inflight_requests` | `dynamo_frontend_active_requests` (same semantics, clearer name) |
+| `dynamo_frontend_queued_requests` | `sum(dynamo_frontend_stage_requests{stage=~"preprocess\|route\|dispatch"})` |
 #### Model Configuration Metrics
@@ -172,6 +211,8 @@ These metrics come from the Model Deployment Card information provided by worker
 ### Request Processing Flow
+> **Deprecated framing.** The two-metric model below (inflight vs. HTTP queue) describes the legacy `dynamo_frontend_inflight_requests` and `dynamo_frontend_queued_requests` gauges and is kept only to help operators reading existing dashboards. New work should use `dynamo_frontend_active_requests` and the per-stage `dynamo_frontend_stage_requests` gauges described under [Stage and phase labels](#stage-and-phase-labels).
 This section explains the distinction between two key metrics used to track request processing:
 1. **Inflight**: Tracks requests from HTTP handler start until the complete response is finished