# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:SGLang for Agentic Workloads
title:SGLang for Agentic Workloads
subtitle:Priority scheduling and KV cache eviction policies for multi-turn agentic serving
subtitle:Priority scheduling and session control for multi-turn agentic serving
---
---
# SGLang for Agentic Workloads
# SGLang for Agentic Workloads
This guide covers SGLang-specific configuration for agentic serving with Dynamo. It explains which SGLang engine flags to enable and how Dynamo's [agent hints](../../components/frontend/nvext.md#agent-hints) map to SGLang behavior.
This guide covers SGLang-specific configuration for agentic serving with Dynamo. It explains which SGLang engine flags to enable, how Dynamo's [agent hints](../../components/frontend/nvext.md#agent-hints) map to SGLang behavior, and how to use session control to manage KV cache for multi-turn agent conversations.
## Overview
## Overview
...
@@ -65,7 +65,7 @@ When both `--radix-eviction-policy priority` and `--enable-hierarchical-cache` a
...
@@ -65,7 +65,7 @@ When both `--radix-eviction-policy priority` and `--enable-hierarchical-cache` a
| Event | Behavior |
| Event | Behavior |
|-------|----------|
|-------|----------|
| **GPU full** | Low-priority nodes are evicted (demoted to host) first. With `write_through`, all nodes survive on host -- priority only affects demotion order. |
| **GPU full** | Low-priority nodes are evicted (demoted to host) first. With `write_through`, all nodes survive on host -- priority only affects demotion order. |
| **Host full** | Low-priority nodes are deleted from host first. High-priority nodes survive longer. Pinned nodes are skipped entirely. |
| **Host full** | Low-priority nodes are deleted from host first. High-priority nodes with active retention survive longer. |
The practical impact depends on your write policy. With `write_through`, GPU eviction is just a demotion -- the real deletion happens at host eviction, which is where priority ordering matters most.
The practical impact depends on your write policy. With `write_through`, GPU eviction is just a demotion -- the real deletion happens at host eviction, which is where priority ordering matters most.
...
@@ -75,7 +75,7 @@ Dynamo's `nvext.agent_hints` fields are consumed by the router and forwarded to
...
@@ -75,7 +75,7 @@ Dynamo's `nvext.agent_hints` fields are consumed by the router and forwarded to
| `priority` | Raises router queue priority when `--router-queue-threshold` is set. | Queue ordering when `--enable-priority-scheduling` is set. Also affects radix cache eviction order when `--radix-eviction-policy priority` is set. |
| `priority` | Router queue ordering when `--router-queue-threshold` is set. | Request scheduling when `--enable-priority-scheduling` is set. Radix cache eviction order when `--radix-eviction-policy priority` is set. |
| `osl` | Output block tracking for routing decisions (requires `--router-track-output-blocks`) | No direct engine effect. |
| `osl` | Output block tracking for routing decisions (requires `--router-track-output-blocks`) | No direct engine effect. |
| `speculative_prefill` | After response completes, sends a `max_tokens=1` prefill to warm the KV cache for the predicted next turn. | SGLang processes the prefill request normally, populating the radix cache. |
| `speculative_prefill` | After response completes, sends a `max_tokens=1` prefill to warm the KV cache for the predicted next turn. | SGLang processes the prefill request normally, populating the radix cache. |
{"role":"system","content":"You are a coding assistant."},
{"role":"system","content":"You are a tennis historian who believes Roger Federer is the GOAT. Respond with maximum reverence."},
{"role":"user","content":"Write a Python function to parse CSV files."},
{"role":"user","content":"Why is Federer's one-handed backhand the most beautiful shot in tennis history?"},
],
],
stream=True,
stream=True,
extra_body={
extra_body={
...
@@ -109,6 +109,228 @@ for chunk in response:
...
@@ -109,6 +109,228 @@ for chunk in response:
print(chunk.choices[0].delta.content,end="")
print(chunk.choices[0].delta.content,end="")
```
```
## Session Control for Subagent KV Isolation (Experimental)
> [!WARNING]
> Session control is experimental. The API may change.
Agentic orchestrators often spawn short-lived subagents (research, code execution, planning) that accumulate KV cache, use it for a few turns, then die. Under normal radix cache behavior, this ephemeral KV pollutes the tree and competes with the lead agent's long-lived prefix for eviction.
Session control solves this by holding subagent KV in dedicated **streaming session slots** outside the radix tree. Session KV is invisible to eviction, has no L2 backup overhead, and is freed deterministically on close or timeout.
-**Turn 1** goes through the normal radix tree, so the subagent shares the lead agent's cached system prompt prefix.
-**Turns 2+** skip the radix tree entirely. KV is restored from the `SessionSlot` in O(1).
-**Session KV is invisible to eviction**. It cannot be evicted -- only freed by explicit close or inactivity timeout.
-**Deterministic cleanup**: On close, session KV is freed immediately.
-**Router-side affinity**: The `StickySessionRouter` maintains a `session_id -> worker_id` mapping with sliding-window TTL. Clients only need to send `session_id`.
### Enabling Session Control
Session control is request-driven. The router's `AgentController` (session lifecycle RPCs) and `StickySessionRouter` (session affinity) activate automatically when a request carries `nvext.session_control` -- no additional frontend flags are needed beyond `--router-mode kv`. On the worker side, streaming sessions must be explicitly enabled.
> [!NOTE]
> Session control is currently supported only on the SGLang backend. vLLM and TensorRT-LLM do not yet expose the streaming session API.
> [!IMPORTANT]
> Streaming sessions require SGLang changes from [sgl-project/sglang#21875](https://github.com/sgl-project/sglang/pull/21875) (session-aware cache, race condition fixes, session metrics). This is merged to SGLang main but not yet in a release. Until a version after `0.5.10.post1` is published, build SGLang from source (`pip install -e "python"` from the SGLang repo).
**SGLang worker:**
```bash
python -m dynamo.sglang \
--model-path <model> \
--enable-streaming-session\
...
```
| Flag | Description |
|------|-------------|
| `--enable-streaming-session` | Wraps the radix cache with `SessionAwareCache`, enabling streaming session slots for subagent KV isolation. |
**Router:**
```bash
python -m dynamo.frontend \
--router-mode kv \
...
```
### Request Format
#### Opening a session
Include `session_control` with `action: "open"` on the first request:
```json
{
"model":"Qwen/Qwen3-14B-FP8",
"messages":[{"role":"user","content":"Research every Federer Grand Slam final in exhaustive detail."}],
"nvext":{
"session_control":{
"session_id":"sub-1",
"action":"open",
"timeout":60
}
}
}
```
| Field | Type | Description |
|-------|------|-------------|
| `session_control.session_id` | `string` | Unique session identifier. Present on every turn. |
| `session_control.action` | `string` | `"open"` or `"close"`. Omit on intermediate turns. |
| `session_control.timeout` | `integer` | Inactivity timeout in seconds (default 300). Only used with `action: "open"`. |
#### Subsequent turns
Include `session_control` with just `session_id` (no action). The router resolves affinity automatically:
```json
{
"model":"Qwen/Qwen3-14B-FP8",
"messages":[{"role":"user","content":"Now compare his Wimbledon 2007 final vs Nadal to any shot in human history."}],
"nvext":{
"session_control":{
"session_id":"sub-1"
}
}
}
```
#### Closing a session
Include `action: "close"`. The close RPC fires after generation completes:
```json
{
"model":"Qwen/Qwen3-14B-FP8",
"messages":[{"role":"user","content":"Write a 500-word love letter to Federer's single-handed backhand."}],
"nvext":{
"session_control":{
"session_id":"sub-1",
"action":"close"
}
}
}
```
### Limitations
-**Streaming sessions only**: Sessions are opened with `streaming=True`, which means only sequential append operations are supported. Branching (`replace`), token-level rewind (`offset`), and `drop_previous_output` are not supported.
-**Timeout is idle-based**: The timeout refreshes on every request. If a subagent pauses for a long tool call that exceeds the timeout, the session is reaped and KV is freed. The subagent must re-open the session and re-prefill.
-**Session metrics**: Active session count (`sglang:num_streaming_sessions`) and held KV tokens (`sglang:streaming_session_held_tokens`) are exported as Prometheus gauges on the worker's metrics endpoint.
## Quickstart
### Launch Script
The `agg_agent.sh` script launches a single aggregated worker with session control, sticky routing, and KV events:
```bash
# Default model (GLM-4.7-Flash, 2 GPUs)
bash examples/backends/sglang/launch/agg_agent.sh
```
The frontend listens on port 8000 (override with `DYN_HTTP_PORT`). Worker metrics are on port 8081.
### Testing with OpenCode
[OpenCode](https://github.com/opencode-ai/opencode) is an open-source AI coding agent with built-in support for subagents, tool calling, and OpenAI-compatible endpoints. The [Dynamo provider fork](https://github.com/ishandhanani/opencode/tree/idhanani/dynamo-provider) injects `nvext.session_control` on subagent requests, giving each spawned agent its own Dynamo streaming session with sticky routing and KV isolation.
```bash
# Terminal 1 -- launch Dynamo with session control + tool/reasoning parsers
DYNAMO_API_KEY=dummy bun run --cwd packages/opencode src/index.ts \
----model"dynamo/zai-org/GLM-4.7-Flash"
```
When OpenCode spawns a subagent (via the `task` tool), the provider automatically:
1. Sends `session_control.action = "open"` on the subagent's first turn
2. Routes subsequent turns to the same worker via `session_id`
3. Sends `session_control.action = "close"` when the subagent completes, freeing KV
The primary agent runs without session control -- only subagent sessions are pinned. This keeps lead-agent requests load-balanced while subagent multi-turn conversations stay on a single worker with warm KV cache.
#### Configuration
Model and endpoint are configured in `.opencode/opencode.jsonc`:
```jsonc
{
"provider":{
"dynamo":{
"npm":"@ai-sdk/openai-compatible",
"name":"Dynamo",
"env":["DYNAMO_API_KEY"],
"models":{
"zai-org/GLM-4.7-Flash":{
"id":"zai-org/GLM-4.7-Flash",
"name":"GLM 4.7 Flash",
"tool_call":true,
"reasoning":true,
"temperature":true,
"attachment":false,
"release_date":"2025-06-01",
"limit":{"context":131072,"output":8192},
"cost":{"input":0,"output":0},
"interleaved":{"field":"reasoning_content"}
}
},
"options":{
"baseURL":"http://localhost:8000/v1"
}
}
}
}
```
## See Also
## See Also
-**[NVIDIA Request Extensions (nvext)](../../components/frontend/nvext.md)**: Full `nvext` field reference including agent hints
-**[NVIDIA Request Extensions (nvext)](../../components/frontend/nvext.md)**: Full `nvext` field reference including agent hints
@@ -39,6 +39,7 @@ Include `nvext` as a top-level field alongside standard OpenAI-compatible fields
...
@@ -39,6 +39,7 @@ Include `nvext` as a top-level field alongside standard OpenAI-compatible fields
| `prefill_worker_id` | `u64` | `None` | Router | Routes the request to a specific prefill worker (disaggregated serving). |
| `prefill_worker_id` | `u64` | `None` | Router | Routes the request to a specific prefill worker (disaggregated serving). |
| `decode_worker_id` | `u64` | `None` | Router | Routes the request to a specific decode worker (disaggregated serving). |
| `decode_worker_id` | `u64` | `None` | Router | Routes the request to a specific decode worker (disaggregated serving). |
| `agent_hints` | object | `None` | Router | Per-request hints for scheduling and load balancing. See [Agent Hints](#agent-hints). |
| `agent_hints` | object | `None` | Router | Per-request hints for scheduling and load balancing. See [Agent Hints](#agent-hints). |
| `session_control` | object | `None` | Router | Session lifecycle and sticky routing for subagent KV isolation. See [Session Control](#session-control). |
### Header Overrides
### Header Overrides
...
@@ -129,6 +130,31 @@ Backend details:
...
@@ -129,6 +130,31 @@ Backend details:
}
}
```
```
## Session Control
`session_control` enables subagent KV isolation with sticky routing. The router uses `session_id` to keep a session on the same worker and can issue `open` / `close` lifecycle RPCs around streaming sessions.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `session_control.session_id` | `string` | — | Unique session identifier. Present on every turn. |
| `session_control.timeout` | `integer` | `300` | Inactivity timeout in seconds. Only used with `action: "open"`. |
```json
{
"nvext":{
"session_control":{
"session_id":"subagent-1",
"action":"open",
"timeout":300
}
}
}
```
Requires `--router-mode=kv` on the frontend. Session control activates automatically when requests carry `nvext.session_control`. See [SGLang for Agentic Workloads](../../backends/sglang/agents.md) for backend setup details.
## Response Extensions
## Response Extensions
When the client requests response metadata via `extra_fields`, the response includes an `nvext` object with the requested fields:
When the client requests response metadata via `extra_fields`, the response includes an `nvext` object with the requested fields:
...
@@ -164,4 +190,4 @@ When the client requests response metadata via `extra_fields`, the response incl
...
@@ -164,4 +190,4 @@ When the client requests response metadata via `extra_fields`, the response incl
|----------|-------------|
|----------|-------------|
| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
| [Configuration and Tuning](../router/router-configuration.md) | Full router configuration and CLI arguments |
| [Configuration and Tuning](../router/router-configuration.md) | Full router configuration and CLI arguments |
| [SGLang for Agentic Workloads](../../backends/sglang/agents.md) | SGLang engine flags for priority scheduling and eviction policies |
| [SGLang for Agentic Workloads](../../backends/sglang/agents.md) | SGLang engine flags for priority scheduling, eviction policies, and session control |
@@ -47,6 +47,15 @@ For `--router-mode device-aware-weighted`, set `DYN_ENCODER_CUDA_TO_CPU_RATIO` t
...
@@ -47,6 +47,15 @@ For `--router-mode device-aware-weighted`, set `DYN_ENCODER_CUDA_TO_CPU_RATIO` t
To implement KV event publishing for custom inference engines, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).
To implement KV event publishing for custom inference engines, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).
For details on per-request agent hints (`priority`, `osl`, `speculative_prefill`), see [NVIDIA Request Extensions (`nvext`)](../frontend/nvext.md#agent-hints).
For details on per-request agent hints (`priority`, `osl`, `speculative_prefill`), see [NVIDIA Request Extensions (`nvext`)](../frontend/nvext.md#agent-hints).
### Session Control and Sticky Routing
When a request carries `nvext.session_control`, the KV router activates two additional components:
-**AgentController**: Sends session lifecycle RPCs (`open_session`, `close_session`) to the worker's `session_control` endpoint. The event-plane client is lazily initialized on the first session request.
-**StickySessionRouter**: Maintains an in-memory `session_id -> worker_id` affinity map with sliding-window TTL. Subsequent requests with the same `session_id` are routed to the pinned worker, bypassing KV overlap scoring.
These activate automatically with `--router-mode kv` -- no additional flags are needed. Requests without `session_control` are unaffected and follow the standard KV-aware routing path. Session control currently requires the SGLang backend with `--enable-streaming-session`. See [SGLang for Agentic Workloads -- Session Control](../../backends/sglang/agents.md#session-control-for-subagent-kv-isolation-experimental) for details.
## Tuning Guidelines
## Tuning Guidelines
`--router-kv-overlap-score-weight` is the primary knob for balancing prefill efficiency against decode load. Prefill-heavy workloads benefit from a higher weight, which steers requests toward workers with better cache overlap and reduces TTFT. Decode-heavy workloads benefit from a lower weight, which distributes decode load more evenly and reduces ITL. The default of 1.0 is a reasonable starting point. This weight can also be overridden per request via `nvext.agent_hints.kv_overlap_score_weight`.
`--router-kv-overlap-score-weight` is the primary knob for balancing prefill efficiency against decode load. Prefill-heavy workloads benefit from a higher weight, which steers requests toward workers with better cache overlap and reduces TTFT. Decode-heavy workloads benefit from a lower weight, which distributes decode load more evenly and reduces ITL. The default of 1.0 is a reasonable starting point. This weight can also be overridden per request via `nvext.agent_hints.kv_overlap_score_weight`.
The request body includes `nvext.agent_hints` for routing and scheduling metadata; the frontend passes those hints to the KV router for queue ordering and worker selection.
The request body includes `nvext.agent_hints` for routing and scheduling metadata that the frontend passes through to the KV router and backend runtime.
| Hint | Description |
| Hint | Description |
|------|-------------|
|------|-------------|
...
@@ -45,6 +45,7 @@ The request body includes `nvext.agent_hints` for routing and scheduling metadat
...
@@ -45,6 +45,7 @@ The request body includes `nvext.agent_hints` for routing and scheduling metadat
@@ -64,6 +65,8 @@ Dynamo is now supported directly in LangChain using the [NVIDIA AI Endpoints int
...
@@ -64,6 +65,8 @@ Dynamo is now supported directly in LangChain using the [NVIDIA AI Endpoints int
-**Priority-based KV cache eviction:** Instead of evicting by LRU alone, the backend can evict **low-priority** cache entries first when the GPU (and, with HiCache, host) cache is full. The `priority` value in `nvext.agent_hints` is forwarded to the engine; with SGLang, enable `--enable-priority-scheduling` and `--radix-eviction-policy priority`.
-**Priority-based KV cache eviction:** Instead of evicting by LRU alone, the backend can evict **low-priority** cache entries first when the GPU (and, with HiCache, host) cache is full. The `priority` value in `nvext.agent_hints` is forwarded to the engine; with SGLang, enable `--enable-priority-scheduling` and `--radix-eviction-policy priority`.
-**Subagent KV isolation (experimental):** Session control holds subagent KV in dedicated streaming session slots outside the radix tree. Session KV is invisible to eviction and freed deterministically on close or timeout. The router manages sticky session affinity so subsequent turns always hit the same worker. See [SGLang for Agentic Workloads -- Session Control](../backends/sglang/agents.md#session-control-for-subagent-kv-isolation-experimental).
-**Cache prefetching (future work):** Using the predictable agentic lifecycle (e.g. parent-child subagents, known next turn), Dynamo could proactively prefetch or move KV cache to a different worker so that the next request hits warm cache.
-**Cache prefetching (future work):** Using the predictable agentic lifecycle (e.g. parent-child subagents, known next turn), Dynamo could proactively prefetch or move KV cache to a different worker so that the next request hits warm cache.