Unverified Commit cd8ba391 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: fix Fern CI failures (broken links, MDX parser, missing KVBM assets) (#7322)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent e3b10813
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
...@@ -630,7 +630,7 @@ for label, key, stat in metrics: ...@@ -630,7 +630,7 @@ for label, key, stat in metrics:
**Factors that reduce KV router benefit:** **Factors that reduce KV router benefit:**
- **Unique prompts** with no prefix reuse - **Unique prompts** with no prefix reuse
- **Short prompts** (<1000 tokens) where routing overhead exceeds benefit - **Short prompts** (less than 1000 tokens) where routing overhead exceeds benefit
- **Evenly distributed load** where round-robin is already optimal - **Evenly distributed load** where round-robin is already optimal
- **Low request rate** where cache eviction negates benefits - **Low request rate** where cache eviction negates benefits
...@@ -640,7 +640,7 @@ for label, key, stat in metrics: ...@@ -640,7 +640,7 @@ for label, key, stat in metrics:
- Workload demonstrates measurable prefix reuse patterns - Workload demonstrates measurable prefix reuse patterns
**Standard routing is better when:** **Standard routing is better when:**
- KV router shows <10% improvement - KV router shows less than 10% improvement
- Increased latency variance is observed - Increased latency variance is observed
- Load distribution across workers is more important than cache affinity - Load distribution across workers is more important than cache affinity
......
...@@ -87,7 +87,7 @@ The `agent_hints` fields: ...@@ -87,7 +87,7 @@ The `agent_hints` fields:
- **`osl`** (output sequence length) is the harness's estimate of how many tokens this request will generate. The router uses this to gauge how long a worker will be occupied, which improves load balancing. A harness can learn this over time by tracking average output lengths per tool call type. - **`osl`** (output sequence length) is the harness's estimate of how many tokens this request will generate. The router uses this to gauge how long a worker will be occupied, which improves load balancing. A harness can learn this over time by tracking average output lengths per tool call type.
- **`speculative_prefill`** signals the orchestrator to begin caching this request's prefix on a likely worker before the full request is ready. This is useful when the harness knows a tool call is about to return and wants to warm the cache ahead of time. - **`speculative_prefill`** signals the orchestrator to begin caching this request's prefix on a likely worker before the full request is ready. This is useful when the harness knows a tool call is about to return and wants to warm the cache ahead of time.
The `cache_control` field will look familiar to anyone who has used Anthropic's prompt caching API. It tells the orchestrator to pin the computed prefix on the worker for the specified TTL, protecting it from eviction during tool call gaps. Currently `ephemeral` is the only supported type (to match Anthropic's API). We discuss how this works in the cache retention section below. You can find complete documentation on agent hints [here](https://docs.nvidia.com/ai-enterprise/dynamo/latest/components/frontend/nvext.html#cache-control). The `cache_control` field will look familiar to anyone who has used Anthropic's prompt caching API. It tells the orchestrator to pin the computed prefix on the worker for the specified TTL, protecting it from eviction during tool call gaps. Currently `ephemeral` is the only supported type (to match Anthropic's API). We discuss how this works in the cache retention section below. You can find complete documentation on agent hints [here](../../components/frontend/nvext.md#cache-control).
## Layer 2: The Router ## Layer 2: The Router
...@@ -95,7 +95,7 @@ A coding agent follows a sequential pattern: long prefill, tool call, extend pre ...@@ -95,7 +95,7 @@ A coding agent follows a sequential pattern: long prefill, tool call, extend pre
### KV-Aware Placement ### KV-Aware Placement
Without cache-aware routing, turn 2 of a conversation has a ~1/N chance of landing on the same worker as turn 1. Every miss is a full prefix recomputation which is a significant performance bottleneck and extremely costly for an end user. Dynamo's router maintains a global index of which KV cache blocks exist on which workers. The [Flash Indexer post](https://developer.nvidia.com/blog/building-a-high-performance-kv-cache-index-for-llm-inference-with-nvidia-dynamo/) covers the six iterations that got this indexer to 170M ops/s (**planetary** scale KV routing). On every request, the router queries the index for per-worker overlap scores and selects the worker that minimizes the combined cost of cache miss and current decode load. This cost function is tunable, and we show below how teams can build custom agent aware routing strategies on top of it. Without cache-aware routing, turn 2 of a conversation has a ~1/N chance of landing on the same worker as turn 1. Every miss is a full prefix recomputation which is a significant performance bottleneck and extremely costly for an end user. Dynamo's router maintains a global index of which KV cache blocks exist on which workers. The [Flash Indexer post](../flash-indexer/flash-indexer.md) covers the six iterations that got this indexer to 170M ops/s (**planetary** scale KV routing). On every request, the router queries the index for per-worker overlap scores and selects the worker that minimizes the combined cost of cache miss and current decode load. This cost function is tunable, and we show below how teams can build custom agent aware routing strategies on top of it.
### Priority Scheduling ### Priority Scheduling
......
...@@ -338,6 +338,9 @@ navigation: ...@@ -338,6 +338,9 @@ navigation:
path: reference/glossary.md path: reference/glossary.md
- page: Tuning Disaggregated Performance - page: Tuning Disaggregated Performance
path: performance/tuning.md path: performance/tuning.md
# -- Frontend (hidden sub-pages) --
- page: NVIDIA Request Extensions (nvext)
path: components/frontend/nvext.md
# -- Backend detail pages -- # -- Backend detail pages --
- section: vLLM Details - section: vLLM Details
contents: contents:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment