Unverified Commit ece08dc9 authored by Neal Vaidya's avatar Neal Vaidya Committed by GitHub
Browse files

docs: restructure docs directory and move fern config to fern/ (#6700)


Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 1412e44b
......@@ -10,7 +10,7 @@ title: Inference Gateway (GAIE)
Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. The epp config is the same for both. If no prefill workers found the service degrades gracefully to perform aggregated serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway.
......@@ -222,7 +222,7 @@ Key configurations include:
**Configuration**
You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your [values.yaml](../../../deploy/inference-gateway/standalone/helm/dynamo-gaie/values.yaml).
You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your [values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/values.yaml).
Common Vars for Routing Configuration:
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
......
......@@ -172,7 +172,7 @@ Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`
![Prometheus UI showing Dynamo metrics](../../../assets/img/prometheus-k8s.png)
![Prometheus UI showing Dynamo metrics](../../assets/img/prometheus-k8s.png)
### In Grafana
```bash
......@@ -190,7 +190,7 @@ Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.
![Grafana dashboard showing Dynamo metrics](../../../assets/img/grafana-k8s.png)
![Grafana dashboard showing Dynamo metrics](../../assets/img/grafana-k8s.png)
## Operator Metrics
......
......@@ -146,7 +146,7 @@ This section shows how trace and span information appears in JSONL logs. These l
When viewing the corresponding trace in Grafana, you should be able to see something like the following:
![Disaggregated Trace Example](../../assets/img/grafana-disagg-trace.png)
![Disaggregated Trace Example](../assets/img/grafana-disagg-trace.png)
### Trace Overview
Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests.
......
......@@ -8,7 +8,7 @@ title: Prometheus + Grafana Setup
This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes.
![Grafana Dynamo Dashboard](../../assets/img/grafana-dynamo-composite.png)
![Grafana Dynamo Dashboard](../assets/img/grafana-dynamo-composite.png)
**Components:**
- **Prometheus Server** - Collects and stores metrics from Dynamo services
......
......@@ -143,7 +143,7 @@ http://localhost:8000/v1/chat/completions
Below is an example of what a trace looks like in Grafana Tempo:
![Trace Example](../../assets/img/trace.png)
![Trace Example](../assets/img/trace.png)
### 6. Stop Services
......
......@@ -76,7 +76,7 @@ For most frameworks, when chunked prefill is enabled and one forward iteration g
In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average time to first token (TTFT) is minimized.
For example, for Llama3.3-70b NVFP4 quantization on B200 TP1 in vLLM, the below figure shows the prefill time with different isl (prefix caching is turned off):
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../../assets/img/prefill-time.png)
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../assets/img/prefill-time.png)
For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated.
For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment