Unverified Commit 5ff88b33 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: fix image paths to render on both GitHub and Fern (#6228)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 3dd5266e
......@@ -41,7 +41,7 @@ Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GP
## Architecture
![KVBM Architecture](/assets/img/kvbm-architecture.png)
![KVBM Architecture](../../../assets/img/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
......
......@@ -383,7 +383,7 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](/assets/img/kvbm-metrics-grafana.png)
![Grafana Example](../../../assets/img/kvbm-metrics-grafana.png)
### KVBM Worker Initialization Timeout
......
......@@ -170,12 +170,12 @@ The profiler follows a 5-step process:
- **Prefill**:
- TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
- DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst.
![Prefill Performance](/assets/img/h100-prefill-performance.png)
![Prefill Performance](../../../assets/img/h100-prefill-performance.png)
- **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring.
![Decode Performance](/assets/img/h100-decode-performance.png)
![Decode Performance](../../../assets/img/h100-decode-performance.png)
4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL.
5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation.
![ITL Interpolation](/assets/img/pd-interpolation.png)
![ITL Interpolation](../../../assets/img/pd-interpolation.png)
- **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
- **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths.
......
......@@ -47,7 +47,7 @@ The following diagram outlines Dynamo's high-level architecture. To enable large
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](/assets/img/architecture.png "Dynamo Architecture")
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../../assets/img/architecture.png "Dynamo Architecture")
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
......@@ -61,7 +61,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](/assets/img/disagg-perf-benefit.png)
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../../assets/img/disagg-perf-benefit.png)
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
......@@ -70,7 +70,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin
### KV aware routing
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](/assets/img/kv-routing.png)
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../../assets/img/kv-routing.png)
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
......@@ -80,7 +80,7 @@ Existing routing methods, including load-based routing, overlook the specific pr
### KV cache manager
The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](/assets/img/kvbm-agg-performance.png)
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../../assets/img/kvbm-agg-performance.png)
* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
......
......@@ -9,7 +9,7 @@ This document provides an in-depth look at the architecture, components, framewo
## KVBM Components
![Internal Components of Dynamo KVBM](/assets/img/kvbm-components.png)
![Internal Components of Dynamo KVBM](../../assets/img/kvbm-components.png)
*Internal Components of Dynamo KVBM*
......@@ -40,7 +40,7 @@ This document provides an in-depth look at the architecture, components, framewo
## KVBM Data Flows
![KVBM Data Flows](/assets/img/kvbm-data-flows.png)
![KVBM Data Flows](../../assets/img/kvbm-data-flows.png)
*KVBM Data Flows from device to other memory hierarchies*
......@@ -73,7 +73,7 @@ This document provides an in-depth look at the architecture, components, framewo
## Internal Architecture Deep Dive
![Internal architecture and key modules in the Dynamo KVBM](/assets/img/kvbm-internal-arch.png)
![Internal architecture and key modules in the Dynamo KVBM](../../assets/img/kvbm-internal-arch.png)
*Internal architecture and key modules in the Dynamo KVBM*
......@@ -321,23 +321,23 @@ There are two components of the interface:
- **Scheduler (Leader)**: Responsible for orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion.
- **Worker**: Responsible for reading metadata built by the scheduler (leader), performs async onboarding/offloading at the end of the forward pass.
![vLLM KVBM Integration](/assets/img/kvbm-integrations.png)
![vLLM KVBM Integration](../../assets/img/kvbm-integrations.png)
*Typical integration of KVBM with inference frameworks (vLLM shown as example)*
### Onboarding Operations
![Onboarding blocks from Host to Device](/assets/img/kvbm-onboard-host2device.png)
![Onboarding blocks from Host to Device](../../assets/img/kvbm-onboard-host2device.png)
*Onboarding blocks from Host to Device*
![Onboarding blocks from Disk to Device](/assets/img/kvbm-onboard-disk2device.png)
![Onboarding blocks from Disk to Device](../../assets/img/kvbm-onboard-disk2device.png)
*Onboarding blocks from Disk to Device*
### Offloading Operations
![Offloading blocks from Device to Host & Disk](/assets/img/kvbm-offload.png)
![Offloading blocks from Device to Host & Disk](../../assets/img/kvbm-offload.png)
*Offloading blocks from Device to Host & Disk*
......
......@@ -13,7 +13,7 @@ The Planner is Dynamo's autoscaling controller. It observes system metrics, pred
## Architecture
![Planner architecture showing Metric Collector, Load Predictor, and Performance Interpolator feeding into the Scaling Algorithm and Connector Layer](/assets/img/planner-architecture.svg)
![Planner architecture showing Metric Collector, Load Predictor, and Performance Interpolator feeding into the Scaling Algorithm and Connector Layer](../../assets/img/planner-architecture.svg)
## Scaling Algorithm
......
......@@ -24,17 +24,17 @@ AIConfigurator answers these questions in seconds, providing:
### End-to-End Workflow
![AIConfigurator end-to-end workflow](/assets/img/e2e-workflow.svg)
![AIConfigurator end-to-end workflow](../../../assets/img/e2e-workflow.svg)
### Aggregated vs Disaggregated Architecture
AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:
![Aggregated vs Disaggregated architecture comparison](/assets/img/arch-comparison.svg)
![Aggregated vs Disaggregated architecture comparison](../../../assets/img/arch-comparison.svg)
### When to Use Each Architecture
![Decision flowchart for choosing aggregated vs disaggregated](/assets/img/decision-flowchart.svg)
![Decision flowchart for choosing aggregated vs disaggregated](../../../assets/img/decision-flowchart.svg)
## Quick Start
......@@ -288,7 +288,7 @@ Run AIPerf **inside the cluster** to avoid network latency affecting measurement
To use AIPerf to benchmark an AIC-recommended configuration, you'll need to translate AIC parameters into AIPerf profiling arguments (we are working to automate this):
![AIC-to-AIPerf parameter mapping](/assets/img/param-mapping.svg)
![AIC-to-AIPerf parameter mapping](../../../assets/img/param-mapping.svg)
| AIC Output | AIPerf Parameter | Notes |
|------------|-----------------|-------|
......
......@@ -158,7 +158,7 @@ Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`
![Prometheus UI showing Dynamo metrics](/assets/img/prometheus-k8s.png)
![Prometheus UI showing Dynamo metrics](../../../assets/img/prometheus-k8s.png)
### In Grafana
```bash
......@@ -176,7 +176,7 @@ Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.
![Grafana dashboard showing Dynamo metrics](/assets/img/grafana-k8s.png)
![Grafana dashboard showing Dynamo metrics](../../../assets/img/grafana-k8s.png)
## Operator Metrics
......
......@@ -144,7 +144,7 @@ This section shows how trace and span information appears in JSONL logs. These l
When viewing the corresponding trace in Grafana, you should be able to see something like the following:
![Disaggregated Trace Example](/assets/img/grafana-disagg-trace.png)
![Disaggregated Trace Example](../../assets/img/grafana-disagg-trace.png)
### Trace Overview
Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests.
......
......@@ -9,7 +9,7 @@
This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes.
![Grafana Dynamo Dashboard](/assets/img/grafana-dynamo-composite.png)
![Grafana Dynamo Dashboard](../../assets/img/grafana-dynamo-composite.png)
**Components:**
- **Prometheus Server** - Collects and stores metrics from Dynamo services
......
......@@ -144,7 +144,7 @@ http://localhost:8000/v1/chat/completions
Below is an example of what a trace looks like in Grafana Tempo:
![Trace Example](/assets/img/trace.png)
![Trace Example](../../assets/img/trace.png)
### 6. Stop Services
......
......@@ -77,7 +77,7 @@ For most frameworks, when chunked prefill is enabled and one forward iteration g
In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average time to first token (TTFT) is minimized.
For example, for Llama3.3-70b NVFP4 quantization on B200 TP1 in vLLM, the below figure shows the prefill time with different isl (prefix caching is turned off):
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](/assets/img/prefill-time.png)
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../../assets/img/prefill-time.png)
For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated.
For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment