Unverified Commit 7ca6a562 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: update Fern docs for main branch (#5706)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent 704c1dad
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "KV Block Manager"
--- ---
# KV Block Manager
The Dynamo KV Block Manager (KVBM) is a scalable runtime component The Dynamo KV Block Manager (KVBM) is a scalable runtime component
designed to handle memory allocation, management, and remote sharing of designed to handle memory allocation, management, and remote sharing of
Key-Value (KV) blocks for inference tasks across heterogeneous and Key-Value (KV) blocks for inference tasks across heterogeneous and
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Motivation behind KVBM"
--- ---
# Motivation behind KVBM
Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges: Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges:
* Tailored for GenAI use-cases * Tailored for GenAI use-cases
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "KVBM Further Reading"
--- ---
# KVBM Further Reading
- [vLLM](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) - [vLLM](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
- [SGLang](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache) - [SGLang](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache)
- [EMOGI](https://arxiv.org/abs/2006.06890) - [EMOGI](https://arxiv.org/abs/2006.06890)
\ No newline at end of file
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Running KVBM in TensorRT-LLM"
--- ---
# Running KVBM in TensorRT-LLM
This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in TensorRT-LLM (trtllm). This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in TensorRT-LLM (trtllm).
To learn what KVBM is, please check [here](kvbm-architecture.md) To learn what KVBM is, please check [here](kvbm-architecture.md)
<Note> > [!NOTE]
- Ensure that `etcd` and `nats` are running before starting. > - Ensure that `etcd` and `nats` are running before starting.
- KVBM only supports TensorRT-LLM’s PyTorch backend. > - KVBM only supports TensorRT-LLM's PyTorch backend.
- Disable partial reuse `enable_partial_reuse: false` in the LLM API config’s `kv_connector_config` to increase offloading cache hits. > - Disable partial reuse `enable_partial_reuse: false` in the LLM API config's `kv_connector_config` to increase offloading cache hits.
- KVBM requires TensorRT-LLM v1.1.0rc5 or newer. > - KVBM requires TensorRT-LLM v1.2.0rc2 or newer.
- Enabling KVBM metrics with TensorRT-LLM is still a work in progress. > - Enabling KVBM metrics with TensorRT-LLM is still a work in progress.
</Note>
## Quick Start ## Quick Start
...@@ -50,10 +50,9 @@ export DYN_KVBM_DISK_CACHE_GB=8 ...@@ -50,10 +50,9 @@ export DYN_KVBM_DISK_CACHE_GB=8
# DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS to specify exact block counts instead of GB # DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS to specify exact block counts instead of GB
``` ```
<Note> > [!NOTE]
When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step. > When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1. > To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.
</Note>
```bash ```bash
# write an example LLM API config # write an example LLM API config
...@@ -94,6 +93,16 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" ...@@ -94,6 +93,16 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
``` ```
KVBM is also supported on the prefill worker of disaggregated serving. To launch the prefill worker, run:
```bash
# [DYNAMO] To serve an LLM model with dynamo
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml
--disaggregation-mode prefill &
```
Alternatively, can use "trtllm-serve" with KVBM by replacing the above two [DYNAMO] cmds with below: Alternatively, can use "trtllm-serve" with KVBM by replacing the above two [DYNAMO] cmds with below:
```bash ```bash
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Running KVBM in vLLM"
--- ---
# Running KVBM in vLLM
This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in vLLM. This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in vLLM.
To learn what KVBM is, please check [here](kvbm-architecture.md) To learn what KVBM is, please check [here](kvbm-architecture.md)
...@@ -43,29 +44,27 @@ cd $DYNAMO_HOME/examples/backends/vllm ...@@ -43,29 +44,27 @@ cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm_2p2d.sh ./launch/disagg_kvbm_2p2d.sh
``` ```
<Note> > [!NOTE]
Configure or tune KVBM cache tiers (choose one of the following options): > Configure or tune KVBM cache tiers (choose one of the following options):
```bash > ```bash
# Option 1: CPU cache only (GPU -> CPU offloading) > # Option 1: CPU cache only (GPU -> CPU offloading)
# 4 means 4GB of pinned CPU memory would be used > # 4 means 4GB of pinned CPU memory would be used
export DYN_KVBM_CPU_CACHE_GB=4 > export DYN_KVBM_CPU_CACHE_GB=4
# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading) > # Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
export DYN_KVBM_CPU_CACHE_GB=4 > export DYN_KVBM_CPU_CACHE_GB=4
# 8 means 8GB of disk would be used > # 8 means 8GB of disk would be used
export DYN_KVBM_DISK_CACHE_GB=8 > export DYN_KVBM_DISK_CACHE_GB=8
# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading, bypassing CPU) > # [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading, bypassing CPU)
# NOTE: this option is only experimental and it might not give out the best performance. > # NOTE: this option is only experimental and it might not give out the best performance.
# NOTE: disk offload filtering is not supported when using this option. > # NOTE: disk offload filtering is not supported when using this option.
export DYN_KVBM_DISK_CACHE_GB=8 > export DYN_KVBM_DISK_CACHE_GB=8
``` > ```
You can also use "DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS" or > You can also use "DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS" or
"DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS" to specify exact block counts instead of GB > "DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS" to specify exact block counts instead of GB
</Note>
> [!NOTE]
<Note> > When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step. > To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.
To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.
</Note>
### Sample Request ### Sample Request
```bash ```bash
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Multimodal Inference in Dynamo"
--- ---
# Multimodal Inference in Dynamo
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models. Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
<Warning> > [!WARNING]
**Security Requirement**: Multimodal processing must be explicitly enabled at startup. > **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
See the relevant documentation for each backend for the necessary flags. > See the relevant documentation for each backend for the necessary flags.
This prevents unintended processing of multimodal data from untrusted sources. > This prevents unintended processing of multimodal data from untrusted sources.
</Warning>
## Backend Documentation ## Backend Documentation
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "SGLang Multimodal"
--- ---
# SGLang Multimodal
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized **E/PD or E/P/D** flows with **NIXL (RDMA)** for zero-copy tensor transfer. This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized **E/PD or E/P/D** flows with **NIXL (RDMA)** for zero-copy tensor transfer.
## Support Matrix ## Support Matrix
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "TensorRT-LLM Multimodal"
--- ---
# TensorRT-LLM Multimodal
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.
You can provide multimodal inputs in the following ways: You can provide multimodal inputs in the following ways:
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "vLLM Multimodal"
--- ---
# vLLM Multimodal
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
<Warning> > [!WARNING]
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`. > **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64). > This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
</Warning>
## Support Matrix ## Support Matrix
...@@ -160,9 +160,8 @@ cd $DYNAMO_HOME/examples/backends/vllm ...@@ -160,9 +160,8 @@ cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
``` ```
<Note> > [!NOTE]
Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported. > Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
</Note>
## ECConnector Serving ## ECConnector Serving
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Observability"
--- ---
# Dynamo Observability
## Getting Started Quickly ## Getting Started Quickly
This is an example to get started quickly on a single machine. This is an example to get started quickly on a single machine.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Health Checks"
--- ---
# Dynamo Health Checks
## Overview ## Overview
Dynamo provides health check and liveness HTTP endpoints for each component which Dynamo provides health check and liveness HTTP endpoints for each component which
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Logging"
--- ---
# Dynamo Logging
## Overview ## Overview
Dynamo provides structured logging in both text as well as JSONL. When Dynamo provides structured logging in both text as well as JSONL. When
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Metrics Developer Guide"
--- ---
# Metrics Developer Guide
This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API. This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API.
## Metrics Exposure ## Metrics Exposure
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Metrics"
--- ---
# Dynamo Metrics
## Overview ## Overview
Dynamo provides built-in metrics capabilities through the Dynamo metrics API, which is automatically available whenever you use the `DistributedRuntime` framework. This document serves as a reference for all available metrics in Dynamo. Dynamo provides built-in metrics capabilities through the Dynamo metrics API, which is automatically available whenever you use the `DistributedRuntime` framework. This document serves as a reference for all available metrics in Dynamo.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Metrics Visualization with Prometheus and Grafana"
--- ---
# Metrics Visualization with Prometheus and Grafana
## Overview ## Overview
This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes. This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Distributed Tracing with Tempo"
--- ---
# Distributed Tracing with Tempo
## Overview ## Overview
Dynamo supports OpenTelemetry-based distributed tracing for visualizing request flows across Frontend and Worker components. Traces are exported to Tempo via OTLP (OpenTelemetry Protocol) and visualized in Grafana. Dynamo supports OpenTelemetry-based distributed tracing for visualizing request flows across Frontend and Worker components. Traces are exported to Tempo via OTLP (OpenTelemetry Protocol) and visualized in Grafana.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Finding Best Initial Configs using AIConfigurator"
--- ---
# Finding Best Initial Configs using AIConfigurator
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput. [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
## Why Use AIConfigurator? ## Why Use AIConfigurator?
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Disaggregation and Performance Tuning"
--- ---
# Disaggregation and Performance Tuning
Disaggregation gains performance by separating the prefill and decode into different engines to reduce interferences between the two. Disaggregation gains performance by separating the prefill and decode into different engines to reduce interferences between the two.
However, performant disaggregation requires careful tuning of the inference parameters. However, performant disaggregation requires careful tuning of the inference parameters.
Specifically, there are three sets of parameters that needs to be tuned: Specifically, there are three sets of parameters that needs to be tuned:
...@@ -32,10 +33,9 @@ Typically, the number of GPUs vs the performance follows the following pattern: ...@@ -32,10 +33,9 @@ Typically, the number of GPUs vs the performance follows the following pattern:
| Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user | | Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user |
| More than maximum | Communication overhead dominates, poor performance | | More than maximum | Communication overhead dominates, poor performance |
<Note> > [!NOTE]
for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user. > for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user.
For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction: > For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction:
</Note>
| TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1 | | TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1 |
| ------: | -----------------: | --------------------: | ---------------------------: | | ------: | -----------------: | --------------------: | ---------------------------: |
...@@ -46,9 +46,8 @@ For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free ...@@ -46,9 +46,8 @@ For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free
The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed ISL/OSL/concurrency test using [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main) and compare with the SLA. The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed ISL/OSL/concurrency test using [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main) and compare with the SLA.
AIPerf is pre-installed in the dynamo container. AIPerf is pre-installed in the dynamo container.
<Tip> > [!TIP]
If you are unfamiliar with AIPerf, please see this helpful [tutorial](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorial.md) to get you started. > If you are unfamiliar with AIPerf, please see this helpful [tutorial](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorial.md) to get you started.
</Tip>
Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size. Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size.
For prefill engines, usually a small batch size and large `max_num_token` is preferred. For prefill engines, usually a small batch size and large `max_num_token` is preferred.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Load-based Planner"
--- ---
# Load-based Planner
This document covers load-based planner in `examples/llm/components/planner.py`. This document covers load-based planner in `examples/llm/components/planner.py`.
<Warning> > [!WARNING]
Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now. > Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now.
</Warning>
<Warning> > [!WARNING]
Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon. > Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.
</Warning>
## Load-based Scaling Up/Down Prefill/Decode Workers ## Load-based Scaling Up/Down Prefill/Decode Workers
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Planner"
--- ---
# Planner
The planner monitors the state of the system and adjusts workers to The planner monitors the state of the system and adjusts workers to
ensure that the system runs efficiently. ensure that the system runs efficiently.
...@@ -17,8 +18,7 @@ Key features include: ...@@ -17,8 +18,7 @@ Key features include:
- **Graceful scaling** that ensures no requests are dropped during - **Graceful scaling** that ensures no requests are dropped during
scale-down operations scale-down operations
<Tip> > [!TIP]
**New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](sla-planner-quickstart.md) for a complete, step-by-step workflow. > **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](sla-planner-quickstart.md) for a complete, step-by-step workflow.
>
**Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need. > **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
</Tip>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment