Unverified Commit 39d645e5 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate Fern docs from fern/ into docs/ (#6206)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent d381e6ff
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# TensorRT-LLM Prometheus Metrics
## Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
Additional performance metrics are available via non-Prometheus APIs (see [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) below).
TensorRT-LLM natively exposes several Prometheus metrics with the `trtllm_` prefix. The specific metrics available depend on your TensorRT-LLM version.
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
## Getting Started Quickly
This is a single machine example.
### Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
### Launch Dynamo Components
Launch a frontend and TensorRT-LLM backend to test metrics:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
# Enable system metrics server on port 8081 and enable metrics collection
$ DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics
```
**Note:** The `backend` must be set to `"pytorch"` for metrics collection (enforced in `components/src/dynamo/trtllm/main.py`). TensorRT-LLM's `MetricsCollector` integration has only been tested/validated with the PyTorch backend.
Wait for the TensorRT-LLM worker to start, then send requests and check metrics:
```bash
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^trtllm_"
```
## Exposed Metrics
TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All TensorRT-LLM engine metrics use the `trtllm_` prefix and include labels (e.g., `model_name`, `engine_type`, `finished_reason`) to identify the source.
**Note:** TensorRT-LLM uses `model_name` instead of Dynamo's standard `model` label convention.
**Example Prometheus Exposition Format text:**
```
# HELP trtllm_request_success_total Count of successfully processed requests.
# TYPE trtllm_request_success_total counter
trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="stop"} 150.0
trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="length"} 5.0
# HELP trtllm_time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE trtllm_time_to_first_token_seconds histogram
trtllm_time_to_first_token_seconds_bucket{le="0.01",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 0.0
trtllm_time_to_first_token_seconds_bucket{le="0.05",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.0
trtllm_time_to_first_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 8.75
# HELP trtllm_e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE trtllm_e2e_request_latency_seconds histogram
trtllm_e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0
trtllm_e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2
# HELP trtllm_time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE trtllm_time_per_output_token_seconds histogram
trtllm_time_per_output_token_seconds_bucket{le="0.1",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 120.0
trtllm_time_per_output_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_time_per_output_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.5
# HELP trtllm_request_queue_time_seconds Histogram of time spent in WAITING phase for request.
# TYPE trtllm_request_queue_time_seconds histogram
trtllm_request_queue_time_seconds_bucket{le="1.0",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 140.0
trtllm_request_queue_time_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_request_queue_time_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 32.1
```
**Note:** The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint for the current list.
### Metric Categories
TensorRT-LLM provides metrics in the following categories (all prefixed with `trtllm_`):
- **Request metrics** - Request success tracking and latency measurements
- **Performance metrics** - Time to first token (TTFT), time per output token (TPOT), and queue time
**Note:** Metrics may change between TensorRT-LLM versions. Always inspect the `/metrics` endpoint for your version.
## Available Metrics
TensorRT-LLM exposes metrics via Dynamo's `/metrics` endpoint with the `trtllm_` prefix. Common metrics include:
- `trtllm_request_success_total` (Counter) — Count of successfully processed requests by finish reason
- Labels: `model_name`, `engine_type`, `finished_reason`
- `trtllm_e2e_request_latency_seconds` (Histogram) — End-to-end request latency (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm_time_to_first_token_seconds` (Histogram) — Time to first token, TTFT (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm_time_per_output_token_seconds` (Histogram) — Time per output token, TPOT (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm_request_queue_time_seconds` (Histogram) — Time a request spends waiting in the queue (seconds)
- Labels: `model_name`, `engine_type`
**Note:** The specific metrics available depend on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint to see the current list of metrics for your version.
TensorRT-LLM provides Prometheus metrics through the `MetricsCollector` class (see [tensorrt_llm/metrics/collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)).
## Non-Prometheus Performance Metrics
TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.
### Available via Code References
- **RequestPerfMetrics Structure**: [tensorrt_llm/executor/result.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) - KV cache, timing, speculative decoding metrics
- **Engine Statistics**: `engine.llm.get_stats_async()` - System-wide aggregate statistics
- **KV Cache Events**: `engine.llm.get_kv_cache_events_async()` - Real-time cache operations
### Example RequestPerfMetrics JSON Structure
```json
{
"timing_metrics": {
"arrival_time": 1234567890.123,
"first_scheduled_time": 1234567890.135,
"first_token_time": 1234567890.150,
"last_token_time": 1234567890.300,
"kv_cache_size": 2048576,
"kv_cache_transfer_start": 1234567890.140,
"kv_cache_transfer_end": 1234567890.145
},
"kv_cache_metrics": {
"num_total_allocated_blocks": 100,
"num_new_allocated_blocks": 10,
"num_reused_blocks": 90,
"num_missed_blocks": 5
},
"speculative_decoding": {
"acceptance_rate": 0.85,
"total_accepted_draft_tokens": 42,
"total_draft_tokens": 50
}
}
```
**Note:** These structures may vary depending on your TensorRT-LLM version. Refer to the [TensorRT-LLM source code](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) for the most up-to-date structure for your version.
## Implementation Details
- **Prometheus Integration**: Uses the `MetricsCollector` class from `tensorrt_llm.metrics` (see [collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py))
- **Dynamo Integration**: Uses `register_engine_metrics_callback()` function to pass through TRT-LLM's native `trtllm_*` metrics
- **Engine Configuration**: `return_perf_metrics` set to `True` when `--publish-events-and-metrics` is enabled
- **Initialization**: Metrics appear after TensorRT-LLM engine initialization completes and after at least one request is processed
- **Metadata**: `MetricsCollector` initialized with model metadata (model name, engine type)
## Related Documentation
### TensorRT-LLM Metrics
- See the [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) section above for detailed performance data and source code references
- [TensorRT-LLM Metrics Collector](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py) - Source code reference
### Dynamo Metrics
- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside TensorRT-LLM metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Mocker: LLM Engine Simulation in Rust
The Mocker is a lightweight, high-fidelity simulation of an LLM inference engine, implemented entirely in Rust. It replicates the core scheduling, memory management, and timing behaviors of production engines without requiring a GPU, making it invaluable for testing Dynamo's routing, KV cache events, disaggregated serving, and planner components.
## Overview
The mocker simulates:
- **Block-based KV cache management** with LRU eviction
- **Continuous batching scheduler** with watermark-based admission control
- **Prefix caching** with hash-based block deduplication
- **Chunked prefill** for better batching efficiency
- **Realistic timing models** for prefill and decode phases
- **Disaggregated serving** (prefill/decode separation)
- **KV event publishing** for router integration
- **Data parallelism** (multiple DP ranks per engine)
> **Note:** While the mocker uses vLLM as its primary reference implementation, these core components—block-based KV cache management, continuous batching schedulers, LRU evictors, and prefix caching—are fundamental to all modern LLM inference engines, including SGLang and TensorRT-LLM. The architectural patterns simulated here are engine-agnostic and apply broadly across the inference ecosystem.
## Quick Start
### Basic Usage
```bash
# Launch a single mocker worker
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B
# Launch with custom KV cache configuration
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--num-gpu-blocks-override 8192 \
--block-size 64 \
--max-num-seqs 256
# Launch with timing speedup for faster testing
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--speedup-ratio 10.0
```
### Disaggregated Serving
```bash
# Launch prefill worker
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--is-prefill-worker \
--bootstrap-ports 50100
# Launch decode worker (in another terminal)
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--is-decode-worker
```
### Multiple Workers in One Process
```bash
# Launch 4 mocker workers sharing the same tokio runtime
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--num-workers 4
```
## CLI Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
| `--endpoint` | `dyn://dynamo.backend.generate` | Dynamo endpoint string |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--block-size` | 64 | Tokens per KV cache block |
| `--max-num-seqs` | 256 | Maximum concurrent sequences |
| `--max-num-batched-tokens` | 8192 | Maximum tokens per batch |
| `--enable-prefix-caching` | True | Enable prefix caching |
| `--enable-chunked-prefill` | True | Enable chunked prefill |
| `--watermark` | 0.01 | KV cache watermark (fraction reserved) |
| `--speedup-ratio` | 1.0 | Timing speedup factor |
| `--data-parallel-size` | 1 | Number of DP replicas |
| `--startup-time` | None | Simulated startup delay (seconds) |
| `--planner-profile-data` | None | Path to NPZ file with timing data |
| `--num-workers` | 1 | Workers per process |
| `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
| `--is-prefill-worker` | False | Prefill-only mode |
| `--is-decode-worker` | False | Decode-only mode |
| `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) |
| `--bootstrap-ports` | None | Ports for P/D rendezvous |
## Architecture
The mocker is organized into several cooperating components that mirror the internal architecture of production LLM inference engines.
### Scheduler
The scheduler implements continuous batching, maintaining three logical queues:
1. **Waiting Queue** - Newly arrived requests awaiting scheduling
2. **Prefill Queue** - Requests scheduled for prefill
3. **Decode Queue** - Requests actively decoding (ordered by age for preemption)
Each iteration, the scheduler receives incoming requests, moves eligible requests from waiting to prefill based on available memory and compute budgets, simulates the prefill phase for queued requests, runs one decode step for all active sequences, and publishes metrics about current resource utilization.
When resources become constrained, the scheduler employs preemption: the oldest decoding request is evicted back to the waiting queue, its KV blocks are freed, and it will be rescheduled later. This mirrors how real engines handle memory pressure.
### KV Block Manager
The block manager tracks KV cache blocks using reference counting and an LRU eviction policy. Blocks exist in one of two pools:
- **Active Pool** - Blocks currently in use by one or more sequences, tracked with reference counts
- **Inactive Pool** - Blocks no longer actively referenced but kept for potential reuse (prefix caching)
When a sequence needs blocks, the manager first checks if they already exist (cache hit). If not, it allocates new blocks, potentially evicting the least-recently-used inactive blocks to make room. When a sequence completes or is preempted, its blocks are either moved to the inactive pool (for potential reuse) or freed entirely.
The following diagram illustrates the block lifecycle, based on vLLM's block manager design:
```
┌───── Cache hit (Use) ────┐
│ │
▼ │
┌───────────┐ ┌───────────┐ ┌──────────┴──────┐ ┌───────────┐
│ New Block │──────►│ Active │──────►│ Inactive │──────►│ Freed │
└───────────┘ alloc │ Pool │ deref │ Pool │ evict └───────────┘
│(ref_count)│ │ (LRU order) │
└─────┬─────┘ └─────────────────┘
│ destroy (preemption)
┌───────────┐
│ Freed │
└───────────┘
```
### Evictor
The LRU evictor maintains blocks ordered by their last access time, enabling O(1) eviction of the oldest unused block. It supports both normal insertion (for completed sequences) and front-insertion (for preempted sequences that should be evicted first if memory pressure continues).
### Sequence Tracking
Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (still being filled) versus full (complete and hashable for prefix caching). When a partial block fills up, it gets "promoted" to a full block with a content-based hash, enabling future cache hits from requests with matching prefixes.
### Performance Model
The mocker supports two timing prediction modes:
**Polynomial Model (Default):** Uses hardcoded polynomial formulas that approximate typical GPU behavior. Prefill time scales quadratically with token count, while decode time depends on the total active KV cache size.
**Interpolated Model:** Loads actual profiling data from an NPZ file containing measured prefill and decode latencies. The mocker interpolates between data points to predict timing for any input size. This enables high-fidelity simulation matching a specific hardware configuration.
### Bootstrap Rendezvous (Disaggregated Serving)
For disaggregated prefill/decode deployments, prefill and decode workers coordinate via a simple TCP-based rendezvous protocol. The decode worker connects to the prefill worker's bootstrap port and waits until the prefill phase completes and KV cache is ready. Either side can arrive first—the rendezvous completes when both are ready.
## Integration with Dynamo
### KV Event Publishing
When prefix caching is enabled, the mocker publishes KV cache events to the distributed runtime. These events notify the system when blocks are stored (new content cached) or removed (evicted). This enables the KV-aware router to make intelligent routing decisions based on which workers have which prefixes cached.
### Metrics Publishing
Each scheduler publishes metrics about its current state, including the number of active decode blocks per DP rank. The router uses these metrics for load-aware routing decisions.
## Testing Scenarios
The mocker is particularly useful for:
1. **Router Testing** - Validate KV-aware routing without GPUs
2. **Planner Testing** - Test SLA-based planners with realistic timing
3. **Fault Tolerance** - Test request migration, graceful shutdown
4. **Disaggregation** - Test P/D separation and KV transfer coordination
5. **Performance Modeling** - Prototype scheduling policies
6. **CI/CD** - Fast integration tests without hardware dependencies
## Comparison with Real Engines
| Feature | Real Engine | Mocker |
|---------|-------------|--------|
| GPU Required | Yes | No |
| Block Manager | Paged KV cache | Simulated blocks |
| Scheduler | Continuous batching | Continuous batching |
| Prefix Caching | Hash-based | Hash-based |
| Chunked Prefill | Supported | Supported |
| Preemption | Recompute/swap | Recompute (simulated) |
| Timing | Real execution | Model-based |
| KV Events | Native | Compatible |
| Data Parallelism | Multi-GPU | Simulated |
## Feature Gaps (WIP)
The following features are not yet supported by the mocker:
- **KV transfer latency simulation** - Disaggregated serving simulates the rendezvous handshake but does not model the actual KV cache transfer time between prefill and decode workers
- **Multi-tier memory** - No support for offloading KV cache to CPU/disk or onboarding back to GPU; potential future integration with KVBM
- **Multimodal support** - Currently only simulates text token processing; no vision encoder or cross-attention simulation
- **Native Rust reference counting** - Work in progress to use native Rc/Arc for block reference counting, enabling natural RAII patterns for simpler tracking
...@@ -19,7 +19,7 @@ limitations under the License. ...@@ -19,7 +19,7 @@ limitations under the License.
The Dynamo KVBM is a distributed KV-cache block management system designed for scalable LLM inference. It cleanly separates memory management from inference runtimes (vLLM, TensorRT-LLM, and SGLang), enabling GPU↔CPU↔Disk/Remote tiering, asynchronous block offload/onboard, and efficient block reuse. The Dynamo KVBM is a distributed KV-cache block management system designed for scalable LLM inference. It cleanly separates memory management from inference runtimes (vLLM, TensorRT-LLM, and SGLang), enabling GPU↔CPU↔Disk/Remote tiering, asynchronous block offload/onboard, and efficient block reuse.
![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../../../docs/images/kvbm-architecture.png) ![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../../../docs/assets/img/kvbm-architecture.png)
## Feature Highlights ## Feature Highlights
...@@ -35,7 +35,7 @@ The Dynamo KVBM is a distributed KV-cache block management system designed for s ...@@ -35,7 +35,7 @@ The Dynamo KVBM is a distributed KV-cache block management system designed for s
pip install kvbm pip install kvbm
``` ```
See the [support matrix](../../../docs/reference/support-matrix.md) for version compatibility questions. See the [support matrix](../../../docs/pages/reference/support-matrix.md) for version compatibility questions.
## Build from Source ## Build from Source
...@@ -115,7 +115,7 @@ DYN_KVBM_CPU_CACHE_GB=100 vllm serve \ ...@@ -115,7 +115,7 @@ DYN_KVBM_CPU_CACHE_GB=100 vllm serve \
Qwen/Qwen3-8B Qwen/Qwen3-8B
``` ```
For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-vllm) For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/pages/components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-vllm)
### TensorRT-LLM ### TensorRT-LLM
...@@ -137,11 +137,11 @@ DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \ ...@@ -137,11 +137,11 @@ DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \
--extra_llm_api_options /tmp/kvbm_llm_api_config.yaml --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
``` ```
For more detailed integration with dynamo and benchmarking, please check [trtllm-setup](../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) For more detailed integration with dynamo and benchmarking, please check [trtllm-setup](../../../docs/pages/components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm)
## 📚 Docs ## 📚 Docs
- [Architecture](../../../docs/components/kvbm/README.md#architecture) - [Architecture](../../../docs/pages/components/kvbm/README.md#architecture)
- [Design Deepdive](../../../docs/design_docs/kvbm_design.md) - [Design Deepdive](../../../docs/pages/design-docs/kvbm-design.md)
- [NIXL Overview](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) - [NIXL Overview](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
...@@ -50,7 +50,7 @@ maturin develop --uv ...@@ -50,7 +50,7 @@ maturin develop --uv
### Prerequisite ### Prerequisite
See [README.md](../../../docs/development/runtime-guide.md#prerequisites). See [README.md](../../../docs/pages/development/runtime-guide.md#prerequisites).
### Hello World Example ### Hello World Example
......
../../docs/development/runtime-guide.md
\ No newline at end of file
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo. Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform. > **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.
> If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first. > If not, follow the **[Kubernetes Deployment Guide](../docs/pages/kubernetes/README.md)** first.
## Available Recipes ## Available Recipes
...@@ -67,8 +67,8 @@ Each complete recipe follows this standard structure: ...@@ -67,8 +67,8 @@ Each complete recipe follows this standard structure:
The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide: The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:
- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Quickstart (~10 minutes) - **[Kubernetes Deployment Guide](../docs/pages/kubernetes/README.md)** - Quickstart (~10 minutes)
- **[Detailed Installation Guide](../docs/kubernetes/installation_guide.md)** - Advanced options - **[Detailed Installation Guide](../docs/pages/kubernetes/installation-guide.md)** - Advanced options
**2. GPU Cluster Requirements** **2. GPU Cluster Requirements**
...@@ -289,18 +289,18 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z ...@@ -289,18 +289,18 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z
- Review pod logs: `kubectl logs <pod-name> -n ${NAMESPACE}` - Review pod logs: `kubectl logs <pod-name> -n ${NAMESPACE}`
**For more troubleshooting:** **For more troubleshooting:**
- [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting) - [Kubernetes Deployment Guide](../docs/pages/kubernetes/README.md#troubleshooting)
- [Observability Documentation](../docs/kubernetes/observability/) - [Observability Documentation](../docs/pages/kubernetes/observability/)
## Related Documentation ## Related Documentation
- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts - **[Kubernetes Deployment Guide](../docs/pages/kubernetes/README.md)** - Platform installation and concepts
- **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification - **[API Reference](../docs/pages/kubernetes/api-reference.md)** - DynamoGraphDeployment CRD specification
- **[vLLM Backend Guide](../docs/backends/vllm/README.md)** - vLLM-specific features - **[vLLM Backend Guide](../docs/pages/backends/vllm/README.md)** - vLLM-specific features
- **[SGLang Backend Guide](../docs/backends/sglang/README.md)** - SGLang-specific features - **[SGLang Backend Guide](../docs/pages/backends/sglang/README.md)** - SGLang-specific features
- **[TensorRT-LLM Backend Guide](../docs/backends/trtllm/README.md)** - TensorRT-LLM features - **[TensorRT-LLM Backend Guide](../docs/pages/backends/trtllm/README.md)** - TensorRT-LLM features
- **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging - **[Observability](../docs/pages/kubernetes/observability/)** - Monitoring and logging
- **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing - **[Benchmarking Guide](../docs/pages/benchmarks/benchmarking.md)** - Performance testing
## Contributing ## Contributing
......
...@@ -13,7 +13,7 @@ Production-ready deployments for **DeepSeek-R1** (671B MoE) across multiple back ...@@ -13,7 +13,7 @@ Production-ready deployments for **DeepSeek-R1** (671B MoE) across multiple back
## Prerequisites ## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md) 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/pages/kubernetes/README.md)
2. **GPU cluster** with H200 or GB200 GPUs matching the configuration requirements 2. **GPU cluster** with H200 or GB200 GPUs matching the configuration requirements
3. **HuggingFace token** with access to DeepSeek models 3. **HuggingFace token** with access to DeepSeek models
4. **High-bandwidth networking** — InfiniBand or RoCE recommended for multi-node deployments 4. **High-bandwidth networking** — InfiniBand or RoCE recommended for multi-node deployments
......
...@@ -13,7 +13,7 @@ This recipe deploys DeepSeek-R1 using vLLM in a disaggregated prefill/decode set ...@@ -13,7 +13,7 @@ This recipe deploys DeepSeek-R1 using vLLM in a disaggregated prefill/decode set
### 0) Prerequisites: Install the platform ### 0) Prerequisites: Install the platform
Follow the Kubernetes deployment guide to install the Dynamo platform and prerequisites (CRDs/operator, etc.): Follow the Kubernetes deployment guide to install the Dynamo platform and prerequisites (CRDs/operator, etc.):
- `docs/kubernetes/README.md` - `docs/pages/kubernetes/README.md`
Ensure you have a GPU-enabled cluster with sufficient capacity (32x H100/H200 "Hopper" across 4 nodes), and that the NVIDIA GPU Operator is healthy. Ensure you have a GPU-enabled cluster with sufficient capacity (32x H100/H200 "Hopper" across 4 nodes), and that the NVIDIA GPU Operator is healthy.
......
...@@ -12,7 +12,7 @@ Production-ready deployments for **Llama-3.3-70B-Instruct** using vLLM with FP8 ...@@ -12,7 +12,7 @@ Production-ready deployments for **Llama-3.3-70B-Instruct** using vLLM with FP8
## Prerequisites ## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md) 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/pages/kubernetes/README.md)
2. **GPU cluster** with H100 or H200 GPUs matching the configuration requirements 2. **GPU cluster** with H100 or H200 GPUs matching the configuration requirements
3. **HuggingFace token** with access to Llama models 3. **HuggingFace token** with access to Llama models
......
...@@ -11,7 +11,7 @@ Production-ready deployments for **Qwen3-235B-A22B** (MoE model with 22B active ...@@ -11,7 +11,7 @@ Production-ready deployments for **Qwen3-235B-A22B** (MoE model with 22B active
## Prerequisites ## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md) 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/pages/kubernetes/README.md)
2. **GPU cluster** with H100/H200 GPUs (high memory recommended) 2. **GPU cluster** with H100/H200 GPUs (high memory recommended)
3. **HuggingFace token** with access to Qwen models 3. **HuggingFace token** with access to Qwen models
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment