"lib/llm/src/vscode:/vscode.git/clone" did not exist on "1d509252b33d726d9392ebcfc179cce5d93716cd"
Unverified Commit 0a2a820b authored by Anant Sharma's avatar Anant Sharma Committed by GitHub
Browse files

docs: move all md files from components to docs (#3440)


Signed-off-by: default avatarAnant Sharma <anants@nvidia.com>
Co-authored-by: default avatarAnish <80174047+athreesh@users.noreply.github.com>
parent b640f283
## Encode-Prefill-Decode (EPD) Flow with NIXL
# Encode-Prefill-Decode (EPD) Flow with NIXL
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
### Enabling the Feature
## Enabling the Feature
This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:
......@@ -11,14 +11,14 @@ To enable it build the dynamo container with the `--tensorrtllm-commit` flag, fo
./container/build.sh --framework trtllm --tensorrtllm-commit b4065d8ca64a64eee9fdc64b39cb66d73d4be47c
```
### Key Features
## Key Features
- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON
### How to use
## How to use
```bash
cd $DYNAMO_HOME/components/backends/trtllm
......@@ -27,7 +27,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/epd_disagg.sh
```
### Configuration
## Configuration
The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:
......@@ -49,7 +49,7 @@ For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"
export MAX_FILE_SIZE_MB=50
```
### Architecture Overview
## Architecture Overview
The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:
......@@ -57,9 +57,9 @@ The EPD flow implements a **3-worker architecture** for high-performance multimo
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
### Request Flow Diagrams
## Request Flow Diagrams
#### Prefill-First Disaggregation Strategy
### Prefill-First Disaggregation Strategy
```mermaid
sequenceDiagram
......@@ -103,7 +103,7 @@ sequenceDiagram
Gateway->>Client: Final response + [DONE]
```
#### Decode-First Disaggregation Strategy
### Decode-First Disaggregation Strategy
```mermaid
sequenceDiagram
......@@ -155,7 +155,7 @@ sequenceDiagram
Gateway->>Client: Final response + [DONE]
```
### How the System Works
## How the System Works
1. **Request Processing**: Multimodal requests containing embedding file paths OR urls are routed based on disaggregation strategy
2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
......@@ -163,7 +163,7 @@ sequenceDiagram
4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing
### Example Request
## Example Request
The request format is identical to regular multimodal requests:
......
......@@ -21,7 +21,7 @@ TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs
Please note that you should provide **either image URLs or embedding file paths** in a single request.
### Aggregated
## Aggregated
Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode
```bash
......@@ -32,9 +32,9 @@ export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
./launch/agg.sh
```
### Example Requests
## Example Requests
#### With Image URL
### With Image URL
Below is an example of an image being sent to `Llama-4-Maverick-17B-128E-Instruct` model
......@@ -69,7 +69,7 @@ Response :
{"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}
```
### Disaggregated
## Disaggregated
Here are quick steps to launch in disaggregated mode.
......@@ -93,11 +93,11 @@ In general, disaggregated serving can run on a single node, provided the model f
To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md).
### Using Pre-computed Embeddings (Experimental)
## Using Pre-computed Embeddings (Experimental)
Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings.
#### How to Use
### How to Use
Once the container is built, you can send requests with paths to local embedding files.
......@@ -107,7 +107,7 @@ Once the container is built, you can send requests with paths to local embedding
When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline.
#### Example Request
### Example Request
Here is an example of how to send a request with a pre-computed embedding file.
......@@ -135,7 +135,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
"max_tokens": 160
}'
```
### Encode-Prefill-Decode (EPD) Flow with NIXL
## Encode-Prefill-Decode (EPD) Flow with NIXL
Dynamo with the TensorRT-LLM backend supports multimodal models in Encode -> Decode -> Prefill fashion, enabling you to process embeddings seperately in a seperate worker. For detailed setup instructions, example requests, and best practices, see the [Multimodal EPD Support Guide](./multimodal_epd.md).
......
......@@ -44,7 +44,7 @@ Before you begin, ensure you have completed the initial environment configuratio
The following sections provide specific instructions for deploying `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, including environment variable setup and launch commands. These steps can be adapted for other large multimodal models.
### Environment Variable Setup
## Environment Variable Setup
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
......
......@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0
# LLM Deployment using vLLM
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Use the Latest Release
......@@ -153,7 +153,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](deploy/README.md)
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](/components/backends/vllm/deploy/README.md)
## Configuration
......
......@@ -7,7 +7,7 @@ SPDX-License-Identifier: Apache-2.0
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a seperate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
# Instructions
## Instructions
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.
......
......@@ -16,7 +16,7 @@ This deployment uses disaggregated serving in vLLM where:
## Prerequisites
This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/components/backends/vllm/README.md)
This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/docs/backends/vllm/README.md)
## Instructions
......
../../../../components/backends/sglang/README.md
\ No newline at end of file
../../../../../components/backends/sglang/docs/multinode-examples.md
\ No newline at end of file
../../../../components/backends/trtllm/README.md
\ No newline at end of file
../../../../../components/backends/trtllm/multinode/multinode-examples.md
\ No newline at end of file
../../../../components/backends/vllm/LMCache_Integration.md
\ No newline at end of file
../../../../components/backends/vllm/README.md
\ No newline at end of file
......@@ -42,8 +42,22 @@
architecture/request_migration.md
architecture/request_cancellation.md
components/backends/trtllm/multinode/multinode-examples.md
components/backends/sglang/docs/multinode-examples.md
backends/trtllm/multinode/multinode-examples.md
backends/trtllm/multinode/multinode-multimodal-example.md
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/multimodal_support.md
backends/trtllm/multimodal_epd.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/sglang/multinode-examples.md
backends/sglang/dsr1-wideep-gb200.md
backends/sglang/dsr1-wideep-h100.md
backends/sglang/expert-distribution-eplb.md
backends/sglang/gpt-oss.md
backends/sglang/multimodal_epd.md
backends/sglang/sgl-hicache-example.md
examples/README.md
examples/runtime/hello_world/README.md
......@@ -51,6 +65,10 @@
architecture/distributed_runtime.md
architecture/dynamo_flow.md
backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md
backends/vllm/multi-node.md
.. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
have some outdated names/references and need a refresh.
# GitOps Deployment with FluxCD
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](/components/backends/vllm/README.md) to demonstrate the workflow.
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](/docs/backends/vllm/README.md) to demonstrate the workflow.
## Prerequisites
......
......@@ -64,7 +64,7 @@ This will create two components:
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](/components/backends/vllm/README.md)
- Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md)
- Available metrics: See the [metrics guide](/docs/guides/metrics.md)
### Validate the Deployment
......@@ -87,7 +87,7 @@ curl localhost:8000/v1/chat/completions \
}'
```
For more information about validating the deployment, see the [vLLM README](../../components/backends/vllm/README.md).
For more information about validating the deployment, see the [vLLM README](../backends/vllm/README.md).
## Set Up Metrics Collection
......
......@@ -37,8 +37,8 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
## Components
- [Frontend](/components/src/dynamo/frontend/README.md) - HTTP API endpoint that receives requests and forwards them to the decode worker
- [vLLM Prefill Worker](/components/backends/vllm/README.md) - Specialized worker for prefill phase execution
- [vLLM Decode Worker](/components/backends/vllm/README.md) - Specialized worker that handles requests and decides between local/remote prefill
- [vLLM Prefill Worker](/docs/backends/vllm/README.md) - Specialized worker for prefill phase execution
- [vLLM Decode Worker](/docs/backends/vllm/README.md) - Specialized worker that handles requests and decides between local/remote prefill
```mermaid
---
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment