README.md 5.09 KB
Newer Older
Neelay Shah's avatar
Neelay Shah committed
1
<!--
Neelay Shah's avatar
Neelay Shah committed
2
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Neelay Shah's avatar
Neelay Shah committed
3
SPDX-License-Identifier: Apache-2.0
4
5
6
7
8
9
10
11
12
13
14
15

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Neelay Shah's avatar
Neelay Shah committed
16
17
-->

18
# NVIDIA Dynamo
Neelay Shah's avatar
Neelay Shah committed
19
20

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
Meenakshi Sharma's avatar
Meenakshi Sharma committed
21
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
22

23
24
25
26
27
28
29
30
NVIDIA Dynamo is a new modular inference framework designed for serving large language models (LLMs) in multi-node
distributed environments. It enables seamless scaling of inference workloads across GPU nodes and the dynamic allocation
of GPU workers to address traffic bottlenecks at various stages of the model pipeline.

NVIDIA Dynamo also features LLM-specific capabilities, such as disaggregated serving, which separates the context
(prefill) and generation (decode) steps of inference requests onto distinct GPUs and GPU nodes to optimize performance.

NVIDIA Dynamo includes four key innovations:
Neelay Shah's avatar
Neelay Shah committed
31

32
33
34
35
36
37
* **Smart Router**: An LLM-aware router that directs requests across large GPU fleets to minimize costly key-value (KV)
cache recomputations for repeat or overlapping requests, freeing up GPUs to respond to new incoming requests
* **Low-Latency Communication Library**: An inference optimized library that supports state-of-the-art GPU-to-GPU
communication and abstracts complexity of data exchange across heterogenous devices and networking protocols,
accelerating data transfers
* **Memory Manager**: An engine that intelligently offloads and reloads inference data (KV cache) to and from lower-cost memory and storage devices using NVIDIA NIXL without impacting user experiences
Neelay Shah's avatar
Neelay Shah committed
38

39
> [!NOTE]
Neelay Shah's avatar
Neelay Shah committed
40
41
42
43
> This project is currently in the alpha / experimental /
> rapid-prototyping stage and we are actively looking for feedback and
> collaborators.

44
45
46
47
## Quick Start

TODO add quick start guide here

Neelay Shah's avatar
Neelay Shah committed
48
## Building Dynamo
49

50
### Requirements
51

Neelay Shah's avatar
Neelay Shah committed
52
Dynamo development and examples are container based.
53

54
55
56
57
58
* [Docker](https://docs.docker.com/get-started/get-docker/)
* [buildx](https://github.com/docker/buildx)

### Development

Neelay Shah's avatar
Neelay Shah committed
59
You can build the Dynamo container using the build scripts
60
in `container/` (or directly with `docker build`).
61

62
We provide 2 types of builds:
63

64
1. `VLLM` which includes our VLLM backend using new NIXL communication library.
65
66
2. `TENSORRTLLM` which includes our TRT-LLM backend

67
For example, if you want to build a container for the `VLLM` backend you can run
68

69
70
71
72
<!--pytest.mark.skip-->
```bash
./container/build.sh
```
73
74
75

Please see the instructions in the corresponding example for specific build instructions.

Neelay Shah's avatar
Neelay Shah committed
76
## Running Dynamo for Local Testing and Development
77

Neelay Shah's avatar
Neelay Shah committed
78
You can run the Dynamo container using the run scripts in
79
80
81
82
83
84
`container/` (or directly with `docker run`).

The run script offers a few common workflows:

1. Running a command in a container and exiting.

85
86
<!--pytest.mark.skip-->
```bash
Neelay Shah's avatar
Neelay Shah committed
87
./container/run.sh -- python3 -c "import dynamo.runtime; help(dynamo.runtime)"
88
```
89
<!--
90

91
92
93
94
95
# This tests the above the line but from within the container
# using pytest-codeblocks

```bash
python3 -c "import dynamo.runtime; help(dynamo.runtime)"
96
```
97
98
99
100
101
102
-- >

2. Starting an interactive shell.

<!--pytest.mark.skip-->
```bash
103
104
105
106
107
./container/run.sh -it
```

3. Mounting the local workspace and Starting an interactive shell.

108
109
<!--pytest.mark.skip-->
```bash
110
111
112
./container/run.sh -it --mount-workspace
```

113
114
115
The last command also passes common environment variables ( `-e
HF_TOKEN` ) and mounts common directories such as `/tmp:/tmp`,
`/mnt:/mnt`.
116
117
118

Please see the instructions in the corresponding example for specific
deployment instructions.
Neelay Shah's avatar
Neelay Shah committed
119

Meenakshi Sharma's avatar
Meenakshi Sharma committed
120
## Rust Based Runtime
Neelay Shah's avatar
Neelay Shah committed
121

Neelay Shah's avatar
Neelay Shah committed
122
Dynamo has a new rust based distributed runtime with
Neelay Shah's avatar
Neelay Shah committed
123
124
125
126
127
128
implementation under development. The rust based runtime enables
serving arbitrary python code as well as native rust. Please note the
APIs are subject to change.

### Hello World

Neelay Shah's avatar
Neelay Shah committed
129
[Hello World](./lib/bindings/python/examples/hello_world)
Neelay Shah's avatar
Neelay Shah committed
130
131
132
133
134
135
136
137
138
139
140

A basic example demonstrating the rust based runtime and python
bindings.

### LLM

[VLLM](./examples/python_rs/llm/vllm)

An intermediate example expanding further on the concepts introduced
in the Hello World example. In this example, we demonstrate
[Disaggregated Serving](https://arxiv.org/abs/2401.09670) as an
Neelay Shah's avatar
Neelay Shah committed
141
application of the components defined in Dynamo.
Neelay Shah's avatar
Neelay Shah committed
142

Neelay Shah's avatar
Neelay Shah committed
143
144
145
146
147
148
149
150
151
152
153
# Disclaimers

> [!NOTE]
> This project is currently in the alpha / experimental /
> rapid-prototyping stage and we will be adding new features incrementally.

1. The `TENSORRTLLM` and `VLLM` containers are WIP and not expected to
   work out of the box.

2. Testing has primarily been on single node systems with processes
   launched within a single container.