docs: quick start (#215)

0ee128ea · Neelay Shah · GitHub · fa373c19 · 0ee128ea
Commit 0ee128ea authored Mar 17, 2025 by Neelay Shah Committed by GitHub Mar 17, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 61 additions and 96 deletions

README.md README.md +61 -96

No files found.
--- a/README.md
+++ b/README.md
@@ -20,134 +20,99 @@ limitations under the License.
 [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-NVIDIA Dynamo is a new modular inference framework designed for serving large language models (LLMs) in multi-node
+NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as:
-distributed environments. It enables seamless scaling of inference workloads across GPU nodes and the dynamic allocation
-of GPU workers to address traffic bottlenecks at various stages of the model pipeline.
-NVIDIA Dynamo also features LLM-specific capabilities, such as disaggregated serving, which separates the context
+- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and facilitates trade off between throughput and latency.
-(prefill) and generation (decode) steps of inference requests onto distinct GPUs and GPU nodes to optimize performance.
+- **Dynamic GPU scheduling** – Optimizes performance based on fluctuating demand
+- **LLM-aware request routing** – Eliminates unnecessary KV cache re-computation
+- **Accelerated data transfer** – Reduces inference response time using NIXL.
+- **KV cache offloading** – Leverages multiple memory hierarchies for higher system throughput
-NVIDIA Dynamo includes four key innovations:
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
-* **Smart Router**: An LLM-aware router that directs requests across large GPU fleets to minimize costly key-value (KV)
+| [Quick Start](#quick-start) | [LLM Serving](#llm-serving) | [Disaggregated Serving and KV Routing](examples/llm) | [Architecture](docs/architecture.md) | [APIs](lib/bindings/python) | [Additional Resources](#additional-resources) |
-cache recomputations for repeat or overlapping requests, freeing up GPUs to respond to new incoming requests
-* **Low-Latency Communication Library**: An inference optimized library that supports state-of-the-art GPU-to-GPU
-communication and abstracts complexity of data exchange across heterogenous devices and networking protocols,
-accelerating data transfers
-* **Memory Manager**: An engine that intelligently offloads and reloads inference data (KV cache) to and from lower-cost memory and storage devices using NVIDIA NIXL without impacting user experiences
-> [!NOTE]
-> This project is currently in the alpha / experimental /
-> rapid-prototyping stage and we are actively looking for feedback and
-> collaborators.
 ## Quick Start
-TODO add quick start guide here
+### Installation
-## Building Dynamo
-### Requirements
-Dynamo development and examples are container based.
+The following examples require a few system level packages.
-* [Docker](https://docs.docker.com/get-started/get-docker/)
+```
-* [buildx](https://github.com/docker/buildx)
+apt-get update
+DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev libucx0
-### Development
+pip install ai-dynamo nixl vllm==0.7.2+dynamo
+```
-You can build the Dynamo container using the build scripts
+> [!NOTE]
-in `container/` (or directly with `docker build`).
+> TensorRT-LLM Support is currently available on a [branch](https://github.com/ai-dynamo/dynamo/tree/dynamo/trtllm_llmapi_v1/examples/trtllm#building-the-environment)
-We provide 2 types of builds:
+### Running and Interacting with an LLM Locally
-1. `VLLM` which includes our VLLM backend using new NIXL communication library.
+To run a model and interact with it locally you can call `dynamo
-2. `TENSORRTLLM` which includes our TRT-LLM backend
+run` with a hugging face model. `dynamo run` supports several backends
+including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
-For example, if you want to build a container for the `VLLM` backend you can run
+#### Example Command
-<!--pytest.mark.skip-->
+```
-```bash
+dynamo run out=vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-./container/build.sh
 ```
-Please see the instructions in the corresponding example for specific build instructions.
+```
+? User › Hello, how are you?
+✔ User · Hello, how are you?
+Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
+```
-## Running Dynamo for Local Testing and Development
+### LLM Serving
-You can run the Dynamo container using the run scripts in
+Dynamo provides a simple way to spin up a local set of inference
-`container/` (or directly with `docker run`).
+components including:
-The run script offers a few common workflows:
+- **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
+- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
+- **Workers** – Set of pre-configured LLM serving engines.
-1. Running a command in a container and exiting.
+To run a minimal configuration you can use a pre-configured
+example.
-<!--pytest.mark.skip-->
+#### Start Dynamo Distributed Runtime Services
-```bash
-./container/run.sh -- python3 -c "import dynamo.runtime; help(dynamo.runtime)"
-```
-<!--
-# This tests the above the line but from within the container
+First start the Dynamo Distributed Runtime services:
-# using pytest-codeblocks
 ```bash
-python3 -c "import dynamo.runtime; help(dynamo.runtime)"
+docker compose -f deploy/docker-compose.yml up -d
 ```
-- >
-2. Starting an interactive shell.
+#### Start Dynamo LLM Serving Components
+Next serve a minimal configuration with an http server, basic
+round-robin router, and a single worker.
-<!--pytest.mark.skip-->
 ```bash
-./container/run.sh -it
+cd examples/llm
+dynamo serve graphs.agg:Frontend -f configs/agg.yaml
 ```
-3. Mounting the local workspace and Starting an interactive shell.
+#### Send a Request
-<!--pytest.mark.skip-->
 ```bash
-./container/run.sh -it --mount-workspace
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
+    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
+    "messages": [
+    {
+        "role": "user",
+        "content": "Hello, how are you?"
+    }
+    ],
+    "stream":false,
+    "max_tokens": 300
+  }' | jq
 ```
-The last command also passes common environment variables ( `-e
+## Additional Resources
-HF_TOKEN` ) and mounts common directories such as `/tmp:/tmp`,
-`/mnt:/mnt`.
-Please see the instructions in the corresponding example for specific
-deployment instructions.
-## Rust Based Runtime
-Dynamo has a new rust based distributed runtime with
-implementation under development. The rust based runtime enables
-serving arbitrary python code as well as native rust. Please note the
-APIs are subject to change.
-### Hello World
-[Hello World](./lib/bindings/python/examples/hello_world)
-A basic example demonstrating the rust based runtime and python
-bindings.
-### LLM
-[VLLM](./examples/python_rs/llm/vllm)
-An intermediate example expanding further on the concepts introduced
-in the Hello World example. In this example, we demonstrate
-[Disaggregated Serving](https://arxiv.org/abs/2401.09670) as an
-application of the components defined in Dynamo.
-# Disclaimers
-> [!NOTE]
-> This project is currently in the alpha / experimental /
-> rapid-prototyping stage and we will be adding new features incrementally.
-1. The `TENSORRTLLM` and `VLLM` containers are WIP and not expected to
-   work out of the box.
-2. Testing has primarily been on single node systems with processes
+- [TODO] ()
-   launched within a single container.
+- [TODO] ()