Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. Docs and examples for those can be found [here](../../../../README.md).
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. Docs and examples for those can be found [here](../../../../../README.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns and leverages many of its core primitives. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example [here](../../README.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns and leverages many of its core primitives. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example [here](../../README.md).
@@ -12,7 +12,7 @@ The Dynamo KV Cache Manager feature addresses this challenge by enabling the off
...
@@ -12,7 +12,7 @@ The Dynamo KV Cache Manager feature addresses this challenge by enabling the off
The Dynamo KV Cache Manager uses advanced caching policies that prioritize placing frequently accessed data in GPU memory, while less accessed data is moved to shared CPU memory, SSDs, or networked object storage. It incorporates eviction policies that strike a balance between over-caching (which can introduce lookup latencies) and under-caching (which leads to missed lookups and KV cache re-computation).
The Dynamo KV Cache Manager uses advanced caching policies that prioritize placing frequently accessed data in GPU memory, while less accessed data is moved to shared CPU memory, SSDs, or networked object storage. It incorporates eviction policies that strike a balance between over-caching (which can introduce lookup latencies) and under-caching (which leads to missed lookups and KV cache re-computation).
Additionally, this feature can manage KV cache across multiple GPU nodes, supporting both distributed and disaggregated inference serving, and offers hierarchical caching capabilities, creating offloading strategies at the GPU, node, and cluster levels.
Additionally, this feature can manage KV cache across multiple GPU nodes, supporting both distributed and disaggregated inference serving, and offers hierarchical caching capabilities, creating offloading strategies at the GPU, node, and cluster levels.
The Dynamo KV Cache Manager is designed to be framework-agnostic to support various backends, including TensorRT-TLLM, vLLM, and SGLang, and to facilitate the scaling of KV cache storage across large, distributed clusters using NVLink, NVIDIA Quantum switches, and NVIDIA Spectrum switches. It integrates with [NIXL](https://github.com/ai-dynamo/nixl/blob/omrik/documentation/docs/nixl.md) to enable data transfers across different worker instances and storage backends.
The Dynamo KV Cache Manager is designed to be framework-agnostic to support various backends, including TensorRT-TLLM, vLLM, and SGLang, and to facilitate the scaling of KV cache storage across large, distributed clusters using NVLink, NVIDIA Quantum switches, and NVIDIA Spectrum switches. It integrates with [NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) to enable data transfers across different worker instances and storage backends.
@@ -39,12 +39,12 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
...
@@ -39,12 +39,12 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **Dependency** | **Version** |
| **Dependency** | **Version** |
|------------------|-------------|
|------------------|-------------|
|**Base Container**| 25.01 |
|**Base Container**| 25.01 |
|**vLLM** |0.7.2+dynamo*|
|**ai-dynamo-vllm**| 0.7.2*|
|**TensorRT-LLM** | 0.19.0** |
|**TensorRT-LLM** | 0.19.0** |
|**NIXL** | 0.1.0 |
|**NIXL** | 0.1.0 |
> **Note**:
> **Note**:
> - *v0.7.2+dynamo is a customized patch of v0.7.2 from vLLM.
> - *ai-dynamo-vllm v0.7.2 is a customized patch of v0.7.2 from vLLM.
> - **The specific version of TensorRT-LLM (planned v0.19.0) that will be supported by Dynamo is subject to change.
> - **The specific version of TensorRT-LLM (planned v0.19.0) that will be supported by Dynamo is subject to change.
...
@@ -54,4 +54,4 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
...
@@ -54,4 +54,4 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
-**Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**. No wheels are available for other platforms at this time.
-**Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**. No wheels are available for other platforms at this time.
-**Container Images**: We distribute only the source code for container images, and only **x86_64 Linux** is supported for these. Users must build the container image from source if they require it.
-**Container Images**: We distribute only the source code for container images, and only **x86_64 Linux** is supported for these. Users must build the container image from source if they require it.
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/?tab=readme-ov-file#quick-start).
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).