@@ -19,13 +19,13 @@ There are multi-faceted challenges:
...
@@ -19,13 +19,13 @@ There are multi-faceted challenges:
-*Extremely hard UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further amplifies inefficiencies. Developers need a clear, intuitive way to define, optimize, and modify inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, slowing down model deployment and innovation. A modern distributed inference stack must be designed with usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
-*Extremely hard UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further amplifies inefficiencies. Developers need a clear, intuitive way to define, optimize, and modify inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, slowing down model deployment and innovation. A modern distributed inference stack must be designed with usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
-*GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (where large prompt embeddings are generated) is highly compute-intensive, while decode (where tokens are generated) is latency-sensitive. A disaggregated approach is needed to separate prefill and decode, ensuring optimal GPU utilization and increasing overall throughput [DistServe](https://arxiv.org/abs/2401.09670).
-*GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (where large prompt embeddings are generated) is highly compute-intensive, while decode (where tokens are generated) is latency-sensitive. A disaggregated approach is needed to separate prefill and decode, ensuring optimal GPU utilization and increasing overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
-*Expensive KV cache re-computation*: When requests are not efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.[DeepSeek](https://arxiv.org/abs/2501.12948)
-*Expensive KV cache re-computation*: When requests are not efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
-*Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. [Mooncake](https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/)
-*Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/))
-*Fluctuating demand and inefficient GPU allocation*: Inference workloads are use case specific and are inherently dynamic—demand surges unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization [AzureTrace](https://github.com/Azure/AzurePublicDataset)
-*Fluctuating demand and inefficient GPU allocation*: Inference workloads are use case specific and are inherently dynamic—demand surges unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
-*Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Existing contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it demands deep networking expertise to configure correctly, making it impractical for broad inference use cases. What developers really need is a library, optimized for inference workloads that can abstract heterogeneous memory (remote memory, or storage) and dynamically selects the best transport backend via a unified API.
-*Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Existing contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it demands deep networking expertise to configure correctly, making it impractical for broad inference use cases. What developers really need is a library, optimized for inference workloads that can abstract heterogeneous memory (remote memory, or storage) and dynamically selects the best transport backend via a unified API.
...
@@ -90,19 +90,12 @@ Existing routing methods, including load-based routing, overlook the specific pr
...
@@ -90,19 +90,12 @@ Existing routing methods, including load-based routing, overlook the specific pr
Dynamo's design enables KV cache offloading to system CPU memory, and will be extended to support SSDs and networked object storage in subsequent releases. In many accelerated servers, the CPU (system) memory is much larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating additional benefits beyond basic prefix caching.
Dynamo's design enables KV cache offloading to system CPU memory, and will be extended to support SSDs and networked object storage in subsequent releases. In many accelerated servers, the CPU (system) memory is much larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating additional benefits beyond basic prefix caching.
### NIXL
<figure>
<figure>
<imgsrc='images/kv_manager.png'alt='missing'/>
<imgsrc='images/kv_manager.png'alt='missing'/>
<p>Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL</p>
<p>Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL</p>
</figure>
</figure>
<!--
### NIXL
[3]
[3]:Tested with 80 users and 10 multi-turns for each user using 1K ISL / 100 OSL. R1 Distilled Llama 8B model running on single node H100s
-->
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL is able to abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support only one tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and overall throughput.
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL is able to abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support only one tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and overall throughput.