@@ -112,6 +115,7 @@ via custom RequestType/ResponseType definitions:
...
@@ -112,6 +115,7 @@ via custom RequestType/ResponseType definitions:
# basic_worker.py
# basic_worker.py
# This can be run standalone with `dynamo serve basic_worker:YourWorker`
# This can be run standalone with `dynamo serve basic_worker:YourWorker`
importlogging
frompydanticimportBaseModel
frompydanticimportBaseModel
fromdynamo.sdkimportendpoint,service
fromdynamo.sdkimportendpoint,service
...
@@ -187,6 +191,7 @@ and internally these requests would be routed to the attached worker endpoints i
...
@@ -187,6 +191,7 @@ and internally these requests would be routed to the attached worker endpoints i
In more advanced scenarios where your worker may operate on some other intermediate format
In more advanced scenarios where your worker may operate on some other intermediate format
that may not directly match an OpenAI-like format, you could setup a separate processor worker
that may not directly match an OpenAI-like format, you could setup a separate processor worker
that does something like the following:
that does something like the following:
- Take in OpenAI Chat Completions requests from the HTTP service
- Take in OpenAI Chat Completions requests from the HTTP service
- Convert requests from Chat Completions format to the RequestType format your worker expects
- Convert requests from Chat Completions format to the RequestType format your worker expects
- Forward requests to the worker(s)
- Forward requests to the worker(s)
...
@@ -324,6 +329,7 @@ an endpoint that can do arbitrary things based on your use case.
...
@@ -324,6 +329,7 @@ an endpoint that can do arbitrary things based on your use case.
For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
in your class implementation:
in your class implementation:
```python
```python
@service(
@service(
dynamo={
dynamo={
...
@@ -445,6 +451,7 @@ metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routi
...
@@ -445,6 +451,7 @@ metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routi
NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
1.**NIXL Agent Setup**
1.**NIXL Agent Setup**
```python
```python
fromnixl._apiimportnixl_agent
fromnixl._apiimportnixl_agent
...
@@ -458,6 +465,7 @@ class NixlConnector:
...
@@ -458,6 +465,7 @@ class NixlConnector:
```
```
2.**Memory Registration and Transfer Preparation**
2.**Memory Registration and Transfer Preparation**