@@ -112,6 +115,7 @@ via custom RequestType/ResponseType definitions:
# basic_worker.py
# This can be run standalone with `dynamo serve basic_worker:YourWorker`
importlogging
frompydanticimportBaseModel
fromdynamo.sdkimportendpoint,service
...
...
@@ -187,6 +191,7 @@ and internally these requests would be routed to the attached worker endpoints i
In more advanced scenarios where your worker may operate on some other intermediate format
that may not directly match an OpenAI-like format, you could setup a separate processor worker
that does something like the following:
- Take in OpenAI Chat Completions requests from the HTTP service
- Convert requests from Chat Completions format to the RequestType format your worker expects
- Forward requests to the worker(s)
...
...
@@ -324,6 +329,7 @@ an endpoint that can do arbitrary things based on your use case.
For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
in your class implementation:
```python
@service(
dynamo={
...
...
@@ -445,6 +451,7 @@ metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routi
NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
1.**NIXL Agent Setup**
```python
fromnixl._apiimportnixl_agent
...
...
@@ -458,6 +465,7 @@ class NixlConnector:
```
2.**Memory Registration and Transfer Preparation**