Unverified Commit 316dffc0 authored by Shriyash.Patil's avatar Shriyash.Patil Committed by GitHub
Browse files

docs: Fix missing logging import in basic worker example (#1580)


Signed-off-by: default avatarShriyash.Patil <shriyash81@gmail.com>
parent 65f2de5f
...@@ -33,6 +33,7 @@ see the [Dynamo Serve Guide](../../docs/guides/dynamo_serve.md). ...@@ -33,6 +33,7 @@ see the [Dynamo Serve Guide](../../docs/guides/dynamo_serve.md).
When deploying a python-based worker with `dynamo serve` or `dynamo deploy`, it is When deploying a python-based worker with `dynamo serve` or `dynamo deploy`, it is
a Python class based definition that requires a few key decorators to get going: a Python class based definition that requires a few key decorators to get going:
- `@service`: used to define a worker class - `@service`: used to define a worker class
- `@endpoint`: marks methods that can be called by other workers or clients - `@endpoint`: marks methods that can be called by other workers or clients
...@@ -64,6 +65,7 @@ class YourWorker: ...@@ -64,6 +65,7 @@ class YourWorker:
Workers in Dynamo are identified by a `namespace/component/endpoint` naming schema. Workers in Dynamo are identified by a `namespace/component/endpoint` naming schema.
When addressing this worker's endpoint with the `namespace/component/endpoint` schema When addressing this worker's endpoint with the `namespace/component/endpoint` schema
based on the definitions above, it would be: `your_namespace/YourWorker/your_endpoint`: based on the definitions above, it would be: `your_namespace/YourWorker/your_endpoint`:
- `namespace="your_namespace"`: Defined in the `@service` decorator - `namespace="your_namespace"`: Defined in the `@service` decorator
- `component="YourWorker"`: Defined by the Python Class name - `component="YourWorker"`: Defined by the Python Class name
- `endpoint="your_endpoint"`: Defined by the `@endpoint` decorator, or by default the name of the function being decorated. - `endpoint="your_endpoint"`: Defined by the `@endpoint` decorator, or by default the name of the function being decorated.
...@@ -93,6 +95,7 @@ class ResponseType(BaseModel): ...@@ -93,6 +95,7 @@ class ResponseType(BaseModel):
For example, if you deploy your worker directly behind an OpenAI HTTP (`http`) service For example, if you deploy your worker directly behind an OpenAI HTTP (`http`) service
using `llmctl`, you can define the request and response types to correspond to using `llmctl`, you can define the request and response types to correspond to
Chat Completions objects, such as the ones specified in the OpenAI API. For example: Chat Completions objects, such as the ones specified in the OpenAI API. For example:
```python ```python
from vllm.entrypoints.openai.protocol import ChatCompletionRequest from vllm.entrypoints.openai.protocol import ChatCompletionRequest
...@@ -112,6 +115,7 @@ via custom RequestType/ResponseType definitions: ...@@ -112,6 +115,7 @@ via custom RequestType/ResponseType definitions:
# basic_worker.py # basic_worker.py
# This can be run standalone with `dynamo serve basic_worker:YourWorker` # This can be run standalone with `dynamo serve basic_worker:YourWorker`
import logging
from pydantic import BaseModel from pydantic import BaseModel
from dynamo.sdk import endpoint, service from dynamo.sdk import endpoint, service
...@@ -187,6 +191,7 @@ and internally these requests would be routed to the attached worker endpoints i ...@@ -187,6 +191,7 @@ and internally these requests would be routed to the attached worker endpoints i
In more advanced scenarios where your worker may operate on some other intermediate format In more advanced scenarios where your worker may operate on some other intermediate format
that may not directly match an OpenAI-like format, you could setup a separate processor worker that may not directly match an OpenAI-like format, you could setup a separate processor worker
that does something like the following: that does something like the following:
- Take in OpenAI Chat Completions requests from the HTTP service - Take in OpenAI Chat Completions requests from the HTTP service
- Convert requests from Chat Completions format to the RequestType format your worker expects - Convert requests from Chat Completions format to the RequestType format your worker expects
- Forward requests to the worker(s) - Forward requests to the worker(s)
...@@ -324,6 +329,7 @@ an endpoint that can do arbitrary things based on your use case. ...@@ -324,6 +329,7 @@ an endpoint that can do arbitrary things based on your use case.
For example, you can initialize the `KvMetricsAggregator` and `KvIndexer` For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
in your class implementation: in your class implementation:
```python ```python
@service( @service(
dynamo={ dynamo={
...@@ -445,6 +451,7 @@ metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routi ...@@ -445,6 +451,7 @@ metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routi
NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts: NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
1. **NIXL Agent Setup** 1. **NIXL Agent Setup**
```python ```python
from nixl._api import nixl_agent from nixl._api import nixl_agent
...@@ -458,6 +465,7 @@ class NixlConnector: ...@@ -458,6 +465,7 @@ class NixlConnector:
``` ```
2. **Memory Registration and Transfer Preparation** 2. **Memory Registration and Transfer Preparation**
```python ```python
def register_kv_caches(self, kv_cache: torch.Tensor): def register_kv_caches(self, kv_cache: torch.Tensor):
# Get block size from the KV cache tensor # Get block size from the KV cache tensor
...@@ -489,6 +497,7 @@ def register_kv_caches(self, kv_cache: torch.Tensor): ...@@ -489,6 +497,7 @@ def register_kv_caches(self, kv_cache: torch.Tensor):
``` ```
3. **Remote Agent Communication** 3. **Remote Agent Communication**
```python ```python
def get_agent_metadata(self): def get_agent_metadata(self):
# Get metadata for sharing with other agents # Get metadata for sharing with other agents
...@@ -513,6 +522,7 @@ nixl_connector.add_remote_agent(decode_engine_id, decode_metadata, decode_blocks ...@@ -513,6 +522,7 @@ nixl_connector.add_remote_agent(decode_engine_id, decode_metadata, decode_blocks
``` ```
4. **KV Cache Transfer** 4. **KV Cache Transfer**
```python ```python
def write_blocks(self, local_block_ids, remote_block_ids, notify_msg): def write_blocks(self, local_block_ids, remote_block_ids, notify_msg):
# Initiate asynchronous transfer using block IDs # Initiate asynchronous transfer using block IDs
...@@ -533,6 +543,7 @@ nixl_connector.write_blocks([0, 3], [12, 16], "kv_transfer") ...@@ -533,6 +543,7 @@ nixl_connector.write_blocks([0, 3], [12, 16], "kv_transfer")
``` ```
The NIXL connector provides: The NIXL connector provides:
- GPU memory registration for sharing between processes - GPU memory registration for sharing between processes
- Connection establishment between Prefill and Decode workers - Connection establishment between Prefill and Decode workers
- Efficient block-based KV cache transfers - Efficient block-based KV cache transfers
...@@ -547,6 +558,7 @@ on the same concepts used for any Dynamo client<->worker or worker<->worker ...@@ -547,6 +558,7 @@ on the same concepts used for any Dynamo client<->worker or worker<->worker
interaction over the DistributedRuntime. interaction over the DistributedRuntime.
First you can define a worker for each as usual: First you can define a worker for each as usual:
```python ```python
class DecodeWorker: class DecodeWorker:
# ... # ...
...@@ -561,6 +573,7 @@ In some scenarios, it may be more efficient for the Decode worker to just do the ...@@ -561,6 +573,7 @@ In some scenarios, it may be more efficient for the Decode worker to just do the
Prefill itself rather than do the extra communication, such as if the input Prefill itself rather than do the extra communication, such as if the input
sequence length is below some small threshold. If you wanted to disable sequence length is below some small threshold. If you wanted to disable
disaggregation, the DecodeWorker could just always do the Prefill step as well. disaggregation, the DecodeWorker could just always do the Prefill step as well.
```python ```python
@service( @service(
dynamo={ dynamo={
...@@ -618,6 +631,7 @@ For more information on Disaggregated Serving, see the ...@@ -618,6 +631,7 @@ For more information on Disaggregated Serving, see the
## Best Practices ## Best Practices
1. **Resource Management**: Configure resource requirements based on your needs: 1. **Resource Management**: Configure resource requirements based on your needs:
```python ```python
@service( @service(
resources={ resources={
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment