Unverified Commit 22d22777 authored by Jacky's avatar Jacky Committed by GitHub
Browse files

docs: Add cancellation docs for vLLM, TRT-LLM and SGLang backends (#3783)


Signed-off-by: default avatarJacky <18255193+kthui@users.noreply.github.com>
parent 8ad3b9a2
...@@ -69,6 +69,22 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu ...@@ -69,6 +69,22 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
> [!NOTE] > [!NOTE]
> When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend. > When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
### Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
#### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |
> [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the [Request Cancellation Architecture](../../architecture/request_cancellation.md) documentation.
## Installation ## Installation
### Install latest release ### Install latest release
......
...@@ -228,6 +228,20 @@ python3 -m dynamo.trtllm ... --migration-limit=3 ...@@ -228,6 +228,20 @@ python3 -m dynamo.trtllm ... --migration-limit=3
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works. This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated (Decode-First)** | ✅ | ✅ |
| **Disaggregated (Prefill-First)** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../../docs/architecture/request_cancellation.md) documentation.
## Client ## Client
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
......
...@@ -189,3 +189,16 @@ python3 -m dynamo.vllm ... --migration-limit=3 ...@@ -189,3 +189,16 @@ python3 -m dynamo.vllm ... --migration-limit=3
``` ```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works. This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../../docs/architecture/request_cancellation.md) documentation.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment