@@ -115,14 +115,13 @@ In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.
...
@@ -115,14 +115,13 @@ In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.
A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request-migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
In such cases, you can signal incomplete responses by raising an `EngineShutdown` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request-migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
> [!WARNING]
> We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
Here's an example of how to implement this in your `RequestHandler`:
Here's an example of how to implement this in your `RequestHandler`:
When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
When `EngineShutdown` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
For more information about how request migration works, see the [Request Migration Architecture](../fault-tolerance/request-migration.md) documentation.
For more information about how request migration works, see the [Request Migration Architecture](../fault-tolerance/request-migration.md) documentation.