@@ -57,7 +57,7 @@ Tune these values based on your workload. Connection window should accommodate `
...
@@ -57,7 +57,7 @@ Tune these values based on your workload. Connection window should accommodate `
## Registering a Backend
## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_model()` API will be used. Currently the frontend support serving of the following model type and model input combination:
*`ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
*`ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
*`ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
*`ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
...
@@ -153,7 +153,7 @@ See [Router Documentation](../router/README.md) for routing configuration detail
...
@@ -153,7 +153,7 @@ See [Router Documentation](../router/README.md) for routing configuration detail
### With Backends
### With Backends
Backends auto-register with the frontend when they call `register_llm()`. Supported backends:
Backends auto-register with the frontend when they call `register_model()`. Supported backends:
- Automatically handles all backend workers registered to the Dynamo endpoint
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
Backend workers register themselves using the `register_model` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
#### CLI Arguments
#### CLI Arguments
...
@@ -83,8 +83,8 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou
...
@@ -83,8 +83,8 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou
## Prerequisites and Limitations
## Prerequisites and Limitations
**Requirements:**
**Requirements:**
-**Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
-**Dynamic endpoints only**: KV router requires `register_model()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
- Backend workers must call `register_model()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
- Automatically handles all backend workers registered to the Dynamo endpoint
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
Backend workers register themselves using the `register_model` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
#### CLI Arguments
#### CLI Arguments
...
@@ -267,7 +267,7 @@ Dynamo supports disaggregated serving where prefill (prompt processing) and deco
...
@@ -267,7 +267,7 @@ Dynamo supports disaggregated serving where prefill (prompt processing) and deco
### Automatic Prefill Router Activation
### Automatic Prefill Router Activation
The prefill router is automatically created when:
The prefill router is automatically created when:
1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`)
1. A decode model is registered (e.g., via `register_model()` with `ModelType.Chat | ModelType.Completions`)
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
**Key characteristics of the prefill router:**
**Key characteristics of the prefill router:**
...
@@ -283,7 +283,7 @@ When both workers are registered, requests are automatically routed.
...
@@ -283,7 +283,7 @@ When both workers are registered, requests are automatically routed.
# Decode worker registration (in your decode worker)
# Decode worker registration (in your decode worker)
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat).
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat).
- ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions).
- ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions).
`register_llm` can also take the following kwargs:
`register_model` can also take the following kwargs:
-`model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
-`model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
-`context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
-`context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
-`kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
-`kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.