Unverified Commit 52271760 authored by atchernych's avatar atchernych Committed by GitHub
Browse files

feat: Decomposed pipeline for EPP integration [DEP-730] (#5446)


Signed-off-by: default avatarAnna Tchernych <atchernych@nvidia.com>
parent 16a28058
......@@ -167,7 +167,7 @@ class FrontendArgGroup(ArgGroup):
env_var="DYN_ROUTER_MODE",
default="round-robin",
help="How to route the request.",
choices=["round-robin", "random", "kv"],
choices=["round-robin", "random", "kv", "direct"],
)
add_argument(
g,
......
......@@ -7,7 +7,7 @@
# - OpenAI HTTP server.
# - Auto-discovery: Watches etcd for engine/worker registration (via `register_llm`).
# - Pre-processor: Prompt templating and tokenization.
# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv).
# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv, direct).
#
# Pass `--interactive` or `-i` for text chat instead of HTTP server.
#
......@@ -197,6 +197,9 @@ async def async_main():
elif config.router_mode == "random":
router_mode = RouterMode.Random
kv_router_config = None
elif config.router_mode == "direct":
router_mode = RouterMode.Direct
kv_router_config = None
else:
router_mode = RouterMode.RoundRobin
kv_router_config = None
......
......@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway
### 3. Setup secrets ###
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Make sure to enable kv-routing by adding the env var in the FrontEnd.
```bash
mainContainer:
image: ...
env:
- name: DYN_ROUTER_MODE
value: "kv"
```
Sample commands to deploy model:
```bash
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
```bash
......@@ -93,7 +73,7 @@ kubectl create secret docker-registry docker-imagepullsecret \
--namespace=$NAMESPACE
```
Do not forget to include the HuggingFace token if required.
Do not forget to include the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
......@@ -139,13 +119,34 @@ make info # Check image tag
We recommend deploying Inference Gateway's Endpoint Picker as a Dynamo operator's managed component. Alternatively,
you could deploy it as a standalone pod
#### 5.a. Deploy as a DGD component
#### 5.a. Deploy as a DGD component (recommended)
We provide an example for llama-3-70b vLLM below.
```bash
kubectl apply -f operator-managed/examples/agg.yaml -n ${NAMESPACE}
kubectl apply -f operator-managed/examples/http-route.yaml -n ${NAMESPACE}
# Deploy PVC, first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml
kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml
# Deploy your model
kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE}
# Deploy the GAIE http-route CR.
kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE}
```
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- You must enable the flag in the FrontEnd cli as below.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
- The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
- The flag assures the routing respects this selection.
**Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
to allow 60 minutes for startup:
......@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
##### 5.b.1 Deploy Your Model ###
We provide an example for Qwen vLLM below.
Before deploying you must enable the `--direct-route` flag in the FrontEnd cli in your Dynamo Graph.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Sample commands to deploy model:
......@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
##### 5.b.2 Install Dynamo GIE helm chart ###
```bash
......@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone
Common Vars for Routing Configuration:
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
- If using kv-routing:
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [Router Guide](../../docs/pages/components/router/router-guide.md) for details.
- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true)
- `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
- `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
- `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
- `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` — Track output blocks during generation (default: false)
- See the [KV cache routing design](../../docs/pages/design-docs/router-design.md) for details.
Stand-Alone installation only:
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
......@@ -272,7 +281,7 @@ b. use port-forward to expose the gateway to the host
```bash
# in first terminal
kubectl port-forward svc/inference-gateway 8000:80 -n my-model
kubectl port-forward svc/inference-gateway 8000:80 -n kgateway-system
# in second terminal where you want to send inference requests
GATEWAY_URL=http://localhost:8000
......@@ -359,6 +368,14 @@ Sample inference output:
}
```
***If you have more than one HttpRoute running on the cluster***
Add the host to your HttpRoute.yaml and add the header `curl -H "Host: llama3-70b-agg.example.com" ...` to every request.
```bash
spec:
hostnames:
- llama3-70b-agg.example.com
```
### 8. Deleting the installation ###
If you need to uninstall run:
......@@ -407,4 +424,4 @@ The plugins set HTTP headers that are forwarded to the backend workers.
| Header | Description | Set By |
|--------|-------------|--------|
| `x-worker-instance-id` | Primary worker ID (decode worker in disagg mode) | kv-aware-scorer |
| `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer |
| `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer |
\ No newline at end of file
......@@ -26,74 +26,57 @@ package dynamo_kv_scorer
#include <stdlib.h> // for free
#include <stdbool.h>
// enum underlying type is uint32_t; matches cbindgen output
typedef uint32_t dynamo_llm_result_t;
enum { DYNAMO_OK = 0, DYNAMO_ERR = 1 };
// opaque handle forward-decl
struct WorkerSelectionPipeline;
typedef struct WorkerSelectionPipeline WorkerSelectionPipeline;
// Prototypes (C-compatible)
dynamo_llm_result_t dynamo_llm_init(const char *namespace_c_str,
const char *component_c_str,
int64_t worker_id,
uint32_t kv_block_size);
dynamo_llm_result_t dynamo_llm_shutdown(void);
dynamo_llm_result_t dynamo_llm_load_publisher_create(void);
dynamo_llm_result_t dynamo_kv_event_publish_stored(uint64_t event_id,
const uint32_t *token_ids,
const uintptr_t *num_block_tokens,
const uint64_t *block_ids,
size_t num_blocks,
const uint64_t *parent_hash,
uint64_t lora_id);
dynamo_llm_result_t dynamo_kv_event_publish_removed(uint64_t event_id,
const uint64_t *block_ids,
size_t num_blocks);
dynamo_llm_result_t dynamo_create_worker_selection_pipeline(const char *namespace_c_str,
const char *component_c_str,
const char *model_name_c_str,
bool use_kv_routing,
double busy_threshold,
double overlap_score_weight,
double router_temperature,
bool use_kv_events,
bool router_replica_sync,
bool enforce_disagg,
WorkerSelectionPipeline **pipeline_out);
dynamo_llm_result_t dynamo_destroy_worker_selection_pipeline(WorkerSelectionPipeline *pipeline);
dynamo_llm_result_t dynamo_query_worker_selection_and_annotate(WorkerSelectionPipeline *pipeline,
const char *request_json_c_str,
int64_t *decode_worker_id_out,
int64_t *prefill_worker_id_out,
uint32_t **token_ids_out,
size_t *token_count_out,
char **annotated_request_json_out);
dynamo_llm_result_t dynamo_free_worker_selection_result(uint32_t *token_ids,
size_t token_count,
char *annotated_request_json);
// Router bookkeeping functions for GAIE integration
dynamo_llm_result_t dynamo_router_add_request(WorkerSelectionPipeline *pipeline,
const char *request_id_c_str,
const uint32_t *token_ids,
size_t token_count,
uint64_t worker_id,
uint32_t dp_rank);
dynamo_llm_result_t dynamo_router_mark_prefill_complete(WorkerSelectionPipeline *pipeline,
const char *request_id_c_str);
dynamo_llm_result_t dynamo_router_free_request(WorkerSelectionPipeline *pipeline,
const char *request_id_c_str);
// Query router result codes (matches QueryRouterResult in Rust)
typedef uint32_t query_router_result_t;
enum {
QUERY_ROUTER_OK = 0,
QUERY_ROUTER_ERR_INVALID_HANDLE = 1,
QUERY_ROUTER_ERR_INVALID_PARAM = 2,
QUERY_ROUTER_ERR_INIT_FAILED = 3,
QUERY_ROUTER_ERR_QUERY_FAILED = 4,
QUERY_ROUTER_ERR_DISAGG_ENFORCED = 5,
QUERY_ROUTER_ERR_TIMEOUT = 6,
};
// opaque handle forward-decl for Router bindings
struct RouterHandles;
typedef struct RouterHandles RouterHandles;
// Routing result from route_chat_request
typedef struct {
bool is_disaggregated;
uint64_t prefill_worker_id;
uint64_t decode_worker_id;
uint32_t *token_ids;
size_t token_count;
} CRoutingResult;
// Router bindings API (replaces Pipeline API)
query_router_result_t create_routers(const char *namespace_c_str,
const char *component_c_str,
bool enforce_disagg,
RouterHandles **out_handle);
query_router_result_t route_request(RouterHandles *handle,
const char *request_json,
CRoutingResult *out_result);
query_router_result_t add_request(RouterHandles *handle,
const char *request_id,
const uint32_t *token_ids,
size_t token_count,
uint64_t worker_id,
uint32_t dp_rank);
query_router_result_t mark_prefill_complete(RouterHandles *handle,
const char *request_id);
query_router_result_t free_request(RouterHandles *handle,
const char *request_id);
void free_routing_result(CRoutingResult *result);
void destroy(RouterHandles *handle);
*/
import "C"
......@@ -121,9 +104,6 @@ const (
WorkerIDHeader = "x-worker-instance-id"
PrefillWorkerIDHeader = "x-prefill-instance-id"
RoutingModeHeader = "x-dynamo-routing-mode"
// EnableLocalUpdatesHeader controls router bookkeeping in the Dynamo frontend.
// Set to "false" for GAIE Stage 2 so the EPP handles bookkeeping via C FFI.
EnableLocalUpdatesHeader = "x-enable-local-updates"
// stateKey is the key used to store routing state in PluginState
stateKey = "dynamo-routing-state"
......@@ -141,9 +121,8 @@ type params struct{}
type DynamoRoutingState struct {
WorkerID string
PrefillWorkerID string
// TokenData holds the token IDs from the router.
// Currently unused but stored for future implementation where tokens
// may be passed to the worker via request body instead of headers.
// TokenData holds the token IDs from the router, needed for add_request bookkeeping.
// These tokens are used to compute overlap blocks and track active blocks accurately.
TokenData []int64
}
......@@ -214,48 +193,24 @@ var (
ffiOnce sync.Once
ffiErr error
ffiNamespace string
ffiComponent string
ffiModel string
ffiOverlapScoreWeight float64
ffiRouterTemperature float64
ffiKvBlockSize uint32
ffiWorkerID int64
ffiEnforceDisagg bool
ffiNamespace string
ffiComponent string
ffiEnforceDisagg bool
runtimeInitialized bool
routerInitialized bool
// Boxed pipeline handle (owned on the Rust side, opaque here)
pipeline *C.struct_WorkerSelectionPipeline
pipelineMutex sync.RWMutex
// Router handles (owned on the Rust side, opaque here)
routerHandles *C.struct_RouterHandles
routerHandlesMutex sync.RWMutex
)
func loadDynamoConfig() {
ffiNamespace = getEnvOrDefault("DYN_NAMESPACE", "vllm-agg")
ffiComponent = "backend" // The pipeline uses backend not DYN_COMPONENT which is epp
ffiModel = getEnvOrDefault("DYN_MODEL", "Qwen/Qwen3-0.6B")
ffiWorkerID = getEnvInt64OrDefault("DYNAMO_WORKER_ID", 1)
ffiComponent = "backend" // This is not the same as DYN_COMPONENT=epp (in this case)
ffiEnforceDisagg = getEnvBoolOrDefault("DYN_ENFORCE_DISAGG", false)
ffiOverlapScoreWeight = getEnvFloatOrDefault("DYN_OVERLAP_SCORE_WEIGHT", -1.0)
ffiRouterTemperature = getEnvFloatOrDefault("DYN_ROUTER_TEMPERATURE", -1.0)
kvBlockSizeStr := os.Getenv("DYN_KV_BLOCK_SIZE")
if kvBlockSizeStr == "" {
panic("DYN_KV_BLOCK_SIZE is required and must match the model card's kv_cache_block_size")
}
var tmp int64
if n, err := fmt.Sscanf(kvBlockSizeStr, "%d", &tmp); err != nil || n != 1 {
panic(fmt.Sprintf("DYN_KV_BLOCK_SIZE='%s' is not a valid integer", kvBlockSizeStr))
}
ffiKvBlockSize = uint32(tmp)
if ffiKvBlockSize < 16 || ffiKvBlockSize > 8192 {
panic(fmt.Sprintf("DYN_KV_BLOCK_SIZE=%d outside [16,8192]", ffiKvBlockSize))
}
if (ffiKvBlockSize & (ffiKvBlockSize - 1)) != 0 {
panic(fmt.Sprintf("DYN_KV_BLOCK_SIZE=%d must be a power of 2", ffiKvBlockSize))
}
fmt.Printf("Dynamo KV Scorer: Loaded DYN_KV_BLOCK_SIZE=%d\n", ffiKvBlockSize)
// Note: model name and kv_cache_block_size are now auto-discovered from the model card
fmt.Printf("Dynamo KV Scorer: namespace=%s, component=%s, enforce_disagg=%v\n",
ffiNamespace, ffiComponent, ffiEnforceDisagg)
}
func getEnvOrDefault(key, def string) string {
......@@ -265,26 +220,6 @@ func getEnvOrDefault(key, def string) string {
return def
}
func getEnvInt64OrDefault(key string, def int64) int64 {
if v := os.Getenv(key); v != "" {
var p int64
if n, err := fmt.Sscanf(v, "%d", &p); err == nil && n == 1 {
return p
}
}
return def
}
func getEnvFloatOrDefault(key string, def float64) float64 {
if v := os.Getenv(key); v != "" {
var p float64
if n, err := fmt.Sscanf(v, "%f", &p); err == nil && n == 1 {
return p
}
}
return def
}
func getEnvBoolOrDefault(key string, def bool) bool {
if v := os.Getenv(key); v != "" {
switch strings.ToLower(v) {
......@@ -297,46 +232,31 @@ func getEnvBoolOrDefault(key string, def bool) bool {
return def
}
// initFFI: initialize runtime and create a persistent boxed pipeline.
// initFFI: initialize router handles using the new Router bindings.
func initFFI() error {
ffiOnce.Do(func() {
loadDynamoConfig()
ns := C.CString(ffiNamespace)
cm := C.CString(ffiComponent)
model := C.CString(ffiModel)
defer C.free(unsafe.Pointer(ns))
defer C.free(unsafe.Pointer(cm))
defer C.free(unsafe.Pointer(model))
// Init Dynamo runtime
if rc := C.dynamo_llm_init(ns, cm, C.int64_t(ffiWorkerID), C.uint32_t(ffiKvBlockSize)); rc != C.DYNAMO_OK {
ffiErr = fmt.Errorf("dynamo_llm_init failed")
return
}
runtimeInitialized = true
// Create persistent pipeline
pipelineMutex.Lock()
defer pipelineMutex.Unlock()
// Create router handles
routerHandlesMutex.Lock()
defer routerHandlesMutex.Unlock()
rc := C.dynamo_create_worker_selection_pipeline(
rc := C.create_routers(
ns,
cm,
model,
C.bool(getEnvBoolOrDefault("DYN_USE_KV_ROUTING", true)),
C.double(getEnvFloatOrDefault("DYN_BUSY_THRESHOLD", -1.0)),
C.double(ffiOverlapScoreWeight),
C.double(ffiRouterTemperature),
C.bool(getEnvBoolOrDefault("DYN_USE_KV_EVENTS", true)),
C.bool(getEnvBoolOrDefault("DYNAMO_ROUTER_REPLICA_SYNC", false)), // no need as long as we call the Router Book keeping operations from the EPP.
C.bool(ffiEnforceDisagg),
&pipeline,
&routerHandles,
)
if rc != C.DYNAMO_OK {
ffiErr = fmt.Errorf("dynamo_create_worker_selection_pipeline failed")
if rc != C.QUERY_ROUTER_OK {
ffiErr = fmt.Errorf("create_routers failed with code %d", rc)
return
}
routerInitialized = true
})
return ffiErr
}
......@@ -362,16 +282,12 @@ func (k *KVAwareScorer) Score(
"tokenDataCount", len(tokenData),
)
// Store in request headers for the Lua filter at the gateway
// Store in request headers
if req.Headers == nil {
req.Headers = map[string]string{}
}
req.Headers[WorkerIDHeader] = workerID
// Disable local updates in the Dynamo frontend router.
// EPP handles bookkeeping via C FFI (add_request, mark_prefill_complete, free_request).
req.Headers[EnableLocalUpdatesHeader] = "false"
// Set routing mode and prefill worker ID based on disaggregated vs aggregated
if prefillWorkerID != "" && prefillWorkerID != workerID {
// Disaggregated mode: separate prefill and decode workers
......@@ -383,15 +299,13 @@ func (k *KVAwareScorer) Score(
}
// Store routing state for PreRequest to register with router bookkeeping.
// This is the correct place to store state - PreRequest is called AFTER
// scheduling is finalized, ensuring we only register committed requests.
// PreRequest is called AFTER scheduling is finalized, ensuring we only
// register committed requests (avoiding phantom bookkeeping entries).
if req.RequestId != "" {
routingState := &DynamoRoutingState{
WorkerID: workerID,
PrefillWorkerID: prefillWorkerID,
// TokenData is stored for future use. Currently not passed to workers
// via headers (too large). May be passed via request body in future.
TokenData: tokenData,
TokenData: tokenData,
}
k.pluginState.Write(req.RequestId, plugins.StateKey(stateKey), routingState)
}
......@@ -405,8 +319,8 @@ func (k *KVAwareScorer) Score(
}
// PreRequest is called after scheduling is finalized and before the request is sent to the worker.
// This is the correct place to register the request with the Dynamo router's bookkeeping,
// as we know the request WILL be dispatched (avoiding phantom bookkeeping entries).
// This registers the request with the Dynamo router's bookkeeping (add_request), passing the
// token data obtained during Score(). This ensures only committed requests are tracked.
func (k *KVAwareScorer) PreRequest(
ctx context.Context,
request *schedtypes.LLMRequest,
......@@ -432,8 +346,16 @@ func (k *KVAwareScorer) PreRequest(
return
}
// Parse worker ID
var workerIDUint uint64
if _, parseErr := fmt.Sscanf(state.WorkerID, "%d", &workerIDUint); parseErr != nil {
logger.V(logutil.DEFAULT).Error(parseErr, "PreRequest: invalid worker ID",
"requestID", request.RequestId, "workerID", state.WorkerID)
return
}
// Register request with router bookkeeping now that scheduling is committed
if addErr := k.callAddRequest(ctx, request.RequestId, state.TokenData, state.WorkerID, state.PrefillWorkerID); addErr != nil {
if addErr := CallAddRequest(request.RequestId, state.TokenData, workerIDUint, 0); addErr != nil {
logger.V(logutil.DEFAULT).Error(addErr, "PreRequest: failed to add request to router bookkeeping",
"requestID", request.RequestId)
return
......@@ -476,7 +398,7 @@ func (k *KVAwareScorer) ResponseStreaming(
// ResponseComplete is called after the complete response is sent to the client.
// It cleans up the router bookkeeping state for the completed request by calling
// dynamo_router_free_request to release resources associated with the request.
// free_request to release resources associated with the request.
func (k *KVAwareScorer) ResponseComplete(
ctx context.Context,
request *schedtypes.LLMRequest,
......@@ -510,7 +432,7 @@ func (k *KVAwareScorer) ResponseComplete(
"requestID", requestID)
}
// --------------------------- router call (persistent only) ---------------------------
// --------------------------- router call ---------------------------
func (k *KVAwareScorer) callDynamoRouter(
ctx context.Context,
......@@ -522,20 +444,25 @@ func (k *KVAwareScorer) callDynamoRouter(
logger.V(logutil.DEFAULT).Error(err, "FFI init failed")
return "", "", nil, err
}
if !runtimeInitialized {
return "", "", nil, fmt.Errorf("dynamo runtime not initialized")
if !routerInitialized {
return "", "", nil, fmt.Errorf("dynamo router not initialized")
}
pipelineMutex.RLock()
currentPipeline := pipeline
pipelineMutex.RUnlock()
routerHandlesMutex.RLock()
router := routerHandles
routerHandlesMutex.RUnlock()
if currentPipeline == nil {
return "", "", nil, fmt.Errorf("dynamo worker selection pipeline not created")
if router == nil {
return "", "", nil, fmt.Errorf("dynamo router handles not created")
}
// Build OpenAI-compatible JSON request from the new LLMRequest structure
requestBody := buildOpenAIRequest(req)
// Build OpenAI-compatible JSON request from the GAIE LLMRequest structure
requestBody, err := buildOpenAIRequest(req)
if err != nil {
logger.V(logutil.DEFAULT).Info("Invalid/empty request body for router; refusing to route",
"err", err.Error())
return "", "", nil, err
}
requestJSON, jsonErr := json.Marshal(requestBody)
if jsonErr != nil {
logger.V(logutil.DEFAULT).Error(jsonErr, "Failed to marshal OpenAI request")
......@@ -544,115 +471,104 @@ func (k *KVAwareScorer) callDynamoRouter(
cRequestJSON := C.CString(string(requestJSON))
defer C.free(unsafe.Pointer(cRequestJSON))
// Output variables
var cDecodeWorkerID C.int64_t
var cPrefillWorkerID C.int64_t
var cTokens *C.uint32_t
var cTokenCount C.size_t
var cAnnotatedJSON *C.char
// Call the worker selection pipeline
rc := C.dynamo_query_worker_selection_and_annotate(
currentPipeline,
cRequestJSON,
&cDecodeWorkerID,
&cPrefillWorkerID,
&cTokens,
&cTokenCount,
&cAnnotatedJSON,
)
if rc != C.DYNAMO_OK {
return "", "", nil, fmt.Errorf("dynamo_query_worker_selection_and_annotate failed")
var result C.CRoutingResult
rc := C.route_request(router, cRequestJSON, &result)
if rc != C.QUERY_ROUTER_OK {
return "", "", nil, fmt.Errorf("route_request failed with code %d", rc)
}
// Copy tokens into Go memory and free C memory
count := int(uintptr(cTokenCount))
// Copy token IDs into Go memory before freeing the Rust-allocated result.
// These tokens are needed for add_request bookkeeping (overlap + active block tracking).
count := int(result.token_count)
var tokens64 []int64
if count > 0 && cTokens != nil {
src := unsafe.Slice((*uint32)(unsafe.Pointer(cTokens)), count)
if count > 0 && result.token_ids != nil {
src := unsafe.Slice((*uint32)(unsafe.Pointer(result.token_ids)), count)
tokens64 = make([]int64, count)
for i := 0; i < count; i++ {
tokens64[i] = int64(src[i])
}
}
C.dynamo_free_worker_selection_result(cTokens, cTokenCount, cAnnotatedJSON)
workerIDStr := fmt.Sprintf("%d", int64(cDecodeWorkerID))
// Copy scalar result fields before freeing the struct
isDisaggregated := result.is_disaggregated
decodeWorkerID := uint64(result.decode_worker_id)
prefillWorkerIDVal := uint64(result.prefill_worker_id)
// Free the Rust-allocated routing result (including token_ids)
C.free_routing_result(&result)
workerIDStr := fmt.Sprintf("%d", decodeWorkerID)
prefillWorkerIDStr := ""
// Rust returns -1 for prefill_worker_id when not in disaggregated mode
if int64(cPrefillWorkerID) >= 0 {
prefillWorkerIDStr = fmt.Sprintf("%d", int64(cPrefillWorkerID))
if isDisaggregated {
prefillWorkerIDStr = fmt.Sprintf("%d", prefillWorkerIDVal)
}
logger.V(logutil.DEFAULT).Info("Worker selection completed",
"workerID", workerIDStr, "prefillWorkerID", prefillWorkerIDStr, "tokenCount", count)
"workerID", workerIDStr, "prefillWorkerID", prefillWorkerIDStr,
"isDisaggregated", isDisaggregated, "tokenCount", count)
return workerIDStr, prefillWorkerIDStr, tokens64, nil
}
// buildOpenAIRequest constructs an OpenAI-compatible request from the new LLMRequest structure
func buildOpenAIRequest(req *schedtypes.LLMRequest) map[string]any {
// buildOpenAIRequest constructs an OpenAI-compatible request from the GAIE LLMRequest structure.
// Preserves message roles for correct chat template application and tokenization.
func buildOpenAIRequest(req *schedtypes.LLMRequest) (map[string]any, error) {
requestBody := make(map[string]any)
// Extract prompt from the new Body structure
userText := "default prompt"
if req != nil && req.Body != nil {
if req.Body.ChatCompletions != nil && len(req.Body.ChatCompletions.Messages) > 0 {
// Extract text from chat completions messages
var sb strings.Builder
for _, msg := range req.Body.ChatCompletions.Messages {
sb.WriteString(msg.Content.PlainText())
sb.WriteString(" ")
// Preserve the original message structure for correct chat template application
if req == nil || req.Body == nil {
return nil, fmt.Errorf("missing request body")
}
if req.Body.ChatCompletions != nil && len(req.Body.ChatCompletions.Messages) > 0 {
messages := make([]map[string]any, 0, len(req.Body.ChatCompletions.Messages))
anyNonEmpty := false
for _, msg := range req.Body.ChatCompletions.Messages {
content := msg.Content.PlainText()
if strings.TrimSpace(content) != "" {
anyNonEmpty = true
}
userText = strings.TrimSpace(sb.String())
} else if req.Body.Completions != nil && req.Body.Completions.Prompt != "" {
userText = req.Body.Completions.Prompt
messages = append(messages, map[string]any{
"role": msg.Role,
"content": content,
})
}
if !anyNonEmpty {
return nil, fmt.Errorf("empty chat messages")
}
requestBody["messages"] = messages
} else if req.Body.Completions != nil && strings.TrimSpace(req.Body.Completions.Prompt) != "" {
// Legacy completions format - wrap as single user message
requestBody["messages"] = []map[string]any{
{"role": "user", "content": req.Body.Completions.Prompt},
}
} else {
return nil, fmt.Errorf("no messages or prompt provided")
}
requestBody["messages"] = []map[string]any{{"role": "user", "content": userText}}
// Model field is required by OpenAI spec but not used by the router's tokenizer
// (tokenizer is determined by the discovered model card)
if req != nil && strings.TrimSpace(req.TargetModel) != "" {
requestBody["model"] = req.TargetModel
} else {
requestBody["model"] = ffiModel
}
requestBody["max_tokens"] = 1
requestBody["temperature"] = 0.0
requestBody["stream"] = true
requestBody["nvext"] = map[string]any{
"annotations": []string{"query_instance_id"},
requestBody["model"] = "default"
}
return requestBody
return requestBody, nil
}
// --------------------------- router bookkeeping ---------------------------
// callAddRequest registers a request with the router's bookkeeping.
// This should be called after worker selection to track active requests.
func (k *KVAwareScorer) callAddRequest(
ctx context.Context,
requestID string,
tokenData []int64,
workerID string,
prefillWorkerID string,
) error {
logger := log.FromContext(ctx)
if !runtimeInitialized {
return fmt.Errorf("dynamo runtime not initialized")
// CallAddRequest registers a request with the router's bookkeeping.
func CallAddRequest(requestID string, tokenData []int64, workerID uint64, dpRank uint32) error {
if !routerInitialized {
return fmt.Errorf("dynamo router not initialized")
}
pipelineMutex.RLock()
currentPipeline := pipeline
pipelineMutex.RUnlock()
if currentPipeline == nil {
return fmt.Errorf("dynamo worker selection pipeline not created")
}
routerHandlesMutex.RLock()
router := routerHandles
routerHandlesMutex.RUnlock()
// Parse worker ID (use decode worker for bookkeeping in disagg mode)
var workerIDUint uint64
if _, err := fmt.Sscanf(workerID, "%d", &workerIDUint); err != nil {
return fmt.Errorf("invalid worker ID: %s", workerID)
if router == nil {
return fmt.Errorf("dynamo router handles not created")
}
// Convert token data from int64 to uint32
......@@ -669,69 +585,66 @@ func (k *KVAwareScorer) callAddRequest(
cTokens = (*C.uint32_t)(unsafe.Pointer(&tokens[0]))
}
rc := C.dynamo_router_add_request(
currentPipeline,
rc := C.add_request(
router,
cRequestID,
cTokens,
C.size_t(len(tokens)),
C.uint64_t(workerIDUint),
C.uint32_t(0), // dp_rank = 0 for now
C.uint64_t(workerID),
C.uint32_t(dpRank),
)
if rc != C.DYNAMO_OK {
return fmt.Errorf("dynamo_router_add_request failed")
if rc != C.QUERY_ROUTER_OK {
return fmt.Errorf("add_request failed with code %d", rc)
}
logger.V(logutil.VERBOSE).Info("Added request to router bookkeeping",
"requestID", requestID, "workerID", workerID, "tokenCount", len(tokens))
return nil
}
// CallMarkPrefillComplete marks prefill as completed for a request.
// Exported for use by response handlers.
func CallMarkPrefillComplete(requestID string) error {
if !runtimeInitialized {
return fmt.Errorf("dynamo runtime not initialized")
if !routerInitialized {
return fmt.Errorf("dynamo router not initialized")
}
pipelineMutex.RLock()
currentPipeline := pipeline
pipelineMutex.RUnlock()
routerHandlesMutex.RLock()
router := routerHandles
routerHandlesMutex.RUnlock()
if currentPipeline == nil {
return fmt.Errorf("dynamo worker selection pipeline not created")
if router == nil {
return fmt.Errorf("dynamo router handles not created")
}
cRequestID := C.CString(requestID)
defer C.free(unsafe.Pointer(cRequestID))
rc := C.dynamo_router_mark_prefill_complete(currentPipeline, cRequestID)
if rc != C.DYNAMO_OK {
return fmt.Errorf("dynamo_router_mark_prefill_complete failed")
rc := C.mark_prefill_complete(router, cRequestID)
if rc != C.QUERY_ROUTER_OK {
return fmt.Errorf("mark_prefill_complete failed with code %d", rc)
}
return nil
}
// callFreeRequestInternal cleans up router state for a completed/cancelled request.
func callFreeRequestInternal(requestID string) error {
if !runtimeInitialized {
return fmt.Errorf("dynamo runtime not initialized")
if !routerInitialized {
return fmt.Errorf("dynamo router not initialized")
}
pipelineMutex.RLock()
currentPipeline := pipeline
pipelineMutex.RUnlock()
routerHandlesMutex.RLock()
router := routerHandles
routerHandlesMutex.RUnlock()
if currentPipeline == nil {
return fmt.Errorf("dynamo worker selection pipeline not created")
if router == nil {
return fmt.Errorf("dynamo router handles not created")
}
cRequestID := C.CString(requestID)
defer C.free(unsafe.Pointer(cRequestID))
rc := C.dynamo_router_free_request(currentPipeline, cRequestID)
if rc != C.DYNAMO_OK {
return fmt.Errorf("dynamo_router_free_request failed")
rc := C.free_request(router, cRequestID)
if rc != C.QUERY_ROUTER_OK {
return fmt.Errorf("free_request failed with code %d", rc)
}
return nil
}
......@@ -739,21 +652,14 @@ func callFreeRequestInternal(requestID string) error {
// --------------------------- shutdown ---------------------------
func cleanupDynamo() error {
pipelineMutex.Lock()
defer pipelineMutex.Unlock()
routerHandlesMutex.Lock()
defer routerHandlesMutex.Unlock()
if pipeline != nil {
if rc := C.dynamo_destroy_worker_selection_pipeline(pipeline); rc != C.DYNAMO_OK {
fmt.Printf("Warning: dynamo_destroy_worker_selection_pipeline failed\n")
}
pipeline = nil
if routerHandles != nil {
C.destroy(routerHandles)
routerHandles = nil
}
if runtimeInitialized {
if rc := C.dynamo_llm_shutdown(); rc != C.DYNAMO_OK {
return fmt.Errorf("dynamo_llm_shutdown failed")
}
runtimeInitialized = false
}
routerInitialized = false
return nil
}
......@@ -19,7 +19,6 @@
{{- $useDynamo := default false .Values.epp.useDynamo -}}
{{- $resolvedDynNs := (include "dynamo-gaie.dynamoNamespace" .) | trim -}}
{{- $ns := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}}
{{- $kv := default "16" .Values.epp.dynamo.kvBlockSize -}}
{{- $useEtcd := default false .Values.epp.dynamo.useEtcd -}}
{{- $eppImage := required "extension.image is required - set via --set-string extension.image=$EPP_IMAGE or in values file" .Values.extension.image }}
......@@ -113,8 +112,6 @@ spec:
value: "nats://{{ $platformName }}-nats.{{ $platformNs }}:4222"
- name: DYN_NAMESPACE
value: "{{ $ns }}"
- name: DYN_KV_BLOCK_SIZE
value: "{{ $kv }}"
- name: USE_STREAMING
value: "true"
# HuggingFace token for downloading model config files
......
......@@ -72,7 +72,6 @@ epp:
# Dynamo-specific settings (only used when useDynamo: true)
configFile: "/etc/epp/epp-config-dynamo.yaml"
dynamo:
kvBlockSize: "16"
# Use ETCD for discovery instead of Kubernetes (default: false)
# Set to true via --set epp.dynamo.useEtcd=true to enable ETCD discovery
useEtcd: false
......
......@@ -87,10 +87,6 @@ func (e *EPPDefaults) GetBaseContainer(context ComponentContext) (corev1.Contain
// EPP-specific environment variables
container.Env = append(container.Env, []corev1.EnvVar{
{
Name: "DYN_KV_BLOCK_SIZE",
Value: "16",
},
{
Name: "USE_STREAMING",
Value: "true",
......
......@@ -6,15 +6,24 @@ use libc::c_char;
use once_cell::sync::OnceCell;
use std::borrow::Cow;
use std::ffi::CStr;
use std::ptr;
use std::sync::Arc;
use std::sync::atomic::{AtomicU32, Ordering};
use std::time::Duration;
use dynamo_llm::{
discovery::{KvWorkerMonitor, ModelWatcher},
kv_router::{protocols::*, publisher::KvEventPublisher},
};
use dynamo_llm::kv_router::{protocols::*, publisher::KvEventPublisher};
use dynamo_llm::preprocessor::OpenAIPreprocessor;
use dynamo_runtime::discovery::DiscoveryQuery;
use dynamo_runtime::{DistributedRuntime, Worker};
use dynamo_runtime::Runtime;
use dynamo_llm::discovery::{ModelManager, WORKER_TYPE_DECODE};
use dynamo_llm::kv_router::KvRouterConfig;
use dynamo_llm::kv_router::protocols::WorkerWithDpRank;
use dynamo_llm::kv_router::{KvRouter, PrefillRouter, RouterConfigOverride};
use dynamo_runtime::pipeline::RouterMode;
static WK: OnceCell<Worker> = OnceCell::new();
static DRT: AsyncOnceCell<DistributedRuntime> = AsyncOnceCell::new();
// [FIXME] shouldn't the publisher be instance passing between API calls?
......@@ -354,1156 +363,866 @@ pub extern "C" fn dynamo_kv_event_publish_removed(
}
}
// Need to setup etcd and nats to run these tests
// #[cfg(test)]
// mod tests {
// use super::*;
// use std::ffi::CString;
// #[test]
// fn test_dynamo_llm_init() {
// // Create C-compatible strings
// let namespace = CString::new("test_namespace").unwrap();
// let component = CString::new("test_component").unwrap();
// // Call the init function
// let result = unsafe {
// dynamo_llm_init(
// namespace.as_ptr(),
// component.as_ptr(),
// 1, // worker_id
// 32, // kv_block_size
// )
// };
// assert_eq!(result as u32, DynamoLlmResult::OK as u32);
// assert!(WK.get().is_some());
// let shutdown_result = dynamo_llm_shutdown();
// assert_eq!(shutdown_result as u32, DynamoLlmResult::OK as u32);
// }
// }
/* ------------------------------------------------------------------------
* Worker selection pipeline
* Router Bindings for GAIE EPP
* ------------------------------------------------------------------------ */
use std::pin::Pin;
const GENERATE_ENDPOINT: &str = "generate";
use anyhow::Context;
use dynamo_runtime::{Runtime, traits::DistributedRuntimeProvider};
use dynamo_llm::discovery::ModelManager;
use dynamo_llm::entrypoint::build_routed_pipeline;
use dynamo_llm::http::service::metrics::Metrics;
use dynamo_llm::kv_router::KvRouterConfig;
use dynamo_llm::model_card::ModelDeploymentCard;
use dynamo_llm::protocols::openai::nvext::NvExt;
use dynamo_llm::types::{
Annotated,
openai::chat_completions::{
NvCreateChatCompletionRequest, NvCreateChatCompletionStreamResponse,
},
};
use dynamo_runtime::{
engine::AsyncEngineStream,
pipeline::{ManyOut, RouterMode, ServiceEngine, SingleIn},
};
/// Opaque handle exposed to C — it owns its own Worker/runtime and engine.
pub struct WorkerSelectionPipeline {
wk: Worker,
engine: ServiceEngine<
SingleIn<NvCreateChatCompletionRequest>,
ManyOut<Annotated<NvCreateChatCompletionStreamResponse>>,
>,
/// KV router for bookkeeping operations (only present when router_mode is KV)
kv_router: Option<Arc<dynamo_llm::kv_router::KvRouter>>,
// Default timeout for bookkeeping operations
const BOOKKEEPING_TIMEOUT_SEC: u64 = 5;
/// Complete routing result for a chat completion request (C-compatible)
#[repr(C)]
pub struct CRoutingResult {
/// Whether disaggregated mode is active
pub is_disaggregated: bool,
/// Prefill worker ID (only valid if is_disaggregated is true)
pub prefill_worker_id: u64,
/// Decode worker ID
pub decode_worker_id: u64,
/// Token IDs (needed for add_request callback)
pub token_ids: *mut u32,
/// Number of tokens in the request
pub token_count: usize,
}
/// Create a worker-selection pipeline ("generate" endpoint).
///
/// # Safety
/// - `namespace_c_str`, `component_c_str`, and `model_name_c_str` must be **non-null** pointers to
/// **NUL-terminated** C strings that contain **valid UTF-8**. They must remain valid for the
/// duration of this call.
/// - `pipeline_out` must be **non-null** and point to writable memory for a `*mut WorkerSelectionPipeline`.
/// On success this function writes exactly once to `*pipeline_out`. The caller becomes the owner of
/// that pointer and **must** later free it by calling `dynamo_destroy_worker_selection_pipeline`.
/// - Must be called **after** a successful `dynamo_llm_init()`; otherwise behavior is undefined.
/// - This function is not signal-safe and must not be called from a signal handler.
/// - This function may block internally; do not call it from contexts that forbid blocking.
///
/// # Errors
/// Returns `DynamoLlmResult::ERR` on failure and does not write to `pipeline_out`.
/// # Safety
/// See detailed safety docs above. Additional parameter:
/// - `enforce_disagg`: If true, requests fail when disaggregated serving is unavailable.
/// If false, falls back to aggregated serving.
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_create_worker_selection_pipeline(
namespace_c_str: *const c_char,
component_c_str: *const c_char,
model_name_c_str: *const c_char,
use_kv_routing: bool,
busy_threshold: f64,
overlap_score_weight: f64,
router_temperature: f64,
use_kv_events: bool,
router_replica_sync: bool,
enforce_disagg: bool,
pipeline_out: *mut *mut WorkerSelectionPipeline,
) -> DynamoLlmResult {
if pipeline_out.is_null() {
tracing::error!("pipeline_out pointer is null");
return DynamoLlmResult::ERR;
}
let wk = match WK.get() {
Some(w) => w.clone(),
None => {
tracing::error!("Worker not initialized. Call dynamo_llm_init first.");
return DynamoLlmResult::ERR;
}
};
let namespace = match unsafe { CStr::from_ptr(namespace_c_str) }.to_str() {
Ok(s) => s.to_owned(),
Err(e) => {
tracing::error!(error = ?e, "bad namespace");
return DynamoLlmResult::ERR;
impl Default for CRoutingResult {
fn default() -> Self {
Self {
is_disaggregated: false,
prefill_worker_id: 0,
decode_worker_id: 0,
token_ids: ptr::null_mut(),
token_count: 0,
}
};
let component_cow = unsafe { cstr_or_default(component_c_str, "backend") };
if let Cow::Borrowed("backend") = &component_cow {
tracing::info!("defaulting to \"backend\" for component");
}
let component: String = component_cow.into_owned();
}
let model = match unsafe { CStr::from_ptr(model_name_c_str) }.to_str() {
Ok(s) => s.to_owned(),
Err(e) => {
tracing::error!(error = ?e, "bad model");
return DynamoLlmResult::ERR;
}
};
/// Container holding routers and preprocessor for query routing
pub struct RouterHandles {
prefill_router: Arc<PrefillRouter>,
decode_router: Arc<KvRouter>,
#[allow(dead_code)]
model_manager: Arc<ModelManager>,
#[allow(dead_code)]
namespace: String,
/// Cached runtime for executing async operations (avoids creating new runtime per call)
runtime: Runtime,
/// Preprocessor for tokenization and template application (fetched via discovery)
preprocessor: Option<Arc<OpenAIPreprocessor>>,
}
let make_engine = || async {
let router_mode = if use_kv_routing {
RouterMode::KV
} else {
RouterMode::RoundRobin
};
impl RouterHandles {
/// Query optimal prefill worker for a request.
/// Returns worker_id on success.
async fn query_prefill_worker(
&self,
tokens: &[u32],
update_states: bool,
lora_name: Option<String>,
priority_jump: f64,
) -> Result<u64, QueryRouterResult> {
self.prefill_router
.query_prefill_worker(tokens, update_states, lora_name, priority_jump)
.await
.map(|(worker_id, _dp_rank)| worker_id)
.map_err(|e| {
tracing::error!(error = ?e, "Prefill query failed");
QueryRouterResult::ErrQueryFailed
})
}
let kv_router_config = if use_kv_routing {
Some(KvRouterConfig {
overlap_score_weight,
router_temperature,
use_kv_events,
router_replica_sync,
..KvRouterConfig::default()
/// Query optimal decode worker for a request.
/// For disaggregated mode, set `is_disaggregated` to true to use overlap_score_weight=0
/// (since KV cache is being transferred from prefill, not reused).
///
/// Note: The C bindings are query-only and must not mutate router state during worker
/// selection. State updates require a `context_id` (request id) and are managed via the
/// explicit bookkeeping APIs (`add_request`, `mark_prefill_complete`, `free_request`).
/// Returns (worker, overlap_blocks) on success.
async fn query_decode_worker(
&self,
tokens: &[u32],
is_disaggregated: bool,
) -> Result<(WorkerWithDpRank, u32), QueryRouterResult> {
// For decode phase in disaggregated mode, use overlap_score_weight=0
// This matches prefill_router.rs
let config_override = if is_disaggregated {
Some(RouterConfigOverride {
overlap_score_weight: Some(0.0),
..Default::default()
})
} else {
None
};
create_worker_selection_pipeline_chat(
&namespace,
&component,
&model,
router_mode,
(busy_threshold >= 0.0).then_some(busy_threshold),
kv_router_config,
enforce_disagg,
)
.await
};
self.decode_router
.find_best_match(None, tokens, config_override.as_ref(), false, None, 0.0)
.await
.map_err(|e| {
tracing::error!(error = ?e, "Decode query failed");
QueryRouterResult::ErrQueryFailed
})
}
}
let (engine, kv_router) = match wk.runtime().secondary().block_on(make_engine()) {
Ok(p) => p,
Err(e) => {
tracing::error!(error = ?e, "create_worker_selection_pipeline_chat failed");
return DynamoLlmResult::ERR;
}
};
/// Opaque handle for the router pair
pub type RouterHandlesPtr = *mut RouterHandles;
let handle = Box::new(WorkerSelectionPipeline {
wk,
engine,
kv_router,
});
unsafe {
*pipeline_out = Box::into_raw(handle);
/// Result codes for query router C FFI
#[repr(u32)]
pub enum QueryRouterResult {
Ok = 0,
ErrInvalidHandle = 1,
ErrInvalidParam = 2,
ErrInitFailed = 3,
ErrQueryFailed = 4,
ErrDisaggEnforced = 5,
ErrTimeout = 6,
}
/// Build a `KvRouterConfig` from defaults, overridden by optional `DYN_*` environment variables.
///
/// Supported env vars (all optional — unset or empty values are ignored):
/// - `DYN_OVERLAP_SCORE_WEIGHT` — Weight for overlap score in worker selection (default: 1.0)
/// - `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
/// - `DYN_USE_KV_EVENTS` — Use KV events for cache tracking (default: true)
/// - `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
/// - `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
/// - `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` — Track output blocks during generation (default: false)
fn kv_router_config_from_env() -> KvRouterConfig {
let mut cfg = KvRouterConfig::default();
fn env_f64(key: &str) -> Option<f64> {
std::env::var(key).ok().and_then(|v| v.parse().ok())
}
DynamoLlmResult::OK
fn env_bool(key: &str) -> Option<bool> {
std::env::var(key)
.ok()
.and_then(|v| match v.to_lowercase().as_str() {
"true" | "1" | "yes" | "on" => Some(true),
"false" | "0" | "no" | "off" => Some(false),
_ => None,
})
}
if let Some(v) = env_f64("DYN_OVERLAP_SCORE_WEIGHT") {
cfg.overlap_score_weight = v;
}
if let Some(v) = env_f64("DYN_ROUTER_TEMPERATURE") {
cfg.router_temperature = v;
}
if let Some(v) = env_bool("DYN_USE_KV_EVENTS") {
cfg.use_kv_events = v;
}
if let Some(v) = env_bool("DYN_ROUTER_REPLICA_SYNC") {
cfg.router_replica_sync = v;
}
if let Some(v) = env_bool("DYN_ROUTER_TRACK_ACTIVE_BLOCKS") {
cfg.router_track_active_blocks = v;
}
if let Some(v) = env_bool("DYN_ROUTER_TRACK_OUTPUT_BLOCKS") {
cfg.router_track_output_blocks = v;
}
tracing::info!(
overlap_score_weight = cfg.overlap_score_weight,
router_temperature = cfg.router_temperature,
use_kv_events = cfg.use_kv_events,
router_replica_sync = cfg.router_replica_sync,
router_track_active_blocks = cfg.router_track_active_blocks,
router_track_output_blocks = cfg.router_track_output_blocks,
"KvRouterConfig initialized (DYN_* env overrides applied)"
);
cfg
}
/// Query worker selection on an existing pipeline and return:
/// - `decode_worker_id_out` (`i64`): The decode worker ID (primary worker)
/// - `prefill_worker_id_out` (`i64`): The prefill worker ID (-1 if not in disaggregated mode)
/// - `token_ids_out` (heap-allocated `*mut u32`; caller must free via
/// `dynamo_free_worker_selection_result`)
/// - `token_count_out` (`usize`)
/// - `annotated_request_json_out` (`*mut c_char` to a NUL-terminated C string;
/// caller frees via the same free function)
/// Create router handles for query-only routing
///
/// # Safety
/// - `pipeline`
/// - Must be a **non-null** pointer previously returned by
/// `dynamo_create_worker_selection_pipeline` and not yet passed to
/// `dynamo_destroy_worker_selection_pipeline`.
/// - Must remain valid for the entire duration of this call.
/// - **Do not** call this function concurrently on the same `pipeline` pointer
/// from multiple threads unless the surrounding code guarantees synchronization.
/// - `request_json_c_str`
/// - Must be a **non-null**, **NUL-terminated** C string containing **valid UTF-8**.
/// - The JSON must represent a valid `NvCreateChatCompletionRequest`; otherwise this
/// function returns `DynamoLlmResult::ERR`.
/// - Must remain valid for the duration of this call.
/// - Output pointers:
/// - `decode_worker_id_out`, `prefill_worker_id_out`, `token_ids_out`, `token_count_out`,
/// and `annotated_request_json_out` must each be **non-null** and point to
/// writable memory for their respective types. On success, this function
/// writes to all five outputs exactly once.
/// - On **error**, outputs are left unmodified.
/// - Ownership & deallocation:
/// - On success, if there are zero tokens, `*token_ids_out` may be set to `NULL`
/// and `*token_count_out` set to `0`.
/// - If non-null, the buffer written to `*token_ids_out` is allocated with the
/// Rust global allocator and **must** be freed by calling
/// `dynamo_free_worker_selection_result` with the same `token_count_out` value.
/// - The pointer written to `*annotated_request_json_out` is a `CString` allocated
/// by Rust and **must** be freed by calling `dynamo_free_worker_selection_result`.
/// - **Do not** free these with `free(3)` or any other allocator; doing so is
/// undefined behavior.
/// - Blocking & context:
/// - This function may **block** internally while it performs async work; do not
/// call it from contexts that forbid blocking (e.g., signal handlers).
/// - Process/ABI assumptions:
/// - The caller and callee must run in the same process and use the same Rust
/// global allocator for the paired allocation/free described above.
/// - This function is not signal-safe.
/// This function waits for at least one decode worker to be discovered before returning.
/// It auto-detects disaggregated mode by checking if prefill workers are present.
/// The KV cache block size is automatically fetched from the model card via discovery.
///
/// # Errors
/// Returns `DynamoLlmResult::ERR` if any precondition fails (null/invalid pointers,
/// malformed UTF-8/JSON, pipeline errors, allocation failures, etc.). On error, no
/// output pointer is written.
/// # Arguments
/// - `namespace`: Namespace for the model
/// - `component`: Component name (defaults to "backend" if NULL or empty)
/// - `enforce_disagg`: If true, disaggregated mode is required (fails if no prefill workers found)
/// - `out_handle`: Output handle
///
/// # Output values
/// - `decode_worker_id_out`: The decode worker ID (primary worker in aggregated mode)
/// - `prefill_worker_id_out`: The prefill worker ID (only set in disaggregated mode, -1 if not present)
/// - `token_ids_out`, `token_count_out`: Token IDs and count
/// - `annotated_request_json_out`: The annotated request JSON
/// # Safety
/// - All string parameters must be valid null-terminated C strings
/// - The returned handle must be freed with `destroy`
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_query_worker_selection_and_annotate(
pipeline: *mut WorkerSelectionPipeline,
request_json_c_str: *const c_char,
decode_worker_id_out: *mut i64,
prefill_worker_id_out: *mut i64,
token_ids_out: *mut *mut u32,
token_count_out: *mut usize,
annotated_request_json_out: *mut *mut c_char,
) -> DynamoLlmResult {
if pipeline.is_null() {
tracing::error!("Pipeline pointer is null");
return DynamoLlmResult::ERR;
}
if decode_worker_id_out.is_null()
|| prefill_worker_id_out.is_null()
|| token_ids_out.is_null()
|| token_count_out.is_null()
|| annotated_request_json_out.is_null()
{
tracing::error!("One or more output pointers are null");
return DynamoLlmResult::ERR;
pub unsafe extern "C" fn create_routers(
namespace: *const c_char,
component: *const c_char,
enforce_disagg: bool,
out_handle: *mut RouterHandlesPtr,
) -> QueryRouterResult {
if namespace.is_null() || out_handle.is_null() {
return QueryRouterResult::ErrInvalidParam;
}
let req_str = match unsafe { CStr::from_ptr(request_json_c_str) }.to_str() {
Ok(s) => s,
Err(e) => {
tracing::error!(error = ?e, "bad request json");
return DynamoLlmResult::ERR;
}
};
let request: NvCreateChatCompletionRequest = match serde_json::from_str(req_str) {
Ok(r) => r,
Err(e) => {
tracing::error!(error = ?e, "parse request failed");
return DynamoLlmResult::ERR;
}
};
let pl = unsafe { &*pipeline };
let fut = async { query_worker_selection_and_annotate(&pl.engine, request).await };
let (result, annotated_req) = match pl.wk.runtime().secondary().block_on(fut) {
Ok(v) => v,
Err(e) => {
tracing::error!(error = ?e, "query_worker_selection_and_annotate failed");
return DynamoLlmResult::ERR;
}
let namespace_str = match unsafe { CStr::from_ptr(namespace) }.to_str() {
Ok(s) => s.to_owned(),
Err(_) => return QueryRouterResult::ErrInvalidParam,
};
let tokens_ptr = if result.tokens.is_empty() {
std::ptr::null_mut()
let component_str = if component.is_null() {
"backend".to_string()
} else {
let len = result.tokens.len();
let layout = std::alloc::Layout::array::<u32>(len).unwrap();
let ptr = unsafe { std::alloc::alloc(layout) as *mut u32 };
if ptr.is_null() {
tracing::error!("alloc tokens failed");
return DynamoLlmResult::ERR;
}
unsafe {
std::ptr::copy_nonoverlapping(result.tokens.as_ptr(), ptr, len);
match unsafe { CStr::from_ptr(component) }.to_str() {
Ok(s) if !s.is_empty() => s.to_owned(),
_ => "backend".to_string(),
}
ptr
};
let annotated_json = match serde_json::to_string(&annotated_req) {
Ok(s) => s,
// Create the runtime once - it will be stored in RouterHandles and reused
let runtime = match Runtime::from_settings() {
Ok(rt) => rt,
Err(e) => {
if !tokens_ptr.is_null() {
let layout = std::alloc::Layout::array::<u32>(result.tokens.len()).unwrap();
unsafe {
std::alloc::dealloc(tokens_ptr as *mut u8, layout);
}
tracing::error!(error = ?e, "serialize annotated request failed");
}
return DynamoLlmResult::ERR;
tracing::error!(error = ?e, "Failed to create runtime");
return QueryRouterResult::ErrInitFailed;
}
};
let cjson = match std::ffi::CString::new(annotated_json) {
Ok(c) => c,
Err(e) => {
tracing::error!(error = ?e, "CString::new for annotated JSON failed");
if !tokens_ptr.is_null() {
let layout = std::alloc::Layout::array::<u32>(result.tokens.len()).unwrap();
unsafe {
std::alloc::dealloc(tokens_ptr as *mut u8, layout);
}
// Clone for use inside the async block (the original will be moved into handles)
let runtime_for_async = runtime.clone();
let result = runtime_for_async.secondary().block_on(async {
let drt = match DistributedRuntime::from_settings(runtime_for_async.clone()).await {
Ok(drt) => drt,
Err(e) => {
tracing::error!(error = ?e, "Failed to create distributed runtime");
return Err(QueryRouterResult::ErrInitFailed);
}
return DynamoLlmResult::ERR;
};
// Wait for at least one worker to be discovered before proceeding
// This ensures the decode router can be created successfully
let instance_count = wait_for_discovery_sync(&drt).await;
if instance_count == 0 {
tracing::error!(
"Discovery sync failed: no worker instances found. Is the backend running?"
);
return Err(QueryRouterResult::ErrInitFailed);
}
};
unsafe {
*decode_worker_id_out = result.decode_worker_id.unwrap_or(0);
*prefill_worker_id_out = result.prefill_worker_id.unwrap_or(-1);
*token_ids_out = tokens_ptr;
*token_count_out = result.tokens.len();
*annotated_request_json_out = cjson.into_raw();
}
DynamoLlmResult::OK
}
tracing::info!(
"Discovery sync complete, {} worker(s) found",
instance_count
);
/// Destroy a previously created pipeline.
///
/// # Safety
/// - `pipeline`
/// - **Must** be a non-null pointer that was **originally returned by**
/// `dynamo_create_worker_selection_pipeline` (i.e., obtained via
/// `Box::into_raw` on a `WorkerSelectionPipeline`).
/// - **Must not** have been passed to this function (or otherwise freed)
/// before. Passing the same pointer twice is a **double free** and is
/// undefined behavior.
/// - **Must not** be used by any other thread while this function runs.
/// Ensure no concurrent calls are in flight that read or write through
/// this handle (e.g., `dynamo_query_worker_selection_and_annotate`).
/// - After a successful call, the pointer is **invalid** and must not be
/// dereferenced or used again in any way.
/// - Allocator/ABI
/// - The caller and callee must be in the same process and share the same
/// allocator; this function reclaims the allocation that was created by
/// Rust for the handle.
/// - Lifetime/FFI
/// - Do not call from contexts that forbid blocking or running destructors
/// (e.g., signal handlers).
///
/// # Errors
/// - Returns `DynamoLlmResult::ERR` if `pipeline` is null.
/// - On `OK`, ownership of `pipeline` is taken and the underlying resources
/// are dropped; using the pointer after return is undefined behavior.
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_destroy_worker_selection_pipeline(
pipeline: *mut WorkerSelectionPipeline,
) -> DynamoLlmResult {
if pipeline.is_null() {
tracing::error!("Pipeline pointer is null");
return DynamoLlmResult::ERR;
}
let _boxed: Box<WorkerSelectionPipeline> = unsafe { Box::from_raw(pipeline) };
DynamoLlmResult::OK
}
let kv_router_config = kv_router_config_from_env();
/// Free buffers allocated by `dynamo_query_worker_selection_and_annotate`.
///
/// # Safety
/// - `token_ids` and `annotated_request_json` **must come from this library**:
/// - `token_ids` must be the exact pointer previously returned by
/// `dynamo_query_worker_selection_and_annotate` for the tokens buffer,
/// allocated with Rust’s global allocator in this process.
/// - `annotated_request_json` must be the exact pointer previously returned by
/// `CString::into_raw` inside `dynamo_query_worker_selection_and_annotate`.
/// - **Call at most once** per pointer. Passing the same pointer again is a
/// double-free and is undefined behavior.
/// - Pointer/length invariants:
/// - If `token_ids` is non-null, `token_count` **must** be the exact length
/// originally returned. Mismatched lengths cause invalid deallocation.
/// - If `token_ids` is null, `token_count` should be `0`.
/// - Passing a non-null `token_ids` with `token_count == 0` will leak in this
/// implementation (we only dealloc when `token_count > 0`).
/// - After return, the pointers are **invalid** and must not be used again.
/// - The caller and callee must be in the same process and share the same
/// allocator/ABI (these deallocations use Rust’s global allocator).
/// - Ensure no other threads are concurrently reading/writing these buffers when
/// freeing them.
/// - Do not call from contexts that forbid running destructors (e.g., signal handlers).
///
/// Returns `DynamoLlmResult::OK` on success.
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_free_worker_selection_result(
token_ids: *mut u32,
token_count: usize,
annotated_request_json: *mut c_char,
) -> DynamoLlmResult {
if token_count > 0 {
match std::alloc::Layout::array::<u32>(token_count) {
Ok(layout) if !token_ids.is_null() => unsafe {
std::alloc::dealloc(token_ids as *mut u8, layout);
// Get component and endpoint
let component_handle = match drt.namespace(&namespace_str) {
Ok(ns) => match ns.component(&component_str) {
Ok(c) => c,
Err(e) => {
tracing::error!(error = ?e, "Failed to get component");
return Err(QueryRouterResult::ErrInitFailed);
}
},
_ => {}
}
}
if !annotated_request_json.is_null() {
unsafe {
drop(std::ffi::CString::from_raw(annotated_request_json));
}
}
DynamoLlmResult::OK
}
Err(e) => {
tracing::error!(error = ?e, "Failed to get namespace");
return Err(QueryRouterResult::ErrInitFailed);
}
};
let endpoint = component_handle.endpoint("generate");
/// Default timeout for GAIE bookkeeping operations (30 seconds)
const GAIE_BOOKKEEPING_TIMEOUT_SECS: u64 = 30;
/// Helper to validate pipeline pointer and extract request_id from C string.
/// Returns `Err(DynamoLlmResult::ERR)` on validation failure, `Ok((pipeline_ref, request_id))` on success.
unsafe fn validate_pipeline_and_request_id(
pipeline: *mut WorkerSelectionPipeline,
request_id_c_str: *const c_char,
operation: &str,
) -> Result<(&'static WorkerSelectionPipeline, String), DynamoLlmResult> {
if pipeline.is_null() {
tracing::error!("[GAIE] {} failed: pipeline pointer is null", operation);
return Err(DynamoLlmResult::ERR);
}
let model_manager = Arc::new(ModelManager::new());
let request_id = match unsafe { CStr::from_ptr(request_id_c_str) }.to_str() {
Ok(s) => s.to_owned(),
Err(e) => {
tracing::error!(error = ?e, "[GAIE] {} failed: bad request_id", operation);
return Err(DynamoLlmResult::ERR);
}
};
// Fetch model card via discovery and create preprocessor + get block_size
let (preprocessor, block_size, model_name) =
match fetch_preprocessor_from_discovery(&drt, &namespace_str).await {
Ok((prep, bs, name)) => {
tracing::info!(
kv_cache_block_size = bs,
"Preprocessor created from discovery"
);
(Some(prep), bs, name)
}
Err(e) => {
tracing::error!(
error = %e,
"Failed to fetch model card from discovery - cannot determine block_size"
);
return Err(QueryRouterResult::ErrInitFailed);
}
};
// SAFETY: Caller guarantees pipeline is valid for the duration of the call
let pl: &'static WorkerSelectionPipeline = unsafe { &*pipeline };
Ok((pl, request_id))
}
// Create decode router
let decode_router = match model_manager
.kv_chooser_for(
&endpoint,
block_size,
Some(kv_router_config),
WORKER_TYPE_DECODE,
)
.await
{
Ok(r) => r,
Err(e) => {
tracing::error!(error = ?e, "Failed to create decode router");
return Err(QueryRouterResult::ErrInitFailed);
}
};
/// Helper to run an async bookkeeping operation with timeout.
/// Returns `OK` on success or timeout, `ERR` only on validation failures (handled by caller).
fn run_bookkeeping_with_timeout<F, Fut>(
pl: &WorkerSelectionPipeline,
operation: &'static str,
request_id: &str,
f: F,
) -> DynamoLlmResult
where
F: FnOnce() -> Fut,
Fut: std::future::Future<Output = ()>,
{
use std::time::Duration;
let timeout_duration = Duration::from_secs(GAIE_BOOKKEEPING_TIMEOUT_SECS);
let fut = f();
let result = pl
.wk
.runtime()
.secondary()
.block_on(async { tokio::time::timeout(timeout_duration, fut).await });
// Create PrefillRouter based on one-time discovery of prefill workers
// Auto-detects disaggregated mode by checking if prefill workers are present
// The prefill workers have to be created before the epp is created.
// Given that we wait first for the decode worker to show up it is reasonable to assume the prefill will be up as well.
let prefill_router = match find_prefill_endpoint(&drt, &namespace_str).await {
Some(prefill_endpoint) => {
tracing::info!("Prefill worker found, running in disaggregated mode");
let mut prefill_config = kv_router_config;
prefill_config.router_track_active_blocks = false;
// Create immediately-resolved channel to activate router
let (tx, rx) = tokio::sync::oneshot::channel();
let _ = tx.send(prefill_endpoint);
PrefillRouter::new(
rx,
model_manager.clone(),
RouterMode::KV,
block_size,
Some(prefill_config),
enforce_disagg,
model_name.clone(),
)
}
None if enforce_disagg => {
tracing::error!("Prefill workers required (enforce_disagg=true) but none found");
return Err(QueryRouterResult::ErrDisaggEnforced);
}
None => {
tracing::info!("No prefill workers found, running in aggregated mode");
PrefillRouter::disabled(model_manager.clone(), RouterMode::KV, enforce_disagg)
}
};
Ok((
prefill_router,
decode_router,
model_manager,
namespace_str,
preprocessor,
))
});
match result {
Ok(()) => DynamoLlmResult::OK,
Err(_elapsed) => {
tracing::warn!(
request_id = %request_id,
timeout_secs = GAIE_BOOKKEEPING_TIMEOUT_SECS,
"[GAIE] {} timed out",
operation
);
// Return OK to avoid blocking the caller - the operation may still complete
DynamoLlmResult::OK
Ok((prefill_router, decode_router, model_manager, namespace_str, preprocessor)) => {
let handles = RouterHandles {
prefill_router,
decode_router,
model_manager,
namespace: namespace_str,
runtime, // Store the runtime for reuse
preprocessor,
};
unsafe { *out_handle = Box::into_raw(Box::new(handles)) };
QueryRouterResult::Ok
}
Err(code) => code,
}
}
/// Router bookkeeping functions for GAIE integration
/// Add a request to the router's bookkeeping after worker selection.
/// Call this from GAIE Stage 1 after `dynamo_query_worker_selection_and_annotate`.
///
/// This function computes the overlap_blocks internally by querying the indexer,
/// so the caller doesn't need to provide it.
/// Register the request with the KvRouter's scheduler for tracking active blocks
/// and managing prefill/decode lifecycle. Call this after `query_decode` returns
/// worker IDs and before sending the request to the worker.
///
/// # Safety
/// - `pipeline` must be a valid, non-null pointer from `dynamo_create_worker_selection_pipeline`
/// - `request_id_c_str` must be a valid NUL-terminated UTF-8 C string
/// - `handle` must be a valid RouterHandles handle
/// - `request_id` must be a valid null-terminated C string
/// - `token_ids` must point to at least `token_count` valid u32 values
/// - Must not be called concurrently on the same pipeline without synchronization
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_router_add_request(
pipeline: *mut WorkerSelectionPipeline,
request_id_c_str: *const c_char,
pub unsafe extern "C" fn add_request(
handle: RouterHandlesPtr,
request_id: *const c_char,
token_ids: *const u32,
token_count: usize,
worker_id: u64,
dp_rank: u32,
) -> DynamoLlmResult {
let (pl, request_id) = match unsafe {
validate_pipeline_and_request_id(pipeline, request_id_c_str, "add_request")
} {
Ok(v) => v,
Err(e) => return e,
};
) -> QueryRouterResult {
if handle.is_null() || request_id.is_null() {
return QueryRouterResult::ErrInvalidParam;
}
let Some(ref kv_router) = pl.kv_router else {
tracing::debug!(
"[GAIE] KV router not available (router_mode is not KV), skipping add_request (no-op)"
);
return DynamoLlmResult::OK;
let handles = unsafe { &*handle };
let request_id_str = match unsafe { CStr::from_ptr(request_id) }.to_str() {
Ok(s) => s.to_owned(),
Err(_) => return QueryRouterResult::ErrInvalidParam,
};
// Log after kv_router check to reduce noise
tracing::debug!(
request_id = %request_id,
worker_id = worker_id,
dp_rank = dp_rank,
token_count = token_count,
"[GAIE] dynamo_router_add_request processing"
);
let tokens: Vec<u32> = if token_count > 0 && !token_ids.is_null() {
unsafe { std::slice::from_raw_parts(token_ids, token_count) }.to_vec()
} else {
Vec::new()
};
let kv_router = kv_router.clone();
let request_id_clone = request_id.clone();
let decode_router = handles.decode_router.clone();
run_bookkeeping_with_timeout(pl, "add_request", &request_id, || async move {
let worker = dynamo_llm::kv_router::protocols::WorkerWithDpRank::new(worker_id, dp_rank);
let result = handles.runtime.secondary().block_on(async {
let timeout_duration = Duration::from_secs(BOOKKEEPING_TIMEOUT_SEC);
// Compute overlap_blocks using the public method
let overlap_blocks = match kv_router.get_overlap_blocks(&tokens, worker).await {
Ok(overlap) => overlap,
Err(e) => {
tracing::warn!(error = ?e, "Failed to compute overlap, using 0");
0
}
};
tokio::time::timeout(timeout_duration, async {
let worker = WorkerWithDpRank::new(worker_id, dp_rank);
kv_router
.add_request(
request_id_clone.clone(),
&tokens,
overlap_blocks,
None,
worker,
None, // lora_name not exposed in C API yet
None, // router_config_override not exposed in C API yet
)
.await;
tracing::debug!(
request_id = %request_id_clone,
worker_id = worker_id,
dp_rank = dp_rank,
overlap_blocks = overlap_blocks,
token_count = tokens.len(),
"[GAIE] dynamo_router_add_request completed - request registered in router bookkeeping"
);
})
// Compute overlap_blocks using the public method
let overlap_blocks = match decode_router.get_overlap_blocks(&tokens, worker).await {
Ok(overlap) => overlap,
Err(e) => {
tracing::warn!(error = ?e, "Failed to compute overlap, using 0");
0
}
};
decode_router
.add_request(
request_id_str.clone(),
&tokens,
overlap_blocks,
None,
worker,
None, // lora_name
None, // router_config_override
)
.await;
tracing::debug!(
request_id = %request_id_str,
worker_id = worker_id,
dp_rank = dp_rank,
overlap_blocks = overlap_blocks,
token_count = tokens.len(),
"add_request completed"
);
})
.await
});
match result {
Ok(()) => QueryRouterResult::Ok,
Err(_elapsed) => {
tracing::warn!(
request_id = %request_id_str,
timeout_secs = BOOKKEEPING_TIMEOUT_SEC,
"add_request timed out"
);
QueryRouterResult::ErrTimeout
}
}
}
/// Mark prefill as completed for a request.
/// Call this from the EPP extension point when the first token is generated.
///
/// Call when the first token is generated to release prefill tokens from decode worker's load
///
/// # Safety
/// - `pipeline` must be a valid, non-null pointer from `dynamo_create_worker_selection_pipeline`
/// - `request_id_c_str` must be a valid NUL-terminated UTF-8 C string
/// - `handle` must be a valid RouterHandles handle
/// - `request_id` must be a valid null-terminated C string
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_router_mark_prefill_complete(
pipeline: *mut WorkerSelectionPipeline,
request_id_c_str: *const c_char,
) -> DynamoLlmResult {
let (pl, request_id) = match unsafe {
validate_pipeline_and_request_id(pipeline, request_id_c_str, "mark_prefill_complete")
} {
Ok(v) => v,
Err(e) => return e,
};
pub unsafe extern "C" fn mark_prefill_complete(
handle: RouterHandlesPtr,
request_id: *const c_char,
) -> QueryRouterResult {
if handle.is_null() || request_id.is_null() {
return QueryRouterResult::ErrInvalidParam;
}
let Some(ref kv_router) = pl.kv_router else {
tracing::debug!(
"[GAIE] KV router not available (router_mode is not KV), skipping mark_prefill_complete (no-op)"
);
return DynamoLlmResult::OK;
let handles = unsafe { &*handle };
let request_id_str = match unsafe { CStr::from_ptr(request_id) }.to_str() {
Ok(s) => s.to_owned(),
Err(_) => return QueryRouterResult::ErrInvalidParam,
};
// Log after kv_router check to reduce noise
tracing::debug!(
request_id = %request_id,
"[GAIE] dynamo_router_mark_prefill_complete processing"
);
let decode_router = handles.decode_router.clone();
let kv_router = kv_router.clone();
let request_id_clone = request_id.clone();
let result = handles.runtime.secondary().block_on(async {
let timeout_duration = Duration::from_secs(BOOKKEEPING_TIMEOUT_SEC);
tokio::time::timeout(timeout_duration, async {
if let Err(e) = decode_router.mark_prefill_completed(&request_id_str).await {
tracing::warn!(
request_id = %request_id_str,
error = %e,
"Failed to mark prefill complete"
);
} else {
tracing::debug!(
request_id = %request_id_str,
"mark_prefill_complete completed"
);
}
})
.await
});
run_bookkeeping_with_timeout(pl, "mark_prefill_complete", &request_id, || async move {
if let Err(e) = kv_router.mark_prefill_completed(&request_id_clone).await {
match result {
Ok(()) => QueryRouterResult::Ok,
Err(_elapsed) => {
tracing::warn!(
"Failed to mark prefill completed for {}: {}",
request_id_clone,
e
);
} else {
tracing::debug!(
request_id = %request_id_clone,
"[GAIE] dynamo_router_mark_prefill_complete completed - prefill tokens released"
request_id = %request_id_str,
timeout_secs = BOOKKEEPING_TIMEOUT_SEC,
"mark_prefill_complete timed out"
);
QueryRouterResult::ErrTimeout
}
})
}
}
/// Free a request from the router's bookkeeping.
/// Call this from GAIE hook when the stream is closed (completed or cancelled).
///
/// Call this when the stream is closed (completed or cancelled) to release all resources.
///
/// # Safety
/// - `pipeline` must be a valid, non-null pointer from `dynamo_create_worker_selection_pipeline`
/// - `request_id_c_str` must be a valid NUL-terminated UTF-8 C string
/// - `handle` must be a valid RouterHandles handle
/// - `request_id` must be a valid null-terminated C string
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dynamo_router_free_request(
pipeline: *mut WorkerSelectionPipeline,
request_id_c_str: *const c_char,
) -> DynamoLlmResult {
let (pl, request_id) = match unsafe {
validate_pipeline_and_request_id(pipeline, request_id_c_str, "free_request")
} {
Ok(v) => v,
Err(e) => return e,
};
pub unsafe extern "C" fn free_request(
handle: RouterHandlesPtr,
request_id: *const c_char,
) -> QueryRouterResult {
if handle.is_null() || request_id.is_null() {
return QueryRouterResult::ErrInvalidParam;
}
let Some(ref kv_router) = pl.kv_router else {
tracing::debug!(
"[GAIE] KV router not available (router_mode is not KV), skipping free_request (no-op)"
);
return DynamoLlmResult::OK;
let handles = unsafe { &*handle };
let request_id_str = match unsafe { CStr::from_ptr(request_id) }.to_str() {
Ok(s) => s.to_owned(),
Err(_) => return QueryRouterResult::ErrInvalidParam,
};
// Log after kv_router check to reduce noise
tracing::debug!(
request_id = %request_id,
"[GAIE] dynamo_router_free_request processing"
);
let decode_router = handles.decode_router.clone();
let kv_router = kv_router.clone();
let request_id_clone = request_id.clone();
let result = handles.runtime.secondary().block_on(async {
let timeout_duration = Duration::from_secs(BOOKKEEPING_TIMEOUT_SEC);
run_bookkeeping_with_timeout(pl, "free_request", &request_id, || async move {
if let Err(e) = kv_router.free(&request_id_clone).await {
tracing::warn!("Failed to free request {}: {}", request_id_clone, e);
} else {
tracing::debug!(
request_id = %request_id_clone,
"[GAIE] dynamo_router_free_request completed - request removed from bookkeeping"
);
}
})
}
/// Result of worker selection extraction
#[derive(Debug, Clone, Default)]
pub struct WorkerSelectionResult {
/// Decode worker ID (primary worker for aggregated, decode-only for disaggregated)
pub decode_worker_id: Option<i64>,
/// Prefill worker ID (only present in disaggregated mode)
pub prefill_worker_id: Option<i64>,
/// Token IDs from tokenization
pub tokens: Vec<u32>,
}
/// Helper function to extract worker selection information from the annotation stream
///
/// The response format (from disaggregated_params in nvext):
/// - worker_id: {"prefill_worker_id": 123, "decode_worker_id": 456}
/// - token_ids: [1, 2, 3, ...]
pub async fn extract_worker_selection_from_stream(
mut stream: Pin<Box<dyn AsyncEngineStream<Annotated<NvCreateChatCompletionStreamResponse>>>>,
) -> anyhow::Result<WorkerSelectionResult> {
use dynamo_llm::protocols::openai::nvext::WorkerIdInfo;
use futures::StreamExt;
let mut result = WorkerSelectionResult::default();
while let Some(response) = stream.next().await {
// Check for data in nvext (worker_id and token_ids are direct fields)
// nvext is a serde_json::Value, so we access it as a JSON object
if let Some(data) = &response.data
&& let Some(nvext) = &data.nvext
{
// Extract worker_id
if let Some(worker_id_value) = nvext.get("worker_id")
&& let Ok(worker_info) =
serde_json::from_value::<WorkerIdInfo>(worker_id_value.clone())
{
result.decode_worker_id = worker_info.decode_worker_id.map(|id| id as i64);
result.prefill_worker_id = worker_info.prefill_worker_id.map(|id| id as i64);
tracing::debug!(
decode_worker_id = ?result.decode_worker_id,
prefill_worker_id = ?result.prefill_worker_id,
"Parsed worker_id from nvext"
tokio::time::timeout(timeout_duration, async {
if let Err(e) = decode_router.free(&request_id_str).await {
tracing::warn!(
request_id = %request_id_str,
error = %e,
"Failed to free request"
);
}
// Extract token_ids
if let Some(token_ids_value) = nvext.get("token_ids")
&& let Ok(parsed_tokens) =
serde_json::from_value::<Vec<u32>>(token_ids_value.clone())
{
result.tokens = parsed_tokens;
} else {
tracing::debug!(
"Successfully parsed {} tokens from nvext",
result.tokens.len()
request_id = %request_id_str,
"free_request completed"
);
}
})
.await
});
match result {
Ok(()) => QueryRouterResult::Ok,
Err(_elapsed) => {
tracing::warn!(
request_id = %request_id_str,
timeout_secs = BOOKKEEPING_TIMEOUT_SEC,
"free_request timed out"
);
QueryRouterResult::ErrTimeout
}
}
}
tracing::info!(
decode_worker_id = ?result.decode_worker_id,
prefill_worker_id = ?result.prefill_worker_id,
token_count = result.tokens.len(),
"Worker selection extraction complete"
);
Ok(result)
/// Destroy router handles
///
/// # Safety
/// - `handle` must be a valid RouterHandles handle or null
/// - After this call, `handle` must not be used
#[unsafe(no_mangle)]
pub unsafe extern "C" fn destroy(handle: RouterHandlesPtr) {
if !handle.is_null() {
drop(unsafe { Box::from_raw(handle) });
}
}
/// Utility function to add the "query_instance_id" annotation to an OpenAI request
/// Route a chat completion request in a single call.
///
/// This function modifies the request to include the annotation that signals the KV router
/// to return worker selection information (worker_fid and token_data) instead of
/// performing actual inference.
/// This is the main function for EPP to route a `/v1/chat/completions` request.
/// It combines tokenization and worker selection in one call:
/// 1. Applies the chat template to the request JSON
/// 2. Tokenizes the formatted prompt
/// 3. Queries the prefill router (if disaggregated mode)
/// 4. Queries the decode router
/// 5. Returns worker IDs and token_ids
///
/// # Parameters
/// - `request`: Mutable reference to the OpenAI chat completion request
/// After this call, EPP should:
/// - Call `add_request()` to register the request for bookkeeping
/// - Set worker ID headers and forward to backend
/// - Call `mark_prefill_complete()` on first token
/// - Call `free_request()` when the stream ends
/// - Call `free_routing_result()` to free the result
///
/// # Returns
/// The same request with the "query_instance_id" annotation added
pub fn add_query_instance_id(
request: &mut NvCreateChatCompletionRequest,
) -> &mut NvCreateChatCompletionRequest {
// Send empty value - router treats empty as aggregated / aggregated worker selection
set_kv_annotation(request, "query_instance_id".to_string(), "")
}
/// # Safety
/// - `handle` must be a valid RouterHandles handle
/// - `request_json` must be a valid null-terminated C string containing JSON
/// - `out_result` must be a valid pointer
#[unsafe(no_mangle)]
pub unsafe extern "C" fn route_request(
handle: RouterHandlesPtr,
request_json: *const c_char,
out_result: *mut CRoutingResult,
) -> QueryRouterResult {
if handle.is_null() || request_json.is_null() || out_result.is_null() {
return QueryRouterResult::ErrInvalidParam;
}
// Note: set_worker_ids_for_stage2 and set_token_data_for_stage2 have been removed.
// The EPP now handles routing configuration via HTTP headers:
// - `x-worker-instance-id`: decode worker ID
// - `x-prefill-instance-id`: prefill worker ID (disaggregated mode only)
// - `x-enable-local-updates`: set to "false" to disable router bookkeeping
//
// Body modifications are NOT sent to the inference engine (only headers are forwarded),
// so these functions were ineffective.
/// Ensure `nvext` exists and return a mutable slice of annotations.
fn ensure_annotations(request: &mut NvCreateChatCompletionRequest) -> &mut Vec<String> {
let nvext = request.nvext.get_or_insert_with(|| {
NvExt::builder()
.build()
.expect("NvExt builder should not fail")
let handles = unsafe { &*handle };
// Get preprocessor
let preprocessor = match &handles.preprocessor {
Some(p) => p,
None => {
tracing::error!("Preprocessor not available");
return QueryRouterResult::ErrInitFailed;
}
};
let json_str = match unsafe { CStr::from_ptr(request_json) }.to_str() {
Ok(s) => s,
Err(_) => return QueryRouterResult::ErrInvalidParam,
};
// Parse JSON
let request: dynamo_llm::types::openai::chat_completions::NvCreateChatCompletionRequest =
match serde_json::from_str(json_str) {
Ok(req) => req,
Err(e) => {
tracing::error!(error = ?e, "Failed to parse request JSON");
return QueryRouterResult::ErrInvalidParam;
}
};
// Apply chat template
let formatted_prompt = match preprocessor.apply_template(&request) {
Ok(Some(prompt)) => prompt,
Ok(None) => String::new(),
Err(e) => {
tracing::error!(error = ?e, "Failed to apply chat template");
return QueryRouterResult::ErrQueryFailed;
}
};
// Tokenize
let encoding = match preprocessor.tokenize(&formatted_prompt) {
Ok(enc) => enc,
Err(e) => {
tracing::error!(error = ?e, "Failed to tokenize");
return QueryRouterResult::ErrQueryFailed;
}
};
let tokens = encoding.token_ids();
let token_count = tokens.len();
let is_disaggregated = handles.prefill_router.is_activated();
// Query workers
let result = handles.runtime.secondary().block_on(async {
let prefill_worker_id = if is_disaggregated {
handles
.query_prefill_worker(tokens, false, None, 0.0)
.await?
} else {
0
};
let (decode_worker, _overlap_blocks) = handles
.query_decode_worker(tokens, is_disaggregated)
.await?;
tracing::info!(
is_disaggregated = is_disaggregated,
prefill_worker_id = prefill_worker_id,
decode_worker_id = decode_worker.worker_id,
decode_dp_rank = decode_worker.dp_rank,
token_count = token_count,
"Routed chat request"
);
Ok((prefill_worker_id, decode_worker))
});
nvext.annotations.get_or_insert_with(Vec::new)
}
/// Set a `key:value` annotation.
fn set_kv_annotation(
request: &mut NvCreateChatCompletionRequest,
key: String, // <- owned, only one borrowed param remains
value: impl Into<String>,
) -> &mut NvCreateChatCompletionRequest {
let prefix = format!("{}:", key);
let kv = format!("{}{}", prefix, value.into());
let annotations = ensure_annotations(request);
annotations.retain(|a| !a.starts_with(&prefix));
annotations.push(kv);
request
match result {
Ok((prefill_worker_id, decode_worker)) => {
// Allocate and copy token IDs for caller (needed for add_request bookkeeping)
let token_vec: Vec<u32> = tokens.to_vec();
let mut tokens_boxed = token_vec.into_boxed_slice();
let token_ptr = tokens_boxed.as_mut_ptr();
std::mem::forget(tokens_boxed);
unsafe {
*out_result = CRoutingResult {
is_disaggregated,
prefill_worker_id,
decode_worker_id: decode_worker.worker_id,
token_ids: token_ptr,
token_count,
};
}
QueryRouterResult::Ok
}
Err(code) => code,
}
}
/// Wrapper function that queries worker selection for GAIE Stage 1
///
/// This function performs the complete GAIE Stage 1 flow:
/// 1. Clones the original request and adds "query_instance_id:" (empty) annotation
/// 2. Calls engine.generate() with the modified request
/// 3. Extracts worker_id info and tokens from the response stream
/// 4. Returns WorkerSelectionResult and the original request
///
/// Note: The EPP (caller) is responsible for setting HTTP headers for Stage 2:
/// - `x-worker-instance-id`: decode worker ID
/// - `x-prefill-instance-id`: prefill worker ID (disaggregated mode only)
/// - `x-enable-local-updates`: "false" to disable router bookkeeping
///
/// Body modifications are NOT forwarded to the inference engine, so this function
/// does not modify the request body.
///
/// # Parameters
/// - `engine`: The worker selection pipeline engine
/// - `original_request`: The original OpenAI request to process
/// Free a routing result.
///
/// # Returns
/// A tuple containing (WorkerSelectionResult, original_request)
pub async fn query_worker_selection_and_annotate(
engine: &ServiceEngine<
SingleIn<NvCreateChatCompletionRequest>,
ManyOut<Annotated<NvCreateChatCompletionStreamResponse>>,
>,
original_request: NvCreateChatCompletionRequest,
) -> anyhow::Result<(WorkerSelectionResult, NvCreateChatCompletionRequest)> {
// GAIE Stage 1: Query for worker selection
let mut query_request = original_request.clone();
add_query_instance_id(&mut query_request);
let single_in = SingleIn::new(query_request);
let response_stream = engine.generate(single_in).await?;
let result = extract_worker_selection_from_stream(response_stream).await?;
// Return the original request unchanged.
// The EPP sets routing headers (worker IDs, enable_local_updates) which the
// Dynamo frontend reads via apply_header_routing_overrides().
Ok((result, original_request))
/// # Safety
/// - `result` must be a valid pointer to a CRoutingResult previously returned by route functions
#[unsafe(no_mangle)]
pub unsafe extern "C" fn free_routing_result(result: *mut CRoutingResult) {
if result.is_null() {
return;
}
let res = unsafe { &mut *result };
// Free token IDs
if !res.token_ids.is_null() && res.token_count > 0 {
drop(unsafe {
Box::from_raw(std::slice::from_raw_parts_mut(
res.token_ids,
res.token_count,
))
});
res.token_ids = ptr::null_mut();
res.token_count = 0;
}
}
/// Spawn a background task to watch for prefill models and activate prefill routers.
/// This is a lightweight watcher that only handles prefill model discovery.
fn spawn_prefill_watcher(
drt: DistributedRuntime,
model_manager: Arc<ModelManager>,
target_namespace: String,
) {
/// Fetch model card via discovery and create preprocessor.
///
/// This function:
/// 1. Lists all models via discovery
/// 2. Finds the first model in the target namespace (decode workers only)
/// 3. Downloads the model config (tokenizer files) if needed
/// 4. Creates an OpenAIPreprocessor from the model card
/// 5. Returns the preprocessor, the kv_cache_block_size, and model_name from the model card
async fn fetch_preprocessor_from_discovery(
drt: &DistributedRuntime,
target_namespace: &str,
) -> anyhow::Result<(Arc<OpenAIPreprocessor>, u32, String)> {
use dynamo_llm::model_card::ModelDeploymentCard;
use dynamo_runtime::discovery::{DiscoveryEvent, DiscoveryInstance, DiscoveryQuery};
use dynamo_runtime::protocols::EndpointId;
use futures::StreamExt;
tokio::spawn(async move {
let discovery = drt.discovery();
let mut stream = match discovery
.list_and_watch(DiscoveryQuery::AllModels, None)
.await
{
Ok(s) => s,
Err(e) => {
tracing::error!(error = %e, "Failed to start prefill discovery stream");
return;
}
};
use dynamo_runtime::discovery::DiscoveryInstance;
while let Some(result) = stream.next().await {
let event = match result {
Ok(e) => e,
Err(e) => {
tracing::error!(error = %e, "Error in prefill discovery stream");
continue;
}
};
let discovery = drt.discovery();
match event {
DiscoveryEvent::Added(instance) => {
let (endpoint_id, card) = match &instance {
DiscoveryInstance::Model {
namespace,
component,
endpoint,
..
} => {
// Filter by namespace
if namespace != &target_namespace {
continue;
}
let eid = EndpointId {
namespace: namespace.clone(),
component: component.clone(),
name: endpoint.clone(),
};
match instance.deserialize_model::<ModelDeploymentCard>() {
Ok(card) => (eid, card),
Err(_) => continue,
}
}
_ => continue,
};
// Only handle prefill models
if !card.model_type.supports_prefill() {
continue;
}
// List all models
let instances = discovery.list(DiscoveryQuery::AllModels).await?;
tracing::info!(
model_name = card.name(),
"Prefill model discovered, activating prefill router"
);
// Find first model card in the target namespace (decode workers only)
let mut model_card: Option<ModelDeploymentCard> = None;
// Get the endpoint and activate the prefill router
if let Ok(ns) = drt.namespace(&endpoint_id.namespace)
&& let Ok(comp) = ns.component(&endpoint_id.component)
for instance in instances {
if let DiscoveryInstance::Model { namespace, .. } = &instance {
// Filter by namespace
if namespace != target_namespace {
continue;
}
match instance.deserialize_model::<ModelDeploymentCard>() {
Ok(card) => {
// Skip prefill-only workers, we want decode workers for routing
if card.model_type.supports_prefill()
&& !card.model_type.supports_chat()
&& !card.model_type.supports_completions()
{
let endpoint = comp.endpoint(&endpoint_id.name);
if let Err(e) = model_manager.activate_prefill_router(card.name(), endpoint)
{
tracing::warn!(
model_name = card.name(),
error = %e,
"Failed to activate prefill router"
);
} else {
tracing::info!(
model_name = card.name(),
"Prefill router activated successfully"
);
}
continue;
}
model_card = Some(card);
break;
}
DiscoveryEvent::Removed(id) => {
// Log removal for observability
// Note: The PrefillRouter remains active - worker availability
// is handled dynamically by the underlying Client's instance tracking
tracing::debug!(
instance_id = id.instance_id(),
"Prefill worker instance removed from discovery"
);
Err(e) => {
tracing::debug!(error = %e, "Failed to deserialize model card, skipping");
continue;
}
}
}
});
}
/// Create a worker selection pipeline for OpenAI Chat Completion requests
///
/// This is a concrete implementation that works specifically with NvCreateChatCompletionRequest
/// and is designed for use with C bindings. Uses the "generate" endpoint by default.
///
/// # Parameters
/// - `namespace`: namespace name
/// - `component_name`: component name
/// - `model_name`: Name/slug of the model to load
/// - `router_mode`: How to route requests (KV, RoundRobin, etc.)
/// - `busy_threshold`: Optional threshold for busy worker detection
/// - `kv_router_config`: Optional KV router configuration (only used when router_mode is KV)
/// - `enforce_disagg`: If true, fail requests when disaggregated serving is unavailable
///
/// # Returns
/// A tuple of (engine, kv_router) where kv_router is Some when router_mode is KV
pub async fn create_worker_selection_pipeline_chat(
namespace: &str,
component_name: &str,
model_name: &str,
router_mode: RouterMode,
busy_threshold: Option<f64>,
kv_router_config: Option<KvRouterConfig>,
enforce_disagg: bool,
) -> anyhow::Result<(
ServiceEngine<
SingleIn<NvCreateChatCompletionRequest>,
ManyOut<Annotated<NvCreateChatCompletionStreamResponse>>,
>,
Option<Arc<dynamo_llm::kv_router::KvRouter>>,
)> {
use dynamo_llm::discovery::WORKER_TYPE_DECODE;
use dynamo_llm::kv_router::PrefillRouter;
// Use the global DRT singleton - initialize if not already done
// Check if already initialized (by dynamo_llm_init) to avoid redundant sync wait
let needs_sync = DRT.get().is_none();
let distributed_runtime = DRT
.get_or_try_init(async {
tracing::debug!("Initializing DistributedRuntime singleton (standalone mode)");
DistributedRuntime::from_settings(Runtime::from_settings()?).await
})
.await
.map_err(|e| anyhow::anyhow!("Failed to initialize DistributedRuntime: {}", e))?;
// Only wait for discovery sync if we just initialized the DRT
// (dynamo_llm_init already does this when it initializes)
// Note: This waits indefinitely - the K8s StartupProbe is the timeout mechanism.
if needs_sync {
wait_for_discovery_sync(distributed_runtime).await;
}
let component = distributed_runtime
.namespace(namespace)?
.component(component_name)?;
let endpoint = component.endpoint(GENERATE_ENDPOINT);
let client = endpoint.client().await?;
// Discover the model card by searching all instances with this model name
tracing::debug!("Looking for model: {}", model_name);
tracing::debug!("Namespace: {}", namespace);
let model_manager = Arc::new(ModelManager::new());
let router_config = dynamo_llm::entrypoint::RouterConfig {
router_mode,
kv_router_config: kv_router_config.unwrap_or_default(),
load_threshold_config: dynamo_llm::discovery::LoadThresholdConfig {
active_decode_blocks_threshold: busy_threshold,
active_prefill_tokens_threshold: None,
active_prefill_tokens_threshold_frac: None,
},
enforce_disagg,
};
// Create metrics for migration tracking (not exposed via /metrics in C bindings)
let metrics = Arc::new(Metrics::new());
let watcher = ModelWatcher::new(
component.drt().clone(),
model_manager.clone(),
router_config,
0, // migration_limit - default to 0 for C bindings
None,
metrics.clone(),
let mut card = model_card.ok_or_else(|| {
anyhow::anyhow!(
"No model found in namespace '{}' via discovery",
target_namespace
)
})?;
let kv_cache_block_size = card.kv_cache_block_size;
let model_name = card.name().to_string();
tracing::info!(
model_name = model_name,
kv_cache_block_size = kv_cache_block_size,
"Found model card via discovery"
);
let cards = watcher
.cards_for_model(model_name, Some(namespace), false)
.await
.with_context(|| format!("Failed to discover model: {}", model_name))?;
tracing::debug!("Found {} cards for model {}", cards.len(), model_name);
// Download config (tokenizer files) if not local
card.download_config().await?;
let card = cards.into_iter().next().ok_or_else(|| {
tracing::error!("No ModelDeploymentCard found for model: {}", model_name);
anyhow::anyhow!("ModelDeploymentCard not found for model: {}", model_name)
})?;
// Create preprocessor
let preprocessor = OpenAIPreprocessor::new(card)?;
Ok((preprocessor, kv_cache_block_size, model_name))
}
let chooser = if router_mode == RouterMode::KV {
Some(
model_manager
.kv_chooser_for(
&endpoint,
card.kv_cache_block_size,
kv_router_config,
WORKER_TYPE_DECODE,
)
.await?,
)
} else {
None
/// Find a prefill endpoint from already-discovered instances (one-time filter).
/// Returns the endpoint if a prefill worker is found in the target namespace.
async fn find_prefill_endpoint(
drt: &DistributedRuntime,
target_namespace: &str,
) -> Option<dynamo_runtime::component::Endpoint> {
use dynamo_llm::model_card::ModelDeploymentCard;
use dynamo_runtime::discovery::DiscoveryInstance;
let discovery = drt.discovery();
let instances = match discovery.list(DiscoveryQuery::AllModels).await {
Ok(instances) => instances,
Err(e) => {
tracing::warn!(error = %e, "Failed to list instances for prefill discovery");
return None;
}
};
// Create prefill chooser for dynamic disaggregation support
// This registers the model and returns a receiver that will be activated
// when a prefill worker is discovered
let prefill_chooser = model_manager
.register_prefill_router(model_name.to_string())
.map(|rx| {
// Create prefill-specific config with track_active_blocks disabled
let mut prefill_config = kv_router_config.unwrap_or_default();
prefill_config.router_track_active_blocks = false;
PrefillRouter::new(
rx,
model_manager.clone(),
router_mode,
card.kv_cache_block_size,
Some(prefill_config),
enforce_disagg,
model_name.to_string(),
)
});
for instance in instances {
if let DiscoveryInstance::Model {
namespace,
component,
endpoint,
..
} = &instance
{
// Filter by namespace
if namespace != target_namespace {
continue;
}
// Start background watcher for prefill model discovery
// This will activate the prefill router when prefill workers join
spawn_prefill_watcher(
component.drt().clone(),
model_manager.clone(),
namespace.to_string(),
);
let card = match instance.deserialize_model::<ModelDeploymentCard>() {
Ok(card) => card,
Err(_) => continue,
};
// Download model config files from HuggingFace for EPP
// The backend's card has NATS URLs which aren't accessible from EPP
tracing::debug!(
"Downloading model config files for EPP: {}",
card.display_name
);
// Only handle prefill models
if !card.model_type.supports_prefill() {
continue;
}
let local_path = dynamo_llm::hub::from_hf(&card.display_name, true)
.await
.with_context(|| {
format!(
"Failed to download model config files for: {}",
card.display_name
)
})?;
// Load a fresh card from local files, then copy runtime config from original card
tracing::debug!("Loading ModelDeploymentCard from local path...");
let mut card_with_local_files = ModelDeploymentCard::load_from_disk(&local_path, None)
.with_context(|| format!("Failed to load card from disk: {:?}", local_path))?;
// Copy runtime settings from the backend's card
tracing::debug!("Copying runtime config from backend card...");
card_with_local_files.runtime_config = card.runtime_config.clone();
card_with_local_files.kv_cache_block_size = card.kv_cache_block_size;
card_with_local_files.context_length = card.context_length;
// Load the tokenizer from the downloaded files
tracing::debug!("Loading tokenizer from local files...");
let hf_tokenizer = card_with_local_files
.tokenizer_hf()
.with_context(|| format!("Failed to load tokenizer for: {}", card.display_name))?;
// Create worker monitor if busy_threshold is set
// Note: C bindings don't register with ModelManager, so HTTP endpoint won't see this
let worker_monitor = busy_threshold.map(|t| {
KvWorkerMonitor::new(
client.clone(),
dynamo_llm::discovery::LoadThresholdConfig {
active_decode_blocks_threshold: Some(t),
active_prefill_tokens_threshold: None,
active_prefill_tokens_threshold_frac: None,
},
)
});
tracing::info!(
model_name = card.name(),
"Prefill worker found in discovered instances"
);
// Build and return the endpoint
if let Ok(ns) = drt.namespace(namespace)
&& let Ok(comp) = ns.component(component)
{
return Some(comp.endpoint(endpoint));
}
}
}
// Clone chooser before passing to build_routed_pipeline (which takes ownership)
let kv_router = chooser.clone();
let engine = build_routed_pipeline::<
NvCreateChatCompletionRequest,
NvCreateChatCompletionStreamResponse,
>(
&card_with_local_files,
&client,
model_manager.clone(),
router_mode,
worker_monitor,
chooser,
hf_tokenizer,
prefill_chooser,
enforce_disagg,
0, // migration_limit - default to 0 for C bindings
metrics,
)
.await?;
Ok((engine, kv_router))
None
}
......@@ -1635,6 +1635,7 @@ dependencies = [
"ahash",
"aho-corasick",
"akin",
"aligned-vec",
"anyhow",
"async-nats",
"async-stream",
......@@ -1651,6 +1652,7 @@ dependencies = [
"bytes",
"candle-core",
"chrono",
"cudarc",
"dashmap 5.5.3",
"derive-getters",
"derive_builder",
......@@ -1685,6 +1687,8 @@ dependencies = [
"ndarray",
"ndarray-interp",
"ndarray-npy",
"nix 0.26.4",
"nixl-sys",
"object_store",
"offset-allocator",
"oneshot",
......@@ -3962,6 +3966,15 @@ version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "38d1115007560874e373613744c6fba374c17688327a71c1476d1a5954cc857b"
[[package]]
name = "memoffset"
version = "0.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5de893c32cde5f383baa4c04c5d6dbdd735cfd4a794b0debdb2bb1b421da5ff4"
dependencies = [
"autocfg",
]
[[package]]
name = "memoffset"
version = "0.9.1"
......@@ -4255,6 +4268,19 @@ dependencies = [
"thiserror 1.0.69",
]
[[package]]
name = "nix"
version = "0.26.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "598beaf3cc6fdd9a5dfb1630c2800c7acd31df7aaf0f565796fba2b53ca1af1b"
dependencies = [
"bitflags 1.3.2",
"cfg-if 1.0.4",
"libc",
"memoffset 0.7.1",
"pin-utils",
]
[[package]]
name = "nix"
version = "0.29.0"
......@@ -5501,7 +5527,7 @@ dependencies = [
"cfg-if 1.0.4",
"indoc",
"libc",
"memoffset",
"memoffset 0.9.1",
"once_cell",
"portable-atomic",
"pyo3-build-config",
......
......@@ -46,6 +46,9 @@ pub enum RouterMode {
RoundRobin,
Random,
KV,
/// Direct routing - reads worker ID from each request's routing hints.
/// Used when an external orchestrator (e.g., EPP) handles worker selection.
Direct,
}
impl From<RouterMode> for RsRouterMode {
......@@ -54,6 +57,7 @@ impl From<RouterMode> for RsRouterMode {
RouterMode::RoundRobin => Self::RoundRobin,
RouterMode::Random => Self::Random,
RouterMode::KV => Self::KV,
RouterMode::Direct => Self::Direct,
}
}
}
......
......@@ -950,6 +950,7 @@ class RouterMode:
RoundRobin: "RouterMode"
Random: "RouterMode"
KV: "RouterMode"
Direct: "RouterMode"
...
class RouterConfig:
......@@ -968,7 +969,7 @@ class RouterConfig:
Create a RouterConfig.
Args:
mode: The router mode (RoundRobin, Random, or KV)
mode: The router mode (RoundRobin, Random, KV, or Direct)
config: Optional KV router configuration (used when mode is KV)
active_decode_blocks_threshold: Threshold percentage (0.0-1.0) for decode blocks busy detection
active_prefill_tokens_threshold: Literal token count threshold for prefill busy detection
......
......@@ -9,7 +9,7 @@ use crate::{
engines::StreamingEngineAdapter,
entrypoint::{EngineConfig, RouterConfig},
http::service::metrics::Metrics,
kv_router::{KvPushRouter, KvRouter, PrefillRouter},
kv_router::{DirectRoutingRouter, KvPushRouter, KvRouter, PrefillRouter},
migration::Migration,
model_card::ModelDeploymentCard,
preprocessor::{OpenAIPreprocessor, prompt::PromptFormatter},
......@@ -274,10 +274,10 @@ where
.await?;
let service_backend = match router_mode {
RouterMode::Random | RouterMode::RoundRobin | RouterMode::Direct(_) => {
// Non-KV routing: use PushRouter directly.
// Note: Per-worker metrics (active_prefill_tokens, active_decode_blocks) are only
// available in KV routing mode where the router has actual bookkeeping.
RouterMode::Direct => {
ServiceBackend::from_engine(Arc::new(DirectRoutingRouter::new(router)))
}
RouterMode::Random | RouterMode::RoundRobin => {
ServiceBackend::from_engine(Arc::new(router))
}
RouterMode::KV => {
......
......@@ -41,7 +41,7 @@ pub mod worker_query;
pub use config::{KvRouterConfig, RouterConfigOverride};
pub use prefill_router::PrefillRouter;
pub use push_router::KvPushRouter;
pub use push_router::{DirectRoutingRouter, KvPushRouter};
use crate::{
discovery::RuntimeConfigWatch,
......
......@@ -81,14 +81,6 @@ impl InnerPrefillRouter {
InnerPrefillRouter::KvRouter(_) => None,
}
}
/// Peek next worker without incrementing state (for non-KV modes only)
fn peek_next_worker(&self) -> Option<u64> {
match self {
InnerPrefillRouter::SimpleRouter(router) => router.peek_next_worker(),
InnerPrefillRouter::KvRouter(_) => None,
}
}
}
/// PrefillRouter is a forward-only operator that sits between Migration and the decode router.
......@@ -273,7 +265,7 @@ impl PrefillRouter {
preselected_worker: Option<u64>,
) -> Option<(u64, u32, BootstrapInfo)> {
let endpoint_id = self.endpoint_id.get()?;
let prefill_router = self.prefill_router.get()?;
let _prefill_router = self.prefill_router.get()?;
// Worker selection
let (worker_id, dp_rank) = if let Some(id) = preselected_worker {
......@@ -284,12 +276,8 @@ impl PrefillRouter {
"Using pre-selected prefill worker for bootstrap"
);
(id, dp_rank)
} else if self.router_mode.is_kv_routing() {
// KV mode: use find_best_match
let kv_router = match prefill_router {
InnerPrefillRouter::KvRouter(r) => r,
_ => return None,
};
} else {
// Use shared worker selection logic (update_states=false for peek behavior)
// Extract LORA name and priority jump from routing hints
let lora_name = req.routing.as_ref().and_then(|r| r.lora_name.clone());
let priority_jump = req
......@@ -297,24 +285,14 @@ impl PrefillRouter {
.as_ref()
.and_then(|r| r.priority_jump)
.unwrap_or(0.0);
match async {
kv_router
.chooser
.find_best_match(None, &req.token_ids, None, false, lora_name, priority_jump)
.await
}
.instrument(tracing::info_span!("kv_find_best_match"))
.await
match self
.query_prefill_worker(&req.token_ids, false, lora_name, priority_jump)
.instrument(tracing::info_span!("query_prefill_worker"))
.await
{
Ok((worker, _overlap)) => (worker.worker_id, worker.dp_rank),
Ok((worker_id, dp_rank)) => (worker_id, dp_rank),
Err(_) => return None,
}
} else {
// Non-KV mode: use PushRouter's stateful selection
// We use peek_next_worker instead of select_next_worker to avoid double-incrementing the counter
// if we fall back to the original path.
let worker_id = prefill_router.peek_next_worker()?;
(worker_id, 0)
};
// Get bootstrap info from ModelManager (works for ANY mode)
......@@ -489,6 +467,55 @@ impl PrefillRouter {
// No phase permit needed - we wait for completion before changing phase
Self::execute_prefill(self.prefill_router.get().cloned(), request, None, None).await
}
/// Query the best prefill worker without executing a request.
/// Returns (worker_id, dp_rank).
///
/// This is the shared worker selection logic used by both `build_bootstrap_info`
/// and `query_route`.
pub async fn query_prefill_worker(
&self,
token_ids: &[u32],
update_states: bool,
lora_name: Option<String>,
priority_jump: f64,
) -> Result<(u64, u32)> {
let prefill_router = self
.prefill_router
.get()
.ok_or_else(|| anyhow::anyhow!(PrefillError::NotActivated))?;
match prefill_router {
InnerPrefillRouter::KvRouter(r) => {
let (worker, _overlap) = r
.chooser
.find_best_match(
None,
token_ids,
None,
update_states,
lora_name,
priority_jump,
)
.await?;
Ok((worker.worker_id, worker.dp_rank))
}
InnerPrefillRouter::SimpleRouter(r) => {
let worker_id = if update_states {
r.select_next_worker()
} else {
r.peek_next_worker()
}
.ok_or_else(|| anyhow::anyhow!("No workers available for prefill"))?;
Ok((worker_id, 0))
}
}
}
/// Check if disaggregated mode is currently active (prefill router activated)
pub fn is_activated(&self) -> bool {
self.prefill_router.get().is_some()
}
}
impl Drop for PrefillRouter {
......
......@@ -48,7 +48,6 @@ struct WorkerSelection {
struct RequestGuard {
chooser: Arc<KvRouter>,
context_id: String,
handle_local_updates: bool,
tracker: Option<Arc<RequestTracker>>,
request_metrics: Arc<RouterRequestMetrics>,
cumulative_osl: usize,
......@@ -59,9 +58,7 @@ struct RequestGuard {
impl RequestGuard {
async fn finish(&mut self) {
self.record_metrics();
if self.handle_local_updates
&& let Err(e) = self.chooser.free(&self.context_id).await
{
if let Err(e) = self.chooser.free(&self.context_id).await {
tracing::warn!("Failed to free request {}: {e}", self.context_id);
}
self.freed = true;
......@@ -86,7 +83,7 @@ impl RequestGuard {
impl Drop for RequestGuard {
fn drop(&mut self) {
self.record_metrics();
if !self.freed && self.handle_local_updates {
if !self.freed {
let chooser = self.chooser.clone();
let context_id = self.context_id.clone();
let Ok(handle) = tokio::runtime::Handle::try_current() else {
......@@ -112,15 +109,13 @@ impl KvPushRouter {
/// Select a worker for the request, either using a preselected worker or finding the best match.
///
/// When `is_query_only` is false and `handle_local_updates` is true, this also registers
/// the request with the scheduler via `add_request`.
/// When `is_query_only` is false, this also registers the request with the scheduler via `add_request`.
async fn select_worker(
&self,
context_id: &str,
request: &PreprocessedRequest,
phase: RequestPhase,
is_query_only: bool,
handle_local_updates: bool,
) -> Result<WorkerSelection, Error> {
let routing = request.routing.as_ref();
let lora_name = routing.and_then(|r| r.lora_name.clone());
......@@ -172,7 +167,7 @@ impl KvPushRouter {
.get_overlap_blocks(&request.token_ids, worker)
.await?;
if !is_query_only && handle_local_updates {
if !is_query_only {
self.chooser
.add_request(
context_id.to_string(),
......@@ -234,15 +229,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
// Simple query-only detection: presence of query_instance_id annotation means query-only mode
let is_query_only = request.get_annotation_value("query_instance_id").is_some();
// Determine if this router should handle local state updates (add_request, free, etc.)
// Default is true (router handles bookkeeping). Set to false for GAIE Stage 2 where
// an external orchestrator (e.g., EPP sidecar) handles bookkeeping via C FFI.
let handle_local_updates = request
.routing
.as_ref()
.and_then(|r| r.enable_local_updates)
.unwrap_or(true);
// Get phase from tracker (defaults to Aggregated if no tracker or phase not set)
let phase = request
.tracker
......@@ -252,13 +238,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
let block_size = self.chooser.block_size() as usize;
let selection = self
.select_worker(
&context_id,
&request,
phase,
is_query_only,
handle_local_updates,
)
.select_worker(&context_id, &request, phase, is_query_only)
.await?;
let WorkerSelection {
instance_id,
......@@ -335,8 +315,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
.routing
.as_ref()
.and_then(|r| r.expected_output_tokens);
let track_output_blocks =
self.chooser.kv_router_config().router_track_output_blocks && handle_local_updates;
let track_output_blocks = self.chooser.kv_router_config().router_track_output_blocks;
let tracker = request.tracker.clone();
let (mut backend_input, context) = request.into_parts();
......@@ -360,7 +339,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
let mut guard = RequestGuard {
chooser: chooser.clone(),
context_id: context_id.clone(),
handle_local_updates,
tracker: tracker.clone(),
request_metrics: request_metrics.clone(),
cumulative_osl: 0,
......@@ -385,7 +363,9 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
break;
};
if handle_local_updates && !prefill_marked {
if !prefill_marked {
// Only mark prefill completed when we receive actual tokens,
// not empty bootstrap info (token_ids: []) from disaggregated prefill
let has_tokens = item.data.as_ref()
.map(|d| !d.token_ids.is_empty())
.unwrap_or(false);
......@@ -451,3 +431,48 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
Ok(ResponseStream::new(wrapped_stream, stream_context))
}
}
/// A direct routing wrapper for `RouterMode::Direct`.
///
/// This wraps a `PushRouter` and reads worker IDs from each request's routing hints,
/// then routes directly to the specified worker. Used when an external router
/// (e.g., EPP) handles worker selection.
pub struct DirectRoutingRouter {
inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>,
}
impl DirectRoutingRouter {
pub fn new(inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>) -> Self {
DirectRoutingRouter { inner }
}
/// Extract worker ID from request routing hints.
/// Returns an error if no worker ID is found (required in direct routing mode).
fn get_worker_id(request: &PreprocessedRequest) -> Result<u64, Error> {
let routing = request.routing.as_ref();
let worker_id = routing.and_then(|r| r.decode_worker_id.or(r.backend_instance_id));
worker_id.ok_or_else(|| {
anyhow::anyhow!(
"Worker ID required (--direct-route) but none found in request. \
Expected decode_worker_id or backend_instance_id to be set by external router (e.g., EPP)."
)
})
}
}
#[async_trait]
impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutput>>, Error>
for DirectRoutingRouter
{
async fn generate(
&self,
request: SingleIn<PreprocessedRequest>,
) -> Result<ManyOut<Annotated<LLMEngineOutput>>, Error> {
let worker_id = Self::get_worker_id(&request)?;
tracing::debug!(worker_id = worker_id, "Direct routing to specified worker");
self.inner.direct(request, worker_id).await
}
}
......@@ -280,7 +280,6 @@ impl OpenAIPreprocessor {
prefill_worker_id: nvext.prefill_worker_id,
decode_worker_id: nvext.decode_worker_id,
dp_rank: None, // dp_rank is set later in the pipeline
enable_local_updates: nvext.enable_local_updates,
expected_output_tokens: hints.and_then(|h| h.osl),
priority_jump: hints.and_then(|h| h.latency_sensitivity),
lora_name,
......
......@@ -34,14 +34,6 @@ pub struct RoutingHints {
#[serde(default, skip_serializing_if = "Option::is_none")]
pub dp_rank: Option<u32>,
/// Controls whether the router should manage local bookkeeping (add_request,
/// mark_prefill_completed, free) for this request.
///
/// - `None` or `Some(true)`: Router handles bookkeeping locally (default behavior)
/// - `Some(false)`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
#[serde(default, skip_serializing_if = "Option::is_none")]
pub enable_local_updates: Option<bool>,
/// Expected number of output tokens for this request.
/// Used as a hint for routing decisions to estimate resource requirements.
#[serde(default, skip_serializing_if = "Option::is_none")]
......
......@@ -11,16 +11,12 @@ pub use crate::protocols::common::timing::TimingInfo;
pub const HEADER_WORKER_INSTANCE_ID: &str = "x-worker-instance-id";
pub const HEADER_PREFILL_INSTANCE_ID: &str = "x-prefill-instance-id";
/// Header to disable local bookkeeping updates (for GAIE Stage 2)
/// When set to "false", the router skips add_request, mark_prefill_completed, and free calls.
pub const HEADER_ENABLE_LOCAL_UPDATES: &str = "x-enable-local-updates";
/// Apply routing overrides from HTTP headers to nvext.
///
/// Header mappings:
/// - `x-worker-instance-id` -> `backend_instance_id` and `decode_worker_id`
/// - `x-prefill-instance-id` -> `prefill_worker_id`
/// - `x-enable-local-updates` -> `enable_local_updates` (set to false to disable router bookkeeping)
///
/// Headers take priority over existing nvext values when present.
/// If no headers are present, returns the original nvext unchanged.
......@@ -35,17 +31,7 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok());
// Parse enable_local_updates header: "true" or "false"
let enable_local_updates = headers
.get(HEADER_ENABLE_LOCAL_UPDATES)
.and_then(|v| v.to_str().ok())
.and_then(|s| match s.to_lowercase().as_str() {
"true" | "1" => Some(true),
"false" | "0" => Some(false),
_ => None,
});
if worker_id.is_none() && prefill_id.is_none() && enable_local_updates.is_none() {
if worker_id.is_none() && prefill_id.is_none() {
return nvext;
}
......@@ -57,9 +43,6 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
if let Some(id) = prefill_id {
ext.prefill_worker_id = Some(id);
}
if let Some(enabled) = enable_local_updates {
ext.enable_local_updates = Some(enabled);
}
Some(ext)
}
......@@ -169,17 +152,6 @@ pub struct NvExt {
#[serde(default, skip_serializing_if = "Option::is_none")]
pub decode_worker_id: Option<u64>,
/// Controls whether the router should manage local bookkeeping (add_request,
/// mark_prefill_completed, free) for this request.
///
/// - `None` or `true`: Router handles bookkeeping locally (default behavior)
/// - `false`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
///
/// Set to `false` for GAIE Stage 2 when the EPP/sidecar manages request lifecycle.
#[builder(default, setter(strip_option))]
#[serde(default, skip_serializing_if = "Option::is_none")]
pub enable_local_updates: Option<bool>,
/// Agent-provided hints for request handling.
#[builder(default, setter(strip_option))]
#[serde(default, skip_serializing_if = "Option::is_none")]
......@@ -187,7 +159,7 @@ pub struct NvExt {
}
/// Hints from the agent/caller about request characteristics.
#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default)]
#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default, PartialEq)]
pub struct AgentHints {
/// Latency sensitivity in seconds for queue ordering.
/// Higher values cause the request to be scheduled sooner when the router queue is enabled.
......@@ -249,7 +221,7 @@ mod tests {
assert_eq!(nv_ext.extra_fields, None);
assert_eq!(nv_ext.prefill_worker_id, None);
assert_eq!(nv_ext.decode_worker_id, None);
assert_eq!(nv_ext.enable_local_updates, None);
assert_eq!(nv_ext.agent_hints, None);
}
// Test valid builder configurations
......
......@@ -83,15 +83,18 @@ pub enum RouterMode {
#[default]
RoundRobin,
Random,
Direct(u64),
// Marker value, KV routing itself is in dynamo-llm
KV,
Direct,
}
impl RouterMode {
pub fn is_kv_routing(&self) -> bool {
*self == RouterMode::KV
}
pub fn is_direct_routing(&self) -> bool {
*self == RouterMode::Direct
}
}
async fn addressed_router(endpoint: &Endpoint) -> anyhow::Result<Arc<AddressedPushRouter>> {
......@@ -415,14 +418,17 @@ where
U: Data + for<'de> Deserialize<'de> + MaybeError,
{
async fn generate(&self, request: SingleIn<T>) -> Result<ManyOut<U>, Error> {
//InstanceSource::Static => self.r#static(request).await,
match self.router_mode {
RouterMode::Random => self.random(request).await,
RouterMode::RoundRobin => self.round_robin(request).await,
RouterMode::Direct(instance_id) => self.direct(request, instance_id).await,
RouterMode::KV => {
anyhow::bail!("KV routing should not call generate on PushRouter");
}
RouterMode::Direct => {
anyhow::bail!(
"Direct routing should not call generate on PushRouter directly; use DirectRoutingRouter wrapper"
);
}
}
}
}
......@@ -16,12 +16,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/frontend:0.8.0
env:
- name: DYN_KV_BLOCK_SIZE
value: "128"
- name: DYN_MODEL
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
image: nvcr.io/nvidia/ai-dynamo/frontend:my-tag
eppConfig:
# This configuration uses Dynamo's KV-aware scorer for intelligent routing
config:
......@@ -49,8 +44,15 @@ spec:
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
envs:
- name: HF_HOME
value: /opt/models
......@@ -79,7 +81,7 @@ spec:
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
replicas: 1
resources:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment