@@ -42,7 +42,7 @@ When using KV routing, the router needs to know what each worker has cached. The
...
@@ -42,7 +42,7 @@ When using KV routing, the router needs to know what each worker has cached. The
|------------|---------------|-------------|
|------------|---------------|-------------|
| **NATS Core (local indexer)** | Default (no extra flags) | Workers maintain a local indexer; router queries workers on startup and receives events via NATS Core |
| **NATS Core (local indexer)** | Default (no extra flags) | Workers maintain a local indexer; router queries workers on startup and receives events via NATS Core |
| **JetStream (durable)** | `--router-durable-kv-events` | Events persisted in NATS JetStream; supports snapshots and durable consumers. *Deprecated.* |
| **JetStream (durable)** | `--router-durable-kv-events` | Events persisted in NATS JetStream; supports snapshots and durable consumers. *Deprecated.* |
| **ZMQ** | `--event-plane zmq` | Workers publish via ZMQ PUB sockets; the standalone `dynamo.indexer` service aggregates events |
| **Approximate (no events)** | `--no-router-kv-events` | No events consumed; router predicts cache state from its own routing decisions with TTL-based expiration |
| **Approximate (no events)** | `--no-router-kv-events` | No events consumed; router predicts cache state from its own routing decisions with TTL-based expiration |
### Aggregated vs. Disaggregated Topology
### Aggregated vs. Disaggregated Topology
...
@@ -93,6 +93,8 @@ Backend workers register themselves using the `register_model` API, after which
...
@@ -93,6 +93,8 @@ Backend workers register themselves using the `register_model` API, after which
| `--router-prefill-load-model <none\|aic>` | `none` | Prompt-side load model. `aic` decays only the oldest active prefill using an AIC-predicted duration |
| `--router-prefill-load-model <none\|aic>` | `none` | Prompt-side load model. `aic` decays only the oldest active prefill using an AIC-predicted duration |
| `--router-queue-policy <str>` | `fcfs` | Scheduling policy for the queue: `fcfs` (tail TTFT), `wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) |
| `--router-queue-policy <str>` | `fcfs` | Scheduling policy for the queue: `fcfs` (tail TTFT), `wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) |
| `--serve-indexer` | `false` | Serve the Dynamo-native remote indexer from this frontend/router on the worker component |
| `--use-remote-indexer` | `false` | Query the worker component's served remote indexer instead of maintaining a local overlap indexer |
For all available options: `python -m dynamo.frontend --help`
For all available options: `python -m dynamo.frontend --help`
...
@@ -444,6 +446,63 @@ graph TD
...
@@ -444,6 +446,63 @@ graph TD
For improved fault tolerance, you can launch multiple frontend + router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service; see the [Standalone Router README](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/router/README.md).
For improved fault tolerance, you can launch multiple frontend + router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service; see the [Standalone Router README](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/router/README.md).
### Dynamo-Native Remote Indexer
For Dynamo-native deployments, the remote indexer is served by `dynamo.frontend` or `dynamo.router`, not by `dynamo.indexer`.
- Use `--serve-indexer` on router/frontend replicas that should expose `kv_indexer_query` from the worker component.
- Use `--use-remote-indexer` on consumer routers/frontends that should query that served endpoint instead of maintaining a local overlap indexer.
-`dynamo.indexer` remains the standalone HTTP + ZMQ microservice for non-Dynamo / direct-ZMQ deployments.
The served service is request-plane only. Each serving router/frontend keeps its normal local KV event ingestion, gap detection, and worker-query recovery path; remote consumers only issue hash-based overlap queries.
Approximate mode (`--no-router-kv-events`) is singleton-only for remote serving: only one `--serve-indexer` replica may exist for a given worker component. Event-driven mode allows multiple serving replicas behind the same worker component.
@@ -7,13 +7,16 @@ subtitle: Run the KV cache indexer as an independent HTTP service for querying b
...
@@ -7,13 +7,16 @@ subtitle: Run the KV cache indexer as an independent HTTP service for querying b
## Overview
## Overview
The standalone KV indexer (`python -m dynamo.indexer`) is a lightweight service that maintains a radix tree of cached blocks and exposes HTTP endpoints for querying and managing workers. It supports two operational modes:
The standalone KV indexer (`python -m dynamo.indexer`) is a lightweight service that maintains a radix tree of cached blocks and exposes HTTP endpoints for querying and managing workers.
-**Standalone mode** (default): subscribes to ZMQ KV event streams directly from workers. No Dynamo runtime discovery, registration, or event-plane integration required.
- It subscribes to ZMQ KV event streams directly from workers.
-**Dynamo runtime mode** (`--dynamo-runtime`): integrates with the Dynamo runtime for automatic worker discovery via MDC, KV event ingestion via the event plane (NATS or ZMQ), and overlap queries over the request plane for remote frontends.
- It exposes an HTTP API for registration, inspection, and overlap queries.
- It preserves P2P recovery and gap detection/replay for the standalone ZMQ path.
This is distinct from the [Standalone Router](../../../components/src/dynamo/router/README.md), which is a full routing service. The standalone indexer provides only the indexing and query layer without routing logic.
This is distinct from the [Standalone Router](../../../components/src/dynamo/router/README.md), which is a full routing service. The standalone indexer provides only the indexing and query layer without routing logic.
For Dynamo-native remote indexing, use `--serve-indexer` on `dynamo.frontend` or `dynamo.router` and `--use-remote-indexer` on consumers instead. That request-plane service reuses the router's existing event ingestion and recovery machinery; it is not implemented by `dynamo.indexer`.
The HTTP API follows the [Mooncake KV Indexer RFC](https://github.com/kvcache-ai/Mooncake/issues/1403) conventions.
The HTTP API follows the [Mooncake KV Indexer RFC](https://github.com/kvcache-ai/Mooncake/issues/1403) conventions.
`DYN_ROUTER_MIN_INITIAL_WORKERS` is also honored here. When set to a positive integer, the
`DYN_ROUTER_MIN_INITIAL_WORKERS` is also honored here. When set to a positive integer, the
...
@@ -30,9 +33,7 @@ The indexer maintains one radix tree per `(model_name, tenant_id)` pair. Workers
...
@@ -30,9 +33,7 @@ The indexer maintains one radix tree per `(model_name, tenant_id)` pair. Workers
## Compatibility
## Compatibility
In standalone mode, the indexer works with any engine that publishes KV cache events over ZMQ in the expected msgpack format. This includes bare vLLM and SGLang engines, which emit ZMQ KV events natively — no Dynamo-specific wrapper is required.
The standalone indexer works with any engine that publishes KV cache events over ZMQ in the expected msgpack format. This includes bare vLLM and SGLang engines, which emit ZMQ KV events natively — no Dynamo-specific wrapper is required.
In Dynamo runtime mode, the indexer discovers workers automatically via MDC and receives KV events through the event plane. It also registers a query endpoint on the request plane, allowing frontends to query overlap scores remotely without needing direct HTTP access.
## Use Cases
## Use Cases
...
@@ -40,7 +41,7 @@ In Dynamo runtime mode, the indexer discovers workers automatically via MDC and
...
@@ -40,7 +41,7 @@ In Dynamo runtime mode, the indexer discovers workers automatically via MDC and
-**State verification**: Confirm that the indexer's view of KV cache state matches the router's internal state (used in integration tests).
-**State verification**: Confirm that the indexer's view of KV cache state matches the router's internal state (used in integration tests).
-**Custom routing**: Build external routing logic that queries the indexer for overlap scores and makes its own worker selection decisions.
-**Custom routing**: Build external routing logic that queries the indexer for overlap scores and makes its own worker selection decisions.
-**Monitoring**: Observe KV cache distribution across workers without running a full router.
-**Monitoring**: Observe KV cache distribution across workers without running a full router.
-**Remote indexing**: In Dynamo runtime mode, frontends can offload KV cache indexing to a dedicated service and query it over the request plane.
-**Standalone microservice**: Run an indexer independently of the router/frontend when you want direct HTTP inspection and ZMQ-based ingestion.
## P2P Recovery
## P2P Recovery
...
@@ -91,7 +92,6 @@ The service is exposed through the Python bindings package and launched with `py
...
@@ -91,7 +92,6 @@ The service is exposed through the Python bindings package and launched with `py
@@ -109,30 +109,12 @@ cd lib/bindings/python && VIRTUAL_ENV=../../.venv ../../.venv/bin/maturin develo
...
@@ -109,30 +109,12 @@ cd lib/bindings/python && VIRTUAL_ENV=../../.venv ../../.venv/bin/maturin develo
This keeps the default `kv-indexer` build lean while still allowing Prometheus metrics when needed.
This keeps the default `kv-indexer` build lean while still allowing Prometheus metrics when needed.
### Runtime-enabled build
```bash
cd lib/bindings/python &&VIRTUAL_ENV=../../.venv ../../.venv/bin/maturin develop --uv--features kv-indexer,kv-indexer-runtime
```
This enables the `--dynamo-runtime` CLI flag for MDC discovery, event-plane subscription, and request-plane queries. It also includes the metrics endpoint.
In runtime mode, workers are discovered automatically via MDC. The `--workers` flag can still be used to register additional static workers alongside discovered ones.
| Flag | Default | Description |
| Flag | Default | Description |
|------|---------|-------------|
|------|---------|-------------|
| `--block-size` | (none) | KV cache block size for initial `--workers` (required when `--workers` is set) |
| `--block-size` | (none) | KV cache block size for initial `--workers` (required when `--workers` is set) |
...
@@ -142,10 +124,6 @@ In runtime mode, workers are discovered automatically via MDC. The `--workers` f
...
@@ -142,10 +124,6 @@ In runtime mode, workers are discovered automatically via MDC. The `--workers` f
| `--model-name` | `default` | Model name for initial `--workers` |
| `--model-name` | `default` | Model name for initial `--workers` |
| `--tenant-id` | `default` | Tenant ID for initial `--workers` |
| `--tenant-id` | `default` | Tenant ID for initial `--workers` |
| `--peers` | (none) | Comma-separated peer indexer URLs for P2P recovery on startup |
| `--peers` | (none) | Comma-separated peer indexer URLs for P2P recovery on startup |
Returns metrics in Prometheus text exposition format. Available when the Python bindings are built with the `kv-indexer-metrics`or `kv-indexer-runtime`feature.
Returns metrics in Prometheus text exposition format. Available when the Python bindings are built with the `kv-indexer-metrics` feature.
```bash
```bash
curl http://localhost:8090/metrics
curl http://localhost:8090/metrics
...
@@ -400,38 +378,9 @@ If no `replay_endpoint` is configured, gaps are logged as warnings but not recov
...
@@ -400,38 +378,9 @@ If no `replay_endpoint` is configured, gaps are logged as warnings but not recov
The sequence counter (`last_seq`) persists across unregister/register cycles, so re-registering a worker after a gap will trigger replay on the first batch received by the new listener.
The sequence counter (`last_seq`) persists across unregister/register cycles, so re-registering a worker after a gap will trigger replay on the first batch received by the new listener.
## Dynamo Runtime Mode
When started with `--dynamo-runtime`, the indexer integrates with the Dynamo distributed runtime:
### Worker Discovery
The indexer watches MDC (Model Discovery Catalog) for worker additions and removals. When a worker registers with MDC, the indexer automatically creates an indexer for its model and block size. Workers discovered via MDC are tracked separately from those registered via `--workers` or the `/register` HTTP API; a worker cannot be registered through both paths simultaneously.
### Event Plane Subscription
Instead of connecting directly to ZMQ PUB sockets on each worker, the indexer subscribes to KV events through the Dynamo event plane. The transport (NATS or ZMQ) is determined by the `DYNAMO_EVENT_TRANSPORT` environment variable. Events are routed to the appropriate indexer based on the worker ID.
### Request Plane Query Endpoint
The indexer registers a query endpoint on the Dynamo request plane, allowing frontends to send `IndexerQueryRequest` messages containing a model name, namespace, and block hashes. The indexer looks up the appropriate radix tree and returns overlap scores. This enables frontends to use a remote indexer for KV-aware routing without direct HTTP access.
### Example
```bash
# Start the indexer with runtime integration
python -m dynamo.indexer --dynamo-runtime\
--namespace my-namespace \
--component-name kv-indexer \
--worker-component backend \
--port 8090 --threads 4
```
The HTTP API remains fully available in runtime mode. Static workers can be added via `--workers` alongside discovered workers.
## Limitations
## Limitations
-**Standalone mode is ZMQ only**: In standalone mode, workers must publish KV events via ZMQ PUB sockets. Build with `kv-indexer-runtime` and use `--dynamo-runtime` to receive events via the event plane (NATS or ZMQ).
-**Standalone mode is ZMQ only**: Workers must publish KV events via ZMQ PUB sockets.
-**No routing logic**: The indexer only maintains the radix tree and answers queries. It does not track active blocks, manage request lifecycle, or perform worker selection.
-**No routing logic**: The indexer only maintains the radix tree and answers queries. It does not track active blocks, manage request lifecycle, or perform worker selection.