feat: default kv-events-config to empty (align with vLLM defaults) (#6404)

Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

feat: default kv-events-config to empty (align with vLLM defaults) (#6404)
Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
7bbacce1 · Alec · GitHub · d6c49779 · 7bbacce1 · 7bbacce1
Unverified Commit 7bbacce1 authored Feb 19, 2026 by Alec Committed by GitHub Feb 20, 2026
20 changed files
--- a/README.md
+++ b/README.md
@@ -195,6 +195,8 @@ python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
 >
 > See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
+> **Deprecation notice:** vLLM automatically enables KV event publishing when prefix caching is active. In a future release, this will change — KV events will be disabled by default for all backends. Start using `--kv-events-config` explicitly to prepare.
 #### Send a Request
 ```bash

--- a/benchmarks/router/run_engines.sh
+++ b/benchmarks/router/run_engines.sh
@@ -260,7 +260,8 @@ else
                fi
                VLLM_ARGS+=("${EXTRA_ARGS[@]}")
-                exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES DYN_VLLM_KV_EVENT_PORT=$((20080 + i)) VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) python3 -m dynamo.vllm \
+                VLLM_ARGS+=("--kv-events-config" "{\"publisher\":\"zmq\",\"topic\":\"kv-events\",\"endpoint\":\"tcp://*:$((20080 + i))\",\"enable_kv_cache_events\":true}")
+                exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) python3 -m dynamo.vllm \
                    "${VLLM_ARGS[@]}"
            fi
        } &

--- a/components/src/dynamo/vllm/args.py
+++ b/components/src/dynamo/vllm/args.py
@@ -5,6 +5,7 @@ import argparse
 import logging
 import os
 import socket
+import warnings
 from typing import Any, Dict, Optional
 from vllm.config import KVTransferConfig
@@ -344,6 +345,16 @@ def create_kv_events_config(
    # Create default events config for prefix caching
    # TODO: move this to configuration system.
    port = envs.DYN_VLLM_KV_EVENT_PORT
+    warnings.warn(
+        "Automatic KV events configuration is deprecated and will be removed in "
+        "the next release. After that, KV events will be disabled by default "
+        "(matching upstream vLLM). To preserve current behavior, pass "
+        "--kv-events-config explicitly. For example:\n"
+        f'  --kv-events-config \'{{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://*:{port}"}}\'\n'
+        "See docs/pages/backends/vllm/README.md for details.",
+        FutureWarning,
+        stacklevel=2,
+    )
    logger.info(
        f"Using env-var DYN_VLLM_KV_EVENT_PORT={port} to create kv_events_config"
    )

--- a/docs/pages/backends/sglang/README.md
+++ b/docs/pages/backends/sglang/README.md
@@ -163,7 +163,7 @@ docker compose -f deploy/docker-compose.yml up -d
 > [!NOTE]
 > - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
-> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured with `--kv-events-config` to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events
 > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
 > [!TIP]

--- a/docs/pages/backends/trtllm/README.md
+++ b/docs/pages/backends/trtllm/README.md
@@ -72,7 +72,7 @@ docker compose -f deploy/docker-compose.yml up -d
 > [!NOTE]
 > - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
-> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events
 > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
 ### Build container

--- a/docs/pages/backends/vllm/README.md
+++ b/docs/pages/backends/vllm/README.md
@@ -67,7 +67,7 @@ docker compose -f deploy/docker-compose.yml up -d
 > [!NOTE]
 > - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
-> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **NATS** is optional - only needed if using KV routing with events. For vLLM, KV events are currently enabled by default when prefix caching is active (**deprecated** — use `--kv-events-config` explicitly). Use `--no-router-kv-events` on the frontend for prediction-based routing without events
 > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
 ### Pull or build container

--- a/docs/pages/components/router/router-guide.md
+++ b/docs/pages/components/router/router-guide.md
@@ -191,12 +191,12 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
 >
 > When `--router-kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-router-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
 >
-> **Backend Configuration:** When using `--no-router-kv-events`, configure your backend workers to disable KV event publishing:
+> **Backend Configuration:** When using `--no-router-kv-events`, no additional backend flags are needed — SGLang and TRT-LLM disable KV events by default. For vLLM, KV events are currently enabled by default when prefix caching is active (deprecated — will change in a future release). Use `--kv-events-config` explicitly to control behavior:
-> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'`
+> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or omit (auto-enabled, deprecated)
 > - **SGLang**: Do not use `--kv-events-config`
 > - **TRT-LLM**: Do not use `--publish-events-and-metrics`
 >
-> The cli args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
+> The cli args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When workers are configured to publish KV events (via `--kv-events-config`), the router relies on worker-side eviction events and these parameters are ignored.
 >
 > **Queue threshold vs. busy rejection thresholds:** `--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. The busy thresholds **reject** a worker entirely from the candidate set when it exceeds a utilization limit — no traffic is sent until it drops below the threshold. In contrast, `--router-queue-threshold` does not reject workers; it **defers the entire routing decision** until at least one worker has capacity, so the request is routed with the freshest load metrics. The queue also enables priority scheduling via `nvext.agent_hints.latency_sensitivity`.

--- a/docs/pages/design-docs/event-plane.md
+++ b/docs/pages/design-docs/event-plane.md
@@ -73,8 +73,9 @@ Example setup:
 export NATS_SERVER=nats://nats-server:4222
 export DYN_EVENT_PLANE=nats
-# Start workers -- they publish events to NATS automatically
+# Start workers -- explicitly enable KV event publishing
-python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
+python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B \
+    --kv-events-config '{"publisher":"nats","topic":"kv-events","enable_kv_cache_events":true}'
 # Start frontend -- it subscribes to events from NATS automatically
 python3 -m dynamo.frontend --router-mode kv
@@ -94,7 +95,8 @@ Example setup:
 export DYN_EVENT_PLANE=zmq
 # Start workers -- each binds a ZMQ socket, registers with discovery
-python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
+python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B \
+  --kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'
 # Start frontend -- discovers workers and connects directly
 python3 -m dynamo.frontend --router-mode kv
@@ -105,10 +107,10 @@ python3 -m dynamo.frontend --router-mode kv
 If you do not need KV-aware routing, you can disable the event plane entirely:
 ```bash
-python3 -m dynamo.frontend --router-mode kv --no-kv-events
+python3 -m dynamo.frontend --router-mode kv --no-router-kv-events
 ```
-With `--no-kv-events`:
+With `--no-router-kv-events`:
 - The router falls back to prediction-based cache-aware routing (estimates cache state from routing decisions).
 - No NATS server or ZMQ sockets are needed.

--- a/docs/pages/design-docs/request-plane.md
+++ b/docs/pages/design-docs/request-plane.md
@@ -33,7 +33,7 @@ Dynamo has **two independent communication planes**:
 - **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
 - **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
-**Note:** If you are using `tcp` or `http` request plane with KV events enabled (default), NATS is automatically initialized. You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To completely disable NATS, use `--no-kv-events` on the frontend.
+**Note:** If you are using `tcp` or `http` request plane with KV events enabled on the router (the default router-side setting), NATS is automatically initialized. SGLang requires explicit `--kv-events-config` and TRT-LLM requires `--publish-events-and-metrics` to publish events. For vLLM, KV events are currently auto-configured when prefix caching is active (deprecated — use `--kv-events-config` explicitly to prepare for a future release where all backends will default to off). You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To disable the router's KV event listener, use `--no-router-kv-events` on the frontend.
 Because they are independent, you can mix them.
@@ -89,7 +89,7 @@ DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
 **When to use TCP:**
 - Simple deployments with direct service-to-service communication (e.g. frontend to backend)
- Minimal infrastructure requirements (NATS is initialized by default for KV events but can be disabled with `--no-kv-events`)
+- Minimal infrastructure requirements (NATS is initialized when the router listens for KV events; disable with `--no-router-kv-events`)
 - Low-latency requirements
 **TCP Configuration Options:**
@@ -161,7 +161,7 @@ DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
 **When to use NATS:**
 - Production deployments with service discovery
- KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-kv-events`) provides KV routing without NATS but with reduced accuracy.
+- KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-router-kv-events`) provides KV routing without NATS but with reduced accuracy.
 - Need for message replay and persistence features
 Limitations:
@@ -290,6 +290,6 @@ curl http://localhost:8000/v1/chat/completions \
 ### Resource Usage
- **TCP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
+- **TCP**: Minimal infrastructure (NATS required only if using KV events, disable router-side with `--no-router-kv-events`)
- **HTTP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
+- **HTTP**: Minimal infrastructure (NATS required only if using KV events, disable router-side with `--no-router-kv-events`)
 - **NATS**: Requires running NATS server (additional memory/CPU)
--- a/docs/pages/getting-started/quickstart.md
+++ b/docs/pages/getting-started/quickstart.md
@@ -140,6 +140,10 @@ For dependency-free local development, disable KV event publishing (avoids NATS)
 is expected and can be safely ignored.
 </Note>
+<Note>
+**Deprecation notice:** vLLM automatically enables KV event publishing when prefix caching is active. In a future release, this will change — KV events will be disabled by default for all backends. Start using `--kv-events-config` explicitly to prepare.
+</Note>
 ## Test Your Deployment
 ```bash

--- a/docs/pages/kubernetes/inference-gateway.md
+++ b/docs/pages/kubernetes/inference-gateway.md
@@ -228,7 +228,7 @@ Common Vars for Routing Configuration:
 - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
 - Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
 - Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true)
+- Set `DYN_USE_KV_EVENTS=false` if you want to disable the router listening for KV events while using kv-routing (default: true). SGLang workers require `--kv-events-config` and TRT-LLM workers require `--publish-events-and-metrics` to publish KV events. For vLLM, KV events are auto-configured when prefix caching is active (deprecated — use `--kv-events-config` explicitly)
 - `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
 - `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
 - `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)

--- a/docs/pages/observability/tracing.md
+++ b/docs/pages/observability/tracing.md
@@ -73,8 +73,8 @@ cd examples/backends/vllm/launch
 ./disagg.sh
 ```
-**Note:** the example vLLM `disagg.sh` sets additional per-worker port environment variables (e.g., `DYN_VLLM_KV_EVENT_PORT`,
+**Note:** the example vLLM `disagg.sh` sets per-worker `--kv-events-config` with unique ZMQ endpoints and unique
-`VLLM_NIXL_SIDE_CHANNEL_PORT`) to avoid ZMQ "Address already in use" conflicts when multiple workers run on the same host. If you run the components manually, make sure you mirror those port settings.
+`VLLM_NIXL_SIDE_CHANNEL_PORT` values to avoid "Address already in use" conflicts when multiple workers run on the same host. If you run the components manually, make sure you mirror those settings.
 ```bash
 #!/bin/bash
@@ -100,13 +100,13 @@ DYN_SYSTEM_PORT=8081 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
 # Run prefill worker, make sure to wait for start up
 export OTEL_SERVICE_NAME=dynamo-worker-prefill
 DYN_SYSTEM_PORT=8082 \
-DYN_VLLM_KV_EVENT_PORT=20081 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --enforce-eager \
    --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
-    --is-prefill-worker &
+    --is-prefill-worker \
+    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
 ```
 For disaggregated deployments, this separates prefill and decode onto different GPUs for better resource utilization.

--- a/examples/backends/vllm/deploy/agg_router.yaml
+++ b/examples/backends/vllm/deploy/agg_router.yaml
@@ -34,3 +34,5 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-0.6B
+            - --kv-events-config
+            - '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'
--- a/examples/backends/vllm/deploy/disagg_router.yaml
+++ b/examples/backends/vllm/deploy/disagg_router.yaml
@@ -54,3 +54,5 @@ spec:
            - --model
            - Qwen/Qwen3-0.6B
            - --is-prefill-worker
+            - --kv-events-config
+            - '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'
--- a/examples/backends/vllm/launch/dep.sh
+++ b/examples/backends/vllm/launch/dep.sh
@@ -13,14 +13,14 @@ python -m dynamo.frontend --router-mode kv &
 # Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example)
 # --enforce-eager is added for quick deployment. for production use, need to remove this flag
 for i in {0..3}; do
-    DYN_VLLM_KV_EVENT_PORT=$((20080 + i)) \
    VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) \
    CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \
    --model Qwen/Qwen3-30B-A3B \
    --data-parallel-rank $i \
    --data-parallel-size 4 \
    --enable-expert-parallel \
-    --enforce-eager &
+    --enforce-eager \
+    --kv-events-config "{\"publisher\":\"zmq\",\"topic\":\"kv-events\",\"endpoint\":\"tcp://*:$((20080 + i))\",\"enable_kv_cache_events\":true}" &
 done
 echo "All workers starting. (press Ctrl+C to stop)..."

--- a/examples/backends/vllm/launch/disagg.sh
+++ b/examples/backends/vllm/launch/disagg.sh
@@ -13,9 +13,9 @@ python -m dynamo.frontend &
 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-decode-worker &
 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
-DYN_VLLM_KV_EVENT_PORT=20081 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --enforce-eager \
-    --is-prefill-worker
+    --is-prefill-worker \
+    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
--- a/examples/backends/vllm/launch/disagg_kvbm.sh
+++ b/examples/backends/vllm/launch/disagg_kvbm.sh
@@ -14,7 +14,6 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connecto
 # run prefill worker on GPU 1 with KVBM enabled using 20GB of CPU cache
 # NOTE: remove --enforce-eager for production use
-DYN_VLLM_KV_EVENT_PORT=20081 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
 DYN_KVBM_CPU_CACHE_GB=20 \
 CUDA_VISIBLE_DEVICES=1 \
@@ -22,4 +21,5 @@ CUDA_VISIBLE_DEVICES=1 \
    --model Qwen/Qwen3-0.6B \
    --is-prefill-worker \
    --connector kvbm nixl \
-    --enforce-eager
+    --enforce-eager \
+    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
--- a/examples/backends/vllm/launch/disagg_kvbm_2p2d.sh
+++ b/examples/backends/vllm/launch/disagg_kvbm_2p2d.sh
@@ -11,14 +11,12 @@ python -m dynamo.frontend --router-mode kv &
 # run decode workers on GPU 0 and 1, without enabling KVBM
 # NOTE: remove --enforce-eager for production use
 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl --enforce-eager --is-decode-worker &
-DYN_VLLM_KV_EVENT_PORT=20081 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl --enforce-eager --is-decode-worker &
 # run prefill workers on GPU 2 and 3 with KVBM enabled using 20GB of CPU cache
 # NOTE: use different barrier id prefixes for each prefill worker to avoid conflicts
 # NOTE: remove --enforce-eager for production use
-DYN_VLLM_KV_EVENT_PORT=20082 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \
 DYN_KVBM_CPU_CACHE_GB=20 \
 CUDA_VISIBLE_DEVICES=2 \
@@ -26,9 +24,9 @@ CUDA_VISIBLE_DEVICES=2 \
    --model Qwen/Qwen3-0.6B \
    --is-prefill-worker \
    --connector kvbm nixl \
-    --enforce-eager &
+    --enforce-eager \
+    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082","enable_kv_cache_events":true}' &
-DYN_VLLM_KV_EVENT_PORT=20083 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \
 DYN_KVBM_LEADER_ZMQ_PUB_PORT=56003 \
 DYN_KVBM_LEADER_ZMQ_ACK_PORT=56004 \
@@ -38,4 +36,5 @@ CUDA_VISIBLE_DEVICES=3 \
    --model Qwen/Qwen3-0.6B \
    --is-prefill-worker \
    --connector kvbm nixl \
-    --enforce-eager
+    --enforce-eager \
+    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20083","enable_kv_cache_events":true}'
--- a/examples/backends/vllm/launch/disagg_lmcache.sh
+++ b/examples/backends/vllm/launch/disagg_lmcache.sh
@@ -15,10 +15,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B &
 sleep 20
 # run prefill worker on GPU 1 with LMCache
-DYN_VLLM_KV_EVENT_PORT=20081 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
 CUDA_VISIBLE_DEVICES=1 \
  python3 -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --is-prefill-worker \
-    --connector lmcache nixl
+    --connector lmcache nixl \
+    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
--- a/examples/backends/vllm/launch/disagg_same_gpu.sh
+++ b/examples/backends/vllm/launch/disagg_same_gpu.sh
@@ -70,12 +70,12 @@ sleep 10
 # run prefill worker with metrics on port 8082 (foreground)
 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
-DYN_VLLM_KV_EVENT_PORT=20081 \
 VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
 CUDA_VISIBLE_DEVICES=0 \
 python3 -m dynamo.vllm \
  --model Qwen/Qwen3-0.6B \
  --enforce-eager \
  --is-prefill-worker \
-  --gpu-memory-utilization ${GPU_MEM_FRACTION}
+  --gpu-memory-utilization ${GPU_MEM_FRACTION} \
+  --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'