Unverified Commit 7bbacce1 authored by Alec's avatar Alec Committed by GitHub
Browse files

feat: default kv-events-config to empty (align with vLLM defaults) (#6404)


Signed-off-by: default avataralec-flowers <aflowers@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent d6c49779
...@@ -195,6 +195,8 @@ python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \ ...@@ -195,6 +195,8 @@ python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
> >
> See [Service Discovery and Messaging](#service-discovery-and-messaging) for details. > See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
> **Deprecation notice:** vLLM automatically enables KV event publishing when prefix caching is active. In a future release, this will change — KV events will be disabled by default for all backends. Start using `--kv-events-config` explicitly to prepare.
#### Send a Request #### Send a Request
```bash ```bash
......
...@@ -260,7 +260,8 @@ else ...@@ -260,7 +260,8 @@ else
fi fi
VLLM_ARGS+=("${EXTRA_ARGS[@]}") VLLM_ARGS+=("${EXTRA_ARGS[@]}")
exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES DYN_VLLM_KV_EVENT_PORT=$((20080 + i)) VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) python3 -m dynamo.vllm \ VLLM_ARGS+=("--kv-events-config" "{\"publisher\":\"zmq\",\"topic\":\"kv-events\",\"endpoint\":\"tcp://*:$((20080 + i))\",\"enable_kv_cache_events\":true}")
exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) python3 -m dynamo.vllm \
"${VLLM_ARGS[@]}" "${VLLM_ARGS[@]}"
fi fi
} & } &
......
...@@ -5,6 +5,7 @@ import argparse ...@@ -5,6 +5,7 @@ import argparse
import logging import logging
import os import os
import socket import socket
import warnings
from typing import Any, Dict, Optional from typing import Any, Dict, Optional
from vllm.config import KVTransferConfig from vllm.config import KVTransferConfig
...@@ -344,6 +345,16 @@ def create_kv_events_config( ...@@ -344,6 +345,16 @@ def create_kv_events_config(
# Create default events config for prefix caching # Create default events config for prefix caching
# TODO: move this to configuration system. # TODO: move this to configuration system.
port = envs.DYN_VLLM_KV_EVENT_PORT port = envs.DYN_VLLM_KV_EVENT_PORT
warnings.warn(
"Automatic KV events configuration is deprecated and will be removed in "
"the next release. After that, KV events will be disabled by default "
"(matching upstream vLLM). To preserve current behavior, pass "
"--kv-events-config explicitly. For example:\n"
f' --kv-events-config \'{{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://*:{port}"}}\'\n'
"See docs/pages/backends/vllm/README.md for details.",
FutureWarning,
stacklevel=2,
)
logger.info( logger.info(
f"Using env-var DYN_VLLM_KV_EVENT_PORT={port} to create kv_events_config" f"Using env-var DYN_VLLM_KV_EVENT_PORT={port} to create kv_events_config"
) )
......
...@@ -163,7 +163,7 @@ docker compose -f deploy/docker-compose.yml up -d ...@@ -163,7 +163,7 @@ docker compose -f deploy/docker-compose.yml up -d
> [!NOTE] > [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. > - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing > - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured with `--kv-events-config` to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD) > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
> [!TIP] > [!TIP]
......
...@@ -72,7 +72,7 @@ docker compose -f deploy/docker-compose.yml up -d ...@@ -72,7 +72,7 @@ docker compose -f deploy/docker-compose.yml up -d
> [!NOTE] > [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. > - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing > - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD) > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Build container ### Build container
......
...@@ -67,7 +67,7 @@ docker compose -f deploy/docker-compose.yml up -d ...@@ -67,7 +67,7 @@ docker compose -f deploy/docker-compose.yml up -d
> [!NOTE] > [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. > - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing > - **NATS** is optional - only needed if using KV routing with events. For vLLM, KV events are currently enabled by default when prefix caching is active (**deprecated** — use `--kv-events-config` explicitly). Use `--no-router-kv-events` on the frontend for prediction-based routing without events
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD) > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Pull or build container ### Pull or build container
......
...@@ -191,12 +191,12 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na ...@@ -191,12 +191,12 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
> >
> When `--router-kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-router-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. > When `--router-kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-router-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
> >
> **Backend Configuration:** When using `--no-router-kv-events`, configure your backend workers to disable KV event publishing: > **Backend Configuration:** When using `--no-router-kv-events`, no additional backend flags are needed — SGLang and TRT-LLM disable KV events by default. For vLLM, KV events are currently enabled by default when prefix caching is active (deprecated — will change in a future release). Use `--kv-events-config` explicitly to control behavior:
> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` > - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or omit (auto-enabled, deprecated)
> - **SGLang**: Do not use `--kv-events-config` > - **SGLang**: Do not use `--kv-events-config`
> - **TRT-LLM**: Do not use `--publish-events-and-metrics` > - **TRT-LLM**: Do not use `--publish-events-and-metrics`
> >
> The cli args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored. > The cli args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When workers are configured to publish KV events (via `--kv-events-config`), the router relies on worker-side eviction events and these parameters are ignored.
> >
> **Queue threshold vs. busy rejection thresholds:** `--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. The busy thresholds **reject** a worker entirely from the candidate set when it exceeds a utilization limit — no traffic is sent until it drops below the threshold. In contrast, `--router-queue-threshold` does not reject workers; it **defers the entire routing decision** until at least one worker has capacity, so the request is routed with the freshest load metrics. The queue also enables priority scheduling via `nvext.agent_hints.latency_sensitivity`. > **Queue threshold vs. busy rejection thresholds:** `--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. The busy thresholds **reject** a worker entirely from the candidate set when it exceeds a utilization limit — no traffic is sent until it drops below the threshold. In contrast, `--router-queue-threshold` does not reject workers; it **defers the entire routing decision** until at least one worker has capacity, so the request is routed with the freshest load metrics. The queue also enables priority scheduling via `nvext.agent_hints.latency_sensitivity`.
......
...@@ -73,8 +73,9 @@ Example setup: ...@@ -73,8 +73,9 @@ Example setup:
export NATS_SERVER=nats://nats-server:4222 export NATS_SERVER=nats://nats-server:4222
export DYN_EVENT_PLANE=nats export DYN_EVENT_PLANE=nats
# Start workers -- they publish events to NATS automatically # Start workers -- explicitly enable KV event publishing
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B \
--kv-events-config '{"publisher":"nats","topic":"kv-events","enable_kv_cache_events":true}'
# Start frontend -- it subscribes to events from NATS automatically # Start frontend -- it subscribes to events from NATS automatically
python3 -m dynamo.frontend --router-mode kv python3 -m dynamo.frontend --router-mode kv
...@@ -94,7 +95,8 @@ Example setup: ...@@ -94,7 +95,8 @@ Example setup:
export DYN_EVENT_PLANE=zmq export DYN_EVENT_PLANE=zmq
# Start workers -- each binds a ZMQ socket, registers with discovery # Start workers -- each binds a ZMQ socket, registers with discovery
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B \
--kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'
# Start frontend -- discovers workers and connects directly # Start frontend -- discovers workers and connects directly
python3 -m dynamo.frontend --router-mode kv python3 -m dynamo.frontend --router-mode kv
...@@ -105,10 +107,10 @@ python3 -m dynamo.frontend --router-mode kv ...@@ -105,10 +107,10 @@ python3 -m dynamo.frontend --router-mode kv
If you do not need KV-aware routing, you can disable the event plane entirely: If you do not need KV-aware routing, you can disable the event plane entirely:
```bash ```bash
python3 -m dynamo.frontend --router-mode kv --no-kv-events python3 -m dynamo.frontend --router-mode kv --no-router-kv-events
``` ```
With `--no-kv-events`: With `--no-router-kv-events`:
- The router falls back to prediction-based cache-aware routing (estimates cache state from routing decisions). - The router falls back to prediction-based cache-aware routing (estimates cache state from routing decisions).
- No NATS server or ZMQ sockets are needed. - No NATS server or ZMQ sockets are needed.
......
...@@ -33,7 +33,7 @@ Dynamo has **two independent communication planes**: ...@@ -33,7 +33,7 @@ Dynamo has **two independent communication planes**:
- **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`. - **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
- **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing. - **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
**Note:** If you are using `tcp` or `http` request plane with KV events enabled (default), NATS is automatically initialized. You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To completely disable NATS, use `--no-kv-events` on the frontend. **Note:** If you are using `tcp` or `http` request plane with KV events enabled on the router (the default router-side setting), NATS is automatically initialized. SGLang requires explicit `--kv-events-config` and TRT-LLM requires `--publish-events-and-metrics` to publish events. For vLLM, KV events are currently auto-configured when prefix caching is active (deprecated — use `--kv-events-config` explicitly to prepare for a future release where all backends will default to off). You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To disable the router's KV event listener, use `--no-router-kv-events` on the frontend.
Because they are independent, you can mix them. Because they are independent, you can mix them.
...@@ -89,7 +89,7 @@ DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B ...@@ -89,7 +89,7 @@ DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
**When to use TCP:** **When to use TCP:**
- Simple deployments with direct service-to-service communication (e.g. frontend to backend) - Simple deployments with direct service-to-service communication (e.g. frontend to backend)
- Minimal infrastructure requirements (NATS is initialized by default for KV events but can be disabled with `--no-kv-events`) - Minimal infrastructure requirements (NATS is initialized when the router listens for KV events; disable with `--no-router-kv-events`)
- Low-latency requirements - Low-latency requirements
**TCP Configuration Options:** **TCP Configuration Options:**
...@@ -161,7 +161,7 @@ DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B ...@@ -161,7 +161,7 @@ DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
**When to use NATS:** **When to use NATS:**
- Production deployments with service discovery - Production deployments with service discovery
- KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-kv-events`) provides KV routing without NATS but with reduced accuracy. - KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-router-kv-events`) provides KV routing without NATS but with reduced accuracy.
- Need for message replay and persistence features - Need for message replay and persistence features
Limitations: Limitations:
...@@ -290,6 +290,6 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -290,6 +290,6 @@ curl http://localhost:8000/v1/chat/completions \
### Resource Usage ### Resource Usage
- **TCP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`) - **TCP**: Minimal infrastructure (NATS required only if using KV events, disable router-side with `--no-router-kv-events`)
- **HTTP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`) - **HTTP**: Minimal infrastructure (NATS required only if using KV events, disable router-side with `--no-router-kv-events`)
- **NATS**: Requires running NATS server (additional memory/CPU) - **NATS**: Requires running NATS server (additional memory/CPU)
...@@ -140,6 +140,10 @@ For dependency-free local development, disable KV event publishing (avoids NATS) ...@@ -140,6 +140,10 @@ For dependency-free local development, disable KV event publishing (avoids NATS)
is expected and can be safely ignored. is expected and can be safely ignored.
</Note> </Note>
<Note>
**Deprecation notice:** vLLM automatically enables KV event publishing when prefix caching is active. In a future release, this will change — KV events will be disabled by default for all backends. Start using `--kv-events-config` explicitly to prepare.
</Note>
## Test Your Deployment ## Test Your Deployment
```bash ```bash
......
...@@ -228,7 +228,7 @@ Common Vars for Routing Configuration: ...@@ -228,7 +228,7 @@ Common Vars for Routing Configuration:
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner. - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1) - Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true) - Set `DYN_USE_KV_EVENTS=false` if you want to disable the router listening for KV events while using kv-routing (default: true). SGLang workers require `--kv-events-config` and TRT-LLM workers require `--publish-events-and-metrics` to publish KV events. For vLLM, KV events are auto-configured when prefix caching is active (deprecated — use `--kv-events-config` explicitly)
- `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0) - `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
- `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false) - `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
- `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true) - `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
......
...@@ -73,8 +73,8 @@ cd examples/backends/vllm/launch ...@@ -73,8 +73,8 @@ cd examples/backends/vllm/launch
./disagg.sh ./disagg.sh
``` ```
**Note:** the example vLLM `disagg.sh` sets additional per-worker port environment variables (e.g., `DYN_VLLM_KV_EVENT_PORT`, **Note:** the example vLLM `disagg.sh` sets per-worker `--kv-events-config` with unique ZMQ endpoints and unique
`VLLM_NIXL_SIDE_CHANNEL_PORT`) to avoid ZMQ "Address already in use" conflicts when multiple workers run on the same host. If you run the components manually, make sure you mirror those port settings. `VLLM_NIXL_SIDE_CHANNEL_PORT` values to avoid "Address already in use" conflicts when multiple workers run on the same host. If you run the components manually, make sure you mirror those settings.
```bash ```bash
#!/bin/bash #!/bin/bash
...@@ -100,13 +100,13 @@ DYN_SYSTEM_PORT=8081 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \ ...@@ -100,13 +100,13 @@ DYN_SYSTEM_PORT=8081 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
# Run prefill worker, make sure to wait for start up # Run prefill worker, make sure to wait for start up
export OTEL_SERVICE_NAME=dynamo-worker-prefill export OTEL_SERVICE_NAME=dynamo-worker-prefill
DYN_SYSTEM_PORT=8082 \ DYN_SYSTEM_PORT=8082 \
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--enforce-eager \ --enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \ --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--is-prefill-worker & --is-prefill-worker \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
``` ```
For disaggregated deployments, this separates prefill and decode onto different GPUs for better resource utilization. For disaggregated deployments, this separates prefill and decode onto different GPUs for better resource utilization.
......
...@@ -34,3 +34,5 @@ spec: ...@@ -34,3 +34,5 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-0.6B - Qwen/Qwen3-0.6B
- --kv-events-config
- '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'
...@@ -54,3 +54,5 @@ spec: ...@@ -54,3 +54,5 @@ spec:
- --model - --model
- Qwen/Qwen3-0.6B - Qwen/Qwen3-0.6B
- --is-prefill-worker - --is-prefill-worker
- --kv-events-config
- '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'
...@@ -13,14 +13,14 @@ python -m dynamo.frontend --router-mode kv & ...@@ -13,14 +13,14 @@ python -m dynamo.frontend --router-mode kv &
# Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example) # Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example)
# --enforce-eager is added for quick deployment. for production use, need to remove this flag # --enforce-eager is added for quick deployment. for production use, need to remove this flag
for i in {0..3}; do for i in {0..3}; do
DYN_VLLM_KV_EVENT_PORT=$((20080 + i)) \
VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) \ VLLM_NIXL_SIDE_CHANNEL_PORT=$((20096 + i)) \
CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \ CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \
--model Qwen/Qwen3-30B-A3B \ --model Qwen/Qwen3-30B-A3B \
--data-parallel-rank $i \ --data-parallel-rank $i \
--data-parallel-size 4 \ --data-parallel-size 4 \
--enable-expert-parallel \ --enable-expert-parallel \
--enforce-eager & --enforce-eager \
--kv-events-config "{\"publisher\":\"zmq\",\"topic\":\"kv-events\",\"endpoint\":\"tcp://*:$((20080 + i))\",\"enable_kv_cache_events\":true}" &
done done
echo "All workers starting. (press Ctrl+C to stop)..." echo "All workers starting. (press Ctrl+C to stop)..."
......
...@@ -13,9 +13,9 @@ python -m dynamo.frontend & ...@@ -13,9 +13,9 @@ python -m dynamo.frontend &
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-decode-worker & CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-decode-worker &
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \ DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--enforce-eager \ --enforce-eager \
--is-prefill-worker --is-prefill-worker \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
...@@ -14,7 +14,6 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connecto ...@@ -14,7 +14,6 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connecto
# run prefill worker on GPU 1 with KVBM enabled using 20GB of CPU cache # run prefill worker on GPU 1 with KVBM enabled using 20GB of CPU cache
# NOTE: remove --enforce-eager for production use # NOTE: remove --enforce-eager for production use
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
DYN_KVBM_CPU_CACHE_GB=20 \ DYN_KVBM_CPU_CACHE_GB=20 \
CUDA_VISIBLE_DEVICES=1 \ CUDA_VISIBLE_DEVICES=1 \
...@@ -22,4 +21,5 @@ CUDA_VISIBLE_DEVICES=1 \ ...@@ -22,4 +21,5 @@ CUDA_VISIBLE_DEVICES=1 \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--is-prefill-worker \ --is-prefill-worker \
--connector kvbm nixl \ --connector kvbm nixl \
--enforce-eager --enforce-eager \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
...@@ -11,14 +11,12 @@ python -m dynamo.frontend --router-mode kv & ...@@ -11,14 +11,12 @@ python -m dynamo.frontend --router-mode kv &
# run decode workers on GPU 0 and 1, without enabling KVBM # run decode workers on GPU 0 and 1, without enabling KVBM
# NOTE: remove --enforce-eager for production use # NOTE: remove --enforce-eager for production use
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl --enforce-eager --is-decode-worker & CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl --enforce-eager --is-decode-worker &
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl --enforce-eager --is-decode-worker & CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl --enforce-eager --is-decode-worker &
# run prefill workers on GPU 2 and 3 with KVBM enabled using 20GB of CPU cache # run prefill workers on GPU 2 and 3 with KVBM enabled using 20GB of CPU cache
# NOTE: use different barrier id prefixes for each prefill worker to avoid conflicts # NOTE: use different barrier id prefixes for each prefill worker to avoid conflicts
# NOTE: remove --enforce-eager for production use # NOTE: remove --enforce-eager for production use
DYN_VLLM_KV_EVENT_PORT=20082 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \
DYN_KVBM_CPU_CACHE_GB=20 \ DYN_KVBM_CPU_CACHE_GB=20 \
CUDA_VISIBLE_DEVICES=2 \ CUDA_VISIBLE_DEVICES=2 \
...@@ -26,9 +24,9 @@ CUDA_VISIBLE_DEVICES=2 \ ...@@ -26,9 +24,9 @@ CUDA_VISIBLE_DEVICES=2 \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--is-prefill-worker \ --is-prefill-worker \
--connector kvbm nixl \ --connector kvbm nixl \
--enforce-eager & --enforce-eager \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082","enable_kv_cache_events":true}' &
DYN_VLLM_KV_EVENT_PORT=20083 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \
DYN_KVBM_LEADER_ZMQ_PUB_PORT=56003 \ DYN_KVBM_LEADER_ZMQ_PUB_PORT=56003 \
DYN_KVBM_LEADER_ZMQ_ACK_PORT=56004 \ DYN_KVBM_LEADER_ZMQ_ACK_PORT=56004 \
...@@ -38,4 +36,5 @@ CUDA_VISIBLE_DEVICES=3 \ ...@@ -38,4 +36,5 @@ CUDA_VISIBLE_DEVICES=3 \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--is-prefill-worker \ --is-prefill-worker \
--connector kvbm nixl \ --connector kvbm nixl \
--enforce-eager --enforce-eager \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20083","enable_kv_cache_events":true}'
...@@ -15,10 +15,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B & ...@@ -15,10 +15,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B &
sleep 20 sleep 20
# run prefill worker on GPU 1 with LMCache # run prefill worker on GPU 1 with LMCache
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=1 \ CUDA_VISIBLE_DEVICES=1 \
python3 -m dynamo.vllm \ python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--is-prefill-worker \ --is-prefill-worker \
--connector lmcache nixl --connector lmcache nixl \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
...@@ -70,12 +70,12 @@ sleep 10 ...@@ -70,12 +70,12 @@ sleep 10
# run prefill worker with metrics on port 8082 (foreground) # run prefill worker with metrics on port 8082 (foreground)
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \ DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
CUDA_VISIBLE_DEVICES=0 \ CUDA_VISIBLE_DEVICES=0 \
python3 -m dynamo.vllm \ python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--enforce-eager \ --enforce-eager \
--is-prefill-worker \ --is-prefill-worker \
--gpu-memory-utilization ${GPU_MEM_FRACTION} --gpu-memory-utilization ${GPU_MEM_FRACTION} \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment