Unverified Commit 21f135f5 authored by Dmitry Tokarev's avatar Dmitry Tokarev Committed by GitHub
Browse files

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes (#… (#8719)


Signed-off-by: default avatarDmitry Tokarev <dtokarev@nvidia.com>
Co-authored-by: default avatarBiswa Panda <biswa.panda@gmail.com>
parent 6b0ae0d6
...@@ -69,6 +69,8 @@ These recipes are under active development and may require additional setup step ...@@ -69,6 +69,8 @@ These recipes are under active development and may require additional setup step
|-------|-----------|------|------|------------|-------| |-------|-----------|------|------|------------|-------|
| **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). | | **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). |
| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. | | **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. |
| **[DeepSeek-V4-Flash](deepseek-v4-flash/vllm/agg/)** | vLLM | Aggregated | 4x B200 | ✅ | Text only — MoE model (284B / 13B active), DP=4 + EP, FP8 KV cache, reasoning + tool calling. Requires [custom container build](deepseek-v4-flash/container/). |
| **[DeepSeek-V4-Pro](deepseek-v4-pro/vllm/agg/)** | vLLM | Aggregated | 8x B200 | ✅ | Text only — MoE model (1.6T / 49B active, 1M context), TP=8 + EP, FP4+FP8 mixed checkpoint, FP8 KV cache, CSA+HCA attention, three reasoning effort modes, tool calling. Requires [custom container build](deepseek-v4-pro/container/). |
## Recipe Structure ## Recipe Structure
......
# DeepSeek-V4-Flash Recipe
Aggregated-serving recipe for **DeepSeek-V4-Flash** on vLLM with Dynamo.
| Variant | Model | Status | Modality | Manifest | GPUs |
|---------|-------|--------|----------|----------|------|
| **vllm-agg** | `deepseek-ai/DeepSeek-V4-Flash` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 4x B200 |
Aggregated, single-replica: 1 decode pod running DP=4 + Expert Parallel on 4 B200 GPUs (TP=1). Tested on 4 of 8 GPUs per B200 node.
## Prerequisites
1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md).
2. **GPU cluster** with at least 4 B200 GPUs available on one node.
3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Flash`.
4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Flash is not in a stock vLLM release yet. It is built in two steps:
1. Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`).
2. Build the DeepSeek-V4-Flash overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root:
```bash
docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
-t <your-registry>/vllm-dsv4:<tag> .
```
Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `<your-registry>/vllm-dsv4:<tag>`.
## Quick Start
```bash
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker)
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model into the model-cache PVC.
# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster.
# The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards,
# FP4+FP8 mixed) and typically takes 30-60 min to download on first apply.
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s
# Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build.
# Deploy
kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE}
# First launch of the decode worker takes up to ~60 minutes (weight load +
# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
-n ${NAMESPACE} --timeout=3600s
```
## Test the Deployment
```bash
kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
## Recipe Details
The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there:
| Flag | Purpose |
|------|---------|
| `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer |
| `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` |
| `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` |
| `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention |
| `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe |
| `--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel` | DP=4 + EP across the 4 GPUs (TP=1) |
| `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Single-node DEP compilation config from the upstream recipe |
| `--max-num-seqs 256` | Concurrency cap |
## Model Details
| | |
|---|---|
| **Model** | `deepseek-ai/DeepSeek-V4-Flash` (MoE, 284B total / 13B active) |
| **Checkpoint** | Mixed FP4 (expert weights) + FP8 (attention, norm, router) |
| **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) |
| **Parallelism** | TP=1, DP=4, Expert Parallel enabled |
| **KV cache** | FP8, block size 256 |
| **Attention** | Hybrid CSA + HCA with Blackwell FP4 indexer cache |
## Verifying Reasoning
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
"max_tokens": 200
}' | python3 -m json.tool
```
Expected:
- `choices[0].message.reasoning_content` contains the model's chain-of-thought.
- `choices[0].message.content` contains only the final answer.
- No raw `</think>` tags in either field.
If `reasoning_content` is `null` and `</think>` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command.
## Verifying Tool Calling
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"max_tokens": 300
}' | python3 -m json.tool
```
Expected:
- `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`.
- `choices[0].finish_reason` is `"tool_calls"`.
- `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection.
If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command.
## Notes
- **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods.
- **Model size.** `deepseek-ai/DeepSeek-V4-Flash` is ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision.
- **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4.
- **First launch is slow.** The decode worker loads weights and warms CUDA graphs; the startup probe allows up to ~60 min (`failureThreshold: 360` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=3600` is set to match.
- **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer.
- **DP stability.** `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` and `VLLM_SKIP_P2P_CHECK=1` mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs on Blackwell.
- **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image.
#
# Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760
# (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning
# parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer.
#
# We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm
# python worker) from a locally-built Dynamo vLLM runtime image (produced via
# <repo_root>/container/README.md) and layer them on top of the dsv4 vLLM image
# without touching the vLLM install.
#
# Build (run from the repo root):
# docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
# -t <your-registry>/vllm-dsv4:<tag> .
#
# See recipes/deepseek-v4-flash/container/README.md for build args and
# troubleshooting.
#
# Both base images must be Python 3.12 (verified).
# Default to the local tag produced by `container/render.py --framework vllm
# --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override
# with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead.
ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime
ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130
FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src
FROM ${DSV4_BASE_IMAGE}
ENV DEBIAN_FRONTEND=noninteractive
# Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static
# binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport).
RUN apt-get update && apt-get install -y --no-install-recommends \
libibverbs1 rdma-core ibverbs-utils libibumad3 \
libnuma1 librdmacm1 ibverbs-providers \
ca-certificates jq curl \
&& apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \
&& rm -rf /var/lib/apt/lists/*
# --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer ---
# from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c
# TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image.
RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \
/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py
# --- static binaries ---
COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server
COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd
ENV PATH=/usr/local/bin/etcd:${PATH}
# --- UCX ---
COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx
ENV PATH=/usr/local/ucx/bin:${PATH}
# --- NIXL (C++ libs for KV transfer) ---
COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl
ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \
NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \
NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins
ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH}
# --- install dynamo python wheels into the dsv4 image's system python ---
# The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages.
# ai_dynamo_runtime is abi3 (cp310+), compatible with cp312.
COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse
RUN pip install --no-cache-dir \
/opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \
/opt/dynamo/wheelhouse/ai_dynamo*any.whl \
/opt/dynamo/wheelhouse/nixl/nixl*.whl
# --- dynamo python source (dynamo.vllm worker + common + mocker) ---
# Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves.
COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo
ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-}
WORKDIR /workspace
# --- dynamo runtime env tweaks ---
# Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here).
ENV VLLM_USE_FLASHINFER_SAMPLER=1
# Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm`
# via the manifest command/args rather than the vLLM api_server entrypoint.
ENTRYPOINT []
CMD ["bash"]
# DeepSeek-V4-Flash Reference Container
DeepSeek-V4-Flash is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image.
- **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer.
- **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image.
Both layers use Python 3.12; no vLLM reinstall is performed.
## Build flow
Two Docker images are involved:
1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`<repo_root>/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker.
2. **DeepSeek-V4-Flash overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`).
## Step 1 — Build the Dynamo vLLM runtime
From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md):
```bash
# From <repo_root>
container/render.py --framework vllm --target runtime --output-short-filename
docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
```
This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default.
## Step 2 — Build the DeepSeek-V4-Flash overlay
Still from the **repo root**:
```bash
docker build \
-f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
-t <your-registry>/vllm-dsv4:<tag> \
.
```
The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward.
### Build args
Both can be overridden with `--build-arg`:
| Arg | Default | Purpose |
|-----|---------|---------|
| `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. |
| `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. |
Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host:
```bash
docker build \
-f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
--build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \
--build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \
-t <your-registry>/vllm-dsv4:<tag> \
.
```
## Push
```bash
docker push <your-registry>/vllm-dsv4:<tag>
```
## Wire into the recipe
Once the image is pushed, update the `image:` fields in
[`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `<your-registry>/vllm-dsv4:<tag>`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy.
## What the Dockerfile does
1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`).
2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image.
3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image.
4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime.
5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12.
6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves.
7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect.
## Troubleshooting
- **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=<your-image>`.
- **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host.
- **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9.
- **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec).
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 400Gi
storageClassName: "your-storage-class-name"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: deepseek-ai/DeepSeek-V4-Flash
- name: HF_HOME
value: /model-store
- name: HF_XET_HIGH_PERFORMANCE
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub==1.11.0
hf download $MODEL_NAME
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# DynamoGraphDeployment for deepseek-ai/DeepSeek-V4-Flash on vLLM,
# aggregated serving (no prefill/decode disaggregation).
#
# Upstream vLLM recipe:
# https://github.com/vllm-project/recipes/blob/main/models/deepseek-ai/DeepSeek-V4-Flash.yaml
#
# Shape: 1 replica x 4 B200 GPUs, DP=4 + Expert Parallel, TP=1.
# Tested on 4 of 8 GPUs per B200 node.
#
# Image: replace the `:my-tag` placeholder with a Dynamo + vLLM image that
# includes the DeepSeek-V4 stack. See `../../container/README.md`
# for the reference build -- it overlays dynamo on
# vllm/vllm-openai:deepseekv4-cu130.
#
# Weights: served from the `model-cache` PVC populated by
# `../../model-cache/model-download.yaml`.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dsv4-flash-agg
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
env:
- name: HF_HOME
value: /opt/models
- name: HF_HUB_OFFLINE
value: "1"
VllmDecodeWorker:
componentType: worker
subComponentType: decode
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 200Gi
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
# Up to ~60 min for first launch: weight load + FlashInfer autotune +
# cudagraph warmup. periodSeconds * failureThreshold = 10 * 360 = 3600s.
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 360
env:
- name: SERVED_MODEL_NAME
value: deepseek-ai/DeepSeek-V4-Flash
- name: MODEL_PATH
value: deepseek-ai/DeepSeek-V4-Flash
- name: HF_HOME
value: /opt/models
# Read weights from the PVC only; do not hit the HF Hub at startup.
- name: HF_HUB_OFFLINE
value: "1"
# Give the engine room to finish first-launch init.
- name: VLLM_ENGINE_READY_TIMEOUT_S
value: "3600"
# Stabilize DP dummy inputs (matches the DeepSeek-R1 vLLM recipe).
- name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
value: "1"
- name: VLLM_SKIP_P2P_CHECK
value: "1"
- name: NCCL_CUMEM_ENABLE
value: "1"
command:
- /bin/sh
- -c
args:
- |
python3 -m dynamo.vllm \
--model "${MODEL_PATH}" \
--served-model-name "${SERVED_MODEL_NAME}" \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--tokenizer-mode deepseek_v4 \
--dyn-reasoning-parser deepseek_v4 \
--dyn-tool-call-parser deepseek_v4 \
--attention-config '{"use_fp4_indexer_cache":true}' \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--max-num-seqs 256
replicas: 1
resources:
limits:
gpu: "4"
requests:
gpu: "4"
# DeepSeek-V4-Pro Recipe
Aggregated-serving recipe for **DeepSeek-V4-Pro** on vLLM with Dynamo.
| Variant | Model | Status | Modality | Manifest | GPUs |
|---------|-------|--------|----------|----------|------|
| **vllm-agg** | `deepseek-ai/DeepSeek-V4-Pro` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 8x B200 |
Aggregated, single-replica: 1 decode pod running TP=8 + Expert Parallel on all 8 GPUs of one node.
## Prerequisites
1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md).
2. **GPU cluster** with at least 8 B200 GPUs available on one node (TP=8 fills an 8-GPU box).
3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Pro`.
4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Pro is not in a stock vLLM release yet. It is built in two steps:
1. Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`).
2. Build the DeepSeek-V4-Pro overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root:
```bash
docker build -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \
-t <your-registry>/vllm-dsv4:<tag> .
```
Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `<your-registry>/vllm-dsv4:<tag>`.
> The Pro and Flash recipes share the same dsv4 image. If you've already built it for [deepseek-v4-flash](../deepseek-v4-flash/), reuse the tag here — model selection happens at runtime via `--model`.
## Quick Start
```bash
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker)
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model into the model-cache PVC.
# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster.
# The PVC requests 1500Gi; DeepSeek-V4-Pro is ~865 GB on disk (64 safetensors shards,
# FP4+FP8 mixed) and typically takes 1.5-3 hours to download on first apply.
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=14400s
# Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build.
# Deploy
kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE}
# First launch of the decode worker takes up to ~90 minutes (TP=8 weight load +
# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=dsv4-pro-agg \
-n ${NAMESPACE} --timeout=5400s
```
## Test the Deployment
```bash
kubectl port-forward svc/dsv4-pro-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
## Recipe Details
The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there:
| Flag | Purpose |
|------|---------|
| `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer |
| `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` |
| `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` |
| `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention |
| `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe |
| `--tensor-parallel-size 8 --enable-expert-parallel` | TP=8 across 8 GPUs of one node, with EP enabled for the MoE experts |
| `--compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}'` | Conservative cudagraph mode appropriate for the larger Pro model (matches upstream V4-Pro example) |
| `--max-num-seqs 256` | Concurrency cap |
### Why TP=8 (not DP=4 like Flash)?
DeepSeek-V4-Pro is ~5.5x larger than Flash on disk (~865 GB vs. ~160 GB). With FP4+FP8 mixed weights it does not fit in 4 ranks at typical batch shapes, so the upstream tested shape for Pro is **TP=8 across all 8 GPUs of one node**. Expert Parallel is still enabled on top of TP — TP shards the dense (attention/router/norm) weights, EP shards the experts.
## Model Details
Sourced from the [`deepseek-ai/DeepSeek-V4-Pro` model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) (preview release):
| | |
|---|---|
| **Model** | `deepseek-ai/DeepSeek-V4-Pro` (MoE, 1.6T total / 49B active per token) |
| **Context length** | 1M tokens |
| **Checkpoint** | Mixed precision — MoE expert weights in FP4; most other parameters in FP8 |
| **Attention** | Hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA). Recipe enables the Blackwell FP4 indexer cache via `--attention-config '{"use_fp4_indexer_cache":true}'` |
| **Residual path** | Manifold-Constrained Hyper-Connections (mHC) |
| **Reasoning modes** | Three effort levels exposed via `chat_template_kwargs`: `{}` (Non-think), `{"thinking":true,"reasoning_effort":"high"}` (Think High), `{"thinking":true,"reasoning_effort":"max"}` (Think Max — needs `--max-model-len >= 393216`) |
| **Long-context efficiency** | Per the model card, ~27% of the per-token inference FLOPs and ~10% of the KV cache vs. DeepSeek-V3.2 at 1M context |
| **License** | MIT |
Recipe-level (not model-card) settings in this deployment:
| | |
|---|---|
| **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) |
| **Parallelism** | TP=8, Expert Parallel enabled |
| **KV cache** | FP8, block size 256 |
## Verifying Reasoning
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
"max_tokens": 200
}' | python3 -m json.tool
```
Expected:
- `choices[0].message.reasoning_content` contains the model's chain-of-thought.
- `choices[0].message.content` contains only the final answer.
- No raw `</think>` tags in either field.
If `reasoning_content` is `null` and `</think>` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command.
## Verifying Tool Calling
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"max_tokens": 300
}' | python3 -m json.tool
```
Expected:
- `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`.
- `choices[0].finish_reason` is `"tool_calls"`.
- `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection.
If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command.
## Notes
- **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods.
- **Model size.** `deepseek-ai/DeepSeek-V4-Pro` is ~865 GB on disk (64 safetensors shards in FP4+FP8 mixed form). The 1500Gi PVC leaves ~1.7x headroom for HF cache metadata and one alternate revision.
- **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4.
- **First launch is slow.** The decode worker loads weights across 8 TP ranks and warms CUDA graphs; the startup probe allows up to ~90 min (`failureThreshold: 540` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=5400` is set to match.
- **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer.
- **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed.
- **Sibling recipe.** [DeepSeek-V4-Flash](../deepseek-v4-flash/) is the smaller sibling (284B / 13B active, DP=4 + EP on 4 B200 GPUs) and uses the same dsv4 container image.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image.
# Shared image for all DeepSeek-V4 recipes (Flash, Pro, ...).
#
# Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760
# (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning
# parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer.
#
# We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm
# python worker) from a locally-built Dynamo vLLM runtime image (produced via
# <repo_root>/container/README.md) and layer them on top of the dsv4 vLLM image
# without touching the vLLM install.
#
# Build (run from the repo root):
# docker build -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \
# -t <your-registry>/vllm-dsv4:<tag> .
#
# See recipes/deepseek-v4-pro/container/README.md for build args and
# troubleshooting.
#
# Both base images must be Python 3.12 (verified).
# Default to the local tag produced by `container/render.py --framework vllm
# --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override
# with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead.
ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime
ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130
FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src
FROM ${DSV4_BASE_IMAGE}
ENV DEBIAN_FRONTEND=noninteractive
# Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static
# binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport).
RUN apt-get update && apt-get install -y --no-install-recommends \
libibverbs1 rdma-core ibverbs-utils libibumad3 \
libnuma1 librdmacm1 ibverbs-providers \
ca-certificates jq curl \
&& apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \
&& rm -rf /var/lib/apt/lists/*
# --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer ---
# from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c
# TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image.
RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \
/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py
# --- static binaries ---
COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server
COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd
ENV PATH=/usr/local/bin/etcd:${PATH}
# --- UCX ---
COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx
ENV PATH=/usr/local/ucx/bin:${PATH}
# --- NIXL (C++ libs for KV transfer) ---
COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl
ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \
NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \
NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins
ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH}
# --- install dynamo python wheels into the dsv4 image's system python ---
# The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages.
# ai_dynamo_runtime is abi3 (cp310+), compatible with cp312.
COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse
RUN pip install --no-cache-dir \
/opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \
/opt/dynamo/wheelhouse/ai_dynamo*any.whl \
/opt/dynamo/wheelhouse/nixl/nixl*.whl
# --- dynamo python source (dynamo.vllm worker + common + mocker) ---
# Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves.
COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo
ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-}
WORKDIR /workspace
# --- dynamo runtime env tweaks ---
# Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here).
ENV VLLM_USE_FLASHINFER_SAMPLER=1
# Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm`
# via the manifest command/args rather than the vLLM api_server entrypoint.
ENTRYPOINT []
CMD ["bash"]
# DeepSeek-V4-Pro Reference Container
DeepSeek-V4-Pro is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image. The image is the same one the V4-Flash recipe uses — DeepSeek-V4-Flash and DeepSeek-V4-Pro share the same vLLM dsv4 stack — but is duplicated here so each recipe is self-contained.
- **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer.
- **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image.
Both layers use Python 3.12; no vLLM reinstall is performed.
## Build flow
Two Docker images are involved:
1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`<repo_root>/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker.
2. **DeepSeek-V4-Pro overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`).
## Step 1 — Build the Dynamo vLLM runtime
From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md):
```bash
# From <repo_root>
container/render.py --framework vllm --target runtime --output-short-filename
docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
```
This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default.
## Step 2 — Build the DeepSeek-V4-Pro overlay
Still from the **repo root**:
```bash
docker build \
-f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \
-t <your-registry>/vllm-dsv4:<tag> \
.
```
The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward.
> If you have already built the dsv4 overlay for the V4-Flash recipe, you can reuse the same image tag here — there is nothing model-specific in the container. The model is selected at runtime via `--model`.
### Build args
Both can be overridden with `--build-arg`:
| Arg | Default | Purpose |
|-----|---------|---------|
| `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. |
| `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. |
Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host:
```bash
docker build \
-f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \
--build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \
--build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \
-t <your-registry>/vllm-dsv4:<tag> \
.
```
## Push
```bash
docker push <your-registry>/vllm-dsv4:<tag>
```
## Wire into the recipe
Once the image is pushed, update the `image:` fields in
[`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `<your-registry>/vllm-dsv4:<tag>`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy.
## What the Dockerfile does
1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`).
2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image.
3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image.
4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime.
5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12.
6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves.
7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect.
## Troubleshooting
- **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=<your-image>`.
- **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host.
- **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9.
- **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec).
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1500Gi
storageClassName: "your-storage-class-name"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: deepseek-ai/DeepSeek-V4-Pro
- name: HF_HOME
value: /model-store
- name: HF_XET_HIGH_PERFORMANCE
value: "1"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub==1.11.0
hf download $MODEL_NAME
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# DynamoGraphDeployment for deepseek-ai/DeepSeek-V4-Pro on vLLM,
# aggregated serving (no prefill/decode disaggregation).
#
# Upstream reference command:
# docker run --gpus all ... vllm/vllm-openai:deepseekv4-cu130 \
# deepseek-ai/DeepSeek-V4-Pro \
# --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
# --enable-expert-parallel --tensor-parallel-size 8 \
# --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
# --attention_config.use_fp4_indexer_cache=True \
# --tokenizer-mode deepseek_v4 \
# --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
# --reasoning-parser deepseek_v4
#
# Shape: 1 replica x 8 GPUs, TP=8 + Expert Parallel. Fills a single 8-GPU node.
#
# Image: replace the `:my-tag` placeholder with a Dynamo + vLLM image that
# includes the DeepSeek-V4 stack. See `../../container/README.md`
# for the reference build -- it overlays dynamo on
# vllm/vllm-openai:deepseekv4-cu130.
#
# Weights: served from the `model-cache` PVC populated by
# `../../model-cache/model-download.yaml`.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dsv4-pro-agg
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
env:
- name: HF_HOME
value: /opt/models
- name: HF_HUB_OFFLINE
value: "1"
VllmDecodeWorker:
componentType: worker
subComponentType: decode
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 200Gi
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
# DeepSeek-V4-Pro (1.6T params) is large; first launch loads weights
# over TP=8 ranks plus FlashInfer autotune + cudagraph warmup. Allow
# ~90 min: periodSeconds * failureThreshold = 10 * 540 = 5400s.
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 540
env:
- name: SERVED_MODEL_NAME
value: deepseek-ai/DeepSeek-V4-Pro
- name: MODEL_PATH
value: deepseek-ai/DeepSeek-V4-Pro
- name: HF_HOME
value: /opt/models
# Read weights from the PVC only; do not hit the HF Hub at startup.
- name: HF_HUB_OFFLINE
value: "1"
# Give the engine room to finish first-launch init.
- name: VLLM_ENGINE_READY_TIMEOUT_S
value: "5400"
# Stabilize TP/EP all-reduces and skip the IPC P2P probe.
- name: VLLM_SKIP_P2P_CHECK
value: "1"
- name: NCCL_CUMEM_ENABLE
value: "1"
command:
- /bin/sh
- -c
args:
- |
python3 -m dynamo.vllm \
--model "${MODEL_PATH}" \
--served-model-name "${SERVED_MODEL_NAME}" \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--tokenizer-mode deepseek_v4 \
--dyn-reasoning-parser deepseek_v4 \
--dyn-tool-call-parser deepseek_v4 \
--attention-config '{"use_fp4_indexer_cache":true}' \
--compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--max-num-seqs 256
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment