Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
3a1561fb
Unverified
Commit
3a1561fb
authored
Mar 23, 2026
by
Alec
Committed by
GitHub
Mar 24, 2026
Browse files
chore: Bump vLLM to 0.18.0 (#7584)
parent
aaa8a567
Changes
8
Show whitespace changes
Inline
Side-by-side
Showing
8 changed files
with
14 additions
and
82 deletions
+14
-82
container/context.yaml
container/context.yaml
+4
-4
container/deps/requirements.test.txt
container/deps/requirements.test.txt
+1
-1
container/deps/vllm/install_vllm.sh
container/deps/vllm/install_vllm.sh
+3
-34
container/deps/vllm/multinode-tp-init-order.patch
container/deps/vllm/multinode-tp-init-order.patch
+0
-37
docs/features/multimodal/embedding-cache.md
docs/features/multimodal/embedding-cache.md
+2
-2
docs/features/multimodal/multimodal-kv-routing.md
docs/features/multimodal/multimodal-kv-routing.md
+2
-2
docs/reference/support-matrix.md
docs/reference/support-matrix.md
+1
-1
pyproject.toml
pyproject.toml
+1
-1
No files found.
container/context.yaml
View file @
3a1561fb
...
@@ -40,13 +40,13 @@ vllm:
...
@@ -40,13 +40,13 @@ vllm:
runtime_image
:
nvcr.io/nvidia/cuda
runtime_image
:
nvcr.io/nvidia/cuda
base_image_tag
:
25.06-cuda12.9-devel-ubuntu24.04
base_image_tag
:
25.06-cuda12.9-devel-ubuntu24.04
runtime_image_tag
:
12.9.1-runtime-ubuntu24.04
runtime_image_tag
:
12.9.1-runtime-ubuntu24.04
vllm_ref
:
v0.1
7.1
vllm_ref
:
v0.1
8.0
cuda13.0
:
cuda13.0
:
base_image
:
nvcr.io/nvidia/cuda-dl-base
base_image
:
nvcr.io/nvidia/cuda-dl-base
runtime_image
:
nvcr.io/nvidia/cuda
runtime_image
:
nvcr.io/nvidia/cuda
base_image_tag
:
25.11-cuda13.0-devel-ubuntu24.04
base_image_tag
:
25.11-cuda13.0-devel-ubuntu24.04
runtime_image_tag
:
13.0.2-runtime-ubuntu24.04
runtime_image_tag
:
13.0.2-runtime-ubuntu24.04
vllm_ref
:
v0.1
7.1
vllm_ref
:
v0.1
8.0
xpu
:
xpu
:
base_image
:
intel/deep-learning-essentials
base_image
:
intel/deep-learning-essentials
runtime_image
:
intel/deep-learning-essentials
runtime_image
:
intel/deep-learning-essentials
...
@@ -59,8 +59,8 @@ vllm:
...
@@ -59,8 +59,8 @@ vllm:
base_image_tag
:
24.04
base_image_tag
:
24.04
runtime_image_tag
:
24.04
runtime_image_tag
:
24.04
vllm_ref
:
v0.16.0
vllm_ref
:
v0.16.0
flashinf_ref
:
v0.6.
4
flashinf_ref
:
v0.6.
6
lmcache_ref
:
0.4.
1
lmcache_ref
:
0.4.
2
vllm_omni_ref
:
"
v0.16.0"
vllm_omni_ref
:
"
v0.16.0"
max_jobs
:
"
10"
max_jobs
:
"
10"
enable_media_ffmpeg
:
"
false"
enable_media_ffmpeg
:
"
false"
...
...
container/deps/requirements.test.txt
View file @
3a1561fb
...
@@ -14,7 +14,7 @@ kr8s==0.20.13
...
@@ -14,7 +14,7 @@ kr8s==0.20.13
kubernetes_asyncio==32.0.0
kubernetes_asyncio==32.0.0
matplotlib==3.10.7
matplotlib==3.10.7
matplotlib-stubs
matplotlib-stubs
mistral-common
=
=1.
9.1
mistral-common
>
=1.
10.0
mypy==1.18.2
mypy==1.18.2
# For NATS object store verification in router tests
# For NATS object store verification in router tests
nats-py==2.12.0
nats-py==2.12.0
...
...
container/deps/vllm/install_vllm.sh
View file @
3a1561fb
...
@@ -12,7 +12,7 @@
...
@@ -12,7 +12,7 @@
set
-euo
pipefail
set
-euo
pipefail
VLLM_VER
=
"0.1
7.1
"
VLLM_VER
=
"0.1
8.0
"
VLLM_REF
=
"v
${
VLLM_VER
}
"
VLLM_REF
=
"v
${
VLLM_VER
}
"
DEVICE
=
"cuda"
DEVICE
=
"cuda"
...
@@ -25,8 +25,8 @@ INSTALLATION_DIR=/tmp
...
@@ -25,8 +25,8 @@ INSTALLATION_DIR=/tmp
TORCH_CUDA_ARCH_LIST
=
"9.0;10.0"
# For EP Kernels -- TODO: check if we need to add 12.0+PTX
TORCH_CUDA_ARCH_LIST
=
"9.0;10.0"
# For EP Kernels -- TODO: check if we need to add 12.0+PTX
DEEPGEMM_REF
=
""
DEEPGEMM_REF
=
""
CUDA_VERSION
=
"12.9"
CUDA_VERSION
=
"12.9"
FLASHINF_REF
=
"v0.6.
4
"
FLASHINF_REF
=
"v0.6.
6
"
LMCACHE_REF
=
"0.4.
1
"
LMCACHE_REF
=
"0.4.
2
"
VLLM_OMNI_REF
=
"v0.16.0"
VLLM_OMNI_REF
=
"v0.16.0"
while
[[
$#
-gt
0
]]
;
do
while
[[
$#
-gt
0
]]
;
do
...
@@ -208,37 +208,6 @@ if [ "$DEVICE" = "cpu" ]; then
...
@@ -208,37 +208,6 @@ if [ "$DEVICE" = "cpu" ]; then
fi
fi
echo
"✓ vLLM installation completed"
echo
"✓ vLLM installation completed"
# Apply hotfix for multi-node TP init ordering (vLLM PR #35892).
# Only applies to vLLM 0.17.1 — fail loudly on any other version so the
# patch + this block get cleaned up when vLLM is bumped.
# Note: In Docker builds the script is copied to /tmp but deps are bind-mounted
# at /tmp/deps, so resolve the patch relative to BASH_SOURCE first, then fall
# back to the bind-mount path.
SCRIPT_DIR
=
"
$(
cd
"
$(
dirname
"
${
BASH_SOURCE
[0]
}
"
)
"
&&
pwd
)
"
VLLM_PATCH
=
"
${
SCRIPT_DIR
}
/multinode-tp-init-order.patch"
if
[
!
-f
"
$VLLM_PATCH
"
]
;
then
VLLM_PATCH
=
"/tmp/deps/vllm/multinode-tp-init-order.patch"
fi
if
[
"
$VLLM_VER
"
=
"0.17.1"
]
;
then
# Patch the cloned repo (used by CPU/XPU source builds that build from source)
echo
"Applying vLLM multi-node TP hotfix to cloned repo..."
git
-C
"
${
INSTALLATION_DIR
}
/vllm"
apply
--ignore-whitespace
"
$VLLM_PATCH
"
# Also patch site-packages if vLLM was installed from a wheel (CUDA builds).
# Skip if the installed location is the clone itself (already patched above).
# Use `uv pip show` instead of importing vllm to avoid pulling in torch/CUDA.
VLLM_SITE
=
$(
uv pip show vllm 2>/dev/null |
grep
-i
'^Location:'
|
awk
'{print $2}'
)
if
[
-n
"
$VLLM_SITE
"
]
&&
[
"
$VLLM_SITE
"
!=
"
${
INSTALLATION_DIR
}
/vllm"
]
;
then
echo
"Applying vLLM multi-node TP hotfix to
${
VLLM_SITE
}
..."
patch
-d
"
$VLLM_SITE
"
-p1
<
"
$VLLM_PATCH
"
fi
echo
"✓ vLLM multi-node TP hotfix applied"
else
echo
"❌ ERROR: vLLM version is
${
VLLM_VER
}
, not 0.17.1."
echo
" The multi-node TP hotfix patch (multinode-tp-init-order.patch) and"
echo
" this block in install_vllm.sh are no longer needed — please remove them."
exit
1
fi
echo
"
\n
=== Installing LMCache from source ==="
echo
"
\n
=== Installing LMCache from source ==="
# LMCache prebuilt wheels are built against PyTorch <=2.8.0 and fail with PyTorch 2.10+
# LMCache prebuilt wheels are built against PyTorch <=2.8.0 and fail with PyTorch 2.10+
# (undefined symbol: c10::cuda::c10_cuda_check_implementation).
# (undefined symbol: c10::cuda::c10_cuda_check_implementation).
...
...
container/deps/vllm/multinode-tp-init-order.patch
deleted
100644 → 0
View file @
aaa8a567
#
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
SPDX-FileCopyrightText: Copyright contributors to the vLLM project
#
SPDX-License-Identifier: Apache-2.0
#
# Hotfix for vLLM 0.17.1: multi-node TP init ordering
#
Upstream fix: https://github.com/vllm-project/vllm/commit/86e1060b
#
Upstream PR: https://github.com/vllm-project/vllm/pull/35892
#
Issue: https://github.com/vllm-project/vllm/issues/36389
#
# WorkerProc.__init__ calls _init_message_queues() before init_device(),
#
but the former needs _INNER_DP_WORLD which the latter creates.
#
Move _init_message_queues() after init_device()+load_model().
#
# Remove this patch once vLLM >= 0.17.2 is adopted.
diff --git a/vllm/v1/executor/multiproc_executor.py b/vllm/v1/executor/multiproc_executor.py
index e3376ba..39a3646 100644
--- a/vllm/v1/executor/multiproc_executor.py
+++ b/vllm/v1/executor/multiproc_executor.py
@@ -586,7 +586,6 @@
class WorkerProc:
)
# Load model
- self._init_message_queues(input_shm_handle, vllm_config)
is_eep_new_worker = envs.VLLM_ELASTIC_EP_SCALE_UP_LAUNCH
if not is_eep_new_worker:
self.worker.init_device()
@@ -596,6 +595,10 @@
class WorkerProc:
)
self.worker.load_model()
+ # Initialize message queues after init_device() since multi-node setups
+ # (nnodes_within_dp > 1) require distributed groups to be initialized
+ self._init_message_queues(input_shm_handle, vllm_config)
+
# Enable environment variable cache (e.g. assume no more
# environment variable overrides after this point)
enable_envs_cache()
docs/features/multimodal/embedding-cache.md
View file @
3a1561fb
...
@@ -23,11 +23,11 @@ If your workload consists entirely of unique images, the cache provides no benef
...
@@ -23,11 +23,11 @@ If your workload consists entirely of unique images, the cache provides no benef
| Backend | Aggregated | Disaggregated (E/PD) | Notes |
| Backend | Aggregated | Disaggregated (E/PD) | Notes |
|---------|------------|----------------------|-------|
|---------|------------|----------------------|-------|
|
**vLLM**
| ✅
*
| ✅ | Aggregated uses vLLM-native
`ec_both`
; disaggregated uses Dynamo
`EmbeddingCacheManager`
|
|
**vLLM**
| ✅ | ✅ | Aggregated uses vLLM-native
`ec_both`
; disaggregated uses Dynamo
`EmbeddingCacheManager`
|
|
**TRT-LLM**
| ❌ | ✅ | Dynamo
`MultimodalEmbeddingCacheManager`
in PD worker |
|
**TRT-LLM**
| ❌ | ✅ | Dynamo
`MultimodalEmbeddingCacheManager`
in PD worker |
|
**SGLang**
| ❌ | ❌ | Not supported yet |
|
**SGLang**
| ❌ | ❌ | Not supported yet |
*
Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published
.
This support requires vLLM
`0.18.0`
or newer
.
## How It Works
## How It Works
...
...
docs/features/multimodal/multimodal-kv-routing.md
View file @
3a1561fb
...
@@ -25,11 +25,11 @@ Without MM-aware routing, the standard router treats image token blocks as opaqu
...
@@ -25,11 +25,11 @@ Without MM-aware routing, the standard router treats image token blocks as opaqu
| Backend | Supported | Notes |
| Backend | Supported | Notes |
|---------|-----------|-------|
|---------|-----------|-------|
|
**vLLM**
| ✅
*
| Requires vLLM with KV events
`extra_keys`
support (
[
PR #33304
](
https://github.com/vllm-project/vllm/pull/33304
)
) |
|
**vLLM**
| ✅ | Requires vLLM with KV events
`extra_keys`
support (
[
PR #33304
](
https://github.com/vllm-project/vllm/pull/33304
)
) |
|
**TRT-LLM**
| ✅ | Requires
`--publish-events-and-metrics`
on TRT-LLM workers |
|
**TRT-LLM**
| ✅ | Requires
`--publish-events-and-metrics`
on TRT-LLM workers |
|
**SGLang**
| ❌ | Not supported yet |
|
**SGLang**
| ❌ | Not supported yet |
*
Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published
.
This support requires vLLM
`0.18.0`
or newer
.
## How It Works
## How It Works
...
...
docs/reference/support-matrix.md
View file @
3a1561fb
...
@@ -29,7 +29,7 @@ The following table shows the backend framework versions included with each Dyna
...
@@ -29,7 +29,7 @@ The following table shows the backend framework versions included with each Dyna
|
**Dynamo**
|
**SGLang**
|
**TensorRT-LLM**
|
**vLLM**
|
**NIXL**
|
|
**Dynamo**
|
**SGLang**
|
**TensorRT-LLM**
|
**vLLM**
|
**NIXL**
|
| :--- | :--- | :--- | :--- | :--- |
| :--- | :--- | :--- | :--- | :--- |
|
**main (ToT)**
|
`0.5.9`
|
`1.3.0rc8`
|
`0.1
7.1
`
|
`0.10.1`
|
|
**main (ToT)**
|
`0.5.9`
|
`1.3.0rc8`
|
`0.1
8.0
`
|
`0.10.1`
|
|
**v1.1.0-dev.1**
*(experimental)*
|
`0.5.9`
|
`1.3.0rc5.post1`
|
`0.17.1`
|
`0.10.1`
|
|
**v1.1.0-dev.1**
*(experimental)*
|
`0.5.9`
|
`1.3.0rc5.post1`
|
`0.17.1`
|
`0.10.1`
|
|
**v1.0.1**
|
`0.5.9`
|
`1.3.0rc5.post1`
|
`0.16.0`
|
`0.10.1`
|
|
**v1.0.1**
|
`0.5.9`
|
`1.3.0rc5.post1`
|
`0.16.0`
|
`0.10.1`
|
|
**v1.0.0**
|
`0.5.9`
|
`1.3.0rc5.post1`
|
`0.16.0`
|
`0.10.1`
|
|
**v1.0.0**
|
`0.5.9`
|
`1.3.0rc5.post1`
|
`0.16.0`
|
`0.10.1`
|
...
...
pyproject.toml
View file @
3a1561fb
...
@@ -50,7 +50,7 @@ trtllm =[
...
@@ -50,7 +50,7 @@ trtllm =[
vllm
=
[
vllm
=
[
"uvloop"
,
"uvloop"
,
"nixl[cu12]<=0.10.1"
,
"nixl[cu12]<=0.10.1"
,
"vllm[flashinfer,runai,otel]==0.1
7.1
"
,
"vllm[flashinfer,runai,otel]==0.1
8.0
"
,
# vllm-omni 0.16.0 is now on PyPI; install only future rc builds from source in container builds
# vllm-omni 0.16.0 is now on PyPI; install only future rc builds from source in container builds
# (see container/deps/vllm/install_vllm.sh). pip install ai-dynamo[vllm] will
# (see container/deps/vllm/install_vllm.sh). pip install ai-dynamo[vllm] will
# not include vllm-omni — install it separately from source if needed.
# not include vllm-omni — install it separately from source if needed.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment