Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
afd0da21
Commit
afd0da21
authored
Feb 03, 2025
by
zhuwenwen
Browse files
Merge tag 'v0.7.1' into v0.7.1-dev
parents
1a11f127
4f4d427a
Changes
960
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
274 additions
and
459 deletions
+274
-459
.github/mergify.yml
.github/mergify.yml
+37
-0
.github/workflows/actionlint.yml
.github/workflows/actionlint.yml
+0
-40
.github/workflows/clang-format.yml
.github/workflows/clang-format.yml
+0
-53
.github/workflows/codespell.yml
.github/workflows/codespell.yml
+0
-45
.github/workflows/lint-and-deploy.yaml
.github/workflows/lint-and-deploy.yaml
+3
-2
.github/workflows/matchers/ruff.json
.github/workflows/matchers/ruff.json
+0
-17
.github/workflows/mypy.yaml
.github/workflows/mypy.yaml
+0
-51
.github/workflows/png-lint.yml
.github/workflows/png-lint.yml
+0
-37
.github/workflows/pre-commit.yml
.github/workflows/pre-commit.yml
+19
-0
.github/workflows/ruff.yml
.github/workflows/ruff.yml
+0
-52
.github/workflows/shellcheck.yml
.github/workflows/shellcheck.yml
+0
-37
.github/workflows/sphinx-lint.yml
.github/workflows/sphinx-lint.yml
+0
-32
.github/workflows/yapf.yml
.github/workflows/yapf.yml
+0
-38
.gitignore
.gitignore
+1
-4
.pre-commit-config.yaml
.pre-commit-config.yaml
+106
-0
CMakeLists.txt
CMakeLists.txt
+58
-33
Dockerfile
Dockerfile
+37
-9
Dockerfile.cpu
Dockerfile.cpu
+3
-3
Dockerfile.hpu
Dockerfile.hpu
+1
-1
Dockerfile.neuron
Dockerfile.neuron
+9
-5
No files found.
.github/mergify.yml
View file @
afd0da21
...
...
@@ -35,6 +35,43 @@ pull_request_rules:
add
:
-
frontend
-
name
:
label-structured-output
description
:
Automatically apply structured-output label
conditions
:
-
or
:
-
files~=^vllm/model_executor/guided_decoding/
-
files=tests/model_executor/test_guided_processors.py
-
files=tests/entrypoints/llm/test_guided_generate.py
-
files=benchmarks/benchmark_serving_guided.py
-
files=benchmarks/benchmark_guided.py
actions
:
label
:
add
:
-
structured-output
-
name
:
label-speculative-decoding
description
:
Automatically apply speculative-decoding label
conditions
:
-
or
:
-
files~=^vllm/spec_decode/
-
files=vllm/model_executor/layers/spec_decode_base_sampler.py
-
files~=^tests/spec_decode/
actions
:
label
:
add
:
-
speculative-decoding
-
name
:
label-v1
description
:
Automatically apply v1 label
conditions
:
-
or
:
-
files~=^vllm/v1/
-
files~=^tests/v1/
actions
:
label
:
add
:
-
v1
-
name
:
ping author on conflicts and add 'needs-rebase' label
conditions
:
-
conflict
...
...
.github/workflows/actionlint.yml
deleted
100644 → 0
View file @
1a11f127
name
:
Lint GitHub Actions workflows
on
:
push
:
branches
:
-
"
main"
paths
:
-
'
.github/workflows/*.ya?ml'
-
'
.github/workflows/actionlint.*'
-
'
.github/workflows/matchers/actionlint.json'
pull_request
:
branches
:
-
"
main"
paths
:
-
'
.github/workflows/*.ya?ml'
-
'
.github/workflows/actionlint.*'
-
'
.github/workflows/matchers/actionlint.json'
env
:
LC_ALL
:
en_US.UTF-8
defaults
:
run
:
shell
:
bash
permissions
:
contents
:
read
jobs
:
actionlint
:
runs-on
:
ubuntu-latest
steps
:
-
name
:
"
Checkout"
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
with
:
fetch-depth
:
0
-
name
:
"
Run
actionlint"
run
:
|
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
tools/actionlint.sh -color
.github/workflows/clang-format.yml
deleted
100644 → 0
View file @
1a11f127
name
:
clang-format
on
:
# Trigger the workflow on push or pull request,
# but only for the main branch
push
:
branches
:
-
main
paths
:
-
'
**/*.h'
-
'
**/*.cpp'
-
'
**/*.cu'
-
'
**/*.cuh'
-
'
.github/workflows/clang-format.yml'
pull_request
:
branches
:
-
main
paths
:
-
'
**/*.h'
-
'
**/*.cpp'
-
'
**/*.cu'
-
'
**/*.cuh'
-
'
.github/workflows/clang-format.yml'
jobs
:
clang-format
:
runs-on
:
ubuntu-latest
strategy
:
matrix
:
python-version
:
[
"
3.11"
]
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
name
:
Set up Python ${{ matrix.python-version }}
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
${{ matrix.python-version }}
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install clang-format==18.1.5
-
name
:
Running clang-format
run
:
|
EXCLUDES=(
'csrc/moe/topk_softmax_kernels.cu'
'csrc/quantization/gguf/ggml-common.h'
'csrc/quantization/gguf/dequantize.cuh'
'csrc/quantization/gguf/vecdotq.cuh'
'csrc/quantization/gguf/mmq.cuh'
'csrc/quantization/gguf/mmvq.cuh'
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror
.github/workflows/codespell.yml
deleted
100644 → 0
View file @
1a11f127
name
:
codespell
on
:
# Trigger the workflow on push or pull request,
# but only for the main branch
push
:
branches
:
-
main
paths
:
-
"
**/*.py"
-
"
**/*.md"
-
"
**/*.rst"
-
pyproject.toml
-
requirements-lint.txt
-
.github/workflows/codespell.yml
pull_request
:
branches
:
-
main
paths
:
-
"
**/*.py"
-
"
**/*.md"
-
"
**/*.rst"
-
pyproject.toml
-
requirements-lint.txt
-
.github/workflows/codespell.yml
jobs
:
codespell
:
runs-on
:
ubuntu-latest
strategy
:
matrix
:
python-version
:
[
"
3.12"
]
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
name
:
Set up Python ${{ matrix.python-version }}
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
${{ matrix.python-version }}
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
-
name
:
Spelling check with codespell
run
:
|
codespell --toml pyproject.toml
.github/workflows/lint-and-deploy.yaml
View file @
afd0da21
...
...
@@ -27,7 +27,7 @@ jobs:
version
:
v3.10.1
-
name
:
Run chart-testing (lint)
run
:
ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/chart-helm --charts examples/chart-helm
run
:
ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/
online_serving/
chart-helm --charts examples/
online_serving/
chart-helm
-
name
:
Setup minio
run
:
|
...
...
@@ -64,7 +64,8 @@ jobs:
run
:
|
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/chart-helm -f examples/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
sleep 30 && kubectl -n ns-vllm logs -f "$(kubectl -n ns-vllm get pods | awk '/deployment/ {print $1;exit}')" &
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/online_serving/chart-helm -f examples/online_serving/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
-
name
:
curl test
run
:
|
...
...
.github/workflows/matchers/ruff.json
deleted
100644 → 0
View file @
1a11f127
{
"problemMatcher"
:
[
{
"owner"
:
"ruff"
,
"pattern"
:
[
{
"regexp"
:
"^(.+?):(
\\
d+):(
\\
d+): (
\\
w+): (.+)$"
,
"file"
:
1
,
"line"
:
2
,
"column"
:
3
,
"code"
:
4
,
"message"
:
5
}
]
}
]
}
.github/workflows/mypy.yaml
deleted
100644 → 0
View file @
1a11f127
name
:
mypy
on
:
# Trigger the workflow on push or pull request,
# but only for the main branch
push
:
branches
:
-
main
paths
:
-
'
**/*.py'
-
'
.github/workflows/mypy.yaml'
-
'
tools/mypy.sh'
-
'
pyproject.toml'
pull_request
:
branches
:
-
main
# This workflow is only relevant when one of the following files changes.
# However, we have github configured to expect and require this workflow
# to run and pass before github with auto-merge a pull request. Until github
# allows more flexible auto-merge policy, we can just run this on every PR.
# It doesn't take that long to run, anyway.
#paths:
# - '**/*.py'
# - '.github/workflows/mypy.yaml'
# - 'tools/mypy.sh'
# - 'pyproject.toml'
jobs
:
mypy
:
runs-on
:
ubuntu-latest
strategy
:
matrix
:
python-version
:
[
"
3.9"
,
"
3.10"
,
"
3.11"
,
"
3.12"
]
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
name
:
Set up Python ${{ matrix.python-version }}
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
${{ matrix.python-version }}
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install mypy==1.11.1
pip install types-setuptools
pip install types-PyYAML
pip install types-requests
pip install types-setuptools
-
name
:
Mypy
run
:
|
echo "::add-matcher::.github/workflows/matchers/mypy.json"
tools/mypy.sh 1 ${{ matrix.python-version }}
.github/workflows/png-lint.yml
deleted
100644 → 0
View file @
1a11f127
name
:
Lint PNG exports from excalidraw
on
:
push
:
branches
:
-
"
main"
paths
:
-
'
*.excalidraw.png'
-
'
.github/workflows/png-lint.yml'
pull_request
:
branches
:
-
"
main"
paths
:
-
'
*.excalidraw.png'
-
'
.github/workflows/png-lint.yml'
env
:
LC_ALL
:
en_US.UTF-8
defaults
:
run
:
shell
:
bash
permissions
:
contents
:
read
jobs
:
actionlint
:
runs-on
:
ubuntu-latest
steps
:
-
name
:
"
Checkout"
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
with
:
fetch-depth
:
0
-
name
:
"
Run
png-lint.sh
to
check
excalidraw
exported
images"
run
:
|
tools/png-lint.sh
.github/workflows/pre-commit.yml
0 → 100644
View file @
afd0da21
name
:
pre-commit
on
:
pull_request
:
push
:
branches
:
[
main
]
jobs
:
pre-commit
:
runs-on
:
ubuntu-latest
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
"
3.12"
-
run
:
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
-
uses
:
pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd
# v3.0.1
with
:
extra_args
:
--all-files --hook-stage manual
.github/workflows/ruff.yml
deleted
100644 → 0
View file @
1a11f127
name
:
ruff
on
:
# Trigger the workflow on push or pull request,
# but only for the main branch
push
:
branches
:
-
main
paths
:
-
"
**/*.py"
-
pyproject.toml
-
requirements-lint.txt
-
.github/workflows/matchers/ruff.json
-
.github/workflows/ruff.yml
pull_request
:
branches
:
-
main
# This workflow is only relevant when one of the following files changes.
# However, we have github configured to expect and require this workflow
# to run and pass before github with auto-merge a pull request. Until github
# allows more flexible auto-merge policy, we can just run this on every PR.
# It doesn't take that long to run, anyway.
#paths:
# - "**/*.py"
# - pyproject.toml
# - requirements-lint.txt
# - .github/workflows/matchers/ruff.json
# - .github/workflows/ruff.yml
jobs
:
ruff
:
runs-on
:
ubuntu-latest
strategy
:
matrix
:
python-version
:
[
"
3.12"
]
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
name
:
Set up Python ${{ matrix.python-version }}
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
${{ matrix.python-version }}
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
-
name
:
Analysing the code with ruff
run
:
|
echo "::add-matcher::.github/workflows/matchers/ruff.json"
ruff check --output-format github .
-
name
:
Run isort
run
:
|
isort . --check-only
.github/workflows/shellcheck.yml
deleted
100644 → 0
View file @
1a11f127
name
:
Lint shell scripts
on
:
push
:
branches
:
-
"
main"
paths
:
-
'
**/*.sh'
-
'
.github/workflows/shellcheck.yml'
pull_request
:
branches
:
-
"
main"
paths
:
-
'
**/*.sh'
-
'
.github/workflows/shellcheck.yml'
env
:
LC_ALL
:
en_US.UTF-8
defaults
:
run
:
shell
:
bash
permissions
:
contents
:
read
jobs
:
shellcheck
:
runs-on
:
ubuntu-latest
steps
:
-
name
:
"
Checkout"
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
with
:
fetch-depth
:
0
-
name
:
"
Check
shell
scripts"
run
:
|
tools/shellcheck.sh
.github/workflows/sphinx-lint.yml
deleted
100644 → 0
View file @
1a11f127
name
:
Lint documentation
on
:
push
:
branches
:
-
main
paths
:
-
"
docs/**"
pull_request
:
branches
:
-
main
paths
:
-
"
docs/**"
jobs
:
sphinx-lint
:
runs-on
:
ubuntu-latest
strategy
:
matrix
:
python-version
:
[
"
3.12"
]
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
name
:
Set up Python ${{ matrix.python-version }}
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
${{ matrix.python-version }}
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
-
name
:
Linting docs
run
:
tools/sphinx-lint.sh
.github/workflows/yapf.yml
deleted
100644 → 0
View file @
1a11f127
name
:
yapf
on
:
# Trigger the workflow on push or pull request,
# but only for the main branch
push
:
branches
:
-
main
paths
:
-
"
**/*.py"
-
.github/workflows/yapf.yml
pull_request
:
branches
:
-
main
paths
:
-
"
**/*.py"
-
.github/workflows/yapf.yml
jobs
:
yapf
:
runs-on
:
ubuntu-latest
strategy
:
matrix
:
python-version
:
[
"
3.12"
]
steps
:
-
uses
:
actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# v4.2.2
-
name
:
Set up Python ${{ matrix.python-version }}
uses
:
actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b
# v5.3.0
with
:
python-version
:
${{ matrix.python-version }}
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install yapf==0.32.0
pip install toml==0.10.2
-
name
:
Running yapf
run
:
|
yapf --diff --recursive .
.gitignore
View file @
afd0da21
...
...
@@ -79,10 +79,7 @@ instance/
# Sphinx documentation
docs/_build/
docs/source/getting_started/examples/*.rst
!**/*.template.rst
docs/source/getting_started/examples/*.md
!**/*.template.md
docs/source/getting_started/examples/
# PyBuilder
.pybuilder/
...
...
.pre-commit-config.yaml
0 → 100644
View file @
afd0da21
default_stages
:
-
pre-commit
# Run locally
-
manual
# Run in CI
repos
:
-
repo
:
https://github.com/google/yapf
rev
:
v0.43.0
hooks
:
-
id
:
yapf
args
:
[
--in-place
,
--verbose
]
additional_dependencies
:
[
toml
]
# TODO: Remove when yapf is upgraded
-
repo
:
https://github.com/astral-sh/ruff-pre-commit
rev
:
v0.9.3
hooks
:
-
id
:
ruff
args
:
[
--output-format
,
github
]
-
repo
:
https://github.com/codespell-project/codespell
rev
:
v2.4.0
hooks
:
-
id
:
codespell
exclude
:
'
benchmarks/sonnet.txt|(build|tests/(lora/data|models/fixtures|prompts))/.*'
-
repo
:
https://github.com/PyCQA/isort
rev
:
5.13.2
hooks
:
-
id
:
isort
-
repo
:
https://github.com/pre-commit/mirrors-clang-format
rev
:
v19.1.7
hooks
:
-
id
:
clang-format
exclude
:
'
csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))'
types_or
:
[
c++
,
cuda
]
args
:
[
--style=file
,
--verbose
]
-
repo
:
https://github.com/jackdewinter/pymarkdown
rev
:
v0.9.27
hooks
:
-
id
:
pymarkdown
files
:
docs/.*
-
repo
:
https://github.com/rhysd/actionlint
rev
:
v1.7.7
hooks
:
-
id
:
actionlint
-
repo
:
local
hooks
:
-
id
:
mypy-local
name
:
Run mypy for local Python installation
entry
:
tools/mypy.sh 0 "local"
language
:
python
types
:
[
python
]
additional_dependencies
:
&mypy_deps
[
mypy==1.11.1
,
types-setuptools
,
types-PyYAML
,
types-requests
]
stages
:
[
pre-commit
]
# Don't run in CI
-
id
:
mypy-3.9
# TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name
:
Run mypy for Python
3.9
entry
:
tools/mypy.sh 1 "3.9"
language
:
python
types
:
[
python
]
additional_dependencies
:
*mypy_deps
stages
:
[
manual
]
# Only run in CI
-
id
:
mypy-3.10
# TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name
:
Run mypy for Python
3.10
entry
:
tools/mypy.sh 1 "3.10"
language
:
python
types
:
[
python
]
additional_dependencies
:
*mypy_deps
stages
:
[
manual
]
# Only run in CI
-
id
:
mypy-3.11
# TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name
:
Run mypy for Python
3.11
entry
:
tools/mypy.sh 1 "3.11"
language
:
python
types
:
[
python
]
additional_dependencies
:
*mypy_deps
stages
:
[
manual
]
# Only run in CI
-
id
:
mypy-3.12
# TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name
:
Run mypy for Python
3.12
entry
:
tools/mypy.sh 1 "3.12"
language
:
python
types
:
[
python
]
additional_dependencies
:
*mypy_deps
stages
:
[
manual
]
# Only run in CI
-
id
:
shellcheck
name
:
Lint shell scripts
entry
:
tools/shellcheck.sh
language
:
script
types
:
[
shell
]
-
id
:
png-lint
name
:
Lint PNG exports from excalidraw
entry
:
tools/png-lint.sh
language
:
script
types
:
[
png
]
-
id
:
signoff-commit
name
:
Sign-off Commit
entry
:
bash
args
:
-
-c
-
|
if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" .git/COMMIT_EDITMSG; then
printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> .git/COMMIT_EDITMSG
fi
language
:
system
verbose
:
true
stages
:
[
commit-msg
]
-
id
:
suggestion
name
:
Suggestion
entry
:
bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language
:
system
verbose
:
true
pass_filenames
:
false
CMakeLists.txt
View file @
afd0da21
...
...
@@ -28,9 +28,6 @@ add_compile_options(-w)
# Suppress potential warnings about unused manually-specified variables
set
(
ignoreMe
"
${
VLLM_PYTHON_PATH
}
"
)
# Prevent installation of dependencies (cutlass) by default.
install
(
CODE
"set(CMAKE_INSTALL_LOCAL_ONLY TRUE)"
ALL_COMPONENTS
)
#
# Supported python versions. These versions will be searched in order, the
# first match will be selected. These should be kept in sync with setup.py.
...
...
@@ -185,6 +182,31 @@ message(STATUS "FetchContent base directory: ${FETCHCONTENT_BASE_DIR}")
# Define other extension targets
#
#
# cumem_allocator extension
#
set
(
VLLM_CUMEM_EXT_SRC
"csrc/cumem_allocator.cpp"
)
set_gencode_flags_for_srcs
(
SRCS
"
${
VLLM_CUMEM_EXT_SRC
}
"
CUDA_ARCHS
"
${
CUDA_ARCHS
}
"
)
if
(
VLLM_GPU_LANG STREQUAL
"CUDA"
)
message
(
STATUS
"Enabling cumem allocator extension."
)
# link against cuda driver library
list
(
APPEND CUMEM_LIBS cuda
)
define_gpu_extension_target
(
cumem_allocator
DESTINATION vllm
LANGUAGE CXX
SOURCES
${
VLLM_CUMEM_EXT_SRC
}
LIBRARIES
${
CUMEM_LIBS
}
USE_SABI 3.8
WITH_SOABI
)
endif
()
#
# _C extension
#
...
...
@@ -236,13 +258,13 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
FetchContent_Declare
(
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
GIT_TAG
8aa95dbb888be6d81c6fbf7169718c5244b53227
GIT_TAG
v3.7.0
GIT_PROGRESS TRUE
# Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history.
# Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags.
# So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE
GIT_SHALLOW
FALS
E
GIT_SHALLOW
TRU
E
)
endif
()
FetchContent_MakeAvailable
(
cutlass
)
...
...
@@ -266,7 +288,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Only build Marlin kernels if we are building for at least some compatible archs.
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
cuda_archs_loose_intersection
(
MARLIN_ARCHS
"8.0;8.6;8.7;8.9;9.0"
${
CUDA_ARCHS
}
)
cuda_archs_loose_intersection
(
MARLIN_ARCHS
"8.0;8.6;8.7;8.9;9.0"
"
${
CUDA_ARCHS
}
"
)
if
(
MARLIN_ARCHS
)
set
(
MARLIN_SRCS
"csrc/quantization/fp8/fp8_marlin.cu"
...
...
@@ -287,10 +309,15 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif
()
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper,
9.0/
9.0a for now).
cuda_archs_loose_intersection
(
SCALED_MM_3X_ARCHS
"9.
0;9.
0a"
"
${
CUDA_ARCHS
}
"
)
# CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
cuda_archs_loose_intersection
(
SCALED_MM_3X_ARCHS
"9.0a"
"
${
CUDA_ARCHS
}
"
)
if
(
${
CMAKE_CUDA_COMPILER_VERSION
}
VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS
)
set
(
SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
)
set
(
SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_fp8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_int8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_azp_sm90_int8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu"
)
set_gencode_flags_for_srcs
(
SRCS
"
${
SRCS
}
"
CUDA_ARCHS
"
${
SCALED_MM_3X_ARCHS
}
"
)
...
...
@@ -342,7 +369,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# 2:4 Sparse Kernels
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper,
9.0/
9.0a for now).
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
if
(
${
CMAKE_CUDA_COMPILER_VERSION
}
VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS
)
set
(
SRCS
"csrc/sparse/cutlass/sparse_compressor_c3x.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu"
)
...
...
@@ -525,7 +552,7 @@ endif()
]]
# vllm-flash-attn currently only supported on CUDA
if
(
NOT VLLM_
TARGET_DEVICE
STREQUAL
"
cuda
"
)
if
(
NOT VLLM_
GPU_LANG
STREQUAL
"
CUDA
"
)
return
()
endif
()
...
...
@@ -548,7 +575,7 @@ endif()
# They should be identical but if they aren't, this is a massive footgun.
#
# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
# To only install vllm-flash-attn, use --component vllm_f
lash_attn_c
.
# To only install vllm-flash-attn, use --component
_
vllm_f
a2_C (for FA2) or --component _vllm_fa3_C (for FA3)
.
# If no component is specified, vllm-flash-attn is still installed.
# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
...
...
@@ -560,13 +587,17 @@ if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
endif
()
if
(
VLLM_FLASH_ATTN_SRC_DIR
)
FetchContent_Declare
(
vllm-flash-attn SOURCE_DIR
${
VLLM_FLASH_ATTN_SRC_DIR
}
)
FetchContent_Declare
(
vllm-flash-attn SOURCE_DIR
${
VLLM_FLASH_ATTN_SRC_DIR
}
BINARY_DIR
${
CMAKE_BINARY_DIR
}
/vllm-flash-attn
)
#[[
else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG
04325b6798bcc326c86fb35af62d05a9c8c8eceb
GIT_TAG
d4e09037abf588af1ec47d0e966b237ee376876c
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
...
...
@@ -574,31 +605,25 @@ else()
]]
endif
()
# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
set
(
VLLM_PARENT_BUILD ON
)
#[[
# Ensure the vllm/vllm_flash_attn directory exists before installation
install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)
# Make sure vllm-flash-attn install rules are nested under vllm/
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)
# Fetch the vllm-flash-attn library
FetchContent_MakeAvailable(vllm-flash-attn)
message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")
# Restore the install prefix
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
# Copy over the vllm-flash-attn python files (duplicated for fa2 and fa3, in
# case only one is built, in the case both are built redundant work is done)
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm_flash_attn
COMPONENT _vllm_fa2_C
FILES_MATCHING PATTERN "*.py"
)
# Copy over the vllm-flash-attn python files
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION
vllm/
vllm_flash_attn
COMPONENT vllm_f
lash_attn_c
DESTINATION vllm_flash_attn
COMPONENT
_
vllm_f
a3_C
FILES_MATCHING PATTERN "*.py"
)
...
...
Dockerfile
View file @
afd0da21
...
...
@@ -2,8 +2,8 @@
# to run the OpenAI compatible server.
# Please update any changes made here to
# docs/source/
dev
/dockerfile/dockerfile.md and
# docs/source/assets/
dev
/dockerfile-stages-dependency.png
# docs/source/
contributing
/dockerfile/dockerfile.md and
# docs/source/assets/
contributing
/dockerfile-stages-dependency.png
ARG
CUDA_VERSION=12.4.1
#################### BASE BUILD IMAGE ####################
...
...
@@ -52,7 +52,7 @@ WORKDIR /workspace
# after this step
RUN
--mount
=
type
=
cache,target
=
/root/.cache/pip
\
if
[
"
$TARGETPLATFORM
"
=
"linux/arm64"
]
;
then
\
python3
-m
pip
install
--index-url
https://download.pytorch.org/whl/nightly/cu12
4
"torch==2.
6
.0.dev202
4
121
0
+cu12
4
"
"torchvision==0.22.0.dev202
4
121
5
"
;
\
python3
-m
pip
install
--index-url
https://download.pytorch.org/whl/nightly/cu12
6
"torch==2.
7
.0.dev202
50
121+cu12
6
"
"torchvision==0.22.0.dev202
50
121"
;
\
fi
COPY
requirements-common.txt requirements-common.txt
...
...
@@ -126,8 +126,8 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
# Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY
.buildkite/check-wheel-size.py check-wheel-size.py
#
Default max size of the wheel is 250MB
ARG
VLLM_MAX_SIZE_MB=
25
0
#
sync the default value with .buildkite/check-wheel-size.py
ARG
VLLM_MAX_SIZE_MB=
30
0
ENV
VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
ARG
RUN_WHEEL_CHECK=true
RUN if
[
"
$RUN_WHEEL_CHECK
"
=
"true"
]
;
then
\
...
...
@@ -149,7 +149,8 @@ RUN --mount=type=cache,target=/root/.cache/pip \
#################### vLLM installation IMAGE ####################
# image with vLLM installed
FROM
nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 AS vllm-base
# TODO: Restore to base image after FlashInfer AOT wheel fixed
FROM
nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 AS vllm-base
ARG
CUDA_VERSION=12.4.1
ARG
PYTHON_VERSION=3.12
WORKDIR
/vllm-workspace
...
...
@@ -194,12 +195,30 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
--mount
=
type
=
cache,target
=
/root/.cache/pip
\
python3
-m
pip
install
dist/
*
.whl
--verbose
# How to build this FlashInfer wheel:
# $ export FLASHINFER_ENABLE_AOT=1
# $ # Note we remove 7.0 from the arch list compared to the list below, since FlashInfer only supports sm75+
# $ export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX'
# $ git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
# $ cd flashinfer
# $ git checkout 524304395bd1d8cd7d07db083859523fcaa246a4
# $ python3 setup.py bdist_wheel --dist-dir=dist --verbose
RUN
--mount
=
type
=
cache,target
=
/root/.cache/pip
\
.
/etc/environment
&&
\
if
[
"
$TARGETPLATFORM
"
!=
"linux/arm64"
]
;
then
\
python3
-m
pip
install
https://
github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4
-cp
${
PYTHON_VERSION_STR
}
-cp
${
PYTHON_VERSION_STR
}
-linux_x86_64
.whl
;
\
python3
-m
pip
install
https://
wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1
-cp
${
PYTHON_VERSION_STR
}
-cp
${
PYTHON_VERSION_STR
}
-linux_x86_64
.whl
;
\
fi
COPY
examples examples
# Although we build Flashinfer with AOT mode, there's still
# some issues w.r.t. JIT compilation. Therefore we need to
# install build dependencies for JIT compilation.
# TODO: Remove this once FlashInfer AOT wheel is fixed
COPY
requirements-build.txt requirements-build.txt
RUN
--mount
=
type
=
cache,target
=
/root/.cache/pip
\
python3
-m
pip
install
-r
requirements-build.txt
#################### vLLM installation IMAGE ####################
#################### TEST IMAGE ####################
...
...
@@ -234,8 +253,8 @@ RUN mv vllm test_docs/
#################### TEST IMAGE ####################
#################### OPENAI API SERVER ####################
# openai
api server alternative
FROM
vllm-base AS vllm-openai
#
base
openai
image with additional requirements, for any subsequent openai-style images
FROM
vllm-base AS vllm-openai
-base
# install additional dependencies for openai api server
RUN
--mount
=
type
=
cache,target
=
/root/.cache/pip
\
...
...
@@ -247,5 +266,14 @@ RUN --mount=type=cache,target=/root/.cache/pip \
ENV
VLLM_USAGE_SOURCE production-docker-image
# define sagemaker first, so it is not default from `docker build`
FROM
vllm-openai-base AS vllm-sagemaker
COPY
examples/online_serving/sagemaker-entrypoint.sh .
RUN
chmod
+x sagemaker-entrypoint.sh
ENTRYPOINT
["./sagemaker-entrypoint.sh"]
FROM
vllm-openai-base AS vllm-openai
ENTRYPOINT
["python3", "-m", "vllm.entrypoints.openai.api_server"]
#################### OPENAI API SERVER ####################
Dockerfile.cpu
View file @
afd0da21
...
...
@@ -26,10 +26,10 @@ RUN pip install intel_extension_for_pytorch==2.5.0
WORKDIR /workspace
COPY requirements-build.txt requirements-build.txt
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements-build.txt,target=requirements-build.txt \
pip install --upgrade pip && \
pip install -r requirements-build.txt
...
...
@@ -37,9 +37,9 @@ FROM cpu-test-1 AS build
WORKDIR /workspace/vllm
COPY requirements-common.txt requirements-common.txt
COPY requirements-cpu.txt requirements-cpu.txt
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements-common.txt,target=requirements-common.txt \
--mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \
pip install -v -r requirements-cpu.txt
COPY . .
...
...
Dockerfile.hpu
View file @
afd0da21
FROM vault.habana.ai/gaudi-docker/1.1
8.0
/ubuntu22.04/habanalabs/pytorch-installer-2.
4.0
:latest
FROM vault.habana.ai/gaudi-docker/1.1
9.1
/ubuntu22.04/habanalabs/pytorch-installer-2.
5.1
:latest
COPY ./ /workspace/vllm
...
...
Dockerfile.neuron
View file @
afd0da21
# default base image
# https://gallery.ecr.aws/neuron/pytorch-inference-neuronx
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.
1.2
-neuronx-py310-sdk2.2
0.2
-ubuntu2
0
.04"
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.
5.1
-neuronx-py310-sdk2.2
1.0
-ubuntu2
2
.04"
FROM $BASE_IMAGE
...
...
@@ -15,16 +15,17 @@ RUN apt-get update && \
ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /
app
ARG APP_MOUNT=/
app
# When launching the container, mount the code directory to /
workspace
ARG APP_MOUNT=/
workspace
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.
36
.2 -U
RUN python3 -m pip install sentencepiece transformers==4.
45
.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install neuronx-cc==2.16.345.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install pytest
COPY . .
ARG GIT_REPO_CHECK=0
...
...
@@ -42,4 +43,7 @@ RUN --mount=type=bind,source=.git,target=.git \
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
# overwrite entrypoint to run bash script
RUN echo "import subprocess; import sys; subprocess.check_call(sys.argv[1:])" > /usr/local/bin/dockerd-entrypoint.py
CMD ["/bin/bash"]
Prev
1
2
3
4
5
6
…
48
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment