Commit 6d2051cc authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.6.3.post1' into v0.6.3.post1-dev

parents 2c7f740a a2c71c54
# Contributing to vLLM # Contributing to vLLM
Thank you for your interest in contributing to vLLM! Thank you for your interest in contributing to vLLM! Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large. There are several ways you can contribute to the project:
Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large.
There are several ways you can contribute to the project:
- Identify and report any issues or bugs. - Identify and report any issues or bugs.
- Request or add a new model. - Request or add support for a new model.
- Suggest or implement new features. - Suggest or implement new features.
- Improve documentation or contribute a how-to guide.
However, remember that contributions aren't just about code. We also believe in the power of community support; thus, answering queries, offering PR reviews, and assisting others are also highly regarded and beneficial contributions.
We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions.
Finally, one of the most impactful ways to support us is by raising awareness about vLLM. Finally, one of the most impactful ways to support us is by raising awareness about vLLM. Talk about it in your blog posts and highlight how it's driving your incredible projects. Express your support on social media if you're using vLLM, or simply offer your appreciation by starring our repository!
Talk about it in your blog posts, highlighting how it's driving your incredible projects.
Express your support on Twitter if vLLM aids you, or simply offer your appreciation by starring our repository.
## Setup for development ## Developing
### Build from source Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) documentation for details.
```bash
pip install -e . # This may take several minutes.
```
### Testing ## Testing
```bash ```bash
pip install -r requirements-dev.txt pip install -r requirements-dev.txt
...@@ -36,15 +29,16 @@ mypy ...@@ -36,15 +29,16 @@ mypy
# Unit tests # Unit tests
pytest tests/ pytest tests/
``` ```
**Note:** Currently, the repository does not pass the mypy tests. **Note:** Currently, the repository does not pass the ``mypy`` tests.
## Contribution Guidelines
## Contributing Guidelines ### Issues
### Issue Reporting If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
If you encounter a bug or have a feature request, please check our issues page first to see if someone else has already reported it. > [!IMPORTANT]
If not, please file a new issue, providing as much relevant information as possible. > If you discover a security vulnerability, please follow the instructions [here](/SECURITY.md#reporting-a-vulnerability).
### Pull Requests & Code Reviews ### Pull Requests & Code Reviews
...@@ -53,4 +47,4 @@ Please check the PR checklist in the [PR template](.github/PULL_REQUEST_TEMPLATE ...@@ -53,4 +47,4 @@ Please check the PR checklist in the [PR template](.github/PULL_REQUEST_TEMPLATE
### Thank You ### Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
Your contributions make vLLM a great tool for everyone! All of your contributions help make vLLM a great tool and community for everyone!
...@@ -27,6 +27,14 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ ...@@ -27,6 +27,14 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \ && curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \
&& python3 --version && python3 -m pip --version && python3 --version && python3 -m pip --version
# Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519
# as it was causing spam when compiling the CUTLASS kernels
RUN apt-get install -y gcc-10 g++-10
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 110 --slave /usr/bin/g++ g++ /usr/bin/g++-10
RUN <<EOF
gcc --version
EOF
# Workaround for https://github.com/openai/triton/issues/2507 and # Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully # https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image # this won't be needed for future versions of this docker image
...@@ -62,15 +70,10 @@ COPY requirements-build.txt requirements-build.txt ...@@ -62,15 +70,10 @@ COPY requirements-build.txt requirements-build.txt
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-build.txt python3 -m pip install -r requirements-build.txt
# files and directories related to build wheels COPY . .
COPY csrc csrc ARG GIT_REPO_CHECK=0
COPY setup.py setup.py RUN --mount=type=bind,source=.git,target=.git \
COPY cmake cmake if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
COPY CMakeLists.txt CMakeLists.txt
COPY requirements-common.txt requirements-common.txt
COPY requirements-cuda.txt requirements-cuda.txt
COPY pyproject.toml pyproject.toml
COPY vllm vllm
# max jobs used by Ninja to build extensions # max jobs used by Ninja to build extensions
ARG max_jobs=2 ARG max_jobs=2
...@@ -135,7 +138,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \ ...@@ -135,7 +138,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
#################### DEV IMAGE #################### #################### DEV IMAGE ####################
#################### vLLM installation IMAGE #################### #################### vLLM installation IMAGE ####################
# image with vLLM installed # image with vLLM installed
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu20.04 AS vllm-base FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 AS vllm-base
ARG CUDA_VERSION=12.4.1 ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12 ARG PYTHON_VERSION=3.12
WORKDIR /vllm-workspace WORKDIR /vllm-workspace
...@@ -173,6 +176,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist ...@@ -173,6 +176,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \ . /etc/environment && \
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl
COPY examples examples
#################### vLLM installation IMAGE #################### #################### vLLM installation IMAGE ####################
......
...@@ -22,29 +22,17 @@ ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/usr/local/li ...@@ -22,29 +22,17 @@ ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/usr/local/li
RUN echo 'ulimit -c 0' >> ~/.bashrc RUN echo 'ulimit -c 0' >> ~/.bashrc
RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.4.0%2Bgitfbaa4bc-cp310-cp310-linux_x86_64.whl RUN pip install intel_extension_for_pytorch==2.4.0
WORKDIR /workspace WORKDIR /workspace
ENV PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cpu ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements-build.txt,target=requirements-build.txt \ --mount=type=bind,src=requirements-build.txt,target=requirements-build.txt \
pip install --upgrade pip && \ pip install --upgrade pip && \
pip install -r requirements-build.txt pip install -r requirements-build.txt
# install oneDNN
RUN git clone -b rls-v3.5 https://github.com/oneapi-src/oneDNN.git
RUN --mount=type=cache,target=/root/.cache/ccache \
cmake -B ./oneDNN/build -S ./oneDNN -G Ninja -DONEDNN_LIBRARY_TYPE=STATIC \
-DONEDNN_BUILD_DOC=OFF \
-DONEDNN_BUILD_EXAMPLES=OFF \
-DONEDNN_BUILD_TESTS=OFF \
-DONEDNN_BUILD_GRAPH=OFF \
-DONEDNN_ENABLE_WORKLOAD=INFERENCE \
-DONEDNN_ENABLE_PRIMITIVE=MATMUL && \
cmake --build ./oneDNN/build --target install --config Release
FROM cpu-test-1 AS build FROM cpu-test-1 AS build
WORKDIR /workspace/vllm WORKDIR /workspace/vllm
...@@ -54,7 +42,10 @@ RUN --mount=type=cache,target=/root/.cache/pip \ ...@@ -54,7 +42,10 @@ RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \ --mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \
pip install -v -r requirements-cpu.txt pip install -v -r requirements-cpu.txt
COPY ./ ./ COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# Support for building with non-AVX512 vLLM: docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" ... # Support for building with non-AVX512 vLLM: docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" ...
ARG VLLM_CPU_DISABLE_AVX512 ARG VLLM_CPU_DISABLE_AVX512
......
...@@ -17,7 +17,7 @@ RUN apt-get update && \ ...@@ -17,7 +17,7 @@ RUN apt-get update && \
# When launching the container, mount the code directory to /app # When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ] VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT} WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
...@@ -25,17 +25,17 @@ RUN python3 -m pip install sentencepiece transformers==4.36.2 -U ...@@ -25,17 +25,17 @@ RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
COPY . /app/vllm COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
RUN cd /app/vllm \ RUN python3 -m pip install -U \
&& python3 -m pip install -U \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \ cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
-r requirements-neuron.txt -r requirements-neuron.txt
ENV VLLM_TARGET_DEVICE neuron ENV VLLM_TARGET_DEVICE neuron
RUN --mount=type=bind,source=.git,target=.git \ RUN --mount=type=bind,source=.git,target=.git \
cd /app/vllm \ pip install --no-build-isolation -v -e . \
&& pip install --no-build-isolation -v -e . \
&& cd ..
CMD ["/bin/bash"] CMD ["/bin/bash"]
...@@ -9,16 +9,10 @@ RUN apt-get update -y && \ ...@@ -9,16 +9,10 @@ RUN apt-get update -y && \
ffmpeg libsm6 libxext6 libgl1 ffmpeg libsm6 libxext6 libgl1
WORKDIR /workspace WORKDIR /workspace
# copy requirements COPY . .
COPY requirements-build.txt /workspace/vllm/ ARG GIT_REPO_CHECK=0
COPY requirements-common.txt /workspace/vllm/ RUN --mount=type=bind,source=.git,target=.git \
COPY requirements-openvino.txt /workspace/vllm/ if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
COPY vllm/ /workspace/vllm/vllm
COPY csrc/core /workspace/vllm/csrc/core
COPY cmake/utils.cmake /workspace/vllm/cmake/
COPY CMakeLists.txt /workspace/vllm/
COPY setup.py /workspace/vllm/
# install build requirements # install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
......
...@@ -14,6 +14,9 @@ RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p ...@@ -14,6 +14,9 @@ RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p
COPY ./ /workspace/vllm COPY ./ /workspace/vllm
WORKDIR /workspace/vllm WORKDIR /workspace/vllm
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
# These packages will be in rocketce eventually # These packages will be in rocketce eventually
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
......
...@@ -117,6 +117,9 @@ RUN --mount=type=cache,target=${CCACHE_DIR} \ ...@@ -117,6 +117,9 @@ RUN --mount=type=cache,target=${CCACHE_DIR} \
FROM base AS final FROM base AS final
# Import the vLLM development directory from the build context # Import the vLLM development directory from the build context
COPY . . COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# Package upgrades for useful functionality or to avoid dependency issues # Package upgrades for useful functionality or to avoid dependency issues
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
......
...@@ -2,7 +2,7 @@ ARG NIGHTLY_DATE="20240828" ...@@ -2,7 +2,7 @@ ARG NIGHTLY_DATE="20240828"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE" ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"
FROM $BASE_IMAGE FROM $BASE_IMAGE
WORKDIR /workspace WORKDIR /workspace/vllm
# Install some basic utilities # Install some basic utilities
RUN apt-get update && apt-get install -y \ RUN apt-get update && apt-get install -y \
...@@ -16,14 +16,17 @@ RUN --mount=type=cache,target=/root/.cache/pip \ ...@@ -16,14 +16,17 @@ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html python3 -m pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
# Build vLLM. # Build vLLM.
COPY . /workspace/vllm COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
ENV VLLM_TARGET_DEVICE="tpu" ENV VLLM_TARGET_DEVICE="tpu"
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \ --mount=type=bind,source=.git,target=.git \
cd /workspace/vllm && \
python3 -m pip install \ python3 -m pip install \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \ cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
-r requirements-tpu.txt -r requirements-tpu.txt
RUN cd /workspace/vllm && python3 setup.py develop RUN python3 setup.py develop
CMD ["/bin/bash"] CMD ["/bin/bash"]
FROM intel/oneapi-basekit:2024.2.1-0-devel-ubuntu22.04 FROM intel/oneapi-basekit:2024.2.1-0-devel-ubuntu22.04 AS vllm-base
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \ echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
...@@ -7,20 +7,52 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO ...@@ -7,20 +7,52 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \ echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
chmod 644 /usr/share/keyrings/intel-graphics.gpg chmod 644 /usr/share/keyrings/intel-graphics.gpg
RUN apt-get update -y && \ RUN apt-get update -y && \
apt-get install -y curl libicu70 lsb-release git wget vim numactl python3 python3-pip ffmpeg libsm6 libxext6 libgl1 apt-get install -y --no-install-recommends --fix-missing \
curl \
COPY ./ /workspace/vllm ffmpeg \
git \
libsndfile1 \
libsm6 \
libxext6 \
libgl1 \
lsb-release \
numactl \
python3 \
python3-dev \
python3-pip \
# vim \
wget
WORKDIR /workspace/vllm WORKDIR /workspace/vllm
COPY requirements-xpu.txt /workspace/vllm/requirements-xpu.txt
COPY requirements-common.txt /workspace/vllm/requirements-common.txt
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
pip install -v --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ \ pip install --no-cache-dir \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ \
-r requirements-xpu.txt -r requirements-xpu.txt
COPY . .
ARG GIT_REPO_CHECK
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
ENV VLLM_TARGET_DEVICE=xpu
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \ --mount=type=bind,source=.git,target=.git \
VLLM_TARGET_DEVICE=xpu python3 setup.py install python3 setup.py install
CMD ["/bin/bash"] CMD ["/bin/bash"]
FROM vllm-base AS vllm-openai
# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0'
ENV VLLM_USAGE_SOURCE production-docker-image \
TRITON_XPU_PROFILE 1
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
...@@ -13,7 +13,7 @@ vLLM是一个快速且易于使用的LLM推理和服务库,使用PageAttention ...@@ -13,7 +13,7 @@ vLLM是一个快速且易于使用的LLM推理和服务库,使用PageAttention
| LlamaForCausalLM | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes | | LlamaForCausalLM | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes |
| QWenLMHeadModel | QWen、Qwen-VL | Yes | Yes | | QWenLMHeadModel | QWen、Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen1.5、CodeQwen1.5、QWen2 | Yes | Yes | | Qwen2ForCausalLM | QWen1.5、CodeQwen1.5、QWen2 | Yes | Yes |
| ChatGLMModel | chatglm2、chatglm3 | Yes | Yes | | ChatGLMModel | chatglm2、chatglm3、chatglm4、glm-4v-9b | Yes | Yes |
| BaiChuanForCausalLM | Baichuan、Baichuan2 | Yes | Yes | | BaiChuanForCausalLM | Baichuan、Baichuan2 | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes | | BloomForCausalLM | BLOOM | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes | | InternLMForCausalLM | InternLM | Yes | Yes |
...@@ -29,6 +29,7 @@ vLLM支持 ...@@ -29,6 +29,7 @@ vLLM支持
+ Python 3.9. + Python 3.9.
+ Python 3.10. + Python 3.10.
+ Python 3.11. + Python 3.11.
+ Python 3.12.
### 使用源码编译方式安装 ### 使用源码编译方式安装
...@@ -75,7 +76,7 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install ...@@ -75,7 +76,7 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
+ 若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/ + 若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
## 验证 ## 验证
- python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.6.2 - python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.6.3.post1
## Known Issue ## Known Issue
- -
......
...@@ -10,22 +10,13 @@ Easy, fast, and cheap LLM serving for everyone ...@@ -10,22 +10,13 @@ Easy, fast, and cheap LLM serving for everyone
</h3> </h3>
<p align="center"> <p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | | <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p> </p>
---
**vLLM, AMD, Anyscale Meet & Greet at [Ray Summit 2024](http://raysummit.anyscale.com) (Monday, Sept 30th, 5-7pm PT) at Marriott Marquis San Francisco**
We are excited to announce our special vLLM event in collaboration with AMD and Anyscale.
Join us to learn more about recent advancements of vLLM on MI300X.
Register [here](https://lu.ma/db5ld9n5) and be a part of the event!
---
*Latest News* 🔥 *Latest News* 🔥
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?tab.day=20241001&search.sessiontracks=1719251906298001uzJ2) from other vLLM contributors and users!
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing). - [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing). - [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html). - [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
...@@ -51,7 +42,7 @@ vLLM is fast with: ...@@ -51,7 +42,7 @@ vLLM is fast with:
- Speculative decoding - Speculative decoding
- Chunked prefill - Chunked prefill
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)). **Performance benchmark**: We include a performance benchmark at the end of [our blog post](https://blog.vllm.ai/2024/09/05/perf-update.html). It compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [SGLang](https://github.com/sgl-project/sglang) and [LMDeploy](https://github.com/InternLM/lmdeploy)). The implementation is under [nightly-benchmarks folder](.buildkite/nightly-benchmarks/) and you can [reproduce](https://github.com/vllm-project/vllm/issues/8176) this benchmark using our one-click runnable script.
vLLM is flexible and easy to use with: vLLM is flexible and easy to use with:
...@@ -136,5 +127,6 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs ...@@ -136,5 +127,6 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
* For technical questions and feature requests, please use Github issues or discussions. * For technical questions and feature requests, please use Github issues or discussions.
* For discussing with fellow users, please use Discord. * For discussing with fellow users, please use Discord.
* For coordinating contributions and development, please use Slack.
* For security disclosures, please use Github's security advisory feature. * For security disclosures, please use Github's security advisory feature.
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu. * For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
\ No newline at end of file
...@@ -2,11 +2,10 @@ ...@@ -2,11 +2,10 @@
## Reporting a Vulnerability ## Reporting a Vulnerability
If you believe you have found a security vulnerability in vLLM, we encourage you to let us know right away. If you believe you have found a security vulnerability in vLLM, we encourage you to let us know right away. We will investigate all legitimate reports and do our best to quickly fix the problem.
We will investigate all legitimate reports and do our best to quickly fix the problem.
Please report security issues using https://github.com/vllm-project/vllm/security/advisories/new Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new).
--- ---
Please see PyTorch Security for more information how to securely interact with models: https://github.com/pytorch/pytorch/blob/main/SECURITY.md
This document mostly references the recommendation from PyTorch, thank you! Please see [PyTorch's Security Policy](https://github.com/pytorch/pytorch/blob/main/SECURITY.md) for more information and recommendations on how to securely interact with models.
...@@ -23,9 +23,9 @@ class RequestFuncInput: ...@@ -23,9 +23,9 @@ class RequestFuncInput:
output_len: int output_len: int
model: str model: str
best_of: int = 1 best_of: int = 1
use_beam_search: bool = False
logprobs: Optional[int] = None logprobs: Optional[int] = None
multi_modal_content: Optional[dict] = None multi_modal_content: Optional[dict] = None
ignore_eos: bool = False
@dataclass @dataclass
...@@ -48,13 +48,13 @@ async def async_request_tgi( ...@@ -48,13 +48,13 @@ async def async_request_tgi(
assert api_url.endswith("generate_stream") assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
params = { params = {
"best_of": request_func_input.best_of, "best_of": request_func_input.best_of,
"max_new_tokens": request_func_input.output_len, "max_new_tokens": request_func_input.output_len,
"do_sample": True, "do_sample": True,
"temperature": 0.01, # TGI does not accept 0.0 temperature. "temperature": 0.01, # TGI does not accept 0.0 temperature.
"top_p": 0.99, # TGI does not accept 1.0 top_p. "top_p": 0.99, # TGI does not accept 1.0 top_p.
# TGI does not accept ignore_eos flag.
} }
payload = { payload = {
"inputs": request_func_input.prompt, "inputs": request_func_input.prompt,
...@@ -119,7 +119,6 @@ async def async_request_trt_llm( ...@@ -119,7 +119,6 @@ async def async_request_trt_llm(
assert api_url.endswith("generate_stream") assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
assert request_func_input.best_of == 1 assert request_func_input.best_of == 1
payload = { payload = {
"accumulate_tokens": True, "accumulate_tokens": True,
...@@ -129,6 +128,8 @@ async def async_request_trt_llm( ...@@ -129,6 +128,8 @@ async def async_request_trt_llm(
"max_tokens": request_func_input.output_len, "max_tokens": request_func_input.output_len,
"stream": True, "stream": True,
} }
if request_func_input.ignore_eos:
payload["min_length"] = request_func_input.output_len
output = RequestFuncOutput() output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len output.prompt_len = request_func_input.prompt_len
...@@ -183,7 +184,6 @@ async def async_request_deepspeed_mii( ...@@ -183,7 +184,6 @@ async def async_request_deepspeed_mii(
) -> RequestFuncOutput: ) -> RequestFuncOutput:
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1 assert request_func_input.best_of == 1
assert not request_func_input.use_beam_search
payload = { payload = {
"prompt": request_func_input.prompt, "prompt": request_func_input.prompt,
...@@ -231,7 +231,6 @@ async def async_request_openai_completions( ...@@ -231,7 +231,6 @@ async def async_request_openai_completions(
), "OpenAI Completions API URL must end with 'completions' or 'profile'." ), "OpenAI Completions API URL must end with 'completions' or 'profile'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
payload = { payload = {
"model": request_func_input.model, "model": request_func_input.model,
"prompt": request_func_input.prompt, "prompt": request_func_input.prompt,
...@@ -240,6 +239,7 @@ async def async_request_openai_completions( ...@@ -240,6 +239,7 @@ async def async_request_openai_completions(
"max_tokens": request_func_input.output_len, "max_tokens": request_func_input.output_len,
"logprobs": request_func_input.logprobs, "logprobs": request_func_input.logprobs,
"stream": True, "stream": True,
"ignore_eos": request_func_input.ignore_eos,
} }
headers = { headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}" "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
...@@ -312,7 +312,6 @@ async def async_request_openai_chat_completions( ...@@ -312,7 +312,6 @@ async def async_request_openai_chat_completions(
), "OpenAI Chat Completions API URL must end with 'chat/completions'." ), "OpenAI Chat Completions API URL must end with 'chat/completions'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
content = [{"type": "text", "text": request_func_input.prompt}] content = [{"type": "text", "text": request_func_input.prompt}]
if request_func_input.multi_modal_content: if request_func_input.multi_modal_content:
content.append(request_func_input.multi_modal_content) content.append(request_func_input.multi_modal_content)
...@@ -327,6 +326,7 @@ async def async_request_openai_chat_completions( ...@@ -327,6 +326,7 @@ async def async_request_openai_chat_completions(
"temperature": 0.0, "temperature": 0.0,
"max_tokens": request_func_input.output_len, "max_tokens": request_func_input.output_len,
"stream": True, "stream": True,
"ignore_eos": request_func_input.ignore_eos,
} }
headers = { headers = {
"Content-Type": "application/json", "Content-Type": "application/json",
...@@ -430,4 +430,5 @@ ASYNC_REQUEST_FUNCS = { ...@@ -430,4 +430,5 @@ ASYNC_REQUEST_FUNCS = {
"openai-chat": async_request_openai_chat_completions, "openai-chat": async_request_openai_chat_completions,
"tensorrt-llm": async_request_trt_llm, "tensorrt-llm": async_request_trt_llm,
"scalellm": async_request_openai_completions, "scalellm": async_request_openai_completions,
"sglang": async_request_openai_completions,
} }
...@@ -11,7 +11,7 @@ from tqdm import tqdm ...@@ -11,7 +11,7 @@ from tqdm import tqdm
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import DEVICE_OPTIONS, EngineArgs from vllm.engine.arg_utils import DEVICE_OPTIONS, EngineArgs
from vllm.inputs import PromptInputs from vllm.inputs import PromptType
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
from vllm.utils import FlexibleArgumentParser from vllm.utils import FlexibleArgumentParser
...@@ -38,7 +38,6 @@ def main(args: argparse.Namespace): ...@@ -38,7 +38,6 @@ def main(args: argparse.Namespace):
quantization_param_path=args.quantization_param_path, quantization_param_path=args.quantization_param_path,
device=args.device, device=args.device,
ray_workers_use_nsight=args.ray_workers_use_nsight, ray_workers_use_nsight=args.ray_workers_use_nsight,
use_v2_block_manager=args.use_v2_block_manager,
enable_chunked_prefill=args.enable_chunked_prefill, enable_chunked_prefill=args.enable_chunked_prefill,
download_dir=args.download_dir, download_dir=args.download_dir,
block_size=args.block_size, block_size=args.block_size,
...@@ -51,9 +50,8 @@ def main(args: argparse.Namespace): ...@@ -51,9 +50,8 @@ def main(args: argparse.Namespace):
sampling_params = SamplingParams( sampling_params = SamplingParams(
n=args.n, n=args.n,
temperature=0.0 if args.use_beam_search else 1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
use_beam_search=args.use_beam_search,
ignore_eos=True, ignore_eos=True,
max_tokens=args.output_len, max_tokens=args.output_len,
) )
...@@ -61,7 +59,7 @@ def main(args: argparse.Namespace): ...@@ -61,7 +59,7 @@ def main(args: argparse.Namespace):
dummy_prompt_token_ids = np.random.randint(10000, dummy_prompt_token_ids = np.random.randint(10000,
size=(args.batch_size, size=(args.batch_size,
args.input_len)) args.input_len))
dummy_inputs: List[PromptInputs] = [{ dummy_prompts: List[PromptType] = [{
"prompt_token_ids": batch "prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()] } for batch in dummy_prompt_token_ids.tolist()]
...@@ -74,13 +72,13 @@ def main(args: argparse.Namespace): ...@@ -74,13 +72,13 @@ def main(args: argparse.Namespace):
], ],
on_trace_ready=torch.profiler.tensorboard_trace_handler( on_trace_ready=torch.profiler.tensorboard_trace_handler(
str(profile_dir))) as p: str(profile_dir))) as p:
llm.generate(dummy_inputs, llm.generate(dummy_prompts,
sampling_params=sampling_params, sampling_params=sampling_params,
use_tqdm=False) use_tqdm=False)
print(p.key_averages()) print(p.key_averages())
else: else:
start_time = time.perf_counter() start_time = time.perf_counter()
llm.generate(dummy_inputs, llm.generate(dummy_prompts,
sampling_params=sampling_params, sampling_params=sampling_params,
use_tqdm=False) use_tqdm=False)
end_time = time.perf_counter() end_time = time.perf_counter()
...@@ -222,7 +220,6 @@ if __name__ == '__main__': ...@@ -222,7 +220,6 @@ if __name__ == '__main__':
parser.add_argument("--enable-prefix-caching", parser.add_argument("--enable-prefix-caching",
action='store_true', action='store_true',
help="Enable automatic prefix caching") help="Enable automatic prefix caching")
parser.add_argument('--use-v2-block-manager', action='store_true')
parser.add_argument( parser.add_argument(
"--ray-workers-use-nsight", "--ray-workers-use-nsight",
action='store_true', action='store_true',
......
...@@ -113,7 +113,7 @@ def repeat_and_sort_requests(requests: List[Tuple[str, int, int]], ...@@ -113,7 +113,7 @@ def repeat_and_sort_requests(requests: List[Tuple[str, int, int]],
def main(args): def main(args):
tokenizer = get_tokenizer(args.model, trust_remote_code=True) tokenizer = get_tokenizer(args.model, trust_remote_code=True)
input_length_range = tuple(map(int, args.input_length_range.split(':'))) input_length_range = tuple(map(int, args.input_length_range.split(':')))
random.seed(args.seed)
if args.dataset_path is not None: if args.dataset_path is not None:
print(f"Start to sample {args.num_prompts} prompts" print(f"Start to sample {args.num_prompts} prompts"
"from {args.dataset_path}") "from {args.dataset_path}")
...@@ -133,7 +133,6 @@ def main(args): ...@@ -133,7 +133,6 @@ def main(args):
tokenizer_mode='auto', tokenizer_mode='auto',
trust_remote_code=True, trust_remote_code=True,
enforce_eager=True, enforce_eager=True,
use_v2_block_manager=args.use_v2_block_manager,
tensor_parallel_size=args.tensor_parallel_size, tensor_parallel_size=args.tensor_parallel_size,
enable_prefix_caching=args.enable_prefix_caching) enable_prefix_caching=args.enable_prefix_caching)
...@@ -175,9 +174,6 @@ if __name__ == "__main__": ...@@ -175,9 +174,6 @@ if __name__ == "__main__":
parser.add_argument('--enable-prefix-caching', parser.add_argument('--enable-prefix-caching',
action='store_true', action='store_true',
help='enable prefix caching') help='enable prefix caching')
parser.add_argument('--use-v2-block-manager',
action='store_true',
help='Use BlockSpaceMangerV2')
parser.add_argument('--num-prompts', parser.add_argument('--num-prompts',
type=int, type=int,
default=1, default=1,
...@@ -194,5 +190,9 @@ if __name__ == "__main__": ...@@ -194,5 +190,9 @@ if __name__ == "__main__":
default='128:256', default='128:256',
help='Range of input lengths for sampling prompts,' help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").') 'specified as "min:max" (e.g., "128:256").')
parser.add_argument("--seed",
type=int,
default=0,
help='Random seed for reproducibility')
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)
...@@ -68,7 +68,6 @@ def run_vllm( ...@@ -68,7 +68,6 @@ def run_vllm(
tensor_parallel_size: int, tensor_parallel_size: int,
seed: int, seed: int,
n: int, n: int,
use_beam_search: bool,
trust_remote_code: bool, trust_remote_code: bool,
dtype: str, dtype: str,
max_model_len: Optional[int], max_model_len: Optional[int],
...@@ -114,9 +113,8 @@ def run_vllm( ...@@ -114,9 +113,8 @@ def run_vllm(
sampling_params.append( sampling_params.append(
SamplingParams( SamplingParams(
n=n, n=n,
temperature=0.0 if use_beam_search else 1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
use_beam_search=use_beam_search,
ignore_eos=True, ignore_eos=True,
max_tokens=output_len, max_tokens=output_len,
)) ))
...@@ -144,15 +142,16 @@ def main(args: argparse.Namespace): ...@@ -144,15 +142,16 @@ def main(args: argparse.Namespace):
args.output_len) args.output_len)
if args.backend == "vllm": if args.backend == "vllm":
elapsed_time = run_vllm( elapsed_time = run_vllm(requests, args.model, args.tokenizer,
requests, args.model, args.tokenizer, args.quantization, args.quantization, args.tensor_parallel_size,
args.tensor_parallel_size, args.seed, args.n, args.use_beam_search, args.seed, args.n, args.trust_remote_code,
args.trust_remote_code, args.dtype, args.max_model_len, args.dtype, args.max_model_len,
args.enforce_eager, args.kv_cache_dtype, args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device, args.quantization_param_path, args.device,
args.enable_prefix_caching, args.enable_chunked_prefill, args.enable_prefix_caching,
args.max_num_batched_tokens, args.gpu_memory_utilization, args.enable_chunked_prefill,
args.download_dir) args.max_num_batched_tokens,
args.gpu_memory_utilization, args.download_dir)
else: else:
raise ValueError(f"Unknown backend: {args.backend}") raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(prompt_len + output_len total_num_tokens = sum(prompt_len + output_len
...@@ -203,7 +202,6 @@ if __name__ == "__main__": ...@@ -203,7 +202,6 @@ if __name__ == "__main__":
type=int, type=int,
default=1, default=1,
help="Number of generated sequences per prompt.") help="Number of generated sequences per prompt.")
parser.add_argument("--use-beam-search", action="store_true")
parser.add_argument("--num-prompts", parser.add_argument("--num-prompts",
type=int, type=int,
default=200, default=200,
......
"""Benchmark online serving throughput. r"""Benchmark online serving throughput.
On the server side, run one of the following commands: On the server side, run one of the following commands:
vLLM OpenAI API server vLLM OpenAI API server
...@@ -89,10 +89,8 @@ def sample_sharegpt_requests( ...@@ -89,10 +89,8 @@ def sample_sharegpt_requests(
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int] = None, fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, int, int, None]]: ) -> List[Tuple[str, int, int, None]]:
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")
# Load the dataset. # Load the dataset.
with open(dataset_path) as f: with open(dataset_path, encoding='utf-8') as f:
dataset = json.load(f) dataset = json.load(f)
# Filter out the conversations with less than 2 turns. # Filter out the conversations with less than 2 turns.
dataset = [data for data in dataset if len(data["conversations"]) >= 2] dataset = [data for data in dataset if len(data["conversations"]) >= 2]
...@@ -117,7 +115,7 @@ def sample_sharegpt_requests( ...@@ -117,7 +115,7 @@ def sample_sharegpt_requests(
prompt_len = len(prompt_token_ids) prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len ) if fixed_output_len is None else fixed_output_len
if prompt_len < 4 or output_len < 4: if prompt_len < 4 or (fixed_output_len is None and output_len < 4):
# Prune too short sequences. # Prune too short sequences.
continue continue
if prompt_len > 1024 or prompt_len + output_len > 2048: if prompt_len > 1024 or prompt_len + output_len > 2048:
...@@ -141,7 +139,7 @@ def sample_sonnet_requests( ...@@ -141,7 +139,7 @@ def sample_sonnet_requests(
), "'args.sonnet-input-len' must be greater than 'args.prefix-input-len'." ), "'args.sonnet-input-len' must be greater than 'args.prefix-input-len'."
# Load the dataset. # Load the dataset.
with open(dataset_path) as f: with open(dataset_path, encoding='utf-8') as f:
poem_lines = f.readlines() poem_lines = f.readlines()
# Tokenize the poem lines. # Tokenize the poem lines.
...@@ -178,9 +176,9 @@ def sample_sonnet_requests( ...@@ -178,9 +176,9 @@ def sample_sonnet_requests(
# Sample the rest of lines per request. # Sample the rest of lines per request.
sampled_requests: List[Tuple[str, int, int]] = [] sampled_requests: List[Tuple[str, int, int]] = []
for _ in range(num_requests): for _ in range(num_requests):
sampled_lines = "".join( num_lines_needed = num_input_lines - num_prefix_lines
prefix_lines + sampled_lines = "".join(prefix_lines +
random.sample(poem_lines, num_input_lines - num_prefix_lines)) random.choices(poem_lines, k=num_lines_needed))
prompt = f"{base_prompt}{sampled_lines}" prompt = f"{base_prompt}{sampled_lines}"
message = [ message = [
...@@ -228,10 +226,11 @@ def sample_hf_requests( ...@@ -228,10 +226,11 @@ def sample_hf_requests(
prompt_len = len(prompt_token_ids) prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len ) if fixed_output_len is None else fixed_output_len
if prompt_len < 4 or output_len < 4: if fixed_output_len is None and (prompt_len < 4 or output_len < 4):
# Prune too short sequences. # Prune too short sequences.
continue continue
if prompt_len > 1024 or prompt_len + output_len > 2048: if fixed_output_len is None and \
(prompt_len > 1024 or prompt_len + output_len > 2048):
# Prune too long sequences. # Prune too long sequences.
continue continue
...@@ -392,12 +391,12 @@ async def benchmark( ...@@ -392,12 +391,12 @@ async def benchmark(
input_requests: List[Tuple[str, int, int]], input_requests: List[Tuple[str, int, int]],
logprobs: Optional[int], logprobs: Optional[int],
best_of: int, best_of: int,
use_beam_search: bool,
request_rate: float, request_rate: float,
disable_tqdm: bool, disable_tqdm: bool,
profile: bool, profile: bool,
selected_percentile_metrics: List[str], selected_percentile_metrics: List[str],
selected_percentiles: List[str], selected_percentiles: List[str],
ignore_eos: bool,
): ):
if backend in ASYNC_REQUEST_FUNCS: if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend] request_func = ASYNC_REQUEST_FUNCS[backend]
...@@ -419,8 +418,8 @@ async def benchmark( ...@@ -419,8 +418,8 @@ async def benchmark(
output_len=test_output_len, output_len=test_output_len,
logprobs=logprobs, logprobs=logprobs,
best_of=best_of, best_of=best_of,
use_beam_search=use_beam_search,
multi_modal_content=test_mm_content, multi_modal_content=test_mm_content,
ignore_eos=ignore_eos,
) )
test_output = await request_func(request_func_input=test_input) test_output = await request_func(request_func_input=test_input)
if not test_output.success: if not test_output.success:
...@@ -432,17 +431,15 @@ async def benchmark( ...@@ -432,17 +431,15 @@ async def benchmark(
if profile: if profile:
print("Starting profiler...") print("Starting profiler...")
profile_input = RequestFuncInput( profile_input = RequestFuncInput(model=model_id,
model=model_id, prompt=test_prompt,
prompt=test_prompt, api_url=base_url + "/start_profile",
api_url=base_url + "/start_profile", prompt_len=test_prompt_len,
prompt_len=test_prompt_len, output_len=test_output_len,
output_len=test_output_len, logprobs=logprobs,
logprobs=logprobs, best_of=best_of,
best_of=best_of, multi_modal_content=test_mm_content,
use_beam_search=use_beam_search, ignore_eos=ignore_eos)
multi_modal_content=test_mm_content,
)
profile_output = await request_func(request_func_input=profile_input) profile_output = await request_func(request_func_input=profile_input)
if profile_output.success: if profile_output.success:
print("Profiler started") print("Profiler started")
...@@ -455,17 +452,15 @@ async def benchmark( ...@@ -455,17 +452,15 @@ async def benchmark(
tasks: List[asyncio.Task] = [] tasks: List[asyncio.Task] = []
async for request in get_request(input_requests, request_rate): async for request in get_request(input_requests, request_rate):
prompt, prompt_len, output_len, mm_content = request prompt, prompt_len, output_len, mm_content = request
request_func_input = RequestFuncInput( request_func_input = RequestFuncInput(model=model_id,
model=model_id, prompt=prompt,
prompt=prompt, api_url=api_url,
api_url=api_url, prompt_len=prompt_len,
prompt_len=prompt_len, output_len=output_len,
output_len=output_len, logprobs=logprobs,
logprobs=logprobs, best_of=best_of,
best_of=best_of, multi_modal_content=mm_content,
use_beam_search=use_beam_search, ignore_eos=ignore_eos)
multi_modal_content=mm_content,
)
tasks.append( tasks.append(
asyncio.create_task( asyncio.create_task(
request_func(request_func_input=request_func_input, request_func(request_func_input=request_func_input,
...@@ -482,7 +477,6 @@ async def benchmark( ...@@ -482,7 +477,6 @@ async def benchmark(
output_len=test_output_len, output_len=test_output_len,
logprobs=logprobs, logprobs=logprobs,
best_of=best_of, best_of=best_of,
use_beam_search=use_beam_search,
) )
profile_output = await request_func(request_func_input=profile_input) profile_output = await request_func(request_func_input=profile_input)
if profile_output.success: if profile_output.success:
...@@ -540,7 +534,7 @@ async def benchmark( ...@@ -540,7 +534,7 @@ async def benchmark(
# E.g., "Time to First Token" # E.g., "Time to First Token"
metric_header: str, metric_header: str,
): ):
# This function print and add statistics of the specified # This function prints and adds statistics of the specified
# metric. # metric.
if metric_attribute_name not in selected_percentile_metrics: if metric_attribute_name not in selected_percentile_metrics:
return return
...@@ -678,7 +672,6 @@ def main(args: argparse.Namespace): ...@@ -678,7 +672,6 @@ def main(args: argparse.Namespace):
input_requests=input_requests, input_requests=input_requests,
logprobs=args.logprobs, logprobs=args.logprobs,
best_of=args.best_of, best_of=args.best_of,
use_beam_search=args.use_beam_search,
request_rate=args.request_rate, request_rate=args.request_rate,
disable_tqdm=args.disable_tqdm, disable_tqdm=args.disable_tqdm,
profile=args.profile, profile=args.profile,
...@@ -686,6 +679,7 @@ def main(args: argparse.Namespace): ...@@ -686,6 +679,7 @@ def main(args: argparse.Namespace):
selected_percentiles=[ selected_percentiles=[
float(p) for p in args.metric_percentiles.split(",") float(p) for p in args.metric_percentiles.split(",")
], ],
ignore_eos=args.ignore_eos,
)) ))
# Save config and results to json # Save config and results to json
...@@ -699,7 +693,6 @@ def main(args: argparse.Namespace): ...@@ -699,7 +693,6 @@ def main(args: argparse.Namespace):
result_json["model_id"] = model_id result_json["model_id"] = model_id
result_json["tokenizer_id"] = tokenizer_id result_json["tokenizer_id"] = tokenizer_id
result_json["best_of"] = args.best_of result_json["best_of"] = args.best_of
result_json["use_beam_search"] = args.use_beam_search
result_json["num_prompts"] = args.num_prompts result_json["num_prompts"] = args.num_prompts
# Metadata # Metadata
...@@ -727,7 +720,7 @@ def main(args: argparse.Namespace): ...@@ -727,7 +720,7 @@ def main(args: argparse.Namespace):
file_name = args.result_filename file_name = args.result_filename
if args.result_dir: if args.result_dir:
file_name = os.path.join(args.result_dir, file_name) file_name = os.path.join(args.result_dir, file_name)
with open(file_name, "w") as outfile: with open(file_name, "w", encoding='utf-8') as outfile:
json.dump(result_json, outfile) json.dump(result_json, outfile)
...@@ -864,6 +857,11 @@ if __name__ == "__main__": ...@@ -864,6 +857,11 @@ if __name__ == "__main__":
"{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json" "{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"
" format.", " format.",
) )
parser.add_argument(
"--ignore-eos",
action="store_true",
help="Set ignore_eos flag when sending the benchmark request."
"Warning: ignore_eos is not supported in deepspeed_mii and tgi.")
parser.add_argument( parser.add_argument(
"--percentile-metrics", "--percentile-metrics",
type=str, type=str,
...@@ -963,4 +961,4 @@ if __name__ == "__main__": ...@@ -963,4 +961,4 @@ if __name__ == "__main__":
) )
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)
\ No newline at end of file
...@@ -12,11 +12,11 @@ from tqdm import tqdm ...@@ -12,11 +12,11 @@ from tqdm import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer, from transformers import (AutoModelForCausalLM, AutoTokenizer,
PreTrainedTokenizerBase) PreTrainedTokenizerBase)
from vllm.inputs import PromptInputs
from vllm.engine.arg_utils import DEVICE_OPTIONS, AsyncEngineArgs, EngineArgs from vllm.engine.arg_utils import DEVICE_OPTIONS, AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import ( from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args) build_async_engine_client_from_engine_args)
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
from vllm.sampling_params import BeamSearchParams
from vllm.utils import FlexibleArgumentParser, merge_async_iterators from vllm.utils import FlexibleArgumentParser, merge_async_iterators
...@@ -75,7 +75,6 @@ def run_vllm( ...@@ -75,7 +75,6 @@ def run_vllm(
tensor_parallel_size: int, tensor_parallel_size: int,
seed: int, seed: int,
n: int, n: int,
use_beam_search: bool,
trust_remote_code: bool, trust_remote_code: bool,
dtype: str, dtype: str,
max_model_len: Optional[int], max_model_len: Optional[int],
...@@ -89,11 +88,9 @@ def run_vllm( ...@@ -89,11 +88,9 @@ def run_vllm(
distributed_executor_backend: Optional[str], distributed_executor_backend: Optional[str],
gpu_memory_utilization: float = 0.9, gpu_memory_utilization: float = 0.9,
num_scheduler_steps: int = 1, num_scheduler_steps: int = 1,
use_v2_block_manager: bool = False,
download_dir: Optional[str] = None, download_dir: Optional[str] = None,
load_format: str = EngineArgs.load_format, load_format: str = EngineArgs.load_format,
disable_async_output_proc: bool = False, disable_async_output_proc: bool = False,
use_new_beam_search_impl: bool = False,
) -> float: ) -> float:
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
llm = LLM( llm = LLM(
...@@ -117,7 +114,6 @@ def run_vllm( ...@@ -117,7 +114,6 @@ def run_vllm(
distributed_executor_backend=distributed_executor_backend, distributed_executor_backend=distributed_executor_backend,
load_format=load_format, load_format=load_format,
num_scheduler_steps=num_scheduler_steps, num_scheduler_steps=num_scheduler_steps,
use_v2_block_manager=use_v2_block_manager,
disable_async_output_proc=disable_async_output_proc, disable_async_output_proc=disable_async_output_proc,
) )
...@@ -129,13 +125,12 @@ def run_vllm( ...@@ -129,13 +125,12 @@ def run_vllm(
sampling_params.append( sampling_params.append(
SamplingParams( SamplingParams(
n=n, n=n,
temperature=0.0 if use_beam_search else 1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
use_beam_search=use_beam_search,
ignore_eos=True, ignore_eos=True,
max_tokens=output_len, max_tokens=output_len,
)) ))
# warmup # warmup
warmup_prompts = [] warmup_prompts = []
warmup_sampling_params = [] warmup_sampling_params = []
...@@ -144,9 +139,8 @@ def run_vllm( ...@@ -144,9 +139,8 @@ def run_vllm(
warmup_sampling_params.append( warmup_sampling_params.append(
SamplingParams( SamplingParams(
n=n, n=n,
temperature=0.0 if use_beam_search else 1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
use_beam_search=use_beam_search,
ignore_eos=True, ignore_eos=True,
max_tokens=output_len, max_tokens=output_len,
)) ))
...@@ -158,7 +152,7 @@ def run_vllm( ...@@ -158,7 +152,7 @@ def run_vllm(
# dummy_prompt_token_ids = np.random.randint(10000, # dummy_prompt_token_ids = np.random.randint(10000,
# size=(args.num_prompts, # size=(args.num_prompts,
# args.input_len)) # args.input_len))
# dummy_inputs: List[PromptInputs] = [{ # dummy_prompts: List[PromptType] = [{
# "prompt_token_ids": batch # "prompt_token_ids": batch
# } for batch in dummy_prompt_token_ids.tolist()] # } for batch in dummy_prompt_token_ids.tolist()]
...@@ -171,22 +165,27 @@ def run_vllm( ...@@ -171,22 +165,27 @@ def run_vllm(
# for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"): # for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"):
# run_to_completion() # run_to_completion()
if not use_new_beam_search_impl:
use_beam_search = False
if not use_beam_search:
start = time.perf_counter() start = time.perf_counter()
llm.generate(prompts, sampling_params, use_tqdm=True) llm.generate(prompts, sampling_params, use_tqdm=True)
end = time.perf_counter() end = time.perf_counter()
else: else:
assert use_beam_search
prompts = [prompt for prompt, _, _ in requests] prompts = [prompt for prompt, _, _ in requests]
# output_len should be the same for all requests. # output_len should be the same for all requests.
output_len = requests[0][2] output_len = requests[0][2]
for prompt, input_len, _output_len in requests: for prompt, input_len, _output_len in requests:
assert _output_len == output_len assert _output_len == output_len
start = time.perf_counter() start = time.perf_counter()
llm.beam_search(prompts, llm.beam_search(
beam_width=n, prompts,
max_tokens=output_len, BeamSearchParams(
ignore_eos=True) beam_width=n,
max_tokens=output_len,
ignore_eos=True,
))
end = time.perf_counter() end = time.perf_counter()
return end - start return end - start
...@@ -199,7 +198,6 @@ async def run_vllm_async( ...@@ -199,7 +198,6 @@ async def run_vllm_async(
tensor_parallel_size: int, tensor_parallel_size: int,
seed: int, seed: int,
n: int, n: int,
use_beam_search: bool,
trust_remote_code: bool, trust_remote_code: bool,
dtype: str, dtype: str,
max_model_len: Optional[int], max_model_len: Optional[int],
...@@ -213,7 +211,6 @@ async def run_vllm_async( ...@@ -213,7 +211,6 @@ async def run_vllm_async(
distributed_executor_backend: Optional[str], distributed_executor_backend: Optional[str],
gpu_memory_utilization: float = 0.9, gpu_memory_utilization: float = 0.9,
num_scheduler_steps: int = 1, num_scheduler_steps: int = 1,
use_v2_block_manager: bool = False,
download_dir: Optional[str] = None, download_dir: Optional[str] = None,
load_format: str = EngineArgs.load_format, load_format: str = EngineArgs.load_format,
disable_async_output_proc: bool = False, disable_async_output_proc: bool = False,
...@@ -241,7 +238,6 @@ async def run_vllm_async( ...@@ -241,7 +238,6 @@ async def run_vllm_async(
distributed_executor_backend=distributed_executor_backend, distributed_executor_backend=distributed_executor_backend,
load_format=load_format, load_format=load_format,
num_scheduler_steps=num_scheduler_steps, num_scheduler_steps=num_scheduler_steps,
use_v2_block_manager=use_v2_block_manager,
disable_async_output_proc=disable_async_output_proc, disable_async_output_proc=disable_async_output_proc,
worker_use_ray=False, worker_use_ray=False,
disable_log_requests=True, disable_log_requests=True,
...@@ -258,9 +254,8 @@ async def run_vllm_async( ...@@ -258,9 +254,8 @@ async def run_vllm_async(
sampling_params.append( sampling_params.append(
SamplingParams( SamplingParams(
n=n, n=n,
temperature=0.0 if use_beam_search else 1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
use_beam_search=use_beam_search,
ignore_eos=True, ignore_eos=True,
max_tokens=output_len, max_tokens=output_len,
)) ))
...@@ -282,11 +277,9 @@ def run_hf( ...@@ -282,11 +277,9 @@ def run_hf(
model: str, model: str,
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
n: int, n: int,
use_beam_search: bool,
max_batch_size: int, max_batch_size: int,
trust_remote_code: bool, trust_remote_code: bool,
) -> float: ) -> float:
assert not use_beam_search
llm = AutoModelForCausalLM.from_pretrained( llm = AutoModelForCausalLM.from_pretrained(
model, torch_dtype=torch.float16, trust_remote_code=trust_remote_code) model, torch_dtype=torch.float16, trust_remote_code=trust_remote_code)
if llm.config.model_type == "llama": if llm.config.model_type == "llama":
...@@ -318,7 +311,7 @@ def run_hf( ...@@ -318,7 +311,7 @@ def run_hf(
padding=True).input_ids padding=True).input_ids
llm_outputs = llm.generate( llm_outputs = llm.generate(
input_ids=input_ids.cuda(), input_ids=input_ids.cuda(),
do_sample=not use_beam_search, do_sample=True,
num_return_sequences=n, num_return_sequences=n,
temperature=1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
...@@ -378,40 +371,37 @@ def main(args: argparse.Namespace): ...@@ -378,40 +371,37 @@ def main(args: argparse.Namespace):
if args.async_engine: if args.async_engine:
run_args = [ run_args = [
requests, args.model, args.tokenizer, args.quantization, requests, args.model, args.tokenizer, args.quantization,
args.tensor_parallel_size, args.seed, args.n, args.use_beam_search, args.tensor_parallel_size, args.seed, args.n,
args.trust_remote_code, args.dtype, args.max_model_len, args.trust_remote_code, args.dtype, args.max_model_len,
args.enforce_eager, args.kv_cache_dtype, args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device, args.quantization_param_path, args.device,
args.enable_prefix_caching, args.enable_chunked_prefill, args.enable_prefix_caching, args.enable_chunked_prefill,
args.max_num_batched_tokens, args.distributed_executor_backend, args.max_num_batched_tokens, args.distributed_executor_backend,
args.gpu_memory_utilization, args.num_scheduler_steps, args.gpu_memory_utilization, args.num_scheduler_steps,
args.use_v2_block_manager, args.download_dir, args.load_format, args.download_dir, args.load_format, args.disable_async_output_proc
args.disable_async_output_proc
] ]
else: else:
run_args = [ run_args = [
warmup_requests, requests, args.model, args.tokenizer, args.quantization, warmup_requests, requests, args.model, args.tokenizer, args.quantization,
args.tensor_parallel_size, args.seed, args.n, args.use_beam_search, args.tensor_parallel_size, args.seed, args.n,
args.trust_remote_code, args.dtype, args.max_model_len, args.trust_remote_code, args.dtype, args.max_model_len,
args.enforce_eager, args.kv_cache_dtype, args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device, args.quantization_param_path, args.device,
args.enable_prefix_caching, args.enable_chunked_prefill, args.enable_prefix_caching, args.enable_chunked_prefill,
args.max_num_batched_tokens, args.distributed_executor_backend, args.max_num_batched_tokens, args.distributed_executor_backend,
args.gpu_memory_utilization, args.num_scheduler_steps, args.gpu_memory_utilization, args.num_scheduler_steps,
args.use_v2_block_manager, args.download_dir, args.load_format, args.download_dir, args.load_format, args.disable_async_output_proc
args.disable_async_output_proc
] ]
if args.async_engine: if args.async_engine:
run_args.append(args.disable_frontend_multiprocessing) run_args.append(args.disable_frontend_multiprocessing)
elapsed_time = uvloop.run(run_vllm_async(*run_args)) elapsed_time = uvloop.run(run_vllm_async(*run_args))
else: else:
elapsed_time = run_vllm(*run_args, args.use_new_beam_search_impl) elapsed_time = run_vllm(*run_args)
elif args.backend == "hf": elif args.backend == "hf":
assert args.tensor_parallel_size == 1 assert args.tensor_parallel_size == 1
elapsed_time = run_hf(requests, args.model, tokenizer, args.n, elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
args.use_beam_search, args.hf_max_batch_size, args.hf_max_batch_size, args.trust_remote_code)
args.trust_remote_code)
elif args.backend == "mii": elif args.backend == "mii":
elapsed_time = run_mii(requests, args.model, args.tensor_parallel_size, elapsed_time = run_mii(requests, args.model, args.tensor_parallel_size,
args.output_len) args.output_len)
...@@ -473,12 +463,10 @@ if __name__ == "__main__": ...@@ -473,12 +463,10 @@ if __name__ == "__main__":
type=int, type=int,
default=1, default=1,
help="Number of generated sequences per prompt.") help="Number of generated sequences per prompt.")
parser.add_argument("--use-beam-search", action="store_true")
parser.add_argument('--num-iters-warmup', parser.add_argument('--num-iters-warmup',
type=int, type=int,
default=1, default=1,
help='Number of iterations to run for warmup.') help='Number of iterations to run for warmup.')
parser.add_argument("--use-new-beam-search-impl", action="store_true")
parser.add_argument("--num-prompts", parser.add_argument("--num-prompts",
type=int, type=int,
default=1000, default=1000,
...@@ -543,9 +531,6 @@ if __name__ == "__main__": ...@@ -543,9 +531,6 @@ if __name__ == "__main__":
type=int, type=int,
default=1, default=1,
help="Maximum number of forward steps per scheduler call.") help="Maximum number of forward steps per scheduler call.")
parser.add_argument("--use-v2-block-manager",
action='store_true',
help="Enable block manager v2.")
parser.add_argument( parser.add_argument(
"--enable-prefix-caching", "--enable-prefix-caching",
action='store_true', action='store_true',
...@@ -633,8 +618,6 @@ if __name__ == "__main__": ...@@ -633,8 +618,6 @@ if __name__ == "__main__":
raise ValueError("dtype must be auto for MII backend.") raise ValueError("dtype must be auto for MII backend.")
if args.n != 1: if args.n != 1:
raise ValueError("n must be 1 for MII backend.") raise ValueError("n must be 1 for MII backend.")
if args.use_beam_search:
raise ValueError("Beam search is not supported for MII backend.")
if args.quantization is not None: if args.quantization is not None:
raise ValueError("Quantization is only for vLLM backend.") raise ValueError("Quantization is only for vLLM backend.")
if args.hf_max_batch_size is not None: if args.hf_max_batch_size is not None:
......
...@@ -31,7 +31,7 @@ def benchmark_rope_kernels_multi_lora( ...@@ -31,7 +31,7 @@ def benchmark_rope_kernels_multi_lora(
# batched RoPE can take multiple scaling factors # batched RoPE can take multiple scaling factors
batched_rope = get_rope(head_size, rotary_dim, max_position, base, batched_rope = get_rope(head_size, rotary_dim, max_position, base,
is_neox_style, { is_neox_style, {
"type": "linear", "rope_type": "linear",
"factor": tuple(scaling_factors) "factor": tuple(scaling_factors)
}) })
# non-batched RoPE takes only one scaling factor, we create multiple # non-batched RoPE takes only one scaling factor, we create multiple
...@@ -41,7 +41,7 @@ def benchmark_rope_kernels_multi_lora( ...@@ -41,7 +41,7 @@ def benchmark_rope_kernels_multi_lora(
non_batched_ropes.append( non_batched_ropes.append(
get_rope(head_size, rotary_dim, max_position, base, is_neox_style, get_rope(head_size, rotary_dim, max_position, base, is_neox_style,
{ {
"type": "linear", "rope_type": "linear",
"factor": (scaling_factor, ) "factor": (scaling_factor, )
})) }))
......
...@@ -16,7 +16,6 @@ def main(args): ...@@ -16,7 +16,6 @@ def main(args):
enforce_eager=True, enforce_eager=True,
enable_prefix_caching=True, enable_prefix_caching=True,
tensor_parallel_size=args.tensor_parallel_size, tensor_parallel_size=args.tensor_parallel_size,
use_v2_block_manager=args.use_v2_block_manager,
) )
sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len) sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len)
...@@ -56,8 +55,5 @@ if __name__ == "__main__": ...@@ -56,8 +55,5 @@ if __name__ == "__main__":
parser.add_argument('--enable-prefix-caching', parser.add_argument('--enable-prefix-caching',
action='store_true', action='store_true',
help='enable prefix caching') help='enable prefix caching')
parser.add_argument('--use-v2-block-manager',
action='store_true',
help='Use BlockSpaceMangerV2')
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment