Unverified Commit 57f09a41 authored by Ilya Lavrenov's avatar Ilya Lavrenov Committed by GitHub
Browse files

[Hardware][Intel] OpenVINO vLLM backend (#5379)

parent 59326344
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex
# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .
# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container
# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
# to run the OpenAI compatible server.
FROM ubuntu:22.04 AS dev
RUN apt-get update -y && \
apt-get install -y python3-pip git
WORKDIR /workspace
# copy requirements
COPY requirements-build.txt /workspace/vllm/
COPY requirements-common.txt /workspace/vllm/
COPY requirements-openvino.txt /workspace/vllm/
COPY vllm/ /workspace/vllm/vllm
COPY setup.py /workspace/vllm/
# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
# build vLLM with OpenVINO backend
RUN PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/
COPY examples/ /workspace/vllm/examples
COPY benchmarks/ /workspace/vllm/benchmarks
CMD ["/bin/bash"]
...@@ -207,9 +207,10 @@ if __name__ == '__main__': ...@@ -207,9 +207,10 @@ if __name__ == '__main__':
parser.add_argument( parser.add_argument(
"--device", "--device",
type=str, type=str,
default="cuda", default="auto",
choices=["cuda", "cpu", "tpu", "xpu"], choices=["auto", "cuda", "cpu", "openvino", "tpu", "xpu"],
help='device type for vLLM execution, supporting CUDA and CPU.') help='device type for vLLM execution, supporting CUDA, OpenVINO and '
'CPU.')
parser.add_argument('--block-size', parser.add_argument('--block-size',
type=int, type=int,
default=16, default=16,
......
...@@ -349,9 +349,10 @@ if __name__ == "__main__": ...@@ -349,9 +349,10 @@ if __name__ == "__main__":
parser.add_argument( parser.add_argument(
"--device", "--device",
type=str, type=str,
default="cuda", default="auto",
choices=["cuda", "cpu", "tpu", "xpu"], choices=["auto", "cuda", "cpu", "openvino", "tpu", "xpu"],
help='device type for vLLM execution, supporting CUDA and CPU.') help='device type for vLLM execution, supporting CUDA, OpenVINO and '
'CPU.')
parser.add_argument( parser.add_argument(
"--enable-prefix-caching", "--enable-prefix-caching",
action='store_true', action='store_true',
......
.. _installation_openvino:
Installation with OpenVINO
==========================
vLLM powered by OpenVINO supports all LLM models from :doc:`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:
- Prefix caching (``--enable-prefix-caching``)
- Chunked prefill (``--enable-chunked-prefill``)
**Table of contents**:
- :ref:`Requirements <openvino_backend_requirements>`
- :ref:`Quick start using Dockerfile <openvino_backend_quick_start_dockerfile>`
- :ref:`Build from source <install_openvino_backend_from_source>`
- :ref:`Performance tips <openvino_backend_performance_tips>`
- :ref:`Limitations <openvino_backend_limitations>`
.. _openvino_backend_requirements:
Requirements
------------
* OS: Linux
* Instruction set architecture (ISA) requirement: at least AVX2.
.. _openvino_backend_quick_start_dockerfile:
Quick start using Dockerfile
----------------------------
.. code-block:: console
$ docker build -f Dockerfile.openvino -t vllm-openvino-env .
$ docker run -it --rm vllm-openvino-env
.. _install_openvino_backend_from_source:
Install from source
-------------------
- First, install Python. For example, on Ubuntu 22.04, you can run:
.. code-block:: console
$ sudo apt-get update -y
$ sudo apt-get install python3
- Second, install prerequisites vLLM OpenVINO backend installation:
.. code-block:: console
$ pip install --upgrade pip
$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
- Finally, install vLLM with OpenVINO backend:
.. code-block:: console
$ PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE=openvino python -m pip install -v .
.. _openvino_backend_performance_tips:
Performance tips
----------------
vLLM OpenVINO backend uses the following environment variables to control behavior:
- ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off.
To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``)
OpenVINO best known configuration is:
.. code-block:: console
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256
.. _openvino_backend_limitations:
Limitations
-----------
- LoRA serving is not supported.
- Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration.
- Tensor and pipeline parallelism are not currently enabled in vLLM integration.
- Speculative sampling is not tested within vLLM integration.
...@@ -63,6 +63,7 @@ Documentation ...@@ -63,6 +63,7 @@ Documentation
getting_started/installation getting_started/installation
getting_started/amd-installation getting_started/amd-installation
getting_started/openvino-installation
getting_started/cpu-installation getting_started/cpu-installation
getting_started/neuron-installation getting_started/neuron-installation
getting_started/tpu-installation getting_started/tpu-installation
......
# Common dependencies
-r requirements-common.txt
# OpenVINO dependencies
torch >= 2.1.2
openvino ~= 2024.3.0.dev
optimum-intel[openvino] >= 1.17.2
triton >= 2.2.0 # FIXME(woosuk): This is a hack to avoid import error.
...@@ -233,6 +233,10 @@ def _is_cpu() -> bool: ...@@ -233,6 +233,10 @@ def _is_cpu() -> bool:
return VLLM_TARGET_DEVICE == "cpu" return VLLM_TARGET_DEVICE == "cpu"
def _is_openvino() -> bool:
return VLLM_TARGET_DEVICE == "openvino"
def _is_xpu() -> bool: def _is_xpu() -> bool:
return VLLM_TARGET_DEVICE == "xpu" return VLLM_TARGET_DEVICE == "xpu"
...@@ -337,6 +341,8 @@ def get_vllm_version() -> str: ...@@ -337,6 +341,8 @@ def get_vllm_version() -> str:
if neuron_version != MAIN_CUDA_VERSION: if neuron_version != MAIN_CUDA_VERSION:
neuron_version_str = neuron_version.replace(".", "")[:3] neuron_version_str = neuron_version.replace(".", "")[:3]
version += f"+neuron{neuron_version_str}" version += f"+neuron{neuron_version_str}"
elif _is_openvino():
version += "+openvino"
elif _is_tpu(): elif _is_tpu():
version += "+tpu" version += "+tpu"
elif _is_cpu(): elif _is_cpu():
...@@ -388,6 +394,8 @@ def get_requirements() -> List[str]: ...@@ -388,6 +394,8 @@ def get_requirements() -> List[str]:
requirements = _read_requirements("requirements-rocm.txt") requirements = _read_requirements("requirements-rocm.txt")
elif _is_neuron(): elif _is_neuron():
requirements = _read_requirements("requirements-neuron.txt") requirements = _read_requirements("requirements-neuron.txt")
elif _is_openvino():
requirements = _read_requirements("requirements-openvino.txt")
elif _is_tpu(): elif _is_tpu():
requirements = _read_requirements("requirements-tpu.txt") requirements = _read_requirements("requirements-tpu.txt")
elif _is_cpu(): elif _is_cpu():
...@@ -396,7 +404,8 @@ def get_requirements() -> List[str]: ...@@ -396,7 +404,8 @@ def get_requirements() -> List[str]:
requirements = _read_requirements("requirements-xpu.txt") requirements = _read_requirements("requirements-xpu.txt")
else: else:
raise ValueError( raise ValueError(
"Unsupported platform, please use CUDA, ROCm, Neuron, or CPU.") "Unsupported platform, please use CUDA, ROCm, Neuron, "
"OpenVINO, or CPU.")
return requirements return requirements
......
...@@ -9,8 +9,8 @@ from vllm.attention.selector import which_attn_to_use ...@@ -9,8 +9,8 @@ from vllm.attention.selector import which_attn_to_use
@pytest.mark.parametrize( @pytest.mark.parametrize(
"name", ["TORCH_SDPA", "ROCM_FLASH", "XFORMERS", "FLASHINFER"]) "name", ["TORCH_SDPA", "ROCM_FLASH", "XFORMERS", "FLASHINFER", "OPENVINO"])
@pytest.mark.parametrize("device", ["cpu", "hip", "cuda"]) @pytest.mark.parametrize("device", ["cpu", "openvino", "hip", "cuda"])
def test_env(name: str, device: str, monkeypatch): def test_env(name: str, device: str, monkeypatch):
"""Test that the attention selector can be set via environment variable. """Test that the attention selector can be set via environment variable.
Note that we do not test FlashAttn because it is the default backend. Note that we do not test FlashAttn because it is the default backend.
...@@ -28,6 +28,11 @@ def test_env(name: str, device: str, monkeypatch): ...@@ -28,6 +28,11 @@ def test_env(name: str, device: str, monkeypatch):
backend = which_attn_to_use(8, 16, 8, None, torch.float16, backend = which_attn_to_use(8, 16, 8, None, torch.float16,
torch.float16, 16) torch.float16, 16)
assert backend.name == "ROCM_FLASH" assert backend.name == "ROCM_FLASH"
elif device == "openvino":
with patch("vllm.attention.selector.is_openvino", return_value=True):
backend = which_attn_to_use(8, 16, 8, None, torch.float16,
torch.float16, 16)
assert backend.name == "OPENVINO"
else: else:
backend = which_attn_to_use(8, 16, 8, None, torch.float16, backend = which_attn_to_use(8, 16, 8, None, torch.float16,
torch.float16, 16) torch.float16, 16)
......
from dataclasses import dataclass
from typing import List, Tuple
import openvino as ov
import torch
from vllm.attention.backends.abstract import (AttentionBackend,
AttentionMetadata)
class OpenVINOAttentionBackend(AttentionBackend):
@staticmethod
def get_name() -> str:
return "openvino"
@staticmethod
def get_impl_cls():
# OpenVINO implements PagedAttention as part of the Optimum
# exported model
raise NotImplementedError
@staticmethod
def make_metadata(*args, **kwargs) -> "AttentionMetadata":
raise NotImplementedError
@staticmethod
def make_openvino_metadata(*args, **kwargs) -> "OpenVINOAttentionMetadata":
return OpenVINOAttentionMetadata(*args, **kwargs)
@staticmethod
def get_kv_cache_shape(
num_blocks: int,
block_size: int,
num_kv_heads: int,
head_size: int,
) -> Tuple[int, ...]:
return (2, num_blocks, num_kv_heads, block_size, head_size)
@staticmethod
def swap_blocks(
src_kv_cache: ov.Tensor,
dst_kv_cache: ov.Tensor,
src_to_dst: torch.Tensor,
) -> None:
# OpenVINO currently supports only CPU, which does not require
# swap of KV cache blocks
raise NotImplementedError
@staticmethod
def copy_blocks(
kv_caches: List[Tuple[ov.Tensor, ov.Tensor]],
src_to_dists: List[Tuple[int, int]],
) -> None:
for src, dst in src_to_dists:
for key_cache, value_cache in kv_caches:
key_cache.data[dst, :] = key_cache.data[src, :]
value_cache.data[dst, :] = value_cache.data[src, :]
@dataclass
class OpenVINOAttentionMetadata:
"""Metadata for OpenVINOAttentionBackend.
Basic terms used below:
- batch_size_in_sequences - total number of sequences to execute​
- prompt_lens – per sequence size number of scheduled tokens​
- batch_size_in_tokens = sum(prompt_lens)​
- max_context_len = max(context_lens)​
- max_num_blocks = div_up(max_context_len / BLOCK_SIZE)​
- num_blocks – total number of blocks in block_indices​
"""
# Describes past KV cache size for each sequence within a batch
# Shape: [batch_size_in_sequences]
# Type: i32​
past_lens: torch.Tensor
# Describes start indices of input / speculative tokens from
# current sequences within a batch sequence​
# Shape: [batch_size_in_sequences + 1]​
# Type: i32
subsequence_begins: torch.Tensor
# Describes block tables for each sequence within a batch​ -
# indices along 0th dimension in key_cache and value_cache inputs​
# Shape: [num_blocks]
# Type: i32​
block_indices: torch.Tensor
# Describes block tables for each sequence within a batch​ -
# for i-th element, it is an index in block_indices with the
# first block belonging to i-th sequence​
# Shape: [batch_size_in_sequences + 1]
# Type: i32​
block_indices_begins: torch.Tensor
# Describes max context length
# Shape: scalar
# Type: i32
max_context_len: torch.Tensor
...@@ -7,7 +7,7 @@ import torch ...@@ -7,7 +7,7 @@ import torch
import vllm.envs as envs import vllm.envs as envs
from vllm.attention.backends.abstract import AttentionBackend from vllm.attention.backends.abstract import AttentionBackend
from vllm.logger import init_logger from vllm.logger import init_logger
from vllm.utils import is_cpu, is_hip, is_tpu, is_xpu from vllm.utils import is_cpu, is_hip, is_openvino, is_tpu, is_xpu
logger = init_logger(__name__) logger = init_logger(__name__)
...@@ -17,6 +17,7 @@ class _Backend(enum.Enum): ...@@ -17,6 +17,7 @@ class _Backend(enum.Enum):
XFORMERS = enum.auto() XFORMERS = enum.auto()
ROCM_FLASH = enum.auto() ROCM_FLASH = enum.auto()
TORCH_SDPA = enum.auto() TORCH_SDPA = enum.auto()
OPENVINO = enum.auto()
FLASHINFER = enum.auto() FLASHINFER = enum.auto()
PALLAS = enum.auto() PALLAS = enum.auto()
IPEX = enum.auto() IPEX = enum.auto()
...@@ -64,6 +65,10 @@ def get_attn_backend( ...@@ -64,6 +65,10 @@ def get_attn_backend(
logger.info("Using Torch SDPA backend.") logger.info("Using Torch SDPA backend.")
from vllm.attention.backends.torch_sdpa import TorchSDPABackend from vllm.attention.backends.torch_sdpa import TorchSDPABackend
return TorchSDPABackend return TorchSDPABackend
elif backend == _Backend.OPENVINO:
logger.info("Using OpenVINO Attention backend.")
from vllm.attention.backends.openvino import OpenVINOAttentionBackend
return OpenVINOAttentionBackend
elif backend == _Backend.IPEX: elif backend == _Backend.IPEX:
assert is_xpu(), RuntimeError( assert is_xpu(), RuntimeError(
"IPEX attention backend is only used for the XPU device.") "IPEX attention backend is only used for the XPU device.")
...@@ -113,6 +118,11 @@ def which_attn_to_use( ...@@ -113,6 +118,11 @@ def which_attn_to_use(
logger.info("Cannot use %s backend on CPU.", selected_backend) logger.info("Cannot use %s backend on CPU.", selected_backend)
return _Backend.TORCH_SDPA return _Backend.TORCH_SDPA
if is_openvino():
if selected_backend != _Backend.OPENVINO:
logger.info("Cannot use %s backend on OpenVINO.", selected_backend)
return _Backend.OPENVINO
if is_xpu(): if is_xpu():
if selected_backend != _Backend.IPEX: if selected_backend != _Backend.IPEX:
logger.info("Cannot use %s backend on XPU.", selected_backend) logger.info("Cannot use %s backend on XPU.", selected_backend)
......
...@@ -14,8 +14,8 @@ from vllm.model_executor.models import ModelRegistry ...@@ -14,8 +14,8 @@ from vllm.model_executor.models import ModelRegistry
from vllm.tracing import is_otel_installed from vllm.tracing import is_otel_installed
from vllm.transformers_utils.config import get_config, get_hf_text_config from vllm.transformers_utils.config import get_config, get_hf_text_config
from vllm.utils import (cuda_device_count_stateless, get_cpu_memory, is_cpu, from vllm.utils import (cuda_device_count_stateless, get_cpu_memory, is_cpu,
is_hip, is_neuron, is_tpu, is_xpu, print_warning_once, is_hip, is_neuron, is_openvino, is_tpu, is_xpu,
update_environment_variables) print_warning_once, update_environment_variables)
if TYPE_CHECKING: if TYPE_CHECKING:
from ray.util.placement_group import PlacementGroup from ray.util.placement_group import PlacementGroup
...@@ -781,6 +781,8 @@ class DeviceConfig: ...@@ -781,6 +781,8 @@ class DeviceConfig:
# Automated device type detection # Automated device type detection
if is_neuron(): if is_neuron():
self.device_type = "neuron" self.device_type = "neuron"
elif is_openvino():
self.device_type = "openvino"
elif is_tpu(): elif is_tpu():
self.device_type = "tpu" self.device_type = "tpu"
elif is_cpu(): elif is_cpu():
...@@ -796,7 +798,7 @@ class DeviceConfig: ...@@ -796,7 +798,7 @@ class DeviceConfig:
self.device_type = device self.device_type = device
# Some device types require processing inputs on CPU # Some device types require processing inputs on CPU
if self.device_type in ["neuron"]: if self.device_type in ["neuron", "openvino"]:
self.device = torch.device("cpu") self.device = torch.device("cpu")
elif self.device_type in ["tpu"]: elif self.device_type in ["tpu"]:
self.device = None self.device = None
......
...@@ -504,12 +504,14 @@ class EngineArgs: ...@@ -504,12 +504,14 @@ class EngineArgs:
'Enabling this will use the fully sharded layers. ' 'Enabling this will use the fully sharded layers. '
'At high sequence length, max rank or ' 'At high sequence length, max rank or '
'tensor parallel size, this is likely faster.')) 'tensor parallel size, this is likely faster.'))
parser.add_argument( parser.add_argument("--device",
"--device", type=str,
type=str, default=EngineArgs.device,
default=EngineArgs.device, choices=[
choices=["auto", "cuda", "neuron", "cpu", "tpu", "xpu"], "auto", "cuda", "neuron", "cpu", "openvino",
help='Device type for vLLM execution.') "tpu", "xpu"
],
help='Device type for vLLM execution.')
# Related to Vision-language models such as llava # Related to Vision-language models such as llava
parser = EngineArgs.add_cli_args_for_vlm(parser) parser = EngineArgs.add_cli_args_for_vlm(parser)
......
...@@ -393,6 +393,12 @@ class AsyncLLMEngine: ...@@ -393,6 +393,12 @@ class AsyncLLMEngine:
"Distributed execution is not supported with the CPU backend.") "Distributed execution is not supported with the CPU backend.")
from vllm.executor.cpu_executor import CPUExecutorAsync from vllm.executor.cpu_executor import CPUExecutorAsync
executor_class = CPUExecutorAsync executor_class = CPUExecutorAsync
elif engine_config.device_config.device_type == "openvino":
assert distributed_executor_backend is None, (
"Distributed execution is not supported with "
"the OpenVINO backend.")
from vllm.executor.openvino_executor import OpenVINOExecutorAsync
executor_class = OpenVINOExecutorAsync
elif engine_config.device_config.device_type == "xpu": elif engine_config.device_config.device_type == "xpu":
if distributed_executor_backend is None: if distributed_executor_backend is None:
from vllm.executor.xpu_executor import XPUExecutorAsync from vllm.executor.xpu_executor import XPUExecutorAsync
......
...@@ -363,6 +363,9 @@ class LLMEngine: ...@@ -363,6 +363,9 @@ class LLMEngine:
elif engine_config.device_config.device_type == "cpu": elif engine_config.device_config.device_type == "cpu":
from vllm.executor.cpu_executor import CPUExecutor from vllm.executor.cpu_executor import CPUExecutor
executor_class = CPUExecutor executor_class = CPUExecutor
elif engine_config.device_config.device_type == "openvino":
from vllm.executor.openvino_executor import OpenVINOExecutor
executor_class = OpenVINOExecutor
elif engine_config.device_config.device_type == "xpu": elif engine_config.device_config.device_type == "xpu":
if distributed_executor_backend == "ray": if distributed_executor_backend == "ray":
initialize_ray_cluster(engine_config.parallel_config) initialize_ray_cluster(engine_config.parallel_config)
......
...@@ -28,6 +28,9 @@ if TYPE_CHECKING: ...@@ -28,6 +28,9 @@ if TYPE_CHECKING:
VLLM_TRACE_FUNCTION: int = 0 VLLM_TRACE_FUNCTION: int = 0
VLLM_ATTENTION_BACKEND: Optional[str] = None VLLM_ATTENTION_BACKEND: Optional[str] = None
VLLM_CPU_KVCACHE_SPACE: int = 0 VLLM_CPU_KVCACHE_SPACE: int = 0
VLLM_OPENVINO_KVCACHE_SPACE: int = 0
VLLM_OPENVINO_CPU_KV_CACHE_PRECISION: Optional[str] = None
VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS: bool = False
VLLM_XLA_CACHE_PATH: str = "~/.vllm/xla_cache/" VLLM_XLA_CACHE_PATH: str = "~/.vllm/xla_cache/"
VLLM_USE_RAY_COMPILED_DAG: bool = False VLLM_USE_RAY_COMPILED_DAG: bool = False
VLLM_WORKER_MULTIPROC_METHOD: str = "fork" VLLM_WORKER_MULTIPROC_METHOD: str = "fork"
...@@ -49,7 +52,8 @@ environment_variables: Dict[str, Callable[[], Any]] = { ...@@ -49,7 +52,8 @@ environment_variables: Dict[str, Callable[[], Any]] = {
# ================== Installation Time Env Vars ================== # ================== Installation Time Env Vars ==================
# Target device of vLLM, supporting [cuda (by default), rocm, neuron, cpu] # Target device of vLLM, supporting [cuda (by default),
# rocm, neuron, cpu, openvino]
"VLLM_TARGET_DEVICE": "VLLM_TARGET_DEVICE":
lambda: os.getenv("VLLM_TARGET_DEVICE", "cuda"), lambda: os.getenv("VLLM_TARGET_DEVICE", "cuda"),
...@@ -208,6 +212,22 @@ environment_variables: Dict[str, Callable[[], Any]] = { ...@@ -208,6 +212,22 @@ environment_variables: Dict[str, Callable[[], Any]] = {
"VLLM_CPU_KVCACHE_SPACE": "VLLM_CPU_KVCACHE_SPACE":
lambda: int(os.getenv("VLLM_CPU_KVCACHE_SPACE", "0")), lambda: int(os.getenv("VLLM_CPU_KVCACHE_SPACE", "0")),
# OpenVINO key-value cache space
# default is 4GB
"VLLM_OPENVINO_KVCACHE_SPACE":
lambda: int(os.getenv("VLLM_OPENVINO_KVCACHE_SPACE", "0")),
# OpenVINO KV cache precision
# default is bf16 if natively supported by platform, otherwise f16
# To enable KV cache compression, please, explicitly specify u8
"VLLM_OPENVINO_CPU_KV_CACHE_PRECISION":
lambda: os.getenv("VLLM_OPENVINO_CPU_KV_CACHE_PRECISION", None),
# Enables weights compression during model export via HF Optimum
# default is False
"VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS":
lambda: bool(os.getenv("VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS", False)),
# If the env var is set, it uses the Ray's compiled DAG API # If the env var is set, it uses the Ray's compiled DAG API
# which optimizes the control plane overhead. # which optimizes the control plane overhead.
# Run vLLM with VLLM_USE_RAY_COMPILED_DAG=1 to enable it. # Run vLLM with VLLM_USE_RAY_COMPILED_DAG=1 to enable it.
......
from typing import List, Set, Tuple
import openvino as ov
import openvino.properties.hint as hints
import torch
import vllm.envs as envs
from vllm.config import CacheConfig, ModelConfig
from vllm.executor.executor_base import ExecutorAsyncBase, ExecutorBase
from vllm.logger import init_logger
from vllm.lora.request import LoRARequest
from vllm.sequence import ExecuteModelRequest, SamplerOutput
from vllm.utils import (get_distributed_init_method, get_ip, get_open_port,
make_async)
logger = init_logger(__name__)
class OpenVINOExecutor(ExecutorBase):
def _init_executor(self) -> None:
assert self.device_config.device_type == "openvino"
assert self.lora_config is None, "OpenVINO backend doesn't support LoRA"
self.model_config = _verify_and_get_model_config(self.model_config)
self.cache_config = _verify_and_get_cache_config(self.cache_config)
# Instantiate the worker and load the model to CPU.
self._init_worker()
def _init_worker(self):
from vllm.worker.openvino_worker import OpenVINOWorker
assert (
self.parallel_config.world_size == 1
), "OpenVINOExecutor only supports single CPU socket currently."
distributed_init_method = get_distributed_init_method(
get_ip(), get_open_port())
self.driver_worker = OpenVINOWorker(
model_config=self.model_config,
parallel_config=self.parallel_config,
scheduler_config=self.scheduler_config,
device_config=self.device_config,
cache_config=self.cache_config,
load_config=self.load_config,
local_rank=0,
rank=0,
distributed_init_method=distributed_init_method,
lora_config=self.lora_config,
vision_language_config=self.vision_language_config,
kv_cache_dtype=self.cache_config.cache_dtype,
is_driver_worker=True,
)
self.driver_worker.init_device()
self.driver_worker.load_model()
def determine_num_available_blocks(self) -> Tuple[int, int]:
"""Determine the number of available KV blocks by invoking the
underlying worker.
"""
return self.driver_worker.determine_num_available_blocks()
def initialize_cache(self, num_gpu_blocks: int,
num_cpu_blocks: int) -> None:
"""Initialize the KV cache by invoking the underlying worker."""
# NOTE: We log here to avoid multiple logs when number of workers is
# greater than one. We could log in the engine, but not all executors
# have GPUs.
# NOTE: `cpu block` for OpenVINO backend is located on CPU memory but is
# referred as `gpu block`. Because we want to reuse the existing block
# management procedure.
logger.info("# CPU blocks: %d", num_gpu_blocks)
self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
def execute_model(
self,
execute_model_req: ExecuteModelRequest) -> List[SamplerOutput]:
output = self.driver_worker.execute_model(execute_model_req)
return output
def add_lora(self, lora_request: LoRARequest) -> bool:
return self.driver_worker.add_lora(lora_request)
def remove_lora(self, lora_id: int) -> bool:
return self.driver_worker.remove_lora(lora_id)
def pin_lora(self, lora_id: int) -> bool:
return self.driver_worker.pin_lora(lora_id)
def list_loras(self) -> Set[int]:
return self.driver_worker.list_loras()
def check_health(self) -> None:
# OpenVINOExecutor will always be healthy as long as
# it's running.
return
class OpenVINOExecutorAsync(OpenVINOExecutor, ExecutorAsyncBase):
async def execute_model_async(
self,
execute_model_req: ExecuteModelRequest) -> List[SamplerOutput]:
output = await make_async(self.driver_worker.execute_model
)(execute_model_req=execute_model_req, )
return output
async def check_health_async(self) -> None:
# OpenVINOExecutor will always be healthy as long as
# it's running.
return
def _verify_and_get_model_config(config: ModelConfig) -> ModelConfig:
if config.dtype != torch.float32:
logger.warning(
f"Only float32 dtype is supported on OpenVINO, casting from {config.dtype}." # noqa: G004, E501
)
config.dtype = torch.float32
if not config.enforce_eager:
logger.warning(
"CUDA graph is not supported on OpenVINO backend, fallback to the "
"eager mode.")
config.enforce_eager = True
return config
def _verify_and_get_cache_config(config: CacheConfig) -> CacheConfig:
if envs.VLLM_OPENVINO_CPU_KV_CACHE_PRECISION == "u8":
logger.info("KV cache type is overried to u8 via "
"VLLM_OPENVINO_CPU_KV_CACHE_PRECISION env var.")
config.cache_dtype = ov.Type.u8
else:
core = ov.Core()
inference_precision = core.get_property("CPU",
hints.inference_precision)
if inference_precision == ov.Type.bf16:
config.cache_dtype = ov.Type.bf16
else:
config.cache_dtype = ov.Type.f16
if config.block_size != 32:
logger.info(
f"OpenVINO optimal block size is 32, overriding currently set {config.block_size}" # noqa: G004, E501
)
config.block_size = 32
kv_cache_space = envs.VLLM_OPENVINO_KVCACHE_SPACE
if kv_cache_space >= 0:
_GB = 1 << 30
if kv_cache_space == 0:
config.openvino_kvcache_space_bytes = 4 * _GB # type: ignore
logger.warning(
"Environment variable VLLM_OPENVINO_KVCACHE_SPACE (GB) "
"for OpenVINO backend is not set, using 4 by default.")
else:
config.openvino_kvcache_space_bytes = kv_cache_space * _GB # type: ignore
else:
raise RuntimeError(
"Invalid environment variable VLLM_OPENVINO_KVCACHE_SPACE"
f" {kv_cache_space}, expect a positive integer value.")
return config
...@@ -679,7 +679,7 @@ def _get_ranks(x: torch.Tensor, indices: torch.Tensor) -> torch.Tensor: ...@@ -679,7 +679,7 @@ def _get_ranks(x: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:
Returns: Returns:
torch.Tensor: 1D tensor of shape (N,) where N is the no. of tokens. torch.Tensor: 1D tensor of shape (N,) where N is the no. of tokens.
Each element in the returned tensor represents the rank Each element in the returned tensor represents the rank
of the chosen token in the input logprob tensor. of the chosen token in the input logprob tensor.
""" """
vals = x[torch.arange(0, len(x), device=x.device, dtype=indices.dtype), vals = x[torch.arange(0, len(x), device=x.device, dtype=indices.dtype),
...@@ -965,7 +965,7 @@ def _modify_greedy_probs_inplace(logprobs: torch.Tensor, probs: torch.Tensor, ...@@ -965,7 +965,7 @@ def _modify_greedy_probs_inplace(logprobs: torch.Tensor, probs: torch.Tensor,
distribution. distribution.
- Greedy sampling performs `argmax` to obtain the token with the - Greedy sampling performs `argmax` to obtain the token with the
highest likelihood. highest likelihood.
Ignoring greedy sampling for a moment, we find that the computed probability Ignoring greedy sampling for a moment, we find that the computed probability
distribution has the following property: we can sample from it independently distribution has the following property: we can sample from it independently
and find that the token sampled by the Sampler has a frequency corresponding and find that the token sampled by the Sampler has a frequency corresponding
......
# ruff: noqa: SIM117
from pathlib import Path
from typing import List, Optional, Tuple
import openvino as ov
import torch
from huggingface_hub import HfApi
from openvino._offline_transformations import paged_attention_transformation
from optimum.intel import OVModelForCausalLM
from torch import nn
import vllm.envs as envs
from vllm.attention.backends.openvino import OpenVINOAttentionMetadata
from vllm.config import DeviceConfig, ModelConfig
from vllm.logger import init_logger
from vllm.model_executor.layers.logits_processor import (LogitsProcessor,
_prune_hidden_states)
from vllm.model_executor.layers.sampler import Sampler
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.sequence import SamplerOutput
logger = init_logger(__name__)
def _flattenize_inputs(inputs):
"""
Helper function for making nested inputs flattens
"""
flatten_inputs = []
for input_data in inputs:
if input_data is None:
continue
if isinstance(input_data, (list, tuple)):
flatten_inputs.extend(_flattenize_inputs(input_data))
elif isinstance(input_data, dict):
flatten_inputs.extend(_flattenize_inputs(list(
input_data.values())))
else:
flatten_inputs.append(input_data)
return flatten_inputs
def _modify_cache_parameters(model: ov.Model, kv_cache_dtype: ov.Type,
is_cpu: bool):
# Apply hardware dependent modifications to KV tensors
for parameter in model.get_parameters():
input = parameter.get_output_tensor(0)
input_names = input.get_names()
if len(input_names) != 1:
continue
input_name = next(iter(input_names))
shape = parameter.get_partial_shape()
# use real block size if available, just a placeholder
# to provide the expected rank
x_size = 1
num_blocks = ov.Dimension()
block_size = ov.Dimension()
head_size = ov.Dimension()
# TODO: Negotiate required layout with plugins (CPU is ~OK, GPU is TBD),
# pass more parameters to this function to set more static dimensions
if input_name.startswith("key_cache."):
cpu_shape = [num_blocks, shape[1], block_size, head_size]
gpu_shape = [
num_blocks,
shape[1],
shape[2].get_length() //
x_size if shape[2].is_static else ov.Dimension(),
block_size,
x_size,
]
elif input_name.startswith("value_cache."):
cpu_shape = [num_blocks, shape[1], block_size, head_size]
gpu_shape = [num_blocks, shape[1], shape[2], block_size]
else:
continue
parameter.set_partial_shape(
ov.PartialShape(cpu_shape if is_cpu else gpu_shape))
parameter.set_element_type(kv_cache_dtype)
model.validate_nodes_and_infer_types()
def _require_model_export(model_id, revision=None, subfolder=None):
model_dir = Path(model_id)
if subfolder is not None:
model_dir = model_dir / subfolder
if model_dir.is_dir():
return (not (model_dir / "openvino_model.xml").exists()
or not (model_dir / "openvino_model.bin").exists())
hf_api = HfApi()
try:
model_info = hf_api.model_info(model_id, revision=revision or "main")
normalized_subfolder = (None if subfolder is None else
Path(subfolder).as_posix())
model_files = [
file.rfilename for file in model_info.siblings
if normalized_subfolder is None
or file.rfilename.startswith(normalized_subfolder)
]
ov_model_path = ("openvino_model.xml" if normalized_subfolder is None
else f"{normalized_subfolder}/openvino_model.xml")
return (ov_model_path not in model_files
or ov_model_path.replace(".xml", ".bin") not in model_files)
except Exception:
return True
class OpenVINOCasualLM(nn.Module):
def __init__(
self,
model_config: ModelConfig,
device_config: DeviceConfig,
kv_cache_dtype: ov.Type,
) -> None:
super().__init__()
self.logits_processor = LogitsProcessor(
model_config.hf_config.vocab_size, logits_as_input=True)
self.sampler = Sampler()
export = _require_model_export(model_config.model)
if export:
logger.warning(
f"Provided model id {model_config.model} does not " # noqa: G004
"contain OpenVINO IR, the model will be converted to IR with "
"default options. If you need to use specific options for "
"model conversion, use optimum-cli export openvino with "
"desired options.")
else:
logger.warning(
"OpenVINO IR is available for provided model id " # noqa: G004
f"{model_config.model}. This IR will be used for inference "
"as-is, all possible options that may affect model conversion "
"are ignored.")
load_in_8bit = envs.VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS
pt_model = OVModelForCausalLM.from_pretrained(
model_config.model,
export=export,
compile=False,
load_in_8bit=load_in_8bit,
trust_remote_code=model_config.trust_remote_code,
)
paged_attention_transformation(pt_model.model)
_modify_cache_parameters(pt_model.model, kv_cache_dtype,
device_config.device.type == "cpu")
core = ov.Core()
ov_compiled = core.compile_model(pt_model.model, "CPU")
self.ov_request = ov_compiled.create_infer_request()
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[Tuple[ov.Tensor, ov.Tensor]],
attn_metadata: OpenVINOAttentionMetadata,
) -> torch.Tensor:
flatten_kv_cache = _flattenize_inputs(kv_caches)
inputs = [
input_ids,
positions,
*flatten_kv_cache,
attn_metadata.past_lens,
attn_metadata.subsequence_begins,
attn_metadata.block_indices,
attn_metadata.block_indices_begins,
attn_metadata.max_context_len,
]
self.ov_request.start_async(inputs, share_inputs=True)
self.ov_request.wait()
logits = torch.from_numpy(self.ov_request.get_tensor("logits").data)
# TODO: remove 'view' once OpenVINO PA will drop 'seq_len' dimension
return logits.view(-1, logits.shape[-1])
def compute_logits(self, hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata) -> torch.Tensor:
hidden_states = _prune_hidden_states(hidden_states, sampling_metadata)
logits = self.logits_processor(None, hidden_states, sampling_metadata)
return logits
def sample(
self,
logits: torch.Tensor,
sampling_metadata: SamplingMetadata,
) -> Optional[SamplerOutput]:
next_tokens = self.sampler(logits, sampling_metadata)
return next_tokens
def get_model(
model_config: ModelConfig,
device_config: DeviceConfig,
kv_cache_dtype: ov.Type,
**kwargs,
) -> torch.nn.Module:
lora_config = kwargs.get("lora_config", None)
if lora_config:
raise ValueError(
"OpenVINO modeling does not support LoRA, "
"but LoRA is enabled. Support for this model may "
"be added in the future. If this is important to you, "
"please open an issue on github.")
return OpenVINOCasualLM(model_config, device_config, kv_cache_dtype)
...@@ -176,6 +176,15 @@ def is_cpu() -> bool: ...@@ -176,6 +176,15 @@ def is_cpu() -> bool:
return False return False
@lru_cache(maxsize=None)
def is_openvino() -> bool:
from importlib.metadata import PackageNotFoundError, version
try:
return "openvino" in version("vllm")
except PackageNotFoundError:
return False
@lru_cache(maxsize=None) @lru_cache(maxsize=None)
def is_neuron() -> bool: def is_neuron() -> bool:
try: try:
...@@ -546,7 +555,7 @@ def is_pin_memory_available() -> bool: ...@@ -546,7 +555,7 @@ def is_pin_memory_available() -> bool:
elif is_neuron(): elif is_neuron():
print_warning_once("Pin memory is not supported on Neuron.") print_warning_once("Pin memory is not supported on Neuron.")
return False return False
elif is_cpu(): elif is_cpu() or is_openvino():
return False return False
return True return True
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment