"...chenpangpang/transformers.git" did not exist on "df15703b422644c7cbbb9779af9e8f1fd639eefe"
Commit 2b1b8c7d authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge branch 'v0.6.2-eval' into v0.6.2-dev

parents 3f42b83d 367708c7
# <div align="center"><strong>vLLM</strong></div> # <div align="center"><strong>vLLM</strong></div>
## 简介 ## 简介
vLLM是一个快速且易于使用的LLM推理和服务库使用PageAttention高效管理kv内存Continuous batching传入请求支持很多Hugging Face模型如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。 vLLM是一个快速且易于使用的LLM推理和服务库,使用PageAttention高效管理kv内存,Continuous batching传入请求,支持很多Hugging Face模型,如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
## 暂不支持的官方功能 ## 暂不支持的官方功能
- **量化推理**目前支持fp16的推理和gptq,awq-int4推理mralin的权重量化、kv-cache fp8推理方案暂不支持 - **量化推理**:目前支持fp16的推理和gptq,awq-int4推理,mralin的权重量化、kv-cache fp8推理方案暂不支持
- **模块支持**目前不支持Sliding window attention - **模块支持**:目前不支持Sliding window attention
## 支持模型结构列表 ## 支持模型结构列表
| 结构 | 模型 | 模型并行 | FP16 | | 结构 | 模型 | 模型并行 | FP16 |
| :------: | :------: | :------: | :------: | | :------: | :------: | :------: | :------: |
| LlamaForCausalLM | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes | | LlamaForCausalLM | Llama 3.1,Llama 3,Llama 2,Llama,Yi,Codellama、deepseek | Yes | Yes |
| QWenLMHeadModel | QWen、Qwen-VL | Yes | Yes | | QWenLMHeadModel | QWen,Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen1.5、CodeQwen1.5、QWen2 | Yes | Yes | | Qwen2ForCausalLM | QWen2,QWen1.5,CodeQwen1.5 | Yes | Yes |
| ChatGLMModel | chatglm2、chatglm3 | Yes | Yes | | ChatGLMModel | glm-4v-9b,chatglm3,chatglm2 | Yes | Yes |
| BaiChuanForCausalLM | Baichuan、Baichuan2 | Yes | Yes | | DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes | | BaiChuanForCausalLM | Baichuan2,Baichuan | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes | | BloomForCausalLM | BLOOM | Yes | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | Yes | | InternLMForCausalLM | InternLM | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes | | InternLM2ForCausalLM | InternLM2 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B | Yes | Yes | | MiniCPMForCausalLM | MiniCPM | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes | | MiniCPM3ForCausalLM | MiniCPM3 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B,Mixtral-8x7B-Instruct | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes |
| LlavaForConditionalGeneration | LLaMA,LLaMA-2,LLaMA-3 | Yes | Yes |
| Qwen2VLForConditionalGeneration | Qwen2-VL | Yes | Yes |
| MiniCPMV | MiniCPM-V | Yes | Yes |
| Phi3VForCausalLM | Phi-3.5-vision | Yes | Yes |
## 安装 ## 安装
...@@ -33,11 +39,11 @@ vLLM支持 ...@@ -33,11 +39,11 @@ vLLM支持
### 使用源码编译方式安装 ### 使用源码编译方式安装
#### 编译环境准备 #### 编译环境准备
提供2种环境准备方式 提供2种环境准备方式:
1. 基于光源pytorch2.3.0基础镜像环境镜像下载地址[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。 1. 基于光源pytorch2.3.0基础镜像环境:镜像下载地址:[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
2. 基于现有python环境安装pytorch2.3.0pytorch whl包下载目录[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch)根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下 2. 基于现有python环境:安装pytorch2.3.0,pytorch whl包下载目录:[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch),根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下:
```shell ```shell
pip install torch* (下载的torch的whl包) pip install torch* (下载的torch的whl包)
pip install setuptools wheel pip install setuptools wheel
...@@ -47,11 +53,11 @@ pip install setuptools wheel ...@@ -47,11 +53,11 @@ pip install setuptools wheel
```shell ```shell
git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git # 根据需要的分支进行切换 git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git # 根据需要的分支进行切换
``` ```
安装依赖 安装依赖:
```shell ```shell
pip install -r requirements-rocm.txt pip install -r requirements-rocm.txt
``` ```
- 提供2种源码编译方式进入vllm目录): - 提供2种源码编译方式(进入vllm目录):
``` ```
1. 编译whl包并安装 1. 编译whl包并安装
VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel
...@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install ...@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
#### 运行基础环境准备 #### 运行基础环境准备
1、使用上面基于光源pytorch2.3.0基础镜像环境 1、使用上面基于光源pytorch2.3.0基础镜像环境
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包 2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包:
- triton:[https://cancon.hpccube.com:65024/4/main/triton](https://cancon.hpccube.com:65024/4/main/triton/) - triton:[https://cancon.hpccube.com:65024/4/main/triton](https://cancon.hpccube.com:65024/4/main/triton/)
- xformers:[https://cancon.hpccube.com:65024/4/main/xformers](https://cancon.hpccube.com:65024/4/main/xformers) - xformers:[https://cancon.hpccube.com:65024/4/main/xformers](https://cancon.hpccube.com:65024/4/main/xformers)
- flash_attn: [https://cancon.hpccube.com:65024/4/main/flash_attn](https://cancon.hpccube.com:65024/4/main/flash_attn) - flash_attn: [https://cancon.hpccube.com:65024/4/main/flash_attn](https://cancon.hpccube.com:65024/4/main/flash_attn)
- lmslim: [https://cancon.hpccube.com:65024/4/main/lmslim](https://cancon.hpccube.com:65024/4/main/lmslim) - lmslim: [https://cancon.hpccube.com:65024/4/main/lmslim](https://cancon.hpccube.com:65024/4/main/lmslim)
#### 注意事项 #### 注意事项
+ 若使用 pip install 下载安装过慢可添加源-i https://pypi.tuna.tsinghua.edu.cn/simple/ + 若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
## 验证 ## 验证
- python -c "import vllm; print(vllm.\_\_version__)"版本号与官方版本同步查询该软件的版本号例如0.6.2 - python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.6.2;
## Known Issue ## Known Issue
- -
......
...@@ -23,7 +23,7 @@ def get_model_architecture( ...@@ -23,7 +23,7 @@ def get_model_architecture(
model_config: ModelConfig) -> Tuple[Type[nn.Module], str]: model_config: ModelConfig) -> Tuple[Type[nn.Module], str]:
architectures = getattr(model_config.hf_config, "architectures", []) architectures = getattr(model_config.hf_config, "architectures", [])
visions = getattr(model_config.hf_config, "visual", []) or getattr(model_config.hf_config, "vision_config", []) visions = getattr(model_config.hf_config, "visual", []) or getattr(model_config.hf_config, "vision_config", [])
support_nn_architectures = ['LlamaForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'ChatGLMModel', 'BaichuanForCausalLM', 'BloomForCausalLM', 'MedusaModel'] support_nn_architectures = ['LlamaForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2VLForConditionalGeneration', 'ChatGLMModel', 'BaichuanForCausalLM', 'BloomForCausalLM', 'MedusaModel']
if any(arch in architectures for arch in support_nn_architectures): if any(arch in architectures for arch in support_nn_architectures):
if os.getenv('LLAMA_NN') != '0': if os.getenv('LLAMA_NN') != '0':
if (architectures == ['QWenLMHeadModel'] or architectures == ['ChatGLMModel'] ) and visions != []: if (architectures == ['QWenLMHeadModel'] or architectures == ['ChatGLMModel'] ) and visions != []:
......
...@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA): ...@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA):
"mlp.down_proj.weight", "mlp.down_proj.weight",
"lm_head.weight" "lm_head.weight"
] ]
combined_words = "|".join(lay_key_words) combined_words = "|".join(lay_key_words)
lay_qkv_words = ["self_attn.qkv_proj.weight"] lay_qkv_words = ["self_attn.qkv_proj.weight"]
......
...@@ -71,6 +71,11 @@ from vllm.utils import is_cpu ...@@ -71,6 +71,11 @@ from vllm.utils import is_cpu
from .utils import (PPMissingLayer, is_pp_missing_parameter, from .utils import (PPMissingLayer, is_pp_missing_parameter,
make_empty_intermediate_tensors_factory) make_empty_intermediate_tensors_factory)
import os
import re
from vllm import _custom_ops as ops
from vllm.model_executor.utils import pad_weight, gemm_bank_conf
logger = init_logger(__name__) logger = init_logger(__name__)
# === Vision Inputs === # # === Vision Inputs === #
...@@ -889,6 +894,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal): ...@@ -889,6 +894,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
self.make_empty_intermediate_tensors = ( self.make_empty_intermediate_tensors = (
make_empty_intermediate_tensors_factory( make_empty_intermediate_tensors_factory(
["hidden_states", "residual"], config.hidden_size)) ["hidden_states", "residual"], config.hidden_size))
self.quant_method = None
if quant_config is not None:
self.quant_method=quant_config.get_name()
self.quant_config=quant_config
self.use_llama_nn = os.environ.get('LLAMA_NN') == '1'
self.use_gemm_pad = os.environ.get('GEMM_PAD') == '1'
self.use_fa_pad = os.environ.get('FA_PAD') == '1'
self.use_awq_pad = os.environ.get('AWQ_PAD') == '1'
def _validate_and_reshape_mm_tensor(self, def _validate_and_reshape_mm_tensor(self,
mm_input: Union[torch.Tensor, mm_input: Union[torch.Tensor,
...@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal): ...@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
weight_loader = getattr(param, "weight_loader", weight_loader = getattr(param, "weight_loader",
default_weight_loader) default_weight_loader)
weight_loader(param, loaded_weight) weight_loader(param, loaded_weight)
if self.use_llama_nn and self.quant_method is None:
lay_key_words = [
"attn.qkv.weight",
"attn.proj.weight",
"mlp.fc1.weight",
"mlp.fc2.weight",
"mlp.0.weight",
"mlp.2.weight",
"self_attn.qkv_proj.weight",
"self_attn.o_proj.weight",
"mlp.gate_up_proj.weight",
"mlp.down_proj.weight",
"lm_head.weight",
]
combined_words = "|".join(lay_key_words)
lay_qkv_words = ["attn.qkv.weight"]
qkv_words = "|".join(lay_qkv_words)
lay_qkv_bias_words = ["attn.qkv.bias"]
qkv_bias_words = "|".join(lay_qkv_bias_words)
for layername, weight in params_dict.items():
if self.use_fa_pad and (re.findall(qkv_bias_words, layername)):
weight.data = pad_weight(weight.data, 32)
matches = re.findall(combined_words, layername)
if matches:
if self.use_gemm_pad and gemm_bank_conf(weight.data.shape[0]):
weight.data = pad_weight(weight.data, 32)
if self.use_fa_pad and (re.findall(qkv_words, layername)):
if not gemm_bank_conf(weight.data.shape[0]):
weight.data = pad_weight(weight.data, 32)
_weight = torch.zeros_like(weight.data)
ori_shape =_weight.shape
ops.trans_w16_gemm(_weight, weight.data, _weight.shape[0], _weight.shape[1])
weight.data.copy_(_weight)
weight.data=weight.data.reshape(ori_shape[1],-1)
\ No newline at end of file
...@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union ...@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union
import numpy as np import numpy as np
import numpy.typing as npt import numpy.typing as npt
from PIL import Image from PIL import Image
import os
from vllm.connections import global_http_connection from vllm.connections import global_http_connection
from vllm.envs import VLLM_AUDIO_FETCH_TIMEOUT, VLLM_IMAGE_FETCH_TIMEOUT from vllm.envs import VLLM_AUDIO_FETCH_TIMEOUT, VLLM_IMAGE_FETCH_TIMEOUT
...@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str): ...@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str):
return load_image_from_base64(image_base64) return load_image_from_base64(image_base64)
def _load_image_from_file(file_path: str) -> Image.Image:
# Load image directly from file
return Image.open(file_path)
def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image: def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
""" """
Load a PIL image from a HTTP or base64 data URL. Load a PIL image from a HTTP or base64 data URL.
...@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image: ...@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
elif image_url.startswith('data:image'): elif image_url.startswith('data:image'):
image = _load_image_from_data_url(image_url) image = _load_image_from_data_url(image_url)
elif os.path.isfile(image_url):
# Load image from local file path
image = _load_image_from_file(image_url)
else: else:
raise ValueError("Invalid 'image_url': A valid 'image_url' must start " raise ValueError("Invalid 'image_url': A valid 'image_url' must start "
"with either 'data:image' or 'http'.") "with either 'data:image' or 'http'.")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment