Commit 2b1b8c7d authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge branch 'v0.6.2-eval' into v0.6.2-dev

parents 3f42b83d 367708c7
# <div align="center"><strong>vLLM</strong></div>
## 简介
vLLM是一个快速且易于使用的LLM推理和服务库使用PageAttention高效管理kv内存Continuous batching传入请求支持很多Hugging Face模型如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
vLLM是一个快速且易于使用的LLM推理和服务库,使用PageAttention高效管理kv内存,Continuous batching传入请求,支持很多Hugging Face模型,如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
## 暂不支持的官方功能
- **量化推理**目前支持fp16的推理和gptq,awq-int4推理mralin的权重量化、kv-cache fp8推理方案暂不支持
- **模块支持**目前不支持Sliding window attention
- **量化推理**:目前支持fp16的推理和gptq,awq-int4推理,mralin的权重量化、kv-cache fp8推理方案暂不支持
- **模块支持**:目前不支持Sliding window attention
## 支持模型结构列表
| 结构 | 模型 | 模型并行 | FP16 |
| :------: | :------: | :------: | :------: |
| LlamaForCausalLM | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes |
| QWenLMHeadModel | QWen、Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen1.5、CodeQwen1.5、QWen2 | Yes | Yes |
| ChatGLMModel | chatglm2、chatglm3 | Yes | Yes |
| BaiChuanForCausalLM | Baichuan、Baichuan2 | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes |
| LlamaForCausalLM | Llama 3.1,Llama 3,Llama 2,Llama,Yi,Codellama、deepseek | Yes | Yes |
| QWenLMHeadModel | QWen,Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen2,QWen1.5,CodeQwen1.5 | Yes | Yes |
| ChatGLMModel | glm-4v-9b,chatglm3,chatglm2 | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| BaiChuanForCausalLM | Baichuan2,Baichuan | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | Yes |
| MiniCPMForCausalLM | MiniCPM | Yes | Yes |
| MiniCPM3ForCausalLM | MiniCPM3 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B,Mixtral-8x7B-Instruct | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes |
| LlavaForConditionalGeneration | LLaMA,LLaMA-2,LLaMA-3 | Yes | Yes |
| Qwen2VLForConditionalGeneration | Qwen2-VL | Yes | Yes |
| MiniCPMV | MiniCPM-V | Yes | Yes |
| Phi3VForCausalLM | Phi-3.5-vision | Yes | Yes |
## 安装
......@@ -33,11 +39,11 @@ vLLM支持
### 使用源码编译方式安装
#### 编译环境准备
提供2种环境准备方式
提供2种环境准备方式:
1. 基于光源pytorch2.3.0基础镜像环境镜像下载地址[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
1. 基于光源pytorch2.3.0基础镜像环境:镜像下载地址:[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
2. 基于现有python环境安装pytorch2.3.0pytorch whl包下载目录[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch)根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下
2. 基于现有python环境:安装pytorch2.3.0,pytorch whl包下载目录:[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch),根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下:
```shell
pip install torch* (下载的torch的whl包)
pip install setuptools wheel
......@@ -47,11 +53,11 @@ pip install setuptools wheel
```shell
git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git # 根据需要的分支进行切换
```
安装依赖
安装依赖:
```shell
pip install -r requirements-rocm.txt
```
- 提供2种源码编译方式进入vllm目录):
- 提供2种源码编译方式(进入vllm目录):
```
1. 编译whl包并安装
VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel
......@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
#### 运行基础环境准备
1、使用上面基于光源pytorch2.3.0基础镜像环境
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包:
- triton:[https://cancon.hpccube.com:65024/4/main/triton](https://cancon.hpccube.com:65024/4/main/triton/)
- xformers:[https://cancon.hpccube.com:65024/4/main/xformers](https://cancon.hpccube.com:65024/4/main/xformers)
- flash_attn: [https://cancon.hpccube.com:65024/4/main/flash_attn](https://cancon.hpccube.com:65024/4/main/flash_attn)
- lmslim: [https://cancon.hpccube.com:65024/4/main/lmslim](https://cancon.hpccube.com:65024/4/main/lmslim)
#### 注意事项
+ 若使用 pip install 下载安装过慢可添加源-i https://pypi.tuna.tsinghua.edu.cn/simple/
+ 若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
## 验证
- python -c "import vllm; print(vllm.\_\_version__)"版本号与官方版本同步查询该软件的版本号例如0.6.2
- python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.6.2;
## Known Issue
-
......
......@@ -23,7 +23,7 @@ def get_model_architecture(
model_config: ModelConfig) -> Tuple[Type[nn.Module], str]:
architectures = getattr(model_config.hf_config, "architectures", [])
visions = getattr(model_config.hf_config, "visual", []) or getattr(model_config.hf_config, "vision_config", [])
support_nn_architectures = ['LlamaForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'ChatGLMModel', 'BaichuanForCausalLM', 'BloomForCausalLM', 'MedusaModel']
support_nn_architectures = ['LlamaForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2VLForConditionalGeneration', 'ChatGLMModel', 'BaichuanForCausalLM', 'BloomForCausalLM', 'MedusaModel']
if any(arch in architectures for arch in support_nn_architectures):
if os.getenv('LLAMA_NN') != '0':
if (architectures == ['QWenLMHeadModel'] or architectures == ['ChatGLMModel'] ) and visions != []:
......
......@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA):
"mlp.down_proj.weight",
"lm_head.weight"
]
combined_words = "|".join(lay_key_words)
lay_qkv_words = ["self_attn.qkv_proj.weight"]
......
......@@ -71,6 +71,11 @@ from vllm.utils import is_cpu
from .utils import (PPMissingLayer, is_pp_missing_parameter,
make_empty_intermediate_tensors_factory)
import os
import re
from vllm import _custom_ops as ops
from vllm.model_executor.utils import pad_weight, gemm_bank_conf
logger = init_logger(__name__)
# === Vision Inputs === #
......@@ -889,6 +894,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
self.make_empty_intermediate_tensors = (
make_empty_intermediate_tensors_factory(
["hidden_states", "residual"], config.hidden_size))
self.quant_method = None
if quant_config is not None:
self.quant_method=quant_config.get_name()
self.quant_config=quant_config
self.use_llama_nn = os.environ.get('LLAMA_NN') == '1'
self.use_gemm_pad = os.environ.get('GEMM_PAD') == '1'
self.use_fa_pad = os.environ.get('FA_PAD') == '1'
self.use_awq_pad = os.environ.get('AWQ_PAD') == '1'
def _validate_and_reshape_mm_tensor(self,
mm_input: Union[torch.Tensor,
......@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
weight_loader = getattr(param, "weight_loader",
default_weight_loader)
weight_loader(param, loaded_weight)
if self.use_llama_nn and self.quant_method is None:
lay_key_words = [
"attn.qkv.weight",
"attn.proj.weight",
"mlp.fc1.weight",
"mlp.fc2.weight",
"mlp.0.weight",
"mlp.2.weight",
"self_attn.qkv_proj.weight",
"self_attn.o_proj.weight",
"mlp.gate_up_proj.weight",
"mlp.down_proj.weight",
"lm_head.weight",
]
combined_words = "|".join(lay_key_words)
lay_qkv_words = ["attn.qkv.weight"]
qkv_words = "|".join(lay_qkv_words)
lay_qkv_bias_words = ["attn.qkv.bias"]
qkv_bias_words = "|".join(lay_qkv_bias_words)
for layername, weight in params_dict.items():
if self.use_fa_pad and (re.findall(qkv_bias_words, layername)):
weight.data = pad_weight(weight.data, 32)
matches = re.findall(combined_words, layername)
if matches:
if self.use_gemm_pad and gemm_bank_conf(weight.data.shape[0]):
weight.data = pad_weight(weight.data, 32)
if self.use_fa_pad and (re.findall(qkv_words, layername)):
if not gemm_bank_conf(weight.data.shape[0]):
weight.data = pad_weight(weight.data, 32)
_weight = torch.zeros_like(weight.data)
ori_shape =_weight.shape
ops.trans_w16_gemm(_weight, weight.data, _weight.shape[0], _weight.shape[1])
weight.data.copy_(_weight)
weight.data=weight.data.reshape(ori_shape[1],-1)
\ No newline at end of file
......@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union
import numpy as np
import numpy.typing as npt
from PIL import Image
import os
from vllm.connections import global_http_connection
from vllm.envs import VLLM_AUDIO_FETCH_TIMEOUT, VLLM_IMAGE_FETCH_TIMEOUT
......@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str):
return load_image_from_base64(image_base64)
def _load_image_from_file(file_path: str) -> Image.Image:
# Load image directly from file
return Image.open(file_path)
def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
"""
Load a PIL image from a HTTP or base64 data URL.
......@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
elif image_url.startswith('data:image'):
image = _load_image_from_data_url(image_url)
elif os.path.isfile(image_url):
# Load image from local file path
image = _load_image_from_file(image_url)
else:
raise ValueError("Invalid 'image_url': A valid 'image_url' must start "
"with either 'data:image' or 'http'.")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment