Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
2b1b8c7d
"lib/engines/vscode:/vscode.git/clone" did not exist on "2100f6aac5b51043ba7df6986974abee39ea7641"
Commit
2b1b8c7d
authored
Nov 05, 2024
by
zhuwenwen
Browse files
Merge branch 'v0.6.2-eval' into v0.6.2-dev
parents
3f42b83d
367708c7
Changes
5
Show whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
97 additions
and
24 deletions
+97
-24
README.md
README.md
+28
-22
vllm/model_executor/model_loader/utils.py
vllm/model_executor/model_loader/utils.py
+1
-1
vllm/model_executor/models/llama.py
vllm/model_executor/models/llama.py
+0
-1
vllm/model_executor/models/qwen2_vl.py
vllm/model_executor/models/qwen2_vl.py
+58
-0
vllm/multimodal/utils.py
vllm/multimodal/utils.py
+10
-0
No files found.
README.md
View file @
2b1b8c7d
# <div align="center"><strong>vLLM</strong></div>
## 简介
vLLM是一个快速且易于使用的LLM推理和服务库
,
使用PageAttention高效管理kv内存
,
Continuous batching传入请求
,
支持很多Hugging Face模型
,
如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
vLLM是一个快速且易于使用的LLM推理和服务库
,
使用PageAttention高效管理kv内存
,
Continuous batching传入请求
,
支持很多Hugging Face模型
,
如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
## 暂不支持的官方功能
-
**量化推理**
:
目前支持fp16的推理和gptq,awq-int4推理
,
mralin的权重量化、kv-cache fp8推理方案暂不支持
-
**模块支持**
:
目前不支持Sliding window attention
-
**量化推理**
:
目前支持fp16的推理和gptq,awq-int4推理
,
mralin的权重量化、kv-cache fp8推理方案暂不支持
-
**模块支持**
:
目前不支持Sliding window attention
## 支持模型结构列表
| 结构 | 模型 | 模型并行 | FP16 |
| :------: | :------: | :------: | :------: |
| LlamaForCausalLM | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes |
| QWenLMHeadModel | QWen、Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen1.5、CodeQwen1.5、QWen2 | Yes | Yes |
| ChatGLMModel | chatglm2、chatglm3 | Yes | Yes |
| BaiChuanForCausalLM | Baichuan、Baichuan2 | Yes | Yes |
| LlamaForCausalLM | Llama 3.1,Llama 3,Llama 2,Llama,Yi,Codellama、deepseek | Yes | Yes |
| QWenLMHeadModel | QWen,Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen2,QWen1.5,CodeQwen1.5 | Yes | Yes |
| ChatGLMModel | glm-4v-9b,chatglm3,chatglm2 | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| BaiChuanForCausalLM | Baichuan2,Baichuan | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B | Yes | Yes |
| MiniCPMForCausalLM | MiniCPM | Yes | Yes |
| MiniCPM3ForCausalLM | MiniCPM3 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B,Mixtral-8x7B-Instruct | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes |
| LlavaForConditionalGeneration | LLaMA,LLaMA-2,LLaMA-3 | Yes | Yes |
| Qwen2VLForConditionalGeneration | Qwen2-VL | Yes | Yes |
| MiniCPMV | MiniCPM-V | Yes | Yes |
| Phi3VForCausalLM | Phi-3.5-vision | Yes | Yes |
## 安装
...
...
@@ -33,11 +39,11 @@ vLLM支持
### 使用源码编译方式安装
#### 编译环境准备
提供2种环境准备方式
:
提供2种环境准备方式
:
1.
基于光源pytorch2.3.0基础镜像环境
:
镜像下载地址
:
[
https://sourcefind.cn/#/image/dcu/pytorch
](
https://sourcefind.cn/#/image/dcu/pytorch
)
,
根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
1.
基于光源pytorch2.3.0基础镜像环境
:
镜像下载地址
:
[
https://sourcefind.cn/#/image/dcu/pytorch
](
https://sourcefind.cn/#/image/dcu/pytorch
)
,
根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
2.
基于现有python环境
:
安装pytorch2.3.0
,
pytorch whl包下载目录
:
[
https://cancon.hpccube.com:65024/4/main/pytorch
](
https://cancon.hpccube.com:65024/4/main/pytorch
)
,
根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下
:
2.
基于现有python环境
:
安装pytorch2.3.0
,
pytorch whl包下载目录
:
[
https://cancon.hpccube.com:65024/4/main/pytorch
](
https://cancon.hpccube.com:65024/4/main/pytorch
)
,
根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下
:
```
shell
pip
install
torch
*
(
下载的torch的whl包
)
pip
install
setuptools wheel
...
...
@@ -47,11 +53,11 @@ pip install setuptools wheel
```
shell
git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git
# 根据需要的分支进行切换
```
安装依赖
:
安装依赖
:
```
shell
pip
install
-r
requirements-rocm.txt
```
-
提供2种源码编译方式
(
进入vllm目录
):
-
提供2种源码编译方式
(
进入vllm目录
):
```
1. 编译whl包并安装
VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel
...
...
@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
#### 运行基础环境准备
1、使用上面基于光源pytorch2.3.0基础镜像环境
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包
:
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包
:
-
triton:
[
https://cancon.hpccube.com:65024/4/main/triton
](
https://cancon.hpccube.com:65024/4/main/triton/
)
-
xformers:
[
https://cancon.hpccube.com:65024/4/main/xformers
](
https://cancon.hpccube.com:65024/4/main/xformers
)
-
flash_attn:
[
https://cancon.hpccube.com:65024/4/main/flash_attn
](
https://cancon.hpccube.com:65024/4/main/flash_attn
)
-
lmslim:
[
https://cancon.hpccube.com:65024/4/main/lmslim
](
https://cancon.hpccube.com:65024/4/main/lmslim
)
#### 注意事项
+
若使用 pip install 下载安装过慢
,
可添加源
:
-i https://pypi.tuna.tsinghua.edu.cn/simple/
+
若使用 pip install 下载安装过慢
,
可添加源
:
-i https://pypi.tuna.tsinghua.edu.cn/simple/
## 验证
-
python -c "import vllm; print(vllm.
\_\_
version__)"
,
版本号与官方版本同步
,
查询该软件的版本号
,
例如0.6.2
;
-
python -c "import vllm; print(vllm.
\_\_
version__)"
,
版本号与官方版本同步
,
查询该软件的版本号
,
例如0.6.2
;
## Known Issue
-
无
...
...
vllm/model_executor/model_loader/utils.py
View file @
2b1b8c7d
...
...
@@ -23,7 +23,7 @@ def get_model_architecture(
model_config
:
ModelConfig
)
->
Tuple
[
Type
[
nn
.
Module
],
str
]:
architectures
=
getattr
(
model_config
.
hf_config
,
"architectures"
,
[])
visions
=
getattr
(
model_config
.
hf_config
,
"visual"
,
[])
or
getattr
(
model_config
.
hf_config
,
"vision_config"
,
[])
support_nn_architectures
=
[
'LlamaForCausalLM'
,
'QWenLMHeadModel'
,
'Qwen2ForCausalLM'
,
'ChatGLMModel'
,
'BaichuanForCausalLM'
,
'BloomForCausalLM'
,
'MedusaModel'
]
support_nn_architectures
=
[
'LlamaForCausalLM'
,
'QWenLMHeadModel'
,
'Qwen2ForCausalLM'
,
'Qwen2VLForConditionalGeneration'
,
'ChatGLMModel'
,
'BaichuanForCausalLM'
,
'BloomForCausalLM'
,
'MedusaModel'
]
if
any
(
arch
in
architectures
for
arch
in
support_nn_architectures
):
if
os
.
getenv
(
'LLAMA_NN'
)
!=
'0'
:
if
(
architectures
==
[
'QWenLMHeadModel'
]
or
architectures
==
[
'ChatGLMModel'
]
)
and
visions
!=
[]:
...
...
vllm/model_executor/models/llama.py
View file @
2b1b8c7d
...
...
@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA):
"mlp.down_proj.weight"
,
"lm_head.weight"
]
combined_words
=
"|"
.
join
(
lay_key_words
)
lay_qkv_words
=
[
"self_attn.qkv_proj.weight"
]
...
...
vllm/model_executor/models/qwen2_vl.py
View file @
2b1b8c7d
...
...
@@ -71,6 +71,11 @@ from vllm.utils import is_cpu
from
.utils
import
(
PPMissingLayer
,
is_pp_missing_parameter
,
make_empty_intermediate_tensors_factory
)
import
os
import
re
from
vllm
import
_custom_ops
as
ops
from
vllm.model_executor.utils
import
pad_weight
,
gemm_bank_conf
logger
=
init_logger
(
__name__
)
# === Vision Inputs === #
...
...
@@ -890,6 +895,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
make_empty_intermediate_tensors_factory
(
[
"hidden_states"
,
"residual"
],
config
.
hidden_size
))
self
.
quant_method
=
None
if
quant_config
is
not
None
:
self
.
quant_method
=
quant_config
.
get_name
()
self
.
quant_config
=
quant_config
self
.
use_llama_nn
=
os
.
environ
.
get
(
'LLAMA_NN'
)
==
'1'
self
.
use_gemm_pad
=
os
.
environ
.
get
(
'GEMM_PAD'
)
==
'1'
self
.
use_fa_pad
=
os
.
environ
.
get
(
'FA_PAD'
)
==
'1'
self
.
use_awq_pad
=
os
.
environ
.
get
(
'AWQ_PAD'
)
==
'1'
def
_validate_and_reshape_mm_tensor
(
self
,
mm_input
:
Union
[
torch
.
Tensor
,
List
[
torch
.
Tensor
]],
...
...
@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
weight_loader
=
getattr
(
param
,
"weight_loader"
,
default_weight_loader
)
weight_loader
(
param
,
loaded_weight
)
if
self
.
use_llama_nn
and
self
.
quant_method
is
None
:
lay_key_words
=
[
"attn.qkv.weight"
,
"attn.proj.weight"
,
"mlp.fc1.weight"
,
"mlp.fc2.weight"
,
"mlp.0.weight"
,
"mlp.2.weight"
,
"self_attn.qkv_proj.weight"
,
"self_attn.o_proj.weight"
,
"mlp.gate_up_proj.weight"
,
"mlp.down_proj.weight"
,
"lm_head.weight"
,
]
combined_words
=
"|"
.
join
(
lay_key_words
)
lay_qkv_words
=
[
"attn.qkv.weight"
]
qkv_words
=
"|"
.
join
(
lay_qkv_words
)
lay_qkv_bias_words
=
[
"attn.qkv.bias"
]
qkv_bias_words
=
"|"
.
join
(
lay_qkv_bias_words
)
for
layername
,
weight
in
params_dict
.
items
():
if
self
.
use_fa_pad
and
(
re
.
findall
(
qkv_bias_words
,
layername
)):
weight
.
data
=
pad_weight
(
weight
.
data
,
32
)
matches
=
re
.
findall
(
combined_words
,
layername
)
if
matches
:
if
self
.
use_gemm_pad
and
gemm_bank_conf
(
weight
.
data
.
shape
[
0
]):
weight
.
data
=
pad_weight
(
weight
.
data
,
32
)
if
self
.
use_fa_pad
and
(
re
.
findall
(
qkv_words
,
layername
)):
if
not
gemm_bank_conf
(
weight
.
data
.
shape
[
0
]):
weight
.
data
=
pad_weight
(
weight
.
data
,
32
)
_weight
=
torch
.
zeros_like
(
weight
.
data
)
ori_shape
=
_weight
.
shape
ops
.
trans_w16_gemm
(
_weight
,
weight
.
data
,
_weight
.
shape
[
0
],
_weight
.
shape
[
1
])
weight
.
data
.
copy_
(
_weight
)
weight
.
data
=
weight
.
data
.
reshape
(
ori_shape
[
1
],
-
1
)
\ No newline at end of file
vllm/multimodal/utils.py
View file @
2b1b8c7d
...
...
@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union
import
numpy
as
np
import
numpy.typing
as
npt
from
PIL
import
Image
import
os
from
vllm.connections
import
global_http_connection
from
vllm.envs
import
VLLM_AUDIO_FETCH_TIMEOUT
,
VLLM_IMAGE_FETCH_TIMEOUT
...
...
@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str):
return
load_image_from_base64
(
image_base64
)
def
_load_image_from_file
(
file_path
:
str
)
->
Image
.
Image
:
# Load image directly from file
return
Image
.
open
(
file_path
)
def
fetch_image
(
image_url
:
str
,
*
,
image_mode
:
str
=
"RGB"
)
->
Image
.
Image
:
"""
Load a PIL image from a HTTP or base64 data URL.
...
...
@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
elif
image_url
.
startswith
(
'data:image'
):
image
=
_load_image_from_data_url
(
image_url
)
elif
os
.
path
.
isfile
(
image_url
):
# Load image from local file path
image
=
_load_image_from_file
(
image_url
)
else
:
raise
ValueError
(
"Invalid 'image_url': A valid 'image_url' must start "
"with either 'data:image' or 'http'."
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment