Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
2b1b8c7d
"...chenpangpang/transformers.git" did not exist on "df15703b422644c7cbbb9779af9e8f1fd639eefe"
Commit
2b1b8c7d
authored
Nov 05, 2024
by
zhuwenwen
Browse files
Merge branch 'v0.6.2-eval' into v0.6.2-dev
parents
3f42b83d
367708c7
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
97 additions
and
24 deletions
+97
-24
README.md
README.md
+28
-22
vllm/model_executor/model_loader/utils.py
vllm/model_executor/model_loader/utils.py
+1
-1
vllm/model_executor/models/llama.py
vllm/model_executor/models/llama.py
+0
-1
vllm/model_executor/models/qwen2_vl.py
vllm/model_executor/models/qwen2_vl.py
+58
-0
vllm/multimodal/utils.py
vllm/multimodal/utils.py
+10
-0
No files found.
README.md
View file @
2b1b8c7d
# <div align="center"><strong>vLLM</strong></div>
# <div align="center"><strong>vLLM</strong></div>
## 简介
## 简介
vLLM是一个快速且易于使用的LLM推理和服务库
,
使用PageAttention高效管理kv内存
,
Continuous batching传入请求
,
支持很多Hugging Face模型
,
如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
vLLM是一个快速且易于使用的LLM推理和服务库
,
使用PageAttention高效管理kv内存
,
Continuous batching传入请求
,
支持很多Hugging Face模型
,
如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
## 暂不支持的官方功能
## 暂不支持的官方功能
-
**量化推理**
:
目前支持fp16的推理和gptq,awq-int4推理
,
mralin的权重量化、kv-cache fp8推理方案暂不支持
-
**量化推理**
:
目前支持fp16的推理和gptq,awq-int4推理
,
mralin的权重量化、kv-cache fp8推理方案暂不支持
-
**模块支持**
:
目前不支持Sliding window attention
-
**模块支持**
:
目前不支持Sliding window attention
## 支持模型结构列表
## 支持模型结构列表
| 结构 | 模型 | 模型并行 | FP16 |
| 结构 | 模型 | 模型并行 | FP16 |
| :------: | :------: | :------: | :------: |
| :------: | :------: | :------: | :------: |
| LlamaForCausalLM | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes |
| LlamaForCausalLM | Llama 3.1,Llama 3,Llama 2,Llama,Yi,Codellama、deepseek | Yes | Yes |
| QWenLMHeadModel | QWen、Qwen-VL | Yes | Yes |
| QWenLMHeadModel | QWen,Qwen-VL | Yes | Yes |
| Qwen2ForCausalLM | QWen1.5、CodeQwen1.5、QWen2 | Yes | Yes |
| Qwen2ForCausalLM | QWen2,QWen1.5,CodeQwen1.5 | Yes | Yes |
| ChatGLMModel | chatglm2、chatglm3 | Yes | Yes |
| ChatGLMModel | glm-4v-9b,chatglm3,chatglm2 | Yes | Yes |
| BaiChuanForCausalLM | Baichuan、Baichuan2 | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes |
| BaiChuanForCausalLM | Baichuan2,Baichuan | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes |
| BloomForCausalLM | BLOOM | Yes | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | Yes |
| InternLMForCausalLM | InternLM | Yes | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B | Yes | Yes |
| MiniCPMForCausalLM | MiniCPM | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes |
| MiniCPM3ForCausalLM | MiniCPM3 | Yes | Yes |
| MixtralForCausalLM | Mixtral-8x7B,Mixtral-8x7B-Instruct | Yes | Yes |
| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B | Yes | Yes |
| LlavaForConditionalGeneration | LLaMA,LLaMA-2,LLaMA-3 | Yes | Yes |
| Qwen2VLForConditionalGeneration | Qwen2-VL | Yes | Yes |
| MiniCPMV | MiniCPM-V | Yes | Yes |
| Phi3VForCausalLM | Phi-3.5-vision | Yes | Yes |
## 安装
## 安装
...
@@ -33,11 +39,11 @@ vLLM支持
...
@@ -33,11 +39,11 @@ vLLM支持
### 使用源码编译方式安装
### 使用源码编译方式安装
#### 编译环境准备
#### 编译环境准备
提供2种环境准备方式
:
提供2种环境准备方式
:
1.
基于光源pytorch2.3.0基础镜像环境
:
镜像下载地址
:
[
https://sourcefind.cn/#/image/dcu/pytorch
](
https://sourcefind.cn/#/image/dcu/pytorch
)
,
根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
1.
基于光源pytorch2.3.0基础镜像环境
:
镜像下载地址
:
[
https://sourcefind.cn/#/image/dcu/pytorch
](
https://sourcefind.cn/#/image/dcu/pytorch
)
,
根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
2.
基于现有python环境
:
安装pytorch2.3.0
,
pytorch whl包下载目录
:
[
https://cancon.hpccube.com:65024/4/main/pytorch
](
https://cancon.hpccube.com:65024/4/main/pytorch
)
,
根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下
:
2.
基于现有python环境
:
安装pytorch2.3.0
,
pytorch whl包下载目录
:
[
https://cancon.hpccube.com:65024/4/main/pytorch
](
https://cancon.hpccube.com:65024/4/main/pytorch
)
,
根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下
:
```
shell
```
shell
pip
install
torch
*
(
下载的torch的whl包
)
pip
install
torch
*
(
下载的torch的whl包
)
pip
install
setuptools wheel
pip
install
setuptools wheel
...
@@ -47,11 +53,11 @@ pip install setuptools wheel
...
@@ -47,11 +53,11 @@ pip install setuptools wheel
```
shell
```
shell
git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git
# 根据需要的分支进行切换
git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git
# 根据需要的分支进行切换
```
```
安装依赖
:
安装依赖
:
```
shell
```
shell
pip
install
-r
requirements-rocm.txt
pip
install
-r
requirements-rocm.txt
```
```
-
提供2种源码编译方式
(
进入vllm目录
):
-
提供2种源码编译方式
(
进入vllm目录
):
```
```
1. 编译whl包并安装
1. 编译whl包并安装
VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel
VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel
...
@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
...
@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
#### 运行基础环境准备
#### 运行基础环境准备
1、使用上面基于光源pytorch2.3.0基础镜像环境
1、使用上面基于光源pytorch2.3.0基础镜像环境
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包
:
2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包
:
-
triton:
[
https://cancon.hpccube.com:65024/4/main/triton
](
https://cancon.hpccube.com:65024/4/main/triton/
)
-
triton:
[
https://cancon.hpccube.com:65024/4/main/triton
](
https://cancon.hpccube.com:65024/4/main/triton/
)
-
xformers:
[
https://cancon.hpccube.com:65024/4/main/xformers
](
https://cancon.hpccube.com:65024/4/main/xformers
)
-
xformers:
[
https://cancon.hpccube.com:65024/4/main/xformers
](
https://cancon.hpccube.com:65024/4/main/xformers
)
-
flash_attn:
[
https://cancon.hpccube.com:65024/4/main/flash_attn
](
https://cancon.hpccube.com:65024/4/main/flash_attn
)
-
flash_attn:
[
https://cancon.hpccube.com:65024/4/main/flash_attn
](
https://cancon.hpccube.com:65024/4/main/flash_attn
)
-
lmslim:
[
https://cancon.hpccube.com:65024/4/main/lmslim
](
https://cancon.hpccube.com:65024/4/main/lmslim
)
-
lmslim:
[
https://cancon.hpccube.com:65024/4/main/lmslim
](
https://cancon.hpccube.com:65024/4/main/lmslim
)
#### 注意事项
#### 注意事项
+
若使用 pip install 下载安装过慢
,
可添加源
:
-i https://pypi.tuna.tsinghua.edu.cn/simple/
+
若使用 pip install 下载安装过慢
,
可添加源
:
-i https://pypi.tuna.tsinghua.edu.cn/simple/
## 验证
## 验证
-
python -c "import vllm; print(vllm.
\_\_
version__)"
,
版本号与官方版本同步
,
查询该软件的版本号
,
例如0.6.2
;
-
python -c "import vllm; print(vllm.
\_\_
version__)"
,
版本号与官方版本同步
,
查询该软件的版本号
,
例如0.6.2
;
## Known Issue
## Known Issue
-
无
-
无
...
...
vllm/model_executor/model_loader/utils.py
View file @
2b1b8c7d
...
@@ -23,7 +23,7 @@ def get_model_architecture(
...
@@ -23,7 +23,7 @@ def get_model_architecture(
model_config
:
ModelConfig
)
->
Tuple
[
Type
[
nn
.
Module
],
str
]:
model_config
:
ModelConfig
)
->
Tuple
[
Type
[
nn
.
Module
],
str
]:
architectures
=
getattr
(
model_config
.
hf_config
,
"architectures"
,
[])
architectures
=
getattr
(
model_config
.
hf_config
,
"architectures"
,
[])
visions
=
getattr
(
model_config
.
hf_config
,
"visual"
,
[])
or
getattr
(
model_config
.
hf_config
,
"vision_config"
,
[])
visions
=
getattr
(
model_config
.
hf_config
,
"visual"
,
[])
or
getattr
(
model_config
.
hf_config
,
"vision_config"
,
[])
support_nn_architectures
=
[
'LlamaForCausalLM'
,
'QWenLMHeadModel'
,
'Qwen2ForCausalLM'
,
'ChatGLMModel'
,
'BaichuanForCausalLM'
,
'BloomForCausalLM'
,
'MedusaModel'
]
support_nn_architectures
=
[
'LlamaForCausalLM'
,
'QWenLMHeadModel'
,
'Qwen2ForCausalLM'
,
'Qwen2VLForConditionalGeneration'
,
'ChatGLMModel'
,
'BaichuanForCausalLM'
,
'BloomForCausalLM'
,
'MedusaModel'
]
if
any
(
arch
in
architectures
for
arch
in
support_nn_architectures
):
if
any
(
arch
in
architectures
for
arch
in
support_nn_architectures
):
if
os
.
getenv
(
'LLAMA_NN'
)
!=
'0'
:
if
os
.
getenv
(
'LLAMA_NN'
)
!=
'0'
:
if
(
architectures
==
[
'QWenLMHeadModel'
]
or
architectures
==
[
'ChatGLMModel'
]
)
and
visions
!=
[]:
if
(
architectures
==
[
'QWenLMHeadModel'
]
or
architectures
==
[
'ChatGLMModel'
]
)
and
visions
!=
[]:
...
...
vllm/model_executor/models/llama.py
View file @
2b1b8c7d
...
@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA):
...
@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA):
"mlp.down_proj.weight"
,
"mlp.down_proj.weight"
,
"lm_head.weight"
"lm_head.weight"
]
]
combined_words
=
"|"
.
join
(
lay_key_words
)
combined_words
=
"|"
.
join
(
lay_key_words
)
lay_qkv_words
=
[
"self_attn.qkv_proj.weight"
]
lay_qkv_words
=
[
"self_attn.qkv_proj.weight"
]
...
...
vllm/model_executor/models/qwen2_vl.py
View file @
2b1b8c7d
...
@@ -71,6 +71,11 @@ from vllm.utils import is_cpu
...
@@ -71,6 +71,11 @@ from vllm.utils import is_cpu
from
.utils
import
(
PPMissingLayer
,
is_pp_missing_parameter
,
from
.utils
import
(
PPMissingLayer
,
is_pp_missing_parameter
,
make_empty_intermediate_tensors_factory
)
make_empty_intermediate_tensors_factory
)
import
os
import
re
from
vllm
import
_custom_ops
as
ops
from
vllm.model_executor.utils
import
pad_weight
,
gemm_bank_conf
logger
=
init_logger
(
__name__
)
logger
=
init_logger
(
__name__
)
# === Vision Inputs === #
# === Vision Inputs === #
...
@@ -889,6 +894,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
...
@@ -889,6 +894,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
self
.
make_empty_intermediate_tensors
=
(
self
.
make_empty_intermediate_tensors
=
(
make_empty_intermediate_tensors_factory
(
make_empty_intermediate_tensors_factory
(
[
"hidden_states"
,
"residual"
],
config
.
hidden_size
))
[
"hidden_states"
,
"residual"
],
config
.
hidden_size
))
self
.
quant_method
=
None
if
quant_config
is
not
None
:
self
.
quant_method
=
quant_config
.
get_name
()
self
.
quant_config
=
quant_config
self
.
use_llama_nn
=
os
.
environ
.
get
(
'LLAMA_NN'
)
==
'1'
self
.
use_gemm_pad
=
os
.
environ
.
get
(
'GEMM_PAD'
)
==
'1'
self
.
use_fa_pad
=
os
.
environ
.
get
(
'FA_PAD'
)
==
'1'
self
.
use_awq_pad
=
os
.
environ
.
get
(
'AWQ_PAD'
)
==
'1'
def
_validate_and_reshape_mm_tensor
(
self
,
def
_validate_and_reshape_mm_tensor
(
self
,
mm_input
:
Union
[
torch
.
Tensor
,
mm_input
:
Union
[
torch
.
Tensor
,
...
@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
...
@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
weight_loader
=
getattr
(
param
,
"weight_loader"
,
weight_loader
=
getattr
(
param
,
"weight_loader"
,
default_weight_loader
)
default_weight_loader
)
weight_loader
(
param
,
loaded_weight
)
weight_loader
(
param
,
loaded_weight
)
if
self
.
use_llama_nn
and
self
.
quant_method
is
None
:
lay_key_words
=
[
"attn.qkv.weight"
,
"attn.proj.weight"
,
"mlp.fc1.weight"
,
"mlp.fc2.weight"
,
"mlp.0.weight"
,
"mlp.2.weight"
,
"self_attn.qkv_proj.weight"
,
"self_attn.o_proj.weight"
,
"mlp.gate_up_proj.weight"
,
"mlp.down_proj.weight"
,
"lm_head.weight"
,
]
combined_words
=
"|"
.
join
(
lay_key_words
)
lay_qkv_words
=
[
"attn.qkv.weight"
]
qkv_words
=
"|"
.
join
(
lay_qkv_words
)
lay_qkv_bias_words
=
[
"attn.qkv.bias"
]
qkv_bias_words
=
"|"
.
join
(
lay_qkv_bias_words
)
for
layername
,
weight
in
params_dict
.
items
():
if
self
.
use_fa_pad
and
(
re
.
findall
(
qkv_bias_words
,
layername
)):
weight
.
data
=
pad_weight
(
weight
.
data
,
32
)
matches
=
re
.
findall
(
combined_words
,
layername
)
if
matches
:
if
self
.
use_gemm_pad
and
gemm_bank_conf
(
weight
.
data
.
shape
[
0
]):
weight
.
data
=
pad_weight
(
weight
.
data
,
32
)
if
self
.
use_fa_pad
and
(
re
.
findall
(
qkv_words
,
layername
)):
if
not
gemm_bank_conf
(
weight
.
data
.
shape
[
0
]):
weight
.
data
=
pad_weight
(
weight
.
data
,
32
)
_weight
=
torch
.
zeros_like
(
weight
.
data
)
ori_shape
=
_weight
.
shape
ops
.
trans_w16_gemm
(
_weight
,
weight
.
data
,
_weight
.
shape
[
0
],
_weight
.
shape
[
1
])
weight
.
data
.
copy_
(
_weight
)
weight
.
data
=
weight
.
data
.
reshape
(
ori_shape
[
1
],
-
1
)
\ No newline at end of file
vllm/multimodal/utils.py
View file @
2b1b8c7d
...
@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union
...
@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union
import
numpy
as
np
import
numpy
as
np
import
numpy.typing
as
npt
import
numpy.typing
as
npt
from
PIL
import
Image
from
PIL
import
Image
import
os
from
vllm.connections
import
global_http_connection
from
vllm.connections
import
global_http_connection
from
vllm.envs
import
VLLM_AUDIO_FETCH_TIMEOUT
,
VLLM_IMAGE_FETCH_TIMEOUT
from
vllm.envs
import
VLLM_AUDIO_FETCH_TIMEOUT
,
VLLM_IMAGE_FETCH_TIMEOUT
...
@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str):
...
@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str):
return
load_image_from_base64
(
image_base64
)
return
load_image_from_base64
(
image_base64
)
def
_load_image_from_file
(
file_path
:
str
)
->
Image
.
Image
:
# Load image directly from file
return
Image
.
open
(
file_path
)
def
fetch_image
(
image_url
:
str
,
*
,
image_mode
:
str
=
"RGB"
)
->
Image
.
Image
:
def
fetch_image
(
image_url
:
str
,
*
,
image_mode
:
str
=
"RGB"
)
->
Image
.
Image
:
"""
"""
Load a PIL image from a HTTP or base64 data URL.
Load a PIL image from a HTTP or base64 data URL.
...
@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
...
@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
elif
image_url
.
startswith
(
'data:image'
):
elif
image_url
.
startswith
(
'data:image'
):
image
=
_load_image_from_data_url
(
image_url
)
image
=
_load_image_from_data_url
(
image_url
)
elif
os
.
path
.
isfile
(
image_url
):
# Load image from local file path
image
=
_load_image_from_file
(
image_url
)
else
:
else
:
raise
ValueError
(
"Invalid 'image_url': A valid 'image_url' must start "
raise
ValueError
(
"Invalid 'image_url': A valid 'image_url' must start "
"with either 'data:image' or 'http'."
)
"with either 'data:image' or 'http'."
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment