README.md 4.03 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
# <div align="center"><strong>vLLM</strong></div>
## 简介
zhuwenwen's avatar
zhuwenwen committed
3
vLLM是一个快速且易于使用的LLM推理和服务库,使用PageAttention高效管理kv内存,Continuous batching传入请求,支持很多Hugging Face模型,如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
Woosuk Kwon's avatar
Woosuk Kwon committed
4

zhuwenwen's avatar
zhuwenwen committed
5
6
7
8
9
10
11
12
13
14
15
16
17
## 暂不支持的官方功能
- **量化推理**:目前支持fp16的推理和gptq推理,awq-int4和mralin的权重量化、kv-cache fp8推理方案暂不支持
- **模块支持**:目前不支持Sliding window attention、 moe kernel和lora模块


## 支持模型结构列表
|     结构     |     模型      | 模型并行 | FP16 |
| :----------: | :----------: | :------: | :--: |
|    LlamaForCausalLM       |    LLaMA          |   Yes    | Yes  |
|    LlamaForCausalLM       |    LLaMA-2        |   Yes    | Yes  |
|    LlamaForCausalLM       |    LLaMA-3        |   Yes    | Yes  |
|    LlamaForCausalLM       |    Codellama      |   Yes    | Yes  |
|    QWenLMHeadModel        |    QWen           |   Yes    | Yes  |
18
19
20
21
22
23
|    Qwen2ForCausalLM       |    QWen1.5        |   Yes    | Yes  |
|    Qwen2ForCausalLM       |    CodeQwen1.5    |   Yes    | Yes  |
|    Qwen2ForCausalLM       |    QWen2          |   Yes    | Yes  |
|    ChatGLMModel           |    chatglm2       |   Yes    | Yes  |
|    ChatGLMModel           |    chatglm3       |   Yes    | Yes  |
|    ChatGLMModel           |    glm-4          |   Yes    | Yes  |
zhuwenwen's avatar
zhuwenwen committed
24
25
26
27
|    BaiChuanForCausalLM    |    Baichuan-7B    |   Yes    | Yes  |
|    BaiChuanForCausalLM    |    Baichuan2-7B   |   Yes    | Yes  |
|    InternLMForCausalLM    |    InternLM       |   Yes    | Yes  |
|    InternLM2ForCausalLM   |    InternLM2      |   Yes    | Yes  |
28
29
|    LlamaForCausalLM       |    deepseek       |   Yes    | Yes  |
|    DeepseekV2ForCausalLM  |    DeepSeek-V2    |   Yes    | Yes  |
zhuwenwen's avatar
zhuwenwen committed
30
31
32
33
|    LlamaForCausalLM       |    Yi             |   Yes    | Yes  |
|    MixtralForCausalLM     |    Mixtral-8x7B   |   Yes    | Yes  |


zhuwenwen's avatar
zhuwenwen committed
34
35
36
37
38
39
## 安装
vLLM支持
+ Python 3.8.
+ Python 3.9.
+ Python 3.10.
+ Python 3.11.
Woosuk Kwon's avatar
Woosuk Kwon committed
40

zhuwenwen's avatar
zhuwenwen committed
41
### 使用源码编译方式安装
Woosuk Kwon's avatar
Woosuk Kwon committed
42

zhuwenwen's avatar
zhuwenwen committed
43
44
#### 编译环境准备
提供2种环境准备方式:
Woosuk Kwon's avatar
Woosuk Kwon committed
45

zhuwenwen's avatar
zhuwenwen committed
46
1. 基于光源pytorch2.1.0基础镜像环境:镜像下载地址:[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
47

zhuwenwen's avatar
zhuwenwen committed
48
2. 基于现有python环境:安装pytorch2.1.0,pytorch whl包下载目录:[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch),根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下:
zhuwenwen's avatar
zhuwenwen committed
49
50
51
52
```shell
pip install torch* (下载的torch的whl包)
pip install setuptools wheel
```
Zhuohan Li's avatar
Zhuohan Li committed
53

zhuwenwen's avatar
zhuwenwen committed
54
55
#### 源码编译安装
```shell
zhuwenwen's avatar
zhuwenwen committed
56
git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git # 根据需要的分支进行切换
Zhuohan Li's avatar
Zhuohan Li committed
57
58
```

zhuwenwen's avatar
zhuwenwen committed
59
60
61
- 提供2种源码编译方式(进入vllm目录):
```
1. 编译whl包并安装
zhuwenwen's avatar
zhuwenwen committed
62
VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel 
zhuwenwen's avatar
zhuwenwen committed
63
64
cd dist
pip install vllm*
65
66
67
68
cd csrc/quantization/gptq
python setup.py bdist_wheel
cd dist
pip install gptq_kernel
Zhuohan Li's avatar
Zhuohan Li committed
69

zhuwenwen's avatar
zhuwenwen committed
70
2. 源码编译安装
zhuwenwen's avatar
zhuwenwen committed
71
VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install 
zhuwenwen's avatar
zhuwenwen committed
72
```
Zhuohan Li's avatar
Zhuohan Li committed
73

zhuwenwen's avatar
zhuwenwen committed
74
#### 运行基础环境准备
zhuwenwen's avatar
zhuwenwen committed
75
1、使用上面基于光源pytorch2.1.0基础镜像环境
zhuwenwen's avatar
zhuwenwen committed
76

zhuwenwen's avatar
zhuwenwen committed
77
78
79
80
2、根据pytorch2.1.0、python、dtk及系统下载对应的依赖包:
- triton:[https://cancon.hpccube.com:65024/4/main/triton](https://cancon.hpccube.com:65024/4/main/triton/)
- xformers:[https://cancon.hpccube.com:65024/4/main/xformers](https://cancon.hpccube.com:65024/4/main/xformers)
- flash_attn: [https://cancon.hpccube.com:65024/4/main/flash_attn](https://cancon.hpccube.com:65024/4/main/flash_attn)
zhuwenwen's avatar
zhuwenwen committed
81
82


zhuwenwen's avatar
zhuwenwen committed
83
84
#### 注意事项
+ 若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
85

zhuwenwen's avatar
zhuwenwen committed
86
## 验证
zhuwenwen's avatar
zhuwenwen committed
87
- python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.5.0;
Woosuk Kwon's avatar
Woosuk Kwon committed
88

zhuwenwen's avatar
zhuwenwen committed
89
90
## Known Issue
-
Woosuk Kwon's avatar
Woosuk Kwon committed
91

zhuwenwen's avatar
zhuwenwen committed
92
93
94
## 参考资料
- [README_ORIGIN](README_ORIGIN.md)
- [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)