README.md 9.94 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-04-25 10:38:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-09-29 17:50:01
zhuwenwen's avatar
zhuwenwen committed
6
7
8
9
10
11
12
13
14
15
16
17
-->
# LLAMA

## 论文
- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)

## 模型结构
LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。以下是与原始架构的主要区别:
预归一化。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

zhuwenwen's avatar
zhuwenwen committed
18
![img](./docs/llama_str.png)
zhuwenwen's avatar
zhuwenwen committed
19
20
21
22

## 算法原理
LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。

zhuwenwen's avatar
zhuwenwen committed
23
![img](./docs/llama_pri.png)
zhuwenwen's avatar
zhuwenwen committed
24
25
26

## 环境配置

zhuwenwen's avatar
zhuwenwen committed
27
### Docker(方法一)
zhuwenwen's avatar
zhuwenwen committed
28
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
29

zhuwenwen's avatar
zhuwenwen committed
30
```
zhuwenwen's avatar
zhuwenwen committed
31
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
zhuwenwen's avatar
zhuwenwen committed
32
33
34
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
35
docker run -it --name llama_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
36
```
zhuwenwen's avatar
zhuwenwen committed
37
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
zhuwenwen's avatar
zhuwenwen committed
38

zhuwenwen's avatar
zhuwenwen committed
39
40
41
42
43
44
45
46
47
48
49
50
51
### Dockerfile(方法二)
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t llama:latest .
docker run -it --name llama_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> llama:latest /bin/bash
```

### Anaconda(方法三)
```
conda create -n llama_vllm python=3.10
```
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
zhuwenwen's avatar
zhuwenwen committed
52
* DTK驱动:dtk24.04.2
zhuwenwen's avatar
zhuwenwen committed
53
* Pytorch: 2.1.0
zhuwenwen's avatar
zhuwenwen committed
54
* triton:2.1.0
zhuwenwen's avatar
zhuwenwen committed
55
* lmslim: 0.1.0
zhuwenwen's avatar
zhuwenwen committed
56
* xformers: 0.0.25
zhuwenwen's avatar
zhuwenwen committed
57
* flash_attn: 2.0.4
zhuwenwen's avatar
zhuwenwen committed
58
* vllm: 0.5.0
zhuwenwen's avatar
zhuwenwen committed
59
* python: python3.10
zhuwenwen's avatar
zhuwenwen committed
60

zhuwenwen's avatar
zhuwenwen committed
61
`Tips:需先安装相关依赖,最后安装vllm包`
zhuwenwen's avatar
zhuwenwen committed
62
63
64
65
66
67
68
69

## 数据集


## 推理

### 模型下载

zhougaofeng's avatar
zhougaofeng committed
70
**快速下载通道:**
zhuwenwen's avatar
zhuwenwen committed
71
72
73
74
75
76
77
| 基座模型 | chat模型 | GPTQ模型 | AWQ模型 |
| ------- | ------- | ------- | ------- | 
| [Llama-2-7b-hf](http://113.200.138.88:18080/aimodels/Llama-2-7b-hf)   | [Llama-2-7b-chat-hf](http://113.200.138.88:18080/aimodels/Llama-2-7b-chat-hf)    | [Llama-2-7B-Chat-GPTQ](http://113.200.138.88:18080/aimodels/Llama-2-7B-Chat-GPTQ)   | [Llama-2-7B-Chat-AWQ](http://113.200.138.88:18080/aimodels/thebloke/Llama-2-7B-AWQ)   |
| [Llama-2-13b-hf](http://113.200.138.88:18080/aimodels/Llama-2-13b-hf) | [Llama-2-13b-chat-hf](http://113.200.138.88:18080/aimodels/meta-llama/Llama-2-13b-chat-hf) | [Llama-2-13B-GPTQ](http://113.200.138.88:18080/aimodels/Llama-2-13B-chat-GPTQ) | [Llama-2-13B-AWQ](http://113.200.138.88:18080/aimodels/thebloke/Llama-2-13B-AWQ) |
| [Llama-2-70b-hf](http://113.200.138.88:18080/aimodels/Llama-2-70b-hf) | [Llama-2-70b-chat-hf](http://113.200.138.88:18080/aimodels/meta-llama/Llama-2-70b-chat-hf) | [Llama-2-70B-Chat-GPTQ](http://113.200.138.88:18080/aimodels/Llama-2-70B-Chat-GPTQ) | [Llama-2-70B-Chat-AWQ](http://113.200.138.88:18080/aimodels/thebloke/Llama-2-70B-AWQ) |
| [Meta-Llama-3-8B](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B) | [Meta-Llama-3-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct) | [Meta-Llama-3-8B-Instruct-AWQ](http://113.200.138.88:18080/aimodels/solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ) | 
| [Meta-Llama-3-70B](http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B) | [Meta-Llama-3-70B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B-Instruct) | [Meta-Llama-3-70B-Instruct-AWQ](http://113.200.138.88:18080/aimodels/techxgenus/Meta-Llama-3-70B-Instruct-AWQ) | 
zhuwenwen's avatar
zhuwenwen committed
78
79


zhuwenwen's avatar
zhuwenwen committed
80
81
### 离线批量推理
```bash
zhuwenwen's avatar
zhuwenwen committed
82
python examples/offline_inference.py
zhuwenwen's avatar
zhuwenwen committed
83
```
zhuwenwen's avatar
zhuwenwen committed
84
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
zhuwenwen's avatar
zhuwenwen committed
85
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。`quantization="awq"`为使用awq量化进行推理,需下载以上AWQ模型。
zhuwenwen's avatar
zhuwenwen committed
86

zhuwenwen's avatar
zhuwenwen committed
87

zhuwenwen's avatar
zhuwenwen committed
88
89
90
### 离线批量推理性能测试
1、指定输入输出
```bash
zhuwenwen's avatar
zhuwenwen committed
91
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
92
```
zhuwenwen's avatar
zhuwenwen committed
93
其中`--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
94
95
96
97
98
99
100
101

2、使用数据集
下载数据集:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
zhuwenwen's avatar
zhuwenwen committed
102
python benchmarks/benchmark_throughput.py --num-prompts 1 --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
103
```
zhuwenwen's avatar
zhuwenwen committed
104
其中`--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
105
106


zhuwenwen's avatar
zhuwenwen committed
107
108
109
### api服务推理性能测试
1、启动服务端:
```bash
zhuwenwen's avatar
zhuwenwen committed
110
python -m vllm.entrypoints.openai.api_server  --model meta-llama/Llama-2-7b-chat-hf  --dtype float16 --enforce-eager -tp 1 
zhuwenwen's avatar
zhuwenwen committed
111
112
113
114
```

2、启动客户端:
```bash
zhuwenwen's avatar
zhuwenwen committed
115
python benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
116
```
zhuwenwen's avatar
zhuwenwen committed
117
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
zhuwenwen's avatar
zhuwenwen committed
118
119


zhuwenwen's avatar
zhuwenwen committed
120
121
122
### OpenAI兼容服务
启动服务:
```bash
zhuwenwen's avatar
zhuwenwen committed
123
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
124
```
zhuwenwen's avatar
zhuwenwen committed
125
这里`--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
126
127
128
129
130
131
132
133
134
135
136

列出模型型号:
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
zhuwenwen's avatar
zhuwenwen committed
137
        "model": "meta-llama/Llama-2-7b-hf",
zhuwenwen's avatar
zhuwenwen committed
138
139
140
141
142
        "prompt": "I believe the meaning of life is",
        "max_tokens": 7,
        "temperature": 0
    }'
```
zhuwenwen's avatar
zhuwenwen committed
143
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
144
145
146
147
148
149
150
151
152
153
154
155
156
157


### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "messages": [
            {"role": "system", "content": "I believe the meaning of life is"},
            {"role": "user", "content": "I believe the meaning of life is"}
        ]
    }'
```
zhuwenwen's avatar
zhuwenwen committed
158
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
laibao's avatar
laibao committed
159
### **gradio和vllm结合使用**
zhuwenwen's avatar
zhuwenwen committed
160

laibao's avatar
laibao committed
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
1.安装gradio

```
pip install gradio
```

2.安装必要文件

    2.1 启动gradio服务,根据提示操作

```
python  gradio_openai_chatbot_webserver.py --model "meta-llama/Llama-2-7b-chat-hf" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
```

    2.2 更改文件权限

打开提示下载文件目录,输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```

3.启动OpenAI兼容服务

```
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code --port 8000
```

4.启动gradio服务

```
python  gradio_openai_chatbot_webserver.py --model "meta-llama/Llama-2-7b-chat-hf" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
```

5.使用对话服务

在浏览器中输入本地 URL,可以使用 Gradio 提供的对话服务。
zhuwenwen's avatar
zhuwenwen committed
198
## result
zhuwenwen's avatar
zhuwenwen committed
199
使用的加速卡:1张 DCU-K100_AI-64G
zhuwenwen's avatar
zhuwenwen committed
200
201
202
203
```
Prompt: 'I believe the meaning of life is', Generated text: ' to find purpose, happiness, and fulfillment. Here are some reasons why:\n\n1. Purpose: Having a sense of purpose gives life meaning and direction. It helps individuals set goals and work towards achieving them, which can lead to a sense of accomplishment and fulfillment.\n2. Happiness: Happiness is a fundamental aspect of life that brings joy and satisfaction.
```

zhuwenwen's avatar
zhuwenwen committed
204
### 精度
zhuwenwen's avatar
zhuwenwen committed
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219


## 应用场景

### 算法类别
对话问答

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/llama_vllm)

## 参考资料
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)