README.md 9.51 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-04-25 10:38:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-05-24 15:47:01
zhuwenwen's avatar
zhuwenwen committed
6
7
8
9
10
11
12
13
14
15
16
17
-->
# LLAMA

## 论文
- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)

## 模型结构
LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。以下是与原始架构的主要区别:
预归一化。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

zhuwenwen's avatar
zhuwenwen committed
18
![img](./docs/llama_str.png)
zhuwenwen's avatar
zhuwenwen committed
19
20
21
22

## 算法原理
LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。

zhuwenwen's avatar
zhuwenwen committed
23
![img](./docs/llama_pri.png)
zhuwenwen's avatar
zhuwenwen committed
24
25
26

## 环境配置

zhuwenwen's avatar
zhuwenwen committed
27
### Docker(方法一)
zhuwenwen's avatar
zhuwenwen committed
28
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
29

zhuwenwen's avatar
zhuwenwen committed
30
```
zhuwenwen's avatar
zhuwenwen committed
31
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
zhuwenwen's avatar
zhuwenwen committed
32
33
34
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
35
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
36

zhuwenwen's avatar
zhuwenwen committed
37
38
# 更新镜像的ray版本和服务依赖
pip install ray==2.9.1 aiohttp==3.9.1 outlines==0.0.37 openai==1.23.3
zhuwenwen's avatar
zhuwenwen committed
39
40
```

zhuwenwen's avatar
zhuwenwen committed
41
42
43
44
45
46
47
48
49
50
51
### Dockerfile(方法二)
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t llama:latest .
docker run -it --name llama_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> llama:latest /bin/bash
```

### Anaconda(方法三)
```
conda create -n llama_vllm python=3.10
zhuwenwen's avatar
zhuwenwen committed
52
pip install ray==2.9.1 aiohttp==3.9.1 outlines==0.0.37 openai==1.23.3
zhuwenwen's avatar
zhuwenwen committed
53
54
55
```
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
* DTK驱动:dtk24.04
zhuwenwen's avatar
zhuwenwen committed
56
* Pytorch: 2.1.0
zhuwenwen's avatar
zhuwenwen committed
57
* triton:2.1.0
zhuwenwen's avatar
zhuwenwen committed
58
* vllm: 0.3.3
zhuwenwen's avatar
zhuwenwen committed
59
* xformers: 0.0.25
zhuwenwen's avatar
zhuwenwen committed
60
* flash_attn: 2.0.4
zhuwenwen's avatar
zhuwenwen committed
61
62
* python: python3.10

zhuwenwen's avatar
zhuwenwen committed
63
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应.目前只能在K100_AI上使用`
zhuwenwen's avatar
zhuwenwen committed
64
65
66
67
68

## 数据集


## 推理
zhuwenwen's avatar
zhuwenwen committed
69
70
71
### 源码编译安装
```
# 若使用光源的镜像,可以跳过源码编译安装,镜像中已安装vllm。
zhuwenwen's avatar
zhuwenwen committed
72
73
git clone http://developer.hpccube.com/codes/modelzoo/llama_vllm.git
cd llama_vllm
zhuwenwen's avatar
zhuwenwen committed
74
75
76
77
78
79
git submodule init && git submodule update
cd vllm
pip install wheel
python setup.py bdist_wheel
cd dist && pip install vllm*
```
zhuwenwen's avatar
zhuwenwen committed
80
81
82

### 模型下载

zhuwenwen's avatar
zhuwenwen committed
83
84
85
86
87
| 基座模型                                                                        | chat模型                                                                                | GPTQ模型                                                                                          |
| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)   | [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)    | [Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True)   |
| [Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf) | [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | [Llama-2-13B-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ/tree/gptq-4bit-128g-actorder_True) |
| [Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) | [Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | [Llama-2-70B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True) |
zhuwenwen's avatar
zhuwenwen committed
88
89
| [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 
| [Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 
zhuwenwen's avatar
zhuwenwen committed
90
91


zhuwenwen's avatar
zhuwenwen committed
92
93
### 离线批量推理
```bash
zhuwenwen's avatar
zhuwenwen committed
94
python vllm/examples/offline_inference.py
zhuwenwen's avatar
zhuwenwen committed
95
```
zhuwenwen's avatar
zhuwenwen committed
96
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
zhuwenwen's avatar
zhuwenwen committed
97
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。
zhuwenwen's avatar
zhuwenwen committed
98

zhuwenwen's avatar
zhuwenwen committed
99

zhuwenwen's avatar
zhuwenwen committed
100
101
102
### 离线批量推理性能测试
1、指定输入输出
```bash
zhuwenwen's avatar
zhuwenwen committed
103
python vllm/benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
104
```
zhuwenwen's avatar
zhuwenwen committed
105
其中`--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
106
107
108
109
110
111
112
113

2、使用数据集
下载数据集:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
zhuwenwen's avatar
zhuwenwen committed
114
python vllm/benchmarks/benchmark_throughput.py --num-prompts 1 --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
115
```
zhuwenwen's avatar
zhuwenwen committed
116
其中`--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
117
118


zhuwenwen's avatar
zhuwenwen committed
119
120
121
122
123
124
125
126
127
128
129
130
131
### api服务推理性能测试
1、启动服务端:
```bash
python -m vllm.entrypoints.api_server  --model meta-llama/Llama-2-7b-chat-hf  --dtype float16 --enforce-eager -tp 1 
```

2、启动客户端:
```bash
python vllm/benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
```
参数同使用数据集,离线批量推理性能测试,具体参考[vllm/benchmarks/benchmark_serving.py]


zhuwenwen's avatar
zhuwenwen committed
132
133
134
### OpenAI兼容服务
启动服务:
```bash
zhuwenwen's avatar
zhuwenwen committed
135
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
136
```
zhuwenwen's avatar
zhuwenwen committed
137
这里`--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
138
139
140
141
142
143
144
145
146
147
148

列出模型型号:
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
zhuwenwen's avatar
zhuwenwen committed
149
        "model": "meta-llama/Llama-2-7b-hf",
zhuwenwen's avatar
zhuwenwen committed
150
151
152
153
154
        "prompt": "I believe the meaning of life is",
        "max_tokens": 7,
        "temperature": 0
    }'
```
zhuwenwen's avatar
zhuwenwen committed
155
或者使用[vllm/examples/openai_completion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/blob/675c0abe47eb9d29c126fbecda86fd5801162eba/examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
156
157
158
159
160
161
162
163
164
165
166
167
168
169


### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "messages": [
            {"role": "system", "content": "I believe the meaning of life is"},
            {"role": "user", "content": "I believe the meaning of life is"}
        ]
    }'
```
zhuwenwen's avatar
zhuwenwen committed
170
或者使用[vllm/examples/openai_chatcompletion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/blob/675c0abe47eb9d29c126fbecda86fd5801162eba/examples/openai_chatcompletion_client.py)
zhuwenwen's avatar
zhuwenwen committed
171
172
173


## result
zhuwenwen's avatar
zhuwenwen committed
174
使用的加速卡:1张 DCU-K100_AI-64G
zhuwenwen's avatar
zhuwenwen committed
175
176
177
178
```
Prompt: 'I believe the meaning of life is', Generated text: ' to find purpose, happiness, and fulfillment. Here are some reasons why:\n\n1. Purpose: Having a sense of purpose gives life meaning and direction. It helps individuals set goals and work towards achieving them, which can lead to a sense of accomplishment and fulfillment.\n2. Happiness: Happiness is a fundamental aspect of life that brings joy and satisfaction.
```

zhuwenwen's avatar
zhuwenwen committed
179
### 精度
zhuwenwen's avatar
zhuwenwen committed
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194


## 应用场景

### 算法类别
对话问答

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/llama_vllm)

## 参考资料
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)