README.md 10.2 KB
Newer Older
dcuai's avatar
dcuai committed
1
# InternLM
laibao's avatar
laibao committed
2

laibao's avatar
laibao committed
3
4
5
6
7
8
9
## 论文

`InternLM2 Technical Report`

- [https://arxiv.org/pdf/2403.17297]

## 模型结构
laibao's avatar
laibao committed
10

laibao's avatar
laibao committed
11
Internlm2.5与Internlm2模型结构相同,但取得更好效果,Internlm2采用LLama和GQA结构,相较于Internlm改进了Wqkv的权重矩阵进行交错重排,不再简单堆叠每个头的Wk、Wq和Wv矩阵。此交织重排操作大概能提高5%的训练效率。
laibao's avatar
laibao committed
12

laibao's avatar
laibao committed
13
14
15
16
17
18
19
<div align=center>
    <img src="doc/struct.png"/>
</div>

## 算法原理

Internlm2.5主要是更新了7B系列的Chat模型。其中InternLM2.5-7B-Chat-1M模型支持百万长度的上下文窗口。主要核心点有三个:(1)卓越的模型推理能力,在数学推理的任务上达到了SOTA,超过了同等规模参数量的其他模型,如LLaMA3-8B和Gemma-9B;(2)支持百万长度上下文长度的推理,并且可以通过LMDeploy快速部署,开箱即用;(3)增加了更多的应用工具的支持
laibao's avatar
laibao committed
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

<div align=center>
    <img src="doc/eval1.png"/>
</div>
<div align=center>
    <img src="doc/eval2.png"/>
</div>

## 环境配置

### Docker(方法一)

提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
35
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04.1-rc5-das1.6-py3.10-20250724
laibao's avatar
laibao committed
36
37
38
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
39
# 若要在主机端和容器端映射端口需要删除--network host参数
laibao's avatar
laibao committed
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
docker run -it --name internlm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
```

`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`

### Dockerfile(方法二)

```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t internlm:latest .
docker run -it --name internlm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> internlm:latest /bin/bash
```

### Anaconda(方法三)

```
conda create -n internlm_vllm python=3.10
```

60
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
laibao's avatar
laibao committed
61

62
63
64
65
* DTK驱动:dtk25.04.01
* Pytorch: 2.4.0
* triton: 3.0.0
* lmslim: 0.2.1
laibao's avatar
laibao committed
66
* flash_attn: 2.6.1
67
68
* flash_mla: 1.0.0
* vllm: 0.8.5
laibao's avatar
laibao committed
69
70
* python: python3.10

71
`Tips:需先安装相关依赖,最后安装vllm包`  
72
环境变量:    
73
export VLLM_NUMA_BIND=1  
74
75
76
export ALLREDUCE_STREAM_WITH_COMPUTE=1  
export VLLM_RANK0_NUMA=0  
export VLLM_RANK1_NUMA=1  
laibao's avatar
laibao committed
77
78
79
80
export VLLM_RANK2_NUMA=2  
export VLLM_RANK3_NUMA=3  
export VLLM_RANK4_NUMA=4  
export VLLM_RANK5_NUMA=5  
81
82
export VLLM_RANK6_NUMA=6  
export VLLM_RANK7_NUMA=7  
laibao's avatar
laibao committed
83
84
85
86
87
88
89
90
91

## 数据集



## 推理

### 模型下载

laibao's avatar
laibao committed
92
93
| 基座模型                                                                      |                                                                               |                                                                                 |                                                                                   |
| ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
chenzk's avatar
chenzk committed
94
| [internlm2-7b](https://huggingface.co/internlm/internlm2-7b)  | [internlm2-20b](https://huggingface.co/internlm/internlm2-20b) | [internlm2_5-7b](https://huggingface.co/internlm/internlm2_5-7b) | [internlm2_5-20b](https://huggingface.co/internlm/internlm2_5-20b) |
laibao's avatar
laibao committed
95
96
97
98

### 离线批量推理

```bash
99
VLLM_USE_FLASH_ATTN_PA=1 python examples/offline_inference/basic/basic.py
laibao's avatar
laibao committed
100
101
102
103
104
105
106
107
108
109
```

其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。`quantization="awq"`为使用awq量化进行推理,需下载以上AWQ模型。

### 离线批量推理性能测试

1、指定输入输出

```bash
110
VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model internlm/internlm2_5-7b -tp 1 --trust-remote-code --enforce-eager --dtype float16
laibao's avatar
laibao committed
111
112
```

laibao's avatar
laibao committed
113
其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。如果提示上下文受限,可以添加--max-model-len 64000
laibao's avatar
laibao committed
114
115
116

2、使用数据集
下载数据集:
117
[sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)
laibao's avatar
laibao committed
118
119

```bash
120
VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --model internlm/internlm2_5-7b --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
laibao's avatar
laibao committed
121
122
123
124
125
126
127
128
129
```

其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。

### OpenAI api服务推理性能测试

1.启动服务:

```bash
130
VLLM_USE_FLASH_ATTN_PA=1 vllm serve --model internlm/internlm2_5-7b --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
laibao's avatar
laibao committed
131
132
133
134
135
```

2.启动客户端

```
136
python benchmarks/benchmark_serving.py --model internlm/internlm2_5-7b --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
laibao's avatar
laibao committed
137
138
139
140
141
142
143
144
145
```

参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](/codes/modelzoo/qwen1.5_vllm/-/blob/master/benchmarks/benchmark_serving.py)

### OpenAI兼容服务

启动服务:

```bash
146
VLLM_USE_FLASH_ATTN_PA=1 vllm serve internlm/internlm2_5-7b --enforce-eager --dtype float16 --trust-remote-code
laibao's avatar
laibao committed
147
148
```

laibao's avatar
laibao committed
149
这里 serve之后为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
laibao's avatar
laibao committed
150
151
152
153
154
155
156
157
158
159
160
161
162

列出模型型号:

```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用

```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
laibao's avatar
laibao committed
163
        "model": "internlm/internlm2_5-7b",
laibao's avatar
laibao committed
164
165
166
167
168
169
170
171
172
173
174
175
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```

或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)

### OpenAI Chat API和vllm结合使用

```bash
curl http://localhost:8000/v1/chat/completions \
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b",
    "max_tokens": 128,
    "messages": [
      {
        "role": "user",
        "content": "What is deep learning?"
      }
    ]
  }'
```

或者使用[examples/online_serving/openai_chat_completion_client.py](examples/online_serving/openai_chat_completion_client.py)
laibao's avatar
laibao committed
191
192
193
194
195
196
197
198
199
200
201
202
203
204

### **gradio和vllm结合使用**

1.安装gradio

```
pip install gradio
```

2.安装必要文件

    2.1 启动gradio服务,根据提示操作

```
205
python examples/online_serving/gradio_openai_chatbot_webserver.py --model "internlm/internlm2_5-7b" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
laibao's avatar
laibao committed
206
207
208
209
210
211
212
213
214
```

    2.2 更改文件权限

打开提示下载文件目录,输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```
laibao's avatar
laibao committed
215

laibao's avatar
laibao committed
216
217
218
219
   2.3 端口映射

```
ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
laibao's avatar
laibao committed
220
```
laibao's avatar
laibao committed
221
222
223
224

3.启动OpenAI兼容服务

```
225
VLLM_USE_FLASH_ATTN_PA=1 vllm serve internlm/internlm2_5-7b --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
laibao's avatar
laibao committed
226
227
```

laibao's avatar
laibao committed
228
229
如果提示上下文受限,可以添加--max-model-len 64000

laibao's avatar
laibao committed
230
231
232
4.启动gradio服务

```
233
python examples/online_serving/gradio_openai_chatbot_webserver.py --model "internlm/internlm2_5-7b" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids "" --host "0.0.0.0" --port 8001
laibao's avatar
laibao committed
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
```

5.使用对话服务

在浏览器中输入本地 URL,可以使用 Gradio 提供的对话服务。

## result

使用的加速卡:1张 DCU-K100_AI-64G

```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度



## 应用场景

### 算法类别

对话问答

### 热点应用行业

金融,科研,教育

## 源码仓库及问题反馈

* [https://developer.sourcefind.cn/codes/modelzoo/internlm_vllm](https://developer.sourcefind.cn/codes/modelzoo/internlm_vllm)

## 参考资料

* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)