README.md 6.51 KB
Newer Older
chenych's avatar
chenych committed
1
2
3
4
5
# DeepSeek-V3.2-Exp
## 论文
[DeepSeek_V3.2](./DeepSeek_V3_2.pdf)

## 模型结构
chenych's avatar
chenych committed
6
7
DeepSeek-V3.2-Exp模型是一个实验版本,作为迈向下一代架构的中间步骤,V3.2-Exp 在 V3.1-Terminus 的基础上引入了 DeepSeek 稀疏注意力机制--一种旨在探索和验证在长上下文场景中训练和推理效率优化的稀疏注意力机制。
这个实验版本代表了deepseek团队对更高效变压器架构的持续研究,特别关注在处理扩展文本序列时提高计算效率。
chenych's avatar
chenych committed
8

chenych's avatar
chenych committed
9
10
11
<div align=center>
    <img src="./doc/arch.png"/>
</div>
chenych's avatar
chenych committed
12
13

## 算法原理
chenych's avatar
chenych committed
14
DeepSeek 稀疏注意力机制(DSA)首次实现了细粒度的稀疏注意力,在保持几乎相同的模型输出质量的同时,显著提高了长上下文训练和推理效率。
chenych's avatar
chenych committed
15
16
17
18
19
20
21
22
23
24
25
26

## 环境配置
### 硬件需求
DCU型号:K100AI,节点数量:4台,卡数:32 张。

`-v 路径``docker_name``imageID`根据实际情况修改

### Docker(方法一)
```bash
dcoker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

chenych's avatar
chenych committed
27
28
29
30
31
## 安装transformers
git clone -b add-deepseek-exp https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

chenych's avatar
chenych committed
32
cd /your_code_path/deepseek-v3.2-exp_pytorch
chenych's avatar
chenych committed
33

chenych's avatar
chenych committed
34
35
36
37
38
39
40
41
42
```

### Dockerfile(方法二)
```bash
cd docker
docker build --no-cache -t deepseek-v3.2-exp:latest .

docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

chenych's avatar
chenych committed
43
44
45
46
47
## 安装transformers
git clone -b add-deepseek-exp https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

chenych's avatar
chenych committed
48
cd /your_code_path/deepseek-v3.2-exp_pytorch
chenych's avatar
chenych committed
49

chenych's avatar
chenych committed
50
51
52
53
54
55
56
57
58
```

### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04.1
python: 3.10.12
torch: 2.5.1+das.opt1.dtk25041
```
chenych's avatar
chenych committed
59
60
61
62
63
64
65
66
67
68
`Tips:以上dtk驱动、pytorch等DCU相关工具版本需要严格一一对应`, 其他组件安装方法如下:
```bash
## 安装transformers
git clone -b add-deepseek-exp https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

cd /your_code_path/deepseek-v3.2-exp_pytorch

```
chenych's avatar
chenych committed
69
70
71
72
73
74
75
76

## 数据集


## 训练
暂无

## 推理
chenych's avatar
chenych committed
77
78
样例模型:[DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)

chenych's avatar
chenych committed
79
80
81
82
83
84
首先将模型转换成bf16格式,转换完成后,将原模型中的 `config.json`, `generation_config.json`, `tokenizer_config.json`, `tokenizer.json`拷贝到`/path/to/DeepSeek-V3.2-Exp-bf16`中,并删掉`config.json`中的`quantization_config`字段,如下图所示。

<div align=center>
    <img src="./doc/config.png"/>
</div>

chenych's avatar
chenych committed
85
86
```bash
# fp8转bf16
chenych's avatar
chenych committed
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
python inference/fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-V3.2-Exp --output-bf16-hf-path /path/to/DeepSeek-V3.2-Exp-bf16
```

### vllm推理方法
#### server 多机
1. 加入环境变量
> 请注意:
> 每个节点上的环境变量都写到.sh文件中,保存后各个计算节点分别source `.sh` 文件
>
> VLLM_HOST_IP:节点本地通信口ip,尽量选择IB网卡的IP,**避免出现rccl超时问题**
>
> NCCL_SOCKET_IFNAME和GLOO_SOCKET_IFNAME:节点本地通信网口ip对应的名称
>
> 通信口和ip查询方法:ifconfig
>
> IB口状态查询:ibstat  !!!一定要active激活状态才可用,各个节点要保持统一

<div align=center>
    <img src="./doc/ip_bw.png"/>
</div>

```bash
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_HOST_IP=x.x.x.x # 对应计算节点的IP,建议选择IB口SOCKET_IFNAME对应IP地址
export NCCL_SOCKET_IFNAME=ibxxxx
export GLOO_SOCKET_IFNAME=ibxxxx
export NCCL_IB_HCA=mlx5_0:1 # 环境中的IB网卡名字
unset NCCL_ALGO
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_NET_GDR_READ=1
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# 海光CPU绑定核
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
#BW集群需要额外设置的环境变量:
export NCCL_NET_GDR_LEVEL=7
export NCCL_SDMA_COPY_ENABLE=0
export VLLM_RPC_TIMEOUT=1800000
chenych's avatar
chenych committed
134
```
chenych's avatar
chenych committed
135

chenych's avatar
chenych committed
136
137
138
2. 启动RAY集群
> x.x.x.x 对应第一步 Master节点的 VLLM_HOST_IP

chenych's avatar
chenych committed
139
```bash
chenych's avatar
chenych committed
140
141
142
143
# head节点执行
ray start --head --node-ip-address=x.x.x.x --port=6379 --num-gpus=8 --num-cpus=32
# worker节点执行
ray start --address='x.x.x.x:6379' --num-gpus=8 --num-cpus=32
chenych's avatar
chenych committed
144
145
```

chenych's avatar
chenych committed
146
147
148
3. 启动vllm server
> intel cpu 需要加参数:`--enforce-eager`

chenych's avatar
chenych committed
149
```bash
chenych's avatar
chenych committed
150
151
152
153
154
vllm serve deepseek-v3.2/DeepSeek-V3.2-Exp-bf16 \
    --trust-remote-code \
    --distributed-executor-backend ray \
    --dtype bfloat16 \
    --tensor-parallel-size 32 \
chenych's avatar
chenych committed
155
    --max-model-len 1024 \
chenych's avatar
chenych committed
156
157
158
159
    --max-num-seqs 128 \
    --no-enable-chunked-prefill \
    --no-enable-prefix-caching \
    --gpu-memory-utilization 0.85 \
chenych's avatar
chenych committed
160
    --host 12.12.12.11 \
chenych's avatar
chenych committed
161
162
    --port 8001 \
    --kv-cache-dtype bfloat16
chenych's avatar
chenych committed
163
164
```

chenych's avatar
chenych committed
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
启动完成后可通过以下方式访问:
```bash
curl http://127.0.0.1:8001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-V3.2-Exp",
        "messages": [
            {
                "role": "user",
                "content": "Explain Machine Learning to me in a nutshell."
            }
        ],
        "temperature": 0.15,
        "top_p": 1.0,
        "max_tokens": 2048,
        "stream": false
}'
```
chenych's avatar
add  
chenych committed
183

chenych's avatar
chenych committed
184
## result
chenych's avatar
chenych committed
185
186
187
<div align=center>
    <img src="./doc/results_dcu.jpg"/>
</div>
chenych's avatar
chenych committed
188
189

### 精度
chenych's avatar
chenych committed
190
DCU与GPU精度一致,推理框架:vllm。
chenych's avatar
chenych committed
191
192
193
194
195
196
197
198
199
200

## 应用场景
### 算法类别
`对话问答`

### 热点应用行业
`制造,金融,教育,广媒`

## 预训练权重
- [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)
chenych's avatar
chenych committed
201
- [DeepSeek-V3.2-Exp-Base](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp-Base)
chenych's avatar
chenych committed
202
203
204
205
206
207
208

## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/deepseek-v3.2-exp_pytorch

## 参考资料
- https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
- https://github.com/deepseek-ai/DeepSeek-V3.2-Exp