README.md 6.07 KB
Newer Older
chenych's avatar
chenych committed
1
# Deepseek-V3.1
zhangwq5's avatar
ds  
zhangwq5 committed
2
## 论文
chenych's avatar
chenych committed
3
暂无
chenych's avatar
chenych committed
4

zhangwq5's avatar
ds  
zhangwq5 committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## 模型结构
DeepSeek-V3.1 是一个支持思考模式和非思考模式的混合模型。与之前的版本相比,此次升级在多个方面都有所改进:
- 混合思维模式:通过更改聊天模板,一种模型即可支持思维模式和非思维模式。
- 更智能的工具调用:通过后期训练优化,模型在工具使用和代理任务方面的性能有了显著提升。
- 更高的思维效率:DeepSeek-V3.1-Think 实现了与 DeepSeek-R1-0528 相当的答案质量,同时响应速度更快。

<div align=center>
    <img src="./doc/arch.png"/>
</div>

## 算法原理
DeepSeek-V3.1 是在 DeepSeek-V3.1-Base 的基础上进行后置训练的。DeepSeek-V3.1-Base 是基于原始 V3 基础检查点通过两阶段长上下文扩展方法构建而成的,其遵循了原始 DeepSeek-V3 报告中所阐述的方法。我们通过收集更多长文档并大幅扩展了两个训练阶段的规模,从而扩充了我们的数据集。32K 扩展阶段的规模已增加 10 倍,达到 630 亿个标记,而 128K 扩展阶段则扩大了 3.3 倍,达到 209 亿个标记。

## 环境配置
### 硬件需求
chenych's avatar
chenych committed
20
DCU型号:BW,节点数量:4台,卡数:32 张。
chenych's avatar
chenych committed
21
22
`-v 路径``docker_name``imageID`根据实际情况修改

zhangwq5's avatar
ds  
zhangwq5 committed
23
24
25
26
27
28
### Docker(方法一)
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas101839-0811-das1.6-py3.10-20250812-beta

docker run -it --name {docker_name} --device=/dev/kfd --privileged --network=host --device=/dev/dri --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /your_code_path:/your_code_path -v /opt/hyhal:/opt/hyhal:ro -v /module/DeepSeek-V3.1:/your_model_path/DeepSeek-V3.1 --group-add video --shm-size 64G {imageID} bash

chenych's avatar
chenych committed
29
cd /your_code_path/deepseek-v3.1_vllm
zhangwq5's avatar
ds  
zhangwq5 committed
30
```
chenych's avatar
chenych committed
31

zhangwq5's avatar
ds  
zhangwq5 committed
32
33
34
### Dockerfile(方法二)
```bash
cd docker
chenych's avatar
chenych committed
35
docker build --no-cache -t deepseek-v3.1:latest .
zhangwq5's avatar
ds  
zhangwq5 committed
36
37
docker run -it --name {docker_name} --device=/dev/kfd --privileged --network=host --device=/dev/dri --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /your_code_path:/your_code_path -v /opt/hyhal:/opt/hyhal:ro -v /module/DeepSeek-V3.1:/your_model_path/DeepSeek-V3.1 --group-add video --shm-size 64G {imageID} bash

chenych's avatar
chenych committed
38
cd /your_code_path/deepseek-v3.1_vllm
zhangwq5's avatar
ds  
zhangwq5 committed
39
```
chenych's avatar
chenych committed
40

zhangwq5's avatar
ds  
zhangwq5 committed
41
42
43
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
chenych's avatar
chenych committed
44
45
DTK: 25.04.1
python: 3.10.12
zhangwq5's avatar
ds  
zhangwq5 committed
46
torch: 2.5.1+das.opt2.dtk25041
chenych's avatar
chenych committed
47
48
vllm: 0.9.2
transformers: 4.55.0
zhangwq5's avatar
ds  
zhangwq5 committed
49
50
51
```
`Tips:以上dtk驱动、pytorch等DCU相关工具版本需要严格一一对应`

chenych's avatar
chenych committed
52
53
54
55
56
57
58
59
## 数据集


## 训练
暂无

## 推理
### 精度转换
zhangwq5's avatar
ds  
zhangwq5 committed
60
61
62
63
64
```bash
python ./infer/fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights
```
转换成功后,将其他文件一并复制粘贴到输出目录,并删掉config.json中的"quantization_config"键值对。

chenych's avatar
chenych committed
65
66
### vllm推理方法
#### server 多机
chenych's avatar
chenych committed
67
样例模型:[DeepSeek-V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1 )
chenych's avatar
chenych committed
68
69
70
71
72
73
74
75

1. 加入环境变量
> 请注意:
> 每个节点上的环境变量都写到.sh文件中,保存后各个计算节点分别source `.sh` 文件
>
> VLLM_HOST_IP:节点本地通信口ip,尽量选择IB网卡的IP,**避免出现rccl超时问题**
>
> NCCL_SOCKET_IFNAME和GLOO_SOCKET_IFNAME:节点本地通信网口ip对应的名称
chenych's avatar
chenych committed
76
>
chenych's avatar
chenych committed
77
> 通信口和ip查询方法:ifconfig
chenych's avatar
chenych committed
78
>
chenych's avatar
chenych committed
79
80
> IB口状态查询:ibstat  !!!一定要active激活状态才可用,各个节点要保持统一

zhangwq5's avatar
ds  
zhangwq5 committed
81
82
83
84
<div align=center>
    <img src="./doc/ip_bw.png"/>
</div>

chenych's avatar
chenych committed
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
```bash
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_HOST_IP=x.x.x.x # 对应计算节点的IP,选择IB口SOCKET_IFNAME对应IP地址
export NCCL_SOCKET_IFNAME=ibxxxx
export GLOO_SOCKET_IFNAME=ibxxxx
export NCCL_IB_HCA=mlx5_0:1 # 环境中的IB网卡名字
unset NCCL_ALGO
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_NET_GDR_READ=1
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_USE_V1=1

# 海光CPU绑定核
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
#BW集群需要额外设置的环境变量:
export NCCL_NET_GDR_LEVEL=7
export NCCL_SDMA_COPY_ENABLE=0
export NCCL_TOPO_FILE="topo-input.xml"
export VLLM_RPC_TIMEOUT=1800000
```
zhangwq5's avatar
ds  
zhangwq5 committed
114

chenych's avatar
chenych committed
115
116
2. 启动RAY集群
> x.x.x.x 对应第一步 VLLM_HOST_IP
zhangwq5's avatar
ds  
zhangwq5 committed
117
118
119

```bash
# head节点执行
chenych's avatar
chenych committed
120
ray start --head --node-ip-address=x.x.x.x --port=6379 --num-gpus=8 --num-cpus=32
zhangwq5's avatar
ds  
zhangwq5 committed
121
# worker节点执行
chenych's avatar
chenych committed
122
ray start --address='x.x.x.x:6379' --num-gpus=8 --num-cpus=32
zhangwq5's avatar
ds  
zhangwq5 committed
123
124
```

chenych's avatar
chenych committed
125
126
127
3. 启动vllm server
> intel cpu 需要加参数:`--enforce-eager`

zhangwq5's avatar
ds  
zhangwq5 committed
128
```bash
chenych's avatar
chenych committed
129
vllm serve deepseek-ai/DeepSeek-V3.1 --trust-remote-code --distributed-executor-backend ray --dtype bfloat16 --tensor-parallel-size 32 --enable-expert-parallel --served-model-name ds31 --max-model-len 64000 --max-seq-len-to-capture 64000 --max-num-batched-tokens 64000 --max-num-seqs 128 --disable-log-requests --block-size 64 --no-enable-chunked-prefill --no-enable-prefix-caching --gpu-memory-utilization 0.9
zhangwq5's avatar
ds  
zhangwq5 committed
130
131
```

chenych's avatar
chenych committed
132
启动完成后可通过以下方式访问:
zhangwq5's avatar
ds  
zhangwq5 committed
133
```bash
chenych's avatar
chenych committed
134
135
136
137

curl http://x.x.x.x:8000/v1/chat/completions   \
    -H "Content-Type: application/json"  \
    -d '{
zhangwq5's avatar
ds  
zhangwq5 committed
138
139
140
141
142
143
144
145
146
147
148
149
150
        "model": "ds31",
        "messages": [
            {
                "role": "user",
                "content": "请介绍下你自己"
            }
        ],
        "chat_template_kwargs": {
            "thinking": true
        }
    }'
```

chenych's avatar
chenych committed
151
152
153
154
155
## result
<div align=center>
    <img src="./doc/results-dcu.jpg"/>
</div>

zhangwq5's avatar
ds  
zhangwq5 committed
156
### 精度
chenych's avatar
chenych committed
157
DCU与GPU精度一致,推理框架:vllm。
zhangwq5's avatar
ds  
zhangwq5 committed
158
159
160
161

## 应用场景
### 算法类别
`对话问答`
chenych's avatar
chenych committed
162

zhangwq5's avatar
ds  
zhangwq5 committed
163
164
### 热点应用行业
`制造,金融,教育`
chenych's avatar
chenych committed
165

zhangwq5's avatar
ds  
zhangwq5 committed
166
## 预训练权重
chenych's avatar
chenych committed
167
168
- [DeepSeek-V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1)
- [DeepSeek-V3.1-Base](https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base)
chenych's avatar
chenych committed
169

zhangwq5's avatar
ds  
zhangwq5 committed
170
## 源码仓库及问题反馈
chenych's avatar
chenych committed
171
- https://developer.sourcefind.cn/codes/modelzoo/deepseek-v3.1_vllm
chenych's avatar
chenych committed
172

zhangwq5's avatar
ds  
zhangwq5 committed
173
174
## 参考资料
- https://huggingface.co/deepseek-ai