README.md 6.52 KB
Newer Older
chenych's avatar
chenych committed
1
2
3
# Qwen3-Coder
## 论文
[Qwen3 Technical Report](https://arxiv.org/pdf/2505.09388)
chenych's avatar
chenych committed
4

chenych's avatar
chenych committed
5
## 模型结构
chenych's avatar
chenych committed
6
Qwen3-Coder-480B-A35B-Instruct 具备以下特点:
chenych's avatar
chenych committed
7

chenych's avatar
chenych committed
8
9
10
11
12
13
- 参数: 总参数 480B 激活参数 35B
- 层数: 62
- 注意力头 (GQA): 96 Q、8 KV
- 专家数: 160
- 激活专家数: 8
- 文本长度: 原生支持 256K token 的上下文并可通过 YaRN 扩展到 1M token
chenych's avatar
chenych committed
14
15

<div align=center>
chenych's avatar
chenych committed
16
    <img src="./doc/transformers.jpg"/>
chenych's avatar
chenych committed
17
18
19
</div>

## 算法原理
chenych's avatar
chenych committed
20
在预训练阶段上仍然在努力,这次 Qwen3-Coder 我们从不同角度进行 Scaling,以提升模型的代码能力:
chenych's avatar
chenych committed
21

chenych's avatar
chenych committed
22
23
24
- 数据扩展:总计 7.5T(代码占比 70%),在保持通用与数学能力的同时,具备卓越的编程能力;
- 上下文扩展:原生支持 256K 上下文,借助 YaRN 可拓展至 1M,专为仓库级和动态数据(如 Pull Request)优化,助力 Agentic Coding;
- 合成数据扩展:利用 Qwen2.5-Coder 对低质数据进行清洗与重写,显著提升整体数据质量;
chenych's avatar
chenych committed
25
26
27

## 环境配置
### 硬件需求
chenych's avatar
chenych committed
28
DCU型号:BW1000,节点数量:4台,卡数:32 张。
chenych's avatar
chenych committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

`-v 路径``docker_name``imageID`根据实际情况修改

### Docker(方法一)
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.8.5-ubuntu22.04-dtk25.04.1-rc5-das1.6-py3.10-20250724
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

cd /your_code_path/qwen3-coder_vllm
pip install transformers==4.51.3
```

### Dockerfile(方法二)
```bash
cd docker
docker build --no-cache -t qwen3-coder:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

cd /your_code_path/qwen3-coder_vllm
pip install transformers==4.51.3
```

### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04.1
python: 3.10
vllm: 0.8.5
torch: 2.4.1+das.opt2.dtk2504
deepspeed: 0.14.2+das.opt2.dtk2504
transformers: 4.51.3
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`

## 数据集


## 训练
暂无

## 推理
### vllm推理方法
#### server 单机
样例模型:[Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct)

```bash
export HIP_VISIBLE_DEVICES=0,1,2,3
export ALLREDUCE_STREAM_WITH_COMPUTE=1

vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --trust-remote-code --dtype bfloat16 --max-seq-len-to-capture 32768 -tp 4 --gpu-memory-utilization 0.85 --override-generation-config '{"temperature": 0.7, "top_p":0.8, "top_k":20, "repetition_penalty": 1.05}' --max-model-len 32768
```

启动完成后可通过以下方式访问:
```bash
curl http://x.x.x.x:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "quare the number 1024."
                }
            ]
    }'
```


#### server 多机
样例模型:[Qwen3-Coder-480B-A35B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)

1. 加入环境变量
> 请注意:
chenych's avatar
chenych committed
102
> 每个节点上的环境变量都写到.sh文件中,保存后各个计算节点分别source`.sh`文件
chenych's avatar
chenych committed
103
104
105
>
> VLLM_HOST_IP:节点本地通信口ip,尽量选择IB网卡的IP,**避免出现rccl超时问题**
>
chenych's avatar
chenych committed
106
> NCCL_SOCKET_IFNAME和 GLOO_SOCKET_IFNAME:节点本地通信网口ip对应的名称
chenych's avatar
chenych committed
107
>
chenych's avatar
chenych committed
108
> 通信口和ip查询方法:ifconfig
chenych's avatar
chenych committed
109
>
chenych's avatar
chenych committed
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
> IB口状态查询:ibstat  !!!一定要active激活状态才可用,各个节点要保持统一

```bash
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_HOST_IP=x.x.x.x # 对应计算节点的IP,选择IB口SOCKET_IFNAME对应IP地址
export NCCL_SOCKET_IFNAME=ibxxxx
export GLOO_SOCKET_IFNAME=ibxxxx
export NCCL_IB_HCA=mlx5_0:1 # 环境中的IB网卡名字
unset NCCL_ALGO
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_NET_GDR_READ=1
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_SPEC_DECODE_EAGER=1
export VLLM_MLA_DISABLE=0
export VLLM_USE_FLASH_MLA=1

# K100_AI集群建议额外设置的环境变量:
export VLLM_ENFORCE_EAGER_BS_THRESHOLD=44
export VLLM_RPC_TIMEOUT=1800000

# 海光CPU绑定核
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
```

2. 启动RAY集群
> x.x.x.x 对应第一步 VLLM_HOST_IP

```bash
# head节点执行
ray start --head --node-ip-address=x.x.x.x --port=6379 --num-gpus=8 --num-cpus=32
# worker节点执行
ray start --address='x.x.x.x:6379' --num-gpus=8 --num-cpus=32
```
3. 启动vllm server
> intel cpu 需要加参数:`--enforce-eager`

```bash
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct --trust-remote-code --distributed-executor-backend ray --dtype bfloat16 --max-seq-len-to-capture 32768 -tp 32 --gpu-memory-utilization 0.85 --max-num-seqs 128  --block-size 64 --override-generation-config '{"temperature": 0.7, "top_p":0.8, "top_k":20, "repetition_penalty": 1.05}' --max-model-len 32768 --host x.x.x.x
```

启动完成后可通过以下方式访问:
```bash
curl http://x.x.x.x:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-Coder-480B-A35B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "quare the number 1024."
                }
            ]
    }'
```

## result
<div align=center>
    <img src="./doc/results-dcu.png"/>
</div>
chenych's avatar
chenych committed
178

chenych's avatar
chenych committed
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
### 精度
DCU与GPU精度一致,推理框架:vllm。

## 应用场景
### 算法类别
代码生成

### 热点应用行业
制造,广媒,家居,教育

## 预训练权重
- [Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct)
- [Qwen3-Coder-480B-A35B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)

## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/qwen3-coder_vllm

## 参考资料
- https://github.com/QwenLM/Qwen3-Coder
chenych's avatar
chenych committed
198
- https://qwenlm.github.io/blog/qwen3-coder/