README.md 7.98 KB
Newer Older
luopl's avatar
luopl committed
1
# Qwen3.5
luopl's avatar
luopl committed
2
3
4
5
## 论文
[Qwen3.5](https://qwen.ai/blog?id=qwen3.5)

## 模型简介
luopl's avatar
luopl committed
6
Qwen3.5 通过异构基础设施实现高效的原生多模态训练:在视觉与语言组件上解耦并行策略,避免统一方案带来的低效。利用稀疏激活实现跨模块计算重叠,在混合文本-图像-视频数据上相比纯文本基线达到近 100% 的训练吞吐。在此基础上,原生 FP8 流水线对激活、MoE 路由与 GEMM 运算采用低精度,并通过运行时监控在敏感层保持 BF16,实现约 50% 的激活显存降低与超过 10% 的加速,并稳定扩展至数万亿 token。
luopl's avatar
luopl committed
7

luopl's avatar
luopl committed
8
为了持续释放强化学习的潜力,构建了可扩展的异步强化学习框架,支持 Qwen3.5 全尺寸模型,并全面覆盖文本、多模态及多轮交互场景。通过训推分离架构的解耦式设计,该框架显著提升了硬件利用率,实现了动态负载均衡和细粒度的故障恢复。配合 FP8 训推、Rollout 路由回放、投机采样以及多轮 Rollout 锁定等技术,进一步优化了系统吞吐,提高了训推一致性。通过系统与算法协同设计,该框架在严格控制样本陈旧性的基础上有效缓解了数据长尾问题,提高了训练曲线的稳定性和性能上限。此外,框架面向原生智能体工作流设计,能够实现稳定、无缝的多轮环境交互,消除了框架层的调度中断。这种解耦设计使得系统能够扩展百万级规模的 Agent 脚手架与环境,从而显著增强模型的泛化能力。上述优化最终取得了 3×–5× 的端到端加速,展现了卓越的稳定性、高效率与可扩展性。
luopl's avatar
luopl committed
9
10
11
12
13
14

<div align=center>
    <img src="./doc/qwen3.5_397b_a17b_infra.jpg"/>
</div>

## 环境依赖
luopl's avatar
luopl committed
15
16
17
| 软件 |                    版本                     |
| :------: |:-----------------------------------------:|
| DTK |                   26.04                   |
chenych's avatar
chenych committed
18
19
| python |               3.10.12                  |
| transformers |        5.2.0.dev0                 |
luopl's avatar
luopl committed
20
21
22
| vllm |       0.15.1+das.opt1.alpha.dtk2604       |
| triton | 3.3.0+das.opt2.dtk2604.20260203.g393ad86c |
| torch | 2.9.0+das.opt1.dtk2604.20260126.g22910426 |
luopl's avatar
luopl committed
23

chenych's avatar
chenych committed
24
当前仅支持定制镜像: harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm015-ubuntu22.04-dtk26.04-glm5-0408
luopl's avatar
luopl committed
25
26
27
28

- 挂载地址`-v` 根据实际模型情况修改
```bash
docker run -it \
luopl's avatar
luopl committed
29
    --shm-size 200g \
luopl's avatar
luopl committed
30
31
32
33
34
35
36
37
38
39
40
41
    --network=host \
    --name qwen3.5 \
    --privileged \
    --device=/dev/kfd \
    --device=/dev/dri \
    --device=/dev/mkfd \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -u root \
    -v /opt/hyhal/:/opt/hyhal/:ro \
    -v /path/your_code_data/:/path/your_code_data/ \
chenych's avatar
chenych committed
42
    harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm015-ubuntu22.04-dtk26.04-glm5-0408 bash
luopl's avatar
luopl committed
43
44
45
46
```
更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。

## 数据集
chenych's avatar
chenych committed
47
`暂无`
luopl's avatar
luopl committed
48
49

## 训练
chenych's avatar
chenych committed
50
`暂无`
luopl's avatar
luopl committed
51
52
53

## 推理
### vllm
chenych's avatar
chenych committed
54
55
56
57
**注意**:
- 使用`K100 AI` 启动服务时需要添加`--disable-custom-all-reduce`参数
- 加载W8A8模型启动服务时需要添加`-cc.mode=3``-cc.inductor_compile_config='{"combo_kernels": false, "benchmark_combo_kernel": false}'`参数

luopl's avatar
luopl committed
58
#### 单机推理
chenych's avatar
chenych committed
59
60
61
62
63
64
```bash
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_USE_PIECEWISE=1
export VLLM_USE_FLASH_MLA=1
export USE_FUSED_RMS_QUANT=0
export USE_FUSED_SILU_MUL_QUANT=1
luopl's avatar
luopl committed
65

chenych's avatar
chenych committed
66
67
68
69
70
71
72
export VLLM_USE_GLOBAL_CACHE13=1
export VLLM_FUSED_MOE_CHUNK_SIZE=16384
export VLLM_CUSTOM_CACHE=1
export VLLM_USE_OPT_CAT=1
export VLLM_USE_FUSED_FILL_RMS_CAT=1
export VLLM_USE_LIGHTOP_MOE_SUM_MUL_ADD=0
export VLLM_USE_LIGHTOP_RMS_ROPE_CONCAT=0
luopl's avatar
luopl committed
73
74
75
76
77

## serve启动
vllm serve Qwen/Qwen3.5-35B-A3B \
    --port 8001 \
    --tensor-parallel-size 2 \
chenych's avatar
chenych committed
78
79
80
81
    --gpu-memory-utilization 0.9 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
luopl's avatar
luopl committed
82
83
84
85
86
87
88
89
90
91
92
93

## client访问
curl http://localhost:8001/v1/chat/completions   \
    -H "Content-Type: application/json"  \
    -d '{
        "model": "Qwen/Qwen3.5-35B-A3B",
        "messages": [
          {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"}
        ],
        "temperature": 0.6
    }'
```
chenych's avatar
chenych committed
94

luopl's avatar
luopl committed
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#### 多机推理
1. 加入环境变量
> 请注意:
> 每个节点上的环境变量都写到.sh文件中,保存后各个计算节点分别source`.sh`文件
>
> VLLM_HOST_IP:节点本地通信口ip,尽量选择IB网卡的IP,**避免出现rccl超时问题**
>
> NCCL_SOCKET_IFNAME和 GLOO_SOCKET_IFNAME:节点本地通信网口ip对应的名称
>
> 通信口和ip查询方法:ifconfig
>
> IB口状态查询:ibstat  !!!一定要active激活状态才可用,各个节点要保持统一

```bash
export ALLREDUCE_STREAM_WITH_COMPUTE=1
chenych's avatar
chenych committed
110
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
luopl's avatar
luopl committed
111
112
113
114
115
116
117
118
119
120
121
122
export VLLM_HOST_IP=x.x.x.x # 对应计算节点的IP,选择IB口SOCKET_IFNAME对应IP地址
export NCCL_SOCKET_IFNAME=ibxxxx
export GLOO_SOCKET_IFNAME=ibxxxx
export NCCL_IB_HCA=mlx5_0:1 # 环境中的IB网卡名字
unset NCCL_ALGO
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_NET_GDR_READ=1
export VLLM_SPEC_DECODE_EAGER=1
export VLLM_MLA_DISABLE=0
export VLLM_USE_FLASH_MLA=1
export VLLM_RPC_TIMEOUT=1800000
chenych's avatar
chenych committed
123
124
125
126
127
128
129
130
131
132
133
export VLLM_USE_PIECEWISE=1
export USE_FUSED_RMS_QUANT=0
export USE_FUSED_SILU_MUL_QUANT=1

export VLLM_USE_GLOBAL_CACHE13=1
export VLLM_FUSED_MOE_CHUNK_SIZE=16384
export VLLM_CUSTOM_CACHE=1
export VLLM_USE_OPT_CAT=1
export VLLM_USE_FUSED_FILL_RMS_CAT=1
export VLLM_USE_LIGHTOP_MOE_SUM_MUL_ADD=0
export VLLM_USE_LIGHTOP_RMS_ROPE_CONCAT=0 # 不能和kvfp8一起开
luopl's avatar
luopl committed
134

luopl's avatar
luopl committed
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# 海光CPU绑定核
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
```

2. 启动RAY集群
> x.x.x.x 对应第一步 VLLM_HOST_IP

```bash
# head节点执行
ray start --head --node-ip-address=x.x.x.x --port=6379 --num-gpus=8 --num-cpus=32
# worker节点执行
ray start --address='x.x.x.x:6379' --num-gpus=8 --num-cpus=32
```

chenych's avatar
chenych committed
157
3. 启动vllm serve
luopl's avatar
luopl committed
158
159
160
161
162
```bash
vllm serve Qwen/Qwen3.5-397B-A17B \
    --port 8001 \
    --tensor-parallel-size 16 \
    --distributed-executor-backend ray \
chenych's avatar
chenych committed
163
164
165
166
167
    --gpu-memory-utilization 0.9 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
```
luopl's avatar
luopl committed
168

chenych's avatar
chenych committed
169
170
4. client访问
```bash
luopl's avatar
luopl committed
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
curl http://localhost:8001/v1/chat/completions   \
    -H "Content-Type: application/json"  \
    -d '{
        "model": "Qwen/Qwen3.5-397B-A17B",
        "messages": [
          {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"}
        ],
        "temperature": 0.6
    }'
```

## 效果展示
<div align=center>
    <img src="./doc/result-dcu.jpg"/>
</div>

### 精度
DCU与GPU精度一致,推理框架:vllm。

## 预训练权重
|  模型名称  | 权重大小 | DCU型号  | 最低卡数需求 |         下载地址          |
|:------:|:----:|:----------:|:------:|:---------------------:|
luopl's avatar
luopl committed
193
| Qwen3.5-397B-A17B | 397B | K100AI,BW1000 |   16   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) |
chenych's avatar
chenych committed
194
| Qwen3.5-397B-A17B-INT8 | 397B | BW1000 |   8   | [ModelScope](https://www.modelscope.cn/models/metax-tech/Qwen3.5-397B-A17B-W8A8) |
luopl's avatar
luopl committed
195
196
197
| Qwen3.5-122B-A10B | 122B | K100AI,BW1000 |   8   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) |
| Qwen3.5-35B-A3B | 35B | K100AI,BW1000 |   2   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) |
| Qwen3.5-27B | 27B | K100AI,BW1000 |   2   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-27B) |
luopl's avatar
luopl committed
198
199
200
201
| Qwen3.5-9B | 9B | K100AI,BW1000 |   1   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-9B) |
| Qwen3.5-4B | 4B | K100AI,BW1000 |   1   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-4B) |
| Qwen3.5-2B | 2B | K100AI,BW1000 |   1   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-2B) |
| Qwen3.5-0.8B | 0.8B | K100AI,BW1000 |   1   | [Hugging Face](https://huggingface.co/Qwen/Qwen3.5-0.8B) |
luopl's avatar
luopl committed
202
203
204
205
206
207

## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/qwen3.5_vllm

## 参考资料
- https://github.com/QwenLM/Qwen3.5