Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
jerrrrry
verl_deepseekv3_lora
Commits
1624b887
Commit
1624b887
authored
May 20, 2025
by
jerrrrry
Browse files
Delete verl-main使用指南.md
parent
670161b8
Pipeline
#2718
canceled with stages
Changes
1
Pipelines
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
0 additions
and
281 deletions
+0
-281
verl-main使用指南.md
verl-main使用指南.md
+0
-281
No files found.
verl-main使用指南.md
deleted
100644 → 0
View file @
670161b8
# verl-main使用指南
## github链接
```
https://github.com/volcengine/verl
```
### mixtral_8x7B lora微调
#### 预训练权重
```
shell
# 预训练权重开源地址
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
# bw地址
/public/opendas/DL_DATA/llm-models/Mixtral-8x7B-Instruct-v0.1
```
#### 数据集
```
shell
# 开源地址
https://huggingface.co/datasets/openai/gsm8k/tree/main
```
#### 环境变量
```
shell
# 具体根据verl-main目录修改
export
PYTHONPATH
=
/public/home/fugx1/ds/verl-main:
$PYTHONPATH
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
```
#### 拉起命令
```
shell
# 具体根据verl-main目录修改
cd
/public/home/fugx1/ds/verl-main/examples/sft/gsm8k
# 执行命令
bash run_mixtral_8x7B.sh 8 /public/home/fugx1/ds/verl-main/examples/sft/gsm8k
```
#### 脚本内容
```
shell
# 需要根据实际路径修改
# data.train_files
# data.val_files
# model.partial_pretrain
set
-x
if
[
"$#"
-lt
2
]
;
then
echo
"Usage: run_mixtral_8x7B.sh <nproc_per_node> <save_path> [other_configs...]"
exit
1
fi
nproc_per_node
=
$1
save_path
=
$2
source
/opt/dtk/env.sh
# Shift the arguments so $@ refers to the rest
shift
2
torchrun
--standalone
--nnodes
=
1
--nproc_per_node
=
$nproc_per_node
\
-m
verl.trainer.fsdp_sft_trainer
\
data.train_files
=
/public/home/fugx1/ds/gsm8k/gsm8k-train.parquet
\
data.val_files
=
/public/home/fugx1/ds/gsm8k/gsm8k-test.parquet
\
data.prompt_key
=
'question'
\
data.response_key
=
'answer'
\
+data.prompt_dict_keys
=[
'question'
]
\
+data.response_dict_keys
=[
'answer'
]
\
data.micro_batch_size_per_gpu
=
4
\
model.partial_pretrain
=
/public/opendas/DL_DATA/llm-models/Mixtral-8x7B-Instruct-v0.1
\
trainer.default_local_dir
=
$save_path
\
trainer.project_name
=
gsm8k-sft
\
trainer.experiment_name
=
gsm8k-sft-mixtral-8x7B-Instruct-v0.1
\
trainer.total_epochs
=
1
\
trainer.logger
=[
'console'
]
\
trainer.default_hdfs_dir
=
null
$@
```
#### 依赖
```
shell
# 有一些依赖在requirements.txt没有给出,需要在执行时报错后进行安装
vllm
pip
install
tensordict-0.6.2-cp310-cp310-manylinux1_x86_64.whl
--no-deps
pip
install
orjson
--no-deps
pip
install
cloudpickle
--no-deps
pip
install
ray
--no-deps
pip
install
msgpack
--no-deps
pip
install
google
--no-deps
pip
install
protobuf
--no-deps
pip
install
jsonschema
--no-deps
pip
install
referencing
--no-deps
pip
install
rpds-py
--no-deps
pip
install
hydra-core
--no-deps
pip
install
codetiming
--no-deps
pip
install
jsonschema_specifications
--no-deps
pip
install
pytest
--no-deps
pip
install
pluggy
--no-deps
pip
install
exceptiongroup
--no-deps
pip
install
iniconfig
--no-deps
pip
install
omegaconf
--no-deps
pip
install
antlr4
--no-deps
pip
install
antlr4-python3-runtime
==
4.9.3
--no-deps
pip
install
click
--no-deps
pip
install
wandb
--no-deps
pip
install
pydantic
--no-deps
pip
install
annotated_types
--no-deps
pip
install
msgspec
--no-deps
pip
install
zmq
--no-deps
pip
install
pyzmq
--no-deps
pip
install
blake3
--no-deps
pip
install
cpuinfo
--no-deps
pip
install
py-cpuinfo
--no-deps
pip
install
openai
--no-deps
pip
install
httpx
--no-deps
pip
install
sniffio
--no-deps
pip
install
anyio
--no-deps
pip
install
distro
--no-deps
pip
install
jiter
--no-deps
pip
install
gguf
--no-deps
pip
install
numa
--no-deps
```
### DeepSeek V3 lora减层微调
#### 预训练权重
```
shell
# huggingface链接
https://huggingface.co/deepseek-ai/DeepSeek-V3
# bw千卡集群目录
/public/opendas/DL_DATA/llm-models/DeepSeek-V3-bf16
```
#### 数据集
```
shell
# 开源地址
https://huggingface.co/datasets/openai/gsm8k/tree/main
```
#### 环境变量
```
shell
# 具体根据verl-main目录修改
export
PYTHONPATH
=
/public/home/fugx1/ds/verl-main:
$PYTHONPATH
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
```
#### 文件修改
```
shell
1.modeling_deepseek.py
修改前:https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py
修改后:/public/opendas/DL_DATA/llm-models/DeepSeek-V3-bf16/modeling_deepseek.py
说明:DeepSeek-V3自带的modeling_deepseek.py代码不完整,有一个团队重新实现了包含DeepSeek-V3/R1 训练逻辑的 modeling 文件,这里对modeling_deepseek.py做了补全和修改。
参考链接:https://github.com/ScienceOne-AI/DeepSeek-671B-SFT-Guide/blob/main/model/DeepSeek-V3-BF16/modeling_deepseek.py
2.sft_trainer.yaml
位置:verl-main/verl/trainer/config/sft_trainer.yaml
修改点:
train_batch_size: 256 -> 128
min_num_params: 0 -> 1
trust_remote_code: False -> True
3.config.json
位置:/public/opendas/DL_DATA/llm-models/DeepSeek-V3-bf16/config.json
修改点:
num_hidden_layers 61 -> 4
(
8
)
说明:单机最多放下8层,但是性能很差,这里先设置4层
4.fsdp_sft_trainer.py
位置:https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py
修改点:_compute_loss_and_backward 函数的 loss.backward
()
往前移动4格。
```
#### 拉起命令
```
shell
# 具体根据verl-main目录修改
cd
/public/home/fugx1/ds/verl-main/examples/sft/gsm8k
# 执行命令
bash run_deepseek_671b.sh 8 /public/home/fugx1/ds/verl-main/examples/sft/gsm8k
```
#### 脚本内容
```
shell
# 需要根据实际路径修改
# data.train_files
# data.val_files
# model.partial_pretrain
# 环境变量根据实际情况修改
set
-x
if
[
"$#"
-lt
2
]
;
then
echo
"Usage: run_deepseek_6b7.sh <nproc_per_node> <save_path> [other_configs...]"
exit
1
fi
nproc_per_node
=
$1
save_path
=
$2
source
/opt/dtk/env.sh
export
NCCL_P2P_LEVEL
=
PXB
# SYS
# Runs Mixtral 8x7B model
export
HIP_DIRECT_DISPATCH
=
0
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
OMP_NUM_THREADS
=
1
export
GPU_MAX_HW_QUEUES
=
10
#export NVTE_FLASH_ATTN_TRITON=1
export
NCCL_ALGO
=
Ring
export
NCCL_SOCKET_IFNAME
=
enp33s0f3u1
export
NCCL_NCHANNELS_PER_PEER
=
16
export
NCCL_MIN_NCHANNELS
=
32
# 20
export
NCCL_MAX_NCHANNELS
=
32
# 20
export
NCCL_IB_TIMEOUT
=
22
export
CUDA_DEVICE_MAX_CONNECTIONS
=
1
export
NCCL_IB_HCA
=
mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export
NCCL_NET_GDR_LEVEL
=
7
export
NCCL_NET_GDR_READ
=
1
export
RCCL_SDMA_COPY_ENABLE
=
0
export
NCCL_TOPO_FILE
=
"/public/home/fugx1/datasets/rccl-test/topo-input.xml"
# export NCCL_TOPO_FILE="/workspace/rccl-test/rccl-tests-0204/topo-input.xml"
export
GLOG_minloglevel
=
3
# 打印error级别的nccl日志
export
PATH
=
/opt/hpc/software/mpi/hpcx/2.12.0/gcc-8.3.1/bin/:
$PATH
export
LD_LIBRARY_PATH
=
/opt/hpc/software/mpi/hpcx/2.12.0/gcc-8.3.1/lib/:
$LD_LIBRARY_PATH
# 导入hipblaslt库
export
LD_LIBRARY_PATH
=
/public/home/fugx1/tests1/test03/whl/hipblaslt-install-dtk-25.04-0212/lib:
$LD_LIBRARY_PATH
# 更新rocblas
export
LD_LIBRARY_PATH
=
/public/home/fugx1/tests1/test03/whl/rocblas-install-0224/lib:
$LD_LIBRARY_PATH
RANK
=
$OMPI_COMM_WORLD_RANK
LOCAL_RANK
=
$OMPI_COMM_WORLD_LOCAL_RANK
WORLD_SIZE
=
$OMPI_COMM_WORLD_SIZE
# Shift the arguments so $@ refers to the rest
shift
2
torchrun
--standalone
--nnodes
=
1
--nproc_per_node
=
$nproc_per_node
\
-m
verl.trainer.fsdp_sft_trainer
\
data.train_files
=
/public/home/fugx1/ds/gsm8k/gsm8k-train.parquet
\
data.val_files
=
/public/home/fugx1/ds/gsm8k/gsm8k-test.parquet
\
data.prompt_key
=
'question'
\
data.response_key
=
'answer'
\
+data.prompt_dict_keys
=[
'question'
]
\
+data.response_dict_keys
=[
'answer'
]
\
data.micro_batch_size_per_gpu
=
1
\
model.partial_pretrain
=
/public/opendas/DL_DATA/llm-models/DeepSeek-V3-bf16
\
trainer.default_local_dir
=
$save_path
\
trainer.project_name
=
gsm8k-sft
\
trainer.experiment_name
=
gsm8k-sft-deepseek-v3-671b
\
trainer.total_epochs
=
1
\
trainer.logger
=[
'console'
]
\
trainer.default_hdfs_dir
=
null
$@
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment