Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
GPT2_oneflow
Commits
065ccf6e
"vscode:/vscode.git/clone" did not exist on "a568c7f17aab3f257bd547a21e92cd7d6b482a59"
Commit
065ccf6e
authored
Oct 10, 2023
by
“yuguo”
Browse files
update
parent
91454f7c
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
3 additions
and
3 deletions
+3
-3
README.md
README.md
+3
-3
No files found.
README.md
View file @
065ccf6e
...
@@ -61,7 +61,7 @@ GPT-2中使用掩模自注意力(masked self-attention),一般的自注意
...
@@ -61,7 +61,7 @@ GPT-2中使用掩模自注意力(masked self-attention),一般的自注意
pip3 install pybind11 -i https://mirrors.aliyun.com/pypi/simple
pip3 install pybind11 -i https://mirrors.aliyun.com/pypi/simple
pip3 install -e . -i https://mirrors.aliyun.com/pypi/simple
pip3 install -e . -i https://mirrors.aliyun.com/pypi/simple
##
GPT2预
训练
## 训练
该预训练脚本运行环境为1节点,4张DCU-Z100-16G。
该预训练脚本运行环境为1节点,4张DCU-Z100-16G。
...
@@ -79,7 +79,7 @@ train.dist.pipeline_parallel_size = 1
...
@@ -79,7 +79,7 @@ train.dist.pipeline_parallel_size = 1
bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 4
bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 4
##
#
精度
## 精度
训练数据:
[
https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/gpt_dataset
](
链接
)
训练数据:
[
https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/gpt_dataset
](
链接
)
...
@@ -91,7 +91,7 @@ train.dist.pipeline_parallel_size = 1
...
@@ -91,7 +91,7 @@ train.dist.pipeline_parallel_size = 1
| :--: | :--------: | :---------------------------: |
| :--: | :--------: | :---------------------------: |
| 4 | Libai-main | total_loss: 4.336/10000 iters |
| 4 | Libai-main | total_loss: 4.336/10000 iters |
##
#
混合并行配置指南
## 混合并行配置指南
首先,可以在一个节点内的多卡上做模型并行切分。因为模型并行通信开销大(前后向可能都需要all-reduce通信),而节点内设备间带宽高;另外模型并行组大小越大,流水线Stage可以减少,继而可以减少流水线中的气泡;所以一般可以节点内所有设备作为一个模型并行组。
首先,可以在一个节点内的多卡上做模型并行切分。因为模型并行通信开销大(前后向可能都需要all-reduce通信),而节点内设备间带宽高;另外模型并行组大小越大,流水线Stage可以减少,继而可以减少流水线中的气泡;所以一般可以节点内所有设备作为一个模型并行组。
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment