Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
Megatron-DeepSpeed-ViT_pytorch
Commits
c748c5f4
Commit
c748c5f4
authored
Sep 19, 2024
by
chenzk
Browse files
v1.2.6
parent
6ff004ca
Pipeline
#1703
canceled with stage
Changes
14
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
14 changed files
with
193 additions
and
2 deletions
+193
-2
README.md
README.md
+3
-2
data/.gitignore
data/.gitignore
+147
-0
data/test/images/ILSVRC2012_val_00000293.JPEG
data/test/images/ILSVRC2012_val_00000293.JPEG
+0
-0
data/test/images/ILSVRC2012_val_00012503.JPEG
data/test/images/ILSVRC2012_val_00012503.JPEG
+0
-0
data/test/images/ILSVRC2012_val_0001250388.JPEG
data/test/images/ILSVRC2012_val_0001250388.JPEG
+0
-0
data/train/n01440764/n01440764_10026.JPEG
data/train/n01440764/n01440764_10026.JPEG
+0
-0
data/train/n01440764/n01440764_12433.JPEG
data/train/n01440764/n01440764_12433.JPEG
+0
-0
data/train/n01806143/n01806143_10002.JPEG
data/train/n01806143/n01806143_10002.JPEG
+0
-0
data/train/n01806143/n01806143_12826.JPEG
data/train/n01806143/n01806143_12826.JPEG
+0
-0
data/val/n01440764/ILSVRC2012_val_00000293.JPEG
data/val/n01440764/ILSVRC2012_val_00000293.JPEG
+0
-0
data/val/n01440764/ILSVRC2012_val_00012503.JPEG
data/val/n01440764/ILSVRC2012_val_00012503.JPEG
+0
-0
data/val/n01824575/ILSVRC2012_val_00000370.JPEG
data/val/n01824575/ILSVRC2012_val_00000370.JPEG
+0
-0
data/val/n01824575/ILSVRC2012_val_00009781.JPEG
data/val/n01824575/ILSVRC2012_val_00009781.JPEG
+0
-0
examples/dspvit_1node_minidata.sh
examples/dspvit_1node_minidata.sh
+43
-0
No files found.
README.md
View file @
c748c5f4
...
@@ -65,7 +65,7 @@ https://www.jianshu.com/p/a42b7d863825
...
@@ -65,7 +65,7 @@ https://www.jianshu.com/p/a42b7d863825
项目中的数据集可从快速下载通道下载:
[
imagenet-2012
](
http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
)
项目中的数据集可从快速下载通道下载:
[
imagenet-2012
](
http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
)
项目
中已
提供用于试验训练的迷你数据集,训练数据目录结构如下,用于正常训练的完整数据集请按此目录结构进行制备:
项目提供用于试验训练的迷你数据集
[
tiny-imagenet-200
](
http://113.200.138.88:18080/aidatasets/project-dependency/tiny-imagenet-200.git
)
,下载解压后将名字tiny-imagenet-200改为data
,训练数据目录结构如下,用于正常训练的完整数据集请按此目录结构进行制备:
```
```
data
data
|
|
...
@@ -91,7 +91,8 @@ data
...
@@ -91,7 +91,8 @@ data
### 单机多卡
### 单机多卡
```
```
cd megatron-deepspeed-vit
cd megatron-deepspeed-vit
sh examples/dspvit_1node.sh
# sh examples/dspvit_1node_minidata.sh #用于快速试验迷你数据集
sh examples/dspvit_1node.sh # 训练完整imagenet2012
# 训练过程中报:Message: 'is_pipe_partitioned= False',不影响训练,为deepspeed本身bug,如需要屏蔽可参照deepspeed github官网issue进行源码修改来解决。
# 训练过程中报:Message: 'is_pipe_partitioned= False',不影响训练,为deepspeed本身bug,如需要屏蔽可参照deepspeed github官网issue进行源码修改来解决。
```
```
### 单机单卡
### 单机单卡
...
...
data/.gitignore
0 → 100644
View file @
c748c5f4
# tests
# megatron autogenerated indices
tests/data/*/*npy
tests/tools/openwebtext-1000.jsonl
tmp/
# macOS
.DS_Store
# Byte-compiled / optimized / DLL files
*/__pycache__/
*.py[cod]
*.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask:
instance/
.webassets-cache
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
Pipfile
Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Intellij project settings
.idea/
.iml
# VSCode
.vscode/
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# static files generated from Django application
media
staticfiles
/tags
# tmp files
*.swp
data/test/images/ILSVRC2012_val_00000293.JPEG
deleted
100644 → 0
View file @
6ff004ca
218 KB
data/test/images/ILSVRC2012_val_00012503.JPEG
deleted
100644 → 0
View file @
6ff004ca
170 KB
data/test/images/ILSVRC2012_val_0001250388.JPEG
deleted
100644 → 0
View file @
6ff004ca
170 KB
data/train/n01440764/n01440764_10026.JPEG
deleted
100644 → 0
View file @
6ff004ca
13.4 KB
data/train/n01440764/n01440764_12433.JPEG
deleted
100644 → 0
View file @
6ff004ca
35.9 KB
data/train/n01806143/n01806143_10002.JPEG
deleted
100644 → 0
View file @
6ff004ca
200 KB
data/train/n01806143/n01806143_12826.JPEG
deleted
100644 → 0
View file @
6ff004ca
170 KB
data/val/n01440764/ILSVRC2012_val_00000293.JPEG
deleted
100644 → 0
View file @
6ff004ca
218 KB
data/val/n01440764/ILSVRC2012_val_00012503.JPEG
deleted
100644 → 0
View file @
6ff004ca
170 KB
data/val/n01824575/ILSVRC2012_val_00000370.JPEG
deleted
100644 → 0
View file @
6ff004ca
98.1 KB
data/val/n01824575/ILSVRC2012_val_00009781.JPEG
deleted
100644 → 0
View file @
6ff004ca
120 KB
examples/dspvit_1node_minidata.sh
0 → 100644
View file @
c748c5f4
#! /bin/bash
# Runs the "345M" parameter model
DATA_PATH
=
"./data"
CHECKPOINT_PATH
=
"./checkpoint"
DS_CONFIG
=
"./examples/ds_config.json"
MICRO_BATCH_SIZE
=
1
GLOBAL_BATCH_SIZE
=
8
deepspeed
--num_gpus
4 pretrain_vit.py
\
--num-layers
24
\
--hidden-size
1024
\
--num-attention-heads
16
\
--micro-batch-size
${
MICRO_BATCH_SIZE
}
\
--global-batch-size
${
GLOBAL_BATCH_SIZE
}
\
--seq-length
1024
\
--max-position-embeddings
1024
\
--train-iters
5
\
--lr-decay-iters
3
\
--save
$CHECKPOINT_PATH
\
--load
$CHECKPOINT_PATH
\
--data-path
$DATA_PATH
\
--data-impl
mmap
\
--split
949,50,1
\
--distributed-backend
nccl
\
--lr
0.00015
\
--min-lr
1.0e-5
\
--lr-decay-style
cosine
\
--weight-decay
1e-2
\
--clip-grad
1.0
\
--lr-warmup-fraction
.01
\
--log-interval
1
\
--save-interval
5
\
--eval-interval
5
\
--eval-iters
5
\
--fp16
\
--padded_vocab_size
224
\
--deepspeed
\
--deepspeed_config
$DS_CONFIG
\
# --eval-only True \
# --do_test True \
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment