Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
5d8b9860
Unverified
Commit
5d8b9860
authored
Jan 26, 2022
by
Ngo Quang Huy
Committed by
GitHub
Jan 26, 2022
Browse files
Fix deepspeed docs (#15346)
parent
96161ac4
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
12 additions
and
12 deletions
+12
-12
docs/source/main_classes/deepspeed.mdx
docs/source/main_classes/deepspeed.mdx
+12
-12
No files found.
docs/source/main_classes/deepspeed.mdx
View file @
5d8b9860
...
...
@@ -31,7 +31,7 @@ won't be possible on a single GPU.
🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you
r
type
of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
this document is focused on this feature.
2. If you don'
t
use
[`
Trainer
`]
and
want
to
use
your
own
Trainer
where
you
integrated
DeepSpeed
...
...
@@ -97,7 +97,7 @@ TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--disable-pip-version-check 2>&1 | tee build.log
```
If you intend to use NVMe offload you will need to
also
include `DS_BUILD_AIO=1` in the instructions above (and also
If you intend to use NVMe offload you will
also
need to include `DS_BUILD_AIO=1` in the instructions above (and also
install *libaio-dev* system-wide).
Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
...
...
@@ -134,7 +134,7 @@ You can check the archs pytorch was built with using:
python -c "import torch; print(torch.cuda.get_arch_list())"
```
Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
Here is how to find out the arch for one of the installed GPU
s
. For example, for GPU 0:
```bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
...
...
@@ -169,7 +169,7 @@ following:
2.
add
a
new
argument
`--
deepspeed
ds_config
.
json
`,
where
`
ds_config
.
json
`
is
the
DeepSpeed
configuration
file
as
documented
[
here
](
https
://
www
.
deepspeed
.
ai
/
docs
/
config
-
json
/).
The
file
naming
is
up
to
you
.
Therefore
,
if
your
original
command
line
looked
as
follow
ing
:
Therefore
,
if
your
original
command
line
looked
as
follow
s
:
```
bash
python
-
m
torch
.
distributed
.
launch
--
nproc_per_node
=
2
your_program
.
py
<
normal
cl
args
>
...
...
@@ -214,7 +214,7 @@ For some practical usage examples, please, see this [post](https://github.com/hu
### Deployment with one GPU
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as follow
ing
:
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as follow
s
:
```bash
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
...
...
@@ -560,7 +560,7 @@ Do note that some values, such as `scheduler.params.total_num_steps` are calcula
###
ZeRO
[
Zero
Redundancy
Optimizer
(
ZeRO
)](
https
://
www
.
deepspeed
.
ai
/
tutorials
/
zero
/)
is
the
workhorse
of
DeepSpeed
.
It
support
3
different
levels
(
stages
)
of
optimization
.
The
first
one
is
not
quite
interesting
for
scalability
purposes
,
support
s
3
different
levels
(
stages
)
of
optimization
.
The
first
one
is
not
quite
interesting
for
scalability
purposes
,
therefore
this
document
focuses
on
stages
2
and
3.
Stage
3
is
further
improved
by
the
latest
addition
of
ZeRO
-
Infinity
.
You
will
find
more
indepth
information
in
the
DeepSpeed
documentation
.
...
...
@@ -581,7 +581,7 @@ going to use.
####
ZeRO
-
2
Config
The
following
is
an
example
configuration
for
ZeRO
stage
2
:
The
following
is
an
example
of
configuration
for
ZeRO
stage
2
:
```
json
{
...
...
@@ -604,13 +604,13 @@ The following is an example configuration for ZeRO stage 2:
**
Performance
tuning
:**
-
enabling
`
offload_optimizer
`
should
reduce
GPU
RAM
usage
(
it
requires
`
"stage"
:
2
`)
-
`
"overlap_comm"
:
true
`
trade
s
off
increased
GPU
RAM
usage
to
lower
all
-
reduce
latency
.
`
overlap_comm
`
uses
4.5
x
-
`
"overlap_comm"
:
true
`
trade
off
s
increased
GPU
RAM
usage
to
lower
all
-
reduce
latency
.
`
overlap_comm
`
uses
4.5
x
the
`
allgather_bucket_size
`
and
`
reduce_bucket_size
`
values
.
So
if
they
are
set
to
5e8
,
this
requires
a
9
GB
footprint
(`
5e8
x
2
Bytes
x
2
x
4.5
`).
Therefore
,
if
you
have
a
GPU
with
8
GB
or
less
RAM
,
to
avoid
getting
OOM
-
errors
you
will
need
to
reduce
those
parameters
to
about
`
2e8
`,
which
would
require
3.6
GB
.
You
will
want
to
do
the
same
on
larger
capacity
GPU
as
well
,
if
you
're starting to hit OOM.
- when reducing these buffers you'
re
trading
communication
speed
to
avail
more
GPU
RAM
.
The
smaller
the
buffer
size
,
the
slower
the
communication
,
and
the
more
GPU
RAM
will
be
available
to
other
tasks
.
So
if
a
bigger
batch
size
is
- when reducing these buffers you'
re
trading
communication
speed
to
avail
more
GPU
RAM
.
The
smaller
the
buffer
size
is
,
the
slower
the
communication
gets
,
and
the
more
GPU
RAM
will
be
available
to
other
tasks
.
So
if
a
bigger
batch
size
is
important
,
getting
a
slightly
slower
training
time
could
be
a
good
trade
.
...
...
@@ -619,7 +619,7 @@ The following is an example configuration for ZeRO stage 2:
####
ZeRO
-
3
Config
The
following
is
an
example
configuration
for
ZeRO
stage
3
:
The
following
is
an
example
of
configuration
for
ZeRO
stage
3
:
```
json
{
...
...
@@ -662,7 +662,7 @@ and its typically accessed much faster than normal CPU memory.
If
hitting
OOM
reduce
`
stage3_max_live_parameters
`
and
`
stage3_max_reuse_distance
`.
They
should
have
minimal
impact
on
performance
unless
you
are
doing
activation
checkpointing
.
`
1e9
`
would
consume
~
2
GB
.
The
memory
is
shared
by
`
stage3_max_live_parameters
`
and
`
stage3_max_reuse_distance
`,
so
its
not
additive
,
its
just
2
GB
total
.
`
stage3_max_live_parameters
`
and
`
stage3_max_reuse_distance
`,
so
it
'
s not additive, it
'
s
just
2
GB
total
.
`
stage3_max_live_parameters
`
is
the
upper
limit
on
how
many
full
parameters
you
want
to
keep
on
the
GPU
at
any
given
time
.
"reuse distance"
is
a
metric
we
are
using
to
figure
out
when
will
a
parameter
be
used
again
in
the
future
,
and
we
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment