trainer.rst 26 KB
Newer Older
1
..
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Sylvain Gugger's avatar
Sylvain Gugger committed
13
14
15
16
17
18
Trainer
-----------------------------------------------------------------------------------------------------------------------

The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.

Sylvain Gugger's avatar
Sylvain Gugger committed
19
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
Sylvain Gugger's avatar
Sylvain Gugger committed
20
21
22
23
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
customization during training.

The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
24
<https://github.com/NVIDIA/apex>`__ and Native AMP for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
Sylvain Gugger's avatar
Sylvain Gugger committed
25

26
27
Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop which supports
the above features. To inject custom behavior you can subclass them and override the following methods:
Sylvain Gugger's avatar
Sylvain Gugger committed
28
29

- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
Tiger's avatar
Tiger committed
30
- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaluation DataLoader (PyTorch) or TF Dataset.
Sylvain Gugger's avatar
Sylvain Gugger committed
31
32
- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
- **log** -- Logs information on the various objects watching training.
33
- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
34
35
36
37
  init. Note, that you can also subclass or override the ``create_optimizer`` and ``create_scheduler`` methods
  separately.
- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
Sylvain Gugger's avatar
Sylvain Gugger committed
38
39
40
41
42
43
44
- **compute_loss** - Computes the loss on a batch of training inputs.
- **training_step** -- Performs a training step.
- **prediction_step** -- Performs an evaluation/test step.
- **run_model** (TensorFlow only) -- Basic pass through the model.
- **evaluate** -- Runs an evaluation loop and returns metrics.
- **predict** -- Returns predictions (with metrics if labels are available) on a test set.

45
46
47
48
49
50
51
52
53
54
55
56
.. warning::

    The :class:`~transformers.Trainer` class is optimized for 馃 Transformers models and can have surprising behaviors
    when you use it on other models. When using it on your own model, make sure:

    - your model always return tuples or subclasses of :class:`~transformers.file_utils.ModelOutput`.
    - your model can compute the loss if a :obj:`labels` argument is provided and that loss is returned as the first
      element of the tuple (if your model returns tuples)
    - your model can accept multiple label arguments (use the :obj:`label_names` in your
      :class:`~transformers.TrainingArguments` to indicate their name to the :class:`~transformers.Trainer`) but none
      of them should be named :obj:`"label"`.

57
58
Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function for multi-label
classification:
Sylvain Gugger's avatar
Sylvain Gugger committed
59
60
61

.. code-block:: python

62
    from torch import nn
Sylvain Gugger's avatar
Sylvain Gugger committed
63
    from transformers import Trainer
64
65
66

    class MultilabelTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False):
Mohan Zhang's avatar
Mohan Zhang committed
67
            labels = inputs.get("labels")
Chengxi Guo's avatar
Chengxi Guo committed
68
            outputs = model(**inputs)
Mohan Zhang's avatar
Mohan Zhang committed
69
            logits = outputs.get('logits')
70
            loss_fct = nn.BCEWithLogitsLoss()
71
72
73
            loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                            labels.float().view(-1, self.model.config.num_labels))
            return (loss, outputs) if return_outputs else loss
Sylvain Gugger's avatar
Sylvain Gugger committed
74

Sylvain Gugger's avatar
Sylvain Gugger committed
75
76
77
78
Another way to customize the training loop behavior for the PyTorch :class:`~transformers.Trainer` is to use
:doc:`callbacks <callback>` that can inspect the training loop state (for progress reporting, logging on TensorBoard or
other ML platforms...) and take decisions (like early stopping).

Sylvain Gugger's avatar
Sylvain Gugger committed
79
80
81
82
83
84
85

Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Trainer
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
86

Sylvain Gugger's avatar
Sylvain Gugger committed
87
88
89
90
91
92
93
Seq2SeqTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Seq2SeqTrainer
    :members: evaluate, predict


Sylvain Gugger's avatar
Sylvain Gugger committed
94
95
96
97
98
99
TFTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFTrainer
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
100

Sylvain Gugger's avatar
Sylvain Gugger committed
101
102
103
104
105
106
TrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TrainingArguments
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
107

Sylvain Gugger's avatar
Sylvain Gugger committed
108
109
110
111
112
113
114
Seq2SeqTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Seq2SeqTrainingArguments
    :members:


Sylvain Gugger's avatar
Sylvain Gugger committed
115
116
117
118
119
TFTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFTrainingArguments
    :members:
120
121


122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
Checkpoints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, :class:`~transformers.Trainer` will save all checkpoints in the :obj:`output_dir` you set in the
:class:`~transformers.TrainingArguments` you are using. Those will go in subfolder named :obj:`checkpoint-xxx` with xxx
being the step at which the training was at.

Resuming training from a checkpoint can be done when calling :meth:`~transformers.Trainer.train` with either:

- :obj:`resume_from_checkpoint=True` which will resume training from the latest checkpoint
- :obj:`resume_from_checkpoint=checkpoint_dir` which will resume training from the specific checkpoint in the directory
  passed.

In addition, you can easily save your checkpoints on the Model Hub when using :obj:`push_to_hub=True`. By default, all
the models saved in intermediate checkpoints are saved in different commits, but not the optimizer state. You can adapt
the :obj:`hub-strategy` value of your :class:`~transformers.TrainingArguments` to either:

- :obj:`"checkpoint"`: the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to
  resume training easily with :obj:`trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")`.
- :obj:`"all_checkpoints"`: all checkpoints are pushed like they appear in the output folder (so you will get one
  checkpoint folder per folder in your final repository)


145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
Logging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default :class:`~transformers.Trainer` will use ``logging.INFO`` for the main process and ``logging.WARNING`` for
the replicas if any.

These defaults can be overridden to use any of the 5 ``logging`` levels with :class:`~transformers.TrainingArguments`'s
arguments:

- ``log_level`` - for the main process
- ``log_level_replica`` - for the replicas

Further, if :class:`~transformers.TrainingArguments`'s ``log_on_each_node`` is set to ``False`` only the main node will
use the log level settings for its main process, all other nodes will use the log level settings for replicas.

Note that :class:`~transformers.Trainer` is going to set ``transformers``'s log level separately for each node in its
:meth:`~transformers.Trainer.__init__`. So you may want to set this sooner (see the next example) if you tap into other
``transformers`` functionality before creating the :class:`~transformers.Trainer` object.

Here is an example of how this can be used in an application:

.. code-block:: python

    [...]
    logger = logging.getLogger(__name__)

    # Setup logging
    logging.basicConfig(
173
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
174
175
176
177
178
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    # set the main code and the modules it uses to the same log-level according to the node
179
    log_level = training_args.get_process_log_level()
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)

    trainer = Trainer(...)

And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated
warnings you could run it as:

.. code-block:: bash

    my_app.py ... --log_level warning --log_level_replica error

In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to
change the above to:

.. code-block:: bash

    my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0

and then only the main process of the first node will log at the "warning" level, and all other processes on the main
node and all processes on other nodes will log at the "error" level.

If you need your application to be as quiet as possible you could do:

.. code-block:: bash

    my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0

(add ``--log_on_each_node 0`` if on multi-node environment)



213
214
215
216
217
218
219
220
221
222
Randomness
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When resuming from a checkpoint generated by :class:`~transformers.Trainer` all efforts are made to restore the
`python`, `numpy` and `pytorch` RNG states to the same states as they were at the moment of saving that checkpoint,
which should make the "stop and resume" style of training as close as possible to non-stop training.

However, due to various default non-deterministic pytorch settings this might not fully work. If you want full
determinism please refer to `Controlling sources of randomness
<https://pytorch.org/docs/stable/notes/randomness.html>`__. As explained in the document, that some of those settings
223
that make things deterministic (.e.g., ``torch.backends.cudnn.deterministic``) may slow things down, therefore this
224
225
226
can't be done by default, but you can enable those yourself if needed.


227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
Trainer Integrations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



The :class:`~transformers.Trainer` has been extended to support libraries that may dramatically improve your training
time and fit much bigger models.

Currently it supports third party solutions, `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ and `FairScale
<https://github.com/facebookresearch/fairscale/>`__, which implement parts of the paper `ZeRO: Memory Optimizations
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
<https://arxiv.org/abs/1910.02054>`__.

This provided support is new and experimental as of this writing.

242
243
.. _zero-install-notes:

244
CUDA Extension Installation Notes
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.

While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
any PyTorch extension that needs to build CUDA extensions.

Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:

.. code-block:: bash

    pip install fairscale
    pip install deepspeed

please, read the following notes first.

In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
different remember to adjust the version number to the one you are after.

266
267
Possible problem #1
=======================================================================================================================
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286

While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
installed system-wide.

For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
CUDA ``10.2`` installed system-wide.

The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
installation location by doing:

.. code-block:: bash

    which nvcc

If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.

287
288
Possible problem #2
=======================================================================================================================
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333

Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
may have:

.. code-block:: bash

    /usr/local/cuda-10.2
    /usr/local/cuda-11.0

Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
environment variables.

First, you may look at their contents:

.. code-block:: bash

    echo $PATH
    echo $LD_LIBRARY_PATH

so you get an idea of what is inside.

It's possible that ``LD_LIBRARY_PATH`` is empty.

``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
entries.

Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
doing:

.. code-block:: bash

    export PATH=/usr/local/cuda-10.2/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH

Note that we aren't overwriting the existing values, but prepending instead.

Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
that your system will have it named differently, but if it is adjust it to reflect your reality.


334
335
Possible problem #3
=======================================================================================================================
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359

Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
``gcc-7``.

There are various ways to go about it.

If you can install the latest CUDA toolkit it typically should support the newer compiler.

Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
build system complains it can't find it, the following might do the trick:

.. code-block:: bash

    sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
    sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++


Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.

As always make sure to edit the paths in the example to match your situation.

360
361
362
363
364
365
366
367
FairScale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By integrating `FairScale <https://github.com/facebookresearch/fairscale/>`__ the :class:`~transformers.Trainer`
provides support for the following features from `the ZeRO paper <https://arxiv.org/abs/1910.02054>`__:

1. Optimizer State Sharding
2. Gradient Sharding
368
369
3. Model Parameters Sharding (new and very experimental)
4. CPU offload (new and very experimental)
370

371
372
You will need at least two GPUs to use this feature.

373

374
**Installation**:
375

376
Install the library via pypi:
377

378
.. code-block:: bash
379

380
    pip install fairscale
381

382
383
384
385
386
387
388
389
or via ``transformers``' ``extras``:

.. code-block:: bash

    pip install transformers[fairscale]

(will become available starting from ``transformers==4.6.0``)

390
391
392
393
394
395
396
397
398
399
400
or find more details on `the FairScale's GitHub page <https://github.com/facebookresearch/fairscale/#installation>`__.

If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.

If it's still not resolved the build issue, here are a few more ideas.

``fairscale`` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
with it, you may want to try one of:

.. code-block:: bash

401
    pip install fairscale --no-build-isolation .
402
403
404
405
406

or:

.. code-block:: bash

407
408
409
410
411
412
    git clone https://github.com/facebookresearch/fairscale/
    cd fairscale
    rm -r dist build
    python setup.py bdist_wheel
    pip uninstall -y fairscale
    pip install dist/fairscale-*.whl
413
414
415
416
417

``fairscale`` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:

.. code-block:: bash

418
419
420
    pip uninstall -y fairscale; pip install fairscale --pre \
    -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html \
    --no-cache --no-build-isolation
421
422
423
424
425

or:

.. code-block:: bash

426
427
    pip install -v --disable-pip-version-check . \
    -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html --pre
428
429
430
431
432
433
434
435
436
437
438
439
440

Of course, adjust the urls to match the cuda version you use.

If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
`FairScale <https://github.com/facebookresearch/fairscale/issues>`__.



**Usage**:

To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments, and
make sure you have added the distributed launcher ``-m torch.distributed.launch
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
441

442
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
443
444
445

.. code-block:: bash

Sylvain Gugger's avatar
Sylvain Gugger committed
446
    python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
447
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
448
    --output_dir output_dir --overwrite_output_dir \
449
450
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
451
    --source_lang en --target_lang ro \
452
    --fp16 --sharded_ddp simple
453
454
455
456
457
458

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
459
460
- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
  able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
461
462
  significantly shorter training time.

Stas Bekman's avatar
Stas Bekman committed
463
464
465
3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp
   zero_dp_3`` to the command line arguments, and make sure you have added the distributed launcher ``-m
   torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
466

467
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
468
469
470

.. code-block:: bash

Sylvain Gugger's avatar
Sylvain Gugger committed
471
    python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
472
473
474
475
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
476
    --source_lang en --target_lang ro \
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
    --fp16 --sharded_ddp zero_dp_2

:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
gradients and optimizer states.

Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
"zero_dp_2 cpu_offload"`).

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
- The ``cpu_offload`` additional option requires ``--fp16``.
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
  some bugs you encounter may have been fixed there already.

Known caveats:

496
- This feature is incompatible with :obj:`--predict_with_generate` in the `run_translation.py` script.
497
- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
498
499
  :obj:`FullyShardedDataParallelism` of fairscale. It should be used with the option :obj:`auto_wrap` if you are not
  doing this yourself: :obj:`--sharded_ddp "zero_dp_3 auto_wrap"`.
500

501
502
503
504

DeepSpeed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

505

506
Moved to :ref:`deepspeed-trainer-integration`.
507

508

509
510
Installation
=======================================================================================================================
511

512
Moved to :ref:`deepspeed-installation`.
513

514

515
516
Deployment with multiple GPUs
=======================================================================================================================
517

518
Moved to :ref:`deepspeed-multi-gpu`.
519
520
521
522
523


Deployment with one GPU
=======================================================================================================================

524
Moved to :ref:`deepspeed-one-gpu`.
525

526
527
528
529

Deployment in Notebooks
=======================================================================================================================

530
Moved to :ref:`deepspeed-notebook`.
531

532

533
534
Configuration
=======================================================================================================================
535

536
Moved to :ref:`deepspeed-config`.
537
538


539
540
Passing Configuration
=======================================================================================================================
541

542
Moved to :ref:`deepspeed-config-passing`.
543
544


545
546
Shared Configuration
=======================================================================================================================
547

548
Moved to :ref:`deepspeed-config-shared`.
549

550
551
ZeRO
=======================================================================================================================
552

553
Moved to :ref:`deepspeed-zero`.
554
555
556
557

ZeRO-2 Config
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

558
Moved to :ref:`deepspeed-zero2-config`.
559
560
561
562

ZeRO-3 Config
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

563
Moved to :ref:`deepspeed-zero3-config`.
564
565
566
567


NVMe Support
=======================================================================================================================
568

569
Moved to :ref:`deepspeed-nvme`.
570
571
572
573

ZeRO-2 vs ZeRO-3 Performance
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

574
Moved to :ref:`deepspeed-zero2-zero3-performance`.
575

576
577
578
ZeRO-2 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

579
Moved to :ref:`deepspeed-zero2-example`.
580
581
582
583

ZeRO-3 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

584
Moved to :ref:`deepspeed-zero3-example`.
585

586
Optimizer and Scheduler
587
=======================================================================================================================
588

589
590
591


Optimizer
592
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
593

594
Moved to :ref:`deepspeed-optimizer`.
595
596


597
Scheduler
598
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
599

600
Moved to :ref:`deepspeed-scheduler`.
601
602
603
604

fp32 Precision
=======================================================================================================================

605
Moved to :ref:`deepspeed-fp32`.
606

607
608
Automatic Mixed Precision
=======================================================================================================================
609

610
Moved to :ref:`deepspeed-amp`.
611

Stas Bekman's avatar
Stas Bekman committed
612
613
614
Batch Size
=======================================================================================================================

615
Moved to :ref:`deepspeed-bs`.
Stas Bekman's avatar
Stas Bekman committed
616

617
618
619
Gradient Accumulation
=======================================================================================================================

620
Moved to :ref:`deepspeed-grad-acc`.
621

622

623
624
Gradient Clipping
=======================================================================================================================
625

626
Moved to :ref:`deepspeed-grad-clip`.
627

628
629

Getting The Model Weights Out
630
631
=======================================================================================================================

632
Moved to :ref:`deepspeed-weight-extraction`.