trainer.rst 40.1 KB
Newer Older
1
..
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Sylvain Gugger's avatar
Sylvain Gugger committed
13
14
15
16
17
18
Trainer
-----------------------------------------------------------------------------------------------------------------------

The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.

Sylvain Gugger's avatar
Sylvain Gugger committed
19
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
Sylvain Gugger's avatar
Sylvain Gugger committed
20
21
22
23
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
customization during training.

The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
24
<https://github.com/NVIDIA/apex>`__ and Native AMP for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
Sylvain Gugger's avatar
Sylvain Gugger committed
25

26
27
Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop which supports
the above features. To inject custom behavior you can subclass them and override the following methods:
Sylvain Gugger's avatar
Sylvain Gugger committed
28
29

- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
Tiger's avatar
Tiger committed
30
- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaluation DataLoader (PyTorch) or TF Dataset.
Sylvain Gugger's avatar
Sylvain Gugger committed
31
32
- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
- **log** -- Logs information on the various objects watching training.
33
- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
34
35
36
37
  init. Note, that you can also subclass or override the ``create_optimizer`` and ``create_scheduler`` methods
  separately.
- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
Sylvain Gugger's avatar
Sylvain Gugger committed
38
39
40
41
42
43
44
- **compute_loss** - Computes the loss on a batch of training inputs.
- **training_step** -- Performs a training step.
- **prediction_step** -- Performs an evaluation/test step.
- **run_model** (TensorFlow only) -- Basic pass through the model.
- **evaluate** -- Runs an evaluation loop and returns metrics.
- **predict** -- Returns predictions (with metrics if labels are available) on a test set.

45
46
47
48
49
50
51
52
53
54
55
56
.. warning::

    The :class:`~transformers.Trainer` class is optimized for 馃 Transformers models and can have surprising behaviors
    when you use it on other models. When using it on your own model, make sure:

    - your model always return tuples or subclasses of :class:`~transformers.file_utils.ModelOutput`.
    - your model can compute the loss if a :obj:`labels` argument is provided and that loss is returned as the first
      element of the tuple (if your model returns tuples)
    - your model can accept multiple label arguments (use the :obj:`label_names` in your
      :class:`~transformers.TrainingArguments` to indicate their name to the :class:`~transformers.Trainer`) but none
      of them should be named :obj:`"label"`.

57
58
Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function for multi-label
classification:
Sylvain Gugger's avatar
Sylvain Gugger committed
59
60
61

.. code-block:: python

62
    import torch
Sylvain Gugger's avatar
Sylvain Gugger committed
63
    from transformers import Trainer
64
65
66

    class MultilabelTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False):
Sylvain Gugger's avatar
Sylvain Gugger committed
67
            labels = inputs.pop("labels")
Chengxi Guo's avatar
Chengxi Guo committed
68
            outputs = model(**inputs)
69
70
71
72
73
            logits = outputs.logits
            loss_fct = torch.nn.BCEWithLogitsLoss()
            loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                            labels.float().view(-1, self.model.config.num_labels))
            return (loss, outputs) if return_outputs else loss
Sylvain Gugger's avatar
Sylvain Gugger committed
74

Sylvain Gugger's avatar
Sylvain Gugger committed
75
76
77
78
Another way to customize the training loop behavior for the PyTorch :class:`~transformers.Trainer` is to use
:doc:`callbacks <callback>` that can inspect the training loop state (for progress reporting, logging on TensorBoard or
other ML platforms...) and take decisions (like early stopping).

Sylvain Gugger's avatar
Sylvain Gugger committed
79
80
81
82
83
84
85

Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Trainer
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
86

Sylvain Gugger's avatar
Sylvain Gugger committed
87
88
89
90
91
92
93
Seq2SeqTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Seq2SeqTrainer
    :members: evaluate, predict


Sylvain Gugger's avatar
Sylvain Gugger committed
94
95
96
97
98
99
TFTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFTrainer
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
100

Sylvain Gugger's avatar
Sylvain Gugger committed
101
102
103
104
105
106
TrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TrainingArguments
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
107

Sylvain Gugger's avatar
Sylvain Gugger committed
108
109
110
111
112
113
114
Seq2SeqTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Seq2SeqTrainingArguments
    :members:


Sylvain Gugger's avatar
Sylvain Gugger committed
115
116
117
118
119
TFTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFTrainingArguments
    :members:
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136


Trainer Integrations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



The :class:`~transformers.Trainer` has been extended to support libraries that may dramatically improve your training
time and fit much bigger models.

Currently it supports third party solutions, `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ and `FairScale
<https://github.com/facebookresearch/fairscale/>`__, which implement parts of the paper `ZeRO: Memory Optimizations
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
<https://arxiv.org/abs/1910.02054>`__.

This provided support is new and experimental as of this writing.

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
Installation Notes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.

While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
any PyTorch extension that needs to build CUDA extensions.

Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:

.. code-block:: bash

    pip install fairscale
    pip install deepspeed

please, read the following notes first.

In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
different remember to adjust the version number to the one you are after.

**Possible problem #1:**

While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
installed system-wide.

For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
CUDA ``10.2`` installed system-wide.

The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
installation location by doing:

.. code-block:: bash

    which nvcc

If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.

**Possible problem #2:**

Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
may have:

.. code-block:: bash

    /usr/local/cuda-10.2
    /usr/local/cuda-11.0

Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
environment variables.

First, you may look at their contents:

.. code-block:: bash

    echo $PATH
    echo $LD_LIBRARY_PATH

so you get an idea of what is inside.

It's possible that ``LD_LIBRARY_PATH`` is empty.

``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
entries.

Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
doing:

.. code-block:: bash

    export PATH=/usr/local/cuda-10.2/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH

Note that we aren't overwriting the existing values, but prepending instead.

Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
that your system will have it named differently, but if it is adjust it to reflect your reality.


**Possible problem #3:**

Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
``gcc-7``.

There are various ways to go about it.

If you can install the latest CUDA toolkit it typically should support the newer compiler.

Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
build system complains it can't find it, the following might do the trick:

.. code-block:: bash

    sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
    sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++


Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.

As always make sure to edit the paths in the example to match your situation.

**If still unsuccessful:**

If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.

256
257
258
259
260
261
262
263
264

FairScale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By integrating `FairScale <https://github.com/facebookresearch/fairscale/>`__ the :class:`~transformers.Trainer`
provides support for the following features from `the ZeRO paper <https://arxiv.org/abs/1910.02054>`__:

1. Optimizer State Sharding
2. Gradient Sharding
265
266
3. Model Parameters Sharding (new and very experimental)
4. CPU offload (new and very experimental)
267

268
269
You will need at least two GPUs to use this feature.

270
271
272
273
274
275
276
277
To deploy this feature:

1. Install the library via pypi:

   .. code-block:: bash

       pip install fairscale

278
   or find more details on `the FairScale's GitHub page
279
280
   <https://github.com/facebookresearch/fairscale/#installation>`__.

281
282
283
2. To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments,
   and make sure you have added the distributed launcher ``-m torch.distributed.launch
   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
284

285
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
286
287
288

.. code-block:: bash

289
    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py \
290
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
291
    --output_dir output_dir --overwrite_output_dir \
292
293
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
294
    --source_lang en --target_lang ro \
295
    --fp16 --sharded_ddp simple
296
297
298
299
300
301

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
302
303
- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
  able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
304
305
  significantly shorter training time.

306
307
308
309
3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp zero_dp_3`
   to the command line arguments, and make sure you have added the distributed launcher ``-m torch.distributed.launch
   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.

310
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
311
312
313

.. code-block:: bash

314
    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py \
315
316
317
318
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
319
    --source_lang en --target_lang ro \
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
    --fp16 --sharded_ddp zero_dp_2

:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
gradients and optimizer states.

Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
"zero_dp_2 cpu_offload"`).

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
- The ``cpu_offload`` additional option requires ``--fp16``.
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
  some bugs you encounter may have been fixed there already.

Known caveats:

339
- This feature is incompatible with :obj:`--predict_with_generate` in the `run_translation.py` script.
340
- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
341
342
  :obj:`FullyShardedDataParallelism` of fairscale. It should be used with the option :obj:`auto_wrap` if you are not
  doing this yourself: :obj:`--sharded_ddp "zero_dp_3 auto_wrap"`.
343

344
345
346
347
348
349
350
351
352
353

DeepSpeed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
<https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
full support for:

1. Optimizer State Partitioning (ZeRO stage 1)
2. Add Gradient Partitioning (ZeRO stage 2)
354
355
356
357
358
359
360
361
362
363
3. Custom fp16 handling
4. A range of fast Cuda-extension-based Optimizers
5. ZeRO-Offload

ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training
<https://arxiv.org/abs/2101.06840>`__.

DeepSpeed is currently used only for training, as all the currently available features are of no use to inference.


364

365
366
Installation
=======================================================================================================================
367

368
Install the library via pypi:
369

370
.. code-block:: bash
371

372
    pip install deepspeed
373

374
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
375

376
377
Deployment with multiple GPUs
=======================================================================================================================
378

379
380
To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
following:
381

382
383
384
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
   documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
385

386
Therefore, if your original command line looked as following:
387

388
.. code-block:: bash
389

390
    python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
391

392
Now it should be:
393

394
.. code-block:: bash
395

396
    deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
397

398
399
400
401
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
full details on how to configure various nodes and GPUs can be found `here
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
402

403
404
405
406
407
In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed as long as you don't need to use
``deepspeed`` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use
the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
use it here as well.

408
Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs:
409
410
411

.. code-block:: bash

412
    deepspeed examples/seq2seq/run_translation.py \
413
414
415
416
417
    --deepspeed examples/tests/deepspeed/ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir --fp16 \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
418
    --source_lang en --target_lang ro
419

420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436

Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
with, we combined the two into a single argument.

For some practical usage examples, please, see this `post
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.



Deployment with one GPU
=======================================================================================================================

To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:

.. code-block:: bash

437
    deepspeed --num_gpus=1 examples/seq2seq/run_translation.py \
438
439
440
441
442
    --deepspeed examples/tests/deepspeed/ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir --fp16 \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
443
    --source_lang en --target_lang ro
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461

This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
launcher options.

Why would you want to use DeepSpeed with just one GPU?

1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
   leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
   normally won't fit.
2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
   bigger models and data batches.

While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
with DeepSpeed is to have at least the following configuration in the configuration file:

.. code-block:: json
462

463
464
465
466
467
468
469
470
471
472
473
474
  {
    "zero_optimization": {
       "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": true,
       "contiguous_gradients": true,
       "cpu_offload": true
    },
  }
475

476
477
which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
find more details in the discussion below.
478

479
480
For a practical usage example of this type of deployment, please, see this `post
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
481

482
483
484
485
486
487
488
Notes:

- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit
  the visible scope of available GPUs. Instead, you have to use the following syntax:

   .. code-block:: bash

489
       deepspeed --include localhost:1 examples/seq2seq/run_translation.py ...
490

491
   In this example, we tell DeepSpeed to use GPU 1 (second gpu).
492
493


494
495
496
497

Deployment in Notebooks
=======================================================================================================================

498
499
The problem with running notebook cells as a script is that there is no normal ``deepspeed`` launcher to rely on, so
under certain setups we have to emulate it.
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572

Here is how you'd have to adjust your training code in the notebook to use DeepSpeed.

.. code-block:: python

    # DeepSpeed requires a distributed environment even when only one process is used.
    # This emulates a launcher in the notebook
    import os
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
    os.environ['RANK'] = "0"
    os.environ['LOCAL_RANK'] = "0"
    os.environ['WORLD_SIZE'] = "1"

    # Now proceed as normal, plus pass the deepspeed config file
    training_args = TrainingArguments(..., deepspeed="ds_config.json")
    trainer = Trainer(...)
    trainer.train()

Note: `...` stands for the normal arguments that you'd pass to the functions.

If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
cell with:

.. code-block:: python

    %%bash
    cat <<'EOT' > ds_config.json
    {
        "fp16": {
            "enabled": true,
            "loss_scale": 0,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1
        },

        "zero_optimization": {
            "stage": 2,
            "allgather_partitions": true,
            "allgather_bucket_size": 2e8,
            "overlap_comm": true,
            "reduce_scatter": true,
            "reduce_bucket_size": 2e8,
            "contiguous_gradients": true,
            "cpu_offload": true
        },

        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": 3e-5,
                "betas": [0.8, 0.999],
                "eps": 1e-8,
                "weight_decay": 3e-7
            }
        },

        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": 0,
                "warmup_max_lr": 3e-5,
                "warmup_num_steps": 500
            }
        },

        "steps_per_print": 2000,
        "wall_clock_breakdown": false
    }
    EOT


573
574
575
576
577
That's said if the script is not in the notebook cells, you can launch ``deepspeed`` normally via shell from a cell
with:

.. code-block::

578
   !deepspeed examples/seq2seq/run_translation.py ...
579
580
581
582
583
584
585
586

or with bash magic, where you can write a multi-line code for the shell to run:

.. code-block::

   %%bash

   cd /somewhere
587
   deepspeed examples/seq2seq/run_translation.py ...
588
589
590



591

592
593
Configuration
=======================================================================================================================
594
595
596
597

For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.

598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
You can find dozens of DeepSpeed configuration examples that address various practical needs in `the DeepSpeedExamples
repo <https://github.com/microsoft/DeepSpeedExamples>`__:

.. code-block:: bash

  git clone https://github.com/microsoft/DeepSpeedExamples
  cd DeepSpeedExamples
  find . -name '*json'

Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the
example ``.json`` files with:

.. code-block:: bash

  grep -i Lamb $(find . -name '*json')

Some more examples are to be found in the `main repo <https://github.com/microsoft/DeepSpeed>`__ as well.

616
617
When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have
to be configured via the command line. You will find the nuances in the rest of this guide.
618
619

To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
620
enables FP16, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler:
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663

.. code-block:: json

    {
        "fp16": {
            "enabled": true,
            "loss_scale": 0,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1
        },

       "zero_optimization": {
           "stage": 2,
           "allgather_partitions": true,
           "allgather_bucket_size": 5e8,
           "overlap_comm": true,
           "reduce_scatter": true,
           "reduce_bucket_size": 5e8,
           "contiguous_gradients": true,
           "cpu_offload": true
       },

       "optimizer": {
         "type": "AdamW",
         "params": {
           "lr": 3e-5,
           "betas": [ 0.8, 0.999 ],
           "eps": 1e-8,
           "weight_decay": 3e-7
         }
       },

       "scheduler": {
         "type": "WarmupLR",
         "params": {
           "warmup_min_lr": 0,
           "warmup_max_lr": 3e-5,
           "warmup_num_steps": 500
         }
       }
    }

664
665
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
to the console, so you can see exactly what was the final configuration passed to it.
666
667


668
669
Passing Configuration
=======================================================================================================================
670

671
672
673
674
675
As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're
not using the command line interface to configure the training, and instead instantiate the
:class:`~transformers.Trainer` via :class:`~transformers.TrainingArguments` then for the ``deepspeed`` argument you can
pass a nested ``dict``. This allows you to create the configuration on the fly and doesn't require you to write it to
the file system before passing it to :class:`~transformers.TrainingArguments`.
676

677
To summarize you can do:
678

679
680
681
682
683
684
685
686
687
688
.. code-block:: python

    TrainingArguments(..., deespeed="/path/to/ds_config.json")

or:

.. code-block:: python

    ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
    TrainingArguments(..., deespeed=ds_config_dict)
689
690
691



692
693
Shared Configuration
=======================================================================================================================
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716

Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
configure those via the :class:`~transformers.Trainer` command line arguments.

Therefore, the following DeepSpeed configuration params shouldn't be used with the :class:`~transformers.Trainer`:

* ``train_batch_size``
* ``train_micro_batch_size_per_gpu``
* ``gradient_accumulation_steps``

as these will be automatically derived from the run time environment and the following 2 command line arguments:

.. code-block:: bash

    --per_device_train_batch_size 8 --gradient_accumulation_steps 2

which are always required to be supplied.

Of course, you will need to adjust the values in this example to your situation.



717
718
ZeRO
=======================================================================================================================
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744

The ``zero_optimization`` section of the configuration file is the most important part (`docs
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
which ZeRO stages you want to enable and how to configure them.

.. code-block:: json

    {
       "zero_optimization": {
           "stage": 2,
           "allgather_partitions": true,
           "allgather_bucket_size": 5e8,
           "overlap_comm": true,
           "reduce_scatter": true,
           "reduce_bucket_size": 5e8,
           "contiguous_gradients": true,
           "cpu_offload": true
       }
    }

Notes:

- enabling ``cpu_offload`` should reduce GPU RAM usage (it requires ``"stage": 2``)
- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x
  the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB
  footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
745
746
747
748
749
  OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. You will want to do
  the same on larger capacity GPU as well, if you're starting to hit OOM.
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
  the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
  important, getting a slightly slower training time could be a good trade.
750
751
752
753
754
755

This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides
no equivalent command line arguments.



756
Optimizer and Scheduler
757
=======================================================================================================================
758

759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
As long as you don't enable ``cpu_offload`` you can mix and match DeepSpeed and HuggingFace schedulers and optimizers,
with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer:

+--------------+--------------+--------------+
| Combos       | HF Scheduler | DS Scheduler |
+--------------+--------------+--------------+
| HF Optimizer | Yes          | Yes          |
+--------------+--------------+--------------+
| DS Optimizer | No           | Yes          |
+--------------+--------------+--------------+

If ``cpu_offload`` is enabled you must use both DeepSpeed scheduler and DeepSpeed optimizer.



Optimizer
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

777

778
779
DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here
780
781
782
783
784
785
<https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`__.

If you don't configure the ``optimizer`` entry in the configuration file, the :class:`~transformers.Trainer` will
automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.

786
Here is an example of the pre-configured ``optimizer`` entry for ``AdamW``:
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801

.. code-block:: json

    {
       "optimizer": {
           "type": "AdamW",
           "params": {
             "lr": 0.001,
             "betas": [0.8, 0.999],
             "eps": 1e-8,
             "weight_decay": 3e-7
           }
         }
    }

802
803
804
805
806
807
808
809
810
811
812
Note that the command line arguments will override the values in the configuration file. This is so that there is one
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
different values in different places. Command line rules. The values that get overridden are:

- ``lr`` with the value of ``--learning_rate``
- ``betas`` with the value of ``--adam_beta1 --adam_beta2``
- ``eps`` with the value of ``--adam_epsilon``
- ``weight_decay`` with the value of ``--weight_decay``

Therefore please remember to tune the shared hyperparameters on the command line.

813
814
If you want to use another optimizer which is not listed above, you will have to add ``"zero_allow_untested_optimizer":
true`` to the top level configuration.
815
816
817
818
819

If you want to use one of the officially supported optimizers, configure them explicitly in the configuration file, and
make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.


820
Scheduler
821
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
822
823
824
825

DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.

826
827

Here is where the schedulers overlap between 馃 Transformers and DeepSpeed:
828
829
830
831
832
833

* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup``
* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``,
  therefore, if you don't configure the scheduler this is scheduler that will get configured by default.


834
835
836
If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` to configure a 馃 Transformers version
of it.
837

838
Here is an example of the pre-configured ``scheduler`` entry for ``WarmupLR``:
839

840
.. code-block:: json
841

842
843
844
845
846
847
848
849
850
851
    {
       "scheduler": {
             "type": "WarmupLR",
             "params": {
                 "warmup_min_lr": 0,
                 "warmup_max_lr": 0.001,
                 "warmup_num_steps": 1000
             }
         }
    }
852

853
854
855
Note that the command line arguments will override the values in the configuration file. This is so that there is one
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
different values in different places. Command line rules. The values that get overridden are:
856

857
858
859
860
861
- ``warmup_max_lr`` with the value of ``--learning_rate``
- ``warmup_num_steps`` with the value of ``--warmup_steps``
- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run
  time based on the environment and the size of the dataset and other command line arguments (needed for
  ``WarmupDecayLR``).
862

863
Therefore please remember to tune the shared hyperparameters on the command line.
864

865
For example, for ``WarmupDecayLR``, you can use the following entry:
866
867
868
869
870

.. code-block:: json

    {
       "scheduler": {
871
             "type": "WarmupDecayLR",
872
             "params": {
873
874
                 "total_num_steps": 10,
                 "last_batch_iteration": -1,
875
876
877
878
879
880
881
                 "warmup_min_lr": 0,
                 "warmup_max_lr": 0.001,
                 "warmup_num_steps": 1000
             }
         }
    }

882
883
884
885
and ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be corrected at loading time.



886
887
Automatic Mixed Precision
=======================================================================================================================
888
889
890
891
892
893
894

You can work with FP16 in one of the following ways:

1. Pytorch native amp, as documented `here <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
2. NVIDIA's apex, as documented `here
   <https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.

895
If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.

Here is an example of the ``fp16`` configuration:

.. code-block:: json

    {
        "fp16": {
            "enabled": true,
            "loss_scale": 0,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1
        },
    }

If you want to use NVIDIA's apex instead, you can can either configure the ``amp`` entry in the configuration file, or
use the following command line arguments: ``--fp16 --fp16_backend apex --fp16_opt_level 01``.

Here is an example of the ``amp`` configuration:

.. code-block:: json

    {
        "amp": {
            "enabled": true,
            "opt_level": "O1"
        }
    }


927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
Gradient Accumulation
=======================================================================================================================

While normally DeepSpeed gets gradient accumulation configured with:

.. code-block:: json

    {
        "gradient_accumulation_steps": 3,
    }

in this case, to enable gradient accumulation, pass the command line `--gradient_accumulation_steps` argument as normal
and it will get injected into the DeepSpeed configuration.

If you try to add it directly to the configuration file, you will receive an error from the Trainer - this is because
this setting is needed by the Trainer too, and so this approach ensures that there is a single way of setting this
value and thus avoid potential subtle errors.





949

950
951
Gradient Clipping
=======================================================================================================================
952
953
954
955
956
957
958
959
960
961
962
963
964
965

If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
will use the value of the ``--max_grad_norm`` command line argument to set it.

Here is an example of the ``gradient_clipping`` configuration:

.. code-block:: json

    {
        "gradient_clipping": 1.0,
    }



966
967
Notes
=======================================================================================================================
968
969
970
971
972

* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
  <https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable
  certain features, like 1-bit Adam, which aren't available in the pypi distribution.
973
974
975
* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with 馃 Transformers - you can use any model
  with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions
  <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
976

977
978
Main DeepSpeed Resources
=======================================================================================================================
979

980
- `Project's github <https://github.com/microsoft/deepspeed>`__
981
982
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
983
- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
984

985
986
987
988
989
Papers:

- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models <https://arxiv.org/abs/1910.02054>`__
- `ZeRO-Offload: Democratizing Billion-Scale Model Training <https://arxiv.org/abs/2101.06840>`__

990
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
991
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
992
<https://github.com/microsoft/DeepSpeed/issues>`__.