trainer.rst 24.5 KB
Newer Older
1
..
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Sylvain Gugger's avatar
Sylvain Gugger committed
13
14
15
16
17
18
Trainer
-----------------------------------------------------------------------------------------------------------------------

The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.

Sylvain Gugger's avatar
Sylvain Gugger committed
19
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
Sylvain Gugger's avatar
Sylvain Gugger committed
20
21
22
23
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
customization during training.

The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
24
<https://github.com/NVIDIA/apex>`__ and Native AMP for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
Sylvain Gugger's avatar
Sylvain Gugger committed
25

26
27
Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop which supports
the above features. To inject custom behavior you can subclass them and override the following methods:
Sylvain Gugger's avatar
Sylvain Gugger committed
28
29

- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
Tiger's avatar
Tiger committed
30
- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaluation DataLoader (PyTorch) or TF Dataset.
Sylvain Gugger's avatar
Sylvain Gugger committed
31
32
- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
- **log** -- Logs information on the various objects watching training.
33
- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
34
35
36
37
  init. Note, that you can also subclass or override the ``create_optimizer`` and ``create_scheduler`` methods
  separately.
- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
Sylvain Gugger's avatar
Sylvain Gugger committed
38
39
40
41
42
43
44
- **compute_loss** - Computes the loss on a batch of training inputs.
- **training_step** -- Performs a training step.
- **prediction_step** -- Performs an evaluation/test step.
- **run_model** (TensorFlow only) -- Basic pass through the model.
- **evaluate** -- Runs an evaluation loop and returns metrics.
- **predict** -- Returns predictions (with metrics if labels are available) on a test set.

45
46
47
48
49
50
51
52
53
54
55
56
.. warning::

    The :class:`~transformers.Trainer` class is optimized for 馃 Transformers models and can have surprising behaviors
    when you use it on other models. When using it on your own model, make sure:

    - your model always return tuples or subclasses of :class:`~transformers.file_utils.ModelOutput`.
    - your model can compute the loss if a :obj:`labels` argument is provided and that loss is returned as the first
      element of the tuple (if your model returns tuples)
    - your model can accept multiple label arguments (use the :obj:`label_names` in your
      :class:`~transformers.TrainingArguments` to indicate their name to the :class:`~transformers.Trainer`) but none
      of them should be named :obj:`"label"`.

57
58
Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function for multi-label
classification:
Sylvain Gugger's avatar
Sylvain Gugger committed
59
60
61

.. code-block:: python

62
    from torch import nn
Sylvain Gugger's avatar
Sylvain Gugger committed
63
    from transformers import Trainer
64
65
66

    class MultilabelTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False):
Sylvain Gugger's avatar
Sylvain Gugger committed
67
            labels = inputs.pop("labels")
Chengxi Guo's avatar
Chengxi Guo committed
68
            outputs = model(**inputs)
69
            logits = outputs.logits
70
            loss_fct = nn.BCEWithLogitsLoss()
71
72
73
            loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                            labels.float().view(-1, self.model.config.num_labels))
            return (loss, outputs) if return_outputs else loss
Sylvain Gugger's avatar
Sylvain Gugger committed
74

Sylvain Gugger's avatar
Sylvain Gugger committed
75
76
77
78
Another way to customize the training loop behavior for the PyTorch :class:`~transformers.Trainer` is to use
:doc:`callbacks <callback>` that can inspect the training loop state (for progress reporting, logging on TensorBoard or
other ML platforms...) and take decisions (like early stopping).

Sylvain Gugger's avatar
Sylvain Gugger committed
79
80
81
82
83
84
85

Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Trainer
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
86

Sylvain Gugger's avatar
Sylvain Gugger committed
87
88
89
90
91
92
93
Seq2SeqTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Seq2SeqTrainer
    :members: evaluate, predict


Sylvain Gugger's avatar
Sylvain Gugger committed
94
95
96
97
98
99
TFTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFTrainer
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
100

Sylvain Gugger's avatar
Sylvain Gugger committed
101
102
103
104
105
106
TrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TrainingArguments
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
107

Sylvain Gugger's avatar
Sylvain Gugger committed
108
109
110
111
112
113
114
Seq2SeqTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.Seq2SeqTrainingArguments
    :members:


Sylvain Gugger's avatar
Sylvain Gugger committed
115
116
117
118
119
TFTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFTrainingArguments
    :members:
120
121


122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
Logging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default :class:`~transformers.Trainer` will use ``logging.INFO`` for the main process and ``logging.WARNING`` for
the replicas if any.

These defaults can be overridden to use any of the 5 ``logging`` levels with :class:`~transformers.TrainingArguments`'s
arguments:

- ``log_level`` - for the main process
- ``log_level_replica`` - for the replicas

Further, if :class:`~transformers.TrainingArguments`'s ``log_on_each_node`` is set to ``False`` only the main node will
use the log level settings for its main process, all other nodes will use the log level settings for replicas.

Note that :class:`~transformers.Trainer` is going to set ``transformers``'s log level separately for each node in its
:meth:`~transformers.Trainer.__init__`. So you may want to set this sooner (see the next example) if you tap into other
``transformers`` functionality before creating the :class:`~transformers.Trainer` object.

Here is an example of how this can be used in an application:

.. code-block:: python

    [...]
    logger = logging.getLogger(__name__)

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    # set the main code and the modules it uses to the same log-level according to the node
156
    log_level = training_args.get_process_log_level()
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)

    trainer = Trainer(...)

And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated
warnings you could run it as:

.. code-block:: bash

    my_app.py ... --log_level warning --log_level_replica error

In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to
change the above to:

.. code-block:: bash

    my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0

and then only the main process of the first node will log at the "warning" level, and all other processes on the main
node and all processes on other nodes will log at the "error" level.

If you need your application to be as quiet as possible you could do:

.. code-block:: bash

    my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0

(add ``--log_on_each_node 0`` if on multi-node environment)



190
191
192
193
194
195
196
197
198
199
200
201
202
203
Randomness
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When resuming from a checkpoint generated by :class:`~transformers.Trainer` all efforts are made to restore the
`python`, `numpy` and `pytorch` RNG states to the same states as they were at the moment of saving that checkpoint,
which should make the "stop and resume" style of training as close as possible to non-stop training.

However, due to various default non-deterministic pytorch settings this might not fully work. If you want full
determinism please refer to `Controlling sources of randomness
<https://pytorch.org/docs/stable/notes/randomness.html>`__. As explained in the document, that some of those settings
that make things determinstic (.e.g., ``torch.backends.cudnn.deterministic``) may slow things down, therefore this
can't be done by default, but you can enable those yourself if needed.


204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
Trainer Integrations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



The :class:`~transformers.Trainer` has been extended to support libraries that may dramatically improve your training
time and fit much bigger models.

Currently it supports third party solutions, `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ and `FairScale
<https://github.com/facebookresearch/fairscale/>`__, which implement parts of the paper `ZeRO: Memory Optimizations
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
<https://arxiv.org/abs/1910.02054>`__.

This provided support is new and experimental as of this writing.

219
220
.. _zero-install-notes:

221
CUDA Extension Installation Notes
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.

While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
any PyTorch extension that needs to build CUDA extensions.

Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:

.. code-block:: bash

    pip install fairscale
    pip install deepspeed

please, read the following notes first.

In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
different remember to adjust the version number to the one you are after.

243
244
Possible problem #1
=======================================================================================================================
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263

While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
installed system-wide.

For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
CUDA ``10.2`` installed system-wide.

The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
installation location by doing:

.. code-block:: bash

    which nvcc

If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.

264
265
Possible problem #2
=======================================================================================================================
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310

Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
may have:

.. code-block:: bash

    /usr/local/cuda-10.2
    /usr/local/cuda-11.0

Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
environment variables.

First, you may look at their contents:

.. code-block:: bash

    echo $PATH
    echo $LD_LIBRARY_PATH

so you get an idea of what is inside.

It's possible that ``LD_LIBRARY_PATH`` is empty.

``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
entries.

Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
doing:

.. code-block:: bash

    export PATH=/usr/local/cuda-10.2/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH

Note that we aren't overwriting the existing values, but prepending instead.

Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
that your system will have it named differently, but if it is adjust it to reflect your reality.


311
312
Possible problem #3
=======================================================================================================================
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336

Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
``gcc-7``.

There are various ways to go about it.

If you can install the latest CUDA toolkit it typically should support the newer compiler.

Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
build system complains it can't find it, the following might do the trick:

.. code-block:: bash

    sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
    sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++


Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.

As always make sure to edit the paths in the example to match your situation.

337
338
339
340
341
342
343
344
FairScale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By integrating `FairScale <https://github.com/facebookresearch/fairscale/>`__ the :class:`~transformers.Trainer`
provides support for the following features from `the ZeRO paper <https://arxiv.org/abs/1910.02054>`__:

1. Optimizer State Sharding
2. Gradient Sharding
345
346
3. Model Parameters Sharding (new and very experimental)
4. CPU offload (new and very experimental)
347

348
349
You will need at least two GPUs to use this feature.

350

351
**Installation**:
352

353
Install the library via pypi:
354

355
.. code-block:: bash
356

357
    pip install fairscale
358

359
360
361
362
363
364
365
366
or via ``transformers``' ``extras``:

.. code-block:: bash

    pip install transformers[fairscale]

(will become available starting from ``transformers==4.6.0``)

367
368
369
370
371
372
373
374
375
376
377
or find more details on `the FairScale's GitHub page <https://github.com/facebookresearch/fairscale/#installation>`__.

If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.

If it's still not resolved the build issue, here are a few more ideas.

``fairscale`` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
with it, you may want to try one of:

.. code-block:: bash

378
    pip install fairscale --no-build-isolation .
379
380
381
382
383

or:

.. code-block:: bash

384
385
386
387
388
389
    git clone https://github.com/facebookresearch/fairscale/
    cd fairscale
    rm -r dist build
    python setup.py bdist_wheel
    pip uninstall -y fairscale
    pip install dist/fairscale-*.whl
390
391
392
393
394

``fairscale`` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:

.. code-block:: bash

395
396
397
    pip uninstall -y fairscale; pip install fairscale --pre \
    -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html \
    --no-cache --no-build-isolation
398
399
400
401
402

or:

.. code-block:: bash

403
404
    pip install -v --disable-pip-version-check . \
    -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html --pre
405
406
407
408
409
410
411
412
413
414
415
416
417

Of course, adjust the urls to match the cuda version you use.

If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
`FairScale <https://github.com/facebookresearch/fairscale/issues>`__.



**Usage**:

To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments, and
make sure you have added the distributed launcher ``-m torch.distributed.launch
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
418

419
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
420
421
422

.. code-block:: bash

Sylvain Gugger's avatar
Sylvain Gugger committed
423
    python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
424
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
425
    --output_dir output_dir --overwrite_output_dir \
426
427
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
428
    --source_lang en --target_lang ro \
429
    --fp16 --sharded_ddp simple
430
431
432
433
434
435

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
436
437
- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
  able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
438
439
  significantly shorter training time.

Stas Bekman's avatar
Stas Bekman committed
440
441
442
3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp
   zero_dp_3`` to the command line arguments, and make sure you have added the distributed launcher ``-m
   torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
443

444
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
445
446
447

.. code-block:: bash

Sylvain Gugger's avatar
Sylvain Gugger committed
448
    python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
449
450
451
452
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
453
    --source_lang en --target_lang ro \
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
    --fp16 --sharded_ddp zero_dp_2

:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
gradients and optimizer states.

Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
"zero_dp_2 cpu_offload"`).

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
- The ``cpu_offload`` additional option requires ``--fp16``.
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
  some bugs you encounter may have been fixed there already.

Known caveats:

473
- This feature is incompatible with :obj:`--predict_with_generate` in the `run_translation.py` script.
474
- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
475
476
  :obj:`FullyShardedDataParallelism` of fairscale. It should be used with the option :obj:`auto_wrap` if you are not
  doing this yourself: :obj:`--sharded_ddp "zero_dp_3 auto_wrap"`.
477

478
479
480
481

DeepSpeed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

482

483
Moved to :ref:`deepspeed-trainer-integration`.
484

485

486
487
Installation
=======================================================================================================================
488

489
Moved to :ref:`deepspeed-installation`.
490

491

492
493
Deployment with multiple GPUs
=======================================================================================================================
494

495
Moved to :ref:`deepspeed-multi-gpu`.
496
497
498
499
500


Deployment with one GPU
=======================================================================================================================

501
Moved to :ref:`deepspeed-one-gpu`.
502

503
504
505
506

Deployment in Notebooks
=======================================================================================================================

507
Moved to :ref:`deepspeed-notebook`.
508

509

510
511
Configuration
=======================================================================================================================
512

513
Moved to :ref:`deepspeed-config`.
514
515


516
517
Passing Configuration
=======================================================================================================================
518

519
Moved to :ref:`deepspeed-config-passing`.
520
521


522
523
Shared Configuration
=======================================================================================================================
524

525
Moved to :ref:`deepspeed-config-shared`.
526

527
528
ZeRO
=======================================================================================================================
529

530
Moved to :ref:`deepspeed-zero`.
531
532
533
534

ZeRO-2 Config
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

535
Moved to :ref:`deepspeed-zero2-config`.
536
537
538
539

ZeRO-3 Config
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

540
Moved to :ref:`deepspeed-zero3-config`.
541
542
543
544


NVMe Support
=======================================================================================================================
545

546
Moved to :ref:`deepspeed-nvme`.
547
548
549
550

ZeRO-2 vs ZeRO-3 Performance
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

551
Moved to :ref:`deepspeed-zero2-zero3-performance`.
552

553
554
555
ZeRO-2 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

556
Moved to :ref:`deepspeed-zero2-example`.
557
558
559
560

ZeRO-3 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

561
Moved to :ref:`deepspeed-zero3-example`.
562

563
Optimizer and Scheduler
564
=======================================================================================================================
565

566
567
568


Optimizer
569
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
570

571
Moved to :ref:`deepspeed-optimizer`.
572
573


574
Scheduler
575
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
576

577
Moved to :ref:`deepspeed-scheduler`.
578
579
580
581

fp32 Precision
=======================================================================================================================

582
Moved to :ref:`deepspeed-fp32`.
583

584
585
Automatic Mixed Precision
=======================================================================================================================
586

587
Moved to :ref:`deepspeed-amp`.
588

Stas Bekman's avatar
Stas Bekman committed
589
590
591
Batch Size
=======================================================================================================================

592
Moved to :ref:`deepspeed-bs`.
Stas Bekman's avatar
Stas Bekman committed
593

594
595
596
Gradient Accumulation
=======================================================================================================================

597
Moved to :ref:`deepspeed-grad-acc`.
598

599

600
601
Gradient Clipping
=======================================================================================================================
602

603
Moved to :ref:`deepspeed-grad-clip`.
604

605
606

Getting The Model Weights Out
607
608
=======================================================================================================================

609
Moved to :ref:`deepspeed-weight-extraction`.