training.rst 1.08 KB
Newer Older
Jeff Rasley's avatar
Jeff Rasley committed
1
2
3
Training API
============

Shaden Smith's avatar
Shaden Smith committed
4
5
:func:`deepspeed.initialize` returns a *training engine* in its first argument
of type :class:`DeepSpeedEngine`. This engine is used to progress training:
Jeff Rasley's avatar
Jeff Rasley committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

.. code-block:: python

    for step, batch in enumerate(data_loader):
        #forward() method
        loss = model_engine(batch)

        #runs backpropagation
        model_engine.backward(loss)

        #weight update
        model_engine.step()

Forward Propagation
-------------------
Shaden Smith's avatar
Shaden Smith committed
21
.. autofunction:: deepspeed.DeepSpeedEngine.forward
Jeff Rasley's avatar
Jeff Rasley committed
22
23
24

Backward Propagation
--------------------
Shaden Smith's avatar
Shaden Smith committed
25
.. autofunction:: deepspeed.DeepSpeedEngine.backward
Jeff Rasley's avatar
Jeff Rasley committed
26
27
28

Optimizer Step
--------------
Shaden Smith's avatar
Shaden Smith committed
29
30
31
32
33
.. autofunction:: deepspeed.DeepSpeedEngine.step

Gradient Accumulation
---------------------
.. autofunction:: deepspeed.DeepSpeedEngine.is_gradient_accumulation_boundary
aiss's avatar
aiss committed
34
35
36
37
38
39
40
41


Model Saving
------------
.. autofunction:: deepspeed.DeepSpeedEngine.save_16bit_model


Additionally when a DeepSpeed checkpoint is created, a script ``zero_to_fp32.py`` is added there which can be used to reconstruct fp32 master weights into a single pytorch ``state_dict`` file.