Unverified Commit 18587639 authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[doc porting] several docs (#14858)



* [doc porting] 2 docs

* [doc porting] 2 docs

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/main_classes/deepspeed.mdx

* cleanup
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 033c3ed9
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Debugging
## Underflow and Overflow Detection
<Tip>
This feature is currently available for PyTorch-only.
</Tip>
<Tip>
For multi-GPU training it requires DDP (`torch.distributed.launch`).
</Tip>
<Tip>
This feature can be used with any `nn.Module`-based model.
</Tip>
If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
you can accomplish that easily by activating a special module that will do the detection automatically.
If you're using [`Trainer`], you just need to add:
```bash
--debug underflow_overflow
```
to the normal command line arguments, or pass `debug="underflow_overflow"` when creating the
[`TrainingArguments`] object.
If you're using your own training loop or another Trainer you can accomplish the same with:
```python
from .debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model)
```
[`~debug_utils.DebugUnderflowOverflow`] inserts hooks into the model that immediately after each
forward call will test input and output variables and also the corresponding module's weights. As soon as `inf` or
`nan` is detected in at least one element of the activations or weights, the program will assert and print a report
like this (this was caught with `google/mt5-small` under fp16 mixed precision):
```
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
None output[1]
2.25e-01 1.00e+04 output[2]
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
```
The example output has been trimmed in the middle for brevity.
The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
the inputs and outputs were in the range of `1e4`. So when this training was done under fp16 mixed precision the very
last step overflowed (since under `fp16` the largest number before `inf` is `64e3`). To avoid overflows under
`fp16` the activations must remain way below `1e4`, because `1e4 * 1e4 = 1e8` so any matrix multiplication with
large activations is going to lead to a numerical overflow condition.
At the very start of the trace you can discover at which batch number the problem occurred (here `Detected inf/nan during batch_number=0` means the problem occurred on the first batch).
Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
for. If we look just at this frame:
```
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
```
Here, `encoder.block.2.layer.1.layer_norm` indicates that it was a layer norm for the first layer, of the second
block of the encoder. And the specific calls of the `forward` is `T5LayerNorm`.
Let's look at the last few frames of that report:
```
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
[...]
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
```
The last frame reports for `Dropout.forward` function with the first entry for the only input and the second for the
only output. You can see that it was called from an attribute `dropout` inside `DenseReluDense` class. We can see
that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
input elements was `6.27e+04` and same for the output was `inf`.
You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was
around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which renormalizes
the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
overflow (`inf`).
As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
numbers.
Let's match the report to the code from `models/t5/modeling_t5.py`:
```python
class T5DenseGatedGeluDense(nn.Module):
def __init__(self, config):
super().__init__()
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
self.dropout = nn.Dropout(config.dropout_rate)
self.gelu_act = ACT2FN["gelu_new"]
def forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
```
Now it's easy to see the `dropout` call, and all the previous calls as well.
Since the detection is happening in a forward hook, these reports are printed immediately after each `forward`
returns.
Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
started to go up and most likely switch to the `fp32` mode here, so that the numbers don't overflow when multiplied
or summed up. Of course, there might be other solutions. For example, we could turn off `amp` temporarily if it's
enabled, after moving the original `forward` into a helper wrapper, like so:
```python
def _forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
import torch
def forward(self, hidden_states):
if torch.is_autocast_enabled():
with torch.cuda.amp.autocast(enabled=False):
return self._forward(hidden_states)
else:
return self._forward(hidden_states)
```
Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
want to analyse the intermediary stages of any specific `forward` function as well. In such a case you can use the
`detect_overflow` helper function to inject the detector where you want it, for example:
```python
from debug_utils import detect_overflow
class T5LayerFF(nn.Module):
[...]
def forward(self, hidden_states):
forwarded_states = self.layer_norm(hidden_states)
detect_overflow(forwarded_states, "after layer_norm")
forwarded_states = self.DenseReluDense(forwarded_states)
detect_overflow(forwarded_states, "after DenseReluDense")
return hidden_states + self.dropout(forwarded_states)
```
You can see that we added 2 of these and now we track if `inf` or `nan` for `forwarded_states` was detected
somewhere in between.
Actually, the detector already reports these because each of the calls in the example above is a `nn.Module`, but
let's say if you had some local direct calculations this is how you'd do that.
Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
its default, e.g.:
```python
from .debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
```
### Specific batch absolute mix and max value tracing
The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
Let's say you want to watch the absolute min and max values for all the ingredients of each `forward` call of a given
batch, and only do that for batches 1 and 3. Then you instantiate this class as:
```python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
```
And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
Batches are 0-indexed.
This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
right to that area. Here is a sample truncated output for such configuration:
```
*** Starting batch number=1 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
decoder T5Stack
not a tensor output
lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
T5ForConditionalGeneration
not a tensor output
*** Starting batch number=3 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]
```
Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
numbers started to diverge.
You can also specify the batch number after which to stop the training, with:
```python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
```
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Debugging
=======================================================================================================================
Underflow and Overflow Detection
-----------------------------------------------------------------------------------------------------------------------
.. note::
This feature is currently available for PyTorch-only.
.. note::
For multi-GPU training it requires DDP (``torch.distributed.launch``).
.. note::
This feature can be used with any ``nn.Module``-based model.
If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
you can accomplish that easily by activating a special module that will do the detection automatically.
If you're using :class:`~transformers.Trainer`, you just need to add:
.. code-block:: bash
--debug underflow_overflow
to the normal command line arguments, or pass ``debug="underflow_overflow"`` when creating the
:class:`~transformers.TrainingArguments` object.
If you're using your own training loop or another Trainer you can accomplish the same with:
.. code-block:: python
from .debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model)
:class:`~transformers.debug_utils.DebugUnderflowOverflow` inserts hooks into the model that immediately after each
forward call will test input and output variables and also the corresponding module's weights. As soon as ``inf`` or
``nan`` is detected in at least one element of the activations or weights, the program will assert and print a report
like this (this was caught with ``google/mt5-small`` under fp16 mixed precision):
.. code-block::
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
None output[1]
2.25e-01 1.00e+04 output[2]
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
The example output has been trimmed in the middle for brevity.
The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
the inputs and outputs were in the range of ``1e4``. So when this training was done under fp16 mixed precision the very
last step overflowed (since under ``fp16`` the largest number before ``inf`` is ``64e3``). To avoid overflows under
``fp16`` the activations must remain way below ``1e4``, because ``1e4 * 1e4 = 1e8`` so any matrix multiplication with
large activations is going to lead to a numerical overflow condition.
At the very start of the trace you can discover at which batch number the problem occurred (here ``Detected inf/nan
during batch_number=0`` means the problem occurred on the first batch).
Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
for. If we look just at this frame:
.. code-block::
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
Here, ``encoder.block.2.layer.1.layer_norm`` indicates that it was a layer norm for the first layer, of the second
block of the encoder. And the specific calls of the ``forward`` is ``T5LayerNorm``.
Let's look at the last few frames of that report:
.. code-block::
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
[...]
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
The last frame reports for ``Dropout.forward`` function with the first entry for the only input and the second for the
only output. You can see that it was called from an attribute ``dropout`` inside ``DenseReluDense`` class. We can see
that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
input elements was ``6.27e+04`` and same for the output was ``inf``.
You can see here, that ``T5DenseGatedGeluDense.forward`` resulted in output activations, whose absolute max value was
around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have ``Dropout`` which renormalizes
the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
overflow (``inf``).
As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
numbers.
Let's match the report to the code from ``models/t5/modeling_t5.py``:
.. code-block:: python
class T5DenseGatedGeluDense(nn.Module):
def __init__(self, config):
super().__init__()
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
self.dropout = nn.Dropout(config.dropout_rate)
self.gelu_act = ACT2FN["gelu_new"]
def forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
Now it's easy to see the ``dropout`` call, and all the previous calls as well.
Since the detection is happening in a forward hook, these reports are printed immediately after each ``forward``
returns.
Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
started to go up and most likely switch to the ``fp32`` mode here, so that the numbers don't overflow when multiplied
or summed up. Of course, there might be other solutions. For example, we could turn off ``amp`` temporarily if it's
enabled, after moving the original ``forward`` into a helper wrapper, like so:
.. code-block:: python
def _forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
import torch
def forward(self, hidden_states):
if torch.is_autocast_enabled():
with torch.cuda.amp.autocast(enabled=False):
return self._forward(hidden_states)
else:
return self._forward(hidden_states)
Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
want to analyse the intermediary stages of any specific ``forward`` function as well. In such a case you can use the
``detect_overflow`` helper function to inject the detector where you want it, for example:
.. code-block:: python
from debug_utils import detect_overflow
class T5LayerFF(nn.Module):
[...]
def forward(self, hidden_states):
forwarded_states = self.layer_norm(hidden_states)
detect_overflow(forwarded_states, "after layer_norm")
forwarded_states = self.DenseReluDense(forwarded_states)
detect_overflow(forwarded_states, "after DenseReluDense")
return hidden_states + self.dropout(forwarded_states)
You can see that we added 2 of these and now we track if ``inf`` or ``nan`` for ``forwarded_states`` was detected
somewhere in between.
Actually, the detector already reports these because each of the calls in the example above is a `nn.Module``, but
let's say if you had some local direct calculations this is how you'd do that.
Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
its default, e.g.:
.. code-block:: python
from .debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
Specific batch absolute mix and max value tracing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
Let's say you want to watch the absolute min and max values for all the ingredients of each ``forward`` call of a given
batch, and only do that for batches 1 and 3. Then you instantiate this class as:
.. code-block:: python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
Batches are 0-indexed.
This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
right to that area. Here is a sample truncated output for such configuration:
.. code-block::
*** Starting batch number=1 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
decoder T5Stack
not a tensor output
lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
T5ForConditionalGeneration
not a tensor output
*** Starting batch number=3 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]
Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
numbers started to diverge.
You can also specify the batch number after which to stop the training, with:
.. code-block:: python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
.. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
# DeepSpeed Integration
DeepSpeed Integration [DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for:
-----------------------------------------------------------------------------------------------------------------------
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
<https://arxiv.org/abs/1910.02054>`__. Currently it provides full support for:
1. Optimizer state partitioning (ZeRO stage 1) 1. Optimizer state partitioning (ZeRO stage 1)
2. Gradient partitioning (ZeRO stage 2) 2. Gradient partitioning (ZeRO stage 2)
...@@ -25,26 +21,23 @@ DeepSpeed Integration ...@@ -25,26 +21,23 @@ DeepSpeed Integration
5. A range of fast CUDA-extension-based optimizers 5. A range of fast CUDA-extension-based optimizers
6. ZeRO-Offload to CPU and NVMe 6. ZeRO-Offload to CPU and NVMe
ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
<https://arxiv.org/abs/2101.06840>`__. And NVMe-support is described in the paper `ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__.
DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference. DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
won't be possible on a single GPU. won't be possible on a single GPU.
🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
🤗 Transformers integrates `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ via 2 options:
1. Integration of the core DeepSpeed features via :class:`~transformers.Trainer`. This is everything done for you type
of integration - just supply your custom config file or use our template and you have nothing else to do. Most of of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
this document is focused on this feature. this document is focused on this feature.
2. If you don't use :class:`~transformers.Trainer` and want to use your own Trainer where you integrated DeepSpeed 2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
yourself, core functionality functions like ``from_pretrained`` and ``from_config`` include integration of essential yourself, core functionality functions like `from_pretrained` and `from_config` include integration of essential
parts of DeepSpeed like ``zero.Init`` for ZeRO stage 3 and higher. To tap into this feature read the docs on parts of DeepSpeed like `zero.Init` for ZeRO stage 3 and higher. To tap into this feature read the docs on
:ref:`deepspeed-non-trainer-integration`. [deepspeed-non-trainer-integration](#deepspeed-non-trainer-integration).
What is integrated: What is integrated:
...@@ -56,194 +49,186 @@ Inference: ...@@ -56,194 +49,186 @@ Inference:
1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but 1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but
it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see: it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see:
:ref:`deepspeed-zero-inference`. [deepspeed-zero-inference](#deepspeed-zero-inference).
There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of
ZeRO (coming soon). ZeRO (coming soon).
.. _deepspeed-trainer-integration: <a id='deepspeed-trainer-integration'></a>
Trainer Deepspeed Integration ## Trainer Deepspeed Integration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. _deepspeed-installation: <a id='deepspeed-installation'></a>
Installation ### Installation
=======================================================================================================================
Install the library via pypi: Install the library via pypi:
.. code-block:: bash ```bash
pip install deepspeed
```
pip install deepspeed or via `transformers`' `extras`:
or via ``transformers``' ``extras``: ```bash
pip install transformers[deepspeed]
```
.. code-block:: bash or find more details on [the DeepSpeed's GitHub page](https://github.com/microsoft/deepspeed#installation) and
[advanced install](https://www.deepspeed.ai/tutorials/advanced-install/).
pip install transformers[deepspeed] If you're still struggling with the build, first make sure to read [zero-install-notes](#zero-install-notes).
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__ and
`advanced install <https://www.deepspeed.ai/tutorials/advanced-install/>`__.
If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions
to no avail, the next thing to try is to pre-build the modules before installing them. to no avail, the next thing to try is to pre-build the modules before installing them.
To make a local build for DeepSpeed: To make a local build for DeepSpeed:
.. code-block:: bash ```bash
git clone https://github.com/microsoft/DeepSpeed/
git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed
cd DeepSpeed rm -rf build
rm -rf build TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ --global-option="build_ext" --global-option="-j8" --no-cache -v \
--global-option="build_ext" --global-option="-j8" --no-cache -v \ --disable-pip-version-check 2>&1 | tee build.log
--disable-pip-version-check 2>&1 | tee build.log ```
If you intend to use NVMe offload you will need to also include ``DS_BUILD_AIO=1`` in the instructions above (and also If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also
install `libaio-dev` system-wide). install *libaio-dev* system-wide).
Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use. Assuming all Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
your cards are the same you can get the arch via: your cards are the same you can get the arch via:
.. code-block:: bash ```bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
```
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" So if you get `8, 6`, then use `TORCH_CUDA_ARCH_LIST="8.6"`. If you have multiple different cards, you can list all
of them like so `TORCH_CUDA_ARCH_LIST="6.1;8.6"`
So if you get ``8, 6``, then use ``TORCH_CUDA_ARCH_LIST="8.6"``. If you have multiple different cards, you can list all
of them like so ``TORCH_CUDA_ARCH_LIST="6.1;8.6"``
If you need to use the same setup on multiple machines, make a binary wheel: If you need to use the same setup on multiple machines, make a binary wheel:
.. code-block:: bash ```bash
git clone https://github.com/microsoft/DeepSpeed/
git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed
cd DeepSpeed rm -rf build
rm -rf build TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ python setup.py build_ext -j8 bdist_wheel
python setup.py build_ext -j8 bdist_wheel ```
it will generate something like ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` which now you can install it will generate something like `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` which now you can install
as ``pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` locally or on any other machine. as `pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` locally or on any other machine.
Again, remember to ensure to adjust ``TORCH_CUDA_ARCH_LIST`` to the target architectures. Again, remember to ensure to adjust `TORCH_CUDA_ARCH_LIST` to the target architectures.
You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this
context) `here <https://developer.nvidia.com/cuda-gpus>`__. context) [here](https://developer.nvidia.com/cuda-gpus).
You can check the archs pytorch was built with using: You can check the archs pytorch was built with using:
.. code-block:: bash ```bash
python -c "import torch; print(torch.cuda.get_arch_list())"
python -c "import torch; print(torch.cuda.get_arch_list())" ```
Here is how to find out the arch for one of the installed GPU. For example, for GPU 0: Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
.. code-block:: bash ```bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ print(torch.cuda.get_device_properties(torch.device('cuda')))"
print(torch.cuda.get_device_properties(torch.device('cuda')))" ```
If the output is: If the output is:
.. code-block:: bash ```bash
_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
```
_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82) then you know that this card's arch is `8.6`.
then you know that this card's arch is ``8.6``. You can also leave `TORCH_CUDA_ARCH_LIST` out completely and then the build program will automatically query the
You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
it's best to specify the desired archs explicitly. it's best to specify the desired archs explicitly.
If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
`Deepspeed <https://github.com/microsoft/DeepSpeed/issues>`__, [Deepspeed](https://github.com/microsoft/DeepSpeed/issues),
.. _deepspeed-multi-gpu: <a id='deepspeed-multi-gpu'></a>
Deployment with multiple GPUs ### Deployment with multiple GPUs
=======================================================================================================================
To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as To deploy this feature with multiple GPUs adjust the [`Trainer`] command line arguments as
following: following:
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``. 1. replace `python -m torch.distributed.launch` with `deepspeed`.
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as 2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as
documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you. documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you.
Therefore, if your original command line looked as following: Therefore, if your original command line looked as following:
.. code-block:: bash ```bash
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args> ```
Now it should be: Now it should be:
.. code-block:: bash ```bash
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json ```
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the Unlike, `torch.distributed.launch` where you have to specify how many GPUs to use with `--nproc_per_node`, with the
``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The `deepspeed` launcher you don't have to use the corresponding `--num_gpus` if you want all of your GPUs used. The
full details on how to configure various nodes and GPUs can be found `here full details on how to configure various nodes and GPUs can be found [here](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node).
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed as long as you don't need to use In fact, you can continue using `-m torch.distributed.launch` with DeepSpeed as long as you don't need to use
``deepspeed`` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use `deepspeed` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use
the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will the `deepspeed` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
use it here as well. use it here as well.
Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs: Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs:
.. code-block:: bash ```bash
deepspeed examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
```
deepspeed examples/pytorch/translation/run_translation.py \ Note that in the DeepSpeed documentation you are likely to see `--deepspeed --deepspeed_config ds_config.json` - i.e.
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
with, we combined the two into a single argument. with, we combined the two into a single argument.
For some practical usage examples, please, see this `post For some practical usage examples, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400).
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
.. _deepspeed-one-gpu: <a id='deepspeed-one-gpu'></a>
Deployment with one GPU ### Deployment with one GPU
=======================================================================================================================
To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following: To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following:
.. code-block:: bash ```bash
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ --deepspeed tests/deepspeed/ds_config_zero2.json \
--deepspeed tests/deepspeed/ds_config_zero2.json \ --model_name_or_path t5-small --per_device_train_batch_size 1 \
--model_name_or_path t5-small --per_device_train_batch_size 1 \ --output_dir output_dir --overwrite_output_dir --fp16 \
--output_dir output_dir --overwrite_output_dir --fp16 \ --do_train --max_train_samples 500 --num_train_epochs 1 \
--do_train --max_train_samples 500 --num_train_epochs 1 \ --dataset_name wmt16 --dataset_config "ro-en" \
--dataset_name wmt16 --dataset_config "ro-en" \ --source_lang en --target_lang ro
--source_lang en --target_lang ro ```
This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via
``--num_gpus=1``. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start `--num_gpus=1`. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start
with, then you don't need this argument. The following `documentation with, then you don't need this argument. The following [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) discusses the launcher options.
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the launcher options.
Why would you want to use DeepSpeed with just one GPU? Why would you want to use DeepSpeed with just one GPU?
...@@ -256,9 +241,8 @@ Why would you want to use DeepSpeed with just one GPU? ...@@ -256,9 +241,8 @@ Why would you want to use DeepSpeed with just one GPU?
While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
with DeepSpeed is to have at least the following configuration in the configuration file: with DeepSpeed is to have at least the following configuration in the configuration file:
.. code-block:: json ```json
{
{
"zero_optimization": { "zero_optimization": {
"stage": 2, "stage": 2,
"offload_optimizer": { "offload_optimizer": {
...@@ -272,13 +256,13 @@ with DeepSpeed is to have at least the following configuration in the configurat ...@@ -272,13 +256,13 @@ with DeepSpeed is to have at least the following configuration in the configurat
"overlap_comm": true, "overlap_comm": true,
"contiguous_gradients": true "contiguous_gradients": true
} }
} }
```
which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will
find more details in the discussion below. find more details in the discussion below.
For a practical usage example of this type of deployment, please, see this `post For a practical usage example of this type of deployment, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685).
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document. You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document.
...@@ -287,44 +271,43 @@ recommend ZeRO-3 config as starting one. --> ...@@ -287,44 +271,43 @@ recommend ZeRO-3 config as starting one. -->
Notes: Notes:
- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit - if you need to run on a specific GPU, which is different from GPU 0, you can't use `CUDA_VISIBLE_DEVICES` to limit
the visible scope of available GPUs. Instead, you have to use the following syntax: the visible scope of available GPUs. Instead, you have to use the following syntax:
.. code-block:: bash ```bash
deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ... deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
```
In this example, we tell DeepSpeed to use GPU 1 (second gpu). In this example, we tell DeepSpeed to use GPU 1 (second gpu).
.. _deepspeed-notebook: <a id='deepspeed-notebook'></a>
Deployment in Notebooks ### Deployment in Notebooks
=======================================================================================================================
The problem with running notebook cells as a script is that there is no normal ``deepspeed`` launcher to rely on, so The problem with running notebook cells as a script is that there is no normal `deepspeed` launcher to rely on, so
under certain setups we have to emulate it. under certain setups we have to emulate it.
If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed. If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed.
.. code-block:: python ```python
# DeepSpeed requires a distributed environment even when only one process is used.
# DeepSpeed requires a distributed environment even when only one process is used. # This emulates a launcher in the notebook
# This emulates a launcher in the notebook import os
import os os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use os.environ['RANK'] = "0"
os.environ['RANK'] = "0" os.environ['LOCAL_RANK'] = "0"
os.environ['LOCAL_RANK'] = "0" os.environ['WORLD_SIZE'] = "1"
os.environ['WORLD_SIZE'] = "1"
# Now proceed as normal, plus pass the deepspeed config file # Now proceed as normal, plus pass the deepspeed config file
training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
trainer = Trainer(...) trainer = Trainer(...)
trainer.train() trainer.train()
```
Note: ``...`` stands for the normal arguments that you'd pass to the functions. Note: `...` stands for the normal arguments that you'd pass to the functions.
If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have
to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented
...@@ -333,11 +316,10 @@ at the beginning of this section. ...@@ -333,11 +316,10 @@ at the beginning of this section.
If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
cell with: cell with:
.. code-block:: python ```python
%%bash
%%bash cat <<'EOT' > ds_config_zero3.json
cat <<'EOT' > ds_config_zero3.json {
{
"fp16": { "fp16": {
"enabled": "auto", "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
...@@ -393,72 +375,70 @@ cell with: ...@@ -393,72 +375,70 @@ cell with:
"train_batch_size": "auto", "train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto", "train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
EOT EOT
```
If the training script is in a normal file and not in the notebook cells, you can launch ``deepspeed`` normally via If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via
shell from a cell. For example, to use ``run_translation.py`` you would launch it with: shell from a cell. For example, to use `run_translation.py` you would launch it with:
.. code-block:: ```python
!git clone https://github.com/huggingface/transformers
!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
```
!git clone https://github.com/huggingface/transformers or with `%%bash` magic, where you can write a multi-line code for the shell program to run:
!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
or with ``%%bash`` magic, where you can write a multi-line code for the shell program to run: ```python
%%bash
.. code-block:: git clone https://github.com/huggingface/transformers
cd transformers
%%bash deepspeed examples/pytorch/translation/run_translation.py ...
```
git clone https://github.com/huggingface/transformers
cd transformers
deepspeed examples/pytorch/translation/run_translation.py ...
In such case you don't need any of the code presented at the beginning of this section. In such case you don't need any of the code presented at the beginning of this section.
Note: While ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process Note: While `%%bash` magic is neat, but currently it buffers the output so you won't see the logs until the process
completes. completes.
.. _deepspeed-config: <a id='deepspeed-config'></a>
Configuration ### Configuration
=======================================================================================================================
For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__. to the [following documentation](https://www.deepspeed.ai/docs/config-json/).
You can find dozens of DeepSpeed configuration examples that address various practical needs in `the DeepSpeedExamples
repo <https://github.com/microsoft/DeepSpeedExamples>`__:
.. code-block:: bash You can find dozens of DeepSpeed configuration examples that address various practical needs in [the DeepSpeedExamples
repo](https://github.com/microsoft/DeepSpeedExamples):
git clone https://github.com/microsoft/DeepSpeedExamples ```bash
cd DeepSpeedExamples git clone https://github.com/microsoft/DeepSpeedExamples
find . -name '*json' cd DeepSpeedExamples
find . -name '*json'
```
Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the
example ``.json`` files with: example `.json` files with:
.. code-block:: bash ```bash
grep -i Lamb $(find . -name '*json')
```
grep -i Lamb $(find . -name '*json') Some more examples are to be found in the [main repo](https://github.com/microsoft/DeepSpeed) as well.
Some more examples are to be found in the `main repo <https://github.com/microsoft/DeepSpeed>`__ as well.
When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have
to be configured via the command line. You will find the nuances in the rest of this guide. to be configured via the command line. You will find the nuances in the rest of this guide.
To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features, To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
including optimizer states cpu offload, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler and will enable mixed including optimizer states cpu offload, uses `AdamW` optimizer and `WarmupLR` scheduler and will enable mixed
precision training if ``--fp16`` is passed: precision training if `--fp16` is passed:
.. code-block:: json
{ ```json
{
"fp16": { "fp16": {
"enabled": "auto", "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
...@@ -505,61 +485,60 @@ precision training if ``--fp16`` is passed: ...@@ -505,61 +485,60 @@ precision training if ``--fp16`` is passed:
"gradient_clipping": "auto", "gradient_clipping": "auto",
"train_batch_size": "auto", "train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto", "train_micro_batch_size_per_gpu": "auto",
} }
```
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer` When you execute the program, DeepSpeed will log the configuration it received from the [`Trainer`]
to the console, so you can see exactly what was the final configuration passed to it. to the console, so you can see exactly what was the final configuration passed to it.
.. _deepspeed-config-passing: <a id='deepspeed-config-passing'></a>
Passing Configuration ### Passing Configuration
=======================================================================================================================
As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're
not using the command line interface to configure the training, and instead instantiate the not using the command line interface to configure the training, and instead instantiate the
:class:`~transformers.Trainer` via :class:`~transformers.TrainingArguments` then for the ``deepspeed`` argument you can [`Trainer`] via [`TrainingArguments`] then for the `deepspeed` argument you can
pass a nested ``dict``. This allows you to create the configuration on the fly and doesn't require you to write it to pass a nested `dict`. This allows you to create the configuration on the fly and doesn't require you to write it to
the file system before passing it to :class:`~transformers.TrainingArguments`. the file system before passing it to [`TrainingArguments`].
To summarize you can do: To summarize you can do:
.. code-block:: python ```python
TrainingArguments(..., deepspeed="/path/to/ds_config.json")
TrainingArguments(..., deepspeed="/path/to/ds_config.json") ```
or: or:
.. code-block:: python ```python
ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
TrainingArguments(..., deepspeed=ds_config_dict)
```
ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params) <a id='deepspeed-config-shared'></a>
TrainingArguments(..., deepspeed=ds_config_dict)
### Shared Configuration
.. _deepspeed-config-shared: <Tip warning={true}>
Shared Configuration This section is a must-read
=======================================================================================================================
</Tip>
.. warning:: Some configuration values are required by both the [`Trainer`] and DeepSpeed to function correctly,
This section is a must-read
Some configuration values are required by both the :class:`~transformers.Trainer` and DeepSpeed to function correctly,
therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those
via the :class:`~transformers.Trainer` command line arguments. via the [`Trainer`] command line arguments.
Additionally, some configuration values are derived automatically based on the model's configuration, so instead of Additionally, some configuration values are derived automatically based on the model's configuration, so instead of
remembering to manually adjust multiple values, it's the best to let the :class:`~transformers.Trainer` do the majority remembering to manually adjust multiple values, it's the best to let the [`Trainer`] do the majority
of configuration for you. of configuration for you.
Therefore, in the rest of this guide you will find a special configuration value: ``auto``, which when set will be Therefore, in the rest of this guide you will find a special configuration value: `auto`, which when set will be
automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this
recommendation and set the values explicitly, in which case be very careful that your the recommendation and set the values explicitly, in which case be very careful that your the
:class:`~transformers.Trainer` arguments and DeepSpeed configurations agree. For example, are you using the same [`Trainer`] arguments and DeepSpeed configurations agree. For example, are you using the same
learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very
difficult to detect ways. You have been warned. difficult to detect ways. You have been warned.
...@@ -567,30 +546,28 @@ There are multiple other values that are specific to DeepSpeed-only and those yo ...@@ -567,30 +546,28 @@ There are multiple other values that are specific to DeepSpeed-only and those yo
your needs. your needs.
In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master
and configure :class:`~transformers.TrainingArguments` based on that. The steps are: and configure [`TrainingArguments`] based on that. The steps are:
1. Create or load the DeepSpeed configuration to be used as a master configuration 1. Create or load the DeepSpeed configuration to be used as a master configuration
2. Create the :class:`~transformers.TrainingArguments` object based on these values 2. Create the [`TrainingArguments`] object based on these values
Do note that some values, such as :obj:`scheduler.params.total_num_steps` are calculated by Do note that some values, such as `scheduler.params.total_num_steps` are calculated by
:class:`~transformers.Trainer` during ``train``, but you can of course do the math yourself. [`Trainer`] during `train`, but you can of course do the math yourself.
.. _deepspeed-zero: <a id='deepspeed-zero'></a>
ZeRO ### ZeRO
=======================================================================================================================
`Zero Redundancy Optimizer (ZeRO) <https://www.deepspeed.ai/tutorials/zero/>`__ is the workhorse of DeepSpeed. It [Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It
support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes, support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity. therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
You will find more indepth information in the DeepSpeed documentation. You will find more indepth information in the DeepSpeed documentation.
The ``zero_optimization`` section of the configuration file is the most important part (`docs The `zero_optimization` section of the configuration file is the most important part ([docs](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training)), since that is where you define
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the
DeepSpeed docs. DeepSpeed docs.
This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides This section has to be configured exclusively via DeepSpeed configuration - the [`Trainer`] provides
no equivalent command line arguments. no equivalent command line arguments.
Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for
...@@ -599,16 +576,14 @@ going to use. ...@@ -599,16 +576,14 @@ going to use.
.. _deepspeed-zero2-config: <a id='deepspeed-zero2-config'></a>
ZeRO-2 Config #### ZeRO-2 Config
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The following is an example configuration for ZeRO stage 2: The following is an example configuration for ZeRO stage 2:
.. code-block:: json ```json
{
{
"zero_optimization": { "zero_optimization": {
"stage": 2, "stage": 2,
"offload_optimizer": { "offload_optimizer": {
...@@ -622,15 +597,16 @@ The following is an example configuration for ZeRO stage 2: ...@@ -622,15 +597,16 @@ The following is an example configuration for ZeRO stage 2:
"reduce_bucket_size": 5e8, "reduce_bucket_size": 5e8,
"contiguous_gradients": true "contiguous_gradients": true
} }
} }
```
**Performance tuning:** **Performance tuning:**
- enabling ``offload_optimizer`` should reduce GPU RAM usage (it requires ``"stage": 2``) - enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`)
- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x - `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB
footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. You will want to do OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do
the same on larger capacity GPU as well, if you're starting to hit OOM. the same on larger capacity GPU as well, if you're starting to hit OOM.
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size, - when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
...@@ -638,16 +614,14 @@ The following is an example configuration for ZeRO stage 2: ...@@ -638,16 +614,14 @@ The following is an example configuration for ZeRO stage 2:
.. _deepspeed-zero3-config: <a id='deepspeed-zero3-config'></a>
ZeRO-3 Config #### ZeRO-3 Config
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The following is an example configuration for ZeRO stage 3: The following is an example configuration for ZeRO stage 3:
.. code-block:: json ```json
{
{
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"offload_optimizer": { "offload_optimizer": {
...@@ -668,70 +642,70 @@ The following is an example configuration for ZeRO stage 3: ...@@ -668,70 +642,70 @@ The following is an example configuration for ZeRO stage 3:
"stage3_max_reuse_distance": 1e9, "stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true "stage3_gather_fp16_weights_on_model_save": true
} }
} }
```
If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU
memory offloading the optimizer states and parameters to CPU memory with ``"device": "cpu"`` may solve this limitation. memory offloading the optimizer states and parameters to CPU memory with `"device": "cpu"` may solve this limitation.
If you don't want to offload to CPU memory, use ``none`` instead of ``cpu`` for the ``device`` entry. Offloading to If you don't want to offload to CPU memory, use `none` instead of `cpu` for the `device` entry. Offloading to
NVMe is discussed further down. NVMe is discussed further down.
Pinned memory is enabled with ``pin_memory`` set to ``true``. This feature can improve the throughput at the cost of Pinned memory is enabled with `pin_memory` set to `true`. This feature can improve the throughput at the cost of
making less memory available to other processes. Pinned memory is set aside to the specific process that requested it making less memory available to other processes. Pinned memory is set aside to the specific process that requested it
and its typically accessed much faster than normal CPU memory. and its typically accessed much faster than normal CPU memory.
**Performance tuning:** **Performance tuning:**
- ``stage3_max_live_parameters``: ``1e9`` - `stage3_max_live_parameters`: `1e9`
- ``stage3_max_reuse_distance``: ``1e9`` - `stage3_max_reuse_distance`: `1e9`
If hitting OOM reduce ``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``. They should have minimal impact If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact
on performance unless you are doing activation checkpointing. ``1e9`` would consume ~2GB. The memory is shared by on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by
``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``, so its not additive, its just 2GB total. `stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total.
``stage3_max_live_parameters`` is the upper limit on how many full parameters you want to keep on the GPU at any given `stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given
time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
use the ``stage3_max_reuse_distance`` to decide whether to throw away the parameter or to keep it. If a parameter is use the `stage3_max_reuse_distance` to decide whether to throw away the parameter or to keep it. If a parameter is
going to be used again in near future (less than ``stage3_max_reuse_distance``) then we keep it to reduce communication going to be used again in near future (less than `stage3_max_reuse_distance`) then we keep it to reduce communication
overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and
backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward
The following configuration values depend on the model's hidden size: The following configuration values depend on the model's hidden size:
- ``reduce_bucket_size``: ``hidden_size*hidden_size`` - `reduce_bucket_size`: `hidden_size*hidden_size`
- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size`` - `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size`
- ``stage3_param_persistence_threshold``: ``10 * hidden_size`` - `stage3_param_persistence_threshold`: `10 * hidden_size`
therefore set these values to ``auto`` and the :class:`~transformers.Trainer` will automatically assign the recommended therefore set these values to `auto` and the [`Trainer`] will automatically assign the recommended
values. But, of course, feel free to set these explicitly as well. values. But, of course, feel free to set these explicitly as well.
``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large `stage3_gather_fp16_weights_on_model_save` enables model fp16 weights consolidation when model gets saved. With large
models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if
you plan to resume the training. Watch out for future updates that will remove this limitation and make things more you plan to resume the training. Watch out for future updates that will remove this limitation and make things more
flexible. flexible.
If you're migrating from ZeRO-2 configuration note that ``allgather_partitions``, ``allgather_bucket_size`` and If you're migrating from ZeRO-2 configuration note that `allgather_partitions`, `allgather_bucket_size` and
``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just `reduce_scatter` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just
be ignored. be ignored.
- ``sub_group_size``: ``1e9`` - `sub_group_size`: `1e9`
``sub_group_size`` controls the granularity in which parameters are updated during optimizer steps. Parameters are `sub_group_size` controls the granularity in which parameters are updated during optimizer steps. Parameters are
grouped into buckets of ``sub_group_size`` and each buckets is updated one at a time. When used with NVMe offload in grouped into buckets of `sub_group_size` and each buckets is updated one at a time. When used with NVMe offload in
ZeRO-Infinity, ``sub_group_size`` therefore controls the granularity in which model states are moved in and out of CPU ZeRO-Infinity, `sub_group_size` therefore controls the granularity in which model states are moved in and out of CPU
memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models. memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.
You can leave ``sub_group_size`` to its default value of `1e9` when not using NVMe offload. You may want to change its You can leave `sub_group_size` to its default value of *1e9* when not using NVMe offload. You may want to change its
default value in the following cases: default value in the following cases:
1. Running into OOM during optimizer step: Reduce ``sub_group_size`` to reduce memory utilization of temporary buffers 1. Running into OOM during optimizer step: Reduce `sub_group_size` to reduce memory utilization of temporary buffers
2. Optimizer Step is taking a long time: Increase ``sub_group_size`` to improve bandwidth utilization as a result of 2. Optimizer Step is taking a long time: Increase `sub_group_size` to improve bandwidth utilization as a result of
the increased data buffers. the increased data buffers.
.. _deepspeed-nvme: <a id='deepspeed-nvme'></a>
NVMe Support ### NVMe Support
=======================================================================================================================
ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to
smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during
...@@ -740,9 +714,8 @@ process. ZeRO-Infinity requires ZeRO-3 enabled. ...@@ -740,9 +714,8 @@ process. ZeRO-Infinity requires ZeRO-3 enabled.
The following configuration example enables NVMe to offload both optimizer states and the params: The following configuration example enables NVMe to offload both optimizer states and the params:
.. code-block:: json ```json
{
{
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"offload_optimizer": { "offload_optimizer": {
...@@ -777,29 +750,27 @@ The following configuration example enables NVMe to offload both optimizer state ...@@ -777,29 +750,27 @@ The following configuration example enables NVMe to offload both optimizer state
"stage3_max_reuse_distance": 1e9, "stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true "stage3_gather_fp16_weights_on_model_save": true
}, },
} }
```
You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you
have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint: have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint:
`"device": "cpu"`). *"device": "cpu"*).
Here is the full documentation for offloading `optimizer states Here is the full documentation for offloading [optimizer states](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) and [parameters](https://www.deepspeed.ai/docs/config-json/#parameter-offloading).
<https://www.deepspeed.ai/docs/config-json/#optimizer-offloading>`__ and `parameters
<https://www.deepspeed.ai/docs/config-json/#parameter-offloading>`__.
Make sure that your ``nvme_path`` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll Make sure that your `nvme_path` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll
be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this
writing one can have ~3.5GB/s read, ~3GB/s write peak speeds). writing one can have ~3.5GB/s read, ~3GB/s write peak speeds).
In order to figure out the optimal ``aio`` configuration block you must run a benchmark on your target setup, as In order to figure out the optimal `aio` configuration block you must run a benchmark on your target setup, as
`explained here <https://github.com/microsoft/DeepSpeed/issues/998>`__. [explained here](https://github.com/microsoft/DeepSpeed/issues/998).
.. _deepspeed-zero2-zero3-performance: <a id='deepspeed-zero2-zero3-performance'></a>
ZeRO-2 vs ZeRO-3 Performance #### ZeRO-2 vs ZeRO-3 Performance
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather
model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs
...@@ -808,26 +779,23 @@ at a cost of speed. ...@@ -808,26 +779,23 @@ at a cost of speed.
It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2: It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2:
- set ``stage3_param_persistence_threshold`` to a very large number - larger than the largest parameter, e.g., ``6 * - set `stage3_param_persistence_threshold` to a very large number - larger than the largest parameter, e.g., `6 * hidden_size * hidden_size`. This will keep the parameters on the GPUs.
hidden_size * hidden_size``. This will keep the parameters on the GPUs. - turn off `offload_params` since ZeRO-2 doesn't have that option.
- turn off ``offload_params`` since ZeRO-2 doesn't have that option.
The performance will likely improve significantly with just ``offload_params`` turned off, even if you don't change The performance will likely improve significantly with just `offload_params` turned off, even if you don't change
``stage3_param_persistence_threshold``. Of course, these changes will impact the size of the model you can train. So `stage3_param_persistence_threshold`. Of course, these changes will impact the size of the model you can train. So
these help you to trade scalability for speed depending on your needs. these help you to trade scalability for speed depending on your needs.
.. _deepspeed-zero2-example: <a id='deepspeed-zero2-example'></a>
ZeRO-2 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``: #### ZeRO-2 Example
.. code-block:: json Here is a full ZeRO-2 auto-configuration file `ds_config_zero2.json`:
{ ```json
{
"fp16": { "fp16": {
"enabled": "auto", "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
...@@ -876,15 +844,14 @@ Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``: ...@@ -876,15 +844,14 @@ Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``:
"train_batch_size": "auto", "train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto", "train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
```
Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical
values look like, but we highly recommend using the one with multiple ``auto`` settings in it. values look like, but we highly recommend using the one with multiple `auto` settings in it.
.. code-block:: json ```json
{
{
"fp16": { "fp16": {
"enabled": true, "enabled": true,
"loss_scale": 0, "loss_scale": 0,
...@@ -929,21 +896,18 @@ values look like, but we highly recommend using the one with multiple ``auto`` s ...@@ -929,21 +896,18 @@ values look like, but we highly recommend using the one with multiple ``auto`` s
"steps_per_print": 2000, "steps_per_print": 2000,
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
```
.. _deepspeed-zero3-example:
ZeRO-3 Example <a id='deepspeed-zero3-example'></a>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``: #### ZeRO-3 Example
Here is a full ZeRO-3 auto-configuration file `ds_config_zero3.json`:
.. code-block:: json
{ ```json
{
"fp16": { "fp16": {
"enabled": "auto", "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
...@@ -999,14 +963,14 @@ Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``: ...@@ -999,14 +963,14 @@ Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``:
"train_batch_size": "auto", "train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto", "train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
```
Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical
values look like, but we highly recommend using the one with multiple ``auto`` settings in it. values look like, but we highly recommend using the one with multiple `auto` settings in it.
.. code-block:: json
{ ```json
{
"fp16": { "fp16": {
"enabled": true, "enabled": true,
"loss_scale": 0, "loss_scale": 0,
...@@ -1058,48 +1022,40 @@ values look like, but we highly recommend using the one with multiple ``auto`` s ...@@ -1058,48 +1022,40 @@ values look like, but we highly recommend using the one with multiple ``auto`` s
"steps_per_print": 2000, "steps_per_print": 2000,
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
```
### Optimizer and Scheduler
Optimizer and Scheduler As long as you don't enable `offload_optimizer` you can mix and match DeepSpeed and HuggingFace schedulers and
=======================================================================================================================
As long as you don't enable ``offload_optimizer`` you can mix and match DeepSpeed and HuggingFace schedulers and
optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer: optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer:
+--------------+--------------+--------------+
| Combos | HF Scheduler | DS Scheduler | | Combos | HF Scheduler | DS Scheduler |
+--------------+--------------+--------------+
| HF Optimizer | Yes | Yes | | HF Optimizer | Yes | Yes |
+--------------+--------------+--------------+
| DS Optimizer | No | Yes | | DS Optimizer | No | Yes |
+--------------+--------------+--------------+
It is possible to use a non-DeepSpeed optimizer when ``offload_optimizer`` is enabled, as long as it has both CPU and It is possible to use a non-DeepSpeed optimizer when `offload_optimizer` is enabled, as long as it has both CPU and
GPU implementation (except LAMB). GPU implementation (except LAMB).
.. _deepspeed-optimizer: <a id='deepspeed-optimizer'></a>
Optimizer #### Optimizer
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here thus recommended to be used. It, however, can import other optimizers from `torch`. The full documentation is [here](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters).
<https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`__.
If you don't configure the ``optimizer`` entry in the configuration file, the :class:`~transformers.Trainer` will
automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.
Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``: If you don't configure the `optimizer` entry in the configuration file, the [`Trainer`] will
automatically set it to `AdamW` and will use the supplied values or the defaults for the following command line
arguments: `--learning_rate`, `--adam_beta1`, `--adam_beta2`, `--adam_epsilon` and `--weight_decay`.
.. code-block:: json Here is an example of the auto-configured `optimizer` entry for `AdamW`:
{ ```json
{
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
...@@ -1109,25 +1065,24 @@ Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``: ...@@ -1109,25 +1065,24 @@ Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``:
"weight_decay": "auto" "weight_decay": "auto"
} }
} }
} }
```
Note that the command line arguments will set the values in the configuration file. This is so that there is one Note that the command line arguments will set the values in the configuration file. This is so that there is one
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
different values in different places. Command line rules. The values that get overridden are: different values in different places. Command line rules. The values that get overridden are:
- ``lr`` with the value of ``--learning_rate`` - `lr` with the value of `--learning_rate`
- ``betas`` with the value of ``--adam_beta1 --adam_beta2`` - `betas` with the value of `--adam_beta1 --adam_beta2`
- ``eps`` with the value of ``--adam_epsilon`` - `eps` with the value of `--adam_epsilon`
- ``weight_decay`` with the value of ``--weight_decay`` - `weight_decay` with the value of `--weight_decay`
Therefore please remember to tune the shared hyperparameters on the command line. Therefore please remember to tune the shared hyperparameters on the command line.
You can also set the values explicitly: You can also set the values explicitly:
.. code-block:: json ```json
{
{
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
...@@ -1137,47 +1092,46 @@ You can also set the values explicitly: ...@@ -1137,47 +1092,46 @@ You can also set the values explicitly:
"weight_decay": 3e-7 "weight_decay": 3e-7
} }
} }
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
If you want to use another optimizer which is not listed above, you will have to add to the top level configuration. If you want to use another optimizer which is not listed above, you will have to add to the top level configuration.
.. code-block:: json ```json
{
{
"zero_allow_untested_optimizer": true "zero_allow_untested_optimizer": true
} }
```
Similarly to ``AdamW``, you can configure other officially supported optimizers. Just remember that may have different Similarly to `AdamW`, you can configure other officially supported optimizers. Just remember that may have different
config values. e.g. for Adam you will want ``weight_decay`` around ``0.01``. config values. e.g. for Adam you will want `weight_decay` around `0.01`.
.. _deepspeed-scheduler: <a id='deepspeed-scheduler'></a>
Scheduler #### Scheduler
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DeepSpeed supports ``LRRangeTest``, ``OneCycle``, ``WarmupLR`` and ``WarmupDecayLR`` learning rate schedulers. The full DeepSpeed supports `LRRangeTest`, `OneCycle`, `WarmupLR` and `WarmupDecayLR` learning rate schedulers. The full
documentation is `here <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__. documentation is [here](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters).
Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed: Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup`` - `WarmupLR` via `--lr_scheduler_type constant_with_warmup`
* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``, - `WarmupDecayLR` via `--lr_scheduler_type linear`. This is also the default value for `--lr_scheduler_type`,
therefore, if you don't configure the scheduler this is scheduler that will get configured by default. therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use If you don't configure the `scheduler` entry in the configuration file, the [`Trainer`] will use
the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` or ``--warmup_ratio`` to configure a the values of `--lr_scheduler_type`, `--learning_rate` and `--warmup_steps` or `--warmup_ratio` to configure a
🤗 Transformers version of it. 🤗 Transformers version of it.
Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``: Here is an example of the auto-configured `scheduler` entry for `WarmupLR`:
.. code-block:: json
{ ```json
{
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupLR",
"params": { "params": {
...@@ -1186,25 +1140,25 @@ Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``: ...@@ -1186,25 +1140,25 @@ Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``:
"warmup_num_steps": "auto" "warmup_num_steps": "auto"
} }
} }
} }
```
Since `"auto"` is used the :class:`~transformers.Trainer` arguments will set the correct values in the configuration Since *"auto"* is used the [`Trainer`] arguments will set the correct values in the configuration
file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example, file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example,
the learning rate is set to different values in different places. Command line rules. The values that get set are: the learning rate is set to different values in different places. Command line rules. The values that get set are:
- ``warmup_min_lr`` with the value of ``0``. - `warmup_min_lr` with the value of `0`.
- ``warmup_max_lr`` with the value of ``--learning_rate``. - `warmup_max_lr` with the value of `--learning_rate`.
- ``warmup_num_steps`` with the value of ``--warmup_steps`` if provided. Otherwise will use ``--warmup_ratio`` - `warmup_num_steps` with the value of `--warmup_steps` if provided. Otherwise will use `--warmup_ratio`
multiplied by the number of training steps and rounded up. multiplied by the number of training steps and rounded up.
- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run - `total_num_steps` with either the value of `--max_steps` or if it is not provided, derived automatically at run
time based on the environment and the size of the dataset and other command line arguments (needed for time based on the environment and the size of the dataset and other command line arguments (needed for
``WarmupDecayLR``). `WarmupDecayLR`).
You can, of course, take over any or all of the configuration values and set those yourself: You can, of course, take over any or all of the configuration values and set those yourself:
.. code-block:: json ```json
{
{
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupLR",
"params": { "params": {
...@@ -1213,16 +1167,16 @@ You can, of course, take over any or all of the configuration values and set tho ...@@ -1213,16 +1167,16 @@ You can, of course, take over any or all of the configuration values and set tho
"warmup_num_steps": 1000 "warmup_num_steps": 1000
} }
} }
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
For example, for ``WarmupDecayLR``, you can use the following entry: For example, for `WarmupDecayLR`, you can use the following entry:
.. code-block:: json
{ ```json
{
"scheduler": { "scheduler": {
"type": "WarmupDecayLR", "type": "WarmupDecayLR",
"params": { "params": {
...@@ -1233,55 +1187,52 @@ For example, for ``WarmupDecayLR``, you can use the following entry: ...@@ -1233,55 +1187,52 @@ For example, for ``WarmupDecayLR``, you can use the following entry:
"warmup_num_steps": "auto" "warmup_num_steps": "auto"
} }
} }
} }
```
and ``total_num_steps`, ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be set at loading time. and `total_num_steps`, `warmup_max_lr`, `warmup_num_steps` and `total_num_steps` will be set at loading time.
.. _deepspeed-fp32: <a id='deepspeed-fp32'></a>
fp32 Precision ### fp32 Precision
=======================================================================================================================
Deepspeed supports the full fp32 and the fp16 mixed precision. Deepspeed supports the full fp32 and the fp16 mixed precision.
Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you
will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this
happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained
models). Such models may overflow or underflow leading to ``NaN`` loss. If this is your case then you will want to use models). Such models may overflow or underflow leading to `NaN` loss. If this is your case then you will want to use
the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with: the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with:
.. code-block:: json ```json
{
{
"fp16": { "fp16": {
"enabled": "false", "enabled": "false",
} }
} }
```
If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using
the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and
benchmarks, please, see `TensorFloat-32(TF32) on Ampere devices benchmarks, please, see [TensorFloat-32(TF32) on Ampere devices](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices). The document includes
<https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices>`__. The document includes
instructions on how to disable this automatic conversion if for some reason you prefer not to use it. instructions on how to disable this automatic conversion if for some reason you prefer not to use it.
.. _deepspeed-amp: <a id='deepspeed-amp'></a>
Automatic Mixed Precision ### Automatic Mixed Precision
=======================================================================================================================
You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way: You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way:
To configure pytorch AMP-like mode set: To configure pytorch AMP-like mode set:
.. code-block:: json ```json
{
{
"fp16": { "fp16": {
"enabled": "auto", "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
...@@ -1290,18 +1241,18 @@ To configure pytorch AMP-like mode set: ...@@ -1290,18 +1241,18 @@ To configure pytorch AMP-like mode set:
"hysteresis": 2, "hysteresis": 2,
"min_loss_scale": 1 "min_loss_scale": 1
} }
} }
```
and the :class:`~transformers.Trainer` will automatically enable or disable it based on the value of and the [`Trainer`] will automatically enable or disable it based on the value of
``args.fp16_backend``. The rest of config values are up to you. `args.fp16_backend`. The rest of config values are up to you.
This mode gets enabled when ``--fp16 --fp16_backend amp`` command line args are passed. This mode gets enabled when `--fp16 --fp16_backend amp` command line args are passed.
You can also enable/disable this mode explicitly: You can also enable/disable this mode explicitly:
.. code-block:: json ```json
{
{
"fp16": { "fp16": {
"enabled": true, "enabled": true,
"loss_scale": 0, "loss_scale": 0,
...@@ -1310,168 +1261,161 @@ You can also enable/disable this mode explicitly: ...@@ -1310,168 +1261,161 @@ You can also enable/disable this mode explicitly:
"hysteresis": 2, "hysteresis": 2,
"min_loss_scale": 1 "min_loss_scale": 1
} }
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
Here is the `documentation <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__. Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#fp16-training-options).
To configure apex AMP-like mode set: To configure apex AMP-like mode set:
.. code-block:: json ```json
"amp": {
"amp": {
"enabled": "auto", "enabled": "auto",
"opt_level": "auto" "opt_level": "auto"
} }
```
and the :class:`~transformers.Trainer` will automatically configure it based on the values of ``args.fp16_backend`` and and the [`Trainer`] will automatically configure it based on the values of `args.fp16_backend` and
``args.fp16_opt_level``. `args.fp16_opt_level`.
This mode gets enabled when ``--fp16 --fp16_backend apex --fp16_opt_level 01`` command line args are passed. This mode gets enabled when `--fp16 --fp16_backend apex --fp16_opt_level 01` command line args are passed.
You can also configure this mode explicitly: You can also configure this mode explicitly:
.. code-block:: json ```json
{
{
"amp": { "amp": {
"enabled": true, "enabled": true,
"opt_level": "O1" "opt_level": "O1"
} }
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
Here is the `documentation Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options).
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
.. _deepspeed-bs: <a id='deepspeed-bs'></a>
Batch Size ### Batch Size
=======================================================================================================================
To configure batch size, use: To configure batch size, use:
.. code-block:: json ```json
{
{
"train_batch_size": "auto", "train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto" "train_micro_batch_size_per_gpu": "auto"
} }
```
and the :class:`~transformers.Trainer` will automatically set ``train_micro_batch_size_per_gpu`` to the value of and the [`Trainer`] will automatically set `train_micro_batch_size_per_gpu` to the value of
``args.per_device_train_batch_size`` and ``train_batch_size`` to ``args.world_size * args.per_device_train_batch_size * `args.per_device_train_batch_size` and `train_batch_size` to `args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`.
args.gradient_accumulation_steps``.
You can also set the values explicitly: You can also set the values explicitly:
.. code-block:: json ```json
{
{
"train_batch_size": 12, "train_batch_size": 12,
"train_micro_batch_size_per_gpu": 4 "train_micro_batch_size_per_gpu": 4
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
.. _deepspeed-grad-acc: <a id='deepspeed-grad-acc'></a>
Gradient Accumulation ### Gradient Accumulation
=======================================================================================================================
To configure gradient accumulation set: To configure gradient accumulation set:
.. code-block:: json ```json
{
{
"gradient_accumulation_steps": "auto" "gradient_accumulation_steps": "auto"
} }
```
and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.gradient_accumulation_steps``. and the [`Trainer`] will automatically set it to the value of `args.gradient_accumulation_steps`.
You can also set the value explicitly: You can also set the value explicitly:
.. code-block:: json ```json
{
{
"gradient_accumulation_steps": 3 "gradient_accumulation_steps": 3
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
.. _deepspeed-grad-clip: <a id='deepspeed-grad-clip'></a>
Gradient Clipping ### Gradient Clipping
=======================================================================================================================
To configure gradient gradient clipping set: To configure gradient gradient clipping set:
.. code-block:: json ```json
{
{
"gradient_clipping": "auto" "gradient_clipping": "auto"
} }
```
and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.max_grad_norm``. and the [`Trainer`] will automatically set it to the value of `args.max_grad_norm`.
You can also set the value explicitly: You can also set the value explicitly:
.. code-block:: json ```json
{
{
"gradient_clipping": 1.0 "gradient_clipping": 1.0
} }
```
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
configuration. configuration.
.. _deepspeed-weight-extraction: <a id='deepspeed-weight-extraction'></a>
Getting The Model Weights Out ### Getting The Model Weights Out
=======================================================================================================================
As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores
fp32 master weights in its custom checkpoint optimizer files, which are ``global_step*/*optim_states.pt`` (this is glob fp32 master weights in its custom checkpoint optimizer files, which are `global_step*/*optim_states.pt` (this is glob
pattern), and are saved under the normal checkpoint. pattern), and are saved under the normal checkpoint.
**FP16 Weights:** **FP16 Weights:**
When a model is saved under ZeRO-2, you end up having the normal ``pytorch_model.bin`` file with the model weights, but When a model is saved under ZeRO-2, you end up having the normal `pytorch_model.bin` file with the model weights, but
they are only the fp16 version of the weights. they are only the fp16 version of the weights.
Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs, Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs,
therefore ``"stage3_gather_fp16_weights_on_model_save": true`` is required to get the ``Trainer`` to save the fp16 therefore `"stage3_gather_fp16_weights_on_model_save": true` is required to get the `Trainer` to save the fp16
version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't be created. This is because by default version of the weights. If this setting is `False` ``pytorch_model.bin` won't be created. This is because by default DeepSpeed's `state_dict` contains a placeholder and not the real weights. If we were to save this `state_dict`` it
DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it
won't be possible to load it back. won't be possible to load it back.
.. code-block:: json ```json
{
{
"zero_optimization": { "zero_optimization": {
"stage3_gather_fp16_weights_on_model_save": true "stage3_gather_fp16_weights_on_model_save": true
} }
} }
```
**FP32 Weights:** **FP32 Weights:**
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32 the [models hub](https://huggingface.co/models) or pass it to someone else you most likely will want to get the fp32
weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
memory it can be done in the same training script. The following sections will discuss both approaches. memory it can be done in the same training script. The following sections will discuss both approaches.
...@@ -1483,89 +1427,89 @@ This approach may not work if you model is large and you have little free CPU me ...@@ -1483,89 +1427,89 @@ This approach may not work if you model is large and you have little free CPU me
If you have saved at least one checkpoint, and you want to use the latest one, you can do the following: If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
.. code-block:: python ```python
from transformers.trainer_utils import get_last_checkpoint
from transformers.trainer_utils import get_last_checkpoint from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) ```
If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best If you're using the `--load_best_model_at_end` class:*~transformers.TrainingArguments* argument (to track the best
checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above: checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
.. code-block:: python ```python
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
```
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint <Tip>
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
.. note:: Note, that once `load_state_dict_from_zero_checkpoint` was run, the `model` will no longer be useable in the
DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
`model.load_state_dict(state_dict)` will remove all the DeepSpeed magic from it. So do this only at the very end
of the training.
Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the </Tip>
DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
of the training.
Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own Of course, you don't have to use class:*~transformers.Trainer* and you can adjust the examples above to your own
trainer. trainer.
If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply If for some reason you want more refinement, you can also extract the fp32 `state_dict` of the weights and apply
these yourself as is shown in the following example: these yourself as is shown in the following example:
.. code-block:: python ```python
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu model = model.cpu()
model = model.cpu() model.load_state_dict(state_dict)
model.load_state_dict(state_dict) ```
**Offline FP32 Weights Recovery:** **Offline FP32 Weights Recovery:**
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint DeepSpeed creates a special conversion script `zero_to_fp32.py` which it places in the top-level of the checkpoint
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
have the configuration file or a ``Trainer`` to do the extraction. have the configuration file or a `Trainer` to do the extraction.
Let's say your checkpoint folder looks like this: Let's say your checkpoint folder looks like this:
.. code-block:: bash ```bash
$ ls -l output_dir/checkpoint-1/
$ ls -l output_dir/checkpoint-1/ -rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ -rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest
-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest -rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt -rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin -rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt
-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt -rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json -rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model -rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json -rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json
-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json -rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin -rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* ```
In this example there is just one DeepSpeed checkpoint sub-folder `global_step1`. Therefore to reconstruct the fp32 In this example there is just one DeepSpeed checkpoint sub-folder *global_step1*. Therefore to reconstruct the fp32
weights just run: weights just run:
.. code-block:: bash ```bash
python zero_to_fp32.py . pytorch_model.bin
python zero_to_fp32.py . pytorch_model.bin ```
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs. This is it. `pytorch_model.bin` will now contain the full fp32 model weights consolidated from multiple GPUs.
The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint. The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
``python zero_to_fp32.py -h`` will give you usage details. `python zero_to_fp32.py -h` will give you usage details.
The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current The script will auto-discover the deepspeed sub-folder using the contents of the file `latest`, which in the current
example will contain ``global_step1``. example will contain `global_step1`.
Note: currently the script requires 2x general RAM of the final fp32 model weights. Note: currently the script requires 2x general RAM of the final fp32 model weights.
ZeRO-3 and Infinity Nuances ### ZeRO-3 and Infinity Nuances
=======================================================================================================================
ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature. ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature.
...@@ -1576,104 +1520,99 @@ circumstances you may find the following information to be needed. ...@@ -1576,104 +1520,99 @@ circumstances you may find the following information to be needed.
Constructing Massive Models #### Constructing Massive Models
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases, DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases,
but also if you want the initialization to happen much faster, initialize the model using `deepspeed.zero.Init()` but also if you want the initialization to happen much faster, initialize the model using *deepspeed.zero.Init()*
context manager (which is also a function decorator), like so: context manager (which is also a function decorator), like so:
.. code-block:: python ```python
from transformers import T5ForConditionalGeneration, T5Config
from transformers import T5ForConditionalGeneration, T5Config import deepspeed
import deepspeed with deepspeed.zero.Init():
with deepspeed.zero.Init():
config = T5Config.from_pretrained("t5-small") config = T5Config.from_pretrained("t5-small")
model = T5ForConditionalGeneration(config) model = T5ForConditionalGeneration(config)
```
As you can see this gives you a randomly initialized model. As you can see this gives you a randomly initialized model.
If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as If you want to use a pretrained model, `model_class.from_pretrained` will activate this feature as long as
``is_deepspeed_zero3_enabled()`` returns ``True``, which currently is setup by the `is_deepspeed_zero3_enabled()` returns `True`, which currently is setup by the
class:`~transformers.TrainingArguments` object if the passed DeepSpeed configuration file contains ZeRO-3 config [`TrainingArguments`] object if the passed DeepSpeed configuration file contains ZeRO-3 config
section. Thus you must create the :class:`~transformers.TrainingArguments` object **before** calling section. Thus you must create the [`TrainingArguments`] object **before** calling
``from_pretrained``. Here is an example of a possible sequence: `from_pretrained`. Here is an example of a possible sequence:
.. code-block:: python
from transformers import AutoModel, Trainer, TrainingArguments ```python
training_args = TrainingArguments(..., deepspeed=ds_config) from transformers import AutoModel, Trainer, TrainingArguments
model = AutoModel.from_pretrained("t5-small") training_args = TrainingArguments(..., deepspeed=ds_config)
trainer = Trainer(model=model, args=training_args, ...) model = AutoModel.from_pretrained("t5-small")
trainer = Trainer(model=model, args=training_args, ...)
```
If you're using the official example scripts and your command line arguments include ``--deepspeed ds_config.json`` If you're using the official example scripts and your command line arguments include `--deepspeed ds_config.json`
with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written. with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written.
Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used. Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used.
For full details on this method and other related features please refer to `Constructing Massive Models For full details on this method and other related features please refer to [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models).
<https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models>`__.
Also when loading fp16-pretrained models, you will want to tell ``from_pretrained`` to use Also when loading fp16-pretrained models, you will want to tell `from_pretrained` to use
``torch_dtype=torch.float16``. For details, please, see :ref:`from_pretrained-torch-dtype`. `torch_dtype=torch.float16`. For details, please, see [from_pretrained-torch-dtype](#from_pretrained-torch-dtype).
Gathering Parameters #### Gathering Parameters
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently
executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it. executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it.
Most likely you won't need it, but if you do please refer to `Gathering Parameters Most likely you won't need it, but if you do please refer to [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination)
<https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination>`__
We do however use it internally in several places, one such example is when loading pretrained model weights in We do however use it internally in several places, one such example is when loading pretrained model weights in
``from_pretrained``. We load one layer at a time and immediately partition it to all participating GPUs, as for very `from_pretrained`. We load one layer at a time and immediately partition it to all participating GPUs, as for very
large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory
limitations. limitations.
Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like: Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:
.. code-block:: python ```python
tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
```
tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True) stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much
stress on ``tensor([1.])``, or if you get an error where it says the parameter is of size ``1``, instead of some much
larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder. larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.
.. _deepspeed-zero-inference: <a id='deepspeed-zero-inference'></a>
ZeRO Inference ### ZeRO Inference
=======================================================================================================================
ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In
fact you can leave these in the config file if you want to share the same one with the training. They will just be fact you can leave these in the config file if you want to share the same one with the training. They will just be
ignored. ignored.
Otherwise you just need to pass the usual :class:`~transformers.TrainingArguments` arguments. For example: Otherwise you just need to pass the usual [`TrainingArguments`] arguments. For example:
.. code-block:: bash
deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json ```bash
deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
```
The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever
for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states. for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states.
Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs: Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs:
.. code-block:: bash
deepspeed examples/pytorch/translation/run_translation.py \ ```bash
--deepspeed tests/deepspeed/ds_config_zero3.json \ deepspeed examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --output_dir output_dir \ --deepspeed tests/deepspeed/ds_config_zero3.json \
--do_eval --max_eval_samples 50 --warmup_steps 50 \ --model_name_or_path t5-small --output_dir output_dir \
--max_source_length 128 --val_max_target_length 128 \ --do_eval --max_eval_samples 50 --warmup_steps 50 \
--overwrite_output_dir --per_device_eval_batch_size 4 \ --max_source_length 128 --val_max_target_length 128 \
--predict_with_generate --dataset_config "ro-en" --fp16 \ --overwrite_output_dir --per_device_eval_batch_size 4 \
--source_lang en --target_lang ro --dataset_name wmt16 \ --predict_with_generate --dataset_config "ro-en" --fp16 \
--source_prefix "translate English to Romanian: " --source_lang en --target_lang ro --dataset_name wmt16 \
--source_prefix "translate English to Romanian: "
```
Since for inference there is no need for additional large memory used by the optimizer states and the gradients you Since for inference there is no need for additional large memory used by the optimizer states and the gradients you
should be able to fit much larger batches and/or sequence length onto the same hardware. should be able to fit much larger batches and/or sequence length onto the same hardware.
...@@ -1684,8 +1623,7 @@ to the ZeRO technology, but instead uses tensor parallelism to scale models that ...@@ -1684,8 +1623,7 @@ to the ZeRO technology, but instead uses tensor parallelism to scale models that
work in progress and we will provide the integration once that product is complete. work in progress and we will provide the integration once that product is complete.
Filing Issues ### Filing Issues
=======================================================================================================================
Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work. Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work.
...@@ -1693,30 +1631,29 @@ In your report please always include: ...@@ -1693,30 +1631,29 @@ In your report please always include:
1. the full Deepspeed config file in the report 1. the full Deepspeed config file in the report
2. either the command line arguments if you were using the :class:`~transformers.Trainer` or 2. either the command line arguments if you were using the [`Trainer`] or
:class:`~transformers.TrainingArguments` arguments if you were scripting the Trainer setup yourself. Please do not [`TrainingArguments`] arguments if you were scripting the Trainer setup yourself. Please do not
dump the :class:`~transformers.TrainingArguments` as it has dozens of entries that are irrelevant. dump the [`TrainingArguments`] as it has dozens of entries that are irrelevant.
3. Output of: 3. Output of:
.. code-block:: bash ```bash
python -c 'import torch; print(f"torch: {torch.__version__}")' python -c 'import torch; print(f"torch: {torch.__version__}")'
python -c 'import transformers; print(f"transformers: {transformers.__version__}")' python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
```
4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this 4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this
`notebook <https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb>`__ as [notebook](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) as
a starting point. a starting point.
5. Unless it's impossible please always use a standard dataset that we can use and not something custom. 5. Unless it's impossible please always use a standard dataset that we can use and not something custom.
6. If possible try to use one of the existing `examples 6. If possible try to use one of the existing [examples](https://github.com/huggingface/transformers/tree/master/examples/pytorch) to reproduce the problem with.
<https://github.com/huggingface/transformers/tree/master/examples/pytorch>`__ to reproduce the problem with.
Things to consider: Things to consider:
* Deepspeed is often not the cause of the problem. - Deepspeed is often not the cause of the problem.
Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the
problem was still there. problem was still there.
...@@ -1725,109 +1662,97 @@ Things to consider: ...@@ -1725,109 +1662,97 @@ Things to consider:
exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it. exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it.
And only if the problem persists then do mentioned Deepspeed and supply all the required details. And only if the problem persists then do mentioned Deepspeed and supply all the required details.
* If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue - If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue
directly with `Deepspeed <https://github.com/microsoft/DeepSpeed/>`__. If you aren't sure, please do not worry, directly with [Deepspeed](https://github.com/microsoft/DeepSpeed/). If you aren't sure, please do not worry,
either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if
need be. need be.
Troubleshooting ### Troubleshooting
=======================================================================================================================
* ``deepspeed`` process gets killed at startup without a traceback - `deepspeed` process gets killed at startup without a traceback
If the ``deepspeed`` process gets killed at launch time without a traceback, that usually means that the program tried If the `deepspeed` process gets killed at launch time without a traceback, that usually means that the program tried
to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that
process. This is because your configuration file most likely has either ``offload_optimizer`` or ``offload_param`` or process. This is because your configuration file most likely has either `offload_optimizer` or `offload_param` or
both configured to offload to ``cpu``. If you have NVMe, experiment with offloading to NVMe if you're running under both configured to offload to `cpu`. If you have NVMe, experiment with offloading to NVMe if you're running under
ZeRO-3. ZeRO-3.
Work is being done to enable estimating how much memory is needed for a specific model: `PR Work is being done to enable estimating how much memory is needed for a specific model: [PR](https://github.com/microsoft/DeepSpeed/pull/965).
<https://github.com/microsoft/DeepSpeed/pull/965>`__.
Notes ### Notes
=======================================================================================================================
* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`. - DeepSpeed works with the PyTorch [`Trainer`] but not TF [`TFTrainer`].
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source - While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from [source](https://github.com/microsoft/deepspeed#installation) to best match your hardware and also if you need to enable
<https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable
certain features, like 1-bit Adam, which aren't available in the pypi distribution. certain features, like 1-bit Adam, which aren't available in the pypi distribution.
* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with 🤗 Transformers - you can use any model - You don't have to use the [`Trainer`] to use DeepSpeed with 🤗 Transformers - you can use any model
with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions with your own trainer, and you will have to adapt the latter according to [the DeepSpeed integration instructions](https://www.deepspeed.ai/getting-started/#writing-deepspeed-models).
<https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
.. _deepspeed-non-trainer-integration: <a id='deepspeed-non-trainer-integration'></a>
Non-Trainer Deepspeed Integration ## Non-Trainer Deepspeed Integration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :class:`~transformers.integrations.HfDeepSpeedConfig` is used to integrate Deepspeed into the 🤗 Transformers core The [`~integrations.HfDeepSpeedConfig`] is used to integrate Deepspeed into the 🤗 Transformers core
functionality, when :class:`~transformers.Trainer` is not used. functionality, when [`Trainer`] is not used.
When using :class:`~transformers.Trainer` everything is automatically taken care of. When using [`Trainer`] everything is automatically taken care of.
When not using :class:`~transformers.Trainer`, to efficiently deploy DeepSpeed stage 3, you must instantiate the When not using [`Trainer`], to efficiently deploy DeepSpeed stage 3, you must instantiate the
:class:`~transformers.integrations.HfDeepSpeedConfig` object before instantiating the model. [`~integrations.HfDeepSpeedConfig`] object before instantiating the model.
For example for a pretrained model: For example for a pretrained model:
.. code-block:: python ```python
from transformers.deepspeed import HfDeepSpeedConfig
from transformers.deepspeed import HfDeepSpeedConfig from transformers import AutoModel, deepspeed
from transformers import AutoModel, deepspeed
ds_config = { ... } # deepspeed config object or path to the file ds_config = { ... } # deepspeed config object or path to the file
# must run before instantiating the model # must run before instantiating the model
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
model = AutoModel.from_pretrained("gpt2") model = AutoModel.from_pretrained("gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, ...) engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
```
or for non-pretrained model: or for non-pretrained model:
.. code-block:: python ```python
from transformers.deepspeed import HfDeepSpeedConfig
from transformers.deepspeed import HfDeepSpeedConfig from transformers import AutoModel, AutoConfig, deepspeed
from transformers import AutoModel, AutoConfig, deepspeed
ds_config = { ... } # deepspeed config object or path to the file
# must run before instantiating the model
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
config = AutoConfig.from_pretrained("gpt2")
model = AutoModel.from_config(config)
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
HfDeepSpeedConfig
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.deepspeed.HfDeepSpeedConfig ds_config = { ... } # deepspeed config object or path to the file
:members: # must run before instantiating the model
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
config = AutoConfig.from_pretrained("gpt2")
model = AutoModel.from_config(config)
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
```
## HfDeepSpeedConfig
[[autodoc]] deepspeed.HfDeepSpeedConfig
- all
Main DeepSpeed Resources ## Main DeepSpeed Resources
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- `Project's github <https://github.com/microsoft/deepspeed>`__ - [Project's github](https://github.com/microsoft/deepspeed)
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__ - [Usage docs](https://www.deepspeed.ai/getting-started/)
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__ - [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__ - [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
Papers: Papers:
- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models <https://arxiv.org/abs/1910.02054>`__ - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
- `ZeRO-Offload: Democratizing Billion-Scale Model Training <https://arxiv.org/abs/2101.06840>`__ - [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
- `ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__ - [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you Finally, please, remember that, HuggingFace [`Trainer`] only integrates DeepSpeed, therefore if you
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues).
<https://github.com/microsoft/DeepSpeed/issues>`__.
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Testing
Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
There are 2 test suites in the repository:
1. `tests` -- tests for the general API
2. `examples` -- tests primarily for various applications that aren't part of the API
## How transformers are tested
1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
are defined in this [config file](https://github.com/huggingface/transformers-doc2mdx/tree/master/.circleci/config.yml), so that if needed you can reproduce the same
environment on your machine.
These CI jobs don't run `@slow` tests.
2. There are 3 jobs run by [github actions](https://github.com/huggingface/transformers/actions):
- [torch hub integration](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/github-torch-hub.yml): checks whether torch hub
integration works.
- [self-hosted (push)](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-push.yml): runs fast tests on GPU only on commits on
`master`. It only runs if a commit on `master` has updated the code in one of the following folders: `src`,
`tests`, `.github` (to prevent running on added model cards, notebooks, etc.)
- [self-hosted runner](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-scheduled.yml): runs normal and slow tests on GPU in
`tests` and `examples`:
```bash
RUN_SLOW=1 pytest tests/
RUN_SLOW=1 pytest examples/
```
The results can be observed [here](https://github.com/huggingface/transformers/actions).
## Running tests
### Choosing which tests to run
This document goes into many details of how tests can be run. If after reading everything, you need even more details
you will find them [here](https://docs.pytest.org/en/latest/usage.html).
Here are some most useful ways of running tests.
Run all:
```console
pytest
```
or:
```bash
make test
```
Note that the latter is defined as:
```bash
python -m pytest -n auto --dist=loadfile -s -v ./tests/
```
which tells pytest to:
- run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!)
- ensure that all tests from the same file will be run by the same test process
- do not capture output
- run in verbose mode
### Getting the list of all tests
All tests of the test suite:
```bash
pytest --collect-only -q
```
All tests of a given test file:
```bash
pytest tests/test_optimization.py --collect-only -q
```
### Run a specific test module
To run an individual test module:
```bash
pytest tests/test_logging.py
```
### Run specific tests
Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
class containing those tests. For example, it could be:
```bash
pytest tests/test_optimization.py::OptimizationTest::test_adam_w
```
Here:
- `tests/test_optimization.py` - the file with tests
- `OptimizationTest` - the name of the class
- `test_adam_w` - the name of the specific test function
If the file contains multiple classes, you can choose to run only tests of a given class. For example:
```bash
pytest tests/test_optimization.py::OptimizationTest
```
will run all the tests inside that class.
As mentioned earlier you can see what tests are contained inside the `OptimizationTest` class by running:
```bash
pytest tests/test_optimization.py::OptimizationTest --collect-only -q
```
You can run tests by keyword expressions.
To run only tests whose name contains `adam`:
```bash
pytest -k adam tests/test_optimization.py
```
Logical `and` and `or` can be used to indicate whether all keywords should match or either. `not` can be used to
negate.
To run all tests except those whose name contains `adam`:
```bash
pytest -k "not adam" tests/test_optimization.py
```
And you can combine the two patterns in one:
```bash
pytest -k "ada and not adam" tests/test_optimization.py
```
For example to run both `test_adafactor` and `test_adam_w` you can use:
```bash
pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py
```
Note that we use `or` here, since we want either of the keywords to match to include both.
If you want to include only tests that include both patterns, `and` is to be used:
```bash
pytest -k "test and ada" tests/test_optimization.py
```
### Run only modified tests
You can run the tests related to the unstaged files or the current branch (according to Git) by using [pytest-picked](https://github.com/anapaulagomes/pytest-picked). This is a great way of quickly testing your changes didn't break
anything, since it won't run the tests related to files you didn't touch.
```bash
pip install pytest-picked
```
```bash
pytest --picked
```
All tests will be run from files and folders which are modified, but not yet committed.
### Automatically rerun failed tests on source modification
[pytest-xdist](https://github.com/pytest-dev/pytest-xdist) provides a very useful feature of detecting all failed
tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
which again a full run is performed.
```bash
pip install pytest-xdist
```
To enter the mode: `pytest -f` or `pytest --looponfail`
File changes are detected by looking at `looponfailroots` root directories and all of their contents (recursively).
If the default for this value does not work for you, you can change it in your project by setting a configuration
option in `setup.cfg`:
```ini
[tool:pytest]
looponfailroots = transformers tests
```
or `pytest.ini`/``tox.ini`` files:
```ini
[pytest]
looponfailroots = transformers tests
```
This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
directory.
[pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality.
### Skip a test module
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
example, to run all except `test_modeling_*.py` tests:
```bash
pytest *ls -1 tests/*py | grep -v test_modeling*
```
### Clearing state
CI builds and when isolation is important (against speed), cache should be cleared:
```bash
pytest --cache-clear tests
```
### Running tests in parallel
As mentioned earlier `make test` runs tests in parallel via `pytest-xdist` plugin (`-n X` argument, e.g. `-n 2`
to run 2 parallel jobs).
`pytest-xdist`'s `--dist=` option allows one to control how the tests are grouped. `--dist=loadfile` puts the
tests located in one file onto the same process.
Since the order of executed tests is different and unpredictable, if running the test suite with `pytest-xdist`
produces failures (meaning we have some undetected coupled tests), use [pytest-replay](https://github.com/ESSS/pytest-replay) to replay the tests in the same order, which should help with then somehow
reducing that failing sequence to a minimum.
### Test order and repetition
It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
some problems that get uncovered by randomness of DL.
#### Repeat tests
- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder):
```bash
pip install pytest-flakefinder
```
And then run every test multiple times (50 by default):
```bash
pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
```
<Tip>
This plugin doesn't work with `-n` flag from `pytest-xdist`.
</Tip>
<Tip>
There is another plugin `pytest-repeat`, but it doesn't work with `unittest`.
</Tip>
#### Run tests in a random order
```bash
pip install pytest-random-order
```
Important: the presence of `pytest-random-order` will automatically randomize tests, no configuration change or
command line options is required.
As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
`pytest-random-order` is installed it will print the random seed it used for that session, e.g:
```bash
pytest tests
[...]
Using --random-order-bucket=module
Using --random-order-seed=573663
```
So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
```bash
pytest --random-order-seed=573663
[...]
Using --random-order-bucket=module
Using --random-order-seed=573663
```
It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
they failed and tell pytest to not randomize them instead using `--random-order-bucket=none`, e.g.:
```bash
pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py
```
To disable the shuffling for all tests:
```bash
pytest --random-order-bucket=none
```
By default `--random-order-bucket=module` is implied, which will shuffle the files on the module levels. It can also
shuffle on `class`, `package`, `global` and `none` levels. For the complete details please see its
[documentation](https://github.com/jbasko/pytest-random-order).
Another randomization alternative is: [`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly). This
module has a very similar functionality/interface, but it doesn't have the bucket modes available in
`pytest-random-order`. It has the same problem of imposing itself once installed.
### Look and feel variations
#### pytest-sugar
[pytest-sugar](https://github.com/Frozenball/pytest-sugar) is a plugin that improves the look-n-feel, adds a
progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
```bash
pip install pytest-sugar
```
To run tests without it, run:
```bash
pytest -p no:sugar
```
or uninstall it.
#### Report each sub-test name and its progress
For a single or a group of tests via `pytest` (after `pip install pytest-pspec`):
```bash
pytest --pspec tests/test_optimization.py
```
#### Instantly shows failed tests
[pytest-instafail](https://github.com/pytest-dev/pytest-instafail) shows failures and errors instantly instead of
waiting until the end of test session.
```bash
pip install pytest-instafail
```
```bash
pytest --instafail
```
### To GPU or not to GPU
On a GPU-enabled setup, to test in CPU-only mode add `CUDA_VISIBLE_DEVICES=""`:
```bash
CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
```
or if you have multiple gpus, you can specify which one is to be used by `pytest`. For example, to use only the
second gpu if you have gpus `0` and `1`, you can run:
```bash
CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
```
This is handy when you want to run different tasks on different GPUs.
Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
- `require_torch` - this test will run only under torch
- `require_torch_gpu` - as `require_torch` plus requires at least 1 GPU
- `require_torch_multi_gpu` - as `require_torch` plus requires at least 2 GPUs
- `require_torch_non_multi_gpu` - as `require_torch` plus requires 0 or 1 GPUs
- `require_torch_up_to_2_gpus` - as `require_torch` plus requires 0 or 1 or 2 GPUs
- `require_torch_tpu` - as `require_torch` plus requires at least 1 TPU
Let's depict the GPU requirements in the following table:
| n gpus | decorator |
|--------+--------------------------------|
| `>= 0` | `@require_torch` |
| `>= 1` | `@require_torch_gpu` |
| `>= 2` | `@require_torch_multi_gpu` |
| `< 2` | `@require_torch_non_multi_gpu` |
| `< 3` | `@require_torch_up_to_2_gpus` |
For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
```python
@require_torch_multi_gpu
def test_example_with_multi_gpu():
```
If a test requires `tensorflow` use the `require_tf` decorator. For example:
```python
@require_tf
def test_tf_thing_with_tensorflow():
```
These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
how to set it up:
```python
@require_torch_gpu
@slow
def test_example_slow_on_gpu():
```
Some decorators like `@parametrized` rewrite test names, therefore `@require_*` skip decorators have to be listed
last for them to work correctly. Here is an example of the correct usage:
```python
@parameterized.expand(...)
@require_torch_multi_gpu
def test_integration_foo():
```
This order problem doesn't exist with `@pytest.mark.parametrize`, you can put it first or last and it will still
work. But it only works with non-unittests.
Inside tests:
- How many GPUs are available:
```python
from transformers.testing_utils import get_gpu_count
n_gpu = get_gpu_count() # works with torch and tf
```
### Distributed training
`pytest` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
thing and end up thinking they are `pytest` and start running the test suite in loops. It works, however, if one
spawns a normal process that then spawns off multiple workers and manages the IO pipes.
Here are some tests that use it:
- [test_trainer_distributed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/test_trainer_distributed.py)
- [test_deepspeed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/deepspeed/test_deepspeed.py)
To jump right into the execution point, search for the `execute_subprocess_async` call in those tests.
You will need at least 2 GPUs to see these tests in action:
```bash
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
```
### Output capture
During test execution any output sent to `stdout` and `stderr` is captured. If a test or a setup method fails, its
according captured output will usually be shown along with the failure traceback.
To disable output capturing and to get the `stdout` and `stderr` normally, use `-s` or `--capture=no`:
```bash
pytest -s tests/test_logging.py
```
To send test results to JUnit format output:
```bash
py.test tests --junitxml=result.xml
```
### Color control
To have no color (e.g., yellow on white background is not readable):
```bash
pytest --color=no tests/test_logging.py
```
### Sending test report to online pastebin service
Creating a URL for each test failure:
```bash
pytest --pastebin=failed tests/test_logging.py
```
This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
tests as usual or add for example -x if you only want to send one particular failure.
Creating a URL for a whole test session log:
```bash
pytest --pastebin=all tests/test_logging.py
```
## Writing tests
🤗 transformers tests are based on `unittest`, but run by `pytest`, so most of the time features from both systems
can be used.
You can read [here](https://docs.pytest.org/en/stable/unittest.html) which features are supported, but the important
thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module
`parameterized` that works in a similar way.
### Parametrization
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
the test, but then there is no way of running that test for just one set of arguments.
```python
# test_this1.py
import unittest
from parameterized import parameterized
class TestMathUnitTest(unittest.TestCase):
@parameterized.expand([
("negative", -1.5, -2.0),
("integer", 1, 1.0),
("large fraction", 1.6, 1),
])
def test_floor(self, name, input, expected):
assert_equal(math.floor(input), expected)
```
Now, by default this test will be run 3 times, each time with the last 3 arguments of `test_floor` being assigned the
corresponding arguments in the parameter list.
and you could run just the `negative` and `integer` sets of params with:
```bash
pytest -k "negative and integer" tests/test_mytest.py
```
or all but `negative` sub-tests, with:
```bash
pytest -k "not negative" tests/test_mytest.py
```
Besides using the `-k` filter that was just mentioned, you can find out the exact name of each sub-test and run any
or all of them using their exact names.
```bash
pytest test_this1.py --collect-only -q
```
and it will list:
```bash
test_this1.py::TestMathUnitTest::test_floor_0_negative
test_this1.py::TestMathUnitTest::test_floor_1_integer
test_this1.py::TestMathUnitTest::test_floor_2_large_fraction
```
So now you can run just 2 specific sub-tests:
```bash
pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer
```
The module [parameterized](https://pypi.org/project/parameterized/) which is already in the developer dependencies
of `transformers` works for both: `unittests` and `pytest` tests.
If, however, the test is not a `unittest`, you may use `pytest.mark.parametrize` (or you may see it being used in
some existing tests, mostly under `examples`).
Here is the same example, this time using `pytest`'s `parametrize` marker:
```python
# test_this2.py
import pytest
@pytest.mark.parametrize(
"name, input, expected",
[
("negative", -1.5, -2.0),
("integer", 1, 1.0),
("large fraction", 1.6, 1),
],
)
def test_floor(name, input, expected):
assert_equal(math.floor(input), expected)
```
Same as with `parameterized`, with `pytest.mark.parametrize` you can have a fine control over which sub-tests are
run, if the `-k` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
names for the sub-tests. Here is what they look like:
```bash
pytest test_this2.py --collect-only -q
```
and it will list:
```bash
test_this2.py::test_floor[integer-1-1.0]
test_this2.py::test_floor[negative--1.5--2.0]
test_this2.py::test_floor[large fraction-1.6-1]
```
So now you can run just the specific test:
```bash
pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0]
```
as in the previous example.
### Files and directories
In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
accessors to them:
- `pathlib` objects (all fully resolved):
- `test_file_path` - the current test file path, i.e. `__file__`
- `test_file_dir` - the directory containing the current test file
- `tests_dir` - the directory of the `tests` test suite
- `examples_dir` - the directory of the `examples` test suite
- `repo_root_dir` - the directory of the repository
- `src_dir` - the directory of `src` (i.e. where the `transformers` sub-dir resides)
- stringified paths---same as above but these return paths as strings, rather than `pathlib` objects:
- `test_file_path_str`
- `test_file_dir_str`
- `tests_dir_str`
- `examples_dir_str`
- `repo_root_dir_str`
- `src_dir_str`
To start using those all you need is to make sure that the test resides in a subclass of
`transformers.test_utils.TestCasePlus`. For example:
```python
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_local_locations(self):
data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
```
If you don't need to manipulate paths via `pathlib` or you just need a path as a string, you can always invoked
`str()` on the `pathlib` object or use the accessors ending with `_str`. For example:
```python
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_stringified_locations(self):
examples_dir = self.examples_dir_str
```
### Temporary files and directories
Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
them. Therefore, using packages like `tempfile`, which address these needs is essential.
However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
to know it's exact path and not having it randomized on every test re-run.
A helper class `transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
`unittest.TestCase`, so we can easily inherit from it in the test modules.
Here is an example of its usage:
```python
from transformers.testing_utils import TestCasePlus
class ExamplesTests(TestCasePlus):
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir()
```
This code creates a unique temporary directory, and sets `tmp_dir` to its location.
- Create a unique temporary dir:
```python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir()
```
`tmp_dir` will contain the path to the created temporary dir. It will be automatically removed at the end of the
test.
- Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
```python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
```
This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
leave any data in there.
- You can override the default behavior by directly overriding the `before` and `after` args, leading to one of the
following behaviors:
- `before=True`: the temporary dir will always be cleared at the beginning of the test.
- `before=False`: if the temporary dir already existed, any existing files will remain there.
- `after=True`: the temporary dir will always be deleted at the end of the test.
- `after=False`: the temporary dir will always be left intact at the end of the test.
<Tip>
In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if
an explicit obj:*tmp_dir* is used, so that by mistake no `/tmp` or similar important part of the filesystem will
get nuked. i.e. please always pass paths that start with `./`.
</Tip>
<Tip>
Each test can register multiple temporary directories and they all will get auto-removed, unless requested
otherwise.
</Tip>
### Temporary sys.path override
If you need to temporary override `sys.path` to import from another test for example, you can use the
`ExtendSysPath` context manager. Example:
```python
import os
from transformers.testing_utils import ExtendSysPath
bindir = os.path.abspath(os.path.dirname(__file__))
with ExtendSysPath(f"{bindir}/.."):
from test_trainer import TrainerIntegrationCommon # noqa
```
### Skipping tests
This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
commit it to the main repository we need make sure it's skipped during `make test`.
Methods:
- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
tests that depend on an external resource which is not available at the moment (for example a database).
- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
One of the important differences between the two is that `skip` doesn't run the test, and `xfail` does. So if the
code that's buggy causes some bad state that will affect other tests, do not use `xfail`.
#### Implementation
- Here is how to skip whole test unconditionally:
```python
@unittest.skip("this bug needs to be fixed")
def test_feature_x():
```
or via pytest:
```python
@pytest.mark.skip(reason="this bug needs to be fixed")
```
or the `xfail` way:
```python
@pytest.mark.xfail
def test_feature_x():
```
- Here is how to skip a test based on some internal check inside the test:
```python
def test_feature_x():
if not has_something():
pytest.skip("unsupported configuration")
```
or the whole module:
```python
import pytest
if not pytest.config.getoption("--custom-flag"):
pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
```
or the `xfail` way:
```python
def test_feature_x():
pytest.xfail("expected to fail until bug XYZ is fixed")
```
- Here is how to skip all tests in a module if some import is missing:
```python
docutils = pytest.importorskip("docutils", minversion="0.3")
```
- Skip a test based on a condition:
```python
@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
def test_feature_x():
```
or:
```python
@unittest.skipIf(torch_device == "cpu", "Can't do half precision")
def test_feature_x():
```
or skip the whole module:
```python
@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
class TestClass():
def test_feature_x(self):
```
More details, example and ways are [here](https://docs.pytest.org/en/latest/skipping.html).
### Slow tests
The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
marked as in the example below:
```python
from transformers.testing_utils import slow
@slow
def test_integration_foo():
```
Once a test is marked as `@slow`, to run such tests set `RUN_SLOW=1` env var, e.g.:
```bash
RUN_SLOW=1 pytest tests
```
Some decorators like `@parameterized` rewrite test names, therefore `@slow` and the rest of the skip decorators
`@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage:
```python
@parameterized.expand(...)
@slow
def test_integration_foo():
```
As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
machine before submitting the PR.
Here is a rough decision making mechanism for choosing which tests should be marked as slow:
If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
this approach we should have exceptions:
- All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or
tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you
should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is
discussed in the following paragraphs.
- All tests that need to do a training not specifically optimized to be fast should be set to slow.
- We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
`@slow`. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
as `@slow`.
- If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the `@slow` tests can use large
slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
```bash
grep tiny tests examples
```
Here is a an example of a [script](https://github.com/huggingface/transformers-doc2mdx/tree/master/scripts/fsmt/fsmt-make-tiny-model.py) that created the tiny model
[stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de). You can easily adjust it to your specific
model's architecture.
It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
execution speed report in CI logs instead (the output of `pytest --durations=0 tests`).
That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
tests.
### Testing the stdout/stderr output
In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the
`pytest`'s [capsys system](https://docs.pytest.org/en/latest/capture.html). Here is how this is accomplished:
```python
import sys
def print_to_stdout(s): print(s)
def print_to_stderr(s): sys.stderr.write(s)
def test_result_and_stdout(capsys):
msg = "Hello"
print_to_stdout(msg)
print_to_stderr(msg)
out, err = capsys.readouterr() # consume the captured output streams
# optional: if you want to replay the consumed streams:
sys.stdout.write(out)
sys.stderr.write(err)
# test:
assert msg in out
assert msg in err
```
And, of course, most of the time, `stderr` will come as a part of an exception, so try/except has to be used in such
a case:
```python
def raise_exception(msg): raise ValueError(msg)
def test_something_exception():
msg = "Not a good value"
error = ''
try:
raise_exception(msg)
except Exception as e:
error = str(e)
assert msg in error, f"{msg} is in the exception:\n{error}"
```
Another approach to capturing stdout is via `contextlib.redirect_stdout`:
```python
from io import StringIO
from contextlib import redirect_stdout
def print_to_stdout(s): print(s)
def test_result_and_stdout():
msg = "Hello"
buffer = StringIO()
with redirect_stdout(buffer):
print_to_stdout(msg)
out = buffer.getvalue()
# optional: if you want to replay the consumed streams:
sys.stdout.write(out)
# test:
assert msg in out
```
An important potential issue with capturing stdout is that it may contain `\r` characters that in normal `print`
reset everything that has been printed so far. There is no problem with `pytest`, but with `pytest -s` these
characters get included in the buffer, so to be able to have the test run with and without `-s`, you have to make an
extra cleanup to the captured output, using `re.sub(r'~.*\r', '', buf, 0, re.M)`.
But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
some `\r`'s in it or not, so it's a simple:
```python
from transformers.testing_utils import CaptureStdout
with CaptureStdout() as cs:
function_that_writes_to_stdout()
print(cs.out)
```
Here is a full test example:
```python
from transformers.testing_utils import CaptureStdout
msg = "Secret message\r"
final = "Hello World"
with CaptureStdout() as cs:
print(msg + final)
assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
```
If you'd like to capture `stderr` use the `CaptureStderr` class instead:
```python
from transformers.testing_utils import CaptureStderr
with CaptureStderr() as cs:
function_that_writes_to_stderr()
print(cs.err)
```
If you need to capture both streams at once, use the parent `CaptureStd` class:
```python
from transformers.testing_utils import CaptureStd
with CaptureStd() as cs:
function_that_writes_to_stdout_and_stderr()
print(cs.err, cs.out)
```
Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
from the context.
### Capturing logger stream
If you need to validate the output of a logger, you can use `CaptureLogger`:
```python
from transformers import logging
from transformers.testing_utils import CaptureLogger
msg = "Testing 1, 2, 3"
logging.set_verbosity_info()
logger = logging.get_logger("transformers.models.bart.tokenization_bart")
with CaptureLogger(logger) as cl:
logger.info(msg)
assert cl.out, msg+"\n"
```
### Testing with environment variables
If you want to test the impact of environment variables for a specific test you can use a helper decorator
`transformers.testing_utils.mockenv`
```python
from transformers.testing_utils import mockenv
class HfArgumentParserTest(unittest.TestCase):
@mockenv(TRANSFORMERS_VERBOSITY="error")
def test_env_override(self):
env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
```
At times an external program needs to be called, which requires setting `PYTHONPATH` in `os.environ` to include
multiple local paths. A helper class `transformers.test_utils.TestCasePlus` comes to help:
```python
from transformers.testing_utils import TestCasePlus
class EnvExampleTest(TestCasePlus):
def test_external_prog(self):
env = self.get_env()
# now call the external program, passing `env` to it
```
Depending on whether the test file was under the `tests` test suite or `examples` it'll correctly set up
`env[PYTHONPATH]` to include one of these two directories, and also the `src` directory to ensure the testing is
done against the current repo, and finally with whatever `env[PYTHONPATH]` was already set to before the test was
called if anything.
This helper method creates a copy of the `os.environ` object, so the original remains intact.
### Getting reproducible results
In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
will need to fix the seed:
```python
seed = 42
# python RNG
import random
random.seed(seed)
# pytorch RNGs
import torch
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
# numpy RNG
import numpy as np
np.random.seed(seed)
# tf RNG
tf.random.set_seed(seed)
```
### Debugging tests
To start a debugger at the point of the warning, do this:
```bash
pytest tests/test_logging.py -W error::UserWarning --pdb
```
## Working with github actions workflows
To trigger a self-push workflow CI job, you must:
1. Create a new branch on `transformers` origin (not a fork!).
2. The branch name has to start with either `ci_` or `ci-` (`master` triggers it too, but we can't do PRs on
`master`). It also gets triggered only for specific paths - you can find the up-to-date definition in case it
changed since this document has been written [here](https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml) under *push:*
3. Create a PR from this branch.
4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there
is a backlog.
## Testing Experimental CI Features
Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
new CI feature is to be added, it should be done as following.
1. Create a new dedicated job that tests what needs to be tested
2. The new job must always succeed so that it gives us a green ✓ (details below).
3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches,
non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there
are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always
green)
4. When it's clear that everything is solid, then merge the new changes into existing jobs.
That way experiments on CI functionality itself won't interfere with the normal workflow.
Now how can we make the job always succeed while the new CI feature is being developed?
Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and
Github Actions as of this writing don't support that.
So the following workaround can be used:
1. `set +euo pipefail` at the beginning of the run command to suppress most potential failures in the bash script.
2. the last command must be a success: `echo "done"` or just `true` will do
Here is an example:
```yaml
- run:
name: run CI experiment
command: |
set +euo pipefail
echo "setting run-all-despite-any-errors-mode"
this_command_will_fail
echo "but bash continues to run"
# emulate another failure
false
# but the last command must be a success
echo "during experiment do not remove: reporting success to CI, even if there were failures"
```
For simple commands you could also do:
```bash
cmd_that_may_fail || true
```
Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs,
while removing `set +euo pipefail` or any other things you may have added to ensure that the experimental job doesn't
interfere with the normal CI functioning.
This whole process would have been much easier if we only could set something like `allow-failure` for the
experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and
Github Actions don't support it at the moment.
You can vote for this feature and see where it is at at these CI-specific threads:
- [Github Actions:](https://github.com/actions/toolkit/issues/399)
- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344)
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Testing
=======================================================================================================================
Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
There are 2 test suites in the repository:
1. ``tests`` -- tests for the general API
2. ``examples`` -- tests primarily for various applications that aren't part of the API
How transformers are tested
-----------------------------------------------------------------------------------------------------------------------
1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
are defined in this :prefix_link:`config file <.circleci/config.yml>`, so that if needed you can reproduce the same
environment on your machine.
These CI jobs don't run ``@slow`` tests.
2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
* :prefix_link:`torch hub integration <.github/workflows/github-torch-hub.yml>`: checks whether torch hub
integration works.
* :prefix_link:`self-hosted (push) <.github/workflows/self-push.yml>`: runs fast tests on GPU only on commits on
``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``,
``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
* :prefix_link:`self-hosted runner <.github/workflows/self-scheduled.yml>`: runs normal and slow tests on GPU in
``tests`` and ``examples``:
.. code-block:: bash
RUN_SLOW=1 pytest tests/
RUN_SLOW=1 pytest examples/
The results can be observed `here <https://github.com/huggingface/transformers/actions>`__.
Running tests
-----------------------------------------------------------------------------------------------------------------------
Choosing which tests to run
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This document goes into many details of how tests can be run. If after reading everything, you need even more details
you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
Here are some most useful ways of running tests.
Run all:
.. code-block:: console
pytest
or:
.. code-block:: bash
make test
Note that the latter is defined as:
.. code-block:: bash
python -m pytest -n auto --dist=loadfile -s -v ./tests/
which tells pytest to:
* run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!)
* ensure that all tests from the same file will be run by the same test process
* do not capture output
* run in verbose mode
Getting the list of all tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All tests of the test suite:
.. code-block:: bash
pytest --collect-only -q
All tests of a given test file:
.. code-block:: bash
pytest tests/test_optimization.py --collect-only -q
Run a specific test module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To run an individual test module:
.. code-block:: bash
pytest tests/test_logging.py
Run specific tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
class containing those tests. For example, it could be:
.. code-block:: bash
pytest tests/test_optimization.py::OptimizationTest::test_adam_w
Here:
* ``tests/test_optimization.py`` - the file with tests
* ``OptimizationTest`` - the name of the class
* ``test_adam_w`` - the name of the specific test function
If the file contains multiple classes, you can choose to run only tests of a given class. For example:
.. code-block:: bash
pytest tests/test_optimization.py::OptimizationTest
will run all the tests inside that class.
As mentioned earlier you can see what tests are contained inside the ``OptimizationTest`` class by running:
.. code-block:: bash
pytest tests/test_optimization.py::OptimizationTest --collect-only -q
You can run tests by keyword expressions.
To run only tests whose name contains ``adam``:
.. code-block:: bash
pytest -k adam tests/test_optimization.py
Logical ``and`` and ``or`` can be used to indicate whether all keywords should match or either. ``not`` can be used to
negate.
To run all tests except those whose name contains ``adam``:
.. code-block:: bash
pytest -k "not adam" tests/test_optimization.py
And you can combine the two patterns in one:
.. code-block:: bash
pytest -k "ada and not adam" tests/test_optimization.py
For example to run both ``test_adafactor`` and ``test_adam_w`` you can use:
.. code-block:: bash
pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py
Note that we use ``or`` here, since we want either of the keywords to match to include both.
If you want to include only tests that include both patterns, ``and`` is to be used:
.. code-block:: bash
pytest -k "test and ada" tests/test_optimization.py
Run only modified tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked
<https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break
anything, since it won't run the tests related to files you didn't touch.
.. code-block:: bash
pip install pytest-picked
.. code-block:: bash
pytest --picked
All tests will be run from files and folders which are modified, but not yet committed.
Automatically rerun failed tests on source modification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a very useful feature of detecting all failed
tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
which again a full run is performed.
.. code-block:: bash
pip install pytest-xdist
To enter the mode: ``pytest -f`` or ``pytest --looponfail``
File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively).
If the default for this value does not work for you, you can change it in your project by setting a configuration
option in ``setup.cfg``:
.. code-block:: ini
[tool:pytest]
looponfailroots = transformers tests
or ``pytest.ini``/``tox.ini`` files:
.. code-block:: ini
[pytest]
looponfailroots = transformers tests
This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
directory.
`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an alternative implementation of this functionality.
Skip a test module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
example, to run all except ``test_modeling_*.py`` tests:
.. code-block:: bash
pytest `ls -1 tests/*py | grep -v test_modeling`
Clearing state
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CI builds and when isolation is important (against speed), cache should be cleared:
.. code-block:: bash
pytest --cache-clear tests
Running tests in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2``
to run 2 parallel jobs).
``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the
tests located in one file onto the same process.
Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist``
produces failures (meaning we have some undetected coupled tests), use `pytest-replay
<https://github.com/ESSS/pytest-replay>`__ to replay the tests in the same order, which should help with then somehow
reducing that failing sequence to a minimum.
Test order and repetition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
some problems that get uncovered by randomness of DL.
Repeat tests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `pytest-flakefinder <https://github.com/dropbox/pytest-flakefinder>`__:
.. code-block:: bash
pip install pytest-flakefinder
And then run every test multiple times (50 by default):
.. code-block:: bash
pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
.. note::
This plugin doesn't work with ``-n`` flag from ``pytest-xdist``.
.. note::
There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``.
Run tests in a random order
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
pip install pytest-random-order
Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or
command line options is required.
As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
``pytest-random-order`` is installed it will print the random seed it used for that session, e.g:
.. code-block:: bash
pytest tests
[...]
Using --random-order-bucket=module
Using --random-order-seed=573663
So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
.. code-block:: bash
pytest --random-order-seed=573663
[...]
Using --random-order-bucket=module
Using --random-order-seed=573663
It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.:
.. code-block:: bash
pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py
To disable the shuffling for all tests:
.. code-block:: bash
pytest --random-order-bucket=none
By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also
shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its
`documentation <https://github.com/jbasko/pytest-random-order>`__.
Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This
module has a very similar functionality/interface, but it doesn't have the bucket modes available in
``pytest-random-order``. It has the same problem of imposing itself once installed.
Look and feel variations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pytest-sugar
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a plugin that improves the look-n-feel, adds a
progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
.. code-block:: bash
pip install pytest-sugar
To run tests without it, run:
.. code-block:: bash
pytest -p no:sugar
or uninstall it.
Report each sub-test name and its progress
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``):
.. code-block:: bash
pytest --pspec tests/test_optimization.py
Instantly shows failed tests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ shows failures and errors instantly instead of
waiting until the end of test session.
.. code-block:: bash
pip install pytest-instafail
.. code-block:: bash
pytest --instafail
To GPU or not to GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the
second gpu if you have gpus ``0`` and ``1``, you can run:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
This is handy when you want to run different tasks on different GPUs.
Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
* ``require_torch`` - this test will run only under torch
* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
* ``require_torch_multi_gpu`` - as ``require_torch`` plus requires at least 2 GPUs
* ``require_torch_non_multi_gpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
* ``require_torch_up_to_2_gpus`` - as ``require_torch`` plus requires 0 or 1 or 2 GPUs
* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU
Let's depict the GPU requirements in the following table:
+----------+----------------------------------+
| n gpus | decorator |
+==========+==================================+
| ``>= 0`` | ``@require_torch`` |
+----------+----------------------------------+
| ``>= 1`` | ``@require_torch_gpu`` |
+----------+----------------------------------+
| ``>= 2`` | ``@require_torch_multi_gpu`` |
+----------+----------------------------------+
| ``< 2`` | ``@require_torch_non_multi_gpu`` |
+----------+----------------------------------+
| ``< 3`` | ``@require_torch_up_to_2_gpus`` |
+----------+----------------------------------+
For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
.. code-block:: python
@require_torch_multi_gpu
def test_example_with_multi_gpu():
If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
.. code-block:: python
@require_tf
def test_tf_thing_with_tensorflow():
These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
how to set it up:
.. code-block:: python
@require_torch_gpu
@slow
def test_example_slow_on_gpu():
Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed
last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python
@parameterized.expand(...)
@require_torch_multi_gpu
def test_integration_foo():
This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
work. But it only works with non-unittests.
Inside tests:
* How many GPUs are available:
.. code-block:: bash
from transformers.testing_utils import get_gpu_count
n_gpu = get_gpu_count() # works with torch and tf
Distributed training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
spawns a normal process that then spawns off multiple workers and manages the IO pipes.
Here are some tests that use it:
* :prefix_link:`test_trainer_distributed.py <tests/test_trainer_distributed.py>`
* :prefix_link:`test_deepspeed.py <tests/deepspeed/test_deepspeed.py>`
To jump right into the execution point, search for the ``execute_subprocess_async`` call in those tests.
You will need at least 2 GPUs to see these tests in action:
.. code-block:: bash
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
Output capture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its
according captured output will usually be shown along with the failure traceback.
To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``:
.. code-block:: bash
pytest -s tests/test_logging.py
To send test results to JUnit format output:
.. code-block:: bash
py.test tests --junitxml=result.xml
Color control
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To have no color (e.g., yellow on white background is not readable):
.. code-block:: bash
pytest --color=no tests/test_logging.py
Sending test report to online pastebin service
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating a URL for each test failure:
.. code-block:: bash
pytest --pastebin=failed tests/test_logging.py
This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
tests as usual or add for example -x if you only want to send one particular failure.
Creating a URL for a whole test session log:
.. code-block:: bash
pytest --pastebin=all tests/test_logging.py
Writing tests
-----------------------------------------------------------------------------------------------------------------------
🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems
can be used.
You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important
thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module
``parameterized`` that works in a similar way.
Parametrization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
the test, but then there is no way of running that test for just one set of arguments.
.. code-block:: python
# test_this1.py
import unittest
from parameterized import parameterized
class TestMathUnitTest(unittest.TestCase):
@parameterized.expand([
("negative", -1.5, -2.0),
("integer", 1, 1.0),
("large fraction", 1.6, 1),
])
def test_floor(self, name, input, expected):
assert_equal(math.floor(input), expected)
Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the
corresponding arguments in the parameter list.
and you could run just the ``negative`` and ``integer`` sets of params with:
.. code-block:: bash
pytest -k "negative and integer" tests/test_mytest.py
or all but ``negative`` sub-tests, with:
.. code-block:: bash
pytest -k "not negative" tests/test_mytest.py
Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any
or all of them using their exact names.
.. code-block:: bash
pytest test_this1.py --collect-only -q
and it will list:
.. code-block:: bash
test_this1.py::TestMathUnitTest::test_floor_0_negative
test_this1.py::TestMathUnitTest::test_floor_1_integer
test_this1.py::TestMathUnitTest::test_floor_2_large_fraction
So now you can run just 2 specific sub-tests:
.. code-block:: bash
pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer
The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies
of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in
some existing tests, mostly under ``examples``).
Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
.. code-block:: python
# test_this2.py
import pytest
@pytest.mark.parametrize(
"name, input, expected",
[
("negative", -1.5, -2.0),
("integer", 1, 1.0),
("large fraction", 1.6, 1),
],
)
def test_floor(name, input, expected):
assert_equal(math.floor(input), expected)
Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are
run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
names for the sub-tests. Here is what they look like:
.. code-block:: bash
pytest test_this2.py --collect-only -q
and it will list:
.. code-block:: bash
test_this2.py::test_floor[integer-1-1.0]
test_this2.py::test_floor[negative--1.5--2.0]
test_this2.py::test_floor[large fraction-1.6-1]
So now you can run just the specific test:
.. code-block:: bash
pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0]
as in the previous example.
Files and directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
:obj:`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
accessors to them:
* ``pathlib`` objects (all fully resolved):
- ``test_file_path`` - the current test file path, i.e. ``__file__``
- ``test_file_dir`` - the directory containing the current test file
- ``tests_dir`` - the directory of the ``tests`` test suite
- ``examples_dir`` - the directory of the ``examples`` test suite
- ``repo_root_dir`` - the directory of the repository
- ``src_dir`` - the directory of ``src`` (i.e. where the ``transformers`` sub-dir resides)
* stringified paths---same as above but these return paths as strings, rather than ``pathlib`` objects:
- ``test_file_path_str``
- ``test_file_dir_str``
- ``tests_dir_str``
- ``examples_dir_str``
- ``repo_root_dir_str``
- ``src_dir_str``
To start using those all you need is to make sure that the test resides in a subclass of
:obj:`transformers.test_utils.TestCasePlus`. For example:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_local_locations(self):
data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
If you don't need to manipulate paths via ``pathlib`` or you just need a path as a string, you can always invoked
``str()`` on the ``pathlib`` object or use the accessors ending with ``_str``. For example:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_stringified_locations(self):
examples_dir = self.examples_dir_str
Temporary files and directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
them. Therefore, using packages like ``tempfile``, which address these needs is essential.
However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
to know it's exact path and not having it randomized on every test re-run.
A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
Here is an example of its usage:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class ExamplesTests(TestCasePlus):
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir()
This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
* Create a unique temporary dir:
.. code-block:: python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir()
``tmp_dir`` will contain the path to the created temporary dir. It will be automatically removed at the end of the
test.
* Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
.. code-block:: python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
leave any data in there.
* You can override the default behavior by directly overriding the ``before`` and ``after`` args, leading to one of the
following behaviors:
- ``before=True``: the temporary dir will always be cleared at the beginning of the test.
- ``before=False``: if the temporary dir already existed, any existing files will remain there.
- ``after=True``: the temporary dir will always be deleted at the end of the test.
- ``after=False``: the temporary dir will always be left intact at the end of the test.
.. note::
In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will
get nuked. i.e. please always pass paths that start with ``./``.
.. note::
Each test can register multiple temporary directories and they all will get auto-removed, unless requested
otherwise.
Temporary sys.path override
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you need to temporary override ``sys.path`` to import from another test for example, you can use the
``ExtendSysPath`` context manager. Example:
.. code-block:: python
import os
from transformers.testing_utils import ExtendSysPath
bindir = os.path.abspath(os.path.dirname(__file__))
with ExtendSysPath(f"{bindir}/.."):
from test_trainer import TrainerIntegrationCommon # noqa
Skipping tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
commit it to the main repository we need make sure it's skipped during ``make test``.
Methods:
- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
tests that depend on an external resource which is not available at the moment (for example a database).
- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the
code that's buggy causes some bad state that will affect other tests, do not use ``xfail``.
Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Here is how to skip whole test unconditionally:
.. code-block:: python
@unittest.skip("this bug needs to be fixed")
def test_feature_x():
or via pytest:
.. code-block:: python
@pytest.mark.skip(reason="this bug needs to be fixed")
or the ``xfail`` way:
.. code-block:: python
@pytest.mark.xfail
def test_feature_x():
- Here is how to skip a test based on some internal check inside the test:
.. code-block:: python
def test_feature_x():
if not has_something():
pytest.skip("unsupported configuration")
or the whole module:
.. code-block:: python
import pytest
if not pytest.config.getoption("--custom-flag"):
pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
or the ``xfail`` way:
.. code-block:: python
def test_feature_x():
pytest.xfail("expected to fail until bug XYZ is fixed")
- Here is how to skip all tests in a module if some import is missing:
.. code-block:: python
docutils = pytest.importorskip("docutils", minversion="0.3")
- Skip a test based on a condition:
.. code-block:: python
@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
def test_feature_x():
or:
.. code-block:: python
@unittest.skipIf(torch_device == "cpu", "Can't do half precision")
def test_feature_x():
or skip the whole module:
.. code-block:: python
@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
class TestClass():
def test_feature_x(self):
More details, example and ways are `here <https://docs.pytest.org/en/latest/skipping.html>`__.
Slow tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
marked as in the example below:
.. code-block:: python
from transformers.testing_utils import slow
@slow
def test_integration_foo():
Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var, e.g.:
.. code-block:: bash
RUN_SLOW=1 pytest tests
Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators
``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python
@parameterized.expand(...)
@slow
def test_integration_foo():
As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
machine before submitting the PR.
Here is a rough decision making mechanism for choosing which tests should be marked as slow:
If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
this approach we should have exceptions:
* All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or
tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you
should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is
discussed in the following paragraphs.
* All tests that need to do a training not specifically optimized to be fast should be set to slow.
* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
as ``@slow``.
* If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large
slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
.. code-block:: bash
grep tiny tests examples
Here is a an example of a :prefix_link:`script <scripts/fsmt/fsmt-make-tiny-model.py>` that created the tiny model
`stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your specific
model's architecture.
It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
tests.
Testing the stdout/stderr output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the
``pytest``'s `capsys system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how this is accomplished:
.. code-block:: python
import sys
def print_to_stdout(s): print(s)
def print_to_stderr(s): sys.stderr.write(s)
def test_result_and_stdout(capsys):
msg = "Hello"
print_to_stdout(msg)
print_to_stderr(msg)
out, err = capsys.readouterr() # consume the captured output streams
# optional: if you want to replay the consumed streams:
sys.stdout.write(out)
sys.stderr.write(err)
# test:
assert msg in out
assert msg in err
And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such
a case:
.. code-block:: python
def raise_exception(msg): raise ValueError(msg)
def test_something_exception():
msg = "Not a good value"
error = ''
try:
raise_exception(msg)
except Exception as e:
error = str(e)
assert msg in error, f"{msg} is in the exception:\n{error}"
Another approach to capturing stdout is via ``contextlib.redirect_stdout``:
.. code-block:: python
from io import StringIO
from contextlib import redirect_stdout
def print_to_stdout(s): print(s)
def test_result_and_stdout():
msg = "Hello"
buffer = StringIO()
with redirect_stdout(buffer):
print_to_stdout(msg)
out = buffer.getvalue()
# optional: if you want to replay the consumed streams:
sys.stdout.write(out)
# test:
assert msg in out
An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print``
reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these
characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an
extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
some ``\r``'s in it or not, so it's a simple:
.. code-block:: python
from transformers.testing_utils import CaptureStdout
with CaptureStdout() as cs:
function_that_writes_to_stdout()
print(cs.out)
Here is a full test example:
.. code-block:: python
from transformers.testing_utils import CaptureStdout
msg = "Secret message\r"
final = "Hello World"
with CaptureStdout() as cs:
print(msg + final)
assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead:
.. code-block:: python
from transformers.testing_utils import CaptureStderr
with CaptureStderr() as cs:
function_that_writes_to_stderr()
print(cs.err)
If you need to capture both streams at once, use the parent :obj:`CaptureStd` class:
.. code-block:: python
from transformers.testing_utils import CaptureStd
with CaptureStd() as cs:
function_that_writes_to_stdout_and_stderr()
print(cs.err, cs.out)
Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
from the context.
Capturing logger stream
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you need to validate the output of a logger, you can use :obj:`CaptureLogger`:
.. code-block:: python
from transformers import logging
from transformers.testing_utils import CaptureLogger
msg = "Testing 1, 2, 3"
logging.set_verbosity_info()
logger = logging.get_logger("transformers.models.bart.tokenization_bart")
with CaptureLogger(logger) as cl:
logger.info(msg)
assert cl.out, msg+"\n"
Testing with environment variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to test the impact of environment variables for a specific test you can use a helper decorator
``transformers.testing_utils.mockenv``
.. code-block:: python
from transformers.testing_utils import mockenv
class HfArgumentParserTest(unittest.TestCase):
@mockenv(TRANSFORMERS_VERBOSITY="error")
def test_env_override(self):
env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
At times an external program needs to be called, which requires setting ``PYTHONPATH`` in ``os.environ`` to include
multiple local paths. A helper class :obj:`transformers.test_utils.TestCasePlus` comes to help:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class EnvExampleTest(TestCasePlus):
def test_external_prog(self):
env = self.get_env()
# now call the external program, passing ``env`` to it
Depending on whether the test file was under the ``tests`` test suite or ``examples`` it'll correctly set up
``env[PYTHONPATH]`` to include one of these two directories, and also the ``src`` directory to ensure the testing is
done against the current repo, and finally with whatever ``env[PYTHONPATH]`` was already set to before the test was
called if anything.
This helper method creates a copy of the ``os.environ`` object, so the original remains intact.
Getting reproducible results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
will need to fix the seed:
.. code-block:: python
seed = 42
# python RNG
import random
random.seed(seed)
# pytorch RNGs
import torch
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
# numpy RNG
import numpy as np
np.random.seed(seed)
# tf RNG
tf.random.set_seed(seed)
Debugging tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To start a debugger at the point of the warning, do this:
.. code-block:: bash
pytest tests/test_logging.py -W error::UserWarning --pdb
Working with github actions workflows
-----------------------------------------------------------------------------------------------------------------------
To trigger a self-push workflow CI job, you must:
1. Create a new branch on ``transformers`` origin (not a fork!).
2. The branch name has to start with either ``ci_`` or ``ci-`` (``master`` triggers it too, but we can't do PRs on
``master``). It also gets triggered only for specific paths - you can find the up-to-date definition in case it
changed since this document has been written `here
<https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__ under `push:`
3. Create a PR from this branch.
4. Then you can see the job appear `here
<https://github.com/huggingface/transformers/actions/workflows/self-push.yml>`__. It may not run right away if there
is a backlog.
Testing Experimental CI Features
-----------------------------------------------------------------------------------------------------------------------
Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
new CI feature is to be added, it should be done as following.
1. Create a new dedicated job that tests what needs to be tested
2. The new job must always succeed so that it gives us a green ✓ (details below).
3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches,
non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there
are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always
green)
4. When it's clear that everything is solid, then merge the new changes into existing jobs.
That way experiments on CI functionality itself won't interfere with the normal workflow.
Now how can we make the job always succeed while the new CI feature is being developed?
Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and
Github Actions as of this writing don't support that.
So the following workaround can be used:
1. ``set +euo pipefail`` at the beginning of the run command to suppress most potential failures in the bash script.
2. the last command must be a success: ``echo "done"`` or just ``true`` will do
Here is an example:
.. code-block:: yaml
- run:
name: run CI experiment
command: |
set +euo pipefail
echo "setting run-all-despite-any-errors-mode"
this_command_will_fail
echo "but bash continues to run"
# emulate another failure
false
# but the last command must be a success
echo "during experiment do not remove: reporting success to CI, even if there were failures"
For simple commands you could also do:
.. code-block:: bash
cmd_that_may_fail || true
Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs,
while removing ``set +euo pipefail`` or any other things you may have added to ensure that the experimental job doesn't
interfere with the normal CI functioning.
This whole process would have been much easier if we only could set something like ``allow-failure`` for the
experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and
Github Actions don't support it at the moment.
You can vote for this feature and see where it is at at these CI-specific threads:
* `Github Actions: <https://github.com/actions/toolkit/issues/399>`__
* `CircleCI: <https://ideas.circleci.com/ideas/CCI-I-344>`__
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment