[doc porting] several docs (#14858)

* [doc porting] 2 docs * [doc porting] 2 docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/main_classes/deepspeed.mdx * cleanup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

[doc porting] several docs (#14858)
* [doc porting] 2 docs * [doc porting] 2 docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/main_classes/deepspeed.mdx * cleanup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
18587639 · Stas Bekman · GitHub · 033c3ed9 · 18587639 · 033c3ed9
Unverified Commit 18587639 authored Dec 21, 2021 by Stas Bekman Committed by GitHub Dec 21, 2021
5 changed files
--- a/docs/source/debugging.mdx
+++ b/docs/source/debugging.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Debugging
+## Underflow and Overflow Detection
+<Tip>
+This feature is currently available for PyTorch-only.
+</Tip>
+<Tip>
+For multi-GPU training it requires DDP (`torch.distributed.launch`).
+</Tip>
+<Tip>
+This feature can be used with any `nn.Module`-based model.
+</Tip>
+If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in
+activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
+you can accomplish that easily by activating a special module that will do the detection automatically.
+If you're using [`Trainer`], you just need to add:
+```bash
+--debug underflow_overflow
+```
+to the normal command line arguments, or pass `debug="underflow_overflow"` when creating the
+[`TrainingArguments`] object.
+If you're using your own training loop or another Trainer you can accomplish the same with:
+```python
+from .debug_utils import DebugUnderflowOverflow
+debug_overflow = DebugUnderflowOverflow(model)
+```
+[`~debug_utils.DebugUnderflowOverflow`] inserts hooks into the model that immediately after each
+forward call will test input and output variables and also the corresponding module's weights. As soon as `inf` or
+`nan` is detected in at least one element of the activations or weights, the program will assert and print a report
+like this (this was caught with `google/mt5-small` under fp16 mixed precision):
+```
+Detected inf/nan during batch_number=0
+Last 21 forward frames:
+abs min  abs max  metadata
+                  encoder.block.1.layer.1.DenseReluDense.dropout Dropout
+0.00e+00 2.57e+02 input[0]
+0.00e+00 2.85e+02 output
+[...]
+                  encoder.block.2.layer.0 T5LayerSelfAttention
+6.78e-04 3.15e+03 input[0]
+2.65e-04 3.42e+03 output[0]
+             None output[1]
+2.25e-01 1.00e+04 output[2]
+                  encoder.block.2.layer.1.layer_norm T5LayerNorm
+8.69e-02 4.18e-01 weight
+2.65e-04 3.42e+03 input[0]
+1.79e-06 4.65e+00 output
+                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
+2.17e-07 4.50e+00 weight
+1.79e-06 4.65e+00 input[0]
+2.68e-06 3.70e+01 output
+                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
+8.08e-07 2.66e+01 weight
+1.79e-06 4.65e+00 input[0]
+1.27e-04 2.37e+02 output
+                  encoder.block.2.layer.1.DenseReluDense.dropout Dropout
+0.00e+00 8.76e+03 input[0]
+0.00e+00 9.74e+03 output
+                  encoder.block.2.layer.1.DenseReluDense.wo Linear
+1.01e-06 6.44e+00 weight
+0.00e+00 9.74e+03 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
+1.79e-06 4.65e+00 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.dropout Dropout
+3.18e-04 6.27e+04 input[0]
+0.00e+00      inf output
+```
+The example output has been trimmed in the middle for brevity.
+The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
+the inputs and outputs were in the range of `1e4`. So when this training was done under fp16 mixed precision the very
+last step overflowed (since under `fp16` the largest number before `inf` is `64e3`). To avoid overflows under
+`fp16` the activations must remain way below `1e4`, because `1e4 * 1e4 = 1e8` so any matrix multiplication with
+large activations is going to lead to a numerical overflow condition.
+At the very start of the trace you can discover at which batch number the problem occurred (here `Detected inf/nan during batch_number=0` means the problem occurred on the first batch).
+Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
+for. If we look just at this frame:
+```
+                  encoder.block.2.layer.1.layer_norm T5LayerNorm
+8.69e-02 4.18e-01 weight
+2.65e-04 3.42e+03 input[0]
+1.79e-06 4.65e+00 output
+```
+Here, `encoder.block.2.layer.1.layer_norm` indicates that it was a layer norm for the first layer, of the second
+block of the encoder. And the specific calls of the `forward` is `T5LayerNorm`.
+Let's look at the last few frames of that report:
+```
+Detected inf/nan during batch_number=0
+Last 21 forward frames:
+abs min  abs max  metadata
+[...]
+                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
+2.17e-07 4.50e+00 weight
+1.79e-06 4.65e+00 input[0]
+2.68e-06 3.70e+01 output
+                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
+8.08e-07 2.66e+01 weight
+1.79e-06 4.65e+00 input[0]
+1.27e-04 2.37e+02 output
+                  encoder.block.2.layer.1.DenseReluDense.wo Linear
+1.01e-06 6.44e+00 weight
+0.00e+00 9.74e+03 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
+1.79e-06 4.65e+00 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.dropout Dropout
+3.18e-04 6.27e+04 input[0]
+0.00e+00      inf output
+```
+The last frame reports for `Dropout.forward` function with the first entry for the only input and the second for the
+only output. You can see that it was called from an attribute `dropout` inside `DenseReluDense` class. We can see
+that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
+input elements was `6.27e+04` and same for the output was `inf`.
+You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was
+around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which renormalizes
+the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
+overflow (`inf`).
+As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
+numbers.
+Let's match the report to the code from `models/t5/modeling_t5.py`:
+```python
+class T5DenseGatedGeluDense(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.gelu_act = ACT2FN["gelu_new"]
+    def forward(self, hidden_states):
+        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_gelu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+```
+Now it's easy to see the `dropout` call, and all the previous calls as well.
+Since the detection is happening in a forward hook, these reports are printed immediately after each `forward`
+returns.
+Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
+started to go up and most likely switch to the `fp32` mode here, so that the numbers don't overflow when multiplied
+or summed up. Of course, there might be other solutions. For example, we could turn off `amp` temporarily if it's
+enabled, after moving the original `forward` into a helper wrapper, like so:
+```python
+def _forward(self, hidden_states):
+    hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+    hidden_linear = self.wi_1(hidden_states)
+    hidden_states = hidden_gelu * hidden_linear
+    hidden_states = self.dropout(hidden_states)
+    hidden_states = self.wo(hidden_states)
+    return hidden_states
+import torch
+def forward(self, hidden_states):
+    if torch.is_autocast_enabled():
+         with torch.cuda.amp.autocast(enabled=False):
+             return self._forward(hidden_states)
+     else:
+         return self._forward(hidden_states)
+```
+Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
+want to analyse the intermediary stages of any specific `forward` function as well. In such a case you can use the
+`detect_overflow` helper function to inject the detector where you want it, for example:
+```python
+from debug_utils import detect_overflow
+class T5LayerFF(nn.Module):
+    [...]
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states)
+        detect_overflow(forwarded_states, "after layer_norm")
+        forwarded_states = self.DenseReluDense(forwarded_states)
+        detect_overflow(forwarded_states, "after DenseReluDense")
+        return hidden_states + self.dropout(forwarded_states)
+```
+You can see that we added 2 of these and now we track if `inf` or `nan` for `forwarded_states` was detected
+somewhere in between.
+Actually, the detector already reports these because each of the calls in the example above is a `nn.Module`, but
+let's say if you had some local direct calculations this is how you'd do that.
+Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
+its default, e.g.:
+```python
+from .debug_utils import DebugUnderflowOverflow
+debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
+```
+### Specific batch absolute mix and max value tracing
+The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
+Let's say you want to watch the absolute min and max values for all the ingredients of each `forward` call of a given
+batch, and only do that for batches 1 and 3. Then you instantiate this class as:
+```python
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
+```
+And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
+Batches are 0-indexed.
+This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
+right to that area. Here is a sample truncated output for such configuration:
+```
+                  *** Starting batch number=1 ***
+abs min  abs max  metadata
+                  shared Embedding
+1.01e-06 7.92e+02 weight
+0.00e+00 2.47e+04 input[0]
+5.36e-05 7.92e+02 output
+[...]
+                  decoder.dropout Dropout
+1.60e-07 2.27e+01 input[0]
+0.00e+00 2.52e+01 output
+                  decoder T5Stack
+     not a tensor output
+                  lm_head Linear
+1.01e-06 7.92e+02 weight
+0.00e+00 1.11e+00 input[0]
+6.06e-02 8.39e+01 output
+                   T5ForConditionalGeneration
+     not a tensor output
+                  *** Starting batch number=3 ***
+abs min  abs max  metadata
+                  shared Embedding
+1.01e-06 7.92e+02 weight
+0.00e+00 2.78e+04 input[0]
+5.36e-05 7.92e+02 output
+[...]
+```
+Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
+not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
+a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
+numbers started to diverge.
+You can also specify the batch number after which to stop the training, with:
+```python
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
+```
--- a/docs/source/debugging.rst
+++ b/docs/source/debugging.rst
-..
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Debugging
-=======================================================================================================================
-Underflow and Overflow Detection
-----------------------------------------------------------------------------------------------------------------------
-.. note::
-   This feature is currently available for PyTorch-only.
-.. note::
-   For multi-GPU training it requires DDP (``torch.distributed.launch``).
-.. note::
-   This feature can be used with any ``nn.Module``-based model.
-If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in
-activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
-you can accomplish that easily by activating a special module that will do the detection automatically.
-If you're using :class:`~transformers.Trainer`, you just need to add:
-.. code-block:: bash
-    --debug underflow_overflow
-to the normal command line arguments, or pass ``debug="underflow_overflow"`` when creating the
-:class:`~transformers.TrainingArguments` object.
-If you're using your own training loop or another Trainer you can accomplish the same with:
-.. code-block:: python
-    from .debug_utils import DebugUnderflowOverflow
-    debug_overflow = DebugUnderflowOverflow(model)
-:class:`~transformers.debug_utils.DebugUnderflowOverflow` inserts hooks into the model that immediately after each
-forward call will test input and output variables and also the corresponding module's weights. As soon as ``inf`` or
-``nan`` is detected in at least one element of the activations or weights, the program will assert and print a report
-like this (this was caught with ``google/mt5-small`` under fp16 mixed precision):
-.. code-block::
-    Detected inf/nan during batch_number=0
-    Last 21 forward frames:
-    abs min  abs max  metadata
-                      encoder.block.1.layer.1.DenseReluDense.dropout Dropout
-    0.00e+00 2.57e+02 input[0]
-    0.00e+00 2.85e+02 output
-    [...]
-                      encoder.block.2.layer.0 T5LayerSelfAttention
-    6.78e-04 3.15e+03 input[0]
-    2.65e-04 3.42e+03 output[0]
-                 None output[1]
-    2.25e-01 1.00e+04 output[2]
-                      encoder.block.2.layer.1.layer_norm T5LayerNorm
-    8.69e-02 4.18e-01 weight
-    2.65e-04 3.42e+03 input[0]
-    1.79e-06 4.65e+00 output
-                      encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
-    2.17e-07 4.50e+00 weight
-    1.79e-06 4.65e+00 input[0]
-    2.68e-06 3.70e+01 output
-                      encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
-    8.08e-07 2.66e+01 weight
-    1.79e-06 4.65e+00 input[0]
-    1.27e-04 2.37e+02 output
-                      encoder.block.2.layer.1.DenseReluDense.dropout Dropout
-    0.00e+00 8.76e+03 input[0]
-    0.00e+00 9.74e+03 output
-                      encoder.block.2.layer.1.DenseReluDense.wo Linear
-    1.01e-06 6.44e+00 weight
-    0.00e+00 9.74e+03 input[0]
-    3.18e-04 6.27e+04 output
-                      encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
-    1.79e-06 4.65e+00 input[0]
-    3.18e-04 6.27e+04 output
-                      encoder.block.2.layer.1.dropout Dropout
-    3.18e-04 6.27e+04 input[0]
-    0.00e+00      inf output
-The example output has been trimmed in the middle for brevity.
-The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
-the inputs and outputs were in the range of ``1e4``. So when this training was done under fp16 mixed precision the very
-last step overflowed (since under ``fp16`` the largest number before ``inf`` is ``64e3``). To avoid overflows under
-``fp16`` the activations must remain way below ``1e4``, because ``1e4 * 1e4 = 1e8`` so any matrix multiplication with
-large activations is going to lead to a numerical overflow condition.
-At the very start of the trace you can discover at which batch number the problem occurred (here ``Detected inf/nan
-during batch_number=0`` means the problem occurred on the first batch).
-Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
-for. If we look just at this frame:
-.. code-block::
-                      encoder.block.2.layer.1.layer_norm T5LayerNorm
-    8.69e-02 4.18e-01 weight
-    2.65e-04 3.42e+03 input[0]
-    1.79e-06 4.65e+00 output
-Here, ``encoder.block.2.layer.1.layer_norm`` indicates that it was a layer norm for the first layer, of the second
-block of the encoder. And the specific calls of the ``forward`` is ``T5LayerNorm``.
-Let's look at the last few frames of that report:
-.. code-block::
-        Detected inf/nan during batch_number=0
-        Last 21 forward frames:
-        abs min  abs max  metadata
-        [...]
-                          encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
-        2.17e-07 4.50e+00 weight
-        1.79e-06 4.65e+00 input[0]
-        2.68e-06 3.70e+01 output
-                          encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
-        8.08e-07 2.66e+01 weight
-        1.79e-06 4.65e+00 input[0]
-        1.27e-04 2.37e+02 output
-                          encoder.block.2.layer.1.DenseReluDense.wo Linear
-        1.01e-06 6.44e+00 weight
-        0.00e+00 9.74e+03 input[0]
-        3.18e-04 6.27e+04 output
-                          encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
-        1.79e-06 4.65e+00 input[0]
-        3.18e-04 6.27e+04 output
-                          encoder.block.2.layer.1.dropout Dropout
-        3.18e-04 6.27e+04 input[0]
-        0.00e+00      inf output
-The last frame reports for ``Dropout.forward`` function with the first entry for the only input and the second for the
-only output. You can see that it was called from an attribute ``dropout`` inside ``DenseReluDense`` class. We can see
-that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
-input elements was ``6.27e+04`` and same for the output was ``inf``.
-You can see here, that ``T5DenseGatedGeluDense.forward`` resulted in output activations, whose absolute max value was
-around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have ``Dropout`` which renormalizes
-the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
-overflow (``inf``).
-As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
-numbers.
-Let's match the report to the code from ``models/t5/modeling_t5.py``:
-.. code-block:: python
-    class T5DenseGatedGeluDense(nn.Module):
-        def __init__(self, config):
-            super().__init__()
-            self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
-            self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
-            self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
-            self.dropout = nn.Dropout(config.dropout_rate)
-            self.gelu_act = ACT2FN["gelu_new"]
-        def forward(self, hidden_states):
-            hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
-            hidden_linear = self.wi_1(hidden_states)
-            hidden_states = hidden_gelu * hidden_linear
-            hidden_states = self.dropout(hidden_states)
-            hidden_states = self.wo(hidden_states)
-            return hidden_states
-Now it's easy to see the ``dropout`` call, and all the previous calls as well.
-Since the detection is happening in a forward hook, these reports are printed immediately after each ``forward``
-returns.
-Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
-started to go up and most likely switch to the ``fp32`` mode here, so that the numbers don't overflow when multiplied
-or summed up. Of course, there might be other solutions. For example, we could turn off ``amp`` temporarily if it's
-enabled, after moving the original ``forward`` into a helper wrapper, like so:
-.. code-block:: python
-    def _forward(self, hidden_states):
-        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
-        hidden_linear = self.wi_1(hidden_states)
-        hidden_states = hidden_gelu * hidden_linear
-        hidden_states = self.dropout(hidden_states)
-        hidden_states = self.wo(hidden_states)
-        return hidden_states
-    import torch
-    def forward(self, hidden_states):
-        if torch.is_autocast_enabled():
-             with torch.cuda.amp.autocast(enabled=False):
-                 return self._forward(hidden_states)
-         else:
-             return self._forward(hidden_states)
-Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
-want to analyse the intermediary stages of any specific ``forward`` function as well. In such a case you can use the
-``detect_overflow`` helper function to inject the detector where you want it, for example:
-.. code-block:: python
-    from debug_utils import detect_overflow
-    class T5LayerFF(nn.Module):
-        [...]
-        def forward(self, hidden_states):
-            forwarded_states = self.layer_norm(hidden_states)
-            detect_overflow(forwarded_states, "after layer_norm")
-            forwarded_states = self.DenseReluDense(forwarded_states)
-            detect_overflow(forwarded_states, "after DenseReluDense")
-            return hidden_states + self.dropout(forwarded_states)
-You can see that we added 2 of these and now we track if ``inf`` or ``nan`` for ``forwarded_states`` was detected
-somewhere in between.
-Actually, the detector already reports these because each of the calls in the example above is a `nn.Module``, but
-let's say if you had some local direct calculations this is how you'd do that.
-Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
-its default, e.g.:
-.. code-block:: python
-    from .debug_utils import DebugUnderflowOverflow
-    debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
-Specific batch absolute mix and max value tracing
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
-Let's say you want to watch the absolute min and max values for all the ingredients of each ``forward`` call of a given
-batch, and only do that for batches 1 and 3. Then you instantiate this class as:
-.. code-block:: python
-    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
-And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
-Batches are 0-indexed.
-This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
-right to that area. Here is a sample truncated output for such configuration:
-.. code-block::
-                      *** Starting batch number=1 ***
-    abs min  abs max  metadata
-                      shared Embedding
-    1.01e-06 7.92e+02 weight
-    0.00e+00 2.47e+04 input[0]
-    5.36e-05 7.92e+02 output
-    [...]
-                      decoder.dropout Dropout
-    1.60e-07 2.27e+01 input[0]
-    0.00e+00 2.52e+01 output
-                      decoder T5Stack
-         not a tensor output
-                      lm_head Linear
-    1.01e-06 7.92e+02 weight
-    0.00e+00 1.11e+00 input[0]
-    6.06e-02 8.39e+01 output
-                       T5ForConditionalGeneration
-         not a tensor output
-                      *** Starting batch number=3 ***
-    abs min  abs max  metadata
-                      shared Embedding
-    1.01e-06 7.92e+02 weight
-    0.00e+00 2.78e+04 input[0]
-    5.36e-05 7.92e+02 output
-    [...]
-Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
-not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
-a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
-numbers started to diverge.
-You can also specify the batch number after which to stop the training, with:
-.. code-block:: python
-    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
--- a/docs/source/main_classes/deepspeed.rst
+++ b/docs/source/main_classes/deepspeed.rst
-..
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
+# DeepSpeed Integration
-DeepSpeed Integration
+[DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for:
-----------------------------------------------------------------------------------------------------------------------
-`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
-<https://arxiv.org/abs/1910.02054>`__. Currently it provides full support for:
 1. Optimizer state partitioning (ZeRO stage 1)
 2. Gradient partitioning (ZeRO stage 2)
@@ -25,26 +21,23 @@ DeepSpeed Integration
 5. A range of fast CUDA-extension-based optimizers
 6. ZeRO-Offload to CPU and NVMe
-ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training
+ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
-<https://arxiv.org/abs/2101.06840>`__. And NVMe-support is described in the paper `ZeRO-Infinity: Breaking the GPU
+Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
-Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__.
 DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
 DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
 won't be possible on a single GPU.
+🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
+1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
-🤗 Transformers integrates `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ via 2 options:
-1. Integration of the core DeepSpeed features via :class:`~transformers.Trainer`. This is everything done for you type
   of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
   this document is focused on this feature.
-2. If you don't use :class:`~transformers.Trainer` and want to use your own Trainer where you integrated DeepSpeed
+2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
-   yourself, core functionality functions like ``from_pretrained`` and ``from_config`` include integration of essential
+   yourself, core functionality functions like `from_pretrained` and `from_config` include integration of essential
-   parts of DeepSpeed like ``zero.Init`` for ZeRO stage 3 and higher. To tap into this feature read the docs on
+   parts of DeepSpeed like `zero.Init` for ZeRO stage 3 and higher. To tap into this feature read the docs on
-   :ref:`deepspeed-non-trainer-integration`.
+   [deepspeed-non-trainer-integration](#deepspeed-non-trainer-integration).
 What is integrated:
@@ -56,194 +49,186 @@ Inference:
 1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but
   it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see:
-   :ref:`deepspeed-zero-inference`.
+   [deepspeed-zero-inference](#deepspeed-zero-inference).
 There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of
 ZeRO (coming soon).
-.. _deepspeed-trainer-integration:
+<a id='deepspeed-trainer-integration'></a>
-Trainer Deepspeed Integration
+## Trainer Deepspeed Integration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. _deepspeed-installation:
+<a id='deepspeed-installation'></a>
-Installation
+### Installation
-=======================================================================================================================
 Install the library via pypi:
-.. code-block:: bash
+```bash
+pip install deepspeed
+```
-    pip install deepspeed
+or via `transformers`' `extras`:
-or via ``transformers``' ``extras``:
+```bash
+pip install transformers[deepspeed]
+```
-.. code-block:: bash
+or find more details on [the DeepSpeed's GitHub page](https://github.com/microsoft/deepspeed#installation) and
+[advanced install](https://www.deepspeed.ai/tutorials/advanced-install/).
-    pip install transformers[deepspeed]
+If you're still struggling with the build, first make sure to read [zero-install-notes](#zero-install-notes).
-or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__ and
-`advanced install <https://www.deepspeed.ai/tutorials/advanced-install/>`__.
-If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
 If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions
 to no avail, the next thing to try is to pre-build the modules before installing them.
 To make a local build for DeepSpeed:
-.. code-block:: bash
+```bash
+git clone https://github.com/microsoft/DeepSpeed/
-    git clone https://github.com/microsoft/DeepSpeed/
+cd DeepSpeed
-    cd DeepSpeed
+rm -rf build
-    rm -rf build
+TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
-    TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
+--global-option="build_ext" --global-option="-j8" --no-cache -v \
-    --global-option="build_ext" --global-option="-j8" --no-cache -v \
+--disable-pip-version-check 2>&1 | tee build.log
-    --disable-pip-version-check 2>&1 | tee build.log
+```
-If you intend to use NVMe offload you will need to also include ``DS_BUILD_AIO=1`` in the instructions above (and also
+If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also
-install `libaio-dev` system-wide).
+install *libaio-dev* system-wide).
-Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
+Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
 your cards are the same you can get the arch via:
-.. code-block:: bash
+```bash
+CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
+```
-    CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
+So if you get `8, 6`, then use `TORCH_CUDA_ARCH_LIST="8.6"`. If you have multiple different cards, you can list all
+of them like so `TORCH_CUDA_ARCH_LIST="6.1;8.6"`
-So if you get ``8, 6``, then use ``TORCH_CUDA_ARCH_LIST="8.6"``. If you have multiple different cards, you can list all
-of them like so ``TORCH_CUDA_ARCH_LIST="6.1;8.6"``
 If you need to use the same setup on multiple machines, make a binary wheel:
-.. code-block:: bash
+```bash
+git clone https://github.com/microsoft/DeepSpeed/
-    git clone https://github.com/microsoft/DeepSpeed/
+cd DeepSpeed
-    cd DeepSpeed
+rm -rf build
-    rm -rf build
+TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
-    TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
+python setup.py build_ext -j8 bdist_wheel
-    python setup.py build_ext -j8 bdist_wheel
+```
-it will generate something like ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` which now you can install
+it will generate something like `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` which now you can install
-as ``pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` locally or on any other machine.
+as `pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` locally or on any other machine.
-Again, remember to ensure to adjust ``TORCH_CUDA_ARCH_LIST`` to the target architectures.
+Again, remember to ensure to adjust `TORCH_CUDA_ARCH_LIST` to the target architectures.
 You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this
-context) `here <https://developer.nvidia.com/cuda-gpus>`__.
+context) [here](https://developer.nvidia.com/cuda-gpus).
 You can check the archs pytorch was built with using:
-.. code-block:: bash
+```bash
+python -c "import torch; print(torch.cuda.get_arch_list())"
-    python -c "import torch; print(torch.cuda.get_arch_list())"
+```
 Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
-.. code-block:: bash
+```bash
+CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
-    CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
+print(torch.cuda.get_device_properties(torch.device('cuda')))"
-    print(torch.cuda.get_device_properties(torch.device('cuda')))"
+```
 If the output is:
-.. code-block:: bash
+```bash
+_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
+```
-    _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
+then you know that this card's arch is `8.6`.
-then you know that this card's arch is ``8.6``.
+You can also leave `TORCH_CUDA_ARCH_LIST` out completely and then the build program will automatically query the
-You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
 architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
 it's best to specify the desired archs explicitly.
 If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
-`Deepspeed <https://github.com/microsoft/DeepSpeed/issues>`__,
+[Deepspeed](https://github.com/microsoft/DeepSpeed/issues),
-.. _deepspeed-multi-gpu:
+<a id='deepspeed-multi-gpu'></a>
-Deployment with multiple GPUs
+### Deployment with multiple GPUs
-=======================================================================================================================
-To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
+To deploy this feature with multiple GPUs adjust the [`Trainer`] command line arguments as
 following:
-1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
+1. replace `python -m torch.distributed.launch` with `deepspeed`.
-2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
+2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as
-   documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
+   documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you.
 Therefore, if your original command line looked as following:
-.. code-block:: bash
+```bash
+python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
-    python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
+```
 Now it should be:
-.. code-block:: bash
+```bash
+deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
-    deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
+```
-Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
+Unlike, `torch.distributed.launch` where you have to specify how many GPUs to use with `--nproc_per_node`, with the
-``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
+`deepspeed` launcher you don't have to use the corresponding `--num_gpus` if you want all of your GPUs used. The
-full details on how to configure various nodes and GPUs can be found `here
+full details on how to configure various nodes and GPUs can be found [here](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node).
-<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
-In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed as long as you don't need to use
+In fact, you can continue using `-m torch.distributed.launch` with DeepSpeed as long as you don't need to use
-``deepspeed`` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use
+`deepspeed` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use
-the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
+the `deepspeed` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
 use it here as well.
-Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs:
+Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs:
-.. code-block:: bash
+```bash
+deepspeed examples/pytorch/translation/run_translation.py \
+--deepspeed tests/deepspeed/ds_config_zero3.json \
+--model_name_or_path t5-small --per_device_train_batch_size 1 \
+--output_dir output_dir --overwrite_output_dir --fp16 \
+--do_train --max_train_samples 500 --num_train_epochs 1 \
+--dataset_name wmt16 --dataset_config "ro-en" \
+--source_lang en --target_lang ro
+```
-    deepspeed examples/pytorch/translation/run_translation.py \
+Note that in the DeepSpeed documentation you are likely to see `--deepspeed --deepspeed_config ds_config.json` - i.e.
-    --deepspeed tests/deepspeed/ds_config_zero3.json \
-    --model_name_or_path t5-small --per_device_train_batch_size 1   \
-    --output_dir output_dir --overwrite_output_dir --fp16 \
-    --do_train --max_train_samples 500 --num_train_epochs 1 \
-    --dataset_name wmt16 --dataset_config "ro-en" \
-    --source_lang en --target_lang ro
-Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
 two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
 with, we combined the two into a single argument.
-For some practical usage examples, please, see this `post
+For some practical usage examples, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400).
-<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
-.. _deepspeed-one-gpu:
+<a id='deepspeed-one-gpu'></a>
-Deployment with one GPU
+### Deployment with one GPU
-=======================================================================================================================
-To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
+To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following:
-.. code-block:: bash
+```bash
+deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
-    deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
+--deepspeed tests/deepspeed/ds_config_zero2.json \
-    --deepspeed tests/deepspeed/ds_config_zero2.json \
+--model_name_or_path t5-small --per_device_train_batch_size 1 \
-    --model_name_or_path t5-small --per_device_train_batch_size 1   \
+--output_dir output_dir --overwrite_output_dir --fp16 \
-    --output_dir output_dir --overwrite_output_dir --fp16 \
+--do_train --max_train_samples 500 --num_train_epochs 1 \
-    --do_train --max_train_samples 500 --num_train_epochs 1 \
+--dataset_name wmt16 --dataset_config "ro-en" \
-    --dataset_name wmt16 --dataset_config "ro-en" \
+--source_lang en --target_lang ro
-    --source_lang en --target_lang ro
+```
 This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via
-``--num_gpus=1``. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start
+`--num_gpus=1`. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start
-with, then you don't need this argument. The following `documentation
+with, then you don't need this argument. The following [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) discusses the launcher options.
-<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the launcher options.
 Why would you want to use DeepSpeed with just one GPU?
@@ -256,9 +241,8 @@ Why would you want to use DeepSpeed with just one GPU?
 While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
 with DeepSpeed is to have at least the following configuration in the configuration file:
-.. code-block:: json
+```json
+{
-    {
  "zero_optimization": {
     "stage": 2,
     "offload_optimizer": {
@@ -272,13 +256,13 @@ with DeepSpeed is to have at least the following configuration in the configurat
     "overlap_comm": true,
     "contiguous_gradients": true
  }
-    }
+}
+```
 which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will
 find more details in the discussion below.
-For a practical usage example of this type of deployment, please, see this `post
+For a practical usage example of this type of deployment, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685).
-<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
 You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document.
@@ -287,44 +271,43 @@ recommend ZeRO-3 config as starting one. -->
 Notes:
- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit
+- if you need to run on a specific GPU, which is different from GPU 0, you can't use `CUDA_VISIBLE_DEVICES` to limit
  the visible scope of available GPUs. Instead, you have to use the following syntax:
-   .. code-block:: bash
+  ```bash
  deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
+  ```
  In this example, we tell DeepSpeed to use GPU 1 (second gpu).
-.. _deepspeed-notebook:
+<a id='deepspeed-notebook'></a>
-Deployment in Notebooks
+### Deployment in Notebooks
-=======================================================================================================================
-The problem with running notebook cells as a script is that there is no normal ``deepspeed`` launcher to rely on, so
+The problem with running notebook cells as a script is that there is no normal `deepspeed` launcher to rely on, so
 under certain setups we have to emulate it.
 If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed.
-.. code-block:: python
+```python
+# DeepSpeed requires a distributed environment even when only one process is used.
-    # DeepSpeed requires a distributed environment even when only one process is used.
+# This emulates a launcher in the notebook
-    # This emulates a launcher in the notebook
+import os
-    import os
+os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_ADDR'] = 'localhost'
+os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
-    os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
+os.environ['RANK'] = "0"
-    os.environ['RANK'] = "0"
+os.environ['LOCAL_RANK'] = "0"
-    os.environ['LOCAL_RANK'] = "0"
+os.environ['WORLD_SIZE'] = "1"
-    os.environ['WORLD_SIZE'] = "1"
-    # Now proceed as normal, plus pass the deepspeed config file
+# Now proceed as normal, plus pass the deepspeed config file
-    training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
+training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
-    trainer = Trainer(...)
+trainer = Trainer(...)
-    trainer.train()
+trainer.train()
+```
-Note: ``...`` stands for the normal arguments that you'd pass to the functions.
+Note: `...` stands for the normal arguments that you'd pass to the functions.
 If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have
 to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented
@@ -333,11 +316,10 @@ at the beginning of this section.
 If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
 cell with:
-.. code-block:: python
+```python
+%%bash
-    %%bash
+cat <<'EOT' > ds_config_zero3.json
-    cat <<'EOT' > ds_config_zero3.json
+{
-    {
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
@@ -393,72 +375,70 @@ cell with:
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
-    }
+}
-    EOT
+EOT
+```
-If the training script is in a normal file and not in the notebook cells, you can launch ``deepspeed`` normally via
+If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via
-shell from a cell. For example, to use ``run_translation.py`` you would launch it with:
+shell from a cell. For example, to use `run_translation.py` you would launch it with:
-.. code-block::
+```python
+!git clone https://github.com/huggingface/transformers
+!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
+```
-    !git clone https://github.com/huggingface/transformers
+or with `%%bash` magic, where you can write a multi-line code for the shell program to run:
-    !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
-or with ``%%bash`` magic, where you can write a multi-line code for the shell program to run:
+```python
+%%bash
-.. code-block::
+git clone https://github.com/huggingface/transformers
+cd transformers
-    %%bash
+deepspeed examples/pytorch/translation/run_translation.py ...
+```
-    git clone https://github.com/huggingface/transformers
-    cd transformers
-    deepspeed examples/pytorch/translation/run_translation.py ...
 In such case you don't need any of the code presented at the beginning of this section.
-Note: While ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process
+Note: While `%%bash` magic is neat, but currently it buffers the output so you won't see the logs until the process
 completes.
-.. _deepspeed-config:
+<a id='deepspeed-config'></a>
-Configuration
+### Configuration
-=======================================================================================================================
 For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
-to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
+to the [following documentation](https://www.deepspeed.ai/docs/config-json/).
-You can find dozens of DeepSpeed configuration examples that address various practical needs in `the DeepSpeedExamples
-repo <https://github.com/microsoft/DeepSpeedExamples>`__:
-.. code-block:: bash
+You can find dozens of DeepSpeed configuration examples that address various practical needs in [the DeepSpeedExamples
+repo](https://github.com/microsoft/DeepSpeedExamples):
-    git clone https://github.com/microsoft/DeepSpeedExamples
+```bash
-    cd DeepSpeedExamples
+git clone https://github.com/microsoft/DeepSpeedExamples
-    find . -name '*json'
+cd DeepSpeedExamples
+find . -name '*json'
+```
 Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the
-example ``.json`` files with:
+example `.json` files with:
-.. code-block:: bash
+```bash
+grep -i Lamb $(find . -name '*json')
+```
-    grep -i Lamb $(find . -name '*json')
+Some more examples are to be found in the [main repo](https://github.com/microsoft/DeepSpeed) as well.
-Some more examples are to be found in the `main repo <https://github.com/microsoft/DeepSpeed>`__ as well.
 When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have
 to be configured via the command line. You will find the nuances in the rest of this guide.
 To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
-including optimizer states cpu offload, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler and will enable mixed
+including optimizer states cpu offload, uses `AdamW` optimizer and `WarmupLR` scheduler and will enable mixed
-precision training if ``--fp16`` is passed:
+precision training if `--fp16` is passed:
-.. code-block:: json
-    {
+```json
+{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
@@ -505,61 +485,60 @@ precision training if ``--fp16`` is passed:
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
-    }
+}
+```
-When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
+When you execute the program, DeepSpeed will log the configuration it received from the [`Trainer`]
 to the console, so you can see exactly what was the final configuration passed to it.
-.. _deepspeed-config-passing:
+<a id='deepspeed-config-passing'></a>
-Passing Configuration
+### Passing Configuration
-=======================================================================================================================
 As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're
 not using the command line interface to configure the training, and instead instantiate the
-:class:`~transformers.Trainer` via :class:`~transformers.TrainingArguments` then for the ``deepspeed`` argument you can
+[`Trainer`] via [`TrainingArguments`] then for the `deepspeed` argument you can
-pass a nested ``dict``. This allows you to create the configuration on the fly and doesn't require you to write it to
+pass a nested `dict`. This allows you to create the configuration on the fly and doesn't require you to write it to
-the file system before passing it to :class:`~transformers.TrainingArguments`.
+the file system before passing it to [`TrainingArguments`].
 To summarize you can do:
-.. code-block:: python
+```python
+TrainingArguments(..., deepspeed="/path/to/ds_config.json")
-    TrainingArguments(..., deepspeed="/path/to/ds_config.json")
+```
 or:
-.. code-block:: python
+```python
+ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
+TrainingArguments(..., deepspeed=ds_config_dict)
+```
-    ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
+<a id='deepspeed-config-shared'></a>
-    TrainingArguments(..., deepspeed=ds_config_dict)
+### Shared Configuration
-.. _deepspeed-config-shared:
+<Tip warning={true}>
-Shared Configuration
+This section is a must-read
-=======================================================================================================================
+</Tip>
-.. warning::
+Some configuration values are required by both the [`Trainer`] and DeepSpeed to function correctly,
-    This section is a must-read
-Some configuration values are required by both the :class:`~transformers.Trainer` and DeepSpeed to function correctly,
 therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those
-via the :class:`~transformers.Trainer` command line arguments.
+via the [`Trainer`] command line arguments.
 Additionally, some configuration values are derived automatically based on the model's configuration, so instead of
-remembering to manually adjust multiple values, it's the best to let the :class:`~transformers.Trainer` do the majority
+remembering to manually adjust multiple values, it's the best to let the [`Trainer`] do the majority
 of configuration for you.
-Therefore, in the rest of this guide you will find a special configuration value: ``auto``, which when set will be
+Therefore, in the rest of this guide you will find a special configuration value: `auto`, which when set will be
 automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this
 recommendation and set the values explicitly, in which case be very careful that your the
-:class:`~transformers.Trainer` arguments and DeepSpeed configurations agree. For example, are you using the same
+[`Trainer`] arguments and DeepSpeed configurations agree. For example, are you using the same
 learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very
 difficult to detect ways. You have been warned.
@@ -567,30 +546,28 @@ There are multiple other values that are specific to DeepSpeed-only and those yo
 your needs.
 In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master
-and configure :class:`~transformers.TrainingArguments` based on that. The steps are:
+and configure [`TrainingArguments`] based on that. The steps are:
 1. Create or load the DeepSpeed configuration to be used as a master configuration
-2. Create the :class:`~transformers.TrainingArguments` object based on these values
+2. Create the [`TrainingArguments`] object based on these values
-Do note that some values, such as :obj:`scheduler.params.total_num_steps` are calculated by
+Do note that some values, such as `scheduler.params.total_num_steps` are calculated by
-:class:`~transformers.Trainer` during ``train``, but you can of course do the math yourself.
+[`Trainer`] during `train`, but you can of course do the math yourself.
-.. _deepspeed-zero:
+<a id='deepspeed-zero'></a>
-ZeRO
+### ZeRO
-=======================================================================================================================
-`Zero Redundancy Optimizer (ZeRO) <https://www.deepspeed.ai/tutorials/zero/>`__ is the workhorse of DeepSpeed. It
+[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It
 support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
 therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
 You will find more indepth information in the DeepSpeed documentation.
-The ``zero_optimization`` section of the configuration file is the most important part (`docs
+The `zero_optimization` section of the configuration file is the most important part ([docs](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training)), since that is where you define
-<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
 which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the
 DeepSpeed docs.
-This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides
+This section has to be configured exclusively via DeepSpeed configuration - the [`Trainer`] provides
 no equivalent command line arguments.
 Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for
@@ -599,16 +576,14 @@ going to use.
-.. _deepspeed-zero2-config:
+<a id='deepspeed-zero2-config'></a>
-ZeRO-2 Config
+#### ZeRO-2 Config
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 The following is an example configuration for ZeRO stage 2:
-.. code-block:: json
+```json
+{
-    {
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
@@ -622,15 +597,16 @@ The following is an example configuration for ZeRO stage 2:
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
    }
-    }
+}
+```
 **Performance tuning:**
- enabling ``offload_optimizer`` should reduce GPU RAM usage (it requires ``"stage": 2``)
+- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`)
- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x
+- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
-  the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB
+  the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB
-  footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
+  footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
-  OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. You will want to do
+  OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do
  the same on larger capacity GPU as well, if you're starting to hit OOM.
 - when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
  the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
@@ -638,16 +614,14 @@ The following is an example configuration for ZeRO stage 2:
-.. _deepspeed-zero3-config:
+<a id='deepspeed-zero3-config'></a>
-ZeRO-3 Config
+#### ZeRO-3 Config
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 The following is an example configuration for ZeRO stage 3:
-.. code-block:: json
+```json
+{
-    {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
@@ -668,70 +642,70 @@ The following is an example configuration for ZeRO stage 3:
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    }
-    }
+}
+```
 If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU
-memory offloading the optimizer states and parameters to CPU memory with ``"device": "cpu"`` may solve this limitation.
+memory offloading the optimizer states and parameters to CPU memory with `"device": "cpu"` may solve this limitation.
-If you don't want to offload to CPU memory, use ``none`` instead of ``cpu`` for the ``device`` entry. Offloading to
+If you don't want to offload to CPU memory, use `none` instead of `cpu` for the `device` entry. Offloading to
 NVMe is discussed further down.
-Pinned memory is enabled with ``pin_memory`` set to ``true``. This feature can improve the throughput at the cost of
+Pinned memory is enabled with `pin_memory` set to `true`. This feature can improve the throughput at the cost of
 making less memory available to other processes. Pinned memory is set aside to the specific process that requested it
 and its typically accessed much faster than normal CPU memory.
 **Performance tuning:**
- ``stage3_max_live_parameters``: ``1e9``
+- `stage3_max_live_parameters`: `1e9`
- ``stage3_max_reuse_distance``: ``1e9``
+- `stage3_max_reuse_distance`: `1e9`
-If hitting OOM reduce ``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``. They should have minimal impact
+If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact
-on performance unless you are doing activation checkpointing. ``1e9`` would consume ~2GB. The memory is shared by
+on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by
-``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``, so its not additive, its just 2GB total.
+`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total.
-``stage3_max_live_parameters`` is the upper limit on how many full parameters you want to keep on the GPU at any given
+`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given
 time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
-use the ``stage3_max_reuse_distance`` to decide whether to throw away the parameter or to keep it. If a parameter is
+use the `stage3_max_reuse_distance` to decide whether to throw away the parameter or to keep it. If a parameter is
-going to be used again in near future (less than ``stage3_max_reuse_distance``) then we keep it to reduce communication
+going to be used again in near future (less than `stage3_max_reuse_distance`) then we keep it to reduce communication
 overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and
 backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward
 The following configuration values depend on the model's hidden size:
- ``reduce_bucket_size``: ``hidden_size*hidden_size``
+- `reduce_bucket_size`: `hidden_size*hidden_size`
- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size``
+- `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size`
- ``stage3_param_persistence_threshold``: ``10 * hidden_size``
+- `stage3_param_persistence_threshold`: `10 * hidden_size`
-therefore set these values to ``auto`` and the :class:`~transformers.Trainer` will automatically assign the recommended
+therefore set these values to `auto` and the [`Trainer`] will automatically assign the recommended
 values. But, of course, feel free to set these explicitly as well.
-``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large
+`stage3_gather_fp16_weights_on_model_save` enables model fp16 weights consolidation when model gets saved. With large
 models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if
 you plan to resume the training. Watch out for future updates that will remove this limitation and make things more
 flexible.
-If you're migrating from ZeRO-2 configuration note that ``allgather_partitions``, ``allgather_bucket_size`` and
+If you're migrating from ZeRO-2 configuration note that `allgather_partitions`, `allgather_bucket_size` and
-``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just
+`reduce_scatter` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just
 be ignored.
- ``sub_group_size``: ``1e9``
+- `sub_group_size`: `1e9`
-``sub_group_size`` controls the granularity in which parameters are updated during optimizer steps. Parameters are
+`sub_group_size` controls the granularity in which parameters are updated during optimizer steps. Parameters are
-grouped into buckets of ``sub_group_size`` and each buckets is updated one at a time. When used with NVMe offload in
+grouped into buckets of `sub_group_size` and each buckets is updated one at a time. When used with NVMe offload in
-ZeRO-Infinity, ``sub_group_size`` therefore controls the granularity in which model states are moved in and out of CPU
+ZeRO-Infinity, `sub_group_size` therefore controls the granularity in which model states are moved in and out of CPU
 memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.
-You can leave ``sub_group_size`` to its default value of `1e9` when not using NVMe offload. You may want to change its
+You can leave `sub_group_size` to its default value of *1e9* when not using NVMe offload. You may want to change its
 default value in the following cases:
-1. Running into OOM during optimizer step: Reduce ``sub_group_size`` to reduce memory utilization of temporary buffers
+1. Running into OOM during optimizer step: Reduce `sub_group_size` to reduce memory utilization of temporary buffers
-2. Optimizer Step is taking a long time: Increase ``sub_group_size`` to improve bandwidth utilization as a result of
+2. Optimizer Step is taking a long time: Increase `sub_group_size` to improve bandwidth utilization as a result of
   the increased data buffers.
-.. _deepspeed-nvme:
+<a id='deepspeed-nvme'></a>
-NVMe Support
+### NVMe Support
-=======================================================================================================================
 ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to
 smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during
@@ -740,9 +714,8 @@ process. ZeRO-Infinity requires ZeRO-3 enabled.
 The following configuration example enables NVMe to offload both optimizer states and the params:
-.. code-block:: json
+```json
+{
-    {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
@@ -777,29 +750,27 @@ The following configuration example enables NVMe to offload both optimizer state
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
-    }
+}
+```
 You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you
 have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint:
-`"device": "cpu"`).
+*"device": "cpu"*).
-Here is the full documentation for offloading `optimizer states
+Here is the full documentation for offloading [optimizer states](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) and [parameters](https://www.deepspeed.ai/docs/config-json/#parameter-offloading).
-<https://www.deepspeed.ai/docs/config-json/#optimizer-offloading>`__ and `parameters
-<https://www.deepspeed.ai/docs/config-json/#parameter-offloading>`__.
-Make sure that your ``nvme_path`` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll
+Make sure that your `nvme_path` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll
 be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this
 writing one can have ~3.5GB/s read, ~3GB/s write peak speeds).
-In order to figure out the optimal ``aio`` configuration block you must run a benchmark on your target setup, as
+In order to figure out the optimal `aio` configuration block you must run a benchmark on your target setup, as
-`explained here <https://github.com/microsoft/DeepSpeed/issues/998>`__.
+[explained here](https://github.com/microsoft/DeepSpeed/issues/998).
-.. _deepspeed-zero2-zero3-performance:
+<a id='deepspeed-zero2-zero3-performance'></a>
-ZeRO-2 vs ZeRO-3 Performance
+#### ZeRO-2 vs ZeRO-3 Performance
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather
 model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs
@@ -808,26 +779,23 @@ at a cost of speed.
 It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2:
- set ``stage3_param_persistence_threshold`` to a very large number - larger than the largest parameter, e.g., ``6 *
+- set `stage3_param_persistence_threshold` to a very large number - larger than the largest parameter, e.g., `6 * hidden_size * hidden_size`. This will keep the parameters on the GPUs.
-  hidden_size * hidden_size``. This will keep the parameters on the GPUs.
+- turn off `offload_params` since ZeRO-2 doesn't have that option.
- turn off ``offload_params`` since ZeRO-2 doesn't have that option.
-The performance will likely improve significantly with just ``offload_params`` turned off, even if you don't change
+The performance will likely improve significantly with just `offload_params` turned off, even if you don't change
-``stage3_param_persistence_threshold``. Of course, these changes will impact the size of the model you can train. So
+`stage3_param_persistence_threshold`. Of course, these changes will impact the size of the model you can train. So
 these help you to trade scalability for speed depending on your needs.
-.. _deepspeed-zero2-example:
+<a id='deepspeed-zero2-example'></a>
-ZeRO-2 Example
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``:
+#### ZeRO-2 Example
-.. code-block:: json
+Here is a full ZeRO-2 auto-configuration file `ds_config_zero2.json`:
-    {
+```json
+{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
@@ -876,15 +844,14 @@ Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``:
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
-    }
+}
+```
 Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical
-values look like, but we highly recommend using the one with multiple ``auto`` settings in it.
+values look like, but we highly recommend using the one with multiple `auto` settings in it.
-.. code-block:: json
+```json
+{
-    {
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
@@ -929,21 +896,18 @@ values look like, but we highly recommend using the one with multiple ``auto`` s
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
-    }
+}
+```
-.. _deepspeed-zero3-example:
-ZeRO-3 Example
+<a id='deepspeed-zero3-example'></a>
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``:
+#### ZeRO-3 Example
+Here is a full ZeRO-3 auto-configuration file `ds_config_zero3.json`:
-.. code-block:: json
-    {
+```json
+{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
@@ -999,14 +963,14 @@ Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``:
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
-    }
+}
+```
 Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical
-values look like, but we highly recommend using the one with multiple ``auto`` settings in it.
+values look like, but we highly recommend using the one with multiple `auto` settings in it.
-.. code-block:: json
-    {
+```json
+{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
@@ -1058,48 +1022,40 @@ values look like, but we highly recommend using the one with multiple ``auto`` s
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
-    }
+}
+```
+### Optimizer and Scheduler
-Optimizer and Scheduler
+As long as you don't enable `offload_optimizer` you can mix and match DeepSpeed and HuggingFace schedulers and
-=======================================================================================================================
-As long as you don't enable ``offload_optimizer`` you can mix and match DeepSpeed and HuggingFace schedulers and
 optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer:
-+--------------+--------------+--------------+
 | Combos       | HF Scheduler | DS Scheduler |
-+--------------+--------------+--------------+
 | HF Optimizer | Yes          | Yes          |
-+--------------+--------------+--------------+
 | DS Optimizer | No           | Yes          |
-+--------------+--------------+--------------+
-It is possible to use a non-DeepSpeed optimizer when ``offload_optimizer`` is enabled, as long as it has both CPU and
+It is possible to use a non-DeepSpeed optimizer when `offload_optimizer` is enabled, as long as it has both CPU and
 GPU implementation (except LAMB).
-.. _deepspeed-optimizer:
+<a id='deepspeed-optimizer'></a>
-Optimizer
+#### Optimizer
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
-thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here
+thus recommended to be used. It, however, can import other optimizers from `torch`. The full documentation is [here](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters).
-<https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`__.
-If you don't configure the ``optimizer`` entry in the configuration file, the :class:`~transformers.Trainer` will
-automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
-arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.
-Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``:
+If you don't configure the `optimizer` entry in the configuration file, the [`Trainer`] will
+automatically set it to `AdamW` and will use the supplied values or the defaults for the following command line
+arguments: `--learning_rate`, `--adam_beta1`, `--adam_beta2`, `--adam_epsilon` and `--weight_decay`.
-.. code-block:: json
+Here is an example of the auto-configured `optimizer` entry for `AdamW`:
-    {
+```json
+{
   "optimizer": {
       "type": "AdamW",
       "params": {
@@ -1109,25 +1065,24 @@ Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``:
         "weight_decay": "auto"
       }
   }
-    }
+}
+```
 Note that the command line arguments will set the values in the configuration file. This is so that there is one
 definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
 different values in different places. Command line rules. The values that get overridden are:
- ``lr`` with the value of ``--learning_rate``
+- `lr` with the value of `--learning_rate`
- ``betas`` with the value of ``--adam_beta1 --adam_beta2``
+- `betas` with the value of `--adam_beta1 --adam_beta2`
- ``eps`` with the value of ``--adam_epsilon``
+- `eps` with the value of `--adam_epsilon`
- ``weight_decay`` with the value of ``--weight_decay``
+- `weight_decay` with the value of `--weight_decay`
 Therefore please remember to tune the shared hyperparameters on the command line.
 You can also set the values explicitly:
-.. code-block:: json
+```json
+{
-    {
   "optimizer": {
       "type": "AdamW",
       "params": {
@@ -1137,47 +1092,46 @@ You can also set the values explicitly:
         "weight_decay": 3e-7
       }
   }
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
 If you want to use another optimizer which is not listed above, you will have to add to the top level configuration.
-.. code-block:: json
+```json
+{
-    {
   "zero_allow_untested_optimizer": true
-    }
+}
+```
-Similarly to ``AdamW``, you can configure other officially supported optimizers. Just remember that may have different
+Similarly to `AdamW`, you can configure other officially supported optimizers. Just remember that may have different
-config values. e.g. for Adam you will want ``weight_decay`` around ``0.01``.
+config values. e.g. for Adam you will want `weight_decay` around `0.01`.
-.. _deepspeed-scheduler:
+<a id='deepspeed-scheduler'></a>
-Scheduler
+#### Scheduler
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-DeepSpeed supports ``LRRangeTest``, ``OneCycle``, ``WarmupLR`` and ``WarmupDecayLR`` learning rate schedulers. The full
+DeepSpeed supports `LRRangeTest`, `OneCycle`, `WarmupLR` and `WarmupDecayLR` learning rate schedulers. The full
-documentation is `here <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
+documentation is [here](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters).
 Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
-* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup``
+- `WarmupLR` via `--lr_scheduler_type constant_with_warmup`
-* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``,
+- `WarmupDecayLR` via `--lr_scheduler_type linear`. This is also the default value for `--lr_scheduler_type`,
  therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
-If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
+If you don't configure the `scheduler` entry in the configuration file, the [`Trainer`] will use
-the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` or ``--warmup_ratio`` to configure a
+the values of `--lr_scheduler_type`, `--learning_rate` and `--warmup_steps` or `--warmup_ratio` to configure a
 🤗 Transformers version of it.
-Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``:
+Here is an example of the auto-configured `scheduler` entry for `WarmupLR`:
-.. code-block:: json
-    {
+```json
+{
   "scheduler": {
         "type": "WarmupLR",
         "params": {
@@ -1186,25 +1140,25 @@ Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``:
             "warmup_num_steps": "auto"
         }
     }
-    }
+}
+```
-Since `"auto"` is used the :class:`~transformers.Trainer` arguments will set the correct values in the configuration
+Since *"auto"* is used the [`Trainer`] arguments will set the correct values in the configuration
 file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example,
 the learning rate is set to different values in different places. Command line rules. The values that get set are:
- ``warmup_min_lr`` with the value of ``0``.
+- `warmup_min_lr` with the value of `0`.
- ``warmup_max_lr`` with the value of ``--learning_rate``.
+- `warmup_max_lr` with the value of `--learning_rate`.
- ``warmup_num_steps`` with the value of ``--warmup_steps`` if provided. Otherwise will use ``--warmup_ratio``
+- `warmup_num_steps` with the value of `--warmup_steps` if provided. Otherwise will use `--warmup_ratio`
  multiplied by the number of training steps and rounded up.
- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run
+- `total_num_steps` with either the value of `--max_steps` or if it is not provided, derived automatically at run
  time based on the environment and the size of the dataset and other command line arguments (needed for
-  ``WarmupDecayLR``).
+  `WarmupDecayLR`).
 You can, of course, take over any or all of the configuration values and set those yourself:
-.. code-block:: json
+```json
+{
-    {
   "scheduler": {
         "type": "WarmupLR",
         "params": {
@@ -1213,16 +1167,16 @@ You can, of course, take over any or all of the configuration values and set tho
             "warmup_num_steps": 1000
         }
     }
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
-For example, for ``WarmupDecayLR``, you can use the following entry:
+For example, for `WarmupDecayLR`, you can use the following entry:
-.. code-block:: json
-    {
+```json
+{
   "scheduler": {
         "type": "WarmupDecayLR",
         "params": {
@@ -1233,55 +1187,52 @@ For example, for ``WarmupDecayLR``, you can use the following entry:
             "warmup_num_steps": "auto"
         }
     }
-    }
+}
+```
-and ``total_num_steps`, ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be set at loading time.
+and `total_num_steps`, `warmup_max_lr`, `warmup_num_steps` and `total_num_steps` will be set at loading time.
-.. _deepspeed-fp32:
+<a id='deepspeed-fp32'></a>
-fp32 Precision
+### fp32 Precision
-=======================================================================================================================
 Deepspeed supports the full fp32 and the fp16 mixed precision.
 Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you
 will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this
 happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained
-models). Such models may overflow or underflow leading to ``NaN`` loss. If this is your case then you will want to use
+models). Such models may overflow or underflow leading to `NaN` loss. If this is your case then you will want to use
 the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with:
-.. code-block:: json
+```json
+{
-    {
    "fp16": {
        "enabled": "false",
    }
-    }
+}
+```
 If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using
 the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and
-benchmarks, please, see `TensorFloat-32(TF32) on Ampere devices
+benchmarks, please, see [TensorFloat-32(TF32) on Ampere devices](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices). The document includes
-<https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices>`__. The document includes
 instructions on how to disable this automatic conversion if for some reason you prefer not to use it.
-.. _deepspeed-amp:
+<a id='deepspeed-amp'></a>
-Automatic Mixed Precision
+### Automatic Mixed Precision
-=======================================================================================================================
 You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way:
 To configure pytorch AMP-like mode set:
-.. code-block:: json
+```json
+{
-    {
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
@@ -1290,18 +1241,18 @@ To configure pytorch AMP-like mode set:
        "hysteresis": 2,
        "min_loss_scale": 1
    }
-    }
+}
+```
-and the :class:`~transformers.Trainer` will automatically enable or disable it based on the value of
+and the [`Trainer`] will automatically enable or disable it based on the value of
-``args.fp16_backend``. The rest of config values are up to you.
+`args.fp16_backend`. The rest of config values are up to you.
-This mode gets enabled when ``--fp16 --fp16_backend amp`` command line args are passed.
+This mode gets enabled when `--fp16 --fp16_backend amp` command line args are passed.
 You can also enable/disable this mode explicitly:
-.. code-block:: json
+```json
+{
-    {
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
@@ -1310,168 +1261,161 @@ You can also enable/disable this mode explicitly:
        "hysteresis": 2,
        "min_loss_scale": 1
    }
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
-Here is the `documentation <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
+Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#fp16-training-options).
 To configure apex AMP-like mode set:
-.. code-block:: json
+```json
+"amp": {
-    "amp": {
    "enabled": "auto",
    "opt_level": "auto"
-    }
+}
+```
-and the :class:`~transformers.Trainer` will automatically configure it based on the values of ``args.fp16_backend`` and
+and the [`Trainer`] will automatically configure it based on the values of `args.fp16_backend` and
-``args.fp16_opt_level``.
+`args.fp16_opt_level`.
-This mode gets enabled when ``--fp16 --fp16_backend apex --fp16_opt_level 01`` command line args are passed.
+This mode gets enabled when `--fp16 --fp16_backend apex --fp16_opt_level 01` command line args are passed.
 You can also configure this mode explicitly:
-.. code-block:: json
+```json
+{
-    {
    "amp": {
        "enabled": true,
        "opt_level": "O1"
    }
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
-Here is the `documentation
+Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options).
-<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
-.. _deepspeed-bs:
+<a id='deepspeed-bs'></a>
-Batch Size
+### Batch Size
-=======================================================================================================================
 To configure batch size, use:
-.. code-block:: json
+```json
+{
-    {
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
-    }
+}
+```
-and the :class:`~transformers.Trainer` will automatically set ``train_micro_batch_size_per_gpu`` to the value of
+and the [`Trainer`] will automatically set `train_micro_batch_size_per_gpu` to the value of
-``args.per_device_train_batch_size`` and ``train_batch_size`` to ``args.world_size * args.per_device_train_batch_size *
+`args.per_device_train_batch_size` and `train_batch_size` to `args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`.
-args.gradient_accumulation_steps``.
 You can also set the values explicitly:
-.. code-block:: json
+```json
+{
-    {
    "train_batch_size": 12,
    "train_micro_batch_size_per_gpu": 4
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
-.. _deepspeed-grad-acc:
+<a id='deepspeed-grad-acc'></a>
-Gradient Accumulation
+### Gradient Accumulation
-=======================================================================================================================
 To configure gradient accumulation set:
-.. code-block:: json
+```json
+{
-    {
    "gradient_accumulation_steps": "auto"
-    }
+}
+```
-and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.gradient_accumulation_steps``.
+and the [`Trainer`] will automatically set it to the value of `args.gradient_accumulation_steps`.
 You can also set the value explicitly:
-.. code-block:: json
+```json
+{
-    {
    "gradient_accumulation_steps": 3
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
-.. _deepspeed-grad-clip:
+<a id='deepspeed-grad-clip'></a>
-Gradient Clipping
+### Gradient Clipping
-=======================================================================================================================
 To configure gradient gradient clipping set:
-.. code-block:: json
+```json
+{
-    {
    "gradient_clipping": "auto"
-    }
+}
+```
-and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.max_grad_norm``.
+and the [`Trainer`] will automatically set it to the value of `args.max_grad_norm`.
 You can also set the value explicitly:
-.. code-block:: json
+```json
+{
-    {
    "gradient_clipping": 1.0
-    }
+}
+```
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
 configuration.
-.. _deepspeed-weight-extraction:
+<a id='deepspeed-weight-extraction'></a>
-Getting The Model Weights Out
+### Getting The Model Weights Out
-=======================================================================================================================
 As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores
-fp32 master weights in its custom checkpoint optimizer files, which are ``global_step*/*optim_states.pt`` (this is glob
+fp32 master weights in its custom checkpoint optimizer files, which are `global_step*/*optim_states.pt` (this is glob
 pattern), and are saved under the normal checkpoint.
 **FP16 Weights:**
-When a model is saved under ZeRO-2, you end up having the normal ``pytorch_model.bin`` file with the model weights, but
+When a model is saved under ZeRO-2, you end up having the normal `pytorch_model.bin` file with the model weights, but
 they are only the fp16 version of the weights.
 Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs,
-therefore ``"stage3_gather_fp16_weights_on_model_save": true`` is required to get the ``Trainer`` to save the fp16
+therefore `"stage3_gather_fp16_weights_on_model_save": true` is required to get the `Trainer` to save the fp16
-version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't be created. This is because by default
+version of the weights. If this setting is `False` ``pytorch_model.bin` won't be created. This is because by default DeepSpeed's `state_dict` contains a placeholder and not the real weights. If we were to save this `state_dict`` it
-DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it
 won't be possible to load it back.
-.. code-block:: json
+```json
+{
-    {
    "zero_optimization": {
        "stage3_gather_fp16_weights_on_model_save": true
    }
-    }
+}
+```
 **FP32 Weights:**
 While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
-the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
+the [models hub](https://huggingface.co/models) or pass it to someone else you most likely will want to get the fp32
 weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
 therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
 memory it can be done in the same training script. The following sections will discuss both approaches.
@@ -1483,89 +1427,89 @@ This approach may not work if you model is large and you have little free CPU me
 If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
-.. code-block:: python
+```python
+from transformers.trainer_utils import get_last_checkpoint
-    from transformers.trainer_utils import get_last_checkpoint
+from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
-    from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
-    checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
+fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
-    fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+```
-If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best
+If you're using the `--load_best_model_at_end` class:*~transformers.TrainingArguments* argument (to track the best
 checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
-.. code-block:: python
+```python
+from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
+trainer.deepspeed.save_checkpoint(checkpoint_dir)
+fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+```
-    from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+<Tip>
-    checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
-    trainer.deepspeed.save_checkpoint(checkpoint_dir)
-    fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
-.. note::
+Note, that once `load_state_dict_from_zero_checkpoint` was run, the `model` will no longer be useable in the
+DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+`model.load_state_dict(state_dict)` will remove all the DeepSpeed magic from it. So do this only at the very end
+of the training.
-    Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the
+</Tip>
-    DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
-    ``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
-    of the training.
-Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own
+Of course, you don't have to use class:*~transformers.Trainer* and you can adjust the examples above to your own
 trainer.
-If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply
+If for some reason you want more refinement, you can also extract the fp32 `state_dict` of the weights and apply
 these yourself as is shown in the following example:
-.. code-block:: python
+```python
+from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
-    from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
-    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+model = model.cpu()
-    model = model.cpu()
+model.load_state_dict(state_dict)
-    model.load_state_dict(state_dict)
+```
 **Offline FP32 Weights Recovery:**
-DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
+DeepSpeed creates a special conversion script `zero_to_fp32.py` which it places in the top-level of the checkpoint
 folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
-have the configuration file or a ``Trainer`` to do the extraction.
+have the configuration file or a `Trainer` to do the extraction.
 Let's say your checkpoint folder looks like this:
-.. code-block:: bash
+```bash
+$ ls -l output_dir/checkpoint-1/
-    $ ls -l output_dir/checkpoint-1/
+-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
-    -rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
+drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
-    drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
+-rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
-    -rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
+-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-    -rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
+-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-    -rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
+-rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
-    -rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
+-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-    -rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
+-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-    -rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
+-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-    -rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
+-rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
-    -rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
+-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-    -rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
+-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
-    -rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
+```
-In this example there is just one DeepSpeed checkpoint sub-folder `global_step1`. Therefore to reconstruct the fp32
+In this example there is just one DeepSpeed checkpoint sub-folder *global_step1*. Therefore to reconstruct the fp32
 weights just run:
-.. code-block:: bash
+```bash
+python zero_to_fp32.py . pytorch_model.bin
-    python zero_to_fp32.py . pytorch_model.bin
+```
-This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
+This is it. `pytorch_model.bin` will now contain the full fp32 model weights consolidated from multiple GPUs.
 The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
-``python zero_to_fp32.py -h`` will give you usage details.
+`python zero_to_fp32.py -h` will give you usage details.
-The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current
+The script will auto-discover the deepspeed sub-folder using the contents of the file `latest`, which in the current
-example will contain ``global_step1``.
+example will contain `global_step1`.
 Note: currently the script requires 2x general RAM of the final fp32 model weights.
-ZeRO-3 and Infinity Nuances
+### ZeRO-3 and Infinity Nuances
-=======================================================================================================================
 ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature.
@@ -1576,104 +1520,99 @@ circumstances you may find the following information to be needed.
-Constructing Massive Models
+#### Constructing Massive Models
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases,
-but also if you want the initialization to happen much faster, initialize the model using `deepspeed.zero.Init()`
+but also if you want the initialization to happen much faster, initialize the model using *deepspeed.zero.Init()*
 context manager (which is also a function decorator), like so:
-.. code-block:: python
+```python
+from transformers import T5ForConditionalGeneration, T5Config
-    from transformers import T5ForConditionalGeneration, T5Config
+import deepspeed
-    import deepspeed
+with deepspeed.zero.Init():
-    with deepspeed.zero.Init():
   config = T5Config.from_pretrained("t5-small")
   model = T5ForConditionalGeneration(config)
+```
 As you can see this gives you a randomly initialized model.
-If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as
+If you want to use a pretrained model, `model_class.from_pretrained` will activate this feature as long as
-``is_deepspeed_zero3_enabled()`` returns ``True``, which currently is setup by the
+`is_deepspeed_zero3_enabled()` returns `True`, which currently is setup by the
-class:`~transformers.TrainingArguments` object if the passed DeepSpeed configuration file contains ZeRO-3 config
+[`TrainingArguments`] object if the passed DeepSpeed configuration file contains ZeRO-3 config
-section. Thus you must create the :class:`~transformers.TrainingArguments` object **before** calling
+section. Thus you must create the [`TrainingArguments`] object **before** calling
-``from_pretrained``. Here is an example of a possible sequence:
+`from_pretrained`. Here is an example of a possible sequence:
-.. code-block:: python
-    from transformers import AutoModel, Trainer, TrainingArguments
+```python
-    training_args = TrainingArguments(..., deepspeed=ds_config)
+from transformers import AutoModel, Trainer, TrainingArguments
-    model = AutoModel.from_pretrained("t5-small")
+training_args = TrainingArguments(..., deepspeed=ds_config)
-    trainer = Trainer(model=model, args=training_args, ...)
+model = AutoModel.from_pretrained("t5-small")
+trainer = Trainer(model=model, args=training_args, ...)
+```
-If you're using the official example scripts and your command line arguments include ``--deepspeed ds_config.json``
+If you're using the official example scripts and your command line arguments include `--deepspeed ds_config.json`
 with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written.
 Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used.
-For full details on this method and other related features please refer to `Constructing Massive Models
+For full details on this method and other related features please refer to [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models).
-<https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models>`__.
-Also when loading fp16-pretrained models, you will want to tell ``from_pretrained`` to use
+Also when loading fp16-pretrained models, you will want to tell `from_pretrained` to use
-``torch_dtype=torch.float16``. For details, please, see :ref:`from_pretrained-torch-dtype`.
+`torch_dtype=torch.float16`. For details, please, see [from_pretrained-torch-dtype](#from_pretrained-torch-dtype).
-Gathering Parameters
+#### Gathering Parameters
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently
 executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it.
-Most likely you won't need it, but if you do please refer to `Gathering Parameters
+Most likely you won't need it, but if you do please refer to [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination)
-<https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination>`__
 We do however use it internally in several places, one such example is when loading pretrained model weights in
-``from_pretrained``. We load one layer at a time and immediately partition it to all participating GPUs, as for very
+`from_pretrained`. We load one layer at a time and immediately partition it to all participating GPUs, as for very
 large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory
 limitations.
 Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:
-.. code-block:: python
+```python
+tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
+```
-    tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
+stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much
-stress on ``tensor([1.])``, or if you get an error where it says the parameter is of size ``1``, instead of some much
 larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.
-.. _deepspeed-zero-inference:
+<a id='deepspeed-zero-inference'></a>
-ZeRO Inference
+### ZeRO Inference
-=======================================================================================================================
 ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In
 fact you can leave these in the config file if you want to share the same one with the training. They will just be
 ignored.
-Otherwise you just need to pass the usual :class:`~transformers.TrainingArguments` arguments. For example:
+Otherwise you just need to pass the usual [`TrainingArguments`] arguments. For example:
-.. code-block:: bash
-    deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
+```bash
+deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
+```
 The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever
 for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states.
-Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs:
+Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs:
-.. code-block:: bash
-    deepspeed examples/pytorch/translation/run_translation.py \
+```bash
-    --deepspeed tests/deepspeed/ds_config_zero3.json \
+deepspeed examples/pytorch/translation/run_translation.py \
-    --model_name_or_path t5-small --output_dir output_dir \
+--deepspeed tests/deepspeed/ds_config_zero3.json \
-    --do_eval --max_eval_samples 50 --warmup_steps 50  \
+--model_name_or_path t5-small --output_dir output_dir \
-    --max_source_length 128 --val_max_target_length 128 \
+--do_eval --max_eval_samples 50 --warmup_steps 50  \
-    --overwrite_output_dir --per_device_eval_batch_size 4 \
+--max_source_length 128 --val_max_target_length 128 \
-    --predict_with_generate --dataset_config "ro-en" --fp16 \
+--overwrite_output_dir --per_device_eval_batch_size 4 \
-    --source_lang en --target_lang ro --dataset_name wmt16 \
+--predict_with_generate --dataset_config "ro-en" --fp16 \
-    --source_prefix "translate English to Romanian: "
+--source_lang en --target_lang ro --dataset_name wmt16 \
+--source_prefix "translate English to Romanian: "
+```
 Since for inference there is no need for additional large memory used by the optimizer states and the gradients you
 should be able to fit much larger batches and/or sequence length onto the same hardware.
@@ -1684,8 +1623,7 @@ to the ZeRO technology, but instead uses tensor parallelism to scale models that
 work in progress and we will provide the integration once that product is complete.
-Filing Issues
+### Filing Issues
-=======================================================================================================================
 Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work.
@@ -1693,30 +1631,29 @@ In your report please always include:
 1. the full Deepspeed config file in the report
-2. either the command line arguments if you were using the :class:`~transformers.Trainer` or
+2. either the command line arguments if you were using the [`Trainer`] or
-   :class:`~transformers.TrainingArguments` arguments if you were scripting the Trainer setup yourself. Please do not
+   [`TrainingArguments`] arguments if you were scripting the Trainer setup yourself. Please do not
-   dump the :class:`~transformers.TrainingArguments` as it has dozens of entries that are irrelevant.
+   dump the [`TrainingArguments`] as it has dozens of entries that are irrelevant.
 3. Output of:
-.. code-block:: bash
+    ```bash
    python -c 'import torch; print(f"torch: {torch.__version__}")'
    python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
    python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
+    ```
 4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this
-   `notebook <https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb>`__ as
+   [notebook](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) as
   a starting point.
 5. Unless it's impossible please always use a standard dataset that we can use and not something custom.
-6. If possible try to use one of the existing `examples
+6. If possible try to use one of the existing [examples](https://github.com/huggingface/transformers/tree/master/examples/pytorch) to reproduce the problem with.
-   <https://github.com/huggingface/transformers/tree/master/examples/pytorch>`__ to reproduce the problem with.
 Things to consider:
-* Deepspeed is often not the cause of the problem.
+- Deepspeed is often not the cause of the problem.
  Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the
  problem was still there.
@@ -1725,109 +1662,97 @@ Things to consider:
  exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it.
  And only if the problem persists then do mentioned Deepspeed and supply all the required details.
-* If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue
+- If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue
-  directly with `Deepspeed <https://github.com/microsoft/DeepSpeed/>`__. If you aren't sure, please do not worry,
+  directly with [Deepspeed](https://github.com/microsoft/DeepSpeed/). If you aren't sure, please do not worry,
  either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if
  need be.
-Troubleshooting
+### Troubleshooting
-=======================================================================================================================
-* ``deepspeed`` process gets killed at startup without a traceback
+- `deepspeed` process gets killed at startup without a traceback
-If the ``deepspeed`` process gets killed at launch time without a traceback, that usually means that the program tried
+If the `deepspeed` process gets killed at launch time without a traceback, that usually means that the program tried
 to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that
-process. This is because your configuration file most likely has either ``offload_optimizer`` or ``offload_param`` or
+process. This is because your configuration file most likely has either `offload_optimizer` or `offload_param` or
-both configured to offload to ``cpu``. If you have NVMe, experiment with offloading to NVMe if you're running under
+both configured to offload to `cpu`. If you have NVMe, experiment with offloading to NVMe if you're running under
 ZeRO-3.
-Work is being done to enable estimating how much memory is needed for a specific model: `PR
+Work is being done to enable estimating how much memory is needed for a specific model: [PR](https://github.com/microsoft/DeepSpeed/pull/965).
-<https://github.com/microsoft/DeepSpeed/pull/965>`__.
-Notes
+### Notes
-=======================================================================================================================
-* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
+- DeepSpeed works with the PyTorch [`Trainer`] but not TF [`TFTrainer`].
-* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
+- While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from [source](https://github.com/microsoft/deepspeed#installation) to best match your hardware and also if you need to enable
-  <https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable
  certain features, like 1-bit Adam, which aren't available in the pypi distribution.
-* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with 🤗 Transformers - you can use any model
+- You don't have to use the [`Trainer`] to use DeepSpeed with 🤗 Transformers - you can use any model
-  with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions
+  with your own trainer, and you will have to adapt the latter according to [the DeepSpeed integration instructions](https://www.deepspeed.ai/getting-started/#writing-deepspeed-models).
-  <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
-.. _deepspeed-non-trainer-integration:
+<a id='deepspeed-non-trainer-integration'></a>
-Non-Trainer Deepspeed Integration
+## Non-Trainer Deepspeed Integration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The :class:`~transformers.integrations.HfDeepSpeedConfig` is used to integrate Deepspeed into the 🤗 Transformers core
+The [`~integrations.HfDeepSpeedConfig`] is used to integrate Deepspeed into the 🤗 Transformers core
-functionality, when :class:`~transformers.Trainer` is not used.
+functionality, when [`Trainer`] is not used.
-When using :class:`~transformers.Trainer` everything is automatically taken care of.
+When using [`Trainer`] everything is automatically taken care of.
-When not using :class:`~transformers.Trainer`, to efficiently deploy DeepSpeed stage 3, you must instantiate the
+When not using [`Trainer`], to efficiently deploy DeepSpeed stage 3, you must instantiate the
-:class:`~transformers.integrations.HfDeepSpeedConfig` object before instantiating the model.
+[`~integrations.HfDeepSpeedConfig`] object before instantiating the model.
 For example for a pretrained model:
-.. code-block:: python
+```python
+from transformers.deepspeed import HfDeepSpeedConfig
-    from transformers.deepspeed import HfDeepSpeedConfig
+from transformers import AutoModel, deepspeed
-    from transformers import AutoModel, deepspeed
-    ds_config = { ... } # deepspeed config object or path to the file
+ds_config = { ... } # deepspeed config object or path to the file
-    # must run before instantiating the model
+# must run before instantiating the model
-    dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
-    model = AutoModel.from_pretrained("gpt2")
+model = AutoModel.from_pretrained("gpt2")
-    engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
+engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
+```
 or for non-pretrained model:
-.. code-block:: python
+```python
+from transformers.deepspeed import HfDeepSpeedConfig
-    from transformers.deepspeed import HfDeepSpeedConfig
+from transformers import AutoModel, AutoConfig, deepspeed
-    from transformers import AutoModel, AutoConfig, deepspeed
-    ds_config = { ... } # deepspeed config object or path to the file
-    # must run before instantiating the model
-    dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
-    config = AutoConfig.from_pretrained("gpt2")
-    model = AutoModel.from_config(config)
-    engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
-HfDeepSpeedConfig
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. autoclass:: transformers.deepspeed.HfDeepSpeedConfig
+ds_config = { ... } # deepspeed config object or path to the file
-    :members:
+# must run before instantiating the model
+dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+config = AutoConfig.from_pretrained("gpt2")
+model = AutoModel.from_config(config)
+engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
+```
+## HfDeepSpeedConfig
+[[autodoc]] deepspeed.HfDeepSpeedConfig
+    - all
-Main DeepSpeed Resources
+## Main DeepSpeed Resources
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- `Project's github <https://github.com/microsoft/deepspeed>`__
+- [Project's github](https://github.com/microsoft/deepspeed)
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
+- [Usage docs](https://www.deepspeed.ai/getting-started/)
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
+- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
+- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
 Papers:
- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models <https://arxiv.org/abs/1910.02054>`__
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
- `ZeRO-Offload: Democratizing Billion-Scale Model Training <https://arxiv.org/abs/2101.06840>`__
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
- `ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
-Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
+Finally, please, remember that, HuggingFace [`Trainer`] only integrates DeepSpeed, therefore if you
-have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
+have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues).
-<https://github.com/microsoft/DeepSpeed/issues>`__.
--- a/docs/source/testing.mdx
+++ b/docs/source/testing.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Testing
+Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
+There are 2 test suites in the repository:
+1. `tests` -- tests for the general API
+2. `examples` -- tests primarily for various applications that aren't part of the API
+## How transformers are tested
+1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
+   are defined in this [config file](https://github.com/huggingface/transformers-doc2mdx/tree/master/.circleci/config.yml), so that if needed you can reproduce the same
+   environment on your machine.
+   These CI jobs don't run `@slow` tests.
+2. There are 3 jobs run by [github actions](https://github.com/huggingface/transformers/actions):
+   - [torch hub integration](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/github-torch-hub.yml): checks whether torch hub
+     integration works.
+   - [self-hosted (push)](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-push.yml): runs fast tests on GPU only on commits on
+     `master`. It only runs if a commit on `master` has updated the code in one of the following folders: `src`,
+     `tests`, `.github` (to prevent running on added model cards, notebooks, etc.)
+   - [self-hosted runner](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-scheduled.yml): runs normal and slow tests on GPU in
+     `tests` and `examples`:
+```bash
+RUN_SLOW=1 pytest tests/
+RUN_SLOW=1 pytest examples/
+```
+   The results can be observed [here](https://github.com/huggingface/transformers/actions).
+## Running tests
+### Choosing which tests to run
+This document goes into many details of how tests can be run. If after reading everything, you need even more details
+you will find them [here](https://docs.pytest.org/en/latest/usage.html).
+Here are some most useful ways of running tests.
+Run all:
+```console
+pytest
+```
+or:
+```bash
+make test
+```
+Note that the latter is defined as:
+```bash
+python -m pytest -n auto --dist=loadfile -s -v ./tests/
+```
+which tells pytest to:
+- run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!)
+- ensure that all tests from the same file will be run by the same test process
+- do not capture output
+- run in verbose mode
+### Getting the list of all tests
+All tests of the test suite:
+```bash
+pytest --collect-only -q
+```
+All tests of a given test file:
+```bash
+pytest tests/test_optimization.py --collect-only -q
+```
+### Run a specific test module
+To run an individual test module:
+```bash
+pytest tests/test_logging.py
+```
+### Run specific tests
+Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
+class containing those tests. For example, it could be:
+```bash
+pytest tests/test_optimization.py::OptimizationTest::test_adam_w
+```
+Here:
+- `tests/test_optimization.py` - the file with tests
+- `OptimizationTest` - the name of the class
+- `test_adam_w` - the name of the specific test function
+If the file contains multiple classes, you can choose to run only tests of a given class. For example:
+```bash
+pytest tests/test_optimization.py::OptimizationTest
+```
+will run all the tests inside that class.
+As mentioned earlier you can see what tests are contained inside the `OptimizationTest` class by running:
+```bash
+pytest tests/test_optimization.py::OptimizationTest --collect-only -q
+```
+You can run tests by keyword expressions.
+To run only tests whose name contains `adam`:
+```bash
+pytest -k adam tests/test_optimization.py
+```
+Logical `and` and `or` can be used to indicate whether all keywords should match or either. `not` can be used to
+negate.
+To run all tests except those whose name contains `adam`:
+```bash
+pytest -k "not adam" tests/test_optimization.py
+```
+And you can combine the two patterns in one:
+```bash
+pytest -k "ada and not adam" tests/test_optimization.py
+```
+For example to run both `test_adafactor` and `test_adam_w` you can use:
+```bash
+pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py
+```
+Note that we use `or` here, since we want either of the keywords to match to include both.
+If you want to include only tests that include both patterns, `and` is to be used:
+```bash
+pytest -k "test and ada" tests/test_optimization.py
+```
+### Run only modified tests
+You can run the tests related to the unstaged files or the current branch (according to Git) by using [pytest-picked](https://github.com/anapaulagomes/pytest-picked). This is a great way of quickly testing your changes didn't break
+anything, since it won't run the tests related to files you didn't touch.
+```bash
+pip install pytest-picked
+```
+```bash
+pytest --picked
+```
+All tests will be run from files and folders which are modified, but not yet committed.
+### Automatically rerun failed tests on source modification
+[pytest-xdist](https://github.com/pytest-dev/pytest-xdist) provides a very useful feature of detecting all failed
+tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
+fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
+which again a full run is performed.
+```bash
+pip install pytest-xdist
+```
+To enter the mode: `pytest -f` or `pytest --looponfail`
+File changes are detected by looking at `looponfailroots` root directories and all of their contents (recursively).
+If the default for this value does not work for you, you can change it in your project by setting a configuration
+option in `setup.cfg`:
+```ini
+[tool:pytest]
+looponfailroots = transformers tests
+```
+or `pytest.ini`/``tox.ini`` files:
+```ini
+[pytest]
+looponfailroots = transformers tests
+```
+This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
+directory.
+[pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality.
+### Skip a test module
+If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
+example, to run all except `test_modeling_*.py` tests:
+```bash
+pytest *ls -1 tests/*py | grep -v test_modeling*
+```
+### Clearing state
+CI builds and when isolation is important (against speed), cache should be cleared:
+```bash
+pytest --cache-clear tests
+```
+### Running tests in parallel
+As mentioned earlier `make test` runs tests in parallel via `pytest-xdist` plugin (`-n X` argument, e.g. `-n 2`
+to run 2 parallel jobs).
+`pytest-xdist`'s `--dist=` option allows one to control how the tests are grouped. `--dist=loadfile` puts the
+tests located in one file onto the same process.
+Since the order of executed tests is different and unpredictable, if running the test suite with `pytest-xdist`
+produces failures (meaning we have some undetected coupled tests), use [pytest-replay](https://github.com/ESSS/pytest-replay) to replay the tests in the same order, which should help with then somehow
+reducing that failing sequence to a minimum.
+### Test order and repetition
+It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
+inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
+some problems that get uncovered by randomness of DL.
+#### Repeat tests
+- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder):
+```bash
+pip install pytest-flakefinder
+```
+And then run every test multiple times (50 by default):
+```bash
+pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
+```
+<Tip>
+This plugin doesn't work with `-n` flag from `pytest-xdist`.
+</Tip>
+<Tip>
+There is another plugin `pytest-repeat`, but it doesn't work with `unittest`.
+</Tip>
+#### Run tests in a random order
+```bash
+pip install pytest-random-order
+```
+Important: the presence of `pytest-random-order` will automatically randomize tests, no configuration change or
+command line options is required.
+As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
+`pytest-random-order` is installed it will print the random seed it used for that session, e.g:
+```bash
+pytest tests
+[...]
+Using --random-order-bucket=module
+Using --random-order-seed=573663
+```
+So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
+```bash
+pytest --random-order-seed=573663
+[...]
+Using --random-order-bucket=module
+Using --random-order-seed=573663
+```
+It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
+manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
+they failed and tell pytest to not randomize them instead using `--random-order-bucket=none`, e.g.:
+```bash
+pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py
+```
+To disable the shuffling for all tests:
+```bash
+pytest --random-order-bucket=none
+```
+By default `--random-order-bucket=module` is implied, which will shuffle the files on the module levels. It can also
+shuffle on `class`, `package`, `global` and `none` levels. For the complete details please see its
+[documentation](https://github.com/jbasko/pytest-random-order).
+Another randomization alternative is: [`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly). This
+module has a very similar functionality/interface, but it doesn't have the bucket modes available in
+`pytest-random-order`. It has the same problem of imposing itself once installed.
+### Look and feel variations
+#### pytest-sugar
+[pytest-sugar](https://github.com/Frozenball/pytest-sugar) is a plugin that improves the look-n-feel, adds a
+progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
+```bash
+pip install pytest-sugar
+```
+To run tests without it, run:
+```bash
+pytest -p no:sugar
+```
+or uninstall it.
+#### Report each sub-test name and its progress
+For a single or a group of tests via `pytest` (after `pip install pytest-pspec`):
+```bash
+pytest --pspec tests/test_optimization.py
+```
+#### Instantly shows failed tests
+[pytest-instafail](https://github.com/pytest-dev/pytest-instafail) shows failures and errors instantly instead of
+waiting until the end of test session.
+```bash
+pip install pytest-instafail
+```
+```bash
+pytest --instafail
+```
+### To GPU or not to GPU
+On a GPU-enabled setup, to test in CPU-only mode add `CUDA_VISIBLE_DEVICES=""`:
+```bash
+CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
+```
+or if you have multiple gpus, you can specify which one is to be used by `pytest`. For example, to use only the
+second gpu if you have gpus `0` and `1`, you can run:
+```bash
+CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
+```
+This is handy when you want to run different tasks on different GPUs.
+Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
+decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
+- `require_torch` - this test will run only under torch
+- `require_torch_gpu` - as `require_torch` plus requires at least 1 GPU
+- `require_torch_multi_gpu` - as `require_torch` plus requires at least 2 GPUs
+- `require_torch_non_multi_gpu` - as `require_torch` plus requires 0 or 1 GPUs
+- `require_torch_up_to_2_gpus` - as `require_torch` plus requires 0 or 1 or 2 GPUs
+- `require_torch_tpu` - as `require_torch` plus requires at least 1 TPU
+Let's depict the GPU requirements in the following table:
+| n gpus | decorator                      |
+|--------+--------------------------------|
+| `>= 0` | `@require_torch`               |
+| `>= 1` | `@require_torch_gpu`           |
+| `>= 2` | `@require_torch_multi_gpu`     |
+| `< 2`  | `@require_torch_non_multi_gpu` |
+| `< 3`  | `@require_torch_up_to_2_gpus`  |
+For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
+```python
+@require_torch_multi_gpu
+def test_example_with_multi_gpu():
+```
+If a test requires `tensorflow` use the `require_tf` decorator. For example:
+```python
+@require_tf
+def test_tf_thing_with_tensorflow():
+```
+These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
+how to set it up:
+```python
+@require_torch_gpu
+@slow
+def test_example_slow_on_gpu():
+```
+Some decorators like `@parametrized` rewrite test names, therefore `@require_*` skip decorators have to be listed
+last for them to work correctly. Here is an example of the correct usage:
+```python
+@parameterized.expand(...)
+@require_torch_multi_gpu
+def test_integration_foo():
+```
+This order problem doesn't exist with `@pytest.mark.parametrize`, you can put it first or last and it will still
+work. But it only works with non-unittests.
+Inside tests:
+- How many GPUs are available:
+```python
+from transformers.testing_utils import get_gpu_count
+n_gpu = get_gpu_count() # works with torch and tf
+```
+### Distributed training
+`pytest` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
+thing and end up thinking they are `pytest` and start running the test suite in loops. It works, however, if one
+spawns a normal process that then spawns off multiple workers and manages the IO pipes.
+Here are some tests that use it:
+- [test_trainer_distributed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/test_trainer_distributed.py)
+- [test_deepspeed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/deepspeed/test_deepspeed.py)
+To jump right into the execution point, search for the `execute_subprocess_async` call in those tests.
+You will need at least 2 GPUs to see these tests in action:
+```bash
+CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
+```
+### Output capture
+During test execution any output sent to `stdout` and `stderr` is captured. If a test or a setup method fails, its
+according captured output will usually be shown along with the failure traceback.
+To disable output capturing and to get the `stdout` and `stderr` normally, use `-s` or `--capture=no`:
+```bash
+pytest -s tests/test_logging.py
+```
+To send test results to JUnit format output:
+```bash
+py.test tests --junitxml=result.xml
+```
+### Color control
+To have no color (e.g., yellow on white background is not readable):
+```bash
+pytest --color=no tests/test_logging.py
+```
+### Sending test report to online pastebin service
+Creating a URL for each test failure:
+```bash
+pytest --pastebin=failed tests/test_logging.py
+```
+This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
+tests as usual or add for example -x if you only want to send one particular failure.
+Creating a URL for a whole test session log:
+```bash
+pytest --pastebin=all tests/test_logging.py
+```
+## Writing tests
+🤗 transformers tests are based on `unittest`, but run by `pytest`, so most of the time features from both systems
+can be used.
+You can read [here](https://docs.pytest.org/en/stable/unittest.html) which features are supported, but the important
+thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module
+`parameterized` that works in a similar way.
+### Parametrization
+Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
+the test, but then there is no way of running that test for just one set of arguments.
+```python
+# test_this1.py
+import unittest
+from parameterized import parameterized
+class TestMathUnitTest(unittest.TestCase):
+    @parameterized.expand([
+        ("negative", -1.5, -2.0),
+        ("integer", 1, 1.0),
+        ("large fraction", 1.6, 1),
+    ])
+    def test_floor(self, name, input, expected):
+        assert_equal(math.floor(input), expected)
+```
+Now, by default this test will be run 3 times, each time with the last 3 arguments of `test_floor` being assigned the
+corresponding arguments in the parameter list.
+and you could run just the `negative` and `integer` sets of params with:
+```bash
+pytest -k "negative and integer" tests/test_mytest.py
+```
+or all but `negative` sub-tests, with:
+```bash
+pytest -k "not negative" tests/test_mytest.py
+```
+Besides using the `-k` filter that was just mentioned, you can find out the exact name of each sub-test and run any
+or all of them using their exact names.
+```bash
+pytest test_this1.py --collect-only -q
+```
+and it will list:
+```bash
+test_this1.py::TestMathUnitTest::test_floor_0_negative
+test_this1.py::TestMathUnitTest::test_floor_1_integer
+test_this1.py::TestMathUnitTest::test_floor_2_large_fraction
+```
+So now you can run just 2 specific sub-tests:
+```bash
+pytest test_this1.py::TestMathUnitTest::test_floor_0_negative  test_this1.py::TestMathUnitTest::test_floor_1_integer
+```
+The module [parameterized](https://pypi.org/project/parameterized/) which is already in the developer dependencies
+of `transformers` works for both: `unittests` and `pytest` tests.
+If, however, the test is not a `unittest`, you may use `pytest.mark.parametrize` (or you may see it being used in
+some existing tests, mostly under `examples`).
+Here is the same example, this time using `pytest`'s `parametrize` marker:
+```python
+# test_this2.py
+import pytest
+@pytest.mark.parametrize(
+    "name, input, expected",
+    [
+        ("negative", -1.5, -2.0),
+        ("integer", 1, 1.0),
+        ("large fraction", 1.6, 1),
+    ],
+)
+def test_floor(name, input, expected):
+    assert_equal(math.floor(input), expected)
+```
+Same as with `parameterized`, with `pytest.mark.parametrize` you can have a fine control over which sub-tests are
+run, if the `-k` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
+names for the sub-tests. Here is what they look like:
+```bash
+pytest test_this2.py --collect-only -q
+```
+and it will list:
+```bash
+test_this2.py::test_floor[integer-1-1.0]
+test_this2.py::test_floor[negative--1.5--2.0]
+test_this2.py::test_floor[large fraction-1.6-1]
+```
+So now you can run just the specific test:
+```bash
+pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0]
+```
+as in the previous example.
+### Files and directories
+In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
+could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
+`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
+accessors to them:
+- `pathlib` objects (all fully resolved):
+  - `test_file_path` - the current test file path, i.e. `__file__`
+  - `test_file_dir` - the directory containing the current test file
+  - `tests_dir` - the directory of the `tests` test suite
+  - `examples_dir` - the directory of the `examples` test suite
+  - `repo_root_dir` - the directory of the repository
+  - `src_dir` - the directory of `src` (i.e. where the `transformers` sub-dir resides)
+- stringified paths---same as above but these return paths as strings, rather than `pathlib` objects:
+  - `test_file_path_str`
+  - `test_file_dir_str`
+  - `tests_dir_str`
+  - `examples_dir_str`
+  - `repo_root_dir_str`
+  - `src_dir_str`
+To start using those all you need is to make sure that the test resides in a subclass of
+`transformers.test_utils.TestCasePlus`. For example:
+```python
+from transformers.testing_utils import TestCasePlus
+class PathExampleTest(TestCasePlus):
+    def test_something_involving_local_locations(self):
+        data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
+```
+If you don't need to manipulate paths via `pathlib` or you just need a path as a string, you can always invoked
+`str()` on the `pathlib` object or use the accessors ending with `_str`. For example:
+```python
+from transformers.testing_utils import TestCasePlus
+class PathExampleTest(TestCasePlus):
+    def test_something_involving_stringified_locations(self):
+        examples_dir = self.examples_dir_str
+```
+### Temporary files and directories
+Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
+each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
+them. Therefore, using packages like `tempfile`, which address these needs is essential.
+However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
+to know it's exact path and not having it randomized on every test re-run.
+A helper class `transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
+`unittest.TestCase`, so we can easily inherit from it in the test modules.
+Here is an example of its usage:
+```python
+from transformers.testing_utils import TestCasePlus
+class ExamplesTests(TestCasePlus):
+    def test_whatever(self):
+        tmp_dir = self.get_auto_remove_tmp_dir()
+```
+This code creates a unique temporary directory, and sets `tmp_dir` to its location.
+- Create a unique temporary dir:
+```python
+def test_whatever(self):
+    tmp_dir = self.get_auto_remove_tmp_dir()
+```
+`tmp_dir` will contain the path to the created temporary dir. It will be automatically removed at the end of the
+test.
+- Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
+```python
+def test_whatever(self):
+    tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
+```
+This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
+leave any data in there.
+- You can override the default behavior by directly overriding the `before` and `after` args, leading to one of the
+  following behaviors:
+  - `before=True`: the temporary dir will always be cleared at the beginning of the test.
+  - `before=False`: if the temporary dir already existed, any existing files will remain there.
+  - `after=True`: the temporary dir will always be deleted at the end of the test.
+  - `after=False`: the temporary dir will always be left intact at the end of the test.
+<Tip>
+In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if
+an explicit obj:*tmp_dir* is used, so that by mistake no `/tmp` or similar important part of the filesystem will
+get nuked. i.e. please always pass paths that start with `./`.
+</Tip>
+<Tip>
+Each test can register multiple temporary directories and they all will get auto-removed, unless requested
+otherwise.
+</Tip>
+### Temporary sys.path override
+If you need to temporary override `sys.path` to import from another test for example, you can use the
+`ExtendSysPath` context manager. Example:
+```python
+import os
+from transformers.testing_utils import ExtendSysPath
+bindir = os.path.abspath(os.path.dirname(__file__))
+with ExtendSysPath(f"{bindir}/.."):
+    from test_trainer import TrainerIntegrationCommon  # noqa
+```
+### Skipping tests
+This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
+commit it to the main repository we need make sure it's skipped during `make test`.
+Methods:
+-  A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
+  running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
+  tests that depend on an external resource which is not available at the moment (for example a database).
+-  A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
+  implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
+  pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
+One of the important differences between the two is that `skip` doesn't run the test, and `xfail` does. So if the
+code that's buggy causes some bad state that will affect other tests, do not use `xfail`.
+#### Implementation
+- Here is how to skip whole test unconditionally:
+```python
+@unittest.skip("this bug needs to be fixed")
+def test_feature_x():
+```
+or via pytest:
+```python
+@pytest.mark.skip(reason="this bug needs to be fixed")
+```
+or the `xfail` way:
+```python
+@pytest.mark.xfail
+def test_feature_x():
+```
+- Here is how to skip a test based on some internal check inside the test:
+```python
+def test_feature_x():
+    if not has_something():
+        pytest.skip("unsupported configuration")
+```
+or the whole module:
+```python
+import pytest
+if not pytest.config.getoption("--custom-flag"):
+    pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
+```
+or the `xfail` way:
+```python
+def test_feature_x():
+    pytest.xfail("expected to fail until bug XYZ is fixed")
+```
+- Here is how to skip all tests in a module if some import is missing:
+```python
+docutils = pytest.importorskip("docutils", minversion="0.3")
+```
+-  Skip a test based on a condition:
+```python
+@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
+def test_feature_x():
+```
+or:
+```python
+@unittest.skipIf(torch_device == "cpu", "Can't do half precision")
+def test_feature_x():
+```
+or skip the whole module:
+```python
+@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
+class TestClass():
+    def test_feature_x(self):
+```
+More details, example and ways are [here](https://docs.pytest.org/en/latest/skipping.html).
+### Slow tests
+The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
+an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
+marked as in the example below:
+```python
+from transformers.testing_utils import slow
+@slow
+def test_integration_foo():
+```
+Once a test is marked as `@slow`, to run such tests set `RUN_SLOW=1` env var, e.g.:
+```bash
+RUN_SLOW=1 pytest tests
+```
+Some decorators like `@parameterized` rewrite test names, therefore `@slow` and the rest of the skip decorators
+`@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage:
+```python
+@parameterized.expand(...)
+@slow
+def test_integration_foo():
+```
+As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
+checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
+get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
+machine before submitting the PR.
+Here is a rough decision making mechanism for choosing which tests should be marked as slow:
+If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
+pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
+such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
+this approach we should have exceptions:
+- All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or
+  tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you
+  should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is
+  discussed in the following paragraphs.
+- All tests that need to do a training not specifically optimized to be fast should be set to slow.
+- We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
+  `@slow`. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
+  as `@slow`.
+- If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
+Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
+a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
+have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the `@slow` tests can use large
+slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
+```bash
+grep tiny tests examples
+```
+Here is a an example of a [script](https://github.com/huggingface/transformers-doc2mdx/tree/master/scripts/fsmt/fsmt-make-tiny-model.py) that created the tiny model
+[stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de). You can easily adjust it to your specific
+model's architecture.
+It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
+you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
+execution speed report in CI logs instead (the output of `pytest --durations=0 tests`).
+That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
+If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
+tests.
+### Testing the stdout/stderr output
+In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the
+`pytest`'s [capsys system](https://docs.pytest.org/en/latest/capture.html). Here is how this is accomplished:
+```python
+import sys
+def print_to_stdout(s): print(s)
+def print_to_stderr(s): sys.stderr.write(s)
+def test_result_and_stdout(capsys):
+    msg = "Hello"
+    print_to_stdout(msg)
+    print_to_stderr(msg)
+    out, err = capsys.readouterr() # consume the captured output streams
+    # optional: if you want to replay the consumed streams:
+    sys.stdout.write(out)
+    sys.stderr.write(err)
+    # test:
+    assert msg in out
+    assert msg in err
+```
+And, of course, most of the time, `stderr` will come as a part of an exception, so try/except has to be used in such
+a case:
+```python
+def raise_exception(msg): raise ValueError(msg)
+def test_something_exception():
+    msg = "Not a good value"
+    error = ''
+    try:
+        raise_exception(msg)
+    except Exception as e:
+        error = str(e)
+        assert msg in error, f"{msg} is in the exception:\n{error}"
+```
+Another approach to capturing stdout is via `contextlib.redirect_stdout`:
+```python
+from io import StringIO
+from contextlib import redirect_stdout
+def print_to_stdout(s): print(s)
+def test_result_and_stdout():
+    msg = "Hello"
+    buffer = StringIO()
+    with redirect_stdout(buffer):
+        print_to_stdout(msg)
+    out = buffer.getvalue()
+    # optional: if you want to replay the consumed streams:
+    sys.stdout.write(out)
+    # test:
+    assert msg in out
+```
+An important potential issue with capturing stdout is that it may contain `\r` characters that in normal `print`
+reset everything that has been printed so far. There is no problem with `pytest`, but with `pytest -s` these
+characters get included in the buffer, so to be able to have the test run with and without `-s`, you have to make an
+extra cleanup to the captured output, using `re.sub(r'~.*\r', '', buf, 0, re.M)`.
+But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
+some `\r`'s in it or not, so it's a simple:
+```python
+from transformers.testing_utils import CaptureStdout
+with CaptureStdout() as cs:
+    function_that_writes_to_stdout()
+print(cs.out)
+```
+Here is a full test example:
+```python
+from transformers.testing_utils import CaptureStdout
+msg = "Secret message\r"
+final = "Hello World"
+with CaptureStdout() as cs:
+    print(msg + final)
+assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
+```
+If you'd like to capture `stderr` use the `CaptureStderr` class instead:
+```python
+from transformers.testing_utils import CaptureStderr
+with CaptureStderr() as cs:
+    function_that_writes_to_stderr()
+print(cs.err)
+```
+If you need to capture both streams at once, use the parent `CaptureStd` class:
+```python
+from transformers.testing_utils import CaptureStd
+with CaptureStd() as cs:
+    function_that_writes_to_stdout_and_stderr()
+print(cs.err, cs.out)
+```
+Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
+from the context.
+### Capturing logger stream
+If you need to validate the output of a logger, you can use `CaptureLogger`:
+```python
+from transformers import logging
+from transformers.testing_utils import CaptureLogger
+msg = "Testing 1, 2, 3"
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.bart.tokenization_bart")
+with CaptureLogger(logger) as cl:
+    logger.info(msg)
+assert cl.out, msg+"\n"
+```
+### Testing with environment variables
+If you want to test the impact of environment variables for a specific test you can use a helper decorator
+`transformers.testing_utils.mockenv`
+```python
+from transformers.testing_utils import mockenv
+class HfArgumentParserTest(unittest.TestCase):
+    @mockenv(TRANSFORMERS_VERBOSITY="error")
+    def test_env_override(self):
+        env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
+```
+At times an external program needs to be called, which requires setting `PYTHONPATH` in `os.environ` to include
+multiple local paths. A helper class `transformers.test_utils.TestCasePlus` comes to help:
+```python
+from transformers.testing_utils import TestCasePlus
+class EnvExampleTest(TestCasePlus):
+    def test_external_prog(self):
+        env = self.get_env()
+        # now call the external program, passing `env` to it
+```
+Depending on whether the test file was under the `tests` test suite or `examples` it'll correctly set up
+`env[PYTHONPATH]` to include one of these two directories, and also the `src` directory to ensure the testing is
+done against the current repo, and finally with whatever `env[PYTHONPATH]` was already set to before the test was
+called if anything.
+This helper method creates a copy of the `os.environ` object, so the original remains intact.
+### Getting reproducible results
+In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
+will need to fix the seed:
+```python
+seed = 42
+# python RNG
+import random
+random.seed(seed)
+# pytorch RNGs
+import torch
+torch.manual_seed(seed)
+torch.backends.cudnn.deterministic = True
+if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
+# numpy RNG
+import numpy as np
+np.random.seed(seed)
+# tf RNG
+tf.random.set_seed(seed)
+```
+### Debugging tests
+To start a debugger at the point of the warning, do this:
+```bash
+pytest tests/test_logging.py -W error::UserWarning --pdb
+```
+## Working with github actions workflows
+To trigger a self-push workflow CI job, you must:
+1. Create a new branch on `transformers` origin (not a fork!).
+2. The branch name has to start with either `ci_` or `ci-` (`master` triggers it too, but we can't do PRs on
+   `master`). It also gets triggered only for specific paths - you can find the up-to-date definition in case it
+   changed since this document has been written [here](https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml) under *push:*
+3. Create a PR from this branch.
+4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there
+   is a backlog.
+## Testing Experimental CI Features
+Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
+new CI feature is to be added, it should be done as following.
+1. Create a new dedicated job that tests what needs to be tested
+2. The new job must always succeed so that it gives us a green ✓ (details below).
+3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches,
+   non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there
+   are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always
+   green)
+4. When it's clear that everything is solid, then merge the new changes into existing jobs.
+That way experiments on CI functionality itself won't interfere with the normal workflow.
+Now how can we make the job always succeed while the new CI feature is being developed?
+Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and
+Github Actions as of this writing don't support that.
+So the following workaround can be used:
+1. `set +euo pipefail` at the beginning of the run command to suppress most potential failures in the bash script.
+2. the last command must be a success: `echo "done"` or just `true` will do
+Here is an example:
+```yaml
+- run:
+    name: run CI experiment
+    command: |
+        set +euo pipefail
+        echo "setting run-all-despite-any-errors-mode"
+        this_command_will_fail
+        echo "but bash continues to run"
+        # emulate another failure
+        false
+        # but the last command must be a success
+        echo "during experiment do not remove: reporting success to CI, even if there were failures"
+```
+For simple commands you could also do:
+```bash
+cmd_that_may_fail || true
+```
+Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs,
+while removing `set +euo pipefail` or any other things you may have added to ensure that the experimental job doesn't
+interfere with the normal CI functioning.
+This whole process would have been much easier if we only could set something like `allow-failure` for the
+experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and
+Github Actions don't support it at the moment.
+You can vote for this feature and see where it is at at these CI-specific threads:
+- [Github Actions:](https://github.com/actions/toolkit/issues/399)
+- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344)
--- a/docs/source/testing.rst
+++ b/docs/source/testing.rst
-..
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Testing
-=======================================================================================================================
-Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
-There are 2 test suites in the repository:
-1. ``tests`` -- tests for the general API
-2. ``examples`` -- tests primarily for various applications that aren't part of the API
-How transformers are tested
-----------------------------------------------------------------------------------------------------------------------
-1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
-   are defined in this :prefix_link:`config file <.circleci/config.yml>`, so that if needed you can reproduce the same
-   environment on your machine.
-   These CI jobs don't run ``@slow`` tests.
-2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
-   * :prefix_link:`torch hub integration <.github/workflows/github-torch-hub.yml>`: checks whether torch hub
-     integration works.
-   * :prefix_link:`self-hosted (push) <.github/workflows/self-push.yml>`: runs fast tests on GPU only on commits on
-     ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``,
-     ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
-   * :prefix_link:`self-hosted runner <.github/workflows/self-scheduled.yml>`: runs normal and slow tests on GPU in
-     ``tests`` and ``examples``:
-   .. code-block:: bash
-    RUN_SLOW=1 pytest tests/
-    RUN_SLOW=1 pytest examples/
-   The results can be observed `here <https://github.com/huggingface/transformers/actions>`__.
-Running tests
-----------------------------------------------------------------------------------------------------------------------
-Choosing which tests to run
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This document goes into many details of how tests can be run. If after reading everything, you need even more details
-you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
-Here are some most useful ways of running tests.
-Run all:
-.. code-block:: console
-    pytest
-or:
-.. code-block:: bash
-    make test
-Note that the latter is defined as:
-.. code-block:: bash
-    python -m pytest -n auto --dist=loadfile -s -v ./tests/
-which tells pytest to:
-* run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!)
-* ensure that all tests from the same file will be run by the same test process
-* do not capture output
-* run in verbose mode
-Getting the list of all tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All tests of the test suite:
-.. code-block:: bash
-    pytest --collect-only -q
-All tests of a given test file:
-.. code-block:: bash
-    pytest tests/test_optimization.py --collect-only -q
-Run a specific test module
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To run an individual test module:
-.. code-block:: bash
-    pytest tests/test_logging.py
-Run specific tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
-class containing those tests. For example, it could be:
-.. code-block:: bash
-    pytest tests/test_optimization.py::OptimizationTest::test_adam_w
-Here:
-* ``tests/test_optimization.py`` - the file with tests
-* ``OptimizationTest`` - the name of the class
-* ``test_adam_w`` - the name of the specific test function
-If the file contains multiple classes, you can choose to run only tests of a given class. For example:
-.. code-block:: bash
-    pytest tests/test_optimization.py::OptimizationTest
-will run all the tests inside that class.
-As mentioned earlier you can see what tests are contained inside the ``OptimizationTest`` class by running:
-.. code-block:: bash
-    pytest tests/test_optimization.py::OptimizationTest --collect-only -q
-You can run tests by keyword expressions.
-To run only tests whose name contains ``adam``:
-.. code-block:: bash
-    pytest -k adam tests/test_optimization.py
-Logical ``and`` and ``or`` can be used to indicate whether all keywords should match or either. ``not`` can be used to
-negate.
-To run all tests except those whose name contains ``adam``:
-.. code-block:: bash
-    pytest -k "not adam" tests/test_optimization.py
-And you can combine the two patterns in one:
-.. code-block:: bash
-    pytest -k "ada and not adam" tests/test_optimization.py
-For example to run both ``test_adafactor`` and ``test_adam_w`` you can use:
-.. code-block:: bash
-    pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py
-Note that we use ``or`` here, since we want either of the keywords to match to include both.
-If you want to include only tests that include both patterns, ``and`` is to be used:
-.. code-block:: bash
-    pytest -k "test and ada" tests/test_optimization.py
-Run only modified tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked
-<https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break
-anything, since it won't run the tests related to files you didn't touch.
-.. code-block:: bash
-    pip install pytest-picked
-.. code-block:: bash
-    pytest --picked
-All tests will be run from files and folders which are modified, but not yet committed.
-Automatically rerun failed tests on source modification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a very useful feature of detecting all failed
-tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
-fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
-which again a full run is performed.
-.. code-block:: bash
-    pip install pytest-xdist
-To enter the mode: ``pytest -f`` or ``pytest --looponfail``
-File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively).
-If the default for this value does not work for you, you can change it in your project by setting a configuration
-option in ``setup.cfg``:
-.. code-block:: ini
-    [tool:pytest]
-    looponfailroots = transformers tests
-or ``pytest.ini``/``tox.ini`` files:
-.. code-block:: ini
-    [pytest]
-    looponfailroots = transformers tests
-This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
-directory.
-`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an alternative implementation of this functionality.
-Skip a test module
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
-example, to run all except ``test_modeling_*.py`` tests:
-.. code-block:: bash
-    pytest `ls -1 tests/*py | grep -v test_modeling`
-Clearing state
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-CI builds and when isolation is important (against speed), cache should be cleared:
-.. code-block:: bash
-    pytest --cache-clear tests
-Running tests in parallel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2``
-to run 2 parallel jobs).
-``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the
-tests located in one file onto the same process.
-Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist``
-produces failures (meaning we have some undetected coupled tests), use `pytest-replay
-<https://github.com/ESSS/pytest-replay>`__ to replay the tests in the same order, which should help with then somehow
-reducing that failing sequence to a minimum.
-Test order and repetition
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
-inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
-some problems that get uncovered by randomness of DL.
-Repeat tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-* `pytest-flakefinder <https://github.com/dropbox/pytest-flakefinder>`__:
-.. code-block:: bash
-    pip install pytest-flakefinder
-And then run every test multiple times (50 by default):
-.. code-block:: bash
-    pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
-.. note::
-   This plugin doesn't work with ``-n`` flag from ``pytest-xdist``.
-.. note::
-   There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``.
-Run tests in a random order
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. code-block:: bash
-    pip install pytest-random-order
-Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or
-command line options is required.
-As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
-``pytest-random-order`` is installed it will print the random seed it used for that session, e.g:
-.. code-block:: bash
-    pytest tests
-    [...]
-    Using --random-order-bucket=module
-    Using --random-order-seed=573663
-So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
-.. code-block:: bash
-    pytest --random-order-seed=573663
-    [...]
-    Using --random-order-bucket=module
-    Using --random-order-seed=573663
-It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
-manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
-they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.:
-.. code-block:: bash
-    pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py
-To disable the shuffling for all tests:
-.. code-block:: bash
-    pytest --random-order-bucket=none
-By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also
-shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its
-`documentation <https://github.com/jbasko/pytest-random-order>`__.
-Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This
-module has a very similar functionality/interface, but it doesn't have the bucket modes available in
-``pytest-random-order``. It has the same problem of imposing itself once installed.
-Look and feel variations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-pytest-sugar
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a plugin that improves the look-n-feel, adds a
-progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
-.. code-block:: bash
-    pip install pytest-sugar
-To run tests without it, run:
-.. code-block:: bash
-    pytest -p no:sugar
-or uninstall it.
-Report each sub-test name and its progress
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``):
-.. code-block:: bash
-    pytest --pspec tests/test_optimization.py
-Instantly shows failed tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ shows failures and errors instantly instead of
-waiting until the end of test session.
-.. code-block:: bash
-    pip install pytest-instafail
-.. code-block:: bash
-    pytest --instafail
-To GPU or not to GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
-.. code-block:: bash
-    CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
-or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the
-second gpu if you have gpus ``0`` and ``1``, you can run:
-.. code-block:: bash
-    CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
-This is handy when you want to run different tasks on different GPUs.
-Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
-decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
-* ``require_torch`` - this test will run only under torch
-* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
-* ``require_torch_multi_gpu`` - as ``require_torch`` plus requires at least 2 GPUs
-* ``require_torch_non_multi_gpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
-* ``require_torch_up_to_2_gpus`` - as ``require_torch`` plus requires 0 or 1 or 2 GPUs
-* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU
-Let's depict the GPU requirements in the following table:
-+----------+----------------------------------+
-| n gpus   |  decorator                       |
-+==========+==================================+
-| ``>= 0`` | ``@require_torch``               |
-+----------+----------------------------------+
-| ``>= 1`` | ``@require_torch_gpu``           |
-+----------+----------------------------------+
-| ``>= 2`` | ``@require_torch_multi_gpu``     |
-+----------+----------------------------------+
-| ``< 2``  | ``@require_torch_non_multi_gpu`` |
-+----------+----------------------------------+
-| ``< 3``  | ``@require_torch_up_to_2_gpus``  |
-+----------+----------------------------------+
-For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
-.. code-block:: python
-    @require_torch_multi_gpu
-    def test_example_with_multi_gpu():
-If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
-.. code-block:: python
-    @require_tf
-    def test_tf_thing_with_tensorflow():
-These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
-how to set it up:
-.. code-block:: python
-    @require_torch_gpu
-    @slow
-    def test_example_slow_on_gpu():
-Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed
-last for them to work correctly. Here is an example of the correct usage:
-.. code-block:: python
-    @parameterized.expand(...)
-    @require_torch_multi_gpu
-    def test_integration_foo():
-This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
-work. But it only works with non-unittests.
-Inside tests:
-* How many GPUs are available:
-.. code-block:: bash
-    from transformers.testing_utils import get_gpu_count
-    n_gpu = get_gpu_count() # works with torch and tf
-Distributed training
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
-thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
-spawns a normal process that then spawns off multiple workers and manages the IO pipes.
-Here are some tests that use it:
-* :prefix_link:`test_trainer_distributed.py <tests/test_trainer_distributed.py>`
-* :prefix_link:`test_deepspeed.py <tests/deepspeed/test_deepspeed.py>`
-To jump right into the execution point, search for the ``execute_subprocess_async`` call in those tests.
-You will need at least 2 GPUs to see these tests in action:
-.. code-block:: bash
-    CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
-Output capture
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its
-according captured output will usually be shown along with the failure traceback.
-To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``:
-.. code-block:: bash
-    pytest -s tests/test_logging.py
-To send test results to JUnit format output:
-.. code-block:: bash
-    py.test tests --junitxml=result.xml
-Color control
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To have no color (e.g., yellow on white background is not readable):
-.. code-block:: bash
-    pytest --color=no tests/test_logging.py
-Sending test report to online pastebin service
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Creating a URL for each test failure:
-.. code-block:: bash
-    pytest --pastebin=failed tests/test_logging.py
-This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
-tests as usual or add for example -x if you only want to send one particular failure.
-Creating a URL for a whole test session log:
-.. code-block:: bash
-    pytest --pastebin=all tests/test_logging.py
-Writing tests
-----------------------------------------------------------------------------------------------------------------------
-🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems
-can be used.
-You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important
-thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module
-``parameterized`` that works in a similar way.
-Parametrization
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
-the test, but then there is no way of running that test for just one set of arguments.
-.. code-block:: python
-    # test_this1.py
-    import unittest
-    from parameterized import parameterized
-    class TestMathUnitTest(unittest.TestCase):
-        @parameterized.expand([
-            ("negative", -1.5, -2.0),
-            ("integer", 1, 1.0),
-            ("large fraction", 1.6, 1),
-        ])
-        def test_floor(self, name, input, expected):
-            assert_equal(math.floor(input), expected)
-Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the
-corresponding arguments in the parameter list.
-and you could run just the ``negative`` and ``integer`` sets of params with:
-.. code-block:: bash
-    pytest -k "negative and integer" tests/test_mytest.py
-or all but ``negative`` sub-tests, with:
-.. code-block:: bash
-    pytest -k "not negative" tests/test_mytest.py
-Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any
-or all of them using their exact names.
-.. code-block:: bash
-    pytest test_this1.py --collect-only -q
-and it will list:
-.. code-block:: bash
-    test_this1.py::TestMathUnitTest::test_floor_0_negative
-    test_this1.py::TestMathUnitTest::test_floor_1_integer
-    test_this1.py::TestMathUnitTest::test_floor_2_large_fraction
-So now you can run just 2 specific sub-tests:
-.. code-block:: bash
-    pytest test_this1.py::TestMathUnitTest::test_floor_0_negative  test_this1.py::TestMathUnitTest::test_floor_1_integer
-The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies
-of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
-If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in
-some existing tests, mostly under ``examples``).
-Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
-.. code-block:: python
-    # test_this2.py
-    import pytest
-    @pytest.mark.parametrize(
-        "name, input, expected",
-        [
-            ("negative", -1.5, -2.0),
-            ("integer", 1, 1.0),
-            ("large fraction", 1.6, 1),
-        ],
-    )
-    def test_floor(name, input, expected):
-        assert_equal(math.floor(input), expected)
-Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are
-run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
-names for the sub-tests. Here is what they look like:
-.. code-block:: bash
-    pytest test_this2.py --collect-only -q
-and it will list:
-.. code-block:: bash
-    test_this2.py::test_floor[integer-1-1.0]
-    test_this2.py::test_floor[negative--1.5--2.0]
-    test_this2.py::test_floor[large fraction-1.6-1]
-So now you can run just the specific test:
-.. code-block:: bash
-    pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0]
-as in the previous example.
-Files and directories
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
-could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
-:obj:`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
-accessors to them:
-* ``pathlib`` objects (all fully resolved):
-   - ``test_file_path`` - the current test file path, i.e. ``__file__``
-   - ``test_file_dir`` - the directory containing the current test file
-   - ``tests_dir`` - the directory of the ``tests`` test suite
-   - ``examples_dir`` - the directory of the ``examples`` test suite
-   - ``repo_root_dir`` - the directory of the repository
-   - ``src_dir`` - the directory of ``src`` (i.e. where the ``transformers`` sub-dir resides)
-* stringified paths---same as above but these return paths as strings, rather than ``pathlib`` objects:
-   - ``test_file_path_str``
-   - ``test_file_dir_str``
-   - ``tests_dir_str``
-   - ``examples_dir_str``
-   - ``repo_root_dir_str``
-   - ``src_dir_str``
-To start using those all you need is to make sure that the test resides in a subclass of
-:obj:`transformers.test_utils.TestCasePlus`. For example:
-.. code-block:: python
-    from transformers.testing_utils import TestCasePlus
-    class PathExampleTest(TestCasePlus):
-        def test_something_involving_local_locations(self):
-            data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
-If you don't need to manipulate paths via ``pathlib`` or you just need a path as a string, you can always invoked
-``str()`` on the ``pathlib`` object or use the accessors ending with ``_str``. For example:
-.. code-block:: python
-    from transformers.testing_utils import TestCasePlus
-    class PathExampleTest(TestCasePlus):
-        def test_something_involving_stringified_locations(self):
-            examples_dir = self.examples_dir_str
-Temporary files and directories
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
-each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
-them. Therefore, using packages like ``tempfile``, which address these needs is essential.
-However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
-to know it's exact path and not having it randomized on every test re-run.
-A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
-:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
-Here is an example of its usage:
-.. code-block:: python
-    from transformers.testing_utils import TestCasePlus
-    class ExamplesTests(TestCasePlus):
-        def test_whatever(self):
-            tmp_dir = self.get_auto_remove_tmp_dir()
-This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
-* Create a unique temporary dir:
-.. code-block:: python
-    def test_whatever(self):
-        tmp_dir = self.get_auto_remove_tmp_dir()
-``tmp_dir`` will contain the path to the created temporary dir. It will be automatically removed at the end of the
-test.
-* Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
-.. code-block:: python
-    def test_whatever(self):
-        tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
-This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
-leave any data in there.
-* You can override the default behavior by directly overriding the ``before`` and ``after`` args, leading to one of the
-  following behaviors:
-    - ``before=True``: the temporary dir will always be cleared at the beginning of the test.
-    - ``before=False``: if the temporary dir already existed, any existing files will remain there.
-    - ``after=True``: the temporary dir will always be deleted at the end of the test.
-    - ``after=False``: the temporary dir will always be left intact at the end of the test.
-.. note::
-   In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
-   an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will
-   get nuked. i.e. please always pass paths that start with ``./``.
-.. note::
-   Each test can register multiple temporary directories and they all will get auto-removed, unless requested
-   otherwise.
-Temporary sys.path override
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If you need to temporary override ``sys.path`` to import from another test for example, you can use the
-``ExtendSysPath`` context manager. Example:
-.. code-block:: python
-    import os
-    from transformers.testing_utils import ExtendSysPath
-    bindir = os.path.abspath(os.path.dirname(__file__))
-    with ExtendSysPath(f"{bindir}/.."):
-        from test_trainer import TrainerIntegrationCommon  # noqa
-Skipping tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
-commit it to the main repository we need make sure it's skipped during ``make test``.
-Methods:
-  A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
-   running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
-   tests that depend on an external resource which is not available at the moment (for example a database).
-  A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
-   implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
-   pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
-One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the
-code that's buggy causes some bad state that will affect other tests, do not use ``xfail``.
-Implementation
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Here is how to skip whole test unconditionally:
-.. code-block:: python
-    @unittest.skip("this bug needs to be fixed")
-    def test_feature_x():
-or via pytest:
-.. code-block:: python
-    @pytest.mark.skip(reason="this bug needs to be fixed")
-or the ``xfail`` way:
-.. code-block:: python
-    @pytest.mark.xfail
-    def test_feature_x():
- Here is how to skip a test based on some internal check inside the test:
-.. code-block:: python
-    def test_feature_x():
-        if not has_something():
-            pytest.skip("unsupported configuration")
-or the whole module:
-.. code-block:: python
-    import pytest
-    if not pytest.config.getoption("--custom-flag"):
-        pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
-or the ``xfail`` way:
-.. code-block:: python
-    def test_feature_x():
-        pytest.xfail("expected to fail until bug XYZ is fixed")
- Here is how to skip all tests in a module if some import is missing:
-.. code-block:: python
-    docutils = pytest.importorskip("docutils", minversion="0.3")
-  Skip a test based on a condition:
-.. code-block:: python
-    @pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
-    def test_feature_x():
-or:
-.. code-block:: python
-    @unittest.skipIf(torch_device == "cpu", "Can't do half precision")
-    def test_feature_x():
-or skip the whole module:
-.. code-block:: python
-    @pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
-    class TestClass():
-        def test_feature_x(self):
-More details, example and ways are `here <https://docs.pytest.org/en/latest/skipping.html>`__.
-Slow tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
-an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
-marked as in the example below:
-.. code-block:: python
-    from transformers.testing_utils import slow
-    @slow
-    def test_integration_foo():
-Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var, e.g.:
-.. code-block:: bash
-    RUN_SLOW=1 pytest tests
-Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators
-``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
-.. code-block:: python
-    @parameterized.expand(...)
-    @slow
-    def test_integration_foo():
-As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
-checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
-get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
-machine before submitting the PR.
-Here is a rough decision making mechanism for choosing which tests should be marked as slow:
-If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
-pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
-such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
-this approach we should have exceptions:
-* All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or
-  tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you
-  should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is
-  discussed in the following paragraphs.
-* All tests that need to do a training not specifically optimized to be fast should be set to slow.
-* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
-  ``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
-  as ``@slow``.
-* If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
-Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
-a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
-have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large
-slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
-.. code-block:: bash
-    grep tiny tests examples
-Here is a an example of a :prefix_link:`script <scripts/fsmt/fsmt-make-tiny-model.py>` that created the tiny model
-`stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your specific
-model's architecture.
-It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
-you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
-execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
-That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
-If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
-tests.
-Testing the stdout/stderr output
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the
-``pytest``'s `capsys system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how this is accomplished:
-.. code-block:: python
-    import sys
-    def print_to_stdout(s): print(s)
-    def print_to_stderr(s): sys.stderr.write(s)
-    def test_result_and_stdout(capsys):
-        msg = "Hello"
-        print_to_stdout(msg)
-        print_to_stderr(msg)
-        out, err = capsys.readouterr() # consume the captured output streams
-        # optional: if you want to replay the consumed streams:
-        sys.stdout.write(out)
-        sys.stderr.write(err)
-        # test:
-        assert msg in out
-        assert msg in err
-And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such
-a case:
-.. code-block:: python
-    def raise_exception(msg): raise ValueError(msg)
-    def test_something_exception():
-        msg = "Not a good value"
-        error = ''
-        try:
-            raise_exception(msg)
-        except Exception as e:
-            error = str(e)
-            assert msg in error, f"{msg} is in the exception:\n{error}"
-Another approach to capturing stdout is via ``contextlib.redirect_stdout``:
-.. code-block:: python
-    from io import StringIO
-    from contextlib import redirect_stdout
-    def print_to_stdout(s): print(s)
-    def test_result_and_stdout():
-        msg = "Hello"
-        buffer = StringIO()
-        with redirect_stdout(buffer):
-            print_to_stdout(msg)
-        out = buffer.getvalue()
-        # optional: if you want to replay the consumed streams:
-        sys.stdout.write(out)
-        # test:
-        assert msg in out
-An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print``
-reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these
-characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an
-extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
-But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
-some ``\r``'s in it or not, so it's a simple:
-.. code-block:: python
-    from transformers.testing_utils import CaptureStdout
-    with CaptureStdout() as cs:
-        function_that_writes_to_stdout()
-    print(cs.out)
-Here is a full test example:
-.. code-block:: python
-    from transformers.testing_utils import CaptureStdout
-    msg = "Secret message\r"
-    final = "Hello World"
-    with CaptureStdout() as cs:
-        print(msg + final)
-    assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
-If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead:
-.. code-block:: python
-    from transformers.testing_utils import CaptureStderr
-    with CaptureStderr() as cs:
-        function_that_writes_to_stderr()
-    print(cs.err)
-If you need to capture both streams at once, use the parent :obj:`CaptureStd` class:
-.. code-block:: python
-    from transformers.testing_utils import CaptureStd
-    with CaptureStd() as cs:
-        function_that_writes_to_stdout_and_stderr()
-    print(cs.err, cs.out)
-Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
-from the context.
-Capturing logger stream
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If you need to validate the output of a logger, you can use :obj:`CaptureLogger`:
-.. code-block:: python
-    from transformers import logging
-    from transformers.testing_utils import CaptureLogger
-    msg = "Testing 1, 2, 3"
-    logging.set_verbosity_info()
-    logger = logging.get_logger("transformers.models.bart.tokenization_bart")
-    with CaptureLogger(logger) as cl:
-        logger.info(msg)
-    assert cl.out, msg+"\n"
-Testing with environment variables
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If you want to test the impact of environment variables for a specific test you can use a helper decorator
-``transformers.testing_utils.mockenv``
-.. code-block:: python
-    from transformers.testing_utils import mockenv
-    class HfArgumentParserTest(unittest.TestCase):
-        @mockenv(TRANSFORMERS_VERBOSITY="error")
-        def test_env_override(self):
-            env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
-At times an external program needs to be called, which requires setting ``PYTHONPATH`` in ``os.environ`` to include
-multiple local paths. A helper class :obj:`transformers.test_utils.TestCasePlus` comes to help:
-.. code-block:: python
-    from transformers.testing_utils import TestCasePlus
-    class EnvExampleTest(TestCasePlus):
-        def test_external_prog(self):
-            env = self.get_env()
-            # now call the external program, passing ``env`` to it
-Depending on whether the test file was under the ``tests`` test suite or ``examples`` it'll correctly set up
-``env[PYTHONPATH]`` to include one of these two directories, and also the ``src`` directory to ensure the testing is
-done against the current repo, and finally with whatever ``env[PYTHONPATH]`` was already set to before the test was
-called if anything.
-This helper method creates a copy of the ``os.environ`` object, so the original remains intact.
-Getting reproducible results
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
-will need to fix the seed:
-.. code-block:: python
-    seed = 42
-    # python RNG
-    import random
-    random.seed(seed)
-    # pytorch RNGs
-    import torch
-    torch.manual_seed(seed)
-    torch.backends.cudnn.deterministic = True
-    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
-    # numpy RNG
-    import numpy as np
-    np.random.seed(seed)
-    # tf RNG
-    tf.random.set_seed(seed)
-Debugging tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To start a debugger at the point of the warning, do this:
-.. code-block:: bash
-    pytest tests/test_logging.py -W error::UserWarning --pdb
-Working with github actions workflows
-----------------------------------------------------------------------------------------------------------------------
-To trigger a self-push workflow CI job, you must:
-1. Create a new branch on ``transformers`` origin (not a fork!).
-2. The branch name has to start with either ``ci_`` or ``ci-`` (``master`` triggers it too, but we can't do PRs on
-   ``master``). It also gets triggered only for specific paths - you can find the up-to-date definition in case it
-   changed since this document has been written `here
-   <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__ under `push:`
-3. Create a PR from this branch.
-4. Then you can see the job appear `here
-   <https://github.com/huggingface/transformers/actions/workflows/self-push.yml>`__. It may not run right away if there
-   is a backlog.
-Testing Experimental CI Features
-----------------------------------------------------------------------------------------------------------------------
-Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
-new CI feature is to be added, it should be done as following.
-1. Create a new dedicated job that tests what needs to be tested
-2. The new job must always succeed so that it gives us a green ✓ (details below).
-3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches,
-   non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there
-   are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always
-   green)
-4. When it's clear that everything is solid, then merge the new changes into existing jobs.
-That way experiments on CI functionality itself won't interfere with the normal workflow.
-Now how can we make the job always succeed while the new CI feature is being developed?
-Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and
-Github Actions as of this writing don't support that.
-So the following workaround can be used:
-1. ``set +euo pipefail`` at the beginning of the run command to suppress most potential failures in the bash script.
-2. the last command must be a success: ``echo "done"`` or just ``true`` will do
-Here is an example:
-.. code-block:: yaml
-    - run:
-        name: run CI experiment
-        command: |
-            set +euo pipefail
-            echo "setting run-all-despite-any-errors-mode"
-            this_command_will_fail
-            echo "but bash continues to run"
-            # emulate another failure
-            false
-            # but the last command must be a success
-            echo "during experiment do not remove: reporting success to CI, even if there were failures"
-For simple commands you could also do:
-.. code-block:: bash
-    cmd_that_may_fail || true
-Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs,
-while removing ``set +euo pipefail`` or any other things you may have added to ensure that the experimental job doesn't
-interfere with the normal CI functioning.
-This whole process would have been much easier if we only could set something like ``allow-failure`` for the
-experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and
-Github Actions don't support it at the moment.
-You can vote for this feature and see where it is at at these CI-specific threads:
-* `Github Actions: <https://github.com/actions/toolkit/issues/399>`__
-* `CircleCI: <https://ideas.circleci.com/ideas/CCI-I-344>`__