Unverified Commit c83fbc5f authored by Cheng Li's avatar Cheng Li Committed by GitHub
Browse files

[Deepspeed] Allow HF optimizer and scheduler to be passed to deepspeed (#10464)



* pass hf optimizer and scheduler to deepspeed if not specified in ds config

* pass hf optimizer and scheduler to deepspeed if not specified in ds config

* update

* make init_deepspeed support config dict

* fix docstring formatting

* clean up trainer's comments

* add new tests

* fix type

* composit argparse doesn't work

* style

* add a new test, rename others

* document new functionality

* complete tests, add docs

* style

* correct level

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add new methods to the doc

* must tell DS we are using a non-native optimizer

* add protection against cpu_offload + HF optimizer combo

* fix the cli overrides

* sync docs + tests

* restore AdamW

* better docs

* need new version

* no longer needed

* remove outdate information

* refactor duplicated code
Co-authored-by: default avatarStas Bekman <stas@stason.org>
Co-authored-by: default avatarStas Bekman <stas00@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent c2324844
...@@ -31,7 +31,10 @@ the above features. To inject custom behavior you can subclass them and override ...@@ -31,7 +31,10 @@ the above features. To inject custom behavior you can subclass them and override
- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset. - **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
- **log** -- Logs information on the various objects watching training. - **log** -- Logs information on the various objects watching training.
- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at - **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
init. init. Note, that you can also subclass or override the ``create_optimizer`` and ``create_scheduler`` methods
separately.
- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
- **compute_loss** - Computes the loss on a batch of training inputs. - **compute_loss** - Computes the loss on a batch of training inputs.
- **training_step** -- Performs a training step. - **training_step** -- Performs a training step.
- **prediction_step** -- Performs an evaluation/test step. - **prediction_step** -- Performs an evaluation/test step.
...@@ -542,8 +545,6 @@ cell with: ...@@ -542,8 +545,6 @@ cell with:
"cpu_offload": true "cpu_offload": true
}, },
"zero_allow_untested_optimizer": true,
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
...@@ -612,17 +613,11 @@ example ``.json`` files with: ...@@ -612,17 +613,11 @@ example ``.json`` files with:
Some more examples are to be found in the `main repo <https://github.com/microsoft/DeepSpeed>`__ as well. Some more examples are to be found in the `main repo <https://github.com/microsoft/DeepSpeed>`__ as well.
While you always have to supply the DeepSpeed configuration file, you can configure the DeepSpeed integration in When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have
several ways: to be configured via the command line. You will find the nuances in the rest of this guide.
1. Supply most of the configuration inside the file, and just use a few required command line arguments. This is the
recommended way as it puts most of the configuration params in one place.
2. Supply just the ZeRO configuration params inside the file, and configure the rest using the normal
:class:`~transformers.Trainer` command line arguments.
3. Any variation of the first two ways.
To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features, To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
enables FP16, uses AdamW optimizer and WarmupLR scheduler: enables FP16, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler:
.. code-block:: json .. code-block:: json
...@@ -666,36 +661,33 @@ enables FP16, uses AdamW optimizer and WarmupLR scheduler: ...@@ -666,36 +661,33 @@ enables FP16, uses AdamW optimizer and WarmupLR scheduler:
} }
} }
If you already have a command line that you have been using with :class:`transformers.Trainer` args, you can continue When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
using those and the :class:`~transformers.Trainer` will automatically convert them into the corresponding DeepSpeed to the console, so you can see exactly what was the final configuration passed to it.
configuration at run time. For example, you could use the following configuration file:
.. code-block:: json
{ Passing Configuration
"zero_optimization": { =======================================================================================================================
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"cpu_offload": true
}
}
and the following command line arguments: As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're
not using the command line interface to configure the training, and instead instantiate the
:class:`~transformers.Trainer` via :class:`~transformers.TrainingArguments` then for the ``deepspeed`` argument you can
pass a nested ``dict``. This allows you to create the configuration on the fly and doesn't require you to write it to
the file system before passing it to :class:`~transformers.TrainingArguments`.
.. code-block:: bash To summarize you can do:
--learning_rate 3e-5 --warmup_steps 500 --adam_beta1 0.8 --adam_beta2 0.999 --adam_epsilon 1e-8 \ .. code-block:: python
--weight_decay 3e-7 --lr_scheduler_type constant_with_warmup --fp16 --fp16_backend amp
TrainingArguments(..., deespeed="/path/to/ds_config.json")
or:
.. code-block:: python
ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
TrainingArguments(..., deespeed=ds_config_dict)
to achieve the same configuration as provided by the longer json file in the first example.
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
to the console, so you can see exactly what the final configuration was passed to it.
Shared Configuration Shared Configuration
======================================================================================================================= =======================================================================================================================
...@@ -761,9 +753,27 @@ no equivalent command line arguments. ...@@ -761,9 +753,27 @@ no equivalent command line arguments.
Optimizer Optimizer and Scheduler
======================================================================================================================= =======================================================================================================================
As long as you don't enable ``cpu_offload`` you can mix and match DeepSpeed and HuggingFace schedulers and optimizers,
with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer:
+--------------+--------------+--------------+
| Combos | HF Scheduler | DS Scheduler |
+--------------+--------------+--------------+
| HF Optimizer | Yes | Yes |
+--------------+--------------+--------------+
| DS Optimizer | No | Yes |
+--------------+--------------+--------------+
If ``cpu_offload`` is enabled you must use both DeepSpeed scheduler and DeepSpeed optimizer.
Optimizer
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here
...@@ -773,7 +783,7 @@ If you don't configure the ``optimizer`` entry in the configuration file, the :c ...@@ -773,7 +783,7 @@ If you don't configure the ``optimizer`` entry in the configuration file, the :c
automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``. arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.
Here is an example of the pre-configured ``optimizer`` entry for AdamW: Here is an example of the pre-configured ``optimizer`` entry for ``AdamW``:
.. code-block:: json .. code-block:: json
...@@ -789,6 +799,17 @@ Here is an example of the pre-configured ``optimizer`` entry for AdamW: ...@@ -789,6 +799,17 @@ Here is an example of the pre-configured ``optimizer`` entry for AdamW:
} }
} }
Note that the command line arguments will override the values in the configuration file. This is so that there is one
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
different values in different places. Command line rules. The values that get overridden are:
- ``lr`` with the value of ``--learning_rate``
- ``betas`` with the value of ``--adam_beta1 --adam_beta2``
- ``eps`` with the value of ``--adam_epsilon``
- ``weight_decay`` with the value of ``--weight_decay``
Therefore please remember to tune the shared hyperparameters on the command line.
If you want to use another optimizer which is not listed above, you will have to add ``"zero_allow_untested_optimizer": If you want to use another optimizer which is not listed above, you will have to add ``"zero_allow_untested_optimizer":
true`` to the top level configuration. true`` to the top level configuration.
...@@ -797,48 +818,60 @@ make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` ...@@ -797,48 +818,60 @@ make sure to adjust the values. e.g. if use Adam you will want ``weight_decay``
Scheduler Scheduler
======================================================================================================================= """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__. <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
the value of ``--lr_scheduler_type`` to configure it. Currently the :class:`~transformers.Trainer` supports only 2 LR Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
schedulers that are also supported by DeepSpeed:
* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup`` * ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup``
* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``, * ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``,
therefore, if you don't configure the scheduler this is scheduler that will get configured by default. therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
In either case, the values of ``--learning_rate`` and ``--warmup_steps`` will be used for the configuration.
In other words, if you don't use the configuration file to set the ``scheduler`` entry, provide either: If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` to configure a 🤗 Transformers version
.. code-block:: bash of it.
--lr_scheduler_type constant_with_warmup --learning_rate 3e-5 --warmup_steps 500 Here is an example of the pre-configured ``scheduler`` entry for ``WarmupLR``:
or .. code-block:: json
.. code-block:: bash {
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
}
}
--lr_scheduler_type linear --learning_rate 3e-5 --warmup_steps 500 Note that the command line arguments will override the values in the configuration file. This is so that there is one
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
different values in different places. Command line rules. The values that get overridden are:
with the desired values. If you don't pass these arguments, reasonable default values will be used instead. - ``warmup_max_lr`` with the value of ``--learning_rate``
- ``warmup_num_steps`` with the value of ``--warmup_steps``
- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run
time based on the environment and the size of the dataset and other command line arguments (needed for
``WarmupDecayLR``).
In the case of WarmupDecayLR ``total_num_steps`` gets set either via the ``--max_steps`` command line argument, or if Therefore please remember to tune the shared hyperparameters on the command line.
it is not provided, derived automatically at run time based on the environment and the size of the dataset and other
command line arguments.
Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``constant_with_warmup`` in the For example, for ``WarmupDecayLR``, you can use the following entry:
:class:`~transformers.Trainer` API):
.. code-block:: json .. code-block:: json
{ {
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupDecayLR",
"params": { "params": {
"total_num_steps": 10,
"last_batch_iteration": -1,
"warmup_min_lr": 0, "warmup_min_lr": 0,
"warmup_max_lr": 0.001, "warmup_max_lr": 0.001,
"warmup_num_steps": 1000 "warmup_num_steps": 1000
...@@ -846,6 +879,10 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con ...@@ -846,6 +879,10 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con
} }
} }
and ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be corrected at loading time.
Automatic Mixed Precision Automatic Mixed Precision
======================================================================================================================= =======================================================================================================================
...@@ -933,9 +970,9 @@ Notes ...@@ -933,9 +970,9 @@ Notes
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source * While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
<https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable <https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable
certain features, like 1-bit Adam, which aren't available in the pypi distribution. certain features, like 1-bit Adam, which aren't available in the pypi distribution.
* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with HuggingFace ``transformers`` - you can * You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with 🤗 Transformers - you can use any model
use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions
instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__. <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
Main DeepSpeed Resources Main DeepSpeed Resources
======================================================================================================================= =======================================================================================================================
......
...@@ -12,10 +12,12 @@ ...@@ -12,10 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import io
import json import json
import os import os
import sys import sys
import unittest import unittest
from copy import deepcopy
from transformers.integrations import is_deepspeed_available from transformers.integrations import is_deepspeed_available
from transformers.testing_utils import ( from transformers.testing_utils import (
...@@ -67,16 +69,76 @@ class TrainerIntegrationDeepSpeed(TestCasePlus): ...@@ -67,16 +69,76 @@ class TrainerIntegrationDeepSpeed(TestCasePlus):
MASTER_ADDR="localhost", MASTER_PORT="10999", RANK="0", LOCAL_RANK="0", WORLD_SIZE="1" MASTER_ADDR="localhost", MASTER_PORT="10999", RANK="0", LOCAL_RANK="0", WORLD_SIZE="1"
) )
self.ds_config_file = f"{self.test_file_dir_str}/ds_config.json" self.ds_config_file = f"{self.test_file_dir_str}/ds_config.json"
with io.open(self.ds_config_file, "r", encoding="utf-8") as f:
self.ds_config_dict = json.load(f)
def test_fake_notebook_no_launcher(self): def test_fake_notebook_no_launcher(self):
# this setup emulates a notebook where a launcher needs to be emulated by hand # this setup emulates a notebook where a launcher needs to be emulated by hand
with CaptureStd() as cs: # noqa
with CaptureStd() as cs:
with mockenv_context(**self.dist_env_1_gpu): with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(local_rank=0, deepspeed=self.ds_config_file) trainer = get_regression_trainer(local_rank=0, deepspeed=self.ds_config_file)
trainer.train() trainer.train()
assert "DeepSpeed info" in cs.out, "expected DeepSpeed logger output but got none" # fixme:
# assert "DeepSpeed info" in cs.out, "expected DeepSpeed logger output but got none"
# Test various combos
# 1. DS scheduler + DS optimizer: this is already tested by most other tests
# 2. HF scheduler + HF optimizer:
# 3. DS scheduler + HF optimizer:
# 4. HF scheduler + DS optimizer:
def test_hf_scheduler_hf_optimizer(self):
a = 0
with mockenv_context(**self.dist_env_1_gpu):
ds_config_dict = deepcopy(self.ds_config_dict)
del ds_config_dict["optimizer"] # force default HF Trainer optimizer
del ds_config_dict["scheduler"] # force default HF Trainer scheduler
ds_config_dict["zero_optimization"]["cpu_offload"] = False
ds_config_dict["fp16"]["initial_scale_power"] = 1 # force optimizer on the first step
trainer = get_regression_trainer(a=a, local_rank=0, deepspeed=ds_config_dict)
trainer.train()
new_a = trainer.model.a.item()
self.assertNotEqual(new_a, a)
def test_ds_scheduler_hf_optimizer(self):
a = 0
with mockenv_context(**self.dist_env_1_gpu):
ds_config_dict = deepcopy(self.ds_config_dict)
del ds_config_dict["optimizer"] # force default HF Trainer optimizer
ds_config_dict["zero_optimization"]["cpu_offload"] = False
ds_config_dict["fp16"]["initial_scale_power"] = 1 # force optimizer on the first step
trainer = get_regression_trainer(a=a, local_rank=0, deepspeed=ds_config_dict)
trainer.train()
new_a = trainer.model.a.item()
self.assertNotEqual(new_a, a)
def test_hf_scheduler_ds_optimizer(self):
# this combo is not possible at the moment
with mockenv_context(**self.dist_env_1_gpu):
ds_config_dict = deepcopy(self.ds_config_dict)
del ds_config_dict["scheduler"] # force default HF Trainer scheduler
ds_config_dict["zero_optimization"]["cpu_offload"] = False
ds_config_dict["fp16"]["initial_scale_power"] = 1 # force optimizer on the first step
trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_dict)
with self.assertRaises(Exception) as context:
trainer.train()
self.assertTrue("HF scheduler + DeepSpeed optimizer combination is not possible" in str(context.exception))
def test_hf_optimizer_with_offload(self):
# must not allow non-DS optimizer when using ZERO-offload
with mockenv_context(**self.dist_env_1_gpu):
ds_config_dict = deepcopy(self.ds_config_dict)
del ds_config_dict["optimizer"] # force default HF Trainer optimizer
ds_config_dict["zero_optimization"]["cpu_offload"] = True
# sanity check - should the default config change
assert (
"cpu_offload" in ds_config_dict["zero_optimization"]
and ds_config_dict["zero_optimization"]["cpu_offload"] is True
), "ensure the config is set up correctly"
trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_dict)
with self.assertRaises(Exception) as context:
trainer.train()
self.assertTrue("ZeRO Offload can only work with DeepSpeed optimizers" in str(context.exception))
def test_early_get_last_lr(self): def test_early_get_last_lr(self):
# with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may # with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
......
...@@ -24,7 +24,6 @@ import tempfile ...@@ -24,7 +24,6 @@ import tempfile
from pathlib import Path from pathlib import Path
from types import SimpleNamespace from types import SimpleNamespace
from .trainer_utils import SchedulerType
from .utils import logging from .utils import logging
from .utils.versions import require_version from .utils.versions import require_version
...@@ -282,14 +281,19 @@ def init_deepspeed(trainer, num_training_steps): ...@@ -282,14 +281,19 @@ def init_deepspeed(trainer, num_training_steps):
""" """
import deepspeed import deepspeed
require_version("deepspeed>0.3.10") require_version("deepspeed>0.3.12")
args = trainer.args args = trainer.args
ds_config_file = args.deepspeed ds_config_file = args.deepspeed
model = trainer.model model = trainer.model
with io.open(ds_config_file, "r", encoding="utf-8") as f: if isinstance(args.deepspeed, dict):
config = json.load(f) config = args.deepspeed
elif isinstance(args.deepspeed, str):
with io.open(ds_config_file, "r", encoding="utf-8") as f:
config = json.load(f)
else:
raise ValueError("expecting either a path to a config file or a pre-populated dict")
# The following code translates relevant trainer's cl args into the DS config # The following code translates relevant trainer's cl args into the DS config
...@@ -321,28 +325,49 @@ def init_deepspeed(trainer, num_training_steps): ...@@ -321,28 +325,49 @@ def init_deepspeed(trainer, num_training_steps):
else: # override only if the ds config doesn't already have this section else: # override only if the ds config doesn't already have this section
config["gradient_clipping"] = args.max_grad_norm config["gradient_clipping"] = args.max_grad_norm
# Optimizer + Scheduler
# Currently support combos:
# 1. DS scheduler + DS optimizer: Yes
# 2. HF scheduler + HF optimizer: Yes
# 3. DS scheduler + HF optimizer: Yes
# 4. HF scheduler + DS optimizer: No
# Unless Offload is enabled in which case it's:
# 1. DS scheduler + DS optimizer: Yes
# 2. HF scheduler + HF optimizer: No
# 3. DS scheduler + HF optimizer: No
# 4. HF scheduler + DS optimizer: No
optimizer = None
if "optimizer" in config: if "optimizer" in config:
logger.info( logger.info(f"Updating the `scheduler` config from {ds_config_file} with other command line arguments")
f"Keeping the `optimizer` config from {ds_config_file} intact, ignoring any optimizer-specific cl args"
# to avoid inconsistent values of lr and warm up steps the command line args override config
params = dict(
lr=args.learning_rate,
betas=[args.adam_beta1, args.adam_beta2],
eps=args.adam_epsilon,
weight_decay=args.weight_decay,
) )
for k, v in params.items():
if k in config["optimizer"]["params"]:
logger.info(f"setting optimizer.params.{k} to {v}")
config["optimizer"]["params"][k] = v
else: # override only if the ds config doesn't already have this section else: # override only if the ds config doesn't already have this section
# ds supports Adam, AdamW, OneBitAdam, and Lamb optimizers and can import other optimizers from torch. if (
# To use other optimizers requires voiding warranty with: `"zero_allow_untested_optimizer": true"` "zero_optimization" in config
and "cpu_offload" in config["zero_optimization"]
optimizer_configs = { and config["zero_optimization"]["cpu_offload"] is True
"AdamW": { ):
"lr": args.learning_rate, raise ValueError("ZeRO Offload can only work with DeepSpeed optimizers")
"betas": [args.adam_beta1, args.adam_beta2], else:
"eps": args.adam_epsilon, # ds supports Adam, OneBitAdam, and Lamb optimizers and can import other optimizers from torch.
"weight_decay": args.weight_decay, # But trainer uses AdamW by default.
} # To use other optimizers so using a different scheduler requires voiding warranty with: `zero_allow_untested_optimizer`
} trainer.create_optimizer()
optimizer = "AdamW" optimizer = trainer.optimizer
# flag that this is non-native optimizer
config["optimizer"] = { config["zero_allow_untested_optimizer"] = True
"type": optimizer,
"params": optimizer_configs[optimizer],
}
# DS schedulers (deepspeed/runtime/lr_schedules.py): # DS schedulers (deepspeed/runtime/lr_schedules.py):
# #
...@@ -352,34 +377,33 @@ def init_deepspeed(trainer, num_training_steps): ...@@ -352,34 +377,33 @@ def init_deepspeed(trainer, num_training_steps):
# OneCycle | na | na | 1CLR # OneCycle | na | na | 1CLR
# WarmupLR | constant_with_warmup | get_constant_schedule_with_warmup | w/ warmup_min_lr=0 # WarmupLR | constant_with_warmup | get_constant_schedule_with_warmup | w/ warmup_min_lr=0
# WarmupDecayLR| linear | get_linear_schedule_with_warmup | # WarmupDecayLR| linear | get_linear_schedule_with_warmup |
lr_scheduler = None
if "scheduler" in config: if "scheduler" in config:
logger.info( logger.info(f"Updating the `scheduler` config from {ds_config_file} with other command line arguments")
f"Keeping the `scheduler` config from {ds_config_file} intact, ignoring any scheduler-specific cl args" # the user won't easily know the correct num_training_steps should they use WarmupDecayLR,
# so let's set it to the correct value
if config["scheduler"]["type"] == "WarmupDecayLR":
logger.info(f"setting scheduler.params.total_num_steps to {num_training_steps}")
config["scheduler"]["params"]["total_num_steps"] = num_training_steps
# to avoid inconsistent values of lr and warmup steps the command line args override config
params = dict(
warmup_max_lr=args.learning_rate,
warmup_num_steps=args.warmup_steps,
) )
for k, v in params.items():
if k in config["scheduler"]["params"]:
logger.info(f"setting scheduler.params.{k} to {v}")
config["scheduler"]["params"][k] = v
else: # override only if the ds config doesn't already have this section else: # override only if the ds config doesn't already have this section
if args.lr_scheduler_type == SchedulerType.LINEAR: if "optimizer" in config:
scheduler = "WarmupDecayLR" # to make this option work, we need to init DS optimizer first, then init HS scheduler,
params = { # then pass the HS scheduler to DS init, which is not possible at the moment
"last_batch_iteration": -1, raise ValueError("At the moment HF scheduler + DeepSpeed optimizer combination is not possible")
"total_num_steps": num_training_steps,
"warmup_min_lr": 0,
"warmup_max_lr": args.learning_rate,
"warmup_num_steps": args.warmup_steps,
}
elif args.lr_scheduler_type == SchedulerType.CONSTANT_WITH_WARMUP:
scheduler = "WarmupLR"
params = {
"warmup_min_lr": 0,
"warmup_max_lr": args.learning_rate,
"warmup_num_steps": args.warmup_steps,
}
else: else:
raise ValueError(f"{args.lr_scheduler_type} scheduler type is not supported by DeepSpeed") trainer.create_scheduler(num_training_steps=num_training_steps)
lr_scheduler = trainer.lr_scheduler
config["scheduler"] = {
"type": scheduler,
"params": params,
}
# fp16 # fp16
if trainer.fp16_backend is not None: if trainer.fp16_backend is not None:
...@@ -409,6 +433,9 @@ def init_deepspeed(trainer, num_training_steps): ...@@ -409,6 +433,9 @@ def init_deepspeed(trainer, num_training_steps):
# for clarity extract the specific cl args that are being passed to deepspeed # for clarity extract the specific cl args that are being passed to deepspeed
ds_args = dict(local_rank=args.local_rank) ds_args = dict(local_rank=args.local_rank)
# keep for quick debug:
# from pprint import pprint; pprint(config)
# init that takes part of the config via `args`, and the bulk of it via `config_params` # init that takes part of the config via `args`, and the bulk of it via `config_params`
model_parameters = filter(lambda p: p.requires_grad, model.parameters()) model_parameters = filter(lambda p: p.requires_grad, model.parameters())
model, optimizer, _, lr_scheduler = deepspeed.initialize( model, optimizer, _, lr_scheduler = deepspeed.initialize(
...@@ -416,6 +443,8 @@ def init_deepspeed(trainer, num_training_steps): ...@@ -416,6 +443,8 @@ def init_deepspeed(trainer, num_training_steps):
model=model, model=model,
model_parameters=model_parameters, model_parameters=model_parameters,
config_params=config, config_params=config,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
) )
return model, optimizer, lr_scheduler return model, optimizer, lr_scheduler
......
...@@ -491,10 +491,14 @@ def assert_screenout(out, what): ...@@ -491,10 +491,14 @@ def assert_screenout(out, what):
class CaptureStd: class CaptureStd:
""" """
Context manager to capture: Context manager to capture:
stdout, clean it up and make it available via obj.out stderr, and make it available via obj.err
init arguments: - out - capture stdout: True/False, default True - err - capture stdout: True/False, default - stdout, clean it up and make it available via obj.out
True - stderr, and make it available via obj.err
init arguments:
- out - capture stdout: True/False, default True
- err - capture stdout: True/False, default True
Examples:: Examples::
......
...@@ -312,6 +312,12 @@ class Trainer: ...@@ -312,6 +312,12 @@ class Trainer:
self.sharded_ddp = ShardedDDPOption.ZERO_DP_3 self.sharded_ddp = ShardedDDPOption.ZERO_DP_3
# one place to sort out whether to place the model on device or not # one place to sort out whether to place the model on device or not
# postpone switching model to cuda when:
# 1. MP - since we are trying to fit a much bigger than 1 gpu model
# 2. fp16-enabled DeepSpeed loads the model in half the size and it doesn't need .to() anyway,
# and we only use deepspeed for training at the moment
# 3. full fp16 eval - since the model needs to be half'ed first
# 4. Sharded DDP - same as MP
self.place_model_on_device = args.place_model_on_device self.place_model_on_device = args.place_model_on_device
if ( if (
self.is_model_parallel self.is_model_parallel
...@@ -327,10 +333,6 @@ class Trainer: ...@@ -327,10 +333,6 @@ class Trainer:
self.eval_dataset = eval_dataset self.eval_dataset = eval_dataset
self.tokenizer = tokenizer self.tokenizer = tokenizer
# postpone switching model to cuda when:
# 1. MP - since we are trying to fit a much bigger than 1 gpu model
# 2. fp16-enabled DeepSpeed loads the model in half the size and it doesn't need .to() anyway,
# and we only use deepspeed for training at the moment
if self.place_model_on_device: if self.place_model_on_device:
model = model.to(args.device) model = model.to(args.device)
...@@ -616,6 +618,17 @@ class Trainer: ...@@ -616,6 +618,17 @@ class Trainer:
""" """
Setup the optimizer and the learning rate scheduler. Setup the optimizer and the learning rate scheduler.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
Trainer's init through :obj:`optimizers`, or subclass and override this method (or :obj:`create_optimizer`
and/or :obj:`create_scheduler`) in a subclass.
"""
self.create_optimizer()
self.create_scheduler(num_training_steps)
def create_optimizer(self):
"""
Setup the optimizer.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
Trainer's init through :obj:`optimizers`, or subclass and override this method in a subclass. Trainer's init through :obj:`optimizers`, or subclass and override this method in a subclass.
""" """
...@@ -652,6 +665,13 @@ class Trainer: ...@@ -652,6 +665,13 @@ class Trainer:
else: else:
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs) self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
def create_scheduler(self, num_training_steps: int):
"""
Setup the scheduler. The optimizer of the trainer must have been set up before this method is called.
Args:
num_training_steps (int): The number of training steps to do.
"""
if self.lr_scheduler is None: if self.lr_scheduler is None:
warmup_steps = ( warmup_steps = (
self.args.warmup_steps self.args.warmup_steps
...@@ -902,7 +922,7 @@ class Trainer: ...@@ -902,7 +922,7 @@ class Trainer:
if self.args.deepspeed: if self.args.deepspeed:
model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps) model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
self.model = model.module self.model = model.module
self.model_wrapped = model # will get further wrapped in DDP self.model_wrapped = model
self.deepspeed = model # DeepSpeedEngine object self.deepspeed = model # DeepSpeedEngine object
self.optimizer = optimizer self.optimizer = optimizer
self.lr_scheduler = lr_scheduler self.lr_scheduler = lr_scheduler
......
...@@ -263,9 +263,10 @@ class TrainingArguments: ...@@ -263,9 +263,10 @@ class TrainingArguments:
If a string is passed, it will be split on space. If a bool is passed, it will be converted to an empty If a string is passed, it will be split on space. If a bool is passed, it will be converted to an empty
list for :obj:`False` and :obj:`["simple"]` for :obj:`True`. list for :obj:`False` and :obj:`["simple"]` for :obj:`True`.
deepspeed (:obj:`str`, `optional`): deepspeed (:obj:`str` or :obj:`dict`, `optional`):
Use `Deepspeed <https://github.com/microsoft/deepspeed>`__. This is an experimental feature and its API may Use `Deepspeed <https://github.com/microsoft/deepspeed>`__. This is an experimental feature and its API may
evolve in the future. The value is the location of its json config file (usually ``ds_config.json``). evolve in the future. The value is either the location of DeepSpeed json config file (e.g.,
``ds_config.json``) or an already loaded json file as a :obj:`dict`"
label_smoothing_factor (:obj:`float`, `optional`, defaults to 0.0): label_smoothing_factor (:obj:`float`, `optional`, defaults to 0.0):
The label smoothing factor to use. Zero means no label smoothing, otherwise the underlying onehot-encoded The label smoothing factor to use. Zero means no label smoothing, otherwise the underlying onehot-encoded
labels are changed from 0s and 1s to :obj:`label_smoothing_factor/num_labels` and :obj:`1 - labels are changed from 0s and 1s to :obj:`label_smoothing_factor/num_labels` and :obj:`1 -
...@@ -481,7 +482,9 @@ class TrainingArguments: ...@@ -481,7 +482,9 @@ class TrainingArguments:
) )
deepspeed: Optional[str] = field( deepspeed: Optional[str] = field(
default=None, default=None,
metadata={"help": "Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json)"}, metadata={
"help": "Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict"
},
) )
label_smoothing_factor: float = field( label_smoothing_factor: float = field(
default=0.0, metadata={"help": "The label smoothing epsilon to apply (zero means no label smoothing)."} default=0.0, metadata={"help": "The label smoothing epsilon to apply (zero means no label smoothing)."}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment