Unverified Commit bc2571e6 authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)



* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 0661abc5
...@@ -400,18 +400,18 @@ DeepSpeed ...@@ -400,18 +400,18 @@ DeepSpeed
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
<https://arxiv.org/abs/1910.02054>`__. Currently it provides full support for: <https://arxiv.org/abs/1910.02054>`__. Currently it provides full support for:
1. Optimizer State Partitioning (ZeRO stage 1) 1. Optimizer state partitioning (ZeRO stage 1)
2. Gradient Partitioning (ZeRO stage 2) 2. Gradient partitioning (ZeRO stage 2)
3. Param Partitioning (ZeRO stage 3) 3. Parameter partitioning (ZeRO stage 3)
4. Custom mixed precision training handling 4. Custom mixed precision training handling
5. A range of fast CUDA-extension-based Optimizers 5. A range of fast CUDA-extension-based optimizers
6. ZeRO-Offload 6. ZeRO-Offload to CPU and NVMe
ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training
<https://arxiv.org/abs/2101.06840>`__. <https://arxiv.org/abs/2101.06840>`__. And NVMe-support is described in the paper `ZeRO-Infinity: Breaking the GPU
Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__.
DeepSpeed ZeRO-2 is currently used only for training, as all the currently available features are of no use to DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
inference.
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
won't be possible on a single GPU. won't be possible on a single GPU.
...@@ -541,7 +541,7 @@ Here is an example of running ``run_translation.py`` under DeepSpeed deploying a ...@@ -541,7 +541,7 @@ Here is an example of running ``run_translation.py`` under DeepSpeed deploying a
.. code-block:: bash .. code-block:: bash
deepspeed examples/pytorch/translation/run_translation.py \ deepspeed examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config.json \ --deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \ --model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \ --output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \ --do_train --max_train_samples 500 --num_train_epochs 1 \
...@@ -566,17 +566,17 @@ To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` comma ...@@ -566,17 +566,17 @@ To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` comma
.. code-block:: bash .. code-block:: bash
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config.json \ --deepspeed tests/deepspeed/ds_config_zero2.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \ --model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \ --output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \ --do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \ --dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro --source_lang en --target_lang ro
This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default, This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via
DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The ``--num_gpus=1``. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start
following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the with, then you don't need this argument. The following `documentation
launcher options. <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the launcher options.
Why would you want to use DeepSpeed with just one GPU? Why would you want to use DeepSpeed with just one GPU?
...@@ -601,7 +601,7 @@ with DeepSpeed is to have at least the following configuration in the configurat ...@@ -601,7 +601,7 @@ with DeepSpeed is to have at least the following configuration in the configurat
"overlap_comm": true, "overlap_comm": true,
"contiguous_gradients": true, "contiguous_gradients": true,
"cpu_offload": true "cpu_offload": true
}, }
} }
which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
...@@ -610,6 +610,11 @@ find more details in the discussion below. ...@@ -610,6 +610,11 @@ find more details in the discussion below.
For a practical usage example of this type of deployment, please, see this `post For a practical usage example of this type of deployment, please, see this `post
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__. <https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document.
<!--- TODO: Benchmark whether we can get better performance out of ZeRO-3 vs. ZeRO-2 on a single GPU, and then
recommend ZeRO-3 config as starting one. -->
Notes: Notes:
- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit - if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit
...@@ -643,7 +648,7 @@ If you're using only 1 GPU, here is how you'd have to adjust your training code ...@@ -643,7 +648,7 @@ If you're using only 1 GPU, here is how you'd have to adjust your training code
os.environ['WORLD_SIZE'] = "1" os.environ['WORLD_SIZE'] = "1"
# Now proceed as normal, plus pass the deepspeed config file # Now proceed as normal, plus pass the deepspeed config file
training_args = TrainingArguments(..., deepspeed="ds_config.json") training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
trainer = Trainer(...) trainer = Trainer(...)
trainer.train() trainer.train()
...@@ -659,47 +664,62 @@ cell with: ...@@ -659,47 +664,62 @@ cell with:
.. code-block:: python .. code-block:: python
%%bash %%bash
cat <<'EOT' > ds_config.json cat <<'EOT' > ds_config_zero3.json
{ {
"fp16": { "fp16": {
"enabled": true, "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2, "hysteresis": 2,
"min_loss_scale": 1 "min_loss_scale": 1
}, },
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
"lr": 3e-5, "lr": "auto",
"betas": [0.8, 0.999], "betas": "auto",
"eps": 1e-8, "eps": "auto",
"weight_decay": 3e-7 "weight_decay": "auto"
} }
}, },
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupLR",
"params": { "params": {
"warmup_min_lr": 0, "warmup_min_lr": "auto",
"warmup_max_lr": 3e-5, "warmup_max_lr": "auto",
"warmup_num_steps": 500 "warmup_num_steps": "auto"
} }
}, },
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e14,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000, "steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
EOT EOT
...@@ -725,7 +745,7 @@ or with ``%%bash`` magic, where you can write a multi-line code for the shell pr ...@@ -725,7 +745,7 @@ or with ``%%bash`` magic, where you can write a multi-line code for the shell pr
In such case you don't need any of the code presented at the beginning of this section. In such case you don't need any of the code presented at the beginning of this section.
Note: ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process Note: While ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process
completes. completes.
...@@ -760,48 +780,55 @@ When using DeepSpeed you always need to supply a DeepSpeed configuration file, y ...@@ -760,48 +780,55 @@ When using DeepSpeed you always need to supply a DeepSpeed configuration file, y
to be configured via the command line. You will find the nuances in the rest of this guide. to be configured via the command line. You will find the nuances in the rest of this guide.
To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features, To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
enables FP16, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler: including optimizer states cpu offload, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler and will enable mixed
precision training if ``--fp16`` is passed:
.. code-block:: json .. code-block:: json
{ {
"fp16": { "fp16": {
"enabled": true, "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2, "hysteresis": 2,
"min_loss_scale": 1 "min_loss_scale": 1
}, },
"zero_optimization": { "optimizer": {
"stage": 2, "type": "AdamW",
"allgather_partitions": true, "params": {
"allgather_bucket_size": 5e8, "lr": "auto",
"overlap_comm": true, "betas": "auto",
"reduce_scatter": true, "eps": "auto",
"reduce_bucket_size": 5e8, "weight_decay": "auto"
"contiguous_gradients": true, }
"cpu_offload": true },
},
"optimizer": { "scheduler": {
"type": "AdamW", "type": "WarmupLR",
"params": { "params": {
"lr": 3e-5, "warmup_min_lr": "auto",
"betas": [ 0.8, 0.999 ], "warmup_max_lr": "auto",
"eps": 1e-8, "warmup_num_steps": "auto"
"weight_decay": 3e-7 }
} },
},
"scheduler": { "zero_optimization": {
"type": "WarmupLR", "stage": 2,
"params": { "allgather_partitions": true,
"warmup_min_lr": 0, "allgather_bucket_size": 2e8,
"warmup_max_lr": 3e-5, "overlap_comm": true,
"warmup_num_steps": 500 "reduce_scatter": true,
} "reduce_bucket_size": 2e8,
} "contiguous_gradients": true,
"cpu_offload": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
} }
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer` When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
...@@ -835,35 +862,38 @@ or: ...@@ -835,35 +862,38 @@ or:
Shared Configuration Shared Configuration
======================================================================================================================= =======================================================================================================================
Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
configure those via the :class:`~transformers.Trainer` command line arguments.
Therefore, the following DeepSpeed configuration params shouldn't be used with the :class:`~transformers.Trainer`:
* ``train_batch_size`` .. warning::
* ``train_micro_batch_size_per_gpu``
* ``gradient_accumulation_steps``
as these will be automatically derived from the run time environment and the following 2 command line arguments: This section is a must-read
.. code-block:: bash Some configuration values are required by both the :class:`~transformers.Trainer` and DeepSpeed to function correctly,
therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those
via the :class:`~transformers.Trainer` command line arguments.
--per_device_train_batch_size 8 --gradient_accumulation_steps 2 Additionally, some configuration values are derived automatically based on the model's configuration, so instead of
remembering to manually adjust multiple values, it's the best to let the :class:`~transformers.Trainer` do the majority
of configuration for you.
which are always required to be supplied. Therefore, in the rest of this guide you will find a special configuration value: ``auto``, which when set will be
automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this
recommendation and set the values explicitly, in which case be very careful that your the
:class:`~transformers.Trainer` arguments and DeepSpeed configurations agree. For example, are you using the same
learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very
difficult to detect ways. You have been warned.
Of course, you will need to adjust the values in this example to your situation. There are multiple other values that are specific to DeepSpeed-only and those you will have to set manually to suit
your needs.
ZeRO ZeRO
======================================================================================================================= =======================================================================================================================
`Zero Redundancy Optimizer (ZeRO) <https://www.deepspeed.ai/tutorials/zero/>`__ is the work horse of DeepSpeed. It `Zero Redundancy Optimizer (ZeRO) <https://www.deepspeed.ai/tutorials/zero/>`__ is the workhorse of DeepSpeed. It
support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes, support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
therefore this document focuses on stages 2 and 3. You will find more indepth information in the DeepSpeed therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
documentation. You will find more indepth information in the DeepSpeed documentation.
The ``zero_optimization`` section of the configuration file is the most important part (`docs The ``zero_optimization`` section of the configuration file is the most important part (`docs
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define <https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
...@@ -916,36 +946,43 @@ ZeRO-3 Config ...@@ -916,36 +946,43 @@ ZeRO-3 Config
The following is an example configuration for ZeRO stage 3: The following is an example configuration for ZeRO stage 3:
.. code-block:: json .. code-block:: json
{ {
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"cpu_offload": true, "offload_optimizer": {
"cpu_offload_params": true, "device": "cpu",
"cpu_offload_use_pin_memory" : true, "pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true, "overlap_comm": true,
"contiguous_gradients": true, "contiguous_gradients": true,
"sub_group_size": 1e14, "sub_group_size": 1e14,
"reduce_bucket_size": 1e6, "reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": 0.94e6, "stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": 1e4, "stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9, "stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9, "stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true "stage3_gather_fp16_weights_on_model_save": true
} }
} }
Note: if you're migrating from ZeRO-2 configuration that: ``allgather_partitions``, ``allgather_bucket_size`` and If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU
``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these they will just be ignored. memory offloading the optimizer states and parameters to CPU memory with ``"device": "cpu"`` may solve this limitation.
If you don't want to offload to CPU memory, use ``none`` instead of ``cpu`` for the ``device`` entry. Offloading to
NVMe is discussed further down.
Pinned memory is enabled with ``pin_memory`` set to ``true``. This feature can improve the throughput at the cost of
making less memory available to other processes. Pinned memory is set aside to the specific process that requested it
and its typically accessed much faster than normal CPU memory.
**Performance tuning:** **Performance tuning:**
- ``sub_group_size``: ``1e14`` - ``sub_group_size``: ``1e14``
- ``reduce_bucket_size``: ``hidden_size*hidden_size``
- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size``
- ``stage3_param_persistence_threshold``: ``10 * hidden_size``
- ``stage3_max_live_parameters``: ``1e9`` - ``stage3_max_live_parameters``: ``1e9``
- ``stage3_max_reuse_distance``: ``1e9`` - ``stage3_max_reuse_distance``: ``1e9``
...@@ -960,37 +997,91 @@ going to be used again in near future (less than ``stage3_max_reuse_distance``) ...@@ -960,37 +997,91 @@ going to be used again in near future (less than ``stage3_max_reuse_distance``)
overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and
backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward
If you set ``reduce_bucket_size``, ``stage3_prefetch_bucket_size`` and ``stage3_param_persistence_threshold`` as The following configuration values depend on the model's hidden size:
recommended above, they will already be fairly small so you won't have to tune those much.
- ``reduce_bucket_size``: ``hidden_size*hidden_size``
- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size``
- ``stage3_param_persistence_threshold``: ``10 * hidden_size``
therefore set these values to ``auto`` and the :class:`~transformers.Trainer` will automatically assign the recommended
values. But, of course, feel free to set these explicitly as well.
``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large
models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if
you plan to resume the training. Watch out for future updates that will remove this limitation and make things more
flexible.
If you're migrating from ZeRO-2 configuration note that ``allgather_partitions``, ``allgather_bucket_size`` and
``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just
be ignored. Make sure to remove ``cpu_offload`` though, since it has been deprecated in ZeRO-3.
NVMe Support
=======================================================================================================================
Since ``hidden_size`` varies from model to model, the ``Trainer`` will automatically set the needed value for the 3 ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to
config parameters that contain that variable (using ``model.config.hidden_size``). Just set these values to ``0`` as smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during
shown below and the right configuration will be passed to DeepSpeed: offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training
process. ZeRO-Infinity requires ZeRO-3 enabled.
The following configuration example enables NVMe to offload both optimizer states and the params:
.. code-block:: json .. code-block:: json
{ {
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"cpu_offload": true, "offload_optimizer": {
"cpu_offload_params": true, "device": "nvme",
"cpu_offload_use_pin_memory" : true, "nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
}
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
}
"overlap_comm": true, "overlap_comm": true,
"contiguous_gradients": true, "contiguous_gradients": true,
"sub_group_size": 1e14, "sub_group_size": 1e14,
"reduce_bucket_size": 0, "reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": 0, "stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": 0, "stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9, "stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9, "stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true "stage3_gather_fp16_weights_on_model_save": true
} },
} }
``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you
models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint:
you plan to resume the training. Watch out for future updates that will remove this limitation and make things more `"device": "cpu"`).
flexible.
Here is the full documentation for offloading `optimizer states
<https://www.deepspeed.ai/docs/config-json/#optimizer-offloading>`__ and `parameters
<https://www.deepspeed.ai/docs/config-json/#parameter-offloading>`__.
Make sure that your ``nvme_path`` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll
be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this
writing one can have ~3.5GB/s read, ~3GB/s write peak speeds).
In order to figure out the optimal ``aio`` configuration block you must run a benchmark on your target setup, as
`explained here <https://github.com/microsoft/DeepSpeed/issues/998>`__.
ZeRO-2 vs ZeRO-3 Performance ZeRO-2 vs ZeRO-3 Performance
...@@ -1016,13 +1107,13 @@ these help you to trade scalability for speed depending on your needs. ...@@ -1016,13 +1107,13 @@ these help you to trade scalability for speed depending on your needs.
ZeRO-2 Example ZeRO-2 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``: Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``:
.. code-block:: json .. code-block:: json
{ {
"fp16": { "fp16": {
"enabled": true, "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16, "initial_scale_power": 16,
...@@ -1030,6 +1121,25 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``: ...@@ -1030,6 +1121,25 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``:
"min_loss_scale": 1 "min_loss_scale": 1
}, },
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": { "zero_optimization": {
"stage": 2, "stage": 2,
"allgather_partitions": true, "allgather_partitions": true,
...@@ -1041,6 +1151,30 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``: ...@@ -1041,6 +1151,30 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``:
"cpu_offload": true "cpu_offload": true
}, },
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical
values look like, but we highly recommend using the one with multiple ``auto`` settings in it.
.. code-block:: json
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
...@@ -1060,6 +1194,17 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``: ...@@ -1060,6 +1194,17 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``:
} }
}, },
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"steps_per_print": 2000, "steps_per_print": 2000,
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
...@@ -1069,13 +1214,14 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``: ...@@ -1069,13 +1214,14 @@ Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``:
ZeRO-3 Example ZeRO-3 Example
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Here is a full ZeRO-3 all-enabled configuration file ``ds_config_zero3.json``: Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``:
.. code-block:: json .. code-block:: json
{ {
"fp16": { "fp16": {
"enabled": true, "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16, "initial_scale_power": 16,
...@@ -1083,22 +1229,69 @@ Here is a full ZeRO-3 all-enabled configuration file ``ds_config_zero3.json``: ...@@ -1083,22 +1229,69 @@ Here is a full ZeRO-3 all-enabled configuration file ``ds_config_zero3.json``:
"min_loss_scale": 1 "min_loss_scale": 1
}, },
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"cpu_offload": true, "offload_optimizer": {
"cpu_offload_params": true, "device": "cpu",
"cpu_offload_use_pin_memory" : true, "pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true, "overlap_comm": true,
"contiguous_gradients": true, "contiguous_gradients": true,
"sub_group_size": 1e14, "sub_group_size": 1e14,
"reduce_bucket_size": 1e6, "reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": 0.94e6, "stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": 1e4, "stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9, "stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9, "stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true "stage3_gather_fp16_weights_on_model_save": true
}, },
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical
values look like, but we highly recommend using the one with multiple ``auto`` settings in it.
.. code-block:: json
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
...@@ -1118,6 +1311,27 @@ Here is a full ZeRO-3 all-enabled configuration file ``ds_config_zero3.json``: ...@@ -1118,6 +1311,27 @@ Here is a full ZeRO-3 all-enabled configuration file ``ds_config_zero3.json``:
} }
}, },
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e14,
"reduce_bucket_size": 1e6,
"stage3_prefetch_bucket_size": 0.94e6,
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"steps_per_print": 2000, "steps_per_print": 2000,
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
...@@ -1153,7 +1367,7 @@ If you don't configure the ``optimizer`` entry in the configuration file, the :c ...@@ -1153,7 +1367,7 @@ If you don't configure the ``optimizer`` entry in the configuration file, the :c
automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``. arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.
Here is an example of the pre-configured ``optimizer`` entry for ``AdamW``: Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``:
.. code-block:: json .. code-block:: json
...@@ -1161,15 +1375,16 @@ Here is an example of the pre-configured ``optimizer`` entry for ``AdamW``: ...@@ -1161,15 +1375,16 @@ Here is an example of the pre-configured ``optimizer`` entry for ``AdamW``:
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
"lr": 0.001, "lr": "auto",
"betas": [0.8, 0.999], "betas": "auto",
"eps": 1e-8, "eps": "auto",
"weight_decay": 3e-7 "weight_decay": "auto"
} }
} }
} }
Note that the command line arguments will override the values in the configuration file. This is so that there is one
Note that the command line arguments will set the values in the configuration file. This is so that there is one
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
different values in different places. Command line rules. The values that get overridden are: different values in different places. Command line rules. The values that get overridden are:
...@@ -1180,19 +1395,42 @@ different values in different places. Command line rules. The values that get ov ...@@ -1180,19 +1395,42 @@ different values in different places. Command line rules. The values that get ov
Therefore please remember to tune the shared hyperparameters on the command line. Therefore please remember to tune the shared hyperparameters on the command line.
If you want to use another optimizer which is not listed above, you will have to add ``"zero_allow_untested_optimizer": You can also set the values explicitly:
true`` to the top level configuration.
.. code-block:: json
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
}
}
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
configuration.
If you want to use another optimizer which is not listed above, you will have to add to the top level configuration.
If you want to use one of the officially supported optimizers, configure them explicitly in the configuration file, and .. code-block:: json
make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
{
"zero_allow_untested_optimizer": true
}
Similarly to ``AdamW``, you can configure other officially supported optimizers. Just remember that may have different
config values. e.g. for Adam you will want ``weight_decay`` around ``0.01``.
Scheduler Scheduler
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here DeepSpeed supports ``LRRangeTest``, ``OneCycle``, ``WarmupLR`` and ``WarmupDecayLR`` learning rate schedulers. The full
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__. documentation is `here <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed: Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
...@@ -1200,12 +1438,11 @@ Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed: ...@@ -1200,12 +1438,11 @@ Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``, * ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``,
therefore, if you don't configure the scheduler this is scheduler that will get configured by default. therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` to configure a 🤗 Transformers version the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` to configure a 🤗 Transformers version
of it. of it.
Here is an example of the pre-configured ``scheduler`` entry for ``WarmupLR``: Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``:
.. code-block:: json .. code-block:: json
...@@ -1213,24 +1450,41 @@ Here is an example of the pre-configured ``scheduler`` entry for ``WarmupLR``: ...@@ -1213,24 +1450,41 @@ Here is an example of the pre-configured ``scheduler`` entry for ``WarmupLR``:
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupLR",
"params": { "params": {
"warmup_min_lr": 0, "warmup_min_lr": "auto",
"warmup_max_lr": 0.001, "warmup_max_lr": "auto",
"warmup_num_steps": 1000 "warmup_num_steps": "auto"
} }
} }
} }
Note that the command line arguments will override the values in the configuration file. This is so that there is one Since `"auto"` is used the :class:`~transformers.Trainer` arguments will set the correct values in the configuration
definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example,
different values in different places. Command line rules. The values that get overridden are: the learning rate is set to different values in different places. Command line rules. The values that get set are:
- ``warmup_min_lr`` with the value of ``0``
- ``warmup_max_lr`` with the value of ``--learning_rate`` - ``warmup_max_lr`` with the value of ``--learning_rate``
- ``warmup_num_steps`` with the value of ``--warmup_steps`` - ``warmup_num_steps`` with the value of ``--warmup_steps``
- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run - ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run
time based on the environment and the size of the dataset and other command line arguments (needed for time based on the environment and the size of the dataset and other command line arguments (needed for
``WarmupDecayLR``). ``WarmupDecayLR``).
Therefore please remember to tune the shared hyperparameters on the command line. You can, of course, take over any or all of the configuration values and set those yourself:
.. code-block:: json
{
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
}
}
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
configuration.
For example, for ``WarmupDecayLR``, you can use the following entry: For example, for ``WarmupDecayLR``, you can use the following entry:
...@@ -1240,16 +1494,16 @@ For example, for ``WarmupDecayLR``, you can use the following entry: ...@@ -1240,16 +1494,16 @@ For example, for ``WarmupDecayLR``, you can use the following entry:
"scheduler": { "scheduler": {
"type": "WarmupDecayLR", "type": "WarmupDecayLR",
"params": { "params": {
"total_num_steps": 10,
"last_batch_iteration": -1, "last_batch_iteration": -1,
"warmup_min_lr": 0, "total_num_steps": "auto",
"warmup_max_lr": 0.001, "warmup_min_lr": "auto",
"warmup_num_steps": 1000 "warmup_max_lr": "auto",
"warmup_num_steps": "auto"
} }
} }
} }
and ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be corrected at loading time. and ``total_num_steps`, ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be set at loading time.
...@@ -1258,10 +1512,32 @@ Automatic Mixed Precision ...@@ -1258,10 +1512,32 @@ Automatic Mixed Precision
You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way: You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way:
If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the To configure pytorch AMP-like mode set:
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
.. code-block:: json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
}
and the :class:`~transformers.Trainer` will automatically enable or disable it based on the value of
``args.fp16_backend``. The rest of config values are up to you.
Here is an example of the ``fp16`` configuration: This mode gets enabled when ``--fp16 --fp16_backend amp`` command line args are passed.
.. note::
At the moment DeepSpeed doesn't supported fp32 mode, though it will become available soon. Until then it will be
always set to ``true``.
You can also enable/disable this mode explicitly:
.. code-block:: json .. code-block:: json
...@@ -1270,17 +1546,32 @@ Here is an example of the ``fp16`` configuration: ...@@ -1270,17 +1546,32 @@ Here is an example of the ``fp16`` configuration:
"enabled": true, "enabled": true,
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2, "hysteresis": 2,
"min_loss_scale": 1 "min_loss_scale": 1
}, }
} }
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
configuration.
Here is the `documentation <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__. Here is the `documentation <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
If you want to use NVIDIA's apex instead, you can can either configure the ``amp`` entry in the configuration file, or To configure apex AMP-like mode set:
use the following command line arguments: ``--fp16 --fp16_backend apex --fp16_opt_level 01``.
Here is an example of the ``amp`` configuration: .. code-block:: json
"amp": {
"enabled": "auto",
"opt_level": "auto"
}
and the :class:`~transformers.Trainer` will automatically configure it based on the values of ``args.fp16_backend`` and
``args.fp16_opt_level``.
This mode gets enabled when ``--fp16 --fp16_backend apex --fp16_opt_level 01`` command line args are passed.
You can also configure this mode explicitly:
.. code-block:: json .. code-block:: json
...@@ -1291,6 +1582,9 @@ Here is an example of the ``amp`` configuration: ...@@ -1291,6 +1582,9 @@ Here is an example of the ``amp`` configuration:
} }
} }
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
configuration.
Here is the `documentation Here is the `documentation
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__. <https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
...@@ -1298,43 +1592,55 @@ Here is the `documentation ...@@ -1298,43 +1592,55 @@ Here is the `documentation
Gradient Accumulation Gradient Accumulation
======================================================================================================================= =======================================================================================================================
While normally DeepSpeed gets gradient accumulation configured with: To configure gradient accumulation set:
.. code-block:: json .. code-block:: json
{ {
"gradient_accumulation_steps": 3, "gradient_accumulation_steps": "auto"
} }
in this case, to enable gradient accumulation, pass the command line ``--gradient_accumulation_steps 3`` argument as and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.gradient_accumulation_steps``.
normal and it will get injected into the DeepSpeed configuration.
If you try to add it directly to the configuration file, you will receive an error from the ``Trainer`` - this is
because this setting is needed by the ``Trainer`` too, and so this approach ensures that there is a single way of
setting this value and thus avoid potential subtle errors.
You can also set the value explicitly:
.. code-block:: json
{
"gradient_accumulation_steps": 3
}
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
configuration.
Gradient Clipping Gradient Clipping
======================================================================================================================= =======================================================================================================================
If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer` To configure gradient gradient clipping set:
will use the value of the ``--max_grad_norm`` command line argument to set it.
.. code-block:: json
{
"gradient_clipping": "auto"
}
and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.max_grad_norm``.
Here is an example of the ``gradient_clipping`` configuration: You can also set the value explicitly:
.. code-block:: json .. code-block:: json
{ {
"gradient_clipping": 1.0, "gradient_clipping": 1.0
} }
But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
configuration.
Getting the model weights out
Getting The Model Weights Out
======================================================================================================================= =======================================================================================================================
As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores
...@@ -1352,6 +1658,16 @@ version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't ...@@ -1352,6 +1658,16 @@ version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't
DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it
won't be possible to load it back. won't be possible to load it back.
.. code-block:: json
{
"zero_optimization": {
"stage3_gather_fp16_weights_on_model_save": true
}
}
**FP32 Weights:** **FP32 Weights:**
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
...@@ -1398,44 +1714,18 @@ This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights c ...@@ -1398,44 +1714,18 @@ This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights c
Note: currently the script requires 2x general RAM of the final fp32 model weights. Note: currently the script requires 2x general RAM of the final fp32 model weights.
ZeRO 3 Nuances
ZeRO-3 and Infinity Nuances
======================================================================================================================= =======================================================================================================================
ZeRO 3 is quite different from ZeRO 2 because of its param sharding feature. ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature.
ZeRO-Infinity further extends ZeRO-3 to support NVMe memory and multiple other speed and scalability improvements.
While all the efforts were made for things to just work without needing any special changes to your models, in certain While all the efforts were made for things to just work without needing any special changes to your models, in certain
circumstances you may find the following information to be needed. circumstances you may find the following information to be needed.
Registering External Parameters
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
If layer A needs to access weights belonging to layer B, currently layer A needs to tell DeepSpeed about it. This is
done with the help of ``deepspeed.zero.register_external_parameter`` that needs to be called in ``A.__init__`` and can
be seen in the following example:
.. code-block:: python
class ModuleZ3(torch.nn.Module):
def __init__(self, *args):
super().__init__(self, *args)
self.layer1 = SomeLayer()
self.layer2 = OtherLayer()
deepspeed.zero.register_external_parameter(self, self.layer1.weight)
def forward(self, input):
x = self.layer1(input)
# self.layer1.weight is needed in ModuleZ3.forward
y = self.layer2(x, self.layer1.weight)
return y
In general ``transformers`` models don't use this style of referring to other layer's weights so most likely you won't
need to use it.
For full details on this method please refer to `Registering External Parameters
<https://deepspeed.readthedocs.io/en/latest/zero3.html#registering-external-parameters>`__.
Constructing Massive Models Constructing Massive Models
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...@@ -1455,18 +1745,20 @@ context manager (which is also a function decorator), like so: ...@@ -1455,18 +1745,20 @@ context manager (which is also a function decorator), like so:
As you can see this gives you a randomly initialized model. As you can see this gives you a randomly initialized model.
If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as
``is_deepspeed_zero3_enabled()`` returns ``True``, which can be set manually via ``deepspeed_zero3_enable(True)``. ``is_deepspeed_zero3_enabled()`` returns ``True``, which currently is setup by the
Therefore to enable this feature here is the required sequence: class:`~transformers.TrainingArguments` object if the passed DeepSpeed configuration file contains ZeRO-3 config
section. Thus you must create the :class:`~transformers.TrainingArguments` object **before** calling
``from_pretrained``. Here is an example of a possible sequence:
.. code-block:: python .. code-block:: python
from transformers.integrations import deepspeed_zero3_enable from transformers import AutoModel, Trainer, TrainingArguments
deepspeed_zero3_enable(True) training_args = TrainingArguments(..., deepspeed=ds_config)
model = T5ForConditionalGeneration.from_pretrained("t5-small") model = AutoModel.from_pretrained("t5-small")
trainer = Trainer(model=model, args=training_args, ...)
If you're using ``Trainer`` command line arguments which include ``--deepspeed ds_config.json`` with ZeRO-3 config If you're using the official example scripts and your command line arguments include ``--deepspeed ds_config.json``
enabled, then you can skip ``deepspeed_zero3_enable(True)`` as it will try to discover whether it'll be run under with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written.
ZeRO-3 and ``from_pretrained`` will automatically activate this feature.
Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used. Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used.
...@@ -1475,8 +1767,6 @@ For full details on this method and other related features please refer to `Cons ...@@ -1475,8 +1767,6 @@ For full details on this method and other related features please refer to `Cons
Gathering Parameters Gathering Parameters
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...@@ -1501,8 +1791,6 @@ larger multi-dimensional shape, this means that the parameter is partitioned and ...@@ -1501,8 +1791,6 @@ larger multi-dimensional shape, this means that the parameter is partitioned and
Notes Notes
======================================================================================================================= =======================================================================================================================
...@@ -1514,6 +1802,7 @@ Notes ...@@ -1514,6 +1802,7 @@ Notes
with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions
<https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__. <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
Main DeepSpeed Resources Main DeepSpeed Resources
======================================================================================================================= =======================================================================================================================
...@@ -1526,6 +1815,7 @@ Papers: ...@@ -1526,6 +1815,7 @@ Papers:
- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models <https://arxiv.org/abs/1910.02054>`__ - `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models <https://arxiv.org/abs/1910.02054>`__
- `ZeRO-Offload: Democratizing Billion-Scale Model Training <https://arxiv.org/abs/2101.06840>`__ - `ZeRO-Offload: Democratizing Billion-Scale Model Training <https://arxiv.org/abs/2101.06840>`__
- `ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
......
...@@ -90,7 +90,7 @@ _deps = [ ...@@ -90,7 +90,7 @@ _deps = [
"cookiecutter==1.7.2", "cookiecutter==1.7.2",
"dataclasses", "dataclasses",
"datasets", "datasets",
"deepspeed>=0.3.14", "deepspeed>=0.3.15",
"docutils==0.16.0", "docutils==0.16.0",
"fairscale>0.3", "fairscale>0.3",
"faiss-cpu", "faiss-cpu",
......
...@@ -7,7 +7,7 @@ deps = { ...@@ -7,7 +7,7 @@ deps = {
"cookiecutter": "cookiecutter==1.7.2", "cookiecutter": "cookiecutter==1.7.2",
"dataclasses": "dataclasses", "dataclasses": "dataclasses",
"datasets": "datasets", "datasets": "datasets",
"deepspeed": "deepspeed>=0.3.14", "deepspeed": "deepspeed>=0.3.15",
"docutils": "docutils==0.16.0", "docutils": "docutils==0.16.0",
"fairscale": "fairscale>0.3", "fairscale": "fairscale>0.3",
"faiss-cpu": "faiss-cpu", "faiss-cpu": "faiss-cpu",
......
...@@ -19,8 +19,8 @@ import io ...@@ -19,8 +19,8 @@ import io
import json import json
import numbers import numbers
import os import os
import sys
import tempfile import tempfile
import weakref
from copy import deepcopy from copy import deepcopy
from pathlib import Path from pathlib import Path
...@@ -269,74 +269,180 @@ def rewrite_logs(d): ...@@ -269,74 +269,180 @@ def rewrite_logs(d):
return new_d return new_d
_is_deepspeed_zero3_enabled = None def _is_true(config, key):
if config is None:
return False
return bool(config.get(key))
def is_deepspeed_zero3_enabled(): def _set_if_auto(config, key, val):
""" if config is None:
This function answers to the question of whether DeepSpeed is going to be used and run using ZeRO Stage 3. return
if config.get(key) == "auto":
config[key] = val
It includes an auto-discovery method, see comments in the code for details.
Returns: ``True`` if either it was explicitly enabled via ``deepspeed_zero3_enable(True)`` or the auto-detector was class DeepSpeedConfigHF:
able to derive that the ``Trainer`` will be running via DeepSpeed ZeRO stage 3.
""" """
global _is_deepspeed_zero3_enabled This object contains Deepspeed configuration and can be quickly queried for things like zero stage.
if _is_deepspeed_zero3_enabled is None:
_is_deepspeed_zero3_enabled = False
# Try to auto-discover if we are about to use DeepSpeed with ZeRO3 enabled. This will only
# work for scripts using cli to pass --deepspeed ds_config.json. If cmd args aren't used,
# then to get the model efficiently loaded across multiple-gpus one has to explicitly call
# is_deepspeed_zero3_enabled(True) **before** instantiating a model object
if "--deepspeed" in sys.argv:
idx = sys.argv.index("--deepspeed")
ds_config = sys.argv[idx + 1]
if not os.path.exists(ds_config):
raise ValueError("--deepspeed requires a valid path to a config file")
config = deepspeed_parse_config(ds_config)
if (
"zero_optimization" in config
and "stage" in config["zero_optimization"]
and config["zero_optimization"]["stage"] == 3
):
_is_deepspeed_zero3_enabled = True
return _is_deepspeed_zero3_enabled
def deepspeed_zero3_enable(enable=True):
"""
``is_deepspeed_zero3_enabled()`` tries to derive automatically if DeepSpeed ZeRO 3 is going to be used by looking
at ``sys.argv`` which may or may contain information about where to find the DeepSpeed config if any.
This function allows for explicit enabling/disabling of this global flag. We store a ``weakref`` of this object in the module's global to be able to access the config from areas where the
Trainer is not available (e.g. `from_pretrained` and `_get_resized_embeddings`).
Args: The ``DeepSpeedConfigHF`` object is meant to be created during ``TrainingArguments`` object creation and has the
enable: if set to ``True`` will make ``is_deepspeed_zero3_enabled()`` return ``True`` same lifespan as the latter.
""" """
global _is_deepspeed_zero3_enabled
_is_deepspeed_zero3_enabled = enable
def __init__(self, args):
self.config = None
self.stage = 0
self.offload = False
def deepspeed_parse_config(ds_config): dep_version_check("deepspeed")
"""
If ``ds_config`` isn't already a dict, read it from the config file.
If it's already a dict, return a copy of it, so that we can freely modify it. self.config_process(args)
"""
dep_version_check("deepspeed") # set global weakref object
deepspeed_config_hf_set(self)
if isinstance(ds_config, dict):
# Don't modify user's data should they want to reuse it (e.g. in tests), because once we def is_zero2(self):
# modified it, it will not be accepted here again, since some config params must be not set by users return self.stage == 2
config = deepcopy(ds_config)
elif isinstance(ds_config, str): def is_zero3(self):
with io.open(ds_config, "r", encoding="utf-8") as f: return self.stage == 3
config = json.load(f)
def is_offload(self):
return self.offload
def config_process(self, args):
"""
1. load json if the ``args.deepspeed`` is a path
2. replace any ``auto`` values in the config with the correct or recommended value
This is done as early as possible, before model is created, to allow ``is_deepspeed_zero3_enabled`` query and
getting to the early deepspeed config object during ``zero.Init()`` which needs whether fp16 is enabled, dtype,
etc.
"""
config_file_or_dict = args.deepspeed
if isinstance(config_file_or_dict, dict):
# Don't modify user's data should they want to reuse it (e.g. in tests), because once we
# modified it, it will not be accepted here again, since `auto` values would have been overriden
config = deepcopy(config_file_or_dict)
elif isinstance(config_file_or_dict, str):
with io.open(config_file_or_dict, "r", encoding="utf-8") as f:
config = json.load(f)
else:
raise ValueError("expecting either a path to a config file or a pre-populated dict")
self.config = config
# DeepSpeed does:
# train_batch_size = world_size * train_micro_batch_size_per_gpu * gradient_accumulation_steps
train_batch_size = args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps
_set_if_auto(config, "train_micro_batch_size_per_gpu", args.per_device_train_batch_size)
_set_if_auto(config, "gradient_accumulation_steps", args.gradient_accumulation_steps)
_set_if_auto(config, "train_batch_size", train_batch_size)
_set_if_auto(config, "gradient_clipping", args.max_grad_norm)
# zero
config_zero = config.get("zero_optimization", {})
self.stage = config_zero.get("stage", 0)
config_optim = config.get("optimizer", {})
if config_optim != {}:
config_optim_params = config_optim.get("params")
_set_if_auto(config_optim_params, "lr", args.learning_rate)
_set_if_auto(config_optim_params, "betas", [args.adam_beta1, args.adam_beta2])
_set_if_auto(config_optim_params, "eps", args.adam_epsilon)
_set_if_auto(config_optim_params, "weight_decay", args.weight_decay)
config_sched = config.get("scheduler", {})
if config_sched != {}:
config_sched_params = config_sched.get("params")
_set_if_auto(config_sched_params, "warmup_min_lr", 0)
_set_if_auto(config_sched_params, "warmup_max_lr", args.learning_rate)
_set_if_auto(config_sched_params, "warmup_num_steps", args.warmup_steps)
# total_num_steps - will get set in deepspeed_init
# fp16
if args.fp16:
fp16_backend = "apex" if args.fp16_backend == "apex" else "amp"
else:
fp16_backend = None
# amp: similar to the pytorch native amp - it has a bunch of optional params but we won't set
# any here unless the user did the work
config_fp16 = config.get("fp16")
# XXX: at the moment fp16 can't be False, but the fp32 solution is in works - once it's PR'ed and
# merged and a new release is made, delete the next line and uncomment the one after it
_set_if_auto(config_fp16, "enabled", True)
# _set_if_auto(config_fp16, "enabled", fp16_backend == "amp")
# apex: delegates amp work to apex (which needs to be available), but it cannot be used with any
# ZeRO features, so probably best to be avoided.
config_amp = config.get("amp")
_set_if_auto(config_amp, "enabled", fp16_backend == "apex")
_set_if_auto(config_amp, "opt_level", args.fp16_opt_level)
config_zero = config.get("zero_optimization", {})
if self.is_zero2():
self.offload = _is_true(config_zero, "cpu_offload")
elif self.is_zero3():
offload_devices = ["cpu", "nvme"]
if config_zero.get("offload_optimizer", {}).get("device") in offload_devices:
self.offload = True
if config_zero.get("offload_param", {}).get("device") in offload_devices:
self.offload = True
def config_finalize(self, args, model, num_training_steps):
"""
This stage is run after we have the model and know num_training_steps.
Now we we can complete the configuration process.
"""
config = self.config
# zero
config_zero = config.get("zero_optimization", {})
if self.is_zero3():
# automatically assign the optimal config values based on model config
hidden_size = model.config.hidden_size
_set_if_auto(config_zero, "reduce_bucket_size", hidden_size * hidden_size)
_set_if_auto(config_zero, "stage3_prefetch_bucket_size", 0.9 * hidden_size * hidden_size)
_set_if_auto(config_zero, "stage3_param_persistence_threshold", 10 * hidden_size)
# scheduler
config_sched = config.get("scheduler", {})
config_sched_params = config_sched.get("params", {})
_set_if_auto(config_sched_params, "total_num_steps", num_training_steps)
# keep the config object global to be able to access it anywhere during TrainingArguments life-cycle
_deepspeed_config_hf_weak_ref = None
def deepspeed_config_hf_set(deepspeed_config_hf_obj):
# this is a special weakref global object to allow us to get to Deepspeed config from APIs
# that don't have an easy way to get to the Deepspeed config outside of the Trainer domain.
global _deepspeed_config_hf_weak_ref
# will go away automatically when DeepSpeedConfigHF is destroyed (when TrainingArguments is destroyed)
_deepspeed_config_hf_weak_ref = weakref.ref(deepspeed_config_hf_obj)
def is_deepspeed_zero3_enabled():
if _deepspeed_config_hf_weak_ref is not None and _deepspeed_config_hf_weak_ref() is not None:
return _deepspeed_config_hf_weak_ref().is_zero3()
else: else:
raise ValueError("expecting either a path to a config file or a pre-populated dict") return False
return config def deepspeed_config():
if _deepspeed_config_hf_weak_ref is not None and _deepspeed_config_hf_weak_ref() is not None:
return _deepspeed_config_hf_weak_ref().config
else:
return None
def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None): def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None):
...@@ -355,41 +461,16 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None): ...@@ -355,41 +461,16 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None):
""" """
import deepspeed import deepspeed
args = trainer.args
model = trainer.model model = trainer.model
config = deepspeed_parse_config(args.deepspeed) deepspeed_config_hf = trainer.args.deepspeed_config_hf
deepspeed_config_hf.config_finalize(trainer.args, model, num_training_steps)
# The following code translates relevant trainer's cl args into the DS config # resume config update - some bits like `model` and `num_training_steps` only become available during train
config = deepspeed_config_hf.config
# First to ensure that there is no mismatch between cl args values and presets in the config
# file, ask to not set in ds config file:
# - "train_batch_size",
# - "train_micro_batch_size_per_gpu",
# - "gradient_accumulation_steps"
bs_keys = ["train_batch_size", "train_micro_batch_size_per_gpu"]
if len([x for x in bs_keys if x in config.keys()]):
raise ValueError(
f"Do not include {bs_keys} entries in the ds config file, as they will be set via --per_device_train_batch_size or its default"
)
if "gradient_accumulation_steps" in config.keys():
raise ValueError(
"Do not include gradient_accumulation_steps entries in the ds config file, as they will be set via --gradient_accumulation_steps or its default"
)
# DeepSpeed does:
# train_batch_size = n_gpus * train_micro_batch_size_per_gpu * gradient_accumulation_steps
# therefore we just need to set:
config["train_micro_batch_size_per_gpu"] = args.per_device_train_batch_size
config["gradient_accumulation_steps"] = args.gradient_accumulation_steps
if "gradient_clipping" in config:
logger.info("Keeping the `gradient_clipping` config intact, ignoring any gradient clipping-specific cl args")
else: # override only if the ds config doesn't already have this section
config["gradient_clipping"] = args.max_grad_norm
# Optimizer + Scheduler # Optimizer + Scheduler
# Currently support combos: # Currently supported combos:
# 1. DS scheduler + DS optimizer: Yes # 1. DS scheduler + DS optimizer: Yes
# 2. HF scheduler + HF optimizer: Yes # 2. HF scheduler + HF optimizer: Yes
# 3. DS scheduler + HF optimizer: Yes # 3. DS scheduler + HF optimizer: Yes
...@@ -402,36 +483,16 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None): ...@@ -402,36 +483,16 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None):
# 4. HF scheduler + DS optimizer: No # 4. HF scheduler + DS optimizer: No
optimizer = None optimizer = None
if "optimizer" in config: if "optimizer" not in config:
logger.info("Updating the `scheduler` config with other command line arguments") if deepspeed_config_hf.is_offload():
# to avoid inconsistent values of lr and warm up steps the command line args override config
params = dict(
lr=args.learning_rate,
betas=[args.adam_beta1, args.adam_beta2],
eps=args.adam_epsilon,
weight_decay=args.weight_decay,
)
for k, v in params.items():
if k in config["optimizer"]["params"]:
logger.info(f"setting optimizer.params.{k} to {v}")
config["optimizer"]["params"][k] = v
else: # override only if the ds config doesn't already have this section
if (
"zero_optimization" in config
and "cpu_offload" in config["zero_optimization"]
and config["zero_optimization"]["cpu_offload"] is True
):
raise ValueError("ZeRO Offload can only work with DeepSpeed optimizers") raise ValueError("ZeRO Offload can only work with DeepSpeed optimizers")
else:
# ds supports Adam, OneBitAdam, and Lamb optimizers and can import other optimizers from torch. # ds supports Adam, OneBitAdam, and Lamb optimizers and can import other optimizers from torch.
# But trainer uses AdamW by default. # But trainer uses AdamW by default.
# To use other optimizers so using a different scheduler requires voiding warranty with: `zero_allow_untested_optimizer` trainer.create_optimizer()
trainer.create_optimizer() optimizer = trainer.optimizer
optimizer = trainer.optimizer # To use other optimizers requires voiding warranty with: `zero_allow_untested_optimizer`
# flag that this is non-native optimizer config["zero_allow_untested_optimizer"] = True
config["zero_allow_untested_optimizer"] = True
# DS schedulers (deepspeed/runtime/lr_schedules.py): # DS schedulers (deepspeed/runtime/lr_schedules.py):
# #
...@@ -442,25 +503,7 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None): ...@@ -442,25 +503,7 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None):
# WarmupLR | constant_with_warmup | get_constant_schedule_with_warmup | w/ warmup_min_lr=0 # WarmupLR | constant_with_warmup | get_constant_schedule_with_warmup | w/ warmup_min_lr=0
# WarmupDecayLR| linear | get_linear_schedule_with_warmup | # WarmupDecayLR| linear | get_linear_schedule_with_warmup |
lr_scheduler = None lr_scheduler = None
if "scheduler" in config: if "scheduler" not in config:
logger.info("Updating the `scheduler` config with other command line arguments")
# the user won't easily know the correct num_training_steps should they use WarmupDecayLR,
# so let's set it to the correct value
if config["scheduler"]["type"] == "WarmupDecayLR":
logger.info(f"setting scheduler.params.total_num_steps to {num_training_steps}")
config["scheduler"]["params"]["total_num_steps"] = num_training_steps
# to avoid inconsistent values of lr and warmup steps the command line args override config
params = dict(
warmup_max_lr=args.learning_rate,
warmup_num_steps=args.warmup_steps,
)
for k, v in params.items():
if k in config["scheduler"]["params"]:
logger.info(f"setting scheduler.params.{k} to {v}")
config["scheduler"]["params"][k] = v
else: # override only if the ds config doesn't already have this section
if "optimizer" in config: if "optimizer" in config:
# to make this option work, we need to init DS optimizer first, then init HS scheduler, # to make this option work, we need to init DS optimizer first, then init HS scheduler,
# then pass the HS scheduler to DS init, which is not possible at the moment # then pass the HS scheduler to DS init, which is not possible at the moment
...@@ -469,43 +512,6 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None): ...@@ -469,43 +512,6 @@ def deepspeed_init(trainer, num_training_steps, resume_from_checkpoint=None):
trainer.create_scheduler(num_training_steps=num_training_steps) trainer.create_scheduler(num_training_steps=num_training_steps)
lr_scheduler = trainer.lr_scheduler lr_scheduler = trainer.lr_scheduler
# fp16
if trainer.fp16_backend is not None:
# Deepspeed has 2 possible fp16 config entries:
# - `fp16`: for the native amp - it has a bunch of optional params but we won't set any here unless the user did the work
# - `amp`: which delegates amp work to apex (which needs to be available), but it cannot be used with any ZeRO features, so probably best to be avoided.
if trainer.fp16_backend == "apex":
if "amp" in config:
logger.info("Keeping the `amp` config intact, ignoring any amp-specific cl args")
else:
config["amp"] = {
"enabled": True,
"opt_level": args.fp16_opt_level,
}
elif trainer.fp16_backend == "amp":
if "fp16" in config:
logger.info("Keeping the `fp16` config intact, ignoring any fp16-specific cl args")
else:
config["fp16"] = {
"enabled": True,
}
# zero
if "zero_optimization" in config:
zero = config["zero_optimization"]
# now we know for sure if zero3 is enabled
deepspeed_zero3_enable(zero.get("stage") == 3)
# automatically assign the optimal config values based on model config
hidden_size = model.config.hidden_size
if zero.get("reduce_bucket_size") == 0:
zero["reduce_bucket_size"] = hidden_size * hidden_size
if zero.get("stage3_prefetch_bucket_size") == 0:
zero["stage3_prefetch_bucket_size"] = 0.9 * hidden_size * hidden_size
if zero.get("stage3_param_persistence_threshold") == 0:
zero["stage3_param_persistence_threshold"] = 10 * hidden_size
# keep for quick debug: # keep for quick debug:
# from pprint import pprint; pprint(config) # from pprint import pprint; pprint(config)
......
...@@ -1122,7 +1122,11 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix ...@@ -1122,7 +1122,11 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
import deepspeed import deepspeed
logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model") logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
# this immediately partitions the model to avoid the overhead in time and memory copying it on CPU or each GPU first # this immediately partitions the model across all gpus, to avoid the overhead in time
# and memory copying it on CPU or each GPU first
# XXX: param_dict will be added in deepspeed==0.3.16 and probably replaced by deepspeed_config
# with deepspeed.zero.Init(param_dict=deepspeed_config()):
with deepspeed.zero.Init(): with deepspeed.zero.Init():
model = cls(config, *model_args, **model_kwargs) model = cls(config, *model_args, **model_kwargs)
else: else:
......
...@@ -70,9 +70,6 @@ class TrainingArguments: ...@@ -70,9 +70,6 @@ class TrainingArguments:
<https://docs.python.org/3/library/argparse.html#module-argparse>`__ arguments that can be specified on the command <https://docs.python.org/3/library/argparse.html#module-argparse>`__ arguments that can be specified on the command
line. line.
Parameters: Parameters:
output_dir (:obj:`str`): output_dir (:obj:`str`):
The output directory where the model predictions and checkpoints will be written. The output directory where the model predictions and checkpoints will be written.
...@@ -625,6 +622,14 @@ class TrainingArguments: ...@@ -625,6 +622,14 @@ class TrainingArguments:
elif ShardedDDPOption.ZERO_DP_2 in self.sharded_ddp and ShardedDDPOption.ZERO_DP_3 in self.sharded_ddp: elif ShardedDDPOption.ZERO_DP_2 in self.sharded_ddp and ShardedDDPOption.ZERO_DP_3 in self.sharded_ddp:
raise ValueError("`--sharded_ddp zero_dp_2` is not compatible with `--sharded_ddp zero_dp_3`.") raise ValueError("`--sharded_ddp zero_dp_2` is not compatible with `--sharded_ddp zero_dp_3`.")
if self.deepspeed:
# - must be run very last in arg parsing, since it will use a lot of these settings.
# - must be run before the model is created.
from transformers.integrations import DeepSpeedConfigHF
# will be used later by the Trainer (leave self.deepspeed unmodified in case a user relies on it not to be modified)
self.deepspeed_config_hf = DeepSpeedConfigHF(self)
def __repr__(self): def __repr__(self):
# We override the default repr to remove deprecated arguments from the repr. This method should be removed once # We override the default repr to remove deprecated arguments from the repr. This method should be removed once
# those deprecated arguments are removed form TrainingArguments. (TODO: v5) # those deprecated arguments are removed form TrainingArguments. (TODO: v5)
......
{ {
"fp16": { "fp16": {
"enabled": true, "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16, "initial_scale_power": 16,
...@@ -8,36 +8,40 @@ ...@@ -8,36 +8,40 @@
"min_loss_scale": 1 "min_loss_scale": 1
}, },
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
"lr": 3e-5, "lr": "auto",
"betas": [0.8, 0.999], "betas": "auto",
"eps": 1e-8, "eps": "auto",
"weight_decay": 3e-7 "weight_decay": "auto"
} }
}, },
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupLR",
"params": { "params": {
"warmup_min_lr": 0, "warmup_min_lr": "auto",
"warmup_max_lr": 3e-5, "warmup_max_lr": "auto",
"warmup_num_steps": 500 "warmup_num_steps": "auto"
} }
}, },
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000, "steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
{ {
"fp16": { "fp16": {
"enabled": true, "enabled": "auto",
"loss_scale": 0, "loss_scale": 0,
"loss_scale_window": 1000, "loss_scale_window": 1000,
"initial_scale_power": 16, "initial_scale_power": 16,
...@@ -8,41 +8,50 @@ ...@@ -8,41 +8,50 @@
"min_loss_scale": 1 "min_loss_scale": 1
}, },
"zero_optimization": {
"stage": 3,
"cpu_offload": true,
"cpu_offload_params": true,
"cpu_offload_use_pin_memory" : true,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e14,
"reduce_bucket_size": 0,
"stage3_prefetch_bucket_size": 0,
"stage3_param_persistence_threshold": 0,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"optimizer": { "optimizer": {
"type": "AdamW", "type": "AdamW",
"params": { "params": {
"lr": 3e-5, "lr": "auto",
"betas": [0.8, 0.999], "betas": "auto",
"eps": 1e-8, "eps": "auto",
"weight_decay": 3e-7 "weight_decay": "auto"
} }
}, },
"scheduler": { "scheduler": {
"type": "WarmupLR", "type": "WarmupLR",
"params": { "params": {
"warmup_min_lr": 0, "warmup_min_lr": "auto",
"warmup_max_lr": 3e-5, "warmup_max_lr": "auto",
"warmup_num_steps": 500 "warmup_num_steps": "auto"
} }
}, },
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e14,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000, "steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false "wall_clock_breakdown": false
} }
...@@ -42,7 +42,7 @@ with ExtendSysPath(f"{bindir}/.."): ...@@ -42,7 +42,7 @@ with ExtendSysPath(f"{bindir}/.."):
from test_trainer import TrainerIntegrationCommon # noqa from test_trainer import TrainerIntegrationCommon # noqa
if is_torch_available(): if is_torch_available():
from test_trainer import get_regression_trainer # noqa from test_trainer import RegressionModelConfig, RegressionPreTrainedModel, get_regression_trainer # noqa
set_seed(42) set_seed(42)
...@@ -66,6 +66,10 @@ def require_deepspeed(test_case): ...@@ -66,6 +66,10 @@ def require_deepspeed(test_case):
return test_case return test_case
if is_deepspeed_available():
from deepspeed.utils import logger as deepspeed_logger # noqa
from transformers.integrations import deepspeed_config, is_deepspeed_zero3_enabled # noqa
ZERO2 = "zero2" ZERO2 = "zero2"
ZERO3 = "zero3" ZERO3 = "zero3"
stages = [ZERO2, ZERO3] stages = [ZERO2, ZERO3]
...@@ -115,12 +119,6 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon): ...@@ -115,12 +119,6 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon):
with io.open(self.ds_config_file[ZERO3], "r", encoding="utf-8") as f: with io.open(self.ds_config_file[ZERO3], "r", encoding="utf-8") as f:
self.ds_config_dict[ZERO3] = json.load(f) self.ds_config_dict[ZERO3] = json.load(f)
def tearDown(self):
# XXX: Fixme - this is a temporary band-aid since this global variable impacts other tests
import transformers
transformers.integrations._is_deepspeed_zero3_enabled = None
def get_config_dict(self, stage): def get_config_dict(self, stage):
"""As the tests modify the dict, always make a copy""" """As the tests modify the dict, always make a copy"""
config = deepcopy(self.ds_config_dict[stage]) config = deepcopy(self.ds_config_dict[stage])
...@@ -173,25 +171,65 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon): ...@@ -173,25 +171,65 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon):
trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_zero2_dict) trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_zero2_dict)
with self.assertRaises(Exception) as context: with self.assertRaises(Exception) as context:
trainer.train() trainer.train()
self.assertTrue("HF scheduler + DeepSpeed optimizer combination is not possible" in str(context.exception)) self.assertTrue(
"HF scheduler + DeepSpeed optimizer combination is not possible" in str(context.exception),
f"got exception: {context.exception}",
)
def test_stage3_nvme_offload(self):
with mockenv_context(**self.dist_env_1_gpu):
# this actually doesn't have to be on NVMe, any storage will do since this test only
# runs a simple check that we can use some directory as if it were NVMe
nvme_path = self.get_auto_remove_tmp_dir()
nvme_config = dict(device="nvme", nvme_path=nvme_path)
ds_config_zero3_dict = self.get_config_dict(ZERO3)
ds_config_zero3_dict["zero_optimization"]["offload_optimizer"] = nvme_config
ds_config_zero3_dict["zero_optimization"]["offload_param"] = nvme_config
trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_zero3_dict)
with CaptureLogger(deepspeed_logger) as cs:
trainer.train()
self.assertIn("DeepSpeed info", cs.out, "expected DeepSpeed logger output but got none")
# --- These tests need to run on both zero stages --- #
@parameterized.expand(stages)
def test_fp32(self, stage):
ds_config_dict = self.get_config_dict(stage)
ds_config_dict["fp16"]["enabled"] = False # force non-fp16 mode
# XXX: do we go via from_pretrained in zero 3 here? need to test zero.Init(dtype=torch.float)
# XXX: rewrite this test once fp32 is supported by DeepSpeed
with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_dict)
with self.assertRaises(Exception) as context:
trainer.train()
self.assertIn(
"ZeRO is only supported if fp16 is enabled",
str(context.exception),
f"got exception: {context.exception}",
)
def test_hf_optimizer_with_offload(self): @parameterized.expand(stages)
def test_hf_optimizer_with_offload(self, stage):
# must not allow non-DS optimizer when using ZERO-offload # must not allow non-DS optimizer when using ZERO-offload
ds_config_dict = self.get_config_dict(stage)
del ds_config_dict["optimizer"] # force default HF Trainer optimizer
# force cpu offload
if stage == "stage2":
ds_config_dict["zero_optimization"]["cpu_offload"] = True
elif stage == "stage3":
ds_config_dict["zero_optimization"]["offload_optimizer"]["device"] = "cpu"
with mockenv_context(**self.dist_env_1_gpu): with mockenv_context(**self.dist_env_1_gpu):
ds_config_zero2_dict = self.get_config_dict(ZERO2) trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_dict)
del ds_config_zero2_dict["optimizer"] # force default HF Trainer optimizer
ds_config_zero2_dict["zero_optimization"]["cpu_offload"] = True
# sanity check - should the default config change
assert (
"cpu_offload" in ds_config_zero2_dict["zero_optimization"]
and ds_config_zero2_dict["zero_optimization"]["cpu_offload"] is True
), "ensure the config is set up correctly"
trainer = get_regression_trainer(local_rank=0, deepspeed=ds_config_zero2_dict)
with self.assertRaises(Exception) as context: with self.assertRaises(Exception) as context:
trainer.train() trainer.train()
self.assertTrue("ZeRO Offload can only work with DeepSpeed optimizers" in str(context.exception)) self.assertIn(
"ZeRO Offload can only work with DeepSpeed optimizers",
str(context.exception),
f"got exception: {context.exception}",
)
# --- These tests need to run on both zero stages --- #
@parameterized.expand(stages) @parameterized.expand(stages)
def test_fake_notebook_no_launcher(self, stage): def test_fake_notebook_no_launcher(self, stage):
# this setup emulates a notebook where a launcher needs to be emulated by hand # this setup emulates a notebook where a launcher needs to be emulated by hand
...@@ -199,14 +237,12 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon): ...@@ -199,14 +237,12 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon):
# note that unittest resets sys.stdout each test, so `CaptureStd` will work here to capture # note that unittest resets sys.stdout each test, so `CaptureStd` will work here to capture
# DeepSpeed log if this test happens to run first in this pytest worker. But it will fail if # DeepSpeed log if this test happens to run first in this pytest worker. But it will fail if
# it's run not as a first test as `sys.stdout` will no longer be the same. So we either have # it's run not as a first test as `sys.stdout` will no longer be the same. So we either have
# to reset `logger.handlers[0].setStream(sys.stdout)` or directly capture from the logger. # to reset `deepspeed_logger.handlers[0].setStream(sys.stdout)` or directly capture from the deepspeed_logger.
from deepspeed.utils import logger with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(local_rank=0, deepspeed=self.ds_config_file[stage])
with CaptureLogger(logger) as cs: with CaptureLogger(deepspeed_logger) as cs:
with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(local_rank=0, deepspeed=self.ds_config_file[stage])
trainer.train() trainer.train()
assert "DeepSpeed info" in cs.out, "expected DeepSpeed logger output but got none" self.assertIn("DeepSpeed info", cs.out, "expected DeepSpeed logger output but got none")
@parameterized.expand(stages) @parameterized.expand(stages)
def test_early_get_last_lr(self, stage): def test_early_get_last_lr(self, stage):
...@@ -425,6 +461,38 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon): ...@@ -425,6 +461,38 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon):
self.assertEqual(b, b1) self.assertEqual(b, b1)
self.check_trainer_state_are_the_same(state, state1) self.check_trainer_state_are_the_same(state, state1)
def test_config_object(self):
# test that we can switch from zero2 to zero3 in the same process for example
# test is_zero, etc.
output_dir = self.get_auto_remove_tmp_dir()
kwargs = dict(output_dir=output_dir, train_len=8)
with mockenv_context(**self.dist_env_1_gpu):
ds_config_zero3_dict = self.get_config_dict("zero3")
ds_config_zero2_dict = self.get_config_dict("zero2")
trainer = get_regression_trainer(deepspeed=ds_config_zero3_dict, **kwargs)
self.assertTrue(is_deepspeed_zero3_enabled())
# test we can repeat that and with train this time
trainer = get_regression_trainer(deepspeed=ds_config_zero3_dict, **kwargs)
trainer.train()
self.assertTrue(is_deepspeed_zero3_enabled())
# test zero3 is disabled
trainer = get_regression_trainer(deepspeed=ds_config_zero2_dict, **kwargs)
self.assertFalse(is_deepspeed_zero3_enabled())
# check config obj
config = deepspeed_config()
self.assertTrue(bool(config), "Deepspeed config should be accessible")
del trainer
# now weakref should gc the global and we shouldn't get anything here
config = deepspeed_config()
self.assertFalse(is_deepspeed_zero3_enabled())
self.assertFalse(bool(config), "Deepspeed config should not be accessible")
@slow @slow
@require_deepspeed @require_deepspeed
...@@ -557,6 +625,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -557,6 +625,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
--adafactor --adafactor
--source_lang en --source_lang en
--target_lang ro --target_lang ro
--report_to none
""".split() """.split()
args.extend(["--source_prefix", '"translate English to Romanian: "']) args.extend(["--source_prefix", '"translate English to Romanian: "'])
...@@ -626,6 +695,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -626,6 +695,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
--num_train_epochs 1 --num_train_epochs 1
--warmup_steps 8 --warmup_steps 8
--block_size 128 --block_size 128
--report_to none
""".split() """.split()
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split() ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
......
...@@ -213,16 +213,21 @@ if is_torch_available(): ...@@ -213,16 +213,21 @@ if is_torch_available():
label_names = kwargs.get("label_names", None) label_names = kwargs.get("label_names", None)
train_dataset = RegressionDataset(length=train_len, label_names=label_names) train_dataset = RegressionDataset(length=train_len, label_names=label_names)
eval_dataset = RegressionDataset(length=eval_len, label_names=label_names) eval_dataset = RegressionDataset(length=eval_len, label_names=label_names)
if pretrained:
config = RegressionModelConfig(a=a, b=b, double_output=double_output) model_init = kwargs.pop("model_init", None)
model = RegressionPreTrainedModel(config) if model_init is not None:
model = None
else: else:
model = RegressionModel(a=a, b=b, double_output=double_output) if pretrained:
config = RegressionModelConfig(a=a, b=b, double_output=double_output)
model = RegressionPreTrainedModel(config)
else:
model = RegressionModel(a=a, b=b, double_output=double_output)
compute_metrics = kwargs.pop("compute_metrics", None) compute_metrics = kwargs.pop("compute_metrics", None)
data_collator = kwargs.pop("data_collator", None) data_collator = kwargs.pop("data_collator", None)
optimizers = kwargs.pop("optimizers", (None, None)) optimizers = kwargs.pop("optimizers", (None, None))
output_dir = kwargs.pop("output_dir", "./regression") output_dir = kwargs.pop("output_dir", "./regression")
model_init = kwargs.pop("model_init", None)
args = RegressionTrainingArguments(output_dir, a=a, b=b, **kwargs) args = RegressionTrainingArguments(output_dir, a=a, b=b, **kwargs)
return Trainer( return Trainer(
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment