Unverified Commit 78f5fe14 authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[Deepspeed] adapt multiple models, add zero_to_fp32 tests (#12477)



* zero_to_fp32 tests

* args change

* remove unnecessary work

* use transformers.trainer_utils.get_last_checkpoint

* document the new features

* cleanup

* wip

* fix fsmt

* add bert

* cleanup

* add xlm-roberta

* electra works

* cleanup

* sync

* split off the model zoo tests

* cleanup

* cleanup

* cleanup

* cleanup

* reformat

* cleanup

* casing

* deepspeed>=0.4.3

* adjust distilbert

* Update docs/source/main_classes/deepspeed.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 65bf05cd
...@@ -1456,8 +1456,56 @@ won't be possible to load it back. ...@@ -1456,8 +1456,56 @@ won't be possible to load it back.
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32 the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
weights. This cannot be done during training since this is a process that requires a lot of memory, and therefore this weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
is performed offline. therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
memory it can be done in the same training script. The following sections will discuss both approaches.
**Live FP32 Weights Recovery:**
This approach may not work if you model is large and you have little free CPU memory left, at the end of the training.
If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
.. code-block:: python
from transformers.trainer_utils import get_last_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best
checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
.. code-block:: python
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
.. note::
Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the
DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
of the training.
Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own
trainer.
If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply
these yourself as is shown in the following example:
.. code-block:: python
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
model = model.cpu()
model.load_state_dict(state_dict)
**Offline FP32 Weights Recovery:**
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
...@@ -1486,15 +1534,16 @@ weights just run: ...@@ -1486,15 +1534,16 @@ weights just run:
.. code-block:: bash .. code-block:: bash
python zero_to_fp32.py global_step1 pytorch_model.bin python zero_to_fp32.py . pytorch_model.bin
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
The script will automatically handle either ZeRO-2 or ZeRO-3 checkpoint. The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
``python zero_to_fp32.py -h`` will give you usage details. ``python zero_to_fp32.py -h`` will give you usage details.
If you have multiple DeepSpeed checkpoint sub-folders, pick the one you know to have the desired weights. The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current
example will contain ``global_step1``.
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
Note: currently the script requires 2x general RAM of the final fp32 model weights. Note: currently the script requires 2x general RAM of the final fp32 model weights.
......
...@@ -91,7 +91,7 @@ _deps = [ ...@@ -91,7 +91,7 @@ _deps = [
"cookiecutter==1.7.2", "cookiecutter==1.7.2",
"dataclasses", "dataclasses",
"datasets", "datasets",
"deepspeed>=0.4.0", "deepspeed>=0.4.3",
"docutils==0.16.0", "docutils==0.16.0",
"fairscale>0.3", "fairscale>0.3",
"faiss-cpu", "faiss-cpu",
......
...@@ -8,7 +8,7 @@ deps = { ...@@ -8,7 +8,7 @@ deps = {
"cookiecutter": "cookiecutter==1.7.2", "cookiecutter": "cookiecutter==1.7.2",
"dataclasses": "dataclasses", "dataclasses": "dataclasses",
"datasets": "datasets", "datasets": "datasets",
"deepspeed": "deepspeed>=0.4.0", "deepspeed": "deepspeed>=0.4.3",
"docutils": "docutils==0.16.0", "docutils": "docutils==0.16.0",
"fairscale": "fairscale>0.3", "fairscale": "fairscale>0.3",
"faiss-cpu": "faiss-cpu", "faiss-cpu": "faiss-cpu",
......
...@@ -819,6 +819,14 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix ...@@ -819,6 +819,14 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
if new_num_tokens is None: if new_num_tokens is None:
return old_lm_head return old_lm_head
if is_deepspeed_zero3_enabled():
import deepspeed
with deepspeed.zero.GatheredParameters(old_lm_head.weight, modifier_rank=None):
old_num_tokens, old_lm_head_dim = (
old_lm_head.weight.size() if not transposed else old_lm_head.weight.t().size()
)
else:
old_num_tokens, old_lm_head_dim = ( old_num_tokens, old_lm_head_dim = (
old_lm_head.weight.size() if not transposed else old_lm_head.weight.t().size() old_lm_head.weight.size() if not transposed else old_lm_head.weight.t().size()
) )
...@@ -829,7 +837,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix ...@@ -829,7 +837,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
if not isinstance(old_lm_head, nn.Linear): if not isinstance(old_lm_head, nn.Linear):
raise TypeError( raise TypeError(
f"Old language model head is of type {type(old_lm_head)}, which is not an instance of {nn.Linear}." f"Old language model head is of type {type(old_lm_head)}, which is not an instance of {nn.Linear}."
f"You should either use a different resize function or make sure that `old_embeddings` are an instance of {nn.Linear}." f"You should either use a different resize function or make sure that `old_lm_head` are an instance of {nn.Linear}."
) )
# Build new lm head # Build new lm head
...@@ -842,6 +850,26 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix ...@@ -842,6 +850,26 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
num_tokens_to_copy = min(old_num_tokens, new_num_tokens) num_tokens_to_copy = min(old_num_tokens, new_num_tokens)
# XXX: put the long block of code in a wrapper
if is_deepspeed_zero3_enabled():
import deepspeed
with deepspeed.zero.GatheredParameters(old_lm_head.weight, modifier_rank=0):
if torch.distributed.get_rank() == 0:
# Copy old lm head weights to new lm head
if not transposed:
new_lm_head.weight.data[:num_tokens_to_copy, :] = old_lm_head.weight.data[
:num_tokens_to_copy, :
]
else:
new_lm_head.weight.data[:, :num_tokens_to_copy] = old_lm_head.weight.data[
:, :num_tokens_to_copy
]
# Copy bias weights to new lm head
if has_new_lm_head_bias:
new_lm_head.bias.data[:num_tokens_to_copy] = old_lm_head.bias.data[:num_tokens_to_copy]
else:
# Copy old lm head weights to new lm head # Copy old lm head weights to new lm head
if not transposed: if not transposed:
new_lm_head.weight.data[:num_tokens_to_copy, :] = old_lm_head.weight.data[:num_tokens_to_copy, :] new_lm_head.weight.data[:num_tokens_to_copy, :] = old_lm_head.weight.data[:num_tokens_to_copy, :]
......
...@@ -18,7 +18,6 @@ ...@@ -18,7 +18,6 @@
""" """
import copy
import math import math
import numpy as np import numpy as np
...@@ -27,6 +26,7 @@ from torch import nn ...@@ -27,6 +26,7 @@ from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from ...activations import gelu from ...activations import gelu
from ...deepspeed import is_deepspeed_zero3_enabled
from ...file_utils import ( from ...file_utils import (
add_code_sample_docstrings, add_code_sample_docstrings,
add_start_docstrings, add_start_docstrings,
...@@ -85,6 +85,16 @@ class Embeddings(nn.Module): ...@@ -85,6 +85,16 @@ class Embeddings(nn.Module):
self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id) self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim) self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
if config.sinusoidal_pos_embds: if config.sinusoidal_pos_embds:
if is_deepspeed_zero3_enabled():
import deepspeed
with deepspeed.zero.GatheredParameters(self.position_embeddings.weight, modifier_rank=0):
if torch.distributed.get_rank() == 0:
create_sinusoidal_embeddings(
n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight
)
else:
create_sinusoidal_embeddings( create_sinusoidal_embeddings(
n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight
) )
...@@ -274,9 +284,7 @@ class Transformer(nn.Module): ...@@ -274,9 +284,7 @@ class Transformer(nn.Module):
def __init__(self, config): def __init__(self, config):
super().__init__() super().__init__()
self.n_layers = config.n_layers self.n_layers = config.n_layers
self.layer = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
layer = TransformerBlock(config)
self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])
def forward( def forward(
self, x, attn_mask=None, head_mask=None, output_attentions=False, output_hidden_states=False, return_dict=None self, x, attn_mask=None, head_mask=None, output_attentions=False, output_hidden_states=False, return_dict=None
......
...@@ -36,6 +36,7 @@ from torch import Tensor, nn ...@@ -36,6 +36,7 @@ from torch import Tensor, nn
from torch.nn import CrossEntropyLoss, LayerNorm from torch.nn import CrossEntropyLoss, LayerNorm
from ...activations import ACT2FN from ...activations import ACT2FN
from ...deepspeed import is_deepspeed_zero3_enabled
from ...file_utils import ( from ...file_utils import (
add_code_sample_docstrings, add_code_sample_docstrings,
add_end_docstrings, add_end_docstrings,
...@@ -658,11 +659,14 @@ class FSMTDecoder(nn.Module): ...@@ -658,11 +659,14 @@ class FSMTDecoder(nn.Module):
[DecoderLayer(config) for _ in range(config.decoder_layers)] [DecoderLayer(config) for _ in range(config.decoder_layers)]
) # type: List[DecoderLayer] ) # type: List[DecoderLayer]
self.output_projection = nn.Linear( if is_deepspeed_zero3_enabled():
self.embed_tokens.weight.shape[1], import deepspeed
self.embed_tokens.weight.shape[0],
bias=False, with deepspeed.zero.GatheredParameters(self.embed_tokens.weight, modifier_rank=None):
) embed_tokens_weight_shape = self.embed_tokens.weight.shape
else:
embed_tokens_weight_shape = self.embed_tokens.weight.shape
self.output_projection = nn.Linear(embed_tokens_weight_shape[1], embed_tokens_weight_shape[0], bias=False)
self.output_projection.weight = self.embed_tokens.weight self.output_projection.weight = self.embed_tokens.weight
def forward( def forward(
...@@ -1127,19 +1131,6 @@ class FSMTForConditionalGeneration(PretrainedFSMTModel): ...@@ -1127,19 +1131,6 @@ class FSMTForConditionalGeneration(PretrainedFSMTModel):
base_model = FSMTModel(config) base_model = FSMTModel(config)
self.model = base_model self.model = base_model
def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:
new_embeddings = super().resize_token_embeddings(new_num_tokens)
self.model.encoder.embed_tokens = new_embeddings
new_embeddings = super().resize_token_embeddings(new_num_tokens)
self.model.decoder.embed_tokens = new_embeddings
# XXX: this is not quite correct, as we have 2 different `new_embeddings`, and
# only one return value is expected. Needs to be redesigned in the core to support dual dicts
raise NotImplementedError("this method needs re-thinking for models with 2 separate dictionaries")
return new_embeddings
@add_start_docstrings_to_model_forward(FSMT_INPUTS_DOCSTRING) @add_start_docstrings_to_model_forward(FSMT_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
@add_end_docstrings(FSMT_GENERATION_EXAMPLE) @add_end_docstrings(FSMT_GENERATION_EXAMPLE)
...@@ -1257,6 +1248,9 @@ class FSMTForConditionalGeneration(PretrainedFSMTModel): ...@@ -1257,6 +1248,9 @@ class FSMTForConditionalGeneration(PretrainedFSMTModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.model.decoder.embed_tokens return self.model.decoder.embed_tokens
def set_output_embeddings(self, value):
self.model.decoder.embed_tokens = value
class SinusoidalPositionalEmbedding(nn.Embedding): class SinusoidalPositionalEmbedding(nn.Embedding):
""" """
......
...@@ -23,9 +23,8 @@ import torch ...@@ -23,9 +23,8 @@ import torch
import torch.utils.checkpoint import torch.utils.checkpoint
from torch import nn from torch import nn
from transformers.deepspeed import is_deepspeed_zero3_enabled
from ...activations import ACT2FN from ...activations import ACT2FN
from ...deepspeed import is_deepspeed_zero3_enabled
from ...file_utils import ( from ...file_utils import (
ModelOutput, ModelOutput,
add_start_docstrings, add_start_docstrings,
...@@ -853,16 +852,6 @@ class Wav2Vec2PreTrainedModel(PreTrainedModel): ...@@ -853,16 +852,6 @@ class Wav2Vec2PreTrainedModel(PreTrainedModel):
module.bias.data.zero_() module.bias.data.zero_()
module.weight.data.fill_(1.0) module.weight.data.fill_(1.0)
elif isinstance(module, nn.Conv1d): elif isinstance(module, nn.Conv1d):
if is_deepspeed_zero3_enabled():
import deepspeed
if hasattr(module, "weight_v") and hasattr(module, "weight_g"):
with deepspeed.zero.GatheredParameters([module.weight_v, module.weight_g], modifier_rank=0):
nn.init.kaiming_normal_(module.weight.data)
else:
with deepspeed.zero.GatheredParameters(module.weight, modifier_rank=0):
nn.init.kaiming_normal_(module.weight.data)
else:
nn.init.kaiming_normal_(module.weight.data) nn.init.kaiming_normal_(module.weight.data)
if isinstance(module, (nn.Linear, nn.Conv1d)) and module.bias is not None: if isinstance(module, (nn.Linear, nn.Conv1d)) and module.bias is not None:
......
...@@ -799,10 +799,7 @@ class TrainingArguments: ...@@ -799,10 +799,7 @@ class TrainingArguments:
device = torch.device("cuda", self.local_rank) device = torch.device("cuda", self.local_rank)
self._n_gpu = 1 self._n_gpu = 1
elif self.deepspeed: elif self.deepspeed:
# deepspeed performs its own DDP internally, and requires the program to be started with: # deepspeed inits torch.distributed internally
# deepspeed ./program.py
# rather than:
# python -m torch.distributed.launch --nproc_per_node=2 ./program.py
from .deepspeed import is_deepspeed_available from .deepspeed import is_deepspeed_available
if not is_deepspeed_available(): if not is_deepspeed_available():
......
...@@ -37,11 +37,12 @@ from transformers.testing_utils import ( ...@@ -37,11 +37,12 @@ from transformers.testing_utils import (
require_torch_multi_gpu, require_torch_multi_gpu,
slow, slow,
) )
from transformers.trainer_utils import set_seed from transformers.trainer_utils import get_last_checkpoint, set_seed
bindir = os.path.abspath(os.path.dirname(__file__)) tests_dir = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
with ExtendSysPath(f"{bindir}/.."): root_dir = os.path.dirname(tests_dir)
with ExtendSysPath(tests_dir):
from test_trainer import TrainerIntegrationCommon # noqa from test_trainer import TrainerIntegrationCommon # noqa
if is_torch_available(): if is_torch_available():
...@@ -49,9 +50,10 @@ with ExtendSysPath(f"{bindir}/.."): ...@@ -49,9 +50,10 @@ with ExtendSysPath(f"{bindir}/.."):
set_seed(42) set_seed(42)
MBART_TINY = "sshleifer/tiny-mbart"
T5_SMALL = "t5-small" T5_SMALL = "t5-small"
T5_TINY = "patrickvonplaten/t5-tiny-random" T5_TINY = "patrickvonplaten/t5-tiny-random"
GPT2_TINY = "sshleifer/tiny-gpt2"
def load_json(path): def load_json(path):
...@@ -77,8 +79,19 @@ def require_deepspeed_aio(test_case): ...@@ -77,8 +79,19 @@ def require_deepspeed_aio(test_case):
if is_deepspeed_available(): if is_deepspeed_available():
from deepspeed.utils import logger as deepspeed_logger # noqa from deepspeed.utils import logger as deepspeed_logger # noqa
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
from transformers.deepspeed import deepspeed_config, is_deepspeed_zero3_enabled # noqa from transformers.deepspeed import deepspeed_config, is_deepspeed_zero3_enabled # noqa
def get_launcher(distributed=False):
# 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
# - it won't be able to handle that
# 2. for now testing with just 2 gpus max (since some quality tests may give different
# results with mode gpus because we use very little data)
num_gpus = min(2, get_gpu_count()) if distributed else 1
return f"deepspeed --num_nodes 1 --num_gpus {num_gpus}".split()
ZERO2 = "zero2" ZERO2 = "zero2"
ZERO3 = "zero3" ZERO3 = "zero3"
stages = [ZERO2, ZERO3] stages = [ZERO2, ZERO3]
...@@ -568,6 +581,41 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon): ...@@ -568,6 +581,41 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon):
self.assertEqual(b, b1) self.assertEqual(b, b1)
self.check_trainer_state_are_the_same(state, state1) self.check_trainer_state_are_the_same(state, state1)
@parameterized.expand(stages)
def test_load_state_dict_from_zero_checkpoint(self, stage):
# test that we can load fp32 weights directly from the zero checkpoint into the current model
output_dir = self.get_auto_remove_tmp_dir() # "./xxx", after=False, before=False)
ds_config_dict = self.get_config_dict(stage)
kwargs = dict(
output_dir=output_dir,
train_len=4,
per_device_train_batch_size=4,
num_train_epochs=1,
save_strategy="steps",
save_steps=1,
learning_rate=0.1,
fp16=True,
deepspeed=ds_config_dict,
)
with mockenv_context(**self.dist_env_1_gpu):
trainer = get_regression_trainer(**kwargs)
trainer.train()
(a, b) = trainer.model.a.item(), trainer.model.b.item()
state = dataclasses.asdict(trainer.state)
checkpoint_dir = get_last_checkpoint(output_dir)
model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
(a1, b1) = model.a.item(), model.b.item()
state1 = dataclasses.asdict(trainer.state)
self.assertEqual(a, a1)
self.assertEqual(b, b1)
self.check_trainer_state_are_the_same(state, state1)
def test_config_object(self): def test_config_object(self):
# test that we can switch from zero2 to zero3 in the same process for example # test that we can switch from zero2 to zero3 in the same process for example
# test is_zero, etc. # test is_zero, etc.
...@@ -809,7 +857,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -809,7 +857,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split() ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
script = [f"{self.examples_dir_str}/pytorch/translation/run_translation.py"] script = [f"{self.examples_dir_str}/pytorch/translation/run_translation.py"]
launcher = self.get_launcher(distributed) launcher = get_launcher(distributed)
cmd = launcher + script + args + ds_args cmd = launcher + script + args + ds_args
# keep for quick debug # keep for quick debug
...@@ -826,7 +874,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -826,7 +874,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
data_dir = self.tests_dir / "fixtures" data_dir = self.tests_dir / "fixtures"
output_dir = self.get_auto_remove_tmp_dir() output_dir = self.get_auto_remove_tmp_dir()
args = f""" args = f"""
--model_name_or_path sshleifer/tiny-gpt2 --model_name_or_path {GPT2_TINY}
--train_file {data_dir}/sample_text.txt --train_file {data_dir}/sample_text.txt
--validation_file {data_dir}/sample_text.txt --validation_file {data_dir}/sample_text.txt
--output_dir {output_dir} --output_dir {output_dir}
...@@ -846,7 +894,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -846,7 +894,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split() ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"] script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
launcher = self.get_launcher(distributed=True) launcher = get_launcher(distributed=True)
cmd = launcher + script + args + ds_args cmd = launcher + script + args + ds_args
# keep for quick debug # keep for quick debug
...@@ -860,7 +908,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -860,7 +908,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
output_dir = self.get_auto_remove_tmp_dir() output_dir = self.get_auto_remove_tmp_dir()
args = f""" args = f"""
--model_type gpt2 --model_type gpt2
--tokenizer_name sshleifer/tiny-gpt2 --tokenizer_name {GPT2_TINY}
--train_file {data_dir}/sample_text.txt --train_file {data_dir}/sample_text.txt
--validation_file {data_dir}/sample_text.txt --validation_file {data_dir}/sample_text.txt
--output_dir {output_dir} --output_dir {output_dir}
...@@ -877,7 +925,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -877,7 +925,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_zero3.json".split() ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_zero3.json".split()
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"] script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
launcher = self.get_launcher(distributed=True) launcher = get_launcher(distributed=True)
cmd = launcher + script + args + ds_args cmd = launcher + script + args + ds_args
# keep for quick debug # keep for quick debug
...@@ -885,11 +933,3 @@ class TestDeepSpeedWithLauncher(TestCasePlus): ...@@ -885,11 +933,3 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
with CaptureStderr() as cs: with CaptureStderr() as cs:
execute_subprocess_async(cmd, env=self.get_env()) execute_subprocess_async(cmd, env=self.get_env())
assert "Detected DeepSpeed ZeRO-3" in cs.err assert "Detected DeepSpeed ZeRO-3" in cs.err
def get_launcher(self, distributed=False):
# 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
# - it won't be able to handle that
# 2. for now testing with just 2 gpus max (since some quality tests may give different
# results with mode gpus because we use very little data)
num_gpus = min(2, get_gpu_count()) if distributed else 1
return f"deepspeed --num_nodes 1 --num_gpus {num_gpus}".split()
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import os
import subprocess
from parameterized import parameterized
from transformers import is_torch_available
from transformers.testing_utils import (
ExtendSysPath,
TestCasePlus,
execute_subprocess_async,
get_gpu_count,
require_deepspeed,
require_torch_gpu,
slow,
)
from transformers.trainer_utils import set_seed
tests_dir = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
root_dir = os.path.dirname(tests_dir)
with ExtendSysPath(tests_dir):
from test_trainer import TrainerIntegrationCommon # noqa
if is_torch_available():
from test_trainer import RegressionModelConfig, RegressionPreTrainedModel, get_regression_trainer # noqa
set_seed(42)
# translation
FSMT_TINY = "stas/tiny-wmt19-en-de"
BART_TINY = "sshleifer/bart-tiny-random"
T5_SMALL = "t5-small"
T5_TINY = "patrickvonplaten/t5-tiny-random"
MBART_TINY = "sshleifer/tiny-mbart"
MARIAN_TINY = "sshleifer/tiny-marian-en-de"
# summarization
PEGASUS_TINY = "stas/pegasus-cnn_dailymail-tiny-random"
# causal lm
GPT2_TINY = "sshleifer/tiny-gpt2"
XLM_ROBERTA_TINY = "hf-internal-testing/tiny-xlm-roberta"
# question-answering
ROBERTA_TINY = "sshleifer/tiny-distilroberta-base"
# masked lm
DISTILBERT_TINY = "sshleifer/tiny-distilbert-base-cased"
ELECTRA_TINY = "hf-internal-testing/tiny-electra"
# classification
XLNET_TINY = "sshleifer/tiny-xlnet-base-cased"
BERT_TINY = "hf-internal-testing/tiny-bert"
# TODO: to add:
# albert
# deberta
# funnel
# longformer
# dpr
# gpt_neo
# camembert
# deberta-v2
# m2m_100
# tapas
# vit
# big_bird
def get_launcher(distributed=False):
# 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
# - it won't be able to handle that
# 2. for now testing with just 2 gpus max (since some quality tests may give different
# results with mode gpus because we use very little data)
num_gpus = min(2, get_gpu_count()) if distributed else 1
return f"deepspeed --num_nodes 1 --num_gpus {num_gpus}".split()
def make_task_cmds():
data_dir_fixtures = f"{tests_dir}/fixtures"
data_dir_samples = f"{data_dir_fixtures}/tests_samples"
data_dir_wmt = f"{data_dir_samples}/wmt_en_ro"
data_dir_xsum = f"{data_dir_samples}/xsum"
args_main = """
--do_train
--max_train_samples 4
--per_device_train_batch_size 2
--num_train_epochs 1
--fp16
--report_to none
--overwrite_output_dir
""".split()
# XXX: try to cover as many models as possible once (it's enough to run on one task per model)
# but need a tiny model for each
#
# should have T5_TINY, etc. global var defined
tasks2models = dict(
trans=[
"bart",
"fsmt",
"marian",
"mbart",
"t5",
],
sum=[
"pegasus",
],
clm=[
"gpt2",
"xlm-roberta",
],
mlm=[
"electra",
"distilbert",
],
qa=[
"roberta",
],
clas=[
"bert",
"xlnet",
],
)
scripts_dir = f"{root_dir}/examples/pytorch"
tasks = dict(
trans=f"""
{scripts_dir}/translation/run_translation.py
--train_file {data_dir_wmt}/train.json
--source_lang en
--target_lang ro
""",
sum=f"""
{scripts_dir}/summarization/run_summarization.py
--train_file {data_dir_xsum}/sample.json
--max_source_length 12
--max_target_length 12
""",
clm=f"""
{scripts_dir}/language-modeling/run_clm.py
--train_file {data_dir_fixtures}/sample_text.txt
--block_size 8
""",
mlm=f"""
{scripts_dir}/language-modeling/run_mlm.py
--train_file {data_dir_fixtures}/sample_text.txt
""",
qa=f"""
{scripts_dir}/question-answering/run_qa.py
--train_file {data_dir_samples}/SQUAD/sample.json
""",
clas=f"""
{scripts_dir}/text-classification/run_glue.py
--train_file {data_dir_samples}/MRPC/train.csv
--max_seq_length 12
--task_name MRPC
""",
)
launcher = get_launcher(distributed=True)
cmds = {}
for task, args in tasks.items():
args = args.split()
for model in tasks2models[task]:
model_name = globals()[f"{model.upper().replace('-', '_')}_TINY"]
args_model = f"--model_name_or_path {model_name}".split()
cmds[f"{task}_{model}"] = launcher + args + args_model + args_main
# # generation special case
# if task == "gen":
# launcher = f"deepspeed --num_nodes 1 --num_gpus 1".split()
# args_model += f"--model_type {model}".split()
# cmds[f"{task}_{model}"] = launcher + args + args_model
# else:
return cmds
task_cmds = make_task_cmds()
ZERO2 = "zero2"
ZERO3 = "zero3"
stages = [ZERO2, ZERO3]
def parameterized_custom_name_func(func, param_num, param):
# customize the test name generator function as we want both params to appear in the sub-test
# name, as by default it shows only the first param
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
return f"{func.__name__}_{param_based_name}"
# Cartesian-product of zero stages with models to test
params = list(itertools.product(stages, task_cmds.keys()))
@slow
@require_deepspeed
@require_torch_gpu
class TestDeepSpeedModelZoo(TestCasePlus):
"""This class is for testing via an external script - can do multiple gpus"""
def get_task_cmd(self, task, stage):
# return a ready to run train cmd
if task not in task_cmds:
raise ValueError(f"don't know of task {task}, have {task_cmds.keys()}")
cmd = task_cmds[task]
args_ds = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
output_dir = self.get_auto_remove_tmp_dir()
args_out = f"--output_dir {output_dir}".split()
cmd += args_ds + args_out
return cmd, output_dir
@parameterized.expand(params, name_func=parameterized_custom_name_func)
def test_zero_to_fp32(self, stage, task):
# testing the ability to do a run followed by recovery of full fp32 weights
cmd, output_dir = self.get_task_cmd(task, stage)
# 1. generate the checkpoint
cmd += "--save_steps 1".split()
# keep for quick debug
# print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
execute_subprocess_async(cmd, env=self.get_env())
# 2. test that the fp32 weights get reconsolidated
chkpt_dir = f"{output_dir}/checkpoint-1"
recovered_model_path = f"{chkpt_dir}/out.bin"
cmd = f"{chkpt_dir}/zero_to_fp32.py {chkpt_dir} {recovered_model_path}"
# keep for quick debug
# print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] +cmd)); die
subprocess.check_call(cmd, shell=True)
assert os.path.exists(recovered_model_path), f"{recovered_model_path} was not found"
# possibly could also test that the resulting saved model is usable but given that we use
# random models we won't know if it's any good
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment