serialization.rst 10.8 KB
Newer Older
1
2
Loading Google AI or OpenAI pre-trained weights or PyTorch dump
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
thomwolf's avatar
thomwolf committed
3

4
5
``from_pretrained()`` method
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
thomwolf's avatar
thomwolf committed
6

7
To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
thomwolf's avatar
thomwolf committed
8

9
.. code-block:: python
thomwolf's avatar
thomwolf committed
10

11
   model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
thomwolf's avatar
thomwolf committed
12
13
14
15

where


16
17
18
* ``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
*
  ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
thomwolf's avatar
thomwolf committed
19
20


21
22
  *
    the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
thomwolf's avatar
thomwolf committed
23
24


25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
    * ``bert-base-uncased``: 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-large-uncased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
    * ``bert-base-cased``: 12-layer, 768-hidden, 12-heads , 110M parameters
    * ``bert-large-cased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
    * ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
    * ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
    * ``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
thomwolf's avatar
thomwolf committed
40

41
42
  *
    a path or url to a pretrained model archive containing:
thomwolf's avatar
thomwolf committed
43
44


45
46
47
    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )

48
  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
49
50
51
52
53
54
55
56
57
58
59

*
  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).

* ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
* ``state_dict``\ : an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
* ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)

``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.

When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
thomwolf's avatar
thomwolf committed
60
61
62

Examples:

63
64
65
66
67
.. code-block:: python

   # BERT
   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
   model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
thomwolf's avatar
thomwolf committed
68

69
70
71
   # OpenAI GPT
   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
   model = OpenAIGPTModel.from_pretrained('openai-gpt')
thomwolf's avatar
thomwolf committed
72

73
74
75
   # Transformer-XL
   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
76

77
78
79
   # OpenAI GPT-2
   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
   model = GPT2Model.from_pretrained('gpt2')
thomwolf's avatar
thomwolf committed
80

81
82
Cache directory
~~~~~~~~~~~~~~~
thomwolf's avatar
thomwolf committed
83

84
``pytorch_pretrained_bert`` save the pretrained weights in a cache directory which is located at (in this order of priority):
thomwolf's avatar
thomwolf committed
85
86


87
88
89
* ``cache_dir`` optional arguments to the ``from_pretrained()`` method (see above),
* shell environment variable ``PYTORCH_PRETRAINED_BERT_CACHE``\ ,
* PyTorch cache home + ``/pytorch_pretrained_bert/``
thomwolf's avatar
thomwolf committed
90
91
  where PyTorch cache home is defined by (in this order):

92
93
94
  * shell environment variable ``ENV_TORCH_HOME``
  * shell environment variable ``ENV_XDG_CACHE_HOME`` + ``/torch/``\ )
  * default: ``~/.cache/torch/``
thomwolf's avatar
thomwolf committed
95

96
Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
thomwolf's avatar
thomwolf committed
97

98
99
100
101
You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.

Serialization best-practices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
thomwolf's avatar
thomwolf committed
102
103
104
105

This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
There are three types of files you need to save to be able to reload a fine-tuned model:

106
107
108
109

* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
* the configuration file of the model which is saved as a JSON file, and
* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
thomwolf's avatar
thomwolf committed
110
111
112
113

The *default filenames* of these files are as follow:


114
115
116
117
118
119
* the model weights file: ``pytorch_model.bin``\ ,
* the configuration file: ``config.json``\ ,
* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.

**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
thomwolf's avatar
thomwolf committed
120

121
Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
thomwolf's avatar
thomwolf committed
122

123
.. code-block:: python
thomwolf's avatar
thomwolf committed
124

125
   from transformers import WEIGHTS_NAME, CONFIG_NAME
thomwolf's avatar
thomwolf committed
126

127
   output_dir = "./models/"
thomwolf's avatar
thomwolf committed
128

129
   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
thomwolf's avatar
thomwolf committed
130

131
132
133
   # If we have a distributed model, save only the encapsulated model
   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
   model_to_save = model.module if hasattr(model, 'module') else model
thomwolf's avatar
thomwolf committed
134

135
136
137
   # If we save using the predefined names, we can load using `from_pretrained`
   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
   output_config_file = os.path.join(output_dir, CONFIG_NAME)
thomwolf's avatar
thomwolf committed
138

139
140
141
   torch.save(model_to_save.state_dict(), output_model_file)
   model_to_save.config.to_json_file(output_config_file)
   tokenizer.save_vocabulary(output_dir)
thomwolf's avatar
thomwolf committed
142

143
144
145
146
147
148
149
150
   # Step 2: Re-load the saved model and vocabulary

   # Example for a Bert model
   model = BertForQuestionAnswering.from_pretrained(output_dir)
   tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
   # Example for a GPT model
   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
thomwolf's avatar
thomwolf committed
151
152
153

Here is another way you can save and reload the model if you want to use specific paths for each type of files:

154
155
156
157
158
159
160
.. code-block:: python

   output_model_file = "./models/my_own_model_file.bin"
   output_config_file = "./models/my_own_config_file.bin"
   output_vocab_file = "./models/my_own_vocab_file.bin"

   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
thomwolf's avatar
thomwolf committed
161

162
163
164
   # If we have a distributed model, save only the encapsulated model
   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
   model_to_save = model.module if hasattr(model, 'module') else model
thomwolf's avatar
thomwolf committed
165

166
167
168
   torch.save(model_to_save.state_dict(), output_model_file)
   model_to_save.config.to_json_file(output_config_file)
   tokenizer.save_vocabulary(output_vocab_file)
thomwolf's avatar
thomwolf committed
169

170
   # Step 2: Re-load the saved model and vocabulary
thomwolf's avatar
thomwolf committed
171

172
173
   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
   # Here is how to do it in this situation:
thomwolf's avatar
thomwolf committed
174

175
176
177
178
179
180
   # Example for a Bert model
   config = BertConfig.from_json_file(output_config_file)
   model = BertForQuestionAnswering(config)
   state_dict = torch.load(output_model_file)
   model.load_state_dict(state_dict)
   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
thomwolf's avatar
thomwolf committed
181

182
183
184
185
186
187
   # Example for a GPT model
   config = OpenAIGPTConfig.from_json_file(output_config_file)
   model = OpenAIGPTDoubleHeadsModel(config)
   state_dict = torch.load(output_model_file)
   model.load_state_dict(state_dict)
   tokenizer = OpenAIGPTTokenizer(output_vocab_file)
thomwolf's avatar
thomwolf committed
188