add_new_model.md 51.2 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
2

Sylvain Gugger's avatar
Sylvain Gugger committed
3
4
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
5

Sylvain Gugger's avatar
Sylvain Gugger committed
6
http://www.apache.org/licenses/LICENSE-2.0
7

Sylvain Gugger's avatar
Sylvain Gugger committed
8
9
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
11
12
13

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Sylvain Gugger's avatar
Sylvain Gugger committed
14
-->
15

Sylvain Gugger's avatar
Sylvain Gugger committed
16
# How to add a model to 🤗 Transformers?
17

Steven Liu's avatar
Steven Liu committed
18
The 🤗 Transformers library is often able to offer new models thanks to community contributors. But this can be a challenging project and requires an in-depth knowledge of the 🤗 Transformers library and the model to implement. At Hugging Face, we're trying to empower more of the community to actively add models and we've put together this guide to walk you through the process of adding a PyTorch model (make sure you have [PyTorch installed](https://pytorch.org/get-started/locally/)).
19

Steven Liu's avatar
Steven Liu committed
20
Along the way, you'll:
21

Steven Liu's avatar
Steven Liu committed
22
23
24
- get insights into open-source best practices
- understand the design principles behind one of the most popular deep learning libraries
- learn how to efficiently test large models
Sylvain Gugger's avatar
Sylvain Gugger committed
25
- learn how to integrate Python utilities like `black`, `ruff`, and `make fix-copies` to ensure clean and readable code
26

Steven Liu's avatar
Steven Liu committed
27
A Hugging Face team member will be available to help you along the way so you'll never be alone. 🤗 ❤️
28

Steven Liu's avatar
Steven Liu committed
29
To get started, open a [New model addition](https://github.com/huggingface/transformers/issues/new?assignees=&labels=New+model&template=new-model-addition.yml) issue for the model you want to see in 🤗 Transformers. If you're not especially picky about contributing a specific model, you can filter by the [New model label](https://github.com/huggingface/transformers/labels/New%20model) to see if there are any unclaimed model requests and work on it.
30

Steven Liu's avatar
Steven Liu committed
31
Once you've opened a new model request, the first step is to get familiar with 🤗 Transformers if you aren't already!
32

Sylvain Gugger's avatar
Sylvain Gugger committed
33
## General overview of 🤗 Transformers
34
35
36
37
38
39

First, you should get a general overview of 🤗 Transformers. 🤗 Transformers is a very opinionated library, so there is a
chance that you don't agree with some of the library's philosophies or design choices. From our experience, however, we
found that the fundamental design choices and philosophies of the library are crucial to efficiently scale 🤗
Transformers while keeping maintenance costs at a reasonable level.

Sylvain Gugger's avatar
Sylvain Gugger committed
40
A good first starting point to better understand the library is to read the [documentation of our philosophy](philosophy). As a result of our way of working, there are some choices that we try to apply to all models:
41

Sylvain Gugger's avatar
Sylvain Gugger committed
42
43
44
45
- Composition is generally favored over-abstraction
- Duplicating code is not always bad if it strongly improves the readability or accessibility of a model
- Model files are as self-contained as possible so that when you read the code of a specific model, you ideally only
  have to look into the respective `modeling_....py` file.
46
47
48

In our opinion, the library's code is not just a means to provide a product, *e.g.* the ability to use BERT for
inference, but also as the very product that we want to improve. Hence, when adding a model, the user is not only the
49
person who will use your model, but also everybody who will read, try to understand, and possibly tweak your code.
50
51
52

With this in mind, let's go a bit deeper into the general library design.

Sylvain Gugger's avatar
Sylvain Gugger committed
53
### Overview of models
54
55

To successfully add a model, it is important to understand the interaction between your model and its config,
Sylvain Gugger's avatar
Sylvain Gugger committed
56
57
[`PreTrainedModel`], and [`PretrainedConfig`]. For exemplary purposes, we will
call the model to be added to 🤗 Transformers `BrandNewBert`.
58
59
60

Let's take a look:

Sylvain Gugger's avatar
Sylvain Gugger committed
61
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_overview.png"/>
62
63

As you can see, we do make use of inheritance in 🤗 Transformers, but we keep the level of abstraction to an absolute
Sylvain Gugger's avatar
Sylvain Gugger committed
64
65
minimum. There are never more than two levels of abstraction for any model in the library. `BrandNewBertModel`
inherits from `BrandNewBertPreTrainedModel` which in turn inherits from [`PreTrainedModel`] and
66
that's it. As a general rule, we want to make sure that a new model only depends on
Sylvain Gugger's avatar
Sylvain Gugger committed
67
68
69
70
71
72
[`PreTrainedModel`]. The important functionalities that are automatically provided to every new
model are [`~PreTrainedModel.from_pretrained`] and
[`~PreTrainedModel.save_pretrained`], which are used for serialization and deserialization. All of the
other important functionalities, such as `BrandNewBertModel.forward` should be completely defined in the new
`modeling_brand_new_bert.py` script. Next, we want to make sure that a model with a specific head layer, such as
`BrandNewBertForMaskedLM` does not inherit from `BrandNewBertModel`, but rather uses `BrandNewBertModel`
73
as a component that can be called in its forward pass to keep the level of abstraction low. Every new model requires a
Sylvain Gugger's avatar
Sylvain Gugger committed
74
75
76
configuration class, called `BrandNewBertConfig`. This configuration is always stored as an attribute in
[`PreTrainedModel`], and thus can be accessed via the `config` attribute for all classes
inheriting from `BrandNewBertPreTrainedModel`:
77

Sylvain Gugger's avatar
Sylvain Gugger committed
78
79
80
81
```python
model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert")
model.config  # model has access to its config
```
82
83

Similar to the model, the configuration inherits basic serialization and deserialization functionalities from
Sylvain Gugger's avatar
Sylvain Gugger committed
84
85
[`PretrainedConfig`]. Note that the configuration and the model are always serialized into two
different formats - the model to a *pytorch_model.bin* file and the configuration to a *config.json* file. Calling
fzyzcjy's avatar
fzyzcjy committed
86
87
the model's [`~PreTrainedModel.save_pretrained`] will automatically call
the config's [`~PretrainedConfig.save_pretrained`], so that both model and configuration are saved.
88
89


90
91
92
93
94
95
96
97
### Code style

When coding your new model, keep in mind that Transformers is an opinionated library and we have a few quirks of our
own regarding how code should be written :-)

1. The forward pass of your model should be fully written in the modeling file while being fully independent of other
   models in the library. If you want to reuse a block from another model, copy the code and paste it with a
   `# Copied from` comment on top (see [here](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160)
98
   for a good example and [there](pr_checks#check-copies) for more documentation on Copied from). 
99
100
2. The code should be fully understandable, even by a non-native English speaker. This means you should pick
   descriptive variable names and avoid abbreviations. As an example, `activation` is preferred to `act`.
101
   One-letter variable names are strongly discouraged unless it's an index in a for loop.
102
103
104
105
106
107
3. More generally we prefer longer explicit code to short magical one.
4. Avoid subclassing `nn.Sequential` in PyTorch but subclass `nn.Module` and write the forward pass, so that anyone
   using your code can quickly debug it by adding print statements or breaking points.
5. Your function signature should be type-annotated. For the rest, good variable names are way more readable and
   understandable than type annotations.

Sylvain Gugger's avatar
Sylvain Gugger committed
108
### Overview of tokenizers
109
110
111

Not quite ready yet :-( This section will be added soon!

Sylvain Gugger's avatar
Sylvain Gugger committed
112
## Step-by-step recipe to add a model to 🤗 Transformers
113
114
115
116

Everyone has different preferences of how to port a model so it can be very helpful for you to take a look at summaries
of how other contributors ported models to Hugging Face. Here is a list of community blog posts on how to port a model:

Sylvain Gugger's avatar
Sylvain Gugger committed
117
118
1. [Porting GPT2 Model](https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28) by [Thomas](https://huggingface.co/thomwolf)
2. [Porting WMT19 MT Model](https://huggingface.co/blog/porting-fsmt) by [Stas](https://huggingface.co/stas)
119
120
121
122

From experience, we can tell you that the most important things to keep in mind when adding a model are:

-  Don't reinvent the wheel! Most parts of the code you will add for the new 🤗 Transformers model already exist
Sylvain Gugger's avatar
Sylvain Gugger committed
123
124
125
126
127
  somewhere in 🤗 Transformers. Take some time to find similar, already existing models and tokenizers you can copy
  from. [grep](https://www.gnu.org/software/grep/) and [rg](https://github.com/BurntSushi/ripgrep) are your
  friends. Note that it might very well happen that your model's tokenizer is based on one model implementation, and
  your model's modeling code on another one. *E.g.* FSMT's modeling code is based on BART, while FSMT's tokenizer code
  is based on XLM.
128
129
130
-  It's more of an engineering challenge than a scientific challenge. You should spend more time creating an
  efficient debugging environment rather than trying to understand all theoretical aspects of the model in the paper.
-  Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so we at Hugging Face are more
Sylvain Gugger's avatar
Sylvain Gugger committed
131
132
  than happy to help you at every step to add your model. Don't hesitate to ask if you notice you are not making
  progress.
133
134
135
136
137
138

In the following, we try to give you a general recipe that we found most useful when porting a model to 🤗 Transformers.

The following list is a summary of everything that has to be done to add a model and can be used by you as a To-Do
List:

Steven Liu's avatar
Steven Liu committed
139
140
141
142
143
144
145
146
147
148
149
150
151
152
☐ (Optional) Understood the model's theoretical aspects<br>
☐ Prepared 🤗 Transformers dev environment<br>
☐ Set up debugging environment of the original repository<br>
☐ Created script that successfully runs the `forward()` pass using the original repository and checkpoint<br>
☐ Successfully added the model skeleton to 🤗 Transformers<br>
☐ Successfully converted original checkpoint to 🤗 Transformers checkpoint<br>
☐ Successfully ran `forward()` pass in 🤗 Transformers that gives identical output to original checkpoint<br>
☐ Finished model tests in 🤗 Transformers<br>
☐ Successfully added tokenizer in 🤗 Transformers<br>
☐ Run end-to-end integration tests<br>
☐ Finished docs<br>
☐ Uploaded model weights to the Hub<br>
☐ Submitted the pull request<br>
☐ (Optional) Added a demo notebook
153

154
To begin with, we usually recommend starting by getting a good theoretical understanding of `BrandNewBert`. However,
155
if you prefer to understand the theoretical aspects of the model *on-the-job*, then it is totally fine to directly dive
156
into the `BrandNewBert`'s code-base. This option might suit you better if your engineering skills are better than
Sylvain Gugger's avatar
Sylvain Gugger committed
157
your theoretical skill, if you have trouble understanding `BrandNewBert`'s paper, or if you just enjoy programming
158
159
much more than reading scientific papers.

Sylvain Gugger's avatar
Sylvain Gugger committed
160
### 1. (Optional) Theoretical aspects of BrandNewBert
161
162
163
164
165
166
167
168

You should take some time to read *BrandNewBert's* paper, if such descriptive work exists. There might be large
sections of the paper that are difficult to understand. If this is the case, this is fine - don't worry! The goal is
not to get a deep theoretical understanding of the paper, but to extract the necessary information required to
effectively re-implement the model in 🤗 Transformers. That being said, you don't have to spend too much time on the
theoretical aspects, but rather focus on the practical ones, namely:

-  What type of model is *brand_new_bert*? BERT-like encoder-only model? GPT2-like decoder-only model? BART-like
Sylvain Gugger's avatar
Sylvain Gugger committed
169
  encoder-decoder model? Look at the [model_summary](model_summary) if you're not familiar with the differences between those.
170
-  What are the applications of *brand_new_bert*? Text classification? Text generation? Seq2Seq tasks, *e.g.,*
Sylvain Gugger's avatar
Sylvain Gugger committed
171
  summarization?
172
-  What is the novel feature of the model that makes it different from BERT/GPT-2/BART?
Sylvain Gugger's avatar
Sylvain Gugger committed
173
174
-  Which of the already existing [🤗 Transformers models](https://huggingface.co/transformers/#contents) is most
  similar to *brand_new_bert*?
175
-  What type of tokenizer is used? A sentencepiece tokenizer? Word piece tokenizer? Is it the same tokenizer as used
Sylvain Gugger's avatar
Sylvain Gugger committed
176
  for BERT or BART?
177
178
179
180
181

After you feel like you have gotten a good overview of the architecture of the model, you might want to write to the
Hugging Face team with any questions you might have. This might include questions regarding the model's architecture,
its attention layer, etc. We will be more than happy to help you.

Sylvain Gugger's avatar
Sylvain Gugger committed
182
### 2. Next prepare your environment
183

Sylvain Gugger's avatar
Sylvain Gugger committed
184
1. Fork the [repository](https://github.com/huggingface/transformers) by clicking on the ‘Fork' button on the
185
186
   repository's page. This creates a copy of the code under your GitHub user account.

Sylvain Gugger's avatar
Sylvain Gugger committed
187
2. Clone your `transformers` fork to your local disk, and add the base repository as a remote:
188

189
190
191
192
193
   ```bash
   git clone https://github.com/[your Github handle]/transformers.git
   cd transformers
   git remote add upstream https://github.com/huggingface/transformers.git
   ```
194
195
196

3. Set up a development environment, for instance by running the following command:

197
198
199
200
201
   ```bash
   python -m venv .env
   source .env/bin/activate
   pip install -e ".[dev]"
   ```
202

203
204
205
   Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a
   failure with this command. If that's the case make sure to install the Deep Learning framework you are working with
   (PyTorch, TensorFlow and/or Flax) then do:
206

207
208
209
   ```bash
   pip install -e ".[quality]"
   ```
210

211
   which should be enough for most use cases. You can then return to the parent directory
212

213
214
215
   ```bash
   cd ..
   ```
216
217
218
219

4. We recommend adding the PyTorch version of *brand_new_bert* to Transformers. To install PyTorch, please follow the
   instructions on https://pytorch.org/get-started/locally/.

220
   **Note:** You don't need to have CUDA installed. Making the new model work on CPU is sufficient.
221
222
223

5. To port *brand_new_bert*, you will also need access to its original repository:

224
225
226
227
228
   ```bash
   git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git
   cd brand_new_bert
   pip install -e .
   ```
229
230
231

Now you have set up a development environment to port *brand_new_bert* to 🤗 Transformers.

Sylvain Gugger's avatar
Sylvain Gugger committed
232
### 3.-4. Run a pretrained checkpoint using the original repository
233
234
235
236
237
238
239
240
241
242
243
244
245
246

At first, you will work on the original *brand_new_bert* repository. Often, the original implementation is very
“researchy”. Meaning that documentation might be lacking and the code can be difficult to understand. But this should
be exactly your motivation to reimplement *brand_new_bert*. At Hugging Face, one of our main goals is to *make people
stand on the shoulders of giants* which translates here very well into taking a working model and rewriting it to make
it as **accessible, user-friendly, and beautiful** as possible. This is the number-one motivation to re-implement
models into 🤗 Transformers - trying to make complex new NLP technology accessible to **everybody**.

You should start thereby by diving into the original repository.

Successfully running the official pretrained model in the original repository is often **the most difficult** step.
From our experience, it is very important to spend some time getting familiar with the original code-base. You need to
figure out the following:

Sylvain Gugger's avatar
Sylvain Gugger committed
247
248
249
250
251
252
253
254
255
256
- Where to find the pretrained weights?
- How to load the pretrained weights into the corresponding model?
- How to run the tokenizer independently from the model?
- Trace one forward pass so that you know which classes and functions are required for a simple forward pass. Usually,
  you only have to reimplement those functions.
- Be able to locate the important components of the model: Where is the model's class? Are there model sub-classes,
  *e.g.* EncoderModel, DecoderModel? Where is the self-attention layer? Are there multiple different attention layers,
  *e.g.* *self-attention*, *cross-attention*...?
- How can you debug the model in the original environment of the repo? Do you have to add *print* statements, can you
  work with an interactive debugger like *ipdb*, or should you use an efficient IDE to debug the model, like PyCharm?
257

258
It is very important that before you start the porting process, you can **efficiently** debug code in the original
259
260
261
262
263
264
265
266
267
268
269
270
repository! Also, remember that you are working with an open-source library, so do not hesitate to open an issue, or
even a pull request in the original repository. The maintainers of this repository are most likely very happy about
someone looking into their code!

At this point, it is really up to you which debugging environment and strategy you prefer to use to debug the original
model. We strongly advise against setting up a costly GPU environment, but simply work on a CPU both when starting to
dive into the original repository and also when starting to write the 🤗 Transformers implementation of the model. Only
at the very end, when the model has already been successfully ported to 🤗 Transformers, one should verify that the
model also works as expected on GPU.

In general, there are two possible debugging environments for running the original model

Sylvain Gugger's avatar
Sylvain Gugger committed
271
-  [Jupyter notebooks](https://jupyter.org/) / [google colab](https://colab.research.google.com/notebooks/intro.ipynb)
272
273
274
275
276
-  Local python scripts.

Jupyter notebooks have the advantage that they allow for cell-by-cell execution which can be helpful to better split
logical components from one another and to have faster debugging cycles as intermediate results can be stored. Also,
notebooks are often easier to share with other contributors, which might be very helpful if you want to ask the Hugging
277
Face team for help. If you are familiar with Jupyter notebooks, we strongly recommend you work with them.
278

279
The obvious disadvantage of Jupyter notebooks is that if you are not used to working with them you will have to spend
280
some time adjusting to the new programming environment and you might not be able to use your known debugging tools
Sylvain Gugger's avatar
Sylvain Gugger committed
281
anymore, like `ipdb`.
282
283
284
285
286

For each code-base, a good first step is always to load a **small** pretrained checkpoint and to be able to reproduce a
single forward pass using a dummy integer vector of input IDs as an input. Such a script could look like this (in
pseudocode):

Sylvain Gugger's avatar
Sylvain Gugger committed
287
```python
Sylvain Gugger's avatar
Sylvain Gugger committed
288
model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/")
Sylvain Gugger's avatar
Sylvain Gugger committed
289
290
291
input_ids = [0, 4, 5, 2, 3, 7, 9]  # vector of input ids
original_output = model.predict(input_ids)
```
292
293
294

Next, regarding the debugging strategy, there are generally a few from which to choose from:

Sylvain Gugger's avatar
Sylvain Gugger committed
295
296
297
298
- Decompose the original model into many small testable components and run a forward pass on each of those for
  verification
- Decompose the original model only into the original *tokenizer* and the original *model*, run a forward pass on
  those, and use intermediate print statements or breakpoints for verification
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316

Again, it is up to you which strategy to choose. Often, one or the other is advantageous depending on the original code
base.

If the original code-base allows you to decompose the model into smaller sub-components, *e.g.* if the original
code-base can easily be run in eager mode, it is usually worth the effort to do so. There are some important advantages
to taking the more difficult road in the beginning:

- at a later stage when comparing the original model to the Hugging Face implementation, you can verify automatically
  for each component individually that the corresponding component of the 🤗 Transformers implementation matches instead
  of relying on visual comparison via print statements
- it can give you some rope to decompose the big problem of porting a model into smaller problems of just porting
  individual components and thus structure your work better
- separating the model into logical meaningful components will help you to get a better overview of the model's design
  and thus to better understand the model
- at a later stage those component-by-component tests help you to ensure that no regression occurs as you continue
  changing your code

Sylvain Gugger's avatar
Sylvain Gugger committed
317
[Lysandre's](https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed) integration checks for ELECTRA
318
319
320
321
gives a nice example of how this can be done.

However, if the original code-base is very complex or only allows intermediate components to be run in a compiled mode,
it might be too time-consuming or even impossible to separate the model into smaller testable sub-components. A good
Sylvain Gugger's avatar
Sylvain Gugger committed
322
example is [T5's MeshTensorFlow](https://github.com/tensorflow/mesh/tree/master/mesh_tensorflow) library which is
323
324
325
very complex and does not offer a simple way to decompose the model into its sub-components. For such libraries, one
often relies on verifying print statements.

326
No matter which strategy you choose, the recommended procedure is often the same that you should start to debug the
327
328
329
330
331
starting layers first and the ending layers last.

It is recommended that you retrieve the output, either by print statements or sub-component functions, of the following
layers in the following order:

Sylvain Gugger's avatar
Sylvain Gugger committed
332
333
334
335
336
337
1. Retrieve the input IDs passed to the model
2. Retrieve the word embeddings
3. Retrieve the input of the first Transformer layer
4. Retrieve the output of the first Transformer layer
5. Retrieve the output of the following n - 1 Transformer layers
6. Retrieve the output of the whole BrandNewBert Model
338

Sylvain Gugger's avatar
Sylvain Gugger committed
339
Input IDs should thereby consists of an array of integers, *e.g.* `input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]`
340
341
342

The outputs of the following layers often consist of multi-dimensional float arrays and can look like this:

Sylvain Gugger's avatar
Sylvain Gugger committed
343
344
345
346
347
348
349
350
351
352
```
[[
 [-0.1465, -0.6501,  0.1993,  ...,  0.1451,  0.3430,  0.6024],
 [-0.4417, -0.5920,  0.3450,  ..., -0.3062,  0.6182,  0.7132],
 [-0.5009, -0.7122,  0.4548,  ..., -0.3662,  0.6091,  0.7648],
 ...,
 [-0.5613, -0.6332,  0.4324,  ..., -0.3792,  0.7372,  0.9288],
 [-0.5416, -0.6345,  0.4180,  ..., -0.3564,  0.6992,  0.9191],
 [-0.5334, -0.6403,  0.4271,  ..., -0.3339,  0.6533,  0.8694]]],
```
353
354
355
356
357

We expect that every model added to 🤗 Transformers passes a couple of integration tests, meaning that the original
model and the reimplemented version in 🤗 Transformers have to give the exact same output up to a precision of 0.001!
Since it is normal that the exact same model written in different libraries can give a slightly different output
depending on the library framework, we accept an error tolerance of 1e-3 (0.001). It is not enough if the model gives
omahs's avatar
omahs committed
358
nearly the same output, they have to be almost identical. Therefore, you will certainly compare the intermediate
359
360
outputs of the 🤗 Transformers version multiple times against the intermediate outputs of the original implementation of
*brand_new_bert* in which case an **efficient** debugging environment of the original repository is absolutely
361
important. Here is some advice to make your debugging environment as efficient as possible.
362

Sylvain Gugger's avatar
Sylvain Gugger committed
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
- Find the best way of debugging intermediate results. Is the original repository written in PyTorch? Then you should
  probably take the time to write a longer script that decomposes the original model into smaller sub-components to
  retrieve intermediate values. Is the original repository written in Tensorflow 1? Then you might have to rely on
  TensorFlow print operations like [tf.print](https://www.tensorflow.org/api_docs/python/tf/print) to output
  intermediate values. Is the original repository written in Jax? Then make sure that the model is **not jitted** when
  running the forward pass, *e.g.* check-out [this link](https://github.com/google/jax/issues/196).
- Use the smallest pretrained checkpoint you can find. The smaller the checkpoint, the faster your debug cycle
  becomes. It is not efficient if your pretrained model is so big that your forward pass takes more than 10 seconds.
  In case only very large checkpoints are available, it might make more sense to create a dummy model in the new
  environment with randomly initialized weights and save those weights for comparison with the 🤗 Transformers version
  of your model
- Make sure you are using the easiest way of calling a forward pass in the original repository. Ideally, you want to
  find the function in the original repository that **only** calls a single forward pass, *i.e.* that is often called
  `predict`, `evaluate`, `forward` or `__call__`. You don't want to debug a function that calls `forward`
  multiple times, *e.g.* to generate text, like `autoregressive_sample`, `generate`.
- Try to separate the tokenization from the model's *forward* pass. If the original repository shows examples where
  you have to input a string, then try to find out where in the forward call the string input is changed to input ids
  and start from this point. This might mean that you have to possibly write a small script yourself or change the
  original code so that you can directly input the ids instead of an input string.
- Make sure that the model in your debugging setup is **not** in training mode, which often causes the model to yield
  random outputs due to multiple dropout layers in the model. Make sure that the forward pass in your debugging
384
  environment is **deterministic** so that the dropout layers are not used. Or use *transformers.utils.set_seed*
Sylvain Gugger's avatar
Sylvain Gugger committed
385
  if the old and new implementations are in the same framework.
386
387
388

The following section gives you more specific details/tips on how you can do this for *brand_new_bert*.

Sylvain Gugger's avatar
Sylvain Gugger committed
389
### 5.-14. Port BrandNewBert to 🤗 Transformers
390
391
392

Next, you can finally start adding new code to 🤗 Transformers. Go into the clone of your 🤗 Transformers' fork:

Sylvain Gugger's avatar
Sylvain Gugger committed
393
394
395
```bash
cd transformers
```
396
397

In the special case that you are adding a model whose architecture exactly matches the model architecture of an
Sylvain Gugger's avatar
Sylvain Gugger committed
398
existing model you only have to add a conversion script as described in [this section](#write-a-conversion-script).
399
400
In this case, you can just re-use the whole model architecture of the already existing model.

401
402
Otherwise, let's start generating a new model. We recommend using the following script to add a model starting from
an existing model:
403

404
405
406
```bash
transformers-cli add-new-model-like
```
407

408
You will be prompted with a questionnaire to fill in the basic information of your model.
409
410
411
412
413
414
415
416
417

**Open a Pull Request on the main huggingface/transformers repo**

Before starting to adapt the automatically generated code, now is the time to open a “Work in progress (WIP)” pull
request, *e.g.* “[WIP] Add *brand_new_bert*”, in 🤗 Transformers so that you and the Hugging Face team can work
side-by-side on integrating the model into 🤗 Transformers.

You should do the following:

418
1. Create a branch with a descriptive name from your main branch
419

420
421
422
   ```bash
   git checkout -b add_brand_new_bert
   ```
423
424
425

2. Commit the automatically generated code:

426
427
428
429
   ```bash
   git add .
   git commit
   ```
430

431
3. Fetch and rebase to current main
432

433
434
435
436
   ```bash
   git fetch upstream
   git rebase upstream/main
   ```
437
438
439

4. Push the changes to your account using:

440
441
442
   ```bash
   git push -u origin a-descriptive-name-for-my-changes
   ```
443
444
445
446
447
448
449

5. Once you are satisfied, go to the webpage of your fork on GitHub. Click on “Pull request”. Make sure to add the
   GitHub handle of some members of the Hugging Face team as reviewers, so that the Hugging Face team gets notified for
   future changes.

6. Change the PR into a draft by clicking on “Convert to draft” on the right of the GitHub pull request web page.

450
In the following, whenever you have made some progress, don't forget to commit your work and push it to your account so
451
that it shows in the pull request. Additionally, you should make sure to update your work with the current main from
452
453
time to time by doing:

Sylvain Gugger's avatar
Sylvain Gugger committed
454
455
```bash
git fetch upstream
456
git merge upstream/main
Sylvain Gugger's avatar
Sylvain Gugger committed
457
```
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474

In general, all questions you might have regarding the model or your implementation should be asked in your PR and
discussed/solved in the PR. This way, the Hugging Face team will always be notified when you are committing new code or
if you have a question. It is often very helpful to point the Hugging Face team to your added code so that the Hugging
Face team can efficiently understand your problem or question.

To do so, you can go to the “Files changed” tab where you see all of your changes, go to a line regarding which you
want to ask a question, and click on the “+” symbol to add a comment. Whenever a question or problem has been solved,
you can click on the “Resolve” button of the created comment.

In the same way, the Hugging Face team will open comments when reviewing your code. We recommend asking most questions
on GitHub on your PR. For some very general questions that are not very useful for the public, feel free to ping the
Hugging Face team by Slack or email.

**5. Adapt the generated models code for brand_new_bert**

At first, we will focus only on the model itself and not care about the tokenizer. All the relevant code should be
Sylvain Gugger's avatar
Sylvain Gugger committed
475
476
found in the generated files `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` and
`src/transformers/models/brand_new_bert/configuration_brand_new_bert.py`.
477
478

Now you can finally start coding :). The generated code in
Sylvain Gugger's avatar
Sylvain Gugger committed
479
`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` will either have the same architecture as BERT if
480
481
it's an encoder-only model or BART if it's an encoder-decoder model. At this point, you should remind yourself what
you've learned in the beginning about the theoretical aspects of the model: *How is the model different from BERT or
482
BART?*". Implement those changes which often means changing the *self-attention* layer, the order of the normalization
483
484
485
486
487
layer, etc… Again, it is often useful to look at the similar architecture of already existing models in Transformers to
get a better feeling of how your model should be implemented.

**Note** that at this point, you don't have to be very sure that your code is fully correct or clean. Rather, it is
advised to add a first *unclean*, copy-pasted version of the original code to
Sylvain Gugger's avatar
Sylvain Gugger committed
488
`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` until you feel like all the necessary code is
489
490
491
492
493
added. From our experience, it is much more efficient to quickly add a first version of the required code and
improve/correct the code iteratively with the conversion script as described in the next section. The only thing that
has to work at this point is that you can instantiate the 🤗 Transformers implementation of *brand_new_bert*, *i.e.* the
following command should work:

Sylvain Gugger's avatar
Sylvain Gugger committed
494
495
```python
from transformers import BrandNewBertModel, BrandNewBertConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
496

Sylvain Gugger's avatar
Sylvain Gugger committed
497
498
model = BrandNewBertModel(BrandNewBertConfig())
```
499

Sylvain Gugger's avatar
Sylvain Gugger committed
500
501
The above command will create a model according to the default parameters as defined in `BrandNewBertConfig()` with
random weights, thus making sure that the `init()` methods of all components works.
502

503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
Note that all random initialization should happen in the `_init_weights` method of your `BrandnewBertPreTrainedModel`
class. It should initialize all leaf modules depending on the variables of the config. Here is an example with the
BERT `_init_weights` method:

```py
def _init_weights(self, module):
    """Initialize the weights"""
    if isinstance(module, nn.Linear):
        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.bias is not None:
            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.LayerNorm):
        module.bias.data.zero_()
        module.weight.data.fill_(1.0)
```

You can have some more custom schemes if you need a special initialization for some modules. For instance, in
`Wav2Vec2ForPreTraining`, the last two linear layers need to have the initialization of the regular PyTorch `nn.Linear`
but all the other ones should use an initialization as above. This is coded like this:

```py
def _init_weights(self, module):
    """Initialize the weights"""
530
    if isinstance(module, Wav2Vec2ForPreTraining):
531
532
533
534
535
536
537
538
539
540
541
542
543
544
        module.project_hid.reset_parameters()
        module.project_q.reset_parameters()
        module.project_hid._is_hf_initialized = True
        module.project_q._is_hf_initialized = True
    elif isinstance(module, nn.Linear):
        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.bias is not None:
            module.bias.data.zero_()
```

The `_is_hf_initialized` flag is internally used to make sure we only initialize a submodule once. By setting it to
`True` for `module.project_q` and `module.project_hid`, we make sure the custom initialization we did is not overridden later on,
the `_init_weights` function won't be applied to them.

545
546
547
548
549
550
551
552
553
554
**6. Write a conversion script**

Next, you should write a conversion script that lets you convert the checkpoint you used to debug *brand_new_bert* in
the original repository to a checkpoint compatible with your just created 🤗 Transformers implementation of
*brand_new_bert*. It is not advised to write the conversion script from scratch, but rather to look through already
existing conversion scripts in 🤗 Transformers for one that has been used to convert a similar model that was written in
the same framework as *brand_new_bert*. Usually, it is enough to copy an already existing conversion script and
slightly adapt it for your use case. Don't hesitate to ask the Hugging Face team to point you to a similar already
existing conversion script for your model.

Sylvain Gugger's avatar
Sylvain Gugger committed
555
- If you are porting a model from TensorFlow to PyTorch, a good starting point might be BERT's conversion script [here](https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/models/bert/modeling_bert.py#L91)
556
- If you are porting a model from PyTorch to PyTorch, a good starting point might be BART's conversion script [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py)
557
558
559

In the following, we'll quickly explain how PyTorch models store layer weights and define layer names. In PyTorch, the
name of a layer is defined by the name of the class attribute you give the layer. Let's define a dummy model in
Sylvain Gugger's avatar
Sylvain Gugger committed
560
PyTorch, called `SimpleModel` as follows:
561

Sylvain Gugger's avatar
Sylvain Gugger committed
562
563
```python
from torch import nn
564

Sylvain Gugger's avatar
Sylvain Gugger committed
565

Sylvain Gugger's avatar
Sylvain Gugger committed
566
567
class SimpleModel(nn.Module):
    def __init__(self):
Sylvain Gugger's avatar
Sylvain Gugger committed
568
569
570
571
        super().__init__()
        self.dense = nn.Linear(10, 10)
        self.intermediate = nn.Linear(10, 10)
        self.layer_norm = nn.LayerNorm(10)
Sylvain Gugger's avatar
Sylvain Gugger committed
572
```
573

Sylvain Gugger's avatar
Sylvain Gugger committed
574
575
Now we can create an instance of this model definition which will fill all weights: `dense`, `intermediate`,
`layer_norm` with random weights. We can print the model to see its architecture
576

Sylvain Gugger's avatar
Sylvain Gugger committed
577
578
```python
model = SimpleModel()
579

Sylvain Gugger's avatar
Sylvain Gugger committed
580
581
print(model)
```
582
583
584

This will print out the following:

Sylvain Gugger's avatar
Sylvain Gugger committed
585
586
587
588
589
590
591
```
SimpleModel(
  (dense): Linear(in_features=10, out_features=10, bias=True)
  (intermediate): Linear(in_features=10, out_features=10, bias=True)
  (layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
)
```
592
593
594
595

We can see that the layer names are defined by the name of the class attribute in PyTorch. You can print out the weight
values of a specific layer:

Sylvain Gugger's avatar
Sylvain Gugger committed
596
597
598
```python
print(model.dense.weight.data)
```
599
600
601

to see that the weights were randomly initialized

Sylvain Gugger's avatar
Sylvain Gugger committed
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
```
tensor([[-0.0818,  0.2207, -0.0749, -0.0030,  0.0045, -0.1569, -0.1598,  0.0212,
         -0.2077,  0.2157],
        [ 0.1044,  0.0201,  0.0990,  0.2482,  0.3116,  0.2509,  0.2866, -0.2190,
          0.2166, -0.0212],
        [-0.2000,  0.1107, -0.1999, -0.3119,  0.1559,  0.0993,  0.1776, -0.1950,
         -0.1023, -0.0447],
        [-0.0888, -0.1092,  0.2281,  0.0336,  0.1817, -0.0115,  0.2096,  0.1415,
         -0.1876, -0.2467],
        [ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465,
          0.2577,  0.0402],
        [ 0.1502,  0.2465,  0.2566,  0.0693,  0.2352, -0.0530,  0.1859, -0.0604,
          0.2132,  0.1680],
        [ 0.1733, -0.2407, -0.1721,  0.1484,  0.0358, -0.0633, -0.0721, -0.0090,
          0.2707, -0.2509],
        [-0.1173,  0.1561,  0.2945,  0.0595, -0.1996,  0.2988, -0.0802,  0.0407,
          0.1829, -0.1568],
        [-0.1164, -0.2228, -0.0403,  0.0428,  0.1339,  0.0047,  0.1967,  0.2923,
          0.0333, -0.0536],
        [-0.1492, -0.1616,  0.1057,  0.1950, -0.2807, -0.2710, -0.1586,  0.0739,
          0.2220,  0.2358]]).
```
624
625
626
627

In the conversion script, you should fill those randomly initialized weights with the exact weights of the
corresponding layer in the checkpoint. *E.g.*

Sylvain Gugger's avatar
Sylvain Gugger committed
628
```python
Sylvain Gugger's avatar
Sylvain Gugger committed
629
# retrieve matching layer weights, e.g. by
Sylvain Gugger's avatar
Sylvain Gugger committed
630
631
632
# recursive algorithm
layer_name = "dense"
pretrained_weight = array_of_dense_layer
633

Sylvain Gugger's avatar
Sylvain Gugger committed
634
model_pointer = getattr(model, "dense")
635

Sylvain Gugger's avatar
Sylvain Gugger committed
636
637
model_pointer.weight.data = torch.from_numpy(pretrained_weight)
```
638
639
640
641
642

While doing so, you must verify that each randomly initialized weight of your PyTorch model and its corresponding
pretrained checkpoint weight exactly match in both **shape and name**. To do so, it is **necessary** to add assert
statements for the shape and print out the names of the checkpoints weights. E.g. you should add statements like:

Sylvain Gugger's avatar
Sylvain Gugger committed
643
644
645
646
647
```python
assert (
    model_pointer.weight.shape == pretrained_weight.shape
), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched"
```
648
649
650

Besides, you should also print out the names of both weights to make sure they match, *e.g.*

Sylvain Gugger's avatar
Sylvain Gugger committed
651
652
653
```python
logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}")
```
654
655
656
657

If either the shape or the name doesn't match, you probably assigned the wrong checkpoint weight to a randomly
initialized layer of the 🤗 Transformers implementation.

Sylvain Gugger's avatar
Sylvain Gugger committed
658
An incorrect shape is most likely due to an incorrect setting of the config parameters in `BrandNewBertConfig()` that
659
660
661
662
663
do not exactly match those that were used for the checkpoint you want to convert. However, it could also be that
PyTorch's implementation of a layer requires the weight to be transposed beforehand.

Finally, you should also check that **all** required weights are initialized and print out all checkpoint weights that
were not used for initialization to make sure the model is correctly converted. It is completely normal, that the
664
conversion trials fail with either a wrong shape statement or a wrong name assignment. This is most likely because either
Sylvain Gugger's avatar
Sylvain Gugger committed
665
666
you used incorrect parameters in `BrandNewBertConfig()`, have a wrong architecture in the 🤗 Transformers
implementation, you have a bug in the `init()` functions of one of the components of the 🤗 Transformers
667
668
669
670
implementation or you need to transpose one of the checkpoint weights.

This step should be iterated with the previous step until all weights of the checkpoint are correctly loaded in the
Transformers model. Having correctly loaded the checkpoint into the 🤗 Transformers implementation, you can then save
Sylvain Gugger's avatar
Sylvain Gugger committed
671
672
the model under a folder of your choice `/path/to/converted/checkpoint/folder` that should then contain both a
`pytorch_model.bin` file and a `config.json` file:
673

Sylvain Gugger's avatar
Sylvain Gugger committed
674
675
676
```python
model.save_pretrained("/path/to/converted/checkpoint/folder")
```
677
678
679
680

**7. Implement the forward pass**

Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make
681
sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#3-4-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward
682
683
684
pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers
implementation instead of the original one. It should look as follows:

Sylvain Gugger's avatar
Sylvain Gugger committed
685
```python
Sylvain Gugger's avatar
Sylvain Gugger committed
686
model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder")
Sylvain Gugger's avatar
Sylvain Gugger committed
687
688
689
input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]
output = model(input_ids).last_hidden_states
```
690
691
692
693

It is very likely that the 🤗 Transformers implementation and the original model implementation don't give the exact
same output the very first time or that the forward pass throws an error. Don't be disappointed - it's expected! First,
you should make sure that the forward pass doesn't throw any errors. It often happens that the wrong dimensions are
Sylvain Gugger's avatar
Sylvain Gugger committed
694
695
used leading to a *Dimensionality mismatch* error or that the wrong data type object is used, *e.g.* `torch.long`
instead of `torch.float32`. Don't hesitate to ask the Hugging Face team for help, if you don't manage to solve
696
697
698
certain errors.

The final part to make sure the 🤗 Transformers implementation works correctly is to ensure that the outputs are
Sylvain Gugger's avatar
Sylvain Gugger committed
699
700
equivalent to a precision of `1e-3`. First, you should ensure that the output shapes are identical, *i.e.*
`outputs.shape` should yield the same value for the script of the 🤗 Transformers implementation and the original
701
702
703
implementation. Next, you should make sure that the output values are identical as well. This one of the most difficult
parts of adding a new model. Common mistakes why the outputs are not identical are:

Sylvain Gugger's avatar
Sylvain Gugger committed
704
705
706
707
708
- Some layers were not added, *i.e.* an *activation* layer was not added, or the residual connection was forgotten
- The word embedding matrix was not tied
- The wrong positional embeddings are used because the original implementation uses on offset
- Dropout is applied during the forward pass. To fix this make sure *model.training is False* and that no dropout
  layer is falsely activated during the forward pass, *i.e.* pass *self.training* to [PyTorch's functional dropout](https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout)
709
710
711
712
713

The best way to fix the problem is usually to look at the forward pass of the original implementation and the 🤗
Transformers implementation side-by-side and check if there are any differences. Ideally, you should debug/print out
intermediate outputs of both implementations of the forward pass to find the exact position in the network where the 🤗
Transformers implementation shows a different output than the original implementation. First, make sure that the
Sylvain Gugger's avatar
Sylvain Gugger committed
714
715
hard-coded `input_ids` in both scripts are identical. Next, verify that the outputs of the first transformation of
the `input_ids` (usually the word embeddings) are identical. And then work your way up to the very last layer of the
716
717
718
network. At some point, you will notice a difference between the two implementations, which should point you to the bug
in the 🤗 Transformers implementation. From our experience, a simple and efficient way is to add many print statements
in both the original implementation and 🤗 Transformers implementation, at the same positions in the network
719
respectively, and to successively remove print statements showing the same values for intermediate presentations.
720

721
When you're confident that both implementations yield the same output, verify the outputs with
Sylvain Gugger's avatar
Sylvain Gugger committed
722
`torch.allclose(original_output, output, atol=1e-3)`, you're done with the most difficult part! Congratulations - the
723
724
725
726
727
728
729
work left to be done should be a cakewalk 😊.

**8. Adding all necessary model tests**

At this point, you have successfully added a new model. However, it is very much possible that the model does not yet
fully comply with the required design. To make sure, the implementation is fully compatible with 🤗 Transformers, all
common tests should pass. The Cookiecutter should have automatically added a test file for your model, probably under
730
731
the same `tests/models/brand_new_bert/test_modeling_brand_new_bert.py`. Run this test file to verify that all common
tests pass:
732

Sylvain Gugger's avatar
Sylvain Gugger committed
733
```bash
734
pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py
Sylvain Gugger's avatar
Sylvain Gugger committed
735
```
736
737
738

Having fixed all common tests, it is now crucial to ensure that all the nice work you have done is well tested, so that

Sylvain Gugger's avatar
Sylvain Gugger committed
739
740
- a) The community can easily understand your work by looking at specific tests of *brand_new_bert*
- b) Future changes to your model will not break any important feature of the model.
741
742

At first, integration tests should be added. Those integration tests essentially do the same as the debugging scripts
743
you used earlier to implement the model to 🤗 Transformers. A template of those model tests has already added by the
Sylvain Gugger's avatar
Sylvain Gugger committed
744
Cookiecutter, called `BrandNewBertModelIntegrationTests` and only has to be filled out by you. To ensure that those
745
746
tests are passing, run

Sylvain Gugger's avatar
Sylvain Gugger committed
747
```bash
748
RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
Sylvain Gugger's avatar
Sylvain Gugger committed
749
```
750

Sylvain Gugger's avatar
Sylvain Gugger committed
751
<Tip>
752

Sylvain Gugger's avatar
Sylvain Gugger committed
753
In case you are using Windows, you should replace `RUN_SLOW=1` with `SET RUN_SLOW=1`
754

Sylvain Gugger's avatar
Sylvain Gugger committed
755
</Tip>
756
757

Second, all features that are special to *brand_new_bert* should be tested additionally in a separate test under
758
`BrandNewBertModelTester`/`BrandNewBertModelTest`. This part is often forgotten but is extremely useful in two
759
760
ways:

Sylvain Gugger's avatar
Sylvain Gugger committed
761
762
763
- It helps to transfer the knowledge you have acquired during the model addition to the community by showing how the
  special features of *brand_new_bert* should work.
- Future contributors can quickly test changes to the model by running those special tests.
764
765
766
767


**9. Implement the tokenizer**

768
Next, we should add the tokenizer of *brand_new_bert*. Usually, the tokenizer is equivalent to or very similar to an
769
770
771
772
773
774
already existing tokenizer of 🤗 Transformers.

It is very important to find/extract the original tokenizer file and to manage to load this file into the 🤗
Transformers' implementation of the tokenizer.

To ensure that the tokenizer works correctly, it is recommended to first create a script in the original repository
775
that inputs a string and returns the `input_ids`. It could look similar to this (in pseudo-code):
776

Sylvain Gugger's avatar
Sylvain Gugger committed
777
778
```python
input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
Sylvain Gugger's avatar
Sylvain Gugger committed
779
model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/")
Sylvain Gugger's avatar
Sylvain Gugger committed
780
781
input_ids = model.tokenize(input_str)
```
782
783

You might have to take a deeper look again into the original repository to find the correct tokenizer function or you
Sylvain Gugger's avatar
Sylvain Gugger committed
784
might even have to do changes to your clone of the original repository to only output the `input_ids`. Having written
785
786
787
a functional tokenization script that uses the original repository, an analogous script for 🤗 Transformers should be
created. It should look similar to this:

Sylvain Gugger's avatar
Sylvain Gugger committed
788
789
```python
from transformers import BrandNewBertTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
790

Sylvain Gugger's avatar
Sylvain Gugger committed
791
input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
792

Sylvain Gugger's avatar
Sylvain Gugger committed
793
tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/")
794

Sylvain Gugger's avatar
Sylvain Gugger committed
795
796
input_ids = tokenizer(input_str).input_ids
```
797

Sylvain Gugger's avatar
Sylvain Gugger committed
798
When both `input_ids` yield the same values, as a final step a tokenizer test file should also be added.
799
800
801
802
803
804
805

Analogous to the modeling test files of *brand_new_bert*, the tokenization test files of *brand_new_bert* should
contain a couple of hard-coded integration tests.

**10. Run End-to-end integration tests**

Having added the tokenizer, you should also add a couple of end-to-end integration tests using both the model and the
806
807
tokenizer to `tests/models/brand_new_bert/test_modeling_brand_new_bert.py` in 🤗 Transformers.
Such a test should show on a meaningful
808
809
810
811
text-to-text sample that the 🤗 Transformers implementation works as expected. A meaningful text-to-text sample can
include *e.g.* a source-to-target-translation pair, an article-to-summary pair, a question-to-answer pair, etc… If none
of the ported checkpoints has been fine-tuned on a downstream task it is enough to simply rely on the model tests. In a
final step to ensure that the model is fully functional, it is advised that you also run all tests on GPU. It can
Sylvain Gugger's avatar
Sylvain Gugger committed
812
happen that you forgot to add some `.to(self.device)` statements to internal tensors of the model, which in such a
813
814
815
816
817
818
819
test would show in an error. In case you have no access to a GPU, the Hugging Face team can take care of running those
tests for you.

**11. Add Docstring**

Now, all the necessary functionality for *brand_new_bert* is added - you're almost done! The only thing left to add is
a nice docstring and a doc page. The Cookiecutter should have added a template file called
820
`docs/source/model_doc/brand_new_bert.md` that you should fill out. Users of your model will usually first look at
821
822
823
824
this page before using your model. Hence, the documentation must be understandable and concise. It is very useful for
the community to add some *Tips* to show how the model should be used. Don't hesitate to ping the Hugging Face team
regarding the docstrings.

Sylvain Gugger's avatar
Sylvain Gugger committed
825
Next, make sure that the docstring added to `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` is
826
correct and included all necessary inputs and outputs. We have a detailed guide about writing documentation and our docstring format [here](writing-documentation). It is always good to remind oneself that documentation should
827
828
829
830
831
832
833
834
be treated at least as carefully as the code in 🤗 Transformers since the documentation is usually the first contact
point of the community with the model.

**Code refactor**

Great, now you have added all the necessary code for *brand_new_bert*. At this point, you should correct some potential
incorrect code style by running:

Sylvain Gugger's avatar
Sylvain Gugger committed
835
836
837
```bash
make style
```
838
839
840

and verify that your coding style passes the quality check:

Sylvain Gugger's avatar
Sylvain Gugger committed
841
842
843
```bash
make quality
```
844
845
846
847
848
849
850
851
852
853
854
855
856

There are a couple of other very strict design tests in 🤗 Transformers that might still be failing, which shows up in
the tests of your pull request. This is often because of some missing information in the docstring or some incorrect
naming. The Hugging Face team will surely help you if you're stuck here.

Lastly, it is always a good idea to refactor one's code after having ensured that the code works correctly. With all
tests passing, now it's a good time to go over the added code again and do some refactoring.

You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎

**12. Upload the models to the model hub**

In this final part, you should convert and upload all checkpoints to the model hub and add a model card for each
857
uploaded model checkpoint. You can get familiar with the hub functionalities by reading our [Model sharing and uploading Page](model_sharing). You should work alongside the Hugging Face team here to decide on a fitting name for each
858
checkpoint and to get the required access rights to be able to upload the model under the author's organization of
859
860
861
*brand_new_bert*. The `push_to_hub` method, present in all models in `transformers`, is a quick and efficient way to push your checkpoint to the hub. A little snippet is pasted below:

```python
862
863
864
brand_new_bert.push_to_hub("brand_new_bert")
# Uncomment the following line to push to an organization.
# brand_new_bert.push_to_hub("<organization>/brand_new_bert")
865
```
866
867
868
869
870
871
872
873
874
875
876
877
878

It is worth spending some time to create fitting model cards for each checkpoint. The model cards should highlight the
specific characteristics of this particular checkpoint, *e.g.* On which dataset was the checkpoint
pretrained/fine-tuned on? On what down-stream task should the model be used? And also include some code on how to
correctly use the model.

**13. (Optional) Add notebook**

It is very helpful to add a notebook that showcases in-detail how *brand_new_bert* can be used for inference and/or
fine-tuned on a downstream task. This is not mandatory to merge your PR, but very useful for the community.

**14. Submit your finished PR**

879
You're done programming now and can move to the last step, which is getting your PR merged into main. Usually, the
880
881
882
883
Hugging Face team should have helped you already at this point, but it is worth taking some time to give your finished
PR a nice description and eventually add comments to your code, if you want to point out certain design choices to your
reviewer.

Sylvain Gugger's avatar
Sylvain Gugger committed
884
### Share your work!!
885
886
887
888

Now, it's time to get some credit from the community for your work! Having completed a model addition is a major
contribution to Transformers and the whole NLP community. Your code and the ported pre-trained models will certainly be
used by hundreds and possibly even thousands of developers and researchers. You should be proud of your work and share
889
your achievements with the community.
890
891

**You have made another model that is super easy to access for everyone in the community! 🤯**