"vscode:/vscode.git/clone" did not exist on "87b9ec3843f7f9a81253075f92c9e6537ecefe1c"
migration.md 6.4 KB
Newer Older
1
# Migrating from previous packages
thomwolf's avatar
thomwolf committed
2

3
4
5
6
7
8
9
10
11
12
13
14
15
## Migrating from pytorch-transformers to transformers

Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.

### Positional order of some models' keywords inputs (`attention_mask`, `token_type_ids`...) changed

To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models **keywords inputs** (`attention_mask`, `token_type_ids`...) has been changed.

If you used to call the models with keyword names for keyword arguments, e.g. `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, this should not cause any change.

If you used to call the models with positional inputs for keyword arguments, e.g. `model(inputs_ids, attention_mask, token_type_ids)`, you may have to double check the exact order of input arguments.

## Migrating from pytorch-pretrained-bert
thomwolf's avatar
thomwolf committed
16

17
Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `transformers`
thomwolf's avatar
thomwolf committed
18
19
20

### Models always output `tuples`

21
The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
thomwolf's avatar
thomwolf committed
22

23
The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
thomwolf's avatar
thomwolf committed
24
25
26

In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.

27
Here is a `pytorch-pretrained-bert` to `transformers` conversion example for a `BertForSequenceClassification` classification model:
thomwolf's avatar
thomwolf committed
28
29
30
31
32
33
34
35

```python
# Let's load our model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

36
# Now just use this line in transformers to extract the loss from the output tuple:
thomwolf's avatar
thomwolf committed
37
38
39
outputs = model(input_ids, labels=labels)
loss = outputs[0]

40
# In transformers you can also have access to the logits:
thomwolf's avatar
thomwolf committed
41
42
loss, logits = outputs[:2]

Julien Chaumond's avatar
Julien Chaumond committed
43
# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
thomwolf's avatar
thomwolf committed
44
45
46
47
48
49
50
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs
```

### Serialization

thomwolf's avatar
thomwolf committed
51
Breaking change in the `from_pretrained()`method:
52

thomwolf's avatar
thomwolf committed
53
54
55
56
57
1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.

2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method.

Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
thomwolf's avatar
thomwolf committed
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

Here is an example:

```python
### Let's load a model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Do some stuff to our model and tokenizer
# Ex: add new tokens to the vocabulary and embeddings of our model
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))
# Train our model
train(model)

### Now let's save our model and tokenizer to a directory
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
```

### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

84
85
86
87
88
89
90
The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:

- it only implements weights decay correction,
- schedules are now externals (see below),
- gradient clipping is now also external (see below).

The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
thomwolf's avatar
thomwolf committed
91
92
93
94
95
96
97
98

The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.

Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:

```python
# Parameters:
lr = 1e-3
99
max_grad_norm = 1.0
100
num_training_steps = 1000
thomwolf's avatar
thomwolf committed
101
num_warmup_steps = 100
102
warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1
thomwolf's avatar
thomwolf committed
103
104

### Previously BertAdam optimizer was instantiated like this:
105
optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, num_training_steps=num_training_steps)
thomwolf's avatar
thomwolf committed
106
107
108
109
110
111
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    optimizer.step()

112
### In Transformers, optimizer and schedules are splitted and instantiated like this:
thomwolf's avatar
thomwolf committed
113
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
114
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
thomwolf's avatar
thomwolf committed
115
116
117
118
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
119
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
thomwolf's avatar
thomwolf committed
120
    optimizer.step()
Aditya Soni's avatar
Aditya Soni committed
121
    scheduler.step()
thomwolf's avatar
thomwolf committed
122
```