README.md 18.2 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

17
## Sequence to Sequence Training and Evaluation
18

19
This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
20
Please tag @patil-suraj with any issues/unexpected behaviors, or send a PR!
21
For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.com/huggingface/transformers/blob/master/examples/research_projects/bertabs/README.md).
22

23
24
### Supported Architectures

25
- `BartForConditionalGeneration`
26
27
28
29
30
- `MarianMTModel`
- `PegasusForConditionalGeneration`
- `MBartForConditionalGeneration`
- `FSMTForConditionalGeneration`
- `T5ForConditionalGeneration`
31

32
33
34
35
This directory is in a bit of messy state and is undergoing some cleaning, please bare with us in the meantime :-) Here are the instructions to use the new and old scripts for fine-tuning sequence-to-sequence models.

## New script

36
37
38
The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [馃 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.

For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

Here is an example on a summarization task:
```bash
python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --task summarization \
    --dataset_name xsum \
    --output_dir ~/tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
```

55
And here is how you would use it on your own files (replace `path_to_csv_or_jsonlines_file`, `text_column_name` and `summary_column_name` by the relevant values):
56
57
```bash
python examples/seq2seq/run_seq2seq.py \
58
    --model_name_or_path t5-small \
59
60
61
    --do_train \
    --do_eval \
    --task summarization \
62
63
    --train_file path_to_csv_or_jsonlines_file \
    --validation_file path_to_csv_or_jsonlines_file \
64
65
66
67
68
69
    --output_dir ~/tmp/tst-summarization \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate \
    --text_column text_column_name \
70
    --summary_column summary_column_name
71
72
73
74
75
76
77
78
79
80
81
82
```
The training and validation files should have a column for the inputs texts and a column for the summaries.

Here is an example of a translation fine-tuning:
```bash
python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path sshleifer/student_marian_en_ro_6_1 \
    --do_train \
    --do_eval \
    --task translation_en_to_ro \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
83
84
    --source_lang en_XX \
    --target_lang ro_RO\
85
86
87
88
89
90
91
    --output_dir ~/tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
```

92
And here is how you would use it on your own files (replace `path_to_jsonlines_file`, by the relevant values):
93
94
95
96
97
98
99
100
```bash
python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path sshleifer/student_marian_en_ro_6_1 \
    --do_train \
    --do_eval \
    --task translation_en_to_ro \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
101
102
    --source_lang en_XX \
    --target_lang ro_RO\
103
104
    --train_file path_to_jsonlines_file \
    --validation_file path_to_jsonlines_file \
105
106
107
108
109
110
    --output_dir ~/tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
```
111
Here the files are expected to be JSONLINES files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`).
112
113
114
115
116
117
118

## Old script

The new script is very new and hasn't been widely tested yet. It also misses a few functionality offered by the old
script, which is why we are leaving the old script here for now.

### Downlowd the Datasets
119

120
121
#### XSUM

122
123
```bash
cd examples/seq2seq
124
wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
125
126
tar -xzvf xsum.tar.gz
export XSUM_DIR=${PWD}/xsum
127
```
Aditya Soni's avatar
Aditya Soni committed
128
this should make a directory called `xsum/` with files like `test.source`.
129
130
To use your own data, copy that files format. Each article to be summarized is on its own line.

131
#### CNN/DailyMail
132

133
134
```bash
cd examples/seq2seq
135
wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
136
137
tar -xzvf cnn_dm_v2.tgz  # empty lines removed
mv cnn_cln cnn_dm
138
139
export CNN_DIR=${PWD}/cnn_dm
```
140
this should make a directory called `cnn_dm/` with 6 files.
141

142
143
#### WMT16 English-Romanian Translation Data

144
download with this command:
145
```bash
146
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
147
148
149
tar -xzvf wmt_en_ro.tar.gz
export ENRO_DIR=${PWD}/wmt_en_ro
```
150
151
this should make a directory called `wmt_en_ro/` with 6 files.

152
153
#### WMT English-German

154
```bash
155
wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
156
tar -xzvf wmt_en_de.tgz
157
158
159
export DATA_DIR=${PWD}/wmt_en_de
```

160
161
162
163
164
165
166
#### FSMT datasets (wmt)

Refer to the scripts starting with `eval_` under:
https://github.com/huggingface/transformers/tree/master/scripts/fsmt

#### Pegasus (multiple datasets)

167
Multiple eval datasets are available for download from:
168
169
170
https://github.com/stas00/porting/tree/master/datasets/pegasus


171
#### Your Data
172

173
If you are using your own data, it must be formatted as one directory with 6 files:
174
175
176
177
178
179
180
181
```
train.source
train.target
val.source
val.target
test.source
test.target
```
182
183
The `.source` files are the input, the `.target` files are the desired output.

184
185
### Potential issues

186
- native AMP (`--fp16` and no apex) may lead to a huge memory leak and require 10x gpu memory. This has been fixed in pytorch-nightly and the minimal official version to have this fix will be pytorch-1.7.1. Until then if you have to use mixed precision please use AMP only with pytorch-nightly or NVIDIA's apex. Reference: https://github.com/huggingface/transformers/issues/8403
187
188


189
190
191
### Tips and Tricks

General Tips:
192
- since you need to run from `examples/seq2seq`, and likely need to modify code, the easiest workflow is fork transformers, clone your fork, and run `pip install -e .` before you get started.
193
194
195
196
197
- try `--freeze_encoder` or `--freeze_embeds` for faster training/larger batch size.  (3hr per epoch with bs=8, see the "xsum_shared_task" command below)
- `fp16_opt_level=O1` (the default works best).
- In addition to the pytorch-lightning .ckpt checkpoint, a transformers checkpoint will be saved.
Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_tfmr)`.
- At the moment, `--do_predict` does not work in a multi-gpu setting. You need to use `evaluate_checkpoint` or the `run_eval.py` code.
198
- This warning can be safely ignored:
199
200
    > "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
- Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
201
- Read scripts before you run them!
202
203
204
205
206
207

Summarization Tips:
- (summ) 1 epoch at batch size 1 for bart-large takes 24 hours and requires 13GB GPU RAM with fp16 on an NVIDIA-V100.
- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
- For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
- `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
208
- `wandb` can be used by specifying `--logger_name wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task.
209
- If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
210
(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).
211

212
**Update 2018-07-18**
213
214
Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prepare_seq2seq_batch` method. Otherwise, `Seq2SeqDataset` will be used.
Future work/help wanted: A new dataset to support multilingual tasks.
215

Stas Bekman's avatar
Stas Bekman committed
216

Suraj Patil's avatar
Suraj Patil committed
217
### Fine-tuning using Seq2SeqTrainer
Stas Bekman's avatar
Stas Bekman committed
218
To use `Seq2SeqTrainer` for fine-tuning you should use the `finetune_trainer.py` script. It subclasses `Trainer` to extend it for seq2seq training. Except the `Trainer`-related `TrainingArguments`, it shares the same argument names as that of `finetune.py` file. One notable difference is that calculating generative metrics (BLEU, ROUGE) is optional and is controlled using the `--predict_with_generate` argument.
Suraj Patil's avatar
Suraj Patil committed
219

220
With PyTorch 1.6+ it'll automatically use `native AMP` when `--fp16` is set.
Suraj Patil's avatar
Suraj Patil committed
221

Suraj Patil's avatar
Suraj Patil committed
222
223
224
To see all the possible command line options, run:

```bash
Stas Bekman's avatar
Stas Bekman committed
225
python finetune_trainer.py --help
Suraj Patil's avatar
Suraj Patil committed
226
227
```

228
229
230
231
232
For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus:
```bash
python -m torch.distributed.launch --nproc_per_node=2  finetune_trainer.py ...
```

Suraj Patil's avatar
Suraj Patil committed
233
234
**At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.**

Stas Bekman's avatar
Stas Bekman committed
235
All `Seq2SeqTrainer`-based fine-tuning scripts are included in the `builtin_trainer` directory.
Suraj Patil's avatar
Suraj Patil committed
236
237
238

#### TPU Training
`Seq2SeqTrainer` supports TPU training with few caveats
Stas Bekman's avatar
Stas Bekman committed
239
240
1. As `generate` method does not work on TPU at the moment, `predict_with_generate` cannot be used. You should use `--prediction_loss_only` to only calculate loss, and do not set `--do_predict` and `--predict_with_generate`.
2. All sequences should be padded to be of equal length to avoid extremely slow training. (`finetune_trainer.py` does this automatically when running on TPU.)
Suraj Patil's avatar
Suraj Patil committed
241

Stas Bekman's avatar
Stas Bekman committed
242
We provide a very simple launcher script named `xla_spawn.py` that lets you run our example scripts on multiple TPU cores without any boilerplate. Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for `torch.distributed`).
Suraj Patil's avatar
Suraj Patil committed
243
244
245

`builtin_trainer/finetune_tpu.sh` script provides minimal arguments needed for TPU training.

Stas Bekman's avatar
Stas Bekman committed
246
The following command fine-tunes `sshleifer/student_marian_en_ro_6_3` on TPU V3-8 and should complete one epoch in ~5-6 mins.
Suraj Patil's avatar
Suraj Patil committed
247
248
249
250
251

```bash
./builtin_trainer/train_distil_marian_enro_tpu.sh
```

252
## Evaluation Commands
253
254
255
256
257
258
259

To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.
If 'translation' is in your task name, the computed metric will be BLEU. Otherwise, ROUGE will be used.

For t5, you need to specify --task translation_{src}_to_{tgt} as follows:
```bash
export DATA_DIR=wmt_en_ro
260
./run_eval.py t5-base \
261
262
263
264
265
266
267
268
269
270
271
272
273
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation_en_to_ro \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```

This command works for MBART, although the BLEU score is suspiciously low.
```bash
export DATA_DIR=wmt_en_ro
274
./run_eval.py facebook/mbart-large-en-ro $DATA_DIR/val.source mbart_val_generations.txt \
275
276
277
278
279
280
281
282
283
284
285
286
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```

Summarization (xsum will be very similar):
```bash
export DATA_DIR=cnn_dm
287
./run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/val.source dbart_val_generations.txt \
288
289
290
291
    --reference_path $DATA_DIR/val.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --n_obs 100 \
292
293

th 56 \
294
295
296
    --fp16 \
    --bs 32
```
297

298
### Multi-GPU Evaluation
299
300
here is a command to run xsum evaluation on 8 GPUS. It is more than linearly faster than run_eval.py in some cases
because it uses SortishSampler to minimize padding. You can also use it on 1 GPU. `data_dir` must have
301
`{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs.
302
303
304
305
306
307
308
309
310
311

```bash
python -m torch.distributed.launch --nproc_per_node=8  run_distributed_eval.py \
    --model_name sshleifer/distilbart-large-xsum-12-3  \
    --save_dir xsum_generations \
    --data_dir xsum \
    --fp16  # you can pass generate kwargs like num_beams here, just like run_eval.py
```

Contributions that implement this command for other distributed hardware setups are welcome!
312

313
#### Single-GPU Eval: Tips and Tricks
314
315
316
317
318
319
320
321
322
323
324

When using `run_eval.py`, the following features can be useful:

* if you running the script multiple times and want to make it easier to track what arguments produced that output, use `--dump-args`. Along with the results it will also dump any custom params that were passed to the script. For example if you used: `--num_beams 8 --early_stopping true`, the output will be:
   ```
   {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True}
   ```

   `--info` is an additional argument available for the same purpose of tracking the conditions of the experiment. It's useful to pass things that weren't in the argument list, e.g. a language pair `--info "lang:en-ru"`. But also if you pass `--info` without a value it will fallback to the current date/time string, e.g. `2020-09-13 18:44:43`.

   If using `--dump-args --info`, the output will be:
325

326
327
328
329
330
   ```
   {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': '2020-09-13 18:44:43'}
   ```

   If using `--dump-args --info "pair:en-ru chkpt=best`, the output will be:
331

332
333
334
   ```
   {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': 'pair=en-ru chkpt=best'}
   ```
335

336
337
338
339
340
341
342
343
344
345

* if you need to perform a parametric search in order to find the best ones that lead to the highest BLEU score, let `run_eval_search.py` to do the searching for you.

   The script accepts the exact same arguments as `run_eval.py`, plus an additional argument `--search`. The value of `--search` is parsed, reformatted and fed to ``run_eval.py`` as additional args.

   The format for the `--search` value is a simple string with hparams and colon separated values to try, e.g.:
   ```
    --search "num_beams=5:10 length_penalty=0.8:1.0:1.2 early_stopping=true:false"
   ```
   which will generate `12` `(2*3*2)` searches for a product of each hparam. For example the example that was just used will invoke `run_eval.py` repeatedly with:
346

347
348
349
350
351
352
   ```
    --num_beams 5 --length_penalty 0.8 --early_stopping true
    --num_beams 5 --length_penalty 0.8 --early_stopping false
    [...]
    --num_beams 10 --length_penalty 1.2 --early_stopping false
   ```
353

354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
   On completion, this function prints a markdown table of the results sorted by the best BLEU score and the winning arguments.

```
bleu  | num_beams | length_penalty | early_stopping
----- | --------- | -------------- | --------------
26.71 |         5 |            1.1 |              1
26.66 |         5 |            0.9 |              1
26.66 |         5 |            0.9 |              0
26.41 |         5 |            1.1 |              0
21.94 |         1 |            0.9 |              1
21.94 |         1 |            0.9 |              0
21.94 |         1 |            1.1 |              1
21.94 |         1 |            1.1 |              0

Best score args:
stas/wmt19-en-ru data/en-ru/val.source data/en-ru/test_translations.txt --reference_path data/en-ru/val.target --score_path data/en-ru/test_bleu.json --bs 8 --task translation --num_beams 5 --length_penalty 1.1 --early_stopping True
```

If you pass `--info "some experiment-specific info"` it will get printed before the results table - this is useful for scripting and multiple runs, so one can tell the different sets of results from each other.

374

375
376
377
378
379
### Contributing
- follow the standard contributing guidelines and code of conduct.
- add tests to `test_seq2seq_examples.py`
- To run only the seq2seq tests, you must be in the root of the repository and run:
```bash
380
pytest examples/seq2seq/
381
```
382

383
384
385
### Converting pytorch-lightning checkpoints
pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it.

386
This should be done for you, with a file called `{save_dir}/best_tfmr`.
387
388
389
390
391
392
393

If that file doesn't exist but you have a lightning `.ckpt` file, you can run
```bash
python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT  randomly_initialized_hf_model_path save_dir/best_tfmr
```
Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections)

394

395
# Experimental Features
396
397
398
399
400
401
402
403
These features are harder to use and not always useful.

###  Dynamic Batch Size for MT
`finetune.py` has a command line arg `--max_tokens_per_batch` that allows batches to be dynamically sized.
This feature can only be used:
- with fairseq installed
- on 1 GPU
- without sortish sampler
404
- after calling `./save_len_file.py $tok $data_dir`
405

406
For example,
407
```bash
408
./save_len_file.py Helsinki-NLP/opus-mt-en-ro  wmt_en_ro
409
410
411
412
413
414
415
416
417
418
419
420
421
./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs
```
splits `wmt_en_ro/train` into 11,197 uneven lengthed batches and can finish 1 epoch in 8 minutes on a v100.

For comparison,
```bash
./dynamic_bs_example.sh --sortish_sampler --train_batch_size 48
```
uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes.

The feature is still experimental, because:
+ we can make it much more robust if we have memory mapped/preprocessed datasets.
+ The speedup over sortish sampler is not that large at the moment.