README.md 19.7 KB
Newer Older
1
2
## Sequence to Sequence

3
4
This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
5
For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
6

7
8
9
10
11
12
13
14
### Supported Architectures

- `BartForConditionalGeneration` (and anything that inherits from it)
- `MarianMTModel`
- `PegasusForConditionalGeneration`
- `MBartForConditionalGeneration`
- `FSMTForConditionalGeneration`
- `T5ForConditionalGeneration`
15

16
17
## Datasets

18
19
#### XSUM

20
21
```bash
cd examples/seq2seq
22
wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
23
24
tar -xzvf xsum.tar.gz
export XSUM_DIR=${PWD}/xsum
25
```
Aditya Soni's avatar
Aditya Soni committed
26
this should make a directory called `xsum/` with files like `test.source`.
27
28
To use your own data, copy that files format. Each article to be summarized is on its own line.

29
#### CNN/DailyMail
30

31
32
```bash
cd examples/seq2seq
33
wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
34
35
tar -xzvf cnn_dm_v2.tgz  # empty lines removed
mv cnn_cln cnn_dm
36
37
export CNN_DIR=${PWD}/cnn_dm
```
38
this should make a directory called `cnn_dm/` with 6 files.
39

40
41
#### WMT16 English-Romanian Translation Data

42
download with this command:
43
```bash
44
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
45
46
47
tar -xzvf wmt_en_ro.tar.gz
export ENRO_DIR=${PWD}/wmt_en_ro
```
48
49
this should make a directory called `wmt_en_ro/` with 6 files.

50
51
#### WMT English-German

52
```bash
53
wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
54
tar -xzvf wmt_en_de.tgz
55
56
57
export DATA_DIR=${PWD}/wmt_en_de
```

58
59
60
61
62
63
64
65
66
67
68
#### FSMT datasets (wmt)

Refer to the scripts starting with `eval_` under:
https://github.com/huggingface/transformers/tree/master/scripts/fsmt

#### Pegasus (multiple datasets)

Multiple eval datasets are available for download from: 
https://github.com/stas00/porting/tree/master/datasets/pegasus


69
#### Private Data
70

71
If you are using your own data, it must be formatted as one directory with 6 files:
72
73
74
75
76
77
78
79
```
train.source
train.target
val.source
val.target
test.source
test.target
```
80
81
The `.source` files are the input, the `.target` files are the desired output.

82
83
84
### Tips and Tricks

General Tips:
85
- since you need to run from `examples/seq2seq`, and likely need to modify code, the easiest workflow is fork transformers, clone your fork, and run `pip install -e .` before you get started.
86
87
88
89
90
- try `--freeze_encoder` or `--freeze_embeds` for faster training/larger batch size.  (3hr per epoch with bs=8, see the "xsum_shared_task" command below)
- `fp16_opt_level=O1` (the default works best).
- In addition to the pytorch-lightning .ckpt checkpoint, a transformers checkpoint will be saved.
Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_tfmr)`.
- At the moment, `--do_predict` does not work in a multi-gpu setting. You need to use `evaluate_checkpoint` or the `run_eval.py` code.
91
- This warning can be safely ignored:
92
93
    > "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
- Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
94
- Read scripts before you run them!
95
96
97
98
99
100

Summarization Tips:
- (summ) 1 epoch at batch size 1 for bart-large takes 24 hours and requires 13GB GPU RAM with fp16 on an NVIDIA-V100.
- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
- For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
- `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
101
- `wandb` can be used by specifying `--logger_name wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task.
102
- If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
103
(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).
104

105
**Update 2018-07-18**
106
107
Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prepare_seq2seq_batch` method. Otherwise, `Seq2SeqDataset` will be used.
Future work/help wanted: A new dataset to support multilingual tasks.
108

Stas Bekman's avatar
Stas Bekman committed
109

110
111
### Finetuning Scripts
All finetuning bash scripts call finetune.py (or distillation.py) with reasonable command line arguments. They usually require extra command line arguments to work.
Stas Bekman's avatar
Stas Bekman committed
112
113
114
115

To see all the possible command line options, run:

```bash
116
./finetune.py --help
Stas Bekman's avatar
Stas Bekman committed
117
118
```

119
120
121
122
123
124
125
126
127
128
129
130
131
### Finetuning Training Params

To override the pretrained model's training params, you can pass them to `./finetune.sh`:

```bash
./finetune.sh \
    [...]
    --encoder_layerdrop 0.1 \
    --decoder_layerdrop 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
```

132
133
134
135
136
137
138
139
140
141
### Summarization Finetuning
Run/modify `finetune.sh`

The following command should work on a 16GB GPU:
```bash
./finetune.sh \
    --data_dir $XSUM_DIR \
    --train_batch_size=1 \
    --eval_batch_size=1 \
    --output_dir=xsum_results \
142
    --num_train_epochs 6 \
143
144
145
    --model_name_or_path facebook/bart-large
```

146
147
There is a starter finetuning script for pegasus at `finetune_pegasus_xsum.sh`.

148
149
150
151
### Translation Finetuning

First, follow the wmt_en_ro download instructions.
Then you can finetune mbart_cc25 on english-romanian with the following command.
152
153
154
**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.

Best performing command:
155
```bash
156
# optionally
157
export ENRO_DIR='wmt_en_ro' # Download instructions above
158
# export WANDB_PROJECT="MT" # optional
159
export MAX_LEN=128
160
export BS=4
161
./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --label_smoothing 0.1 --fp16_opt_level=O1 --logger_name wandb --sortish_sampler
162
```
163
164
This should take < 6h/epoch on a 16GB v100 and achieve test BLEU above 26
To get results in line with fairseq, you need to do some postprocessing. (see `romanian_postprocessing.md`)
165

166
167
168
MultiGPU command
(using 8 GPUS as an example)
```bash
169
export ENRO_DIR='wmt_en_ro' # Download instructions above
170
 # export WANDB_PROJECT="MT" # optional
171
export MAX_LEN=128
172
173
174
export BS=4
./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --gpus 8 --logger_name wandb
```
175
176
### Finetuning Outputs
As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
177
178
179
180
181
182
183
184
185
186
187
188
Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:

```bash
output_dir
├── best_tfmr  # this is a huggingface checkpoint generated by save_pretrained. It is the same model as the PL .ckpt file below
│   ├── config.json
│   ├── merges.txt
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── git_log.json   # repo, branch, and commit hash
189
├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score. (it will be called BLEU for MT)
190
191
192
193
├── metrics.json  # new validation metrics will continually be appended to this
├── student  # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
│   ├── config.json
│   └── pytorch_model.bin
194
195
├── test_generations.txt
# ^^ are the summaries or translations produced by your best checkpoint on the test data. Populated when training is done
196
197
198
199
200
201
202
203
204
├── test_results.txt  # a convenience file with the test set metrics. This data is also in metrics.json['test']
├── hparams.pkl  # the command line args passed after some light preprocessing. Should be saved fairly quickly.
```
After training, you can recover the best checkpoint by running
```python
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
```

Suraj Patil's avatar
Suraj Patil committed
205
206
207
### Fine-tuning using Seq2SeqTrainer
To use `Seq2SeqTrainer` for fine-tuning you should use the `finetune_trainer.py` script. It subclasses `Trainer` to extend it for seq2seq training. Except the `Trainer` releated `TrainingArguments`, it shares the same argument names as that of `finetune.py` file. One notable difference is that, calculating generative metrics (BLEU, ROUGE) is optional and is controlled using the `--predict_with_generate` argument, set this argument to calculate BLEU and ROUGE metrics.

208
With PyTorch 1.6+ it'll automatically use `native AMP` when `--fp16` is set.
Suraj Patil's avatar
Suraj Patil committed
209

Suraj Patil's avatar
Suraj Patil committed
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
To see all the possible command line options, run:

```bash
./builtin_trainer/finetune.sh --help # This calls python finetune_trainer.py --help
```

**At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.**

All `Seq2SeqTrainer` based fine-tuning scripts are included in the `builtin_trainer` directory.

#### TPU Training
`Seq2SeqTrainer` supports TPU training with few caveats
1. As `generate` method does not work on TPU at the moment, `predict_with_generate` can not be used. You should use `--prediction_loss_only` to only calculate loss, and do not set `--do_predict` and `--predict_with_generate`.
2. All sequences should be padded to be of equal length otherwise it leads to extremely slow training. (`finetune_trainer.py` does this automatically when running on TPU.)

We provide a very simple launcher script named `xla_spawn.py` that lets you run our example scripts on multiple TPU cores without any boilerplate. Just pass a --num_cores flag to this script, then your regular training script with its arguments (this is similar to the torch.distributed.launch helper for torch.distributed).

`builtin_trainer/finetune_tpu.sh` script provides minimal arguments needed for TPU training.

Following command fine-tunes `sshleifer/student_marian_en_ro_6_3` on TPU V3-8 and should complete one epoch in ~5-6 mins.

```bash
./builtin_trainer/train_distil_marian_enro_tpu.sh
```

235
236
237
238
239
240
241
242
### Evaluation Commands

To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.
If 'translation' is in your task name, the computed metric will be BLEU. Otherwise, ROUGE will be used.

For t5, you need to specify --task translation_{src}_to_{tgt} as follows:
```bash
export DATA_DIR=wmt_en_ro
243
./run_eval.py t5-base \
244
245
246
247
248
249
250
251
252
253
254
255
256
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation_en_to_ro \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```

This command works for MBART, although the BLEU score is suspiciously low.
```bash
export DATA_DIR=wmt_en_ro
257
./run_eval.py facebook/mbart-large-en-ro $DATA_DIR/val.source mbart_val_generations.txt \
258
259
260
261
262
263
264
265
266
267
268
269
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```

Summarization (xsum will be very similar):
```bash
export DATA_DIR=cnn_dm
270
./run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/val.source dbart_val_generations.txt \
271
272
273
274
275
    --reference_path $DATA_DIR/val.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --n_obs 100 \
    --device cuda \
276
277
    --max_source_length 1024 \
    --max_target_length 56 \
278
279
280
    --fp16 \
    --bs 32
```
281

282
### Multi-GPU Evaluation
283
284
here is a command to run xsum evaluation on 8 GPUS. It is more than linearly faster than run_eval.py in some cases 
because it uses SortishSampler to minimize padding. You can also use it on 1 GPU. `data_dir` must have 
285
`{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs.
286
287
288
289
290
291
292
293
294
295

```bash
python -m torch.distributed.launch --nproc_per_node=8  run_distributed_eval.py \
    --model_name sshleifer/distilbart-large-xsum-12-3  \
    --save_dir xsum_generations \
    --data_dir xsum \
    --fp16  # you can pass generate kwargs like num_beams here, just like run_eval.py
```

Contributions that implement this command for other distributed hardware setups are welcome!
296

297
#### Single-GPU Eval: Tips and Tricks
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357

When using `run_eval.py`, the following features can be useful:

* if you running the script multiple times and want to make it easier to track what arguments produced that output, use `--dump-args`. Along with the results it will also dump any custom params that were passed to the script. For example if you used: `--num_beams 8 --early_stopping true`, the output will be:
   ```
   {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True}
   ```

   `--info` is an additional argument available for the same purpose of tracking the conditions of the experiment. It's useful to pass things that weren't in the argument list, e.g. a language pair `--info "lang:en-ru"`. But also if you pass `--info` without a value it will fallback to the current date/time string, e.g. `2020-09-13 18:44:43`.

   If using `--dump-args --info`, the output will be:
   
   ```
   {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': '2020-09-13 18:44:43'}
   ```

   If using `--dump-args --info "pair:en-ru chkpt=best`, the output will be:
   
   ```
   {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': 'pair=en-ru chkpt=best'}
   ```
      

* if you need to perform a parametric search in order to find the best ones that lead to the highest BLEU score, let `run_eval_search.py` to do the searching for you.

   The script accepts the exact same arguments as `run_eval.py`, plus an additional argument `--search`. The value of `--search` is parsed, reformatted and fed to ``run_eval.py`` as additional args.

   The format for the `--search` value is a simple string with hparams and colon separated values to try, e.g.:
   ```
    --search "num_beams=5:10 length_penalty=0.8:1.0:1.2 early_stopping=true:false"
   ```
   which will generate `12` `(2*3*2)` searches for a product of each hparam. For example the example that was just used will invoke `run_eval.py` repeatedly with:
   
   ```
    --num_beams 5 --length_penalty 0.8 --early_stopping true
    --num_beams 5 --length_penalty 0.8 --early_stopping false
    [...]
    --num_beams 10 --length_penalty 1.2 --early_stopping false
   ```
   
   On completion, this function prints a markdown table of the results sorted by the best BLEU score and the winning arguments.

```
bleu  | num_beams | length_penalty | early_stopping
----- | --------- | -------------- | --------------
26.71 |         5 |            1.1 |              1
26.66 |         5 |            0.9 |              1
26.66 |         5 |            0.9 |              0
26.41 |         5 |            1.1 |              0
21.94 |         1 |            0.9 |              1
21.94 |         1 |            0.9 |              0
21.94 |         1 |            1.1 |              1
21.94 |         1 |            1.1 |              0

Best score args:
stas/wmt19-en-ru data/en-ru/val.source data/en-ru/test_translations.txt --reference_path data/en-ru/val.target --score_path data/en-ru/test_bleu.json --bs 8 --task translation --num_beams 5 --length_penalty 1.1 --early_stopping True
```

If you pass `--info "some experiment-specific info"` it will get printed before the results table - this is useful for scripting and multiple runs, so one can tell the different sets of results from each other.

358

359
### DistilBART
360
![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)
361

362
For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works:
363
you just copy alternating layers from `bart-large-cnn` and finetune more on the same data.
364
365
366
367
368

For the XSUM dataset, that didn’t work as well so we used that same initialization strategy followed by a combination of Distillbert’s ce_loss and the hidden states MSE loss used in the tinybert paper.

You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0).
and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23).
369
370
371
372
373
374
375
376
377
378

#### No Teacher Distillation
To run the simpler distilbart-cnn style distillation all you need is data, a GPU, and a properly initialized student.
You don't even need `distillation.py`.

Some [un-finetuned students](https://huggingface.co/models?search=sshleifer%2Fstudent) are available for replication purposes.
They are initialized by copying layers from the associated `bart-large-{cnn|xsum}` teacher using `--init_strategy alternate`. (You can read about that in `initialization_utils.py`)
The command that produced `sshleifer/distilbart-cnn-12-6` is
```bash
./train_distilbart_cnn.sh
379
```
380
381
382
383
384
385
runtime: 6H on NVIDIA RTX 24GB GPU

*Note*: You can get the same simple distillation logic by using `./run_distiller.sh --no_teacher` followed by identical arguments as the ones in `train_distilbart_cnn.sh`.
If you are using `wandb` and comparing the two distillation methods, using this entry point will make your logs consistent,
because you will have the same hyperparameters logged in every run.

386
#### With a teacher (Intermediate Supervision)
387
388
389
390
391
392
393
394
*Note* only BART variants are supported

In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `BartSummarizationDistiller`.
This is how `sshleifer/distilbart-xsum*` checkpoints were produced.

The command that produced `sshleifer/distilbart-xsum-12-6` is:

```bash
395
./train_distilbart_xsum.sh --logger_name wandb --gpus 1
396
397
```

398
runtime: 13H on V-100 16GB GPU.
399
400
401
402
403
404

### Contributing
- follow the standard contributing guidelines and code of conduct.
- add tests to `test_seq2seq_examples.py`
- To run only the seq2seq tests, you must be in the root of the repository and run:
```bash
405
pytest examples/seq2seq/
406
```
407

408
409
410
411
412
413
414
415
416
417
418
### Converting pytorch-lightning checkpoints
pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it.

This should be done for you, with a file called `{save_dir}/best_tfmr`. 

If that file doesn't exist but you have a lightning `.ckpt` file, you can run
```bash
python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT  randomly_initialized_hf_model_path save_dir/best_tfmr
```
Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections)

419
420
421
422
423
424
425
426
427
428

## Experimental Features 
These features are harder to use and not always useful.

###  Dynamic Batch Size for MT
`finetune.py` has a command line arg `--max_tokens_per_batch` that allows batches to be dynamically sized.
This feature can only be used:
- with fairseq installed
- on 1 GPU
- without sortish sampler
429
- after calling `./save_len_file.py $tok $data_dir`
430
431
432

For example, 
```bash
433
./save_len_file.py Helsinki-NLP/opus-mt-en-ro  wmt_en_ro
434
435
436
437
438
439
440
441
442
443
444
445
446
./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs
```
splits `wmt_en_ro/train` into 11,197 uneven lengthed batches and can finish 1 epoch in 8 minutes on a v100.

For comparison,
```bash
./dynamic_bs_example.sh --sortish_sampler --train_batch_size 48
```
uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes.

The feature is still experimental, because:
+ we can make it much more robust if we have memory mapped/preprocessed datasets.
+ The speedup over sortish sampler is not that large at the moment.