To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/)(the links next to "Stories") in the same folder. Then uncompress the archives by running:
Then run `ptb_tokenize` on `test.target` and your generated hypotheses.
### Rouge Setup
Tips:
Install `files2rouge` following the instructions at [here](https://github.com/pltrdy/files2rouge).
- 1 epoch at batch size 1 for bart-large takes 24 hours, requires 13GB GPU RAM with fp16 on an NVIDIA-V100.
I also needed to run `sudo apt-get install libxml-parser-perl`
- try `bart-base`, `--freeze_encoder` or `--freeze_embeds` for faster training/larger batch size. (3hr/epoch with bs=8, see below)
-`fp16_opt_level=O1` (the default works best).
```python
- If you are finetuning on your own dataset, start from `bart-large-cnn` if you want long summaries and `bart-large-xsum` if you want short summaries.
fromfiles2rougeimportfiles2rouge
(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).
fromfiles2rougeimportsettings
- In addition to the pytorch-lightning .ckpt checkpoint, a transformers checkpoint will be saved.
files2rouge.run(<path_to_tokenized_hypo>,
Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_tfmr)`.
<path_to_tokenized_target>,
- At the moment, `--do_predict` does not work in a multi-gpu setting. You need to use `evaluate_checkpoint` or the `run_eval.py` code.
saveto='rouge_output.txt')
- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
### XSUM Shared Task
Compare XSUM results with others by using `--logger wandb_shared`. This requires `wandb` registration.