Both types of models do require CNN data and follow different procedures of obtaining so.
#### For BART models
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/)(the links next to "Stories") in the same folder. Then uncompress the archives by running:
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/)(the links next to "Stories") in the same folder. Then uncompress the archives by running:
```bash
```bash
...
@@ -9,22 +12,40 @@ tar -xzvf cnn_dm.tgz
...
@@ -9,22 +12,40 @@ tar -xzvf cnn_dm.tgz
this should make a directory called cnn_dm/ with files like `test.source`.
this should make a directory called cnn_dm/ with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line.
To use your own data, copy that files format. Each article to be summarized is on its own line.
#### For T5 models
First, you need to download the CNN data. It's about ~400 MB and can be downloaded by
***This script evaluates the the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***
### Get the CNN Data
First, you need to download the CNN data. It's about ~400 MB and can be downloaded by