README.md 2.68 KB
Newer Older
1
2
3
4
### Get CNN Data
Both types of models do require CNN data and follow different procedures of obtaining so.

#### For BART models
5
6
7
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:

```bash
8
9
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz
10
```
11

12
this should make a directory called cnn_dm/ with files like `test.source`.
13
14
To use your own data, copy that files format. Each article to be summarized is on its own line.

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#### For T5 models
First, you need to download the CNN data. It's about ~400 MB and can be downloaded by
running

```bash
python download_cnn_daily_mail.py cnn_articles_input_data.txt cnn_articles_reference_summaries.txt
```

You should confirm that each file has 11490 lines:

```bash
wc -l cnn_articles_input_data.txt # should print 11490
wc -l cnn_articles_reference_summaries.txt # should print 11490
```

30
### Evaluation
31

32
33
To create summaries for each article in dataset, run:
```bash
34
python evaluate_cnn.py <path_to_test.source> test_generations.txt <model-name>
35
```
36
The default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
37
38

### Training
39
Run/modify `finetune_bart.sh` or `finetune_t5.sh`
40

41
## (WIP) Rouge Scores
42

43
44
45
46
47
48
To create summaries for each article in dataset and also calculate rouge scores run:
```bash
python evaluate_cnn.py <path_to_test.source> test_generations.txt <model-name> --reference_path <path_to_correct_summaries> --score_path <path_to_save_rouge_scores>
```
The rouge scores "rouge1, rouge2, rougeL" are automatically created and saved in ``<path_to_save_rouge_scores>``.

49
50
51
52
53
54
55
56
57
58
59
60
61
### Stanford CoreNLP Setup
```
ptb_tokenize () {
    cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}

sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
cd stanford-corenlp-full-2018-10-05
export CLASSPATH=stanford-corenlp-3.9.2.jar:stanford-corenlp-3.9.2-models.jar
```
62
Then run `ptb_tokenize` on `test.target` and your generated hypotheses.
63
64
65
66
67
68
69
70
71
72
73
### Rouge Setup
Install `files2rouge` following the instructions at [here](https://github.com/pltrdy/files2rouge).
I also needed to run `sudo apt-get install libxml-parser-perl`

```python
from files2rouge import files2rouge
from files2rouge import settings
files2rouge.run(<path_to_tokenized_hypo>,
                <path_to_tokenized_target>,
               saveto='rouge_output.txt')
```