"examples/summarization/bertabs/modeling_bertabs.py" did not exist on "5eab3cf6bce7b6f11793056d8772aeb6e761ac4f"
README.md 2.09 KB
Newer Older
1
### Get the CNN Data
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:

```bash
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
```
this should make a directory called cnn_dm/ with files like `test.source`. 
To use your own data, copy that files format. Each article to be summarized is on its own line.

### Usage
To create summaries for each article in dataset, run:
```bash
python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt
```
the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.

17
18
19
20
21
22
23
24
25
26
27
28
29

### Training



After downloading the CNN and Daily Mail datasets, preprocess the dataset:
```commandline
git clone https://github.com/artmatsak/cnn-dailymail
cd cnn-dailymail && python make_datafiles.py ../cnn/stories/ ../dailymail/stories/
```

Run the training script: `run_train.sh`

30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
### Where is the code?
The core model is in `src/transformers/modeling_bart.py`. This directory only contains examples.

### (WIP) Rouge Scores

### Stanford CoreNLP Setup
```
ptb_tokenize () {
    cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}

sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
cd stanford-corenlp-full-2018-10-05
export CLASSPATH=stanford-corenlp-3.9.2.jar:stanford-corenlp-3.9.2-models.jar
```
48
Then run `ptb_tokenize` on `test.target` and your generated hypotheses.
49
50
51
52
53
54
55
56
57
58
59
### Rouge Setup
Install `files2rouge` following the instructions at [here](https://github.com/pltrdy/files2rouge).
I also needed to run `sudo apt-get install libxml-parser-perl`

```python
from files2rouge import files2rouge
from files2rouge import settings
files2rouge.run(<path_to_tokenized_hypo>,
                <path_to_tokenized_target>,
               saveto='rouge_output.txt')
```