README.md 2.83 KB
Newer Older
Rémi Louf's avatar
Rémi Louf committed
1
2
3
4
5
6
7
8
9
10
11
12
# Text Summarization with Pretrained Encoders

This folder contains part of the code necessary to reproduce the results on abstractive summarization from the article [Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf) by [Yang Liu](https://nlp-yang.github.io/) and [Mirella Lapata](https://homepages.inf.ed.ac.uk/mlap/). It can also be used to summarize any document.

The original code can be found on the Yang Liu's [github repository](https://github.com/nlpyang/PreSumm).

The model is loaded with the pre-trained weights for the abstractive summarization model trained on the CNN/Daily Mail dataset with an extractive and then abstractive tasks.

## Setup

```
git clone https://github.com/huggingface/transformers && cd transformers
13
pip install .
Rémi Louf's avatar
Rémi Louf committed
14
15
16
17
pip install nltk py-rouge
cd examples/summarization
```

18
## Reproduce the authors'  ROUGE score
Rémi Louf's avatar
Rémi Louf committed
19
20
21
22
23
24
25
26
27
28
29
30
31

To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:

```bash
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
```

And move all the stories to the same folder. We will refer as `$DATA_PATH` the path to where you uncompressed both archive. Then run the following in the same folder as `run_summarization.py`:

```bash
python run_summarization.py \
    --documents_dir $DATA_PATH \
    --summaries_output_dir $SUMMARIES_PATH \ # optional
32
    --no_cuda false \
Rémi Louf's avatar
Rémi Louf committed
33
34
35
36
37
38
39
40
41
    --batch_size 4 \
    --min_length 50 \
    --max_length 200 \
    --beam_size 5 \
    --alpha 0.95 \
    --block_trigram true \
    --compute_rouge true
```

42
The scripts executes on GPU if one is available and if `no_cuda` is not set to `true`. Inference on multiple GPUs is not suported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).
Rémi Louf's avatar
Rémi Louf committed
43
44
45
46
47
48
49
50
51

## Summarize any text

Put the documents that you would like to summarize in a folder (the path to which is referred to as `$DATA_PATH` below) and run the following in the same folder as `run_summarization.py`:

```bash
python run_summarization.py \
    --documents_dir $DATA_PATH \
    --summaries_output_dir $SUMMARIES_PATH \ # optional
52
    --no_cuda false \
Rémi Louf's avatar
Rémi Louf committed
53
54
55
56
57
58
59
60
    --batch_size 4 \
    --min_length 50 \
    --max_length 200 \
    --beam_size 5 \
    --alpha 0.95 \
    --block_trigram true \
```

Rémi Louf's avatar
Rémi Louf committed
61
You may want to play around with `min_length`, `max_length` and `alpha` to suit your use case. If you want to compute ROUGE on another dataset you will need to tweak the stories/summaries import in `utils_summarization.py` and tell it where to fetch the reference summaries.