Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
# RoBERTa Pretrained on Smaller Datasets
We pretrain RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). We release 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: We combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.
### Hyperparameters and Validation Perplexity
The hyperparameters and validation perplexities corresponding to each model are as follows:
| Model Name | Training Size | Model Size | Max Steps | Batch Size | Validation Perplexity |
|--------------------------|---------------|------------|-----------|------------|-----------------------|
| [roberta-base-1B-1][link-roberta-base-1B-1] | 1B | BASE | 100K | 512 | 3.93 |
| [roberta-base-1B-2][link-roberta-base-1B-2] | 1B | BASE | 31K | 1024 | 4.25 |
| [roberta-base-1B-3][link-roberta-base-1B-3] | 1B | BASE | 31K | 4096 | 3.84 |
| [roberta-base-100M-1][link-roberta-base-100M-1] | 100M | BASE | 100K | 512 | 4.99 |
| [roberta-base-100M-2][link-roberta-base-100M-2] | 100M | BASE | 31K | 1024 | 4.61 |
| [roberta-base-100M-3][link-roberta-base-100M-3] | 100M | BASE | 31K | 512 | 5.02 |
| [roberta-base-10M-1][link-roberta-base-10M-1] | 10M | BASE | 10K | 1024 | 11.31 |
| [roberta-base-10M-2][link-roberta-base-10M-2] | 10M | BASE | 10K | 512 | 10.78 |
| [roberta-base-10M-3][link-roberta-base-10M-3] | 10M | BASE | 31K | 512 | 11.58 |
| [roberta-med-small-1M-1][link-roberta-med-small-1M-1] | 1M | MED-SMALL | 100K | 512 | 153.38 |
| [roberta-med-small-1M-2][link-roberta-med-small-1M-2] | 1M | MED-SMALL | 10K | 512 | 134.18 |
| [roberta-med-small-1M-3][link-roberta-med-small-1M-3] | 1M | MED-SMALL | 31K | 512 | 139.39 |
The hyperparameters corresponding to model sizes mentioned above are as follows:
| Model Size | L | AH | HS | FFN | P |
|------------|----|----|-----|------|------|
| BASE | 12 | 12 | 768 | 3072 | 125M |
| MED-SMALL | 6 | 8 | 512 | 2048 | 45M |
(AH = number of attention heads; HS = hidden size; FFN = feedforward network dimension; P = number of parameters.)
For other hyperparameters, we select:
- Peak Learning rate: 5e-4
- Warmup Steps: 6% of max steps
- Dropout: 0.1
[link-roberta-med-small-1M-1]: https://huggingface.co/nyu-mll/roberta-med-small-1M-1
[link-roberta-med-small-1M-2]: https://huggingface.co/nyu-mll/roberta-med-small-1M-2
[link-roberta-med-small-1M-3]: https://huggingface.co/nyu-mll/roberta-med-small-1M-3
[link-roberta-base-10M-1]: https://huggingface.co/nyu-mll/roberta-base-10M-1
[link-roberta-base-10M-2]: https://huggingface.co/nyu-mll/roberta-base-10M-2
[link-roberta-base-10M-3]: https://huggingface.co/nyu-mll/roberta-base-10M-3
[link-roberta-base-100M-1]: https://huggingface.co/nyu-mll/roberta-base-100M-1
[link-roberta-base-100M-2]: https://huggingface.co/nyu-mll/roberta-base-100M-2
[link-roberta-base-100M-3]: https://huggingface.co/nyu-mll/roberta-base-100M-3
[link-roberta-base-1B-1]: https://huggingface.co/nyu-mll/roberta-base-1B-1
[link-roberta-base-1B-2]: https://huggingface.co/nyu-mll/roberta-base-1B-2
[link-roberta-base-1B-3]: https://huggingface.co/nyu-mll/roberta-base-1B-3
# German Sentiment Classification with Bert
This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model,
we provide a Python package that bundles the code need for the preprocessing and inferencing.
The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews.
You can find more information about the dataset and the training process in the [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf).
## Using the Python package
To get started install the package from [pypi](https://pypi.org/project/germansentiment/):
```bash
pip install germansentiment
```
```python
from germansentiment import SentimentModel
model = SentimentModel()
texts = [
"Mit keinem guten Ergebniss","Das ist gar nicht mal so gut",
"Total awesome!","nicht so schlecht wie erwartet",
"Der Test verlief positiv.","Sie fährt ein grünes Auto."]
result = model.predict_sentiment(texts)
print(result)
```
The code above will output following list:
```python
["negative","negative","positive","positive","neutral", "neutral"]
```
## A minimal working Sample
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from typing import List
import torch
import re
class SentimentModel():
def __init__(self, model_name: str):
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
def predict_sentiment(self, texts: List[str])-> List[str]:
texts = [self.clean_text(text) for text in texts]
# Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
input_ids = self.tokenizer(texts, padding=True, truncation=True, add_special_tokens=True)
input_ids = torch.tensor(input_ids["input_ids"])
with torch.no_grad():
logits = self.model(input_ids)
label_ids = torch.argmax(logits[0], axis=1)
labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
return labels
def replace_numbers(self,text: str) -> str:
return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")
def clean_text(self,text: str)-> str:
text = text.replace("\n", " ")
text = self.clean_http_urls.sub('',text)
text = self.clean_at_mentions.sub('',text)
text = self.replace_numbers(text)
text = self.clean_chars.sub('', text) # use only text chars
text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
text = text.strip().lower()
return text
texts = ["Mit keinem guten Ergebniss","Das war unfair", "Das ist gar nicht mal so gut",
"Total awesome!","nicht so schlecht wie erwartet", "Das ist gar nicht mal so schlecht",
"Der Test verlief positiv.","Sie fährt ein grünes Auto.", "Der Fall wurde an die Polzei übergeben."]
model = SentimentModel(model_name = "oliverguhr/german-sentiment-bert")
print(model.predict_sentiment(texts))
```
## Model and Data
If you are interested in code and data that was used to train this model please have a look at [this repository](https://github.com/oliverguhr/german-sentiment) and our [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf). Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.
| Dataset | F1 micro Score |
| :----------------------------------------------------------- | -------------: |
| [holidaycheck](https://github.com/oliverguhr/german-sentiment) | 0.9568 |
| [scare](https://www.romanklinger.de/scare/) | 0.9418 |
| [filmstarts](https://github.com/oliverguhr/german-sentiment) | 0.9021 |
| [germeval](https://sites.google.com/view/germeval2017-absa/home) | 0.7536 |
| [PotTS](https://www.aclweb.org/anthology/L16-1181/) | 0.6780 |
| [emotions](https://github.com/oliverguhr/german-sentiment) | 0.9649 |
| [sb10k](https://www.spinningbytes.com/resources/germansentiment/) | 0.7376 |
| [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) | 0.9967 |
| all | 0.9639 |
## Cite
For feedback and questions contact me view mail or Twitter [@oliverguhr](https://twitter.com/oliverguhr). Please cite us if you found this useful:
```
@InProceedings{guhr-EtAl:2020:LREC,
author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
title = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
month = {May},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1620--1625},
url = {https://www.aclweb.org/anthology/2020.lrec-1.201}
}
```
---
language: id
---
# Indonesian T5 Summarization Base Model
Finetuned T5 base summarization model for Indonesian.
## Finetuning Corpus
`t5-base-indonesian-summarization-cased` model is based on `t5-base-bahasa-summarization-cased` by [huseinzol05](https://huggingface.co/huseinzol05), finetuned using [indosum](https://github.com/kata-ai/indosum) dataset.
## Load Finetuned Model
```python
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("panggi/t5-base-indonesian-summarization-cased")
model = T5ForConditionalGeneration.from_pretrained("panggi/t5-base-indonesian-summarization-cased")
```
## Code Sample
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("panggi/t5-base-indonesian-summarization-cased")
model = T5ForConditionalGeneration.from_pretrained("panggi/t5-base-indonesian-summarization-cased")
# https://www.sehatq.com/artikel/apa-itu-dispepsia-fungsional-ketahui-gejala-dan-faktor-risikonya
ARTICLE_TO_SUMMARIZE = "Secara umum, dispepsia adalah kumpulan gejala pada saluran pencernaan seperti nyeri, sensasi terbakar, dan rasa tidak nyaman pada perut bagian atas. Pada beberapa kasus, dispepsia yang dialami seseorang tidak dapat diketahui penyebabnya. Jenis dispepsia ini disebut dengan dispepsia fungsional. Apa saja gejala dispepsia fungsional? Apa itu dispepsia fungsional? Dispepsia fungsional adalah kumpulan gejala tanpa sebab pada saluran pencernaan bagian atas. Gejala tersebut dapat berupa rasa sakit, nyeri, dan tak nyaman pada perut bagian atas atau ulu hati. Penderita dispepsia fungsional juga akan merasakan kenyang lebih cepat dan sensasi perut penuh berkepanjangan. Gejala-gejala tersebut bisa berlangsung selama sebulan atau lebih. Dispepsia ini memiliki nama “fungsional” karena kumpulan gejalanya tidak memiliki penyebab yang jelas. Dilihat dari fungsi dan struktur saluran pencernaan, dokter tidak menemukan hal yang salah. Namun, gejalanya bisa sangat mengganggu dan menyiksa. Dispepsia fungsional disebut juga dengan dispepsia nonulkus. Diperkirakan bahwa 20% masyarakat dunia menderita dispepsia fungsional. Kondisi ini berisiko tinggi dialami oleh wanita, perokok, dan orang yang mengonsumsi obat anti-peradangan nonsteroid (NSAID). Dispepsia fungsional bisa bersifat kronis dan mengganggu kehidupan penderitanya. Namun beruntung, ada beberapa strategi yang bisa diterapkan untuk mengendalikan gejala dispepsia ini. Strategi tersebut termasuk perubahan gaya hidup, obat-obatan, dan terapi.Ragam gejala dispepsia fungsional Gejala dispepsia fungsional dapat bervariasi antara satu pasien dengan pasien lain. Beberapa tanda yang bisa dirasakan seseorang, yaitu: Sensasi terbakar atau nyeri di saluran pencernaan bagian atas Perut kembung Cepat merasa kenyang walau baru makan sedikit Mual Muntah Bersendawa Rasa asam di mulut Penurunan berat badan Tekanan psikologis terkait dengan kondisi yang dialami Apa sebenarnya penyebab dispepsia fungsional? Sebagai penyakit fungsional, dokter mengkategorikan dispepsia ini sebagai penyakit yang tidak diketahui penyebabnya. Hanya saja, beberapa faktor bisa meningkatkan risiko seseorang terkena dispepsia fungsional. Faktor risiko tersebut, termasuk: Alergi terhadap zat tertentu Perubahan mikrobioma usus Infeksi, seperti yang dipicu oleh bakteriHelicobacter pylori Sekresi asam lambung yang tidak normal Peradangan pada saluran pencernaan bagian atas Gangguan pada fungsi lambung untuk mencerna makanan Pola makan tertentu Gaya hidup tidak sehat Stres Kecemasan atau depresi Efek samping pemakaian obat seperti obat antiinflamasi nonsteroid Penanganan untuk dispepsia fungsional Ada banyak pilihan pengobatan untuk dispepsia fungsional. Seperti yang disampaikan di atas, tidak ada penyebab tunggal dispepsia ini yang bisa diketahui. Gejala yang dialami antara satu pasien juga mungkin amat berbeda dari orang lain. Dengan demikian, jenis pengobatan dispepsia fungsional juga akan bervariasi. Beberapa pilihan strategi penanganan untuk dispepsia fungsional, meliputi: 1. Obat-obatan Ada beberapa jenis obat yang mungkin akan diberikan dokter, seperti Obat penetral asam lambung yang disebut penghambat reseptor H2 Obat penghambat produksi asam lambung yang disebut proton pump inhibitors Obat untuk mengendalikan gas di perut yang mengandung simetikon Antidepresan seperti amitriptyline Obat penguat kerongkongan yang disebut agen prokinetik Obat untuk pengosongan isi lambung seperti metoclopramide Antibiotik jika dokter mendeteksi adanya infeksi bakteri H. pylori 2. Anjuran terkait perubahan gaya hidup Selain obat-obatan, dokter akan memberikan rekomendasi perubahan gaya hidup yang harus diterapkan pasien. Tips terkait perubahan gaya hidup termasuk: Makan lebih sering namun dengan porsi yang lebih sedikit Menjauhi makanan berlemak karena memperlambat pengosongan makanan di lambung Menjauhi jenis makanan lain yang memicu gejala dispepsia, seperti makanan pedas, makanan tinggi asam, produk susu, dan produk kafein Menjauhi rokok Dokter juga akan meminta pasien untuk mencari cara untuk mengendalikan stres, tidur dengan kepala lebih tinggi, dan menjalankan usaha untuk mengendalikan berat badan. Apakah penyakit dispepsia itu berbahaya? Dispepsia, termasuk dispepsia fungsional, dapat menjadi kronis dengan gejala yang menyiksa. Jika tidak ditangani, dispepsia tentu dapat berbahaya dan mengganggu kehidupan pasien. Segera hubungi dokter apabila Anda merasakan gejala dispepsia, terlebih jika tidak merespons obat-obatan yang dijual bebas. Catatan dari SehatQ Dispepsia fungsional adalah kumpulan gejala pada saluran pencernaan bagian atas yang tidak diketahui penyebabnya. Dispepsia fungsional dapat ditangani dengan kombinasi obat-obatan dan perubahan gaya hidup. Jika masih memiliki pertanyaan terkait dispepsia fungsional, Anda bisa menanyakan ke dokter di aplikasi kesehatan keluarga SehatQ. Aplikasi SehatQ bisa diunduh gratis di Appstore dan Playstore yang berikan informasi penyakit terpercaya."
# generate summary
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=5000, return_tensors='pt', truncation=True)
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
```
Output:
```
'Dispepsia fungsional adalah kumpulan gejala tanpa sebab pada saluran pencernaan bagian atas. Gejala tersebut dapat berupa rasa sakit, nyeri, dan tak nyaman pada perut bagian atas. Penderita dispepsia fungsional juga akan merasakan kenyang lebih cepat dan sensasi perut penuh berkepanjangan. Gejala-gejala tersebut bisa berlangsung selama sebulan atau lebih.
```
## Acknowledgement
Thanks to Immanuel Drexel for his article [Text Summarization, Extractive, T5, Bahasa Indonesia, Huggingface’s Transformers](https://medium.com/analytics-vidhya/text-summarization-t5-bahasa-indonesia-huggingfaces-transformers-ee9bfe368e2f)
---
language: id
---
# Indonesian T5 Summarization Small Model
Finetuned T5 small summarization model for Indonesian.
## Finetuning Corpus
`t5-small-indonesian-summarization-cased` model is based on `t5-small-bahasa-summarization-cased` by [huseinzol05](https://huggingface.co/huseinzol05), finetuned using [indosum](https://github.com/kata-ai/indosum) dataset.
## Load Finetuned Model
```python
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("panggi/t5-small-indonesian-summarization-cased")
model = T5ForConditionalGeneration.from_pretrained("panggi/t5-small-indonesian-summarization-cased")
```
## Code Sample
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("panggi/t5-small-indonesian-summarization-cased")
model = T5ForConditionalGeneration.from_pretrained("panggi/t5-small-indonesian-summarization-cased")
# https://www.sehatq.com/artikel/apa-itu-dispepsia-fungsional-ketahui-gejala-dan-faktor-risikonya
ARTICLE_TO_SUMMARIZE = "Secara umum, dispepsia adalah kumpulan gejala pada saluran pencernaan seperti nyeri, sensasi terbakar, dan rasa tidak nyaman pada perut bagian atas. Pada beberapa kasus, dispepsia yang dialami seseorang tidak dapat diketahui penyebabnya. Jenis dispepsia ini disebut dengan dispepsia fungsional. Apa saja gejala dispepsia fungsional? Apa itu dispepsia fungsional? Dispepsia fungsional adalah kumpulan gejala tanpa sebab pada saluran pencernaan bagian atas. Gejala tersebut dapat berupa rasa sakit, nyeri, dan tak nyaman pada perut bagian atas atau ulu hati. Penderita dispepsia fungsional juga akan merasakan kenyang lebih cepat dan sensasi perut penuh berkepanjangan. Gejala-gejala tersebut bisa berlangsung selama sebulan atau lebih. Dispepsia ini memiliki nama “fungsional” karena kumpulan gejalanya tidak memiliki penyebab yang jelas. Dilihat dari fungsi dan struktur saluran pencernaan, dokter tidak menemukan hal yang salah. Namun, gejalanya bisa sangat mengganggu dan menyiksa. Dispepsia fungsional disebut juga dengan dispepsia nonulkus. Diperkirakan bahwa 20% masyarakat dunia menderita dispepsia fungsional. Kondisi ini berisiko tinggi dialami oleh wanita, perokok, dan orang yang mengonsumsi obat anti-peradangan nonsteroid (NSAID). Dispepsia fungsional bisa bersifat kronis dan mengganggu kehidupan penderitanya. Namun beruntung, ada beberapa strategi yang bisa diterapkan untuk mengendalikan gejala dispepsia ini. Strategi tersebut termasuk perubahan gaya hidup, obat-obatan, dan terapi.Ragam gejala dispepsia fungsional Gejala dispepsia fungsional dapat bervariasi antara satu pasien dengan pasien lain. Beberapa tanda yang bisa dirasakan seseorang, yaitu: Sensasi terbakar atau nyeri di saluran pencernaan bagian atas Perut kembung Cepat merasa kenyang walau baru makan sedikit Mual Muntah Bersendawa Rasa asam di mulut Penurunan berat badan Tekanan psikologis terkait dengan kondisi yang dialami Apa sebenarnya penyebab dispepsia fungsional? Sebagai penyakit fungsional, dokter mengkategorikan dispepsia ini sebagai penyakit yang tidak diketahui penyebabnya. Hanya saja, beberapa faktor bisa meningkatkan risiko seseorang terkena dispepsia fungsional. Faktor risiko tersebut, termasuk: Alergi terhadap zat tertentu Perubahan mikrobioma usus Infeksi, seperti yang dipicu oleh bakteriHelicobacter pylori Sekresi asam lambung yang tidak normal Peradangan pada saluran pencernaan bagian atas Gangguan pada fungsi lambung untuk mencerna makanan Pola makan tertentu Gaya hidup tidak sehat Stres Kecemasan atau depresi Efek samping pemakaian obat seperti obat antiinflamasi nonsteroid Penanganan untuk dispepsia fungsional Ada banyak pilihan pengobatan untuk dispepsia fungsional. Seperti yang disampaikan di atas, tidak ada penyebab tunggal dispepsia ini yang bisa diketahui. Gejala yang dialami antara satu pasien juga mungkin amat berbeda dari orang lain. Dengan demikian, jenis pengobatan dispepsia fungsional juga akan bervariasi. Beberapa pilihan strategi penanganan untuk dispepsia fungsional, meliputi: 1. Obat-obatan Ada beberapa jenis obat yang mungkin akan diberikan dokter, seperti Obat penetral asam lambung yang disebut penghambat reseptor H2 Obat penghambat produksi asam lambung yang disebut proton pump inhibitors Obat untuk mengendalikan gas di perut yang mengandung simetikon Antidepresan seperti amitriptyline Obat penguat kerongkongan yang disebut agen prokinetik Obat untuk pengosongan isi lambung seperti metoclopramide Antibiotik jika dokter mendeteksi adanya infeksi bakteri H. pylori 2. Anjuran terkait perubahan gaya hidup Selain obat-obatan, dokter akan memberikan rekomendasi perubahan gaya hidup yang harus diterapkan pasien. Tips terkait perubahan gaya hidup termasuk: Makan lebih sering namun dengan porsi yang lebih sedikit Menjauhi makanan berlemak karena memperlambat pengosongan makanan di lambung Menjauhi jenis makanan lain yang memicu gejala dispepsia, seperti makanan pedas, makanan tinggi asam, produk susu, dan produk kafein Menjauhi rokok Dokter juga akan meminta pasien untuk mencari cara untuk mengendalikan stres, tidur dengan kepala lebih tinggi, dan menjalankan usaha untuk mengendalikan berat badan. Apakah penyakit dispepsia itu berbahaya? Dispepsia, termasuk dispepsia fungsional, dapat menjadi kronis dengan gejala yang menyiksa. Jika tidak ditangani, dispepsia tentu dapat berbahaya dan mengganggu kehidupan pasien. Segera hubungi dokter apabila Anda merasakan gejala dispepsia, terlebih jika tidak merespons obat-obatan yang dijual bebas. Catatan dari SehatQ Dispepsia fungsional adalah kumpulan gejala pada saluran pencernaan bagian atas yang tidak diketahui penyebabnya. Dispepsia fungsional dapat ditangani dengan kombinasi obat-obatan dan perubahan gaya hidup. Jika masih memiliki pertanyaan terkait dispepsia fungsional, Anda bisa menanyakan ke dokter di aplikasi kesehatan keluarga SehatQ. Aplikasi SehatQ bisa diunduh gratis di Appstore dan Playstore yang berikan informasi penyakit terpercaya."
# generate summary
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=5000, return_tensors='pt', truncation=True)
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
```
Output:
```
'Dispepsia fungsional adalah kumpulan gejala tanpa sebab pada saluran pencernaan bagian atas. Gejala tersebut dapat berupa rasa sakit, nyeri, dan tak nyaman pada perut bagian atas. Penderita dispepsia fungsional juga akan merasakan kenyang lebih cepat dan sensasi perut penuh berkepanjangan. Gejala-gejala tersebut bisa berlangsung selama sebulan atau lebih.
```
## Acknowledgement
Thanks to Immanuel Drexel for his article [Text Summarization, Extractive, T5, Bahasa Indonesia, Huggingface’s Transformers](https://medium.com/analytics-vidhya/text-summarization-t5-bahasa-indonesia-huggingfaces-transformers-ee9bfe368e2f)
# Bert2Bert Summarization with 🤗 EncoderDecoder Framework
This model is a Bert2Bert model fine-tuned on summarization.
Bert2Bert is a `EncoderDecoderModel`, meaning that both the encoder and the decoder are `bert-base-uncased`
BERT models. Leveraging the [EncoderDecoderFramework](https://huggingface.co/transformers/model_doc/encoderdecoder.html#encoder-decoder-models), the
two pretrained models can simply be loaded into the framework via:
```python
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
```
The decoder of an `EncoderDecoder` model needs cross-attention layers and usually makes use of causal
masking for auto-regressiv generation.
Thus, ``bert2bert`` is consequently fined-tuned on the `CNN/Daily Mail`dataset and the resulting model
`bert2bert-cnn_dailymail-fp16` is uploaded here.
## Example
The model is by no means a state-of-the-art model, but nevertheless
produces reasonable summarization results. It was mainly fine-tuned
as a proof-of-concept for the 🤗 EncoderDecoder Framework.
The model can be used as follows:
```python
from transformers import BertTokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
tokenizer = BertTokenizer.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
article = """(CNN)Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members singing a racist chant. SAE's national chapter suspended the students, but University of Oklahoma President David Boren took it a step further, saying the university's affiliation with the fraternity is permanently done. The news is shocking, but it's not the first time SAE has faced controversy. SAE was founded March 9, 1856, at the University of Alabama, five years before the American Civil War, according to the fraternity website. When the war began, the group had fewer than 400 members, of which "369 went to war for the Confederate States and seven for the Union Army," the website says. The fraternity now boasts more than 200,000 living alumni, along with about 15,000 undergraduates populating 219 chapters and 20 "colonies" seeking full membership at universities. SAE has had to work hard to change recently after a string of member deaths, many blamed on the hazing of new recruits, SAE national President Bradley Cohen wrote in a message on the fraternity's website. The fraternity's website lists more than 130 chapters cited or suspended for "health and safety incidents" since 2010. At least 30 of the incidents involved hazing, and dozens more involved alcohol. However, the list is missing numerous incidents from recent months. Among them, according to various media outlets: Yale University banned the SAEs from campus activities last month after members allegedly tried to interfere with a sexual misconduct investigation connected to an initiation rite. Stanford University in December suspended SAE housing privileges after finding sorority members attending a fraternity function were subjected to graphic sexual content. And Johns Hopkins University in November suspended the fraternity for underage drinking. "The media has labeled us as the 'nation's deadliest fraternity,' " Cohen said. In 2011, for example, a student died while being coerced into excessive alcohol consumption, according to a lawsuit. SAE's previous insurer dumped the fraternity. "As a result, we are paying Lloyd's of London the highest insurance rates in the Greek-letter world," Cohen said. Universities have turned down SAE's attempts to open new chapters, and the fraternity had to close 12 in 18 months over hazing incidents."""
input_ids = tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
# should produce
# sae was founded in 1856, five years before the civil war. the fraternity has had to work hard to change recently. the university of oklahoma president says the university's affiliation with the fraternity is permanently done. the sae has had a string of members in recent mon
ths.
```
## Training script:
**IMPORTANT**: In order for this code to work, make sure you checkout to the branch
[more_general_trainer_metric](https://github.com/huggingface/transformers/tree/more_general_trainer_metric), which slightly adapts
the `Trainer` for `EncoderDecoderModels` according to this PR: https://github.com/huggingface/transformers/pull/5840.
The following code shows the complete training script that was used to fine-tune `bert2bert-cnn_dailymail-fp16
` for reproducability. The training last ~9h on a standard GPU.
```python
#!/usr/bin/env python3
import nlp
import logging
from transformers import BertTokenizer, EncoderDecoderModel, Trainer, TrainingArguments
logging.basicConfig(level=logging.INFO)
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# CLS token will work as BOS token
tokenizer.bos_token = tokenizer.cls_token
# SEP token will work as EOS token
tokenizer.eos_token = tokenizer.sep_token
# load train and validation data
train_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="train")
val_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="validation[:10%]")
# load rouge for validation
rouge = nlp.load_metric("rouge")
# set decoding params
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4
# map data correctly
def map_to_encoder_decoder_inputs(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at BERT max length 512
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512)
# force summarization <= 128
outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=128)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["labels"] = outputs.input_ids.copy()
# mask loss for padding
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
]
batch["decoder_attention_mask"] = outputs.attention_mask
assert all([len(x) == 512 for x in inputs.input_ids])
assert all([len(x) == 128 for x in outputs.input_ids])
return batch
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
# set batch size here
batch_size = 16
# make train dataset ready
train_dataset = train_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
train_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# same for validation dataset
val_dataset = val_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
val_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_from_generate=True,
evaluate_during_training=True,
do_train=True,
do_eval=True,
logging_steps=1000,
save_steps=1000,
eval_steps=1000,
overwrite_output_dir=True,
warmup_steps=2000,
save_total_limit=10,
)
# instantiate trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
# start training
trainer.train()
```
## Evaluation
The following script evaluates the model on the test set of
CNN/Daily Mail.
```python
#!/usr/bin/env python3
import nlp
from transformers import BertTokenizer, EncoderDecoderModel
tokenizer = BertTokenizer.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
model.to("cuda")
test_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="test")
batch_size = 128
# map data correctly
def generate_summary(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at BERT max length 512
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
outputs = model.generate(input_ids, attention_mask=attention_mask)
# all special tokens including will be removed
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch["pred"] = output_str
return batch
results = test_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
# load rouge for validation
rouge = nlp.load_metric("rouge")
pred_str = results["pred"]
label_str = results["highlights"]
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
print(rouge_output)
```
The obtained results should be:
| - | Rouge2 - mid -precision | Rouge2 - mid - recall | Rouge2 - mid - fmeasure |
|----------|:-------------:|:------:|:------:|
| **CNN/Daily Mail** | 16.12 | 17.07 | **16.1** |
---
language: en
license: apache-2.0
datasets:
- cnn_dailymail
tags:
- summarization
---
Bert2Bert Summarization with 🤗EncoderDecoder Framework
This model is a warm-started *BERT2BERT* model fine-tuned on the *CNN/Dailymail* summarization dataset.
The model achieves a **18.22** ROUGE-2 score on *CNN/Dailymail*'s test dataset.
For more details on how the model was fine-tuned, please refer to
[this](https://colab.research.google.com/drive/1Ekd5pUeCX7VOrMx94_czTkwNtLN32Uyu?usp=sharing) notebook.
# Bert2GPT2 Summarization with 🤗 EncoderDecoder Framework
This model is a Bert2Bert model fine-tuned on summarization.
Bert2GPT2 is a `EncoderDecoderModel`, meaning that the encoder is a `bert-base-uncased`
BERT model and the decoder is a `gpt2` GPT2 model. Leveraging the [EncoderDecoderFramework](https://huggingface.co/transformers/model_doc/encoderdecoder.html#encoder-decoder-models), the
two pretrained models can simply be loaded into the framework via:
```python
bert2gpt2 = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "gpt2")
```
The decoder of an `EncoderDecoder` model needs cross-attention layers and usually makes use of causal
masking for auto-regressiv generation.
Thus, ``bert2gpt2`` is consequently fined-tuned on the `CNN/Daily Mail`dataset and the resulting model
`bert2gpt2-cnn_dailymail-fp16` is uploaded here.
## Example
The model is by no means a state-of-the-art model, but nevertheless
produces reasonable summarization results. It was mainly fine-tuned
as a proof-of-concept for the 🤗 EncoderDecoder Framework.
The model can be used as follows:
```python
from transformers import BertTokenizer, GPT2Tokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2gpt2-cnn_dailymail-fp16")
# reuse tokenizer from bert2bert encoder-decoder model
bert_tokenizer = BertTokenizer.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
article = """(CNN)Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members singing a racist chant. SAE's national chapter suspended the students, but University of Oklahoma President David B
oren took it a step further, saying the university's affiliation with the fraternity is permanently done. The news is shocking, but it's not the first time SAE has faced controversy. SAE was founded March 9, 185
6, at the University of Alabama, five years before the American Civil War, according to the fraternity website. When the war began, the group had fewer than 400 members, of which "369 went to war for the Confede
rate States and seven for the Union Army," the website says. The fraternity now boasts more than 200,000 living alumni, along with about 15,000 undergraduates populating 219 chapters and 20 "colonies" seeking fu
ll membership at universities. SAE has had to work hard to change recently after a string of member deaths, many blamed on the hazing of new recruits, SAE national President Bradley Cohen wrote in a message on t
he fraternity's website. The fraternity's website lists more than 130 chapters cited or suspended for "health and safety incidents" since 2010. At least 30 of the incidents involved hazing, and dozens more invol
ved alcohol. However, the list is missing numerous incidents from recent months. Among them, according to various media outlets: Yale University banned the SAEs from campus activities last month after members al
legedly tried to interfere with a sexual misconduct investigation connected to an initiation rite. Stanford University in December suspended SAE housing privileges after finding sorority members attending a frat
ernity function were subjected to graphic sexual content. And Johns Hopkins University in November suspended the fraternity for underage drinking. "The media has labeled us as the 'nation's deadliest fraternity,
' " Cohen said. In 2011, for example, a student died while being coerced into excessive alcohol consumption, according to a lawsuit. SAE's previous insurer dumped the fraternity. "As a result, we are paying Lloy
d's of London the highest insurance rates in the Greek-letter world," Cohen said. Universities have turned down SAE's attempts to open new chapters, and the fraternity had to close 12 in 18 months over hazing in
cidents."""
input_ids = bert_tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
# we need a gpt2 tokenizer for the output word embeddings
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(gpt2_tokenizer.decode(output_ids[0], skip_special_tokens=True))
# should produce
# SAE's national chapter suspended the students, but university president says it's permanent.
# The fraternity has had to deal with a string of incidents since 2010.
# SAE has more than 200,000 members, many of whom are students.
# A student died while being coerced into drinking alcohol.
```
## Training script:
**IMPORTANT**: In order for this code to work, make sure you checkout to the branch
[more_general_trainer_metric](https://github.com/huggingface/transformers/tree/more_general_trainer_metric), which slightly adapts
the `Trainer` for `EncoderDecoderModels` according to this PR: https://github.com/huggingface/transformers/pull/5840.
The following code shows the complete training script that was used to fine-tune `bert2gpt2-cnn_dailymail-fp16
` for reproducability. The training last ~11h on a standard GPU.
```python
#!/usr/bin/env python3
import nlp
import logging
from transformers import BertTokenizer, GPT2Tokenizer, EncoderDecoderModel, Trainer, TrainingArguments
logging.basicConfig(level=logging.INFO)
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "gpt2")
# cache is currently not supported by EncoderDecoder framework
model.decoder.config.use_cache = False
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# CLS token will work as BOS token
bert_tokenizer.bos_token = bert_tokenizer.cls_token
# SEP token will work as EOS token
bert_tokenizer.eos_token = bert_tokenizer.sep_token
# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
return outputs
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token
# set decoding params
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id
model.config.eos_token_id = gpt2_tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4
# load train and validation data
train_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="train")
val_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="validation[:5%]")
# load rouge for validation
rouge = nlp.load_metric("rouge", experiment_id=1)
encoder_length = 512
decoder_length = 128
batch_size = 16
# map data correctly
def map_to_encoder_decoder_inputs(batch): # Tokenizer will automatically set [BOS] <text> [EOS]
# use bert tokenizer here for encoder
inputs = bert_tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_length)
# force summarization <= 128
outputs = gpt2_tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["labels"] = outputs.input_ids.copy()
batch["decoder_attention_mask"] = outputs.attention_mask
# complicated list comprehension here because pad_token_id alone is not good enough to know whether label should be excluded or not
batch["labels"] = [
[-100 if mask == 0 else token for mask, token in mask_and_tokens] for mask_and_tokens in [zip(masks, labels) for masks, labels in zip(batch["decoder_attention_mask"], batch["labels"])]
]
assert all([len(x) == encoder_length for x in inputs.input_ids])
assert all([len(x) == decoder_length for x in outputs.input_ids])
return batch
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = gpt2_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = gpt2_tokenizer.eos_token_id
label_str = gpt2_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
# make train dataset ready
train_dataset = train_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
train_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# same for validation dataset
val_dataset = val_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
val_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_from_generate=True,
evaluate_during_training=True,
do_train=True,
do_eval=True,
logging_steps=1000,
save_steps=1000,
eval_steps=1000,
overwrite_output_dir=True,
warmup_steps=2000,
save_total_limit=10,
fp16=True,
)
# instantiate trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
# start training
trainer.train()
```
## Evaluation
The following script evaluates the model on the test set of
CNN/Daily Mail.
```python
#!/usr/bin/env python3
import nlp
from transformers import BertTokenizer, GPT2Tokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2gpt2-cnn_dailymail-fp16")
model.to("cuda")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# CLS token will work as BOS token
bert_tokenizer.bos_token = bert_tokenizer.cls_token
# SEP token will work as EOS token
bert_tokenizer.eos_token = bert_tokenizer.sep_token
# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
return outputs
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token
# set decoding params
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id
model.config.eos_token_id = gpt2_tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4
test_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="test")
batch_size = 64
# map data correctly
def generate_summary(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at BERT max length 512
inputs = bert_tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
outputs = model.generate(input_ids, attention_mask=attention_mask)
# all special tokens including will be removed
output_str = gpt2_tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch["pred"] = output_str
return batch
results = test_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
# load rouge for validation
rouge = nlp.load_metric("rouge")
pred_str = results["pred"]
label_str = results["highlights"]
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
print(rouge_output)
```
The obtained results should be:
| - | Rouge2 - mid -precision | Rouge2 - mid - recall | Rouge2 - mid - fmeasure |
|----------|:-------------:|:------:|:------:|
| **CNN/Daily Mail** | 14.42 | 16.99 | **15.16** |
# Longformer2Roberta Summarization with 🤗 EncoderDecoder Framework
This model is a Longformer2Roberta model fine-tuned on summarization.
Longformer2Roberta is a `EncoderDecoderModel`, meaning that both the encoder is a `allenai/longformer-base-4096` model and the decoder is a `roberta-base` model. Leveraging the [EncoderDecoderFramework](https://huggingface.co/transformers/model_doc/encoderdecoder.html#encoder-decoder-models), the
two pretrained models can simply be loaded into the framework via:
```python
roberta2roberta = EncoderDecoderModel.from_encoder_decoder_pretrained("allenai/longformer-base-4096", "roberta-base")
```
The decoder of an `EncoderDecoder` model needs cross-attention layers and usually makes use of causal
masking for auto-regressiv generation.
Thus, ``longformer2roberta`` is consequently fined-tuned on the `CNN/Daily Mail`dataset and the resulting model
`longformer2roberta-cnn_dailymail-fp16` is uploaded here.
## Example
The model is by no means a state-of-the-art model, but nevertheless
produces reasonable summarization results. It was mainly fine-tuned
as a proof-of-concept for the 🤗 EncoderDecoder Framework.
The model can be used as follows:
```python
from transformers import LongformerTokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
article = """(CNN)James Holmes made his introduction to the world in a Colorado cinema filled with spectators watching a midnight showing of the new Batman movie, "The Dark Knight Rises," in June 2012. The moment became one of the deadliest shootings in U.S. history. Holmes is accused of opening fire on the crowd, killing 12 people and injuring or maiming 70 others in Aurora, a suburb of Denver. Holmes appeared like a comic book character: He resembled the Joker, with red-orange hair, similar to the late actor Heath Ledger\'s portrayal of the villain in an earlier Batman movie, authorities said. But Holmes was hardly a cartoon. Authorities said he wore body armor and carried several guns, including an AR-15 rifle, with lots of ammo. He also wore a gas mask. Holmes says he was insane at the time of the shootings, and that is his legal defense and court plea: not guilty by reason of insanity. Prosecutors aren\'t swayed and will seek the death penalty. Opening statements in his trial are scheduled to begin Monday. Holmes admits to the shootings but says he was suffering "a psychotic episode" at the time, according to court papers filed in July 2013 by the state public defenders, Daniel King and Tamara A. Brady. Evidence "revealed thus far in the case supports the defense\'s position that Mr. Holmes suffers from a severe mental illness and was in the throes of a psychotic episode when he committed the acts that resulted in the tragic loss of life and injuries sustained by moviegoers on July 20, 2012," the public defenders wrote. Holmes no longer looks like a dazed Joker, as he did in his first appearance before a judge in 2012. He appeared dramatically different in January when jury selection began for his trial: 9,000 potential jurors were summoned for duty, described as one of the nation\'s largest jury calls. Holmes now has a cleaner look, with a mustache, button-down shirt and khaki pants. In January, he had a beard and eyeglasses. If this new image sounds like one of an academician, it may be because Holmes, now 27, once was one. Just before the shooting, Holmes was a doctoral student in neuroscience, and he was studying how the brain works, with his schooling funded by a U.S. government grant. Yet for all his learning, Holmes apparently lacked the capacity to command his own mind, according to the case against him. A jury will ultimately decide Holmes\' fate. That panel is made up of 12 jurors and 12 alternates. They are 19 women and five men, and almost all are white and middle-aged. The trial could last until autumn. When jury summonses were issued in January, each potential juror stood a 0.2% chance of being selected, District Attorney George Brauchler told the final jury this month. He described the approaching trial as "four to five months of a horrible roller coaster through the worst haunted house you can imagine." The jury will have to render verdicts on each of the 165 counts against Holmes, including murder and attempted murder charges. Meanwhile, victims and their relatives are challenging all media outlets "to stop the gratuitous use of the name and likeness of mass killers, thereby depriving violent individuals the media celebrity and media spotlight they so crave," the No Notoriety group says. They are joined by victims from eight other mass shootings in recent U.S. history. Raised in central coastal California and in San Diego, James Eagan Holmes is the son of a mathematician father noted for his work at the FICO firm that provides credit scores and a registered nurse mother, according to the U-T San Diego newspaper. Holmes also has a sister, Chris, a musician, who\'s five years younger, the newspaper said. His childhood classmates remember him as a clean-cut, bespectacled boy with an "exemplary" character who "never gave any trouble, and never got in trouble himself," The Salinas Californian reported. His family then moved down the California coast, where Holmes grew up in the San Diego-area neighborhood of Rancho Peñasquitos, which a neighbor described as "kind of like Mayberry," the San Diego newspaper said. Holmes attended Westview High School, which says its school district sits in "a primarily middle- to upper-middle-income residential community." There, Holmes ran cross-country, played soccer and later worked at a biotechnology internship at the Salk Institute and Miramar College, which attracts academically talented students. By then, his peers described him as standoffish and a bit of a wiseacre, the San Diego newspaper said. Holmes attended college fairly close to home, in a neighboring area known as Southern California\'s "inland empire" because it\'s more than an hour\'s drive from the coast, in a warm, low-desert climate. He entered the University of California, Riverside, in 2006 as a scholarship student. In 2008 he was a summer camp counselor for disadvantaged children, age 7 to 14, at Camp Max Straus, run by Jewish Big Brothers Big Sisters of Los Angeles. He graduated from UC Riverside in 2010 with the highest honors and a bachelor\'s degree in neuroscience. "Academically, he was at the top of the top," Chancellor Timothy P. White said. He seemed destined for even higher achievement. By 2011, he had enrolled as a doctoral student in the neuroscience program at the University of Colorado Anschutz Medical Campus in Aurora, the largest academic health center in the Rocky Mountain region. The doctoral in neuroscience program attended by Holmes focuses on how the brain works, with an emphasis on processing of information, behavior, learning and memory. Holmes was one of six pre-thesis Ph.D. students in the program who were awarded a neuroscience training grant from the National Institutes of Health. The grant rewards outstanding neuroscientists who will make major contributions to neurobiology. A syllabus that listed Holmes as a student at the medical school shows he was to have delivered a presentation about microRNA biomarkers. But Holmes struggled, and his own mental health took an ominous turn. In March 2012, he told a classmate he wanted to kill people, and that he would do so "when his life was over," court documents said. Holmes was "denied access to the school after June 12, 2012, after he made threats to a professor," according to court documents. About that time, Holmes was a patient of University of Colorado psychiatrist Lynne Fenton. Fenton was so concerned about Holmes\' behavior that she mentioned it to her colleagues, saying he could be a danger to others, CNN affiliate KMGH-TV reported, citing sources with knowledge of the investigation. Fenton\'s concerns surfaced in early June, sources told the Denver station. Holmes began to fantasize about killing "a lot of people" in early June, nearly six weeks before the shootings, the station reported, citing unidentified sources familiar with the investigation. Holmes\' psychiatrist contacted several members of a "behavioral evaluation and threat assessment" team to say Holmes could be a danger to others, the station reported. At issue was whether to order Holmes held for 72 hours to be evaluated by mental health professionals, the station reported. "Fenton made initial phone calls about engaging the BETA team" in "the first 10 days" of June, but it "never came together" because in the period Fenton was having conversations with team members, Holmes began the process of dropping out of school, a source told KMGH. Defense attorneys have rejected the prosecution\'s assertions that Holmes was barred from campus. Citing statements from the university, Holmes\' attorneys have argued that his access was revoked because that\'s normal procedure when a student drops enrollment. What caused this turn for the worse for Holmes has yet to be clearly detailed. In the months before the shooting, he bought four weapons and more than 6,000 rounds of ammunition, authorities said. Police said he also booby-trapped his third-floor apartment with explosives, but police weren\'t fooled. After Holmes was caught in the cinema parking lot immediately after the shooting, bomb technicians went to the apartment and neutralized the explosives. No one was injured at the apartment building. Nine minutes before Holmes went into the movie theater, he called a University of Colorado switchboard, public defender Brady has said in court. The number he called can be used to get in contact with faculty members during off hours, Brady said. Court documents have also revealed that investigators have obtained text messages that Holmes exchanged with someone before the shooting. That person was not named, and the content of the texts has not been made public. According to The New York Times, Holmes sent a text message to a fellow graduate student, a woman, about two weeks before the shooting. She asked if he had left Aurora yet, reported the newspaper, which didn\'t identify her. No, he had two months left on his lease, Holmes wrote back, according to the Times. He asked if she had heard of "dysphoric mania," a form of bipolar disorder marked by the highs of mania and the dark and sometimes paranoid delusions of major depression. The woman asked if the disorder could be managed with treatment. "It was," Holmes wrote her, according to the Times. But he warned she should stay away from him "because I am bad news," the newspaper reported. It was her last contact with Holmes. After the shooting, Holmes\' family issued a brief statement: "Our hearts go out to those who were involved in this tragedy and to the families and friends of those involved," they said, without giving any information about their son. Since then, prosecutors have refused to offer a plea deal to Holmes. For Holmes, "justice is death," said Brauchler, the district attorney. In December, Holmes\' parents, who will be attending the trial, issued another statement: They asked that their son\'s life be spared and that he be sent to an institution for mentally ill people for the rest of his life, if he\'s found not guilty by reason of insanity. "He is not a monster," Robert and Arlene Holmes wrote, saying the death penalty is "morally wrong, especially when the condemned is mentally ill." "He is a human being gripped by a severe mental illness," the parents said. The matter will be settled by the jury. CNN\'s Ana Cabrera and Sara Weisfeldt contributed to this report from Denver."""
input_ids = tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
# should produce
# James Holmes, 27, is accused of opening fire on a Colorado theater.
# He was a doctoral student at University of Colorado.
# Holmes says he was suffering "a psychotic episode" at the time of the shooting.
# Prosecutors won't say whether Holmes was barred from campus.
```
Such an article has a length of > 2000 tokens, which means that it cannot be handled correctly by Bert or Roberta encoders.
## Training script:
**IMPORTANT**: In order for this code to work, make sure you checkout to the branch
[more_general_trainer_metric](https://github.com/huggingface/transformers/tree/more_general_trainer_metric), which slightly adapts
the `Trainer` for `EncoderDecoderModels` according to this PR: https://github.com/huggingface/transformers/pull/5840.
The following code shows the complete training script that was used to fine-tune `longformer2roberta-cnn_dailymail-fp16
` for reproducability. The training last ~90h on a standard GPU.
```python
#!/usr/bin/env python3
import nlp
import logging
from transformers import LongformerTokenizer, EncoderDecoderModel, Trainer, TrainingArguments
logging.basicConfig(level=logging.INFO)
model = EncoderDecoderModel.from_encoder_decoder_pretrained("allenai/longformer-base-4096", "roberta-base")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
# load train and validation data
train_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="train")
val_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="validation[:5%]")
# load rouge for validation
rouge = nlp.load_metric("rouge", experiment_id=0)
# enable gradient checkpointing for longformer encoder
model.encoder.config.gradient_checkpointing = True
# set decoding params
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4
encoder_length = 2048
decoder_length = 128
batch_size = 16
# map data correctly
def map_to_encoder_decoder_inputs(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at Longformer at 2048
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_length)
# force summarization <= 128
outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
# set 128 tokens to global attention
batch["global_attention_mask"] = [[1 if i < 128 else 0 for i in range(sequence_length)] for sequence_length in len(inputs.input_ids) * [encoder_length]]
batch["decoder_input_ids"] = outputs.input_ids
batch["labels"] = outputs.input_ids.copy()
# mask loss for padding
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
]
batch["decoder_attention_mask"] = outputs.attention_mask
assert all([len(x) == encoder_length for x in inputs.input_ids])
assert all([len(x) == decoder_length for x in outputs.input_ids])
return batch
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.eos_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
# make train dataset ready
train_dataset = train_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
train_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "global_attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# same for validation dataset
val_dataset = val_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
val_dataset.set_format(
type="torch", columns=["input_ids", "global_attention_mask", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_from_generate=True,
evaluate_during_training=True,
do_train=True,
do_eval=True,
logging_steps=1000,
save_steps=1000,
eval_steps=1000,
overwrite_output_dir=True,
warmup_steps=2000,
save_total_limit=3,
fp16=True,
)
# instantiate trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
# start training
trainer.train()
```
## Evaluation
The following script evaluates the model on the test set of
CNN/Daily Mail.
```python
#!/usr/bin/env python3
import nlp
import torch
from transformers import LongformerTokenizer, EncoderDecoderModel
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")
model.to("cuda")
test_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="test")
batch_size = 32
encoder_length = 2048
decoder_length = 128
# map data correctly
def generate_summary(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at BERT max length 512
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_length, return_tensors="pt")
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
global_attention_mask = torch.zeros_like(attention_mask)
global_attention_mask[:, :decoder_length] = 1
outputs = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
# all special tokens including will be removed
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch["pred"] = output_str
return batch
results = test_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
# load rouge for validation
rouge = nlp.load_metric("rouge")
pred_str = results["pred"]
label_str = results["highlights"]
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
print(rouge_output)
```
The obtained results should be:
| - | Rouge2 - mid -precision | Rouge2 - mid - recall | Rouge2 - mid - fmeasure |
|----------|:-------------:|:------:|:------:|
| **CNN/Daily Mail** | 12.39 | 15.05 | **13.21** |
**Note** This model was trained to show how Longformer can be used as an Encoder model in a EncoderDecoder setup.
Better results are obtained for datasets of much longer inputs.
# Roberta2Roberta Summarization with 🤗 EncoderDecoder Framework
This model is a Roberta2Roberta model fine-tuned on summarization.
Roberta2Roberta is a `EncoderDecoderModel`, meaning that both the encoder and the decoder are `roberta-base`
RoBERTa models. Leveraging the [EncoderDecoderFramework](https://huggingface.co/transformers/model_doc/encoderdecoder.html#encoder-decoder-models), the
two pretrained models can simply be loaded into the framework via:
```python
roberta2roberta = EncoderDecoderModel.from_encoder_decoder_pretrained("roberta-base", "roberta-base")
```
The decoder of an `EncoderDecoder` model needs cross-attention layers and usually makes use of causal
masking for auto-regressiv generation.
Thus, ``roberta2roberta`` is consequently fined-tuned on the `CNN/Daily Mail`dataset and the resulting model
`roberta2roberta-cnn_dailymail-fp16` is uploaded here.
## Example
The model is by no means a state-of-the-art model, but nevertheless
produces reasonable summarization results. It was mainly fine-tuned
as a proof-of-concept for the 🤗 EncoderDecoder Framework.
The model can be used as follows:
```python
from transformers import BertTokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/roberta2roberta-cnn_dailymail-fp16")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
article = """(CNN)Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members singing a racist chant. SAE's national chapter suspended the students, but University of Oklahoma President David B
oren took it a step further, saying the university's affiliation with the fraternity is permanently done. The news is shocking, but it's not the first time SAE has faced controversy. SAE was founded March 9, 185
6, at the University of Alabama, five years before the American Civil War, according to the fraternity website. When the war began, the group had fewer than 400 members, of which "369 went to war for the Confede
rate States and seven for the Union Army," the website says. The fraternity now boasts more than 200,000 living alumni, along with about 15,000 undergraduates populating 219 chapters and 20 "colonies" seeking fu
ll membership at universities. SAE has had to work hard to change recently after a string of member deaths, many blamed on the hazing of new recruits, SAE national President Bradley Cohen wrote in a message on t
he fraternity's website. The fraternity's website lists more than 130 chapters cited or suspended for "health and safety incidents" since 2010. At least 30 of the incidents involved hazing, and dozens more invol
ved alcohol. However, the list is missing numerous incidents from recent months. Among them, according to various media outlets: Yale University banned the SAEs from campus activities last month after members al
legedly tried to interfere with a sexual misconduct investigation connected to an initiation rite. Stanford University in December suspended SAE housing privileges after finding sorority members attending a frat
ernity function were subjected to graphic sexual content. And Johns Hopkins University in November suspended the fraternity for underage drinking. "The media has labeled us as the 'nation's deadliest fraternity,
' " Cohen said. In 2011, for example, a student died while being coerced into excessive alcohol consumption, according to a lawsuit. SAE's previous insurer dumped the fraternity. "As a result, we are paying Lloy
d's of London the highest insurance rates in the Greek-letter world," Cohen said. Universities have turned down SAE's attempts to open new chapters, and the fraternity had to close 12 in 18 months over hazing in
cidents."""
input_ids = tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
# should produce
# Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members singing racist chants. The fraternity's national chapter has had to close 12 in 18 months over hazing.
# Sigma has had more than 130 chapters in 18 states. University of Oklahoma president says fraternity has been "deteriorated".
```
## Training script:
**IMPORTANT**: In order for this code to work, make sure you checkout to the branch
[more_general_trainer_metric](https://github.com/huggingface/transformers/tree/more_general_trainer_metric), which slightly adapts
the `Trainer` for `EncoderDecoderModels` according to this PR: https://github.com/huggingface/transformers/pull/5840.
The following code shows the complete training script that was used to fine-tune `roberta2roberta-cnn_dailymail-fp16
` for reproducability. The training last ~9h on a standard GPU.
```python
#!/usr/bin/env python3
import nlp
import logging
from transformers import RobertaTokenizer, EncoderDecoderModel, Trainer, TrainingArguments
logging.basicConfig(level=logging.INFO)
model = EncoderDecoderModel.from_encoder_decoder_pretrained("roberta-base", "roberta-base")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
# load train and validation data
train_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="train")
val_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="validation[:5%]")
# load rouge for validation
rouge = nlp.load_metric("rouge", experiment_id=0)
# set decoding params
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4
encoder_length = 512
decoder_length = 128
batch_size = 16
# map data correctly
def map_to_encoder_decoder_inputs(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at Longformer at 2048
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_length)
# force summarization <= 256
outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["labels"] = outputs.input_ids.copy()
# mask loss for padding
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
]
batch["decoder_attention_mask"] = outputs.attention_mask
assert all([len(x) == encoder_length for x in inputs.input_ids])
assert all([len(x) == decoder_length for x in outputs.input_ids])
return batch
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.eos_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
# make train dataset ready
train_dataset = train_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
train_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_attention_mask", "decoder_input_ids", "labels"],
)
# same for validation dataset
val_dataset = val_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
val_dataset.set_format(
type="torch", columns=["input_ids", "decoder_attention_mask", "attention_mask", "decoder_input_ids", "labels"],
)
# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_from_generate=True,
evaluate_during_training=True,
do_train=True,
do_eval=True,
logging_steps=1000,
save_steps=1000,
eval_steps=1000,
overwrite_output_dir=True,
warmup_steps=2000,
save_total_limit=3,
fp16=True,
)
# instantiate trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
# start training
trainer.train()
```
## Evaluation
The following script evaluates the model on the test set of
CNN/Daily Mail.
```python
#!/usr/bin/env python3
import nlp
from transformers import RobertaTokenizer, EncoderDecoderModel
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/roberta2roberta-cnn_dailymail-fp16")
model.to("cuda")
test_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="test")
batch_size = 128
# map data correctly
def generate_summary(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at BERT max length 512
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
outputs = model.generate(input_ids, attention_mask=attention_mask)
# all special tokens including will be removed
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch["pred"] = output_str
return batch
results = test_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
# load rouge for validation
rouge = nlp.load_metric("rouge")
pred_str = results["pred"]
label_str = results["highlights"]
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
print(rouge_output)
```
The obtained results should be:
| - | Rouge2 - mid -precision | Rouge2 - mid - recall | Rouge2 - mid - fmeasure |
|----------|:-------------:|:------:|:------:|
| **CNN/Daily Mail** | 15.79 | 19.05 | **16.79** |
# Shared Roberta2Roberta Summarization with 🤗 EncoderDecoder Framework
This model is a shared Roberta2Roberta model, meaning that the encoder and decoder weights are tied, fine-tuned on summarization.
Roberta2Roberta is a `EncoderDecoderModel`, meaning that both the encoder and the decoder are `roberta-base`
RoBERTa models. In this setup the encoder and decoder weights are tied. Leveraging the [EncoderDecoderFramework](https://huggingface.co/transformers/model_doc/encoderdecoder.html#encoder-decoder-models), the
two pretrained models can simply be loaded into the framework via:
```python
roberta2roberta = EncoderDecoderModel.from_encoder_decoder_pretrained("roberta-base", "roberta-base", tie_encoder_decoder=True)
```
The decoder of an `EncoderDecoder` model needs cross-attention layers and usually makes use of causal
masking for auto-regressiv generation.
Thus, ``roberta2roberta`` is consequently fined-tuned on the `CNN/Daily Mail`dataset and the resulting model
`roberta2roberta-share-cnn_dailymail-fp16` is uploaded here.
## Example
The model is by no means a state-of-the-art model, but nevertheless
produces reasonable summarization results. It was mainly fine-tuned
as a proof-of-concept for the 🤗 EncoderDecoder Framework.
The model can be used as follows:
```python
from transformers import RobertaTokenizer, EncoderDecoderModel
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
article = """(CNN)Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members singing a racist chant. SAE's national chapter suspended the students, but University of Oklahoma President David B
oren took it a step further, saying the university's affiliation with the fraternity is permanently done. The news is shocking, but it's not the first time SAE has faced controversy. SAE was founded March 9, 185
6, at the University of Alabama, five years before the American Civil War, according to the fraternity website. When the war began, the group had fewer than 400 members, of which "369 went to war for the Confede
rate States and seven for the Union Army," the website says. The fraternity now boasts more than 200,000 living alumni, along with about 15,000 undergraduates populating 219 chapters and 20 "colonies" seeking fu
ll membership at universities. SAE has had to work hard to change recently after a string of member deaths, many blamed on the hazing of new recruits, SAE national President Bradley Cohen wrote in a message on t
he fraternity's website. The fraternity's website lists more than 130 chapters cited or suspended for "health and safety incidents" since 2010. At least 30 of the incidents involved hazing, and dozens more invol
ved alcohol. However, the list is missing numerous incidents from recent months. Among them, according to various media outlets: Yale University banned the SAEs from campus activities last month after members al
legedly tried to interfere with a sexual misconduct investigation connected to an initiation rite. Stanford University in December suspended SAE housing privileges after finding sorority members attending a frat
ernity function were subjected to graphic sexual content. And Johns Hopkins University in November suspended the fraternity for underage drinking. "The media has labeled us as the 'nation's deadliest fraternity,
' " Cohen said. In 2011, for example, a student died while being coerced into excessive alcohol consumption, according to a lawsuit. SAE's previous insurer dumped the fraternity. "As a result, we are paying Lloy
d's of London the highest insurance rates in the Greek-letter world," Cohen said. Universities have turned down SAE's attempts to open new chapters, and the fraternity had to close 12 in 18 months over hazing in
cidents."""
input_ids = tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
# should produce
# SAE's national chapter suspended after video shows party-bound fraternity members singing racist chant. University of Oklahoma president says university's affiliation with fraternity is permanently done.
# SAE has had to close 12 chapters since 2010 after members were killed in hazing. The fraternity has had more than 130 chapters in 18 months.
```
## Training script:
**IMPORTANT**: In order for this code to work, make sure you checkout to the branch
[more_general_trainer_metric](https://github.com/huggingface/transformers/tree/more_general_trainer_metric), which slightly adapts
the `Trainer` for `EncoderDecoderModels` according to this PR: https://github.com/huggingface/transformers/pull/5840.
The following code shows the complete training script that was used to fine-tune `roberta2roberta-cnn_dailymail-fp16
` for reproducability. The training last ~9h on a standard GPU.
```python
#!/usr/bin/env python3
import nlp
import logging
from transformers import RobertaTokenizer, EncoderDecoderModel, Trainer, TrainingArguments
logging.basicConfig(level=logging.INFO)
model = EncoderDecoderModel.from_encoder_decoder_pretrained("roberta-base", "roberta-base", tie_encoder_decoder=True)
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
# load train and validation data
train_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="train")
val_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="validation[:5%]")
# load rouge for validation
rouge = nlp.load_metric("rouge", experiment_id=0)
# set decoding params
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4
encoder_length = 512
decoder_length = 128
batch_size = 16
# map data correctly
def map_to_encoder_decoder_inputs(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at Longformer at 2048
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_length)
# force summarization <= 256
outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["labels"] = outputs.input_ids.copy()
# mask loss for padding
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
]
batch["decoder_attention_mask"] = outputs.attention_mask
assert all([len(x) == encoder_length for x in inputs.input_ids])
assert all([len(x) == decoder_length for x in outputs.input_ids])
return batch
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.eos_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
# make train dataset ready
train_dataset = train_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
train_dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_attention_mask", "decoder_input_ids", "labels"],
)
# same for validation dataset
val_dataset = val_dataset.map(
map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["article", "highlights"],
)
val_dataset.set_format(
type="torch", columns=["input_ids", "decoder_attention_mask", "attention_mask", "decoder_input_ids", "labels"],
)
# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_from_generate=True,
evaluate_during_training=True,
do_train=True,
do_eval=True,
logging_steps=1000,
save_steps=1000,
eval_steps=1000,
overwrite_output_dir=True,
warmup_steps=2000,
save_total_limit=3,
fp16=True,
)
# instantiate trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
# start training
trainer.train()
```
## Evaluation
The following script evaluates the model on the test set of
CNN/Daily Mail.
```python
#!/usr/bin/env python3
import nlp
from transformers import RobertaTokenizer, EncoderDecoderModel
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16")
model.to("cuda")
test_dataset = nlp.load_dataset("cnn_dailymail", "3.0.0", split="test")
batch_size = 128
# map data correctly
def generate_summary(batch):
# Tokenizer will automatically set [BOS] <text> [EOS]
# cut off at BERT max length 512
inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
outputs = model.generate(input_ids, attention_mask=attention_mask)
# all special tokens including will be removed
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch["pred"] = output_str
return batch
results = test_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
# load rouge for validation
rouge = nlp.load_metric("rouge")
pred_str = results["pred"]
label_str = results["highlights"]
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
print(rouge_output)
```
The obtained results should be:
| - | Rouge2 - mid -precision | Rouge2 - mid - recall | Rouge2 - mid - fmeasure |
|----------|:-------------:|:------:|:------:|
| **CNN/Daily Mail** | 15.6 | 18.79 | **16.59** |
---
language: en
license: apache-2.0
datasets:
- xsum
tags:
- summarization
---
Shared RoBERTa2RoBERTa Summarization with 🤗EncoderDecoder Framework
This model is a warm-started *RoBERTaShared* model fine-tuned on the *BBC XSum* summarization dataset.
The model achieves a **16.89** ROUGE-2 score on *BBC XSUM*'s test dataset.
For more details on how the model was fine-tuned, please refer to
[this](https://colab.research.google.com/drive/1Ekd5pUeCX7VOrMx94_czTkwNtLN32Uyu?usp=sharing) notebook.
---
language: "nl"
thumbnail: "https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo.png"
tags:
- Dutch
- RoBERTa
- RobBERT
license: mit
datasets:
- oscar
- Shuffled Dutch section of the OSCAR corpus (https://oscar-corpus.com/)
---
# RobBERT
## Model description
[RobBERT v2](https://github.com/iPieter/RobBERT) is a Dutch state-of-the-art [RoBERTa](https://arxiv.org/abs/1907.11692)-based language model.
More detailled information can be found in the [RobBERT paper](https://arxiv.org/abs/2001.06286).
## How to use
```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")
```
## Performance Evaluation Results
All experiments are described in more detail in our [paper](https://arxiv.org/abs/2001.06286).
### Sentiment analysis
Predicting whether a review is positive or negative using the [Dutch Book Reviews Dataset](https://github.com/benjaminvdb/110kDBRD).
| Model | Accuracy [%] |
|-------------------|--------------------------|
| ULMFiT | 93.8 |
| BERTje | 93.0 |
| RobBERT v2 | **95.1** |
### Die/Dat (coreference resolution)
We measured how well the models are able to do coreference resolution by predicting whether "die" or "dat" should be filled into a sentence.
For this, we used the [EuroParl corpus](https://www.statmt.org/europarl/).
#### Finetuning on whole dataset
| Model | Accuracy [%] | F1 [%] |
|-------------------|--------------------------|--------------|
| [Baseline](https://arxiv.org/abs/2001.02943) (LSTM) | | 75.03 |
| mBERT | 98.285 | 98.033 |
| BERTje | 98.268 | 98.014 |
| RobBERT v2 | **99.232** | **99.121** |
#### Finetuning on 10K examples
We also measured the performance using only 10K training examples.
This experiment clearly illustrates that RobBERT outperforms other models when there is little data available.
| Model | Accuracy [%] | F1 [%] |
|-------------------|--------------------------|--------------|
| mBERT | 92.157 | 90.898 |
| BERTje | 93.096 | 91.279 |
| RobBERT v2 | **97.816** | **97.514** |
#### Using zero-shot word masking task
Since BERT models are pre-trained using the word masking task, we can use this to predict whether "die" or "dat" is more likely.
This experiment shows that RobBERT has internalised more information about Dutch than other models.
| Model | Accuracy [%] |
|-------------------|--------------------------|
| ZeroR | 66.70 |
| mBERT | 90.21 |
| BERTje | 94.94 |
| RobBERT v2 | **98.75** |
### Part-of-Speech Tagging.
Using the [Lassy UD dataset](https://universaldependencies.org/treebanks/nl_lassysmall/index.html).
| Model | Accuracy [%] |
|-------------------|--------------------------|
| Frog | 91.7 |
| mBERT | **96.5** |
| BERTje | 96.3 |
| RobBERT v2 | 96.4 |
Interestingly, we found that when dealing with **small data sets**, RobBERT v2 **significantly outperforms** other models.
<p align="center">
<img src="https://github.com/iPieter/RobBERT/blob/master/res/robbert_pos_accuracy.png" alt="RobBERT's performance on smaller datasets">
</p>
### Named Entity Recognition
Using the [CoNLL 2002 evaluation script](https://www.clips.uantwerpen.be/conll2002/ner/).
| Model | Accuracy [%] |
|-------------------|--------------------------|
| Frog | 57.31 |
| mBERT | **90.94** |
| BERT-NL | 89.7 |
| BERTje | 88.3 |
| RobBERT v2 | 89.08 |
## Training procedure
We pre-trained RobBERT using the RoBERTa training regime.
We pre-trained our model on the Dutch section of the [OSCAR corpus](https://oscar-corpus.com/), a large multilingual corpus which was obtained by language classification in the Common Crawl corpus.
This Dutch corpus is 39GB large, with 6.6 billion words spread over 126 million lines of text, where each line could contain multiple sentences, thus using more data than concurrently developed Dutch BERT models.
RobBERT shares its architecture with [RoBERTa's base model](https://github.com/pytorch/fairseq/tree/master/examples/roberta), which itself is a replication and improvement over BERT.
Like BERT, it's architecture consists of 12 self-attention layers with 12 heads with 117M trainable parameters.
One difference with the original BERT model is due to the different pre-training task specified by RoBERTa, using only the MLM task and not the NSP task.
During pre-training, it thus only predicts which words are masked in certain positions of given sentences.
The training process uses the Adam optimizer with polynomial decay of the learning rate l_r=10^-6 and a ramp-up period of 1000 iterations, with hyperparameters beta_1=0.9
and RoBERTa's default beta_2=0.98.
Additionally, a weight decay of 0.1 and a small dropout of 0.1 helps prevent the model from overfitting.
RobBERT was trained on a computing cluster with 4 Nvidia P100 GPUs per node, where the number of nodes was dynamically adjusted while keeping a fixed batch size of 8192 sentences.
At most 20 nodes were used (i.e. 80 GPUs), and the median was 5 nodes.
By using gradient accumulation, the batch size could be set independently of the number of GPUs available, in order to maximally utilize the cluster.
Using the [Fairseq library](https://github.com/pytorch/fairseq/tree/master/examples/roberta), the model trained for two epochs, which equals over 16k batches in total, which took about three days on the computing cluster.
In between training jobs on the computing cluster, 2 Nvidia 1080 Ti's also covered some parameter updates for RobBERT v2.
## Limitations and bias
In the [RobBERT paper](https://arxiv.org/abs/2001.06286), we also investigated potential sources of bias in RobBERT.
We found that the zeroshot model estimates the probability of *hij* (he) to be higher than *zij* (she) for most occupations in bleached template sentences, regardless of their actual job gender ratio in reality.
<p align="center">
<img src="https://github.com/iPieter/RobBERT/blob/master/res/gender_diff.png" alt="RobBERT's performance on smaller datasets">
</p>
By augmenting the DBRB Dutch Book sentiment analysis dataset with the stated gender of the author of the review, we found that highly positive reviews written by women were generally more accurately detected by RobBERT as being positive than those written by men.
<p align="center">
<img src="https://github.com/iPieter/RobBERT/blob/master/res/dbrd.png" alt="RobBERT's performance on smaller datasets">
</p>
## BibTeX entry and citation info
```bibtex
@misc{delobelle2020robbert,
title={RobBERT: a Dutch RoBERTa-based Language Model},
author={Pieter Delobelle and Thomas Winters and Bettina Berendt},
year={2020},
eprint={2001.06286},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
language:
- en
inference: false
---
---
language: pt
widget:
- text: "Quem era Jim Henson? Jim Henson era um"
- text: "Em um achado chocante, o cientista descobriu um"
- text: "Barack Hussein Obama II, nascido em 4 de agosto de 1961, é"
- text: "Corrida por vacina contra Covid-19 tem"
license: mit
datasets:
- wikipedia
---
# GPorTuguese-2: a Language Model for Portuguese text generation (and more NLP tasks...)
## Introduction
GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
It was trained on Portuguese Wikipedia using **Transfer Learning and Fine-tuning techniques** in just over a day, on one GPU NVIDIA V100 32GB and with a little more than 1GB of training data.
It is a proof-of-concept that it is possible to get a state-of-the-art language model in any language with low ressources.
It was fine-tuned from the [English pre-trained GPT-2 small](https://huggingface.co/gpt2) using the Hugging Face libraries (Transformers and Tokenizers) wrapped into the [fastai v2](https://dev.fast.ai/) Deep Learning framework. All the fine-tuning fastai v2 techniques were used.
It is now available on Hugging Face. For further information or requests, please go to "[Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787)".
## Model
| Model | #params | Model file (pt/tf) | Arch. | Training /Validation data (text) |
|-------------------------|---------|--------------------|-------------|------------------------------------------|
| `gpt2-small-portuguese` | 124M | 487M / 475M | GPT-2 small | Portuguese Wikipedia (1.28 GB / 0.32 GB) |
## Evaluation results
In a little more than a day (we only used one GPU NVIDIA V100 32GB; through a Distributed Data Parallel (DDP) training mode, we could have divided by three this time to 10 hours, just with 2 GPUs), we got a loss of 3.17, an **accuracy of 37.99%** and a **perplexity of 23.76** (see the validation results table below).
| after ... epochs | loss | accuracy (%) | perplexity | time by epoch | cumulative time |
|------------------|------|--------------|------------|---------------|-----------------|
| 0 | 9.95 | 9.90 | 20950.94 | 00:00:00 | 00:00:00 |
| 1 | 3.64 | 32.52 | 38.12 | 5:48:31 | 5:48:31 |
| 2 | 3.30 | 36.29 | 27.16 | 5:38:18 | 11:26:49 |
| 3 | 3.21 | 37.46 | 24.71 | 6:20:51 | 17:47:40 |
| 4 | 3.19 | 37.74 | 24.21 | 6:06:29 | 23:54:09 |
| 5 | 3.17 | 37.99 | 23.76 | 6:16:22 | 30:10:31 |
## GPT-2
*Note: information copied/pasted from [Model: gpt2 >> GPT-2](https://huggingface.co/gpt2#gpt-2)*
Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this [paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and first released at this [page](https://openai.com/blog/better-language-models/) (February 14, 2019).
Disclaimer: The team releasing GPT-2 also wrote a [model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.
## Model description
*Note: information copied/pasted from [Model: gpt2 >> Model description](https://huggingface.co/gpt2#model-description)*
GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.
## How to use GPorTuguese-2 with HuggingFace (PyTorch)
The following code use PyTorch. To use TensorFlow, check the below corresponding paragraph.
### Load GPorTuguese-2 and its sub-word tokenizer (Byte-level BPE)
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese")
model = AutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
# Get sequence length max of 1024
tokenizer.model_max_length=1024
model.eval() # disable dropout (or leave in train mode to finetune)
```
### Generate one word
```python
# input sequence
text = "Quem era Jim Henson? Jim Henson era um"
inputs = tokenizer(text, return_tensors="pt")
# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])
# results
print('input text:', text)
print('predicted text:', predicted_text)
# input text: Quem era Jim Henson? Jim Henson era um
# predicted text: homem
```
### Generate one full sequence
```python
# input sequence
text = "Quem era Jim Henson? Jim Henson era um"
inputs = tokenizer(text, return_tensors="pt")
# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
pad_token_id=50256,
do_sample=True,
max_length=50, # put the token number you want
top_k=40,
num_return_sequences=1)
# generated sequence
for i, sample_output in enumerate(sample_outputs):
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
# >> Generated text
# Quem era Jim Henson? Jim Henson era um executivo de televisão e diretor de um grande estúdio de cinema mudo chamado Selig,
# depois que o diretor de cinema mudo Georges Seuray dirigiu vários filmes para a Columbia e o estúdio.
```
## How to use GPorTuguese-2 with HuggingFace (TensorFlow)
The following code use TensorFlow. To use PyTorch, check the above corresponding paragraph.
### Load GPorTuguese-2 and its sub-word tokenizer (Byte-level BPE)
```python
from transformers import AutoTokenizer, TFAutoModelWithLMHead
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese")
model = TFAutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
# Get sequence length max of 1024
tokenizer.model_max_length=1024
model.eval() # disable dropout (or leave in train mode to finetune)
```
### Generate one full sequence
```python
# input sequence
text = "Quem era Jim Henson? Jim Henson era um"
inputs = tokenizer.encode(text, return_tensors="tf")
# model output using Top-k sampling text generation method
outputs = model.generate(inputs, eos_token_id=50256, pad_token_id=50256,
do_sample=True,
max_length=40,
top_k=40)
print(tokenizer.decode(outputs[0]))
# >> Generated text
# Quem era Jim Henson? Jim Henson era um amigo familiar da família. Ele foi contratado pelo seu pai
# para trabalhar como aprendiz no escritório de um escritório de impressão, e então começou a ganhar dinheiro
```
## Limitations and bias
The training data used for this model come from Portuguese Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.
## Author
Portuguese GPT-2 small was trained and evaluated by [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/) thanks to the computing power of the GPU (GPU NVIDIA V100 32 Go) of the [AI Lab](https://www.linkedin.com/company/ailab-unb/) (University of Brasilia) to which I am attached as an Associate Researcher in NLP and the participation of its directors in the definition of NLP strategy, Professors Fabricio Ataides Braz and Nilton Correia da Silva.
## Citation
If you use our work, please cite:
```bibtex
@inproceedings{pierre2020gpt2smallportuguese,
title={GPorTuguese-2 (Portuguese GPT-2 small): a Language Model for Portuguese text generation (and more NLP tasks...)},
author={Pierre Guillou},
year={2020}
}
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment