"...lm-evaluation-harness.git" did not exist on "7feab5e35ed89904ec8580fe88aff89e460de878"
Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: fr
---
# CamemBERT: a Tasty French Language Model
## Introduction
[CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
## Pre-trained models
| Model | #params | Arch. | Training data |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
| `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
| `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
| `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
| `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
## How to use CamemBERT with HuggingFace
##### Load CamemBERT and its sub-word tokenizer :
```python
from transformers import CamembertModel, CamembertTokenizer
# You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-ccnet")
camembert = CamembertModel.from_pretrained("camembert/camembert-base-ccnet")
camembert.eval() # disable dropout (or leave in train mode to finetune)
```
##### Filling masks using pipeline
```python
from transformers import pipeline
camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-ccnet", tokenizer="camembert/camembert-base-ccnet")
results = camembert_fill_mask("Le camembert est <mask> :)")
# results
#[{'sequence': '<s> Le camembert est bon :)</s>', 'score': 0.14011502265930176, 'token': 305},
# {'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.13929404318332672, 'token': 11661},
# {'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.07010319083929062, 'token': 3497},
# {'sequence': '<s> Le camembert est parfait :)</s>', 'score': 0.025885622948408127, 'token': 2528},
# {'sequence': '<s> Le camembert est top :)</s>', 'score': 0.025684962049126625, 'token': 2328}]
```
##### Extract contextual embedding features from Camembert output
```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
# ['▁J', "'", 'aime', '▁le', '▁cam', 'ember', 't', '▁!']
# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [5, 133, 22, 1250, 16, 12034, 14324, 81, 76, 6]
# NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
# Feed tokens to Camembert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# embeddings.detach()
# embeddings.size torch.Size([1, 10, 768])
#tensor([[[ 0.0667, -0.2467, 0.0954, ..., 0.2144, 0.0279, 0.3621],
# [-0.0472, 0.4092, -0.6602, ..., 0.2095, 0.1391, -0.0401],
# [ 0.1911, -0.2347, -0.0811, ..., 0.4306, -0.0639, 0.1821],
# ...,
```
##### Extract contextual embedding features from all Camembert layers
```python
from transformers import CamembertConfig
# (Need to reload the model with new config)
config = CamembertConfig.from_pretrained("camembert/camembert-base-ccnet", output_hidden_states=True)
camembert = CamembertModel.from_pretrained("camembert/camembert-base-ccnet", config=config)
embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
# all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
all_layer_embeddings[5]
# layer 5 contextual embedding : size torch.Size([1, 10, 768])
#tensor([[[ 0.0057, -0.1022, 0.0163, ..., -0.0675, -0.0360, 0.1078],
# [-0.1096, -0.3344, -0.0593, ..., 0.1625, -0.0432, -0.1646],
# [ 0.3751, -0.3829, 0.0844, ..., 0.1067, -0.0330, 0.3334],
# ...,
```
## Authors
CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
## Citation
If you use our work, please cite:
```bibtex
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
```
---
language: fr
---
# CamemBERT: a Tasty French Language Model
## Introduction
[CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
## Pre-trained models
| Model | #params | Arch. | Training data |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
| `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
| `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
| `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
| `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
## How to use CamemBERT with HuggingFace
##### Load CamemBERT and its sub-word tokenizer :
```python
from transformers import CamembertModel, CamembertTokenizer
# You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-oscar-4gb")
camembert = CamembertModel.from_pretrained("camembert/camembert-base-oscar-4gb")
camembert.eval() # disable dropout (or leave in train mode to finetune)
```
##### Filling masks using pipeline
```python
from transformers import pipeline
camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-oscar-4gb", tokenizer="camembert/camembert-base-oscar-4gb")
>>> results = camembert_fill_mask("Le camembert est <mask> !")
# results
#[{'sequence': '<s> Le camembert est parfait!</s>', 'score': 0.04089554399251938, 'token': 1654},
#{'sequence': '<s> Le camembert est délicieux!</s>', 'score': 0.037193264812231064, 'token': 7200},
#{'sequence': '<s> Le camembert est prêt!</s>', 'score': 0.025467922911047935, 'token': 1415},
#{'sequence': '<s> Le camembert est meilleur!</s>', 'score': 0.022812040522694588, 'token': 528},
#{'sequence': '<s> Le camembert est différent!</s>', 'score': 0.017135459929704666, 'token': 2935}]
```
##### Extract contextual embedding features from Camembert output
```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
# ['▁J', "'", 'aime', '▁le', '▁ca', 'member', 't', '▁!']
# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]
# NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
# Feed tokens to Camembert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# embeddings.detach()
# embeddings.size torch.Size([1, 10, 768])
#tensor([[[-0.1120, -0.1464, 0.0181, ..., -0.1723, -0.0278, 0.1606],
# [ 0.1234, 0.1202, -0.0773, ..., -0.0405, -0.0668, -0.0788],
# [-0.0440, 0.0480, -0.1926, ..., 0.1066, -0.0961, 0.0637],
# ...,
```
##### Extract contextual embedding features from all Camembert layers
```python
from transformers import CamembertConfig
# (Need to reload the model with new config)
config = CamembertConfig.from_pretrained("camembert/camembert-base-oscar-4gb", output_hidden_states=True)
camembert = CamembertModel.from_pretrained("camembert/camembert-base-oscar-4gb", config=config)
embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
# all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
all_layer_embeddings[5]
# layer 5 contextual embedding : size torch.Size([1, 10, 768])
#tensor([[[-0.1584, -0.1207, -0.0179, ..., 0.5457, 0.1491, -0.1191],
# [-0.1122, 0.3634, 0.0676, ..., 0.4395, -0.0470, -0.3781],
# [-0.2232, 0.0019, 0.0140, ..., 0.4461, -0.0233, 0.0735],
# ...,
```
## Authors
CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
## Citation
If you use our work, please cite:
```bibtex
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
```
---
language: fr
---
# CamemBERT: a Tasty French Language Model
## Introduction
[CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
## Pre-trained models
| Model | #params | Arch. | Training data |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
| `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
| `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
| `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
| `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
## How to use CamemBERT with HuggingFace
##### Load CamemBERT and its sub-word tokenizer :
```python
from transformers import CamembertModel, CamembertTokenizer
# You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")
camembert.eval() # disable dropout (or leave in train mode to finetune)
```
##### Filling masks using pipeline
```python
from transformers import pipeline
camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
# results
#[{'sequence': '<s> Le camembert est un fromage de chèvre!</s>', 'score': 0.4937814474105835, 'token': 19370},
#{'sequence': '<s> Le camembert est un fromage de brebis!</s>', 'score': 0.06255942583084106, 'token': 30616},
#{'sequence': '<s> Le camembert est un fromage de montagne!</s>', 'score': 0.04340197145938873, 'token': 2364},
# {'sequence': '<s> Le camembert est un fromage de Noël!</s>', 'score': 0.02823255956172943, 'token': 3236},
#{'sequence': '<s> Le camembert est un fromage de vache!</s>', 'score': 0.021357402205467224, 'token': 12329}]
```
##### Extract contextual embedding features from Camembert output
```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
# ['▁J', "'", 'aime', '▁le', '▁ca', 'member', 't', '▁!']
# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [5, 221, 10, 10600, 14, 8952, 10540, 75, 1114, 6]
# NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
# Feed tokens to Camembert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# embeddings.detach()
# embeddings.size torch.Size([1, 10, 768])
#tensor([[[-0.0928, 0.0506, -0.0094, ..., -0.2388, 0.1177, -0.1302],
# [ 0.0662, 0.1030, -0.2355, ..., -0.4224, -0.0574, -0.2802],
# [-0.0729, 0.0547, 0.0192, ..., -0.1743, 0.0998, -0.2677],
# ...,
```
##### Extract contextual embedding features from all Camembert layers
```python
from transformers import CamembertConfig
# (Need to reload the model with new config)
config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)
embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
# all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
all_layer_embeddings[5]
# layer 5 contextual embedding : size torch.Size([1, 10, 768])
#tensor([[[-0.0059, -0.0227, 0.0065, ..., -0.0770, 0.0369, 0.0095],
# [ 0.2838, -0.1531, -0.3642, ..., -0.0027, -0.8502, -0.7914],
# [-0.0073, -0.0338, -0.0011, ..., 0.0533, -0.0250, -0.0061],
# ...,
```
## Authors
CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
## Citation
If you use our work, please cite:
```bibtex
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
```
---
language: fr
---
# CamemBERT: a Tasty French Language Model
## Introduction
[CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
## Pre-trained models
| Model | #params | Arch. | Training data |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
| `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
| `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
| `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
| `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
## How to use CamemBERT with HuggingFace
##### Load CamemBERT and its sub-word tokenizer :
```python
from transformers import CamembertModel, CamembertTokenizer
# You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-large")
camembert = CamembertModel.from_pretrained("camembert/camembert-large")
camembert.eval() # disable dropout (or leave in train mode to finetune)
```
##### Filling masks using pipeline
```python
from transformers import pipeline
camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-large", tokenizer="camembert/camembert-large")
results = camembert_fill_mask("Le camembert est <mask> :)")
# results
#[{'sequence': '<s> Le camembert est bon :)</s>', 'score': 0.15560828149318695, 'token': 305},
#{'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.06821336597204208, 'token': 3497},
#{'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.060438305139541626, 'token': 11661},
#{'sequence': '<s> Le camembert est ici :)</s>', 'score': 0.02023460529744625, 'token': 373},
#{'sequence': '<s> Le camembert est meilleur :)</s>', 'score': 0.01778135634958744, 'token': 876}]
```
##### Extract contextual embedding features from Camembert output
```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
# ['▁J', "'", 'aime', '▁le', '▁cam', 'ember', 't', '▁!']
# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [5, 133, 22, 1250, 16, 12034, 14324, 81, 76, 6]
# NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
# Feed tokens to Camembert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# embeddings.detach()
# torch.Size([1, 10, 1024])
#tensor([[[-0.1284, 0.2643, 0.4374, ..., 0.1627, 0.1308, -0.2305],
# [ 0.4576, -0.6345, -0.2029, ..., -0.1359, -0.2290, -0.6318],
# [ 0.0381, 0.0429, 0.5111, ..., -0.1177, -0.1913, -0.1121],
# ...,
```
##### Extract contextual embedding features from all Camembert layers
```python
from transformers import CamembertConfig
# (Need to reload the model with new config)
config = CamembertConfig.from_pretrained("camembert/camembert-large", output_hidden_states=True)
camembert = CamembertModel.from_pretrained("camembert/camembert-large", config=config)
embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
# all_layer_embeddings list of len(all_layer_embeddings) == 25 (input embedding layer + 24 self attention layers)
all_layer_embeddings[5]
# layer 5 contextual embedding : size torch.Size([1, 10, 1024])
#tensor([[[-0.0600, 0.0742, 0.0332, ..., -0.0525, -0.0637, -0.0287],
# [ 0.0950, 0.2840, 0.1985, ..., 0.2073, -0.2172, -0.6321],
# [ 0.1381, 0.1872, 0.1614, ..., -0.0339, -0.2530, -0.1182],
# ...,
```
## Authors
CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
## Citation
If you use our work, please cite:
```bibtex
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
```
---
thumbnail: https://raw.githubusercontent.com/JetRunner/BERT-of-Theseus/master/bert-of-theseus.png
datasets:
- multi_nli
---
# BERT-of-Theseus
See our paper ["BERT-of-Theseus: Compressing BERT by Progressive Module Replacing"](http://arxiv.org/abs/2002.02925).
BERT-of-Theseus is a new compressed BERT by progressively replacing the components of the original BERT.
![BERT of Theseus](https://github.com/JetRunner/BERT-of-Theseus/blob/master/bert-of-theseus.png?raw=true)
## Load Pretrained Model on MNLI
We provide a 6-layer pretrained model on MNLI as a general-purpose model, which can transfer to other sentence classification tasks, outperforming DistillBERT (with the same 6-layer structure) on six tasks of GLUE (dev set).
| Method | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | STS-B |
|-----------------|------|------|------|------|------|-------|-------|
| BERT-base | 83.5 | 89.5 | 91.2 | 89.8 | 71.1 | 91.5 | 88.9 |
| DistillBERT | 79.0 | 87.5 | 85.3 | 84.9 | 59.9 | 90.7 | 81.2 |
| BERT-of-Theseus | 82.1 | 87.5 | 88.8 | 88.8 | 70.1 | 91.8 | 87.8 |
Please Note: this checkpoint is for [Intermediate-Task Transfer Learning](https://arxiv.org/abs/2005.00628) so it does not include the classification head for MNLI! Please fine-tune it before use (like DistilBERT).
---
language: fr
tags:
- conversational
widget:
- text: "bonjour."
- text: "mais encore"
- text: "est ce que l'argent achete le bonheur?"
---
## a dialoggpt model trained on french opensubtitles with custom tokenizer
trained with this notebook
https://colab.research.google.com/drive/1pfCV3bngAmISNZVfDvBMyEhQKuYw37Rl#scrollTo=AyImj9qZYLRi&uniqifier=3
config from microsoft/DialoGPT-medium
dataset generated from 2018 opensubtitle from opus folowing these guidelines
https://github.com/PolyAI-LDN/conversational-datasets/tree/master/opensubtitles with this notebook
https://colab.research.google.com/drive/1uyh3vJ9nEjqOHI68VD73qxt4olJzODxi#scrollTo=deaacv4XfLMk
### How to use
Now we are ready to try out how the model works as a chatting partner!
```python
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("cedpsam/chatbot_fr")
model = AutoModelWithLMHead.from_pretrained("cedpsam/chatbot_fr")
for step in range(6):
# encode the new user input, add the eos_token and return a tensor in Pytorch
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
# print(new_user_input_ids)
# append the new user input tokens to the chat history
bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
# generated a response while limiting the total chat history to 1000 tokens,
chat_history_ids = model.generate(
bot_input_ids, max_length=1000,
pad_token_id=tokenizer.eos_token_id,
top_p=0.92, top_k = 50
)
# pretty print last ouput tokens from bot
print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
---
language:
- en
tags:
- harry-potter
license: mit
---
# Harry Potter Fanfiction Generator
This is a pre-trained GPT-2 generative text model that allows you to generate your own Harry Potter fanfiction, trained off of the top 100 rated fanficition stories. We intend for this to be used for individual fun and experimentation and not as a commercial product.
---
language: "en"
tags:
- gpt2
- arxiv
- transformers
datasets:
- https://github.com/staeiou/arxiv_archive/tree/v1.0.1
---
# ArXiv AI GPT-2
## Model description
This GPT-2 (774M) model is capable of generating abstracts given paper titles. It was trained using all research paper titles and abstracts under artificial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) on arXiv.
## Intended uses & limitations
#### How to use
To generate paper abstracts, use the provided `generate.py` [here](https://gist.github.com/chrisliu298/ccb8144888eace069da64ad3e6472d64). This is very similar to the HuggingFace's `run_generation.py` [here](https://github.com/huggingface/transformers/tree/master/examples/text-generation). You can simply replace the text with with your own model path (line 89) and change the input string to your paper title (line 127). If you want to use your own script, make sure to prepend `<|startoftext|> ` at the front and append ` <|sep|>` at the end of the paper title.
## Training data
I selected a subset of the [arXiv Archive](https://github.com/staeiou/arxiv_archive) dataset (Geiger, 2019) as the training and evaluation data to fine-tune GPT-2. The original arXiv Archive dataset contains a full archive of metadata about papers on arxiv.org, from the start of the site in 1993 to the end of 2019. Our subset includes all the paper titles (query) and abstracts (context) under the Artificial Intelligence (cs.AI), Machine Learning (cs.LG), Computation and Language (cs.CL), and Computer Vision and Pattern Recognition (cs.CV) categories. I provide the information of the sub-dataset and the distribution of the training and evaluation dataset as follows.
| Splits | Count | Percentage (%) | BPE Token Count |
| :--------: | :--------: | :------------: | :-------------: |
| Train | 90,000 | 90.11 | 20,834,012 |
| Validation | 4,940 | 4.95 | 1,195,056 |
| Test | 4,940 | 4.95 | 1,218,754 |
| **Total** | **99,880** | **100** | **23,247,822** |
The original dataset is in the format of a tab-separated value, so we wrote a simple preprocessing script to convert it into a text file format, which is the input file type (a document) of the GPT-2 model. An example of a paper’s title and its abstract is shown below.
```text
<|startoftext|> Some paper title <|sep|> Some paper abstract <|endoftext|>
```
Because there are a lot of cross-domain papers in the dataset, I deduplicate the dataset using the arXiv ID, which is unique for every paper. I sort the paper by submission date, by doing so, one can examine GPT-2’s ability to use learned terminologies when it is prompted with paper titles from the “future.”
## Training procedure
I used block size = 512, batch size = 1, gradidnet accumulation = 1, learning rate = 1e-5, epochs = 5, and everything else follows the default model configuration.
## Eval results
The resulting GPT-2 large model's perplexity score on the test set is **14.9413**.
## Reference
```bibtex
@dataset{r_stuart_geiger_2019_2533436,
author= {R. Stuart Geiger},
title={{ArXiV Archive: A tidy and complete archive of metadata for papers on arxiv.org, 1993-2019}},
month=jan,
year= 2019,
publisher={Zenodo},
version= {v1.0.1},
doi={10.5281/zenodo.2533436},
url={https://doi.org/10.5281/zenodo.2533436}
}
```
---
language:
- ru
- en
---
## EnDR-BERT
EnDR-BERT - Multilingual, Cased, which pretrained on the english collection of consumer comments on drug administration from [2]. Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google. In particular, Multi-BERT was for used for initialization and all the parameters are the same as in Multi-BERT. Training details are described in our paper. \
link: https://yadi.sk/d/-PTn0xhk1PqvgQ
## Citing & Authors
If you find this repository helpful, feel free to cite our publication:
[1] Tutubalina E, Alimova I, Miftahutdinov Z, et al. The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews.//Bioinformatics. - 2020.
preprint: https://arxiv.org/abs/2004.03659
```
@article{10.1093/bioinformatics/btaa675,
author = {Tutubalina, Elena and Alimova, Ilseyar and Miftahutdinov, Zulfat and Sakhovskiy, Andrey and Malykh, Valentin and Nikolenko, Sergey},
title = "{The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews}",
journal = {Bioinformatics},
year = {2020},
month = {07},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa675},
url = {https://doi.org/10.1093/bioinformatics/btaa675},
note = {btaa675},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa675/33539752/btaa675.pdf},
}
```
[2] Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.//Russian Chemical Bulletin. – 2017. – Т. 66. – №. 11. – С. 2180-2189.
[link to paper](https://www.researchgate.net/profile/Elena_Tutubalina/publication/323751823_Using_semantic_analysis_of_texts_for_the_identification_of_drugs_with_similar_therapeutic_effects/links/5bf7cfc3299bf1a0202cbc1f/Using-semantic-analysis-of-texts-for-the-identification-of-drugs-with-similar-therapeutic-effects.pdf)
```
@article{tutubalina2017using,
title={Using semantic analysis of texts for the identification of drugs with similar therapeutic effects},
author={Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE},
journal={Russian Chemical Bulletin},
volume={66},
number={11},
pages={2180--2189},
year={2017},
publisher={Springer}
}
```
---
language:
- ru
- en
---
## EnRuDR-BERT
EnRuDR-BERT - Multilingual, Cased, which pretrained on the raw part of the RuDReC corpus (1.4M reviews) and english collection of consumer comments on drug administration from [2]. Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google. In particular, Multi-BERT was for used for initialization; vocabulary of Russian subtokens and parameters are the same as in Multi-BERT. Training details are described in our paper. \
link: https://yadi.sk/d/-PTn0xhk1PqvgQ
## Citing & Authors
If you find this repository helpful, feel free to cite our publication:
[1] Tutubalina E, Alimova I, Miftahutdinov Z, et al. The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews.//Bioinformatics. - 2020.
preprint: https://arxiv.org/abs/2004.03659
```
@article{10.1093/bioinformatics/btaa675,
author = {Tutubalina, Elena and Alimova, Ilseyar and Miftahutdinov, Zulfat and Sakhovskiy, Andrey and Malykh, Valentin and Nikolenko, Sergey},
title = "{The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews}",
journal = {Bioinformatics},
year = {2020},
month = {07},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa675},
url = {https://doi.org/10.1093/bioinformatics/btaa675},
note = {btaa675},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa675/33539752/btaa675.pdf},
}
```
[2] Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.//Russian Chemical Bulletin. – 2017. – Т. 66. – №. 11. – С. 2180-2189.
[link to paper](https://www.researchgate.net/profile/Elena_Tutubalina/publication/323751823_Using_semantic_analysis_of_texts_for_the_identification_of_drugs_with_similar_therapeutic_effects/links/5bf7cfc3299bf1a0202cbc1f/Using-semantic-analysis-of-texts-for-the-identification-of-drugs-with-similar-therapeutic-effects.pdf)
```
@article{tutubalina2017using,
title={Using semantic analysis of texts for the identification of drugs with similar therapeutic effects},
author={Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE},
journal={Russian Chemical Bulletin},
volume={66},
number={11},
pages={2180--2189},
year={2017},
publisher={Springer}
}
```
## RuDR-BERT
RuDR-BERT - Multilingual, Cased, which pretrained on the raw part of the RuDReC corpus (1.4M reviews). Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google. In particular, Multi-BERT was for used for initialization; vocabulary of Russian subtokens and parameters are the same as in Multi-BERT. Training details are described in our paper. \
link: https://yadi.sk/d/-PTn0xhk1PqvgQ
## Citing & Authors
If you find this repository helpful, feel free to cite our publication:
[1] Tutubalina E, Alimova I, Miftahutdinov Z, et al. The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews.//Bioinformatics. - 2020.
preprint: https://arxiv.org/abs/2004.03659
```
@article{10.1093/bioinformatics/btaa675,
author = {Tutubalina, Elena and Alimova, Ilseyar and Miftahutdinov, Zulfat and Sakhovskiy, Andrey and Malykh, Valentin and Nikolenko, Sergey},
title = "{The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews}",
journal = {Bioinformatics},
year = {2020},
month = {07},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa675},
url = {https://doi.org/10.1093/bioinformatics/btaa675},
note = {btaa675},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa675/33539752/btaa675.pdf},
}
```
[2] Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.//Russian Chemical Bulletin. – 2017. – Т. 66. – №. 11. – С. 2180-2189.
[link to paper](https://www.researchgate.net/profile/Elena_Tutubalina/publication/323751823_Using_semantic_analysis_of_texts_for_the_identification_of_drugs_with_similar_therapeutic_effects/links/5bf7cfc3299bf1a0202cbc1f/Using-semantic-analysis-of-texts-for-the-identification-of-drugs-with-similar-therapeutic-effects.pdf)
```
@article{tutubalina2017using,
title={Using semantic analysis of texts for the identification of drugs with similar therapeutic effects},
author={Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE},
journal={Russian Chemical Bulletin},
volume={66},
number={11},
pages={2180--2189},
year={2017},
publisher={Springer}
}
```
---
language: zh
---
## albert_chinese_small
### Overview
**Language model:** albert-small
**Model size:** 18.5M
**Language:** Chinese
**Training data:** [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020)
**Eval data:** [CLUE dataset](https://github.com/CLUEbenchmark/CLUE)
### Results
For results on downstream tasks like text classification, please refer to [this repository](https://github.com/CLUEbenchmark/CLUE).
### Usage
**NOTE:**Since sentencepiece is not used in `albert_chinese_small` model, you have to call **BertTokenizer** instead of AlbertTokenizer !!!
```
import torch
from transformers import BertTokenizer, AlbertModel
tokenizer = BertTokenizer.from_pretrained("clue/albert_chinese_small")
albert = AlbertModel.from_pretrained("clue/albert_chinese_small")
```
### About CLUE benchmark
Organization of Language Understanding Evaluation benchmark for Chinese: tasks & datasets, baselines, pre-trained Chinese models, corpus and leaderboard.
Github: https://github.com/CLUEbenchmark
Website: https://www.cluebenchmarks.com/
---
language: zh
---
## albert_chinese_tiny
### Overview
**Language model:** albert-tiny
**Model size:** 16M
**Language:** Chinese
**Training data:** [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020)
**Eval data:** [CLUE dataset](https://github.com/CLUEbenchmark/CLUE)
### Results
For results on downstream tasks like text classification, please refer to [this repository](https://github.com/CLUEbenchmark/CLUE).
### Usage
**NOTE:**Since sentencepiece is not used in `albert_chinese_tiny` model, you have to call **BertTokenizer** instead of AlbertTokenizer !!!
```
import torch
from transformers import BertTokenizer, AlbertModel
tokenizer = BertTokenizer.from_pretrained("clue/albert_chinese_tiny")
albert = AlbertModel.from_pretrained("clue/albert_chinese_tiny")
```
### About CLUE benchmark
Organization of Language Understanding Evaluation benchmark for Chinese: tasks & datasets, baselines, pre-trained Chinese models, corpus and leaderboard.
Github: https://github.com/CLUEbenchmark
Website: https://www.cluebenchmarks.com/
---
language: zh
---
# Introduction
This model was trained on TPU and the details are as follows:
## Model
##
| Model_name | params | size | Training_corpus | Vocab |
| :------------------------------------------ | :----- | :------- | :----------------- | :-----------: |
| **`RoBERTa-tiny-clue`** <br/>Super_small_model | 7.5M | 28.3M | **CLUECorpus2020** | **CLUEVocab** |
| **`RoBERTa-tiny-pair`** <br/>Super_small_sentence_pair_model | 7.5M | 28.3M | **CLUECorpus2020** | **CLUEVocab** |
| **`RoBERTa-tiny3L768-clue`** <br/>small_model | 38M | 110M | **CLUECorpus2020** | **CLUEVocab** |
| **`RoBERTa-tiny3L312-clue`** <br/>small_model | <7.5M | 24M | **CLUECorpus2020** | **CLUEVocab** |
| **`RoBERTa-large-clue`** <br/> Large_model | 290M | 1.20G | **CLUECorpus2020** | **CLUEVocab** |
| **`RoBERTa-large-pair`** <br/>Large_sentence_pair_model | 290M | 1.20G | **CLUECorpus2020** | **CLUEVocab** |
### Usage
With the help of[Huggingface-Transformers 2.5.1](https://github.com/huggingface/transformers), you could use these model as follows
```
tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")
```
`MODEL_NAME`
| Model_NAME | MODEL_LINK |
| -------------------------- | ------------------------------------------------------------ |
| **RoBERTa-tiny-clue** | [`clue/roberta_chinese_clue_tiny`](https://huggingface.co/clue/roberta_chinese_clue_tiny) |
| **RoBERTa-tiny-pair** | [`clue/roberta_chinese_pair_tiny`](https://huggingface.co/clue/roberta_chinese_pair_tiny) |
| **RoBERTa-tiny3L768-clue** | [`clue/roberta_chinese_3L768_clue_tiny`](https://huggingface.co/clue/roberta_chinese_3L768_clue_tiny) |
| **RoBERTa-tiny3L312-clue** | [`clue/roberta_chinese_3L312_clue_tiny`](https://huggingface.co/clue/roberta_chinese_3L312_clue_tiny) |
| **RoBERTa-large-clue** | [`clue/roberta_chinese_clue_large`](https://huggingface.co/clue/roberta_chinese_clue_large) |
| **RoBERTa-large-pair** | [`clue/roberta_chinese_pair_large`](https://huggingface.co/clue/roberta_chinese_pair_large) |
## Details
Please read <a href='https://arxiv.org/pdf/2003.01355'>https://arxiv.org/pdf/2003.01355.
Please visit our repository: https://github.com/CLUEbenchmark/CLUEPretrainedModels.git
---
language: zh
---
## roberta_chinese_base
### Overview
**Language model:** roberta-base
**Model size:** 392M
**Language:** Chinese
**Training data:** [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020)
**Eval data:** [CLUE dataset](https://github.com/CLUEbenchmark/CLUE)
### Results
For results on downstream tasks like text classification, please refer to [this repository](https://github.com/CLUEbenchmark/CLUE).
### Usage
**NOTE:** You have to call **BertTokenizer** instead of RobertaTokenizer !!!
```
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("clue/roberta_chinese_base")
roberta = BertModel.from_pretrained("clue/roberta_chinese_base")
```
### About CLUE benchmark
Organization of Language Understanding Evaluation benchmark for Chinese: tasks & datasets, baselines, pre-trained Chinese models, corpus and leaderboard.
Github: https://github.com/CLUEbenchmark
Website: https://www.cluebenchmarks.com/
---
language: zh
---
## roberta_chinese_large
### Overview
**Language model:** roberta-large
**Model size:** 1.2G
**Language:** Chinese
**Training data:** [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020)
**Eval data:** [CLUE dataset](https://github.com/CLUEbenchmark/CLUE)
### Results
For results on downstream tasks like text classification, please refer to [this repository](https://github.com/CLUEbenchmark/CLUE).
### Usage
**NOTE:** You have to call **BertTokenizer** instead of RobertaTokenizer !!!
```
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("clue/roberta_chinese_large")
roberta = BertModel.from_pretrained("clue/roberta_chinese_large")
```
### About CLUE benchmark
Organization of Language Understanding Evaluation benchmark for Chinese: tasks & datasets, baselines, pre-trained Chinese models, corpus and leaderboard.
Github: https://github.com/CLUEbenchmark
Website: https://www.cluebenchmarks.com/
---
language: zh
---
## xlnet_chinese_large
### Overview
**Language model:** xlnet-large
**Model size:** 1.3G
**Language:** Chinese
**Training data:** [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020)
**Eval data:** [CLUE dataset](https://github.com/CLUEbenchmark/CLUE)
### Results
For results on downstream tasks like text classification, please refer to [this repository](https://github.com/CLUEbenchmark/CLUE).
### Usage
```
import torch
from transformers import XLNetTokenizer,XLNetModel
tokenizer = XLNetTokenizer.from_pretrained("clue/xlnet_chinese_large")
xlnet = XLNetModel.from_pretrained("clue/xlnet_chinese_large")
```
### About CLUE benchmark
Organization of Language Understanding Evaluation benchmark for Chinese: tasks & datasets, baselines, pre-trained Chinese models, corpus and leaderboard.
Github: https://github.com/CLUEbenchmark
Website: https://www.cluebenchmarks.com/
---
language: "ca"
tags:
- masked-lm
- catalan
- exbert
license: mit
---
# Calbert: a Catalan Language Model
## Introduction
CALBERT is an open-source language model for Catalan pretrained on the ALBERT architecture.
It is now available on Hugging Face in its `tiny-uncased` version and `base-uncased` (the one you're looking at) as well, and was pretrained on the [OSCAR dataset](https://traces1.inria.fr/oscar/).
For further information or requests, please go to the [GitHub repository](https://github.com/codegram/calbert)
## Pre-trained models
| Model | Arch. | Training data |
| ----------------------------------- | -------------- | ---------------------- |
| `codegram` / `calbert-tiny-uncased` | Tiny (uncased) | OSCAR (4.3 GB of text) |
| `codegram` / `calbert-base-uncased` | Base (uncased) | OSCAR (4.3 GB of text) |
## How to use Calbert with HuggingFace
#### Load Calbert and its tokenizer:
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("codegram/calbert-base-uncased")
model = AutoModel.from_pretrained("codegram/calbert-base-uncased")
model.eval() # disable dropout (or leave in train mode to finetune
```
#### Filling masks using pipeline
```python
from transformers import pipeline
calbert_fill_mask = pipeline("fill-mask", model="codegram/calbert-base-uncased", tokenizer="codegram/calbert-base-uncased")
results = calbert_fill_mask("M'agrada [MASK] això")
# results
# [{'sequence': "[CLS] m'agrada molt aixo[SEP]", 'score': 0.614592969417572, 'token': 61},
# {'sequence': "[CLS] m'agrada moltíssim aixo[SEP]", 'score': 0.06058056280016899, 'token': 4867},
# {'sequence': "[CLS] m'agrada més aixo[SEP]", 'score': 0.017195818945765495, 'token': 43},
# {'sequence': "[CLS] m'agrada llegir aixo[SEP]", 'score': 0.016321714967489243, 'token': 684},
# {'sequence': "[CLS] m'agrada escriure aixo[SEP]", 'score': 0.012185849249362946, 'token': 1306}]
```
#### Extract contextual embedding features from Calbert output
```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("M'és una mica igual")
# ['▁m', "'", 'es', '▁una', '▁mica', '▁igual']
# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [2, 109, 7, 71, 36, 371, 1103, 3]
# NB: Can be done in one step : tokenize.encode("M'és una mica igual")
# Feed tokens to Calbert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = model(encoded_sentence)
embeddings.size()
# torch.Size([1, 8, 768])
embeddings.detach()
# tensor([[[-0.0261, 0.1166, -0.1075, ..., -0.0368, 0.0193, 0.0017],
# [ 0.1289, -0.2252, 0.9881, ..., -0.1353, 0.3534, 0.0734],
# [-0.0328, -1.2364, 0.9466, ..., 0.3455, 0.7010, -0.2085],
# ...,
# [ 0.0397, -1.0228, -0.2239, ..., 0.2932, 0.1248, 0.0813],
# [-0.0261, 0.1165, -0.1074, ..., -0.0368, 0.0193, 0.0017],
# [-0.1934, -0.2357, -0.2554, ..., 0.1831, 0.6085, 0.1421]]])
```
## Authors
CALBERT was trained and evaluated by [Txus Bach](https://twitter.com/txustice), as part of [Codegram](https://www.codegram.com)'s applied research.
<a href="https://huggingface.co/exbert/?model=codegram/calbert-base-uncased&modelKind=bidirectional&sentence=M%27agradaria%20força%20saber-ne%20més">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: "ca"
tags:
- masked-lm
- catalan
- exbert
license: mit
---
# Calbert: a Catalan Language Model
## Introduction
CALBERT is an open-source language model for Catalan pretrained on the ALBERT architecture.
It is now available on Hugging Face in its `tiny-uncased` version (the one you're looking at) and `base-uncased` as well, and was pretrained on the [OSCAR dataset](https://traces1.inria.fr/oscar/).
For further information or requests, please go to the [GitHub repository](https://github.com/codegram/calbert)
## Pre-trained models
| Model | Arch. | Training data |
| ----------------------------------- | -------------- | ---------------------- |
| `codegram` / `calbert-tiny-uncased` | Tiny (uncased) | OSCAR (4.3 GB of text) |
| `codegram` / `calbert-base-uncased` | Base (uncased) | OSCAR (4.3 GB of text) |
## How to use Calbert with HuggingFace
#### Load Calbert and its tokenizer:
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("codegram/calbert-tiny-uncased")
model = AutoModel.from_pretrained("codegram/calbert-tiny-uncased")
model.eval() # disable dropout (or leave in train mode to finetune
```
#### Filling masks using pipeline
```python
from transformers import pipeline
calbert_fill_mask = pipeline("fill-mask", model="codegram/calbert-tiny-uncased", tokenizer="codegram/calbert-tiny-uncased")
results = calbert_fill_mask("M'agrada [MASK] això")
# results
# [{'sequence': "[CLS] m'agrada molt aixo[SEP]", 'score': 0.4403671622276306, 'token': 61},
# {'sequence': "[CLS] m'agrada més aixo[SEP]", 'score': 0.050061386078596115, 'token': 43},
# {'sequence': "[CLS] m'agrada veure aixo[SEP]", 'score': 0.026286985725164413, 'token': 157},
# {'sequence': "[CLS] m'agrada bastant aixo[SEP]", 'score': 0.022483550012111664, 'token': 2143},
# {'sequence': "[CLS] m'agrada moltíssim aixo[SEP]", 'score': 0.014491282403469086, 'token': 4867}]
```
#### Extract contextual embedding features from Calbert output
```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("M'és una mica igual")
# ['▁m', "'", 'es', '▁una', '▁mica', '▁igual']
# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [2, 109, 7, 71, 36, 371, 1103, 3]
# NB: Can be done in one step : tokenize.encode("M'és una mica igual")
# Feed tokens to Calbert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = model(encoded_sentence)
embeddings.size()
# torch.Size([1, 8, 312])
embeddings.detach()
# tensor([[[-0.2726, -0.9855, 0.9643, ..., 0.3511, 0.3499, -0.1984],
# [-0.2824, -1.1693, -0.2365, ..., -3.1866, -0.9386, -1.3718],
# [-2.3645, -2.2477, -1.6985, ..., -1.4606, -2.7294, 0.2495],
# ...,
# [ 0.8800, -0.0244, -3.0446, ..., 0.5148, -3.0903, 1.1879],
# [ 1.1300, 0.2425, 0.2162, ..., -0.5722, -2.2004, 0.4045],
# [ 0.4549, -0.2378, -0.2290, ..., -2.1247, -2.2769, -0.0820]]])
```
## Authors
CALBERT was trained and evaluated by [Txus Bach](https://twitter.com/txustice), as part of [Codegram](https://www.codegram.com)'s applied research.
<a href="https://huggingface.co/exbert/?model=codegram/calbert-tiny-uncased&modelKind=bidirectional&sentence=M%27agradaria%20força%20saber-ne%20més">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
# LIMIT-BERT
Code and model for the *EMNLP 2020 Findings* paper:
[LIMIT-BERT: Linguistic Informed Multi-task BERT](https://arxiv.org/abs/1910.14296))
## Contents
1. [Requirements](#Requirements)
2. [Training](#Training)
## Requirements
* Python 3.6 or higher.
* Cython 0.25.2 or any compatible version.
* [PyTorch](http://pytorch.org/) 1.0.0+.
* [EVALB](http://nlp.cs.nyu.edu/evalb/). Before starting, run `make` inside the `EVALB/` directory to compile an `evalb` executable. This will be called from Python for evaluation.
* [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) PyTorch 1.0.0+ or any compatible version.
#### Pre-trained Models (PyTorch)
The following pre-trained models are available for download from Google Drive:
* [`LIMIT-BERT`](https://drive.google.com/open?id=1fm0cK2A91iLG3lCpwowCCQSALnWS2X4i):
PyTorch version, same setting with BERT-Large-WWM,loading model with [pytorch-transformers](https://github.com/huggingface/pytorch-transformers).
## How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cooelf/limitbert")
model = AutoModel.from_pretrained("cooelf/limitbert")
```
Please see our original repo for the training scripts.
https://github.com/cooelf/LIMIT-BERT
## Training
To train LIMIT-BERT, simply run:
```
sh run_limitbert.sh
```
### Evaluation Instructions
To test after setting model path:
```
sh test_bert.sh
```
## Citation
```
@article{zhou2019limit,
title={{LIMIT-BERT}: Linguistic informed multi-task {BERT}},
author={Zhou, Junru and Zhang, Zhuosheng and Zhao, Hai},
journal={arXiv preprint arXiv:1910.14296},
year={2019}
}
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment