"server/vscode:/vscode.git/clone" did not exist on "840424a2c4eb9a18726818d427ef382dc41a7e1b"
Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: ga
tags:
- irish
---
## BERTreach
([beirtreach](https://www.teanglann.ie/en/fgb/beirtreach) means 'oyster bed')
**Model size:** 84M
**Training data:**
* [PARSEME 1.2](https://gitlab.com/parseme/parseme_corpus_ga/-/blob/master/README.md)
* Newscrawl 300k portion of the [Leipzig Corpora](https://wortschatz.uni-leipzig.de/en/download/irish)
* Private news corpus crawled with [Corpus Crawler](https://github.com/google/corpuscrawler)
(2125804 sentences, 47419062 tokens, as reckoned by wc)
```
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="jimregan/BERTreach", tokenizer="jimregan/BERTreach")
```
# shrugging-grace/tweetclassifier
## Model description
This model classifies tweets as either relating to the Covid-19 pandemic or not.
## Intended uses & limitations
It is intended to be used on tweets commenting on UK politics, in particular those trending with the #PMQs hashtag, as this refers to weekly Prime Ministers' Questions.
#### How to use
``LABEL_0`` means that the tweet relates to Covid-19
``LABEL_1`` means that the tweet does not relate to Covid-19
## Training data
The model was trained on 1000 tweets (with the "#PMQs'), which were manually labeled by the author. The tweets were collected between May-July 2020.
### BibTeX entry and citation info
This was based on a pretrained version of BERT.
@article{devlin2018bert,
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1810.04805},
year={2018}
}
---
language: en
tags:
- text-classification
- pytorch
datasets:
- yahoo-answers
pipeline_tag: zero-shot-classification
---
# bart-lage-mnli-yahoo-answers
## Model Description
This model takes [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) and fine-tunes it on Yahoo Answers topic classification. It can be used to predict whether a topic label can be assigned to a given sequence, whether or not the label has been seen before.
You can play with an interactive demo of this zero-shot technique with this model, as well as the non-finetuned [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli), [here](https://huggingface.co/zero-shot/).
## Intended Usage
This model was fine-tuned on topic classification and will perform best at zero-shot topic classification. Use `hypothesis_template="This text is about {}."` as this is the template used during fine-tuning.
For settings other than topic classification, you can use any model pre-trained on MNLI such as [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or [roberta-large-mnli](https://huggingface.co/roberta-large-mnli) with the same code as written below.
#### With the zero-shot classification pipeline
The model can be used with the `zero-shot-classification` pipeline like so:
```python
from transformers import pipeline
nlp = pipeline("zero-shot-classification", model="joeddav/bart-large-mnli-yahoo-answers")
sequence_to_classify = "Who are you voting for in 2020?"
candidate_labels = ["Europe", "public health", "politics", "elections"]
hypothesis_template = "This text is about {}."
nlp(sequence_to_classify, candidate_labels, multi_class=True, hypothesis_template=hypothesis_template)
```
#### With manual PyTorch
```python
# pose sequence as a NLI premise and label as a hypothesis
from transformers import BartForSequenceClassification, BartTokenizer
nli_model = BartForSequenceClassification.from_pretrained('joeddav/bart-large-mnli-yahoo-answers')
tokenizer = BartTokenizer.from_pretrained('joeddav/bart-large-mnli-yahoo-answers')
premise = sequence
hypothesis = f'This text is about {label}.'
# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
max_length=tokenizer.max_len,
truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]
# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]
```
## Training
The model is a pre-trained MNLI classifier further fine-tuned on Yahoo Answers topic classification in the manner originally described in [Yin et al. 2019](https://arxiv.org/abs/1909.00161) and [this blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html). That is, each sequence is fed to the pre-trained NLI model in place of the premise and each candidate label as the hypothesis, formatted like so: `This text is about {class name}.` For each example in the training set, a true and a randomly-selected false label hypothesis are fed to the model which must predict which labels are valid and which are false.
Since this method studies the ability to classify unseen labels after being trained on a different set of labels, the model is only trained on 5 out of the 10 labels in Yahoo Answers. These are "Society & Culture", "Health", "Computers & Internet", "Business & Finance", and "Family & Relationships".
## Evaluation Results
This model was evaluated with the label-weighted F1 of the _seen_ and _unseen_ labels. That is, for each example the model must predict from one of the 10 corpus labels. The F1 is reported for the labels seen during training as well as the labels unseen during training. We found an F1 score of `.68` and `.72` for the unseen and seen labels, respectively. In order to adjust for the in-vs-out of distribution labels, we subtract a fixed amount of 30% from the normalized probabilities of the _seen_ labels, as described in [Yin et al. 2019](https://arxiv.org/abs/1909.00161) and [our blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html).
---
language: multilingual
tags:
- text-classification
- pytorch
- tensorflow
datasets:
- multi_nli
- xnli
license: mit
pipeline_tag: zero-shot-classification
widget:
- text: "За кого вы голосуете в 2020 году?"
labels: "politique étrangère, Europe, élections, affaires, politique"
- text: "لمن تصوت في 2020؟"
labels: "السياسة الخارجية, أوروبا, الانتخابات, الأعمال, السياسة"
- text: "2020'de kime oy vereceksiniz?"
labels: "dış politika, Avrupa, seçimler, ticaret, siyaset"
---
# xlm-roberta-large-xnli
## Model Description
This model takes [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) and fine-tunes it on a combination of NLI data in 15 languages. It is intended to be used for zero-shot text classification, such as with the Hugging Face [ZeroShotClassificationPipeline](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline).
## Intended Usage
This model is intended to be used for zero-shot text classification, especially in languages other than English. It is fine-tuned on XNLI, which is a multilingual NLI dataset. The model can therefore be used with any of the languages in the XNLI corpus:
- English
- French
- Spanish
- German
- Greek
- Bulgarian
- Russian
- Turkish
- Arabic
- Vietnamese
- Thai
- Chinese
- Hindi
- Swahili
- Urdu
Since the base model was pre-trained trained on 100 different languages, the
model has shown some effectiveness in languages beyond those listed above as
well. See the full list of pre-trained languages in appendix A of the
[XLM Roberata paper](https://arxiv.org/abs/1911.02116)
For English-only classification, it is recommended to use
[bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or
[a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla).
#### With the zero-shot classification pipeline
The model can be loaded with the `zero-shot-classification` pipeline like so:
```python
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="joeddav/xlm-roberta-large-xnli")
```
You can then classify in any of the above languages. You can even pass the labels in one language and the sequence to
classify in another:
```python
# we will classify the Russian translation of, "Who are you voting for in 2020?"
sequence_to_classify = "За кого вы голосуете в 2020 году?"
# we can specify candidate labels in Russian or any other language above:
candidate_labels = ["Europe", "public health", "politics"]
classifier(sequence_to_classify, candidate_labels)
# {'labels': ['politics', 'Europe', 'public health'],
# 'scores': [0.9048484563827515, 0.05722189322113991, 0.03792969882488251],
# 'sequence': 'За кого вы голосуете в 2020 году?'}
```
The default hypothesis template is the English, `This text is {}`. If you are working strictly within one language, it
may be worthwhile to translate this to the language you are working with:
```python
sequence_to_classify = "¿A quién vas a votar en 2020?"
candidate_labels = ["Europa", "salud pública", "política"]
hypothesis_template = "Este ejemplo es {}."
classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
# {'labels': ['política', 'Europa', 'salud pública'],
# 'scores': [0.9109585881233215, 0.05954807624220848, 0.029493311420083046],
# 'sequence': '¿A quién vas a votar en 2020?'}
```
#### With manual PyTorch
```python
# pose sequence as a NLI premise and label as a hypothesis
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
premise = sequence
hypothesis = f'This example is {label}.'
# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]
# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]
```
## Training
This model was pre-trained on set of 100 languages, as described in
[the original paper](https://arxiv.org/abs/1911.02116). It was then fine-tuned on the task of NLI on the concatenated
MNLI train set and the XNLI validation and test sets. Finally, it was trained for one additional epoch on only XNLI
data where the translations for the premise and hypothesis are shuffled such that the premise and hypothesis for
each example come from the same original English example but the premise and hypothesis are of different languages.
---
language: ca
---
## Introduction
Download the model here:
* Catalan Roberta model: [julibert-2020-11-10.zip](https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-11-10.zip)
## What's this?
Source code: https://github.com/Softcatala/julibert
* Corpus: Oscar Catalan Corpus (3,8G)
* Model type: Roberta
* Vocabulary size: 50265
* Steps: 500000
# Tensorflow CamemBERT
In this repository you will find different versions of the CamemBERT model for Tensorflow.
## CamemBERT
[CamemBERT](https://camembert-model.fr/) is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR.
## Model Weights
| Model | Downloads
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
| `jplu/tf-camembert-base` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-camembert-base/config.json)[`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-camembert-base/tf_model.h5)
## Usage
With Transformers >= 2.4 the Tensorflow models of CamemBERT can be loaded like:
```python
from transformers import TFCamembertModel
model = TFCamembertModel.from_pretrained("jplu/tf-camembert-base")
```
## Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/jplu).
## Acknowledgments
Thanks to all the Huggingface team for the support and their amazing library!
---
language: multilingual
---
# XLM-R + NER
This model is a fine-tuned [XLM-Roberta-base](https://arxiv.org/abs/1911.02116) over the 40 languages proposed in [XTREME](https://github.com/google-research/xtreme) from [Wikiann](https://aclweb.org/anthology/P17-1178). This is still an on-going work and the results will be updated everytime an improvement is reached.
The covered labels are:
```
LOC
ORG
PER
O
```
## Metrics on evaluation set:
### Average over the 40 languages
Number of documents: 262300
```
precision recall f1-score support
ORG 0.81 0.81 0.81 102452
PER 0.90 0.91 0.91 108978
LOC 0.86 0.89 0.87 121868
micro avg 0.86 0.87 0.87 333298
macro avg 0.86 0.87 0.87 333298
```
### Afrikaans
Number of documents: 1000
```
precision recall f1-score support
ORG 0.89 0.88 0.88 582
PER 0.89 0.97 0.93 369
LOC 0.84 0.90 0.86 518
micro avg 0.87 0.91 0.89 1469
macro avg 0.87 0.91 0.89 1469
```
### Arabic
Number of documents: 10000
```
precision recall f1-score support
ORG 0.83 0.84 0.84 3507
PER 0.90 0.91 0.91 3643
LOC 0.88 0.89 0.88 3604
micro avg 0.87 0.88 0.88 10754
macro avg 0.87 0.88 0.88 10754
```
### Basque
Number of documents: 10000
```
precision recall f1-score support
LOC 0.88 0.93 0.91 5228
ORG 0.86 0.81 0.83 3654
PER 0.91 0.91 0.91 4072
micro avg 0.89 0.89 0.89 12954
macro avg 0.89 0.89 0.89 12954
```
### Bengali
Number of documents: 1000
```
precision recall f1-score support
ORG 0.86 0.89 0.87 325
LOC 0.91 0.91 0.91 406
PER 0.96 0.95 0.95 364
micro avg 0.91 0.92 0.91 1095
macro avg 0.91 0.92 0.91 1095
```
### Bulgarian
Number of documents: 1000
```
precision recall f1-score support
ORG 0.86 0.83 0.84 3661
PER 0.92 0.95 0.94 4006
LOC 0.92 0.95 0.94 6449
micro avg 0.91 0.92 0.91 14116
macro avg 0.91 0.92 0.91 14116
```
### Burmese
Number of documents: 100
```
precision recall f1-score support
LOC 0.60 0.86 0.71 37
ORG 0.68 0.63 0.66 30
PER 0.44 0.44 0.44 36
micro avg 0.57 0.65 0.61 103
macro avg 0.57 0.65 0.60 103
```
### Chinese
Number of documents: 10000
```
precision recall f1-score support
ORG 0.70 0.69 0.70 4022
LOC 0.76 0.81 0.78 3830
PER 0.84 0.84 0.84 3706
micro avg 0.76 0.78 0.77 11558
macro avg 0.76 0.78 0.77 11558
```
### Dutch
Number of documents: 10000
```
precision recall f1-score support
ORG 0.87 0.87 0.87 3930
PER 0.95 0.95 0.95 4377
LOC 0.91 0.92 0.91 4813
micro avg 0.91 0.92 0.91 13120
macro avg 0.91 0.92 0.91 13120
```
### English
Number of documents: 10000
```
precision recall f1-score support
LOC 0.83 0.84 0.84 4781
PER 0.89 0.90 0.89 4559
ORG 0.75 0.75 0.75 4633
micro avg 0.82 0.83 0.83 13973
macro avg 0.82 0.83 0.83 13973
```
### Estonian
Number of documents: 10000
```
precision recall f1-score support
LOC 0.89 0.92 0.91 5654
ORG 0.85 0.85 0.85 3878
PER 0.94 0.94 0.94 4026
micro avg 0.90 0.91 0.90 13558
macro avg 0.90 0.91 0.90 13558
```
### Finnish
Number of documents: 10000
```
precision recall f1-score support
ORG 0.84 0.83 0.84 4104
LOC 0.88 0.90 0.89 5307
PER 0.95 0.94 0.94 4519
micro avg 0.89 0.89 0.89 13930
macro avg 0.89 0.89 0.89 13930
```
### French
Number of documents: 10000
```
precision recall f1-score support
LOC 0.90 0.89 0.89 4808
ORG 0.84 0.87 0.85 3876
PER 0.94 0.93 0.94 4249
micro avg 0.89 0.90 0.90 12933
macro avg 0.89 0.90 0.90 12933
```
### Georgian
Number of documents: 10000
```
precision recall f1-score support
PER 0.90 0.91 0.90 3964
ORG 0.83 0.77 0.80 3757
LOC 0.82 0.88 0.85 4894
micro avg 0.84 0.86 0.85 12615
macro avg 0.84 0.86 0.85 12615
```
### German
Number of documents: 10000
```
precision recall f1-score support
LOC 0.85 0.90 0.87 4939
PER 0.94 0.91 0.92 4452
ORG 0.79 0.78 0.79 4247
micro avg 0.86 0.86 0.86 13638
macro avg 0.86 0.86 0.86 13638
```
### Greek
Number of documents: 10000
```
precision recall f1-score support
ORG 0.86 0.85 0.85 3771
LOC 0.88 0.91 0.90 4436
PER 0.91 0.93 0.92 3894
micro avg 0.88 0.90 0.89 12101
macro avg 0.88 0.90 0.89 12101
```
### Hebrew
Number of documents: 10000
```
precision recall f1-score support
PER 0.87 0.88 0.87 4206
ORG 0.76 0.75 0.76 4190
LOC 0.85 0.85 0.85 4538
micro avg 0.83 0.83 0.83 12934
macro avg 0.82 0.83 0.83 12934
```
### Hindi
Number of documents: 1000
```
precision recall f1-score support
ORG 0.78 0.81 0.79 362
LOC 0.83 0.85 0.84 422
PER 0.90 0.95 0.92 427
micro avg 0.84 0.87 0.85 1211
macro avg 0.84 0.87 0.85 1211
```
### Hungarian
Number of documents: 10000
```
precision recall f1-score support
PER 0.95 0.95 0.95 4347
ORG 0.87 0.88 0.87 3988
LOC 0.90 0.92 0.91 5544
micro avg 0.91 0.92 0.91 13879
macro avg 0.91 0.92 0.91 13879
```
### Indonesian
Number of documents: 10000
```
precision recall f1-score support
ORG 0.88 0.89 0.88 3735
LOC 0.93 0.95 0.94 3694
PER 0.93 0.93 0.93 3947
micro avg 0.91 0.92 0.92 11376
macro avg 0.91 0.92 0.92 11376
```
### Italian
Number of documents: 10000
```
precision recall f1-score support
LOC 0.88 0.88 0.88 4592
ORG 0.86 0.86 0.86 4088
PER 0.96 0.96 0.96 4732
micro avg 0.90 0.90 0.90 13412
macro avg 0.90 0.90 0.90 13412
```
### Japanese
Number of documents: 10000
```
precision recall f1-score support
ORG 0.62 0.61 0.62 4184
PER 0.76 0.81 0.78 3812
LOC 0.68 0.74 0.71 4281
micro avg 0.69 0.72 0.70 12277
macro avg 0.69 0.72 0.70 12277
```
### Javanese
Number of documents: 100
```
precision recall f1-score support
ORG 0.79 0.80 0.80 46
PER 0.81 0.96 0.88 26
LOC 0.75 0.75 0.75 40
micro avg 0.78 0.82 0.80 112
macro avg 0.78 0.82 0.80 112
```
### Kazakh
Number of documents: 1000
```
precision recall f1-score support
ORG 0.76 0.61 0.68 307
LOC 0.78 0.90 0.84 461
PER 0.87 0.91 0.89 367
micro avg 0.81 0.83 0.82 1135
macro avg 0.81 0.83 0.81 1135
```
### Korean
Number of documents: 10000
```
precision recall f1-score support
LOC 0.86 0.89 0.88 5097
ORG 0.79 0.74 0.77 4218
PER 0.83 0.86 0.84 4014
micro avg 0.83 0.83 0.83 13329
macro avg 0.83 0.83 0.83 13329
```
### Malay
Number of documents: 1000
```
precision recall f1-score support
ORG 0.87 0.89 0.88 368
PER 0.92 0.91 0.91 366
LOC 0.94 0.95 0.95 354
micro avg 0.91 0.92 0.91 1088
macro avg 0.91 0.92 0.91 1088
```
### Malayalam
Number of documents: 1000
```
precision recall f1-score support
ORG 0.75 0.74 0.75 347
PER 0.84 0.89 0.86 417
LOC 0.74 0.75 0.75 391
micro avg 0.78 0.80 0.79 1155
macro avg 0.78 0.80 0.79 1155
```
### Marathi
Number of documents: 1000
```
precision recall f1-score support
PER 0.89 0.94 0.92 394
LOC 0.82 0.84 0.83 457
ORG 0.84 0.78 0.81 339
micro avg 0.85 0.86 0.85 1190
macro avg 0.85 0.86 0.85 1190
```
### Persian
Number of documents: 10000
```
precision recall f1-score support
PER 0.93 0.92 0.93 3540
LOC 0.93 0.93 0.93 3584
ORG 0.89 0.92 0.90 3370
micro avg 0.92 0.92 0.92 10494
macro avg 0.92 0.92 0.92 10494
```
### Portuguese
Number of documents: 10000
```
precision recall f1-score support
LOC 0.90 0.91 0.91 4819
PER 0.94 0.92 0.93 4184
ORG 0.84 0.88 0.86 3670
micro avg 0.89 0.91 0.90 12673
macro avg 0.90 0.91 0.90 12673
```
### Russian
Number of documents: 10000
```
precision recall f1-score support
PER 0.93 0.96 0.95 3574
LOC 0.87 0.89 0.88 4619
ORG 0.82 0.80 0.81 3858
micro avg 0.87 0.88 0.88 12051
macro avg 0.87 0.88 0.88 12051
```
### Spanish
Number of documents: 10000
```
precision recall f1-score support
PER 0.95 0.93 0.94 3891
ORG 0.86 0.88 0.87 3709
LOC 0.89 0.91 0.90 4553
micro avg 0.90 0.91 0.90 12153
macro avg 0.90 0.91 0.90 12153
```
### Swahili
Number of documents: 1000
```
precision recall f1-score support
ORG 0.82 0.85 0.83 349
PER 0.95 0.92 0.94 403
LOC 0.86 0.89 0.88 450
micro avg 0.88 0.89 0.88 1202
macro avg 0.88 0.89 0.88 1202
```
### Tagalog
Number of documents: 1000
```
precision recall f1-score support
LOC 0.90 0.91 0.90 338
ORG 0.83 0.91 0.87 339
PER 0.96 0.93 0.95 350
micro avg 0.90 0.92 0.91 1027
macro avg 0.90 0.92 0.91 1027
```
### Tamil
Number of documents: 1000
```
precision recall f1-score support
PER 0.90 0.92 0.91 392
ORG 0.77 0.76 0.76 370
LOC 0.78 0.81 0.79 421
micro avg 0.82 0.83 0.82 1183
macro avg 0.82 0.83 0.82 1183
```
### Telugu
Number of documents: 1000
```
precision recall f1-score support
ORG 0.67 0.55 0.61 347
LOC 0.78 0.87 0.82 453
PER 0.73 0.86 0.79 393
micro avg 0.74 0.77 0.76 1193
macro avg 0.73 0.77 0.75 1193
```
### Thai
Number of documents: 10000
```
precision recall f1-score support
LOC 0.63 0.76 0.69 3928
PER 0.78 0.83 0.80 6537
ORG 0.59 0.59 0.59 4257
micro avg 0.68 0.74 0.71 14722
macro avg 0.68 0.74 0.71 14722
```
### Turkish
Number of documents: 10000
```
precision recall f1-score support
PER 0.94 0.94 0.94 4337
ORG 0.88 0.89 0.88 4094
LOC 0.90 0.92 0.91 4929
micro avg 0.90 0.92 0.91 13360
macro avg 0.91 0.92 0.91 13360
```
### Urdu
Number of documents: 1000
```
precision recall f1-score support
LOC 0.90 0.95 0.93 352
PER 0.96 0.96 0.96 333
ORG 0.91 0.90 0.90 326
micro avg 0.92 0.94 0.93 1011
macro avg 0.92 0.94 0.93 1011
```
### Vietnamese
Number of documents: 10000
```
precision recall f1-score support
ORG 0.86 0.87 0.86 3579
LOC 0.88 0.91 0.90 3811
PER 0.92 0.93 0.93 3717
micro avg 0.89 0.90 0.90 11107
macro avg 0.89 0.90 0.90 11107
```
### Yoruba
Number of documents: 100
```
precision recall f1-score support
LOC 0.54 0.72 0.62 36
ORG 0.58 0.31 0.41 35
PER 0.77 1.00 0.87 36
micro avg 0.64 0.68 0.66 107
macro avg 0.63 0.68 0.63 107
```
## Reproduce the results
Download and prepare the dataset from the [XTREME repo](https://github.com/google-research/xtreme#download-the-data). Next, from the root of the transformers repo run:
```
cd examples/ner
python run_tf_ner.py \
--data_dir . \
--labels ./labels.txt \
--model_name_or_path jplu/tf-xlm-roberta-base \
--output_dir model \
--max-seq-length 128 \
--num_train_epochs 2 \
--per_gpu_train_batch_size 16 \
--per_gpu_eval_batch_size 32 \
--do_train \
--do_eval \
--logging_dir logs \
--mode token-classification \
--evaluate_during_training \
--optimizer_name adamw
```
## Usage with pipelines
```python
from transformers import pipeline
nlp_ner = pipeline(
"ner",
model="jplu/tf-xlm-r-ner-40-lang",
tokenizer=(
'jplu/tf-xlm-r-ner-40-lang',
{"use_fast": True}),
framework="tf"
)
text_fr = "Barack Obama est né à Hawaï."
text_en = "Barack Obama was born in Hawaii."
text_es = "Barack Obama nació en Hawai."
text_zh = "巴拉克·奧巴馬(Barack Obama)出生於夏威夷。"
text_ar = "ولد باراك أوباما في هاواي."
nlp_ner(text_fr)
#Output: [{'word': '▁Barack', 'score': 0.9894659519195557, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9888848662376404, 'entity': 'PER'}, {'word': '▁Hawa', 'score': 0.998701810836792, 'entity': 'LOC'}, {'word': 'ï', 'score': 0.9987035989761353, 'entity': 'LOC'}]
nlp_ner(text_en)
#Output: [{'word': '▁Barack', 'score': 0.9929141998291016, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9930834174156189, 'entity': 'PER'}, {'word': '▁Hawaii', 'score': 0.9986202120780945, 'entity': 'LOC'}]
nlp_ner(test_es)
#Output: [{'word': '▁Barack', 'score': 0.9944776296615601, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9949177503585815, 'entity': 'PER'}, {'word': '▁Hawa', 'score': 0.9987911581993103, 'entity': 'LOC'}, {'word': 'i', 'score': 0.9984861612319946, 'entity': 'LOC'}]
nlp_ner(test_zh)
#Output: [{'word': '夏威夷', 'score': 0.9988449215888977, 'entity': 'LOC'}]
nlp_ner(test_ar)
#Output: [{'word': '▁با', 'score': 0.9903655648231506, 'entity': 'PER'}, {'word': 'راك', 'score': 0.9850614666938782, 'entity': 'PER'}, {'word': '▁أوباما', 'score': 0.9850308299064636, 'entity': 'PER'}, {'word': '▁ها', 'score': 0.9477543234825134, 'entity': 'LOC'}, {'word': 'وا', 'score': 0.9428229928016663, 'entity': 'LOC'}, {'word': 'ي', 'score': 0.9319471716880798, 'entity': 'LOC'}]
```
# Tensorflow XLM-RoBERTa
In this repository you will find different versions of the XLM-RoBERTa model for Tensorflow.
## XLM-RoBERTa
[XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross lingual benchmarks.
## Model Weights
| Model | Downloads
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
| `jplu/tf-xlm-roberta-base` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/config.json)[`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/tf_model.h5)
| `jplu/tf-xlm-roberta-large` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/config.json)[`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/tf_model.h5)
## Usage
With Transformers >= 2.4 the Tensorflow models of XLM-RoBERTa can be loaded like:
```python
from transformers import TFXLMRobertaModel
model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-base")
```
Or
```
model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-large")
```
## Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/jplu).
## Acknowledgments
Thanks to all the Huggingface team for the support and their amazing library!
# Tensorflow XLM-RoBERTa
In this repository you will find different versions of the XLM-RoBERTa model for Tensorflow.
## XLM-RoBERTa
[XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross lingual benchmarks.
## Model Weights
| Model | Downloads
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
| `jplu/tf-xlm-roberta-base` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/config.json)[`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/tf_model.h5)
| `jplu/tf-xlm-roberta-large` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/config.json)[`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/tf_model.h5)
## Usage
With Transformers >= 2.4 the Tensorflow models of XLM-RoBERTa can be loaded like:
```python
from transformers import TFXLMRobertaModel
model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-base")
```
Or
```
model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-large")
```
## Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/jplu).
## Acknowledgments
Thanks to all the Huggingface team for the support and their amazing library!
---
language: eo
thumbnail: https://huggingface.co/blog/assets/01_how-to-train/EsperBERTo-thumbnail-v2.png
widget:
- text: "Mi estas viro kej estas tago varma."
---
# EsperBERTo: RoBERTa-like Language model trained on Esperanto
**Companion model to blog post https://huggingface.co/blog/how-to-train** 🔥
## Training Details
- current checkpoint: 566000
- machine name: `galinette`
![](https://huggingface.co/blog/assets/01_how-to-train/EsperBERTo-thumbnail-v2.png)
## Example pipeline
```python
from transformers import TokenClassificationPipeline, pipeline
MODEL_PATH = "./models/EsperBERTo-small-pos/"
nlp = pipeline(
"ner",
model=MODEL_PATH,
tokenizer=MODEL_PATH,
)
# or instantiate a TokenClassificationPipeline directly.
nlp("Mi estas viro kej estas tago varma.")
# {'entity': 'PRON', 'score': 0.9979867339134216, 'word': ' Mi'}
# {'entity': 'VERB', 'score': 0.9683094620704651, 'word': ' estas'}
# {'entity': 'VERB', 'score': 0.9797462821006775, 'word': ' estas'}
# {'entity': 'NOUN', 'score': 0.8509314060211182, 'word': ' tago'}
# {'entity': 'ADJ', 'score': 0.9996201395988464, 'word': ' varma'}
```
\ No newline at end of file
---
language: eo
thumbnail: https://huggingface.co/blog/assets/01_how-to-train/EsperBERTo-thumbnail-v2.png
widget:
- text: "Jen la komenco de bela <mask>."
- text: "Uno du <mask>"
- text: "Jen finiĝas bela <mask>."
---
# EsperBERTo: RoBERTa-like Language model trained on Esperanto
**Companion model to blog post https://huggingface.co/blog/how-to-train** 🔥
## Training Details
- current checkpoint: 566000
- machine name: `galinette`
![](https://huggingface.co/blog/assets/01_how-to-train/EsperBERTo-thumbnail-v2.png)
## Example pipeline
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="julien-c/EsperBERTo-small",
tokenizer="julien-c/EsperBERTo-small"
)
fill_mask("Jen la komenco de bela <mask>.")
# This is the beginning of a beautiful <mask>.
# =>
# {
# 'score':0.06502299010753632
# 'sequence':'<s> Jen la komenco de bela vivo.</s>'
# 'token':1099
# }
# {
# 'score':0.0421181358397007
# 'sequence':'<s> Jen la komenco de bela vespero.</s>'
# 'token':5100
# }
# {
# 'score':0.024884626269340515
# 'sequence':'<s> Jen la komenco de bela laboro.</s>'
# 'token':1570
# }
# {
# 'score':0.02324388362467289
# 'sequence':'<s> Jen la komenco de bela tago.</s>'
# 'token':1688
# }
# {
# 'score':0.020378097891807556
# 'sequence':'<s> Jen la komenco de bela festo.</s>'
# 'token':4580
# }
```
## How to build a dummy model
```python
from transformers BertConfig, BertForMaskedLM, BertTokenizer, TFBertForMaskedLM
SMALL_MODEL_IDENTIFIER = "julien-c/bert-xsmall-dummy"
DIRNAME = "./bert-xsmall-dummy"
config = BertConfig(10, 20, 1, 1, 40)
model = BertForMaskedLM(config)
model.save_pretrained(DIRNAME)
tf_model = TFBertForMaskedLM.from_pretrained(DIRNAME, from_pt=True)
tf_model.save_pretrained(DIRNAME)
# Slightly different for tokenizer.
# tokenizer = BertTokenizer.from_pretrained(DIRNAME)
# tokenizer.save_pretrained()
```
---
tags:
- ci
---
## Dummy model used for unit testing and CI
```python
import json
import os
from transformers import RobertaConfig, RobertaForMaskedLM, TFRobertaForMaskedLM
DIRNAME = "./dummy-unknown"
config = RobertaConfig(10, 20, 1, 1, 40)
model = RobertaForMaskedLM(config)
model.save_pretrained(DIRNAME)
tf_model = TFRobertaForMaskedLM.from_pretrained(DIRNAME, from_pt=True)
tf_model.save_pretrained(DIRNAME)
# Tokenizer:
vocab = [
"l",
"o",
"w",
"e",
"r",
"s",
"t",
"i",
"d",
"n",
"\u0120",
"\u0120l",
"\u0120n",
"\u0120lo",
"\u0120low",
"er",
"\u0120lowest",
"\u0120newer",
"\u0120wider",
"<unk>",
]
vocab_tokens = dict(zip(vocab, range(len(vocab))))
merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
vocab_file = os.path.join(DIRNAME, "vocab.json")
merges_file = os.path.join(DIRNAME, "merges.txt")
with open(vocab_file, "w", encoding="utf-8") as fp:
fp.write(json.dumps(vocab_tokens) + "\n")
with open(merges_file, "w", encoding="utf-8") as fp:
fp.write("\n".join(merges))
```
---
language: si
tags:
- SinhalaBERTo
- Sinhala
- roberta
datasets:
- oscar
---
### Overview
This is a slightly smaller model trained on [OSCAR](https://oscar-corpus.com/) Sinhala dedup dataset. As Sinhala is one of those low resource languages, there are only a handful of models been trained. So, this would be a great place to start training for more downstream tasks.
## Model Specification
The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
1. vocab_size=52000
2. max_position_embeddings=514
3. num_attention_heads=12
4. num_hidden_layers=6
5. type_vocab_size=1
## How to Use
You can use this model directly with a pipeline for masked language modeling:
```py
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
model = BertForMaskedLM.from_pretrained("keshan/SinhalaBERTo")
tokenizer = BertTokenizer.from_pretrained("keshan/SinhalaBERTo")
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_mask("මම ගෙදර <mask>.")
```
---
language: et
---
## Model Description
This model is based off **Sentence-Transformer's** `distiluse-base-multilingual-cased` multilingual model that has been extended to understand sentence embeddings in Estonian.
## Sentence-Transformers
This model can be imported directly via the SentenceTransformers package as shown below:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('kiri-ai/distiluse-base-multilingual-cased-et')
sentences = ['Here is a sample sentence','Another sample sentence']
embeddings = model.encode(sentences)
print("Sentence embeddings:")
print(embeddings)
```
## Fine-tuning
The fine-tuning and training processes were inspired by [sbert's](https://www.sbert.net/) multilingual training techniques which are available [here](https://www.sbert.net/examples/training/multilingual/README.html). The documentation shows and explains the step-by-step process of using parallel sentences to train models in a different language.
### Resources
The model was fine-tuned on English-Estonian parallel sentences taken from [OPUS](http://opus.nlpl.eu/) and [ParaCrawl](https://paracrawl.eu/).
---
language: ko
---
# 📈 Financial Korean ELECTRA model
Pretrained ELECTRA Language Model for Korean (`finance-koelectra-base-discriminator`)
> ELECTRA is a new method for self-supervised language representation learning. It can be used to
> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
> the discriminator of a GAN.
More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
## Stats
The current version of the model is trained on a financial news data of Naver news.
The final training corpus has a size of 25GB and 2.3B tokens.
This model was trained a cased model on a TITAN RTX for 500k steps.
## Usage
```python
from transformers import ElectraForPreTraining, ElectraTokenizer
import torch
discriminator = ElectraForPreTraining.from_pretrained("krevas/finance-koelectra-base-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("krevas/finance-koelectra-base-discriminator")
sentence = "내일 해당 종목이 대폭 상승할 것이다"
fake_sentence = "내일 해당 종목이 맛있게 상승할 것이다"
fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
print("fake token : %s" % fake_tokens[predictions.tolist()[1:-1].index(1)])
```
# Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/krevas).
---
language: ko
---
# 📈 Financial Korean ELECTRA model
Pretrained ELECTRA Language Model for Korean (`finance-koelectra-base-generator`)
> ELECTRA is a new method for self-supervised language representation learning. It can be used to
> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
> the discriminator of a GAN.
More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
## Stats
The current version of the model is trained on a financial news data of Naver news.
The final training corpus has a size of 25GB and 2.3B tokens.
This model was trained a cased model on a TITAN RTX for 500k steps.
## Usage
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="krevas/finance-koelectra-base-generator",
tokenizer="krevas/finance-koelectra-base-generator"
)
print(fill_mask(f"내일 해당 종목이 대폭 {fill_mask.tokenizer.mask_token}할 것이다."))
```
# Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/krevas).
---
language: ko
---
# 📈 Financial Korean ELECTRA model
Pretrained ELECTRA Language Model for Korean (`finance-koelectra-small-discriminator`)
> ELECTRA is a new method for self-supervised language representation learning. It can be used to
> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
> the discriminator of a GAN.
More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
## Stats
The current version of the model is trained on a financial news data of Naver news.
The final training corpus has a size of 25GB and 2.3B tokens.
This model was trained a cased model on a TITAN RTX for 500k steps.
## Usage
```python
from transformers import ElectraForPreTraining, ElectraTokenizer
import torch
discriminator = ElectraForPreTraining.from_pretrained("krevas/finance-koelectra-small-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("krevas/finance-koelectra-small-discriminator")
sentence = "내일 해당 종목이 대폭 상승할 것이다"
fake_sentence = "내일 해당 종목이 맛있게 상승할 것이다"
fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
print("fake token : %s" % fake_tokens[predictions.tolist()[1:-1].index(1)])
```
# Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/krevas).
---
language: ko
---
# 📈 Financial Korean ELECTRA model
Pretrained ELECTRA Language Model for Korean (`finance-koelectra-small-generator`)
> ELECTRA is a new method for self-supervised language representation learning. It can be used to
> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
> the discriminator of a GAN.
More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
## Stats
The current version of the model is trained on a financial news data of Naver news.
The final training corpus has a size of 25GB and 2.3B tokens.
This model was trained a cased model on a TITAN RTX for 500k steps.
## Usage
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="krevas/finance-koelectra-small-generator",
tokenizer="krevas/finance-koelectra-small-generator"
)
print(fill_mask(f"내일 해당 종목이 대폭 {fill_mask.tokenizer.mask_token}할 것이다."))
```
# Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/krevas).
### Model
**[`albert-xlarge-v2`](https://huggingface.co/albert-xlarge-v2)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
```bash
BASE_MODEL=albert-xlarge-v2
python run_squad.py \
--version_2_with_negative \
--model_type albert \
--model_name_or_path $BASE_MODEL \
--output_dir $OUTPUT_MODEL \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--per_gpu_train_batch_size 3 \
--per_gpu_eval_batch_size 64 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps 2000 \
--threads 24 \
--warmup_steps 814 \
--gradient_accumulation_steps 4 \
--fp16 \
--do_train
```
### Evaluation
Evaluation on the dev set. I did not sweep for best threshold.
| | val |
|-------------------|-------------------|
| exact | 84.41842836688285 |
| f1 | 87.4628460501696 |
| total | 11873.0 |
| HasAns_exact | 80.68488529014844 |
| HasAns_f1 | 86.78245127423482 |
| HasAns_total | 5928.0 |
| NoAns_exact | 88.1412952060555 |
| NoAns_f1 | 88.1412952060555 |
| NoAns_total | 5945.0 |
| best_exact | 84.41842836688285 |
| best_exact_thresh | 0.0 |
| best_f1 | 87.46284605016956 |
| best_f1_thresh | 0.0 |
### Usage
See [huggingface documentation](https://huggingface.co/transformers/model_doc/albert.html#albertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
```python
start_scores, end_scores = model(input_ids)
span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
ignore_score = span_scores[:,0,0] #no answer scores
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment