Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
This model is pre-trained on blog articles from AWS Blogs.
## Pre-training corpora
The input text contains around 3000 blog articles on [AWS Blogs website](https://aws.amazon.com/blogs/) technical subject matter including AWS products, tools and tutorials.
## Pre-training details
I picked a Roberta architecture for masked language modeling (6-layer, 768-hidden, 12-heads, 82M parameters) and its corresponding ByteLevelBPE tokenization strategy. I then followed HuggingFace's Transformers [blog post](https://huggingface.co/blog/how-to-train) to train the model.
I chose to follow the following training set-up: 28k training steps with batches of 64 sequences of length 512 with an initial learning rate 5e-5. The model acheived a training loss of 3.6 on the MLM task over 10 epochs.
# GPT2 Genre Based Story Generator
## Model description
GPT2 fine-tuned on genre-based story generation.
## Intended uses
Used to generate stories based on user inputted genre and starting prompts.
## How to use
#### Supported Genres
superhero, action, drama, horror, thriller, sci_fi
#### Input text format
\<BOS> \<genre> Some optional text...
**Example**: \<BOS> \<sci_fi> After discovering time travel,
```python
# Example of usage
from transformers import pipeline
story_gen = pipeline("text-generation", "pranavpsv/gpt2-genre-story-generator")
print(story_gen("<BOS> <superhero> Batman"))
```
## Training data
Initialized with pre-trained weights of "gpt2" checkpoint. Fine-tuned the model on stories of various genres.
---
language: en
thumbnail:
tags:
- bert
- embeddings
license: Apache-2.0
---
# LABSE BERT
## Model description
Model for "Language-agnostic BERT Sentence Embedding" paper from Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang. Model available in [TensorFlow Hub](https://tfhub.dev/google/LaBSE/1).
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModel
import torch
# from sentence-transformers
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
tokenizer = AutoTokenizer.from_pretrained("pvl/labse_bert", do_lower_case=False)
model = AutoModel.from_pretrained("pvl/labse_bert")
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
```
---
language: ar
tags:
- qarib
license: apache-2.0
datasets:
- Arabic GigaWord
- Abulkhair Arabic Corpus
- opus
- Twitter data
---
# QARiB: QCRI Arabic and Dialectal BERT
## About QARiB
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
For Tweets, the data was collected using twitter API and using language filter. `lang:ar`. For Text data, it was a combination from
[Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
### bert-base-qarib60_1790k
- Data size: 60Gb
- Number of Iterations: 1790k
- Loss: 1.8764963
## Training QARiB
The training of the model has been performed using Google’s original Tensorflow code on Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
See more details in [Training QARiB](../Training_QARiB.md)
## Using QARiB
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](../Using_QARiB.md)
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
>>> fill_mask("شو عندكم يا [MASK]")
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'},
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'},
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'},
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'},
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]
>>> fill_mask("قللي وشفيييك يرحم [MASK]")
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
>>> fill_mask("وقام المدير [MASK]")
[
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
]
>>> fill_mask("وقامت المديرة [MASK]")
[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
```
## Training procedure
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
## Eval results
We evaluated QARiB models on five NLP downstream task:
- Sentiment Analysis
- Emotion Detection
- Named-Entity Recognition (NER)
- Offensive Language Detection
- Dialect Identification
The results obtained from QARiB models outperforms multilingual BERT/AraBERT/ArabicBERT.
## Model Weights and Vocab Download
TBD
## Contacts
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
---
language: ar
tags:
- qarib
license: apache-2.0
datasets:
- Arabic GigaWord
- Abulkhair Arabic Corpus
- opus
- Twitter data
---
# QARiB: QCRI Arabic and Dialectal BERT
## About QARiB
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
For Tweets, the data was collected using twitter API and using language filter. `lang:ar`. For Text data, it was a combination from
[Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
### bert-base-qarib60_1970k
- Data size: 60Gb
- Number of Iterations: 1970k
- Loss: 1.5708898
## Training QARiB
The training of the model has been performed using Google’s original Tensorflow code on Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
See more details in [Training QARiB](../Training_QARiB.md)
## Using QARiB
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](../Using_QARiB.md)
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
>>> fill_mask("شو عندكم يا [MASK]")
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'},
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'},
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'},
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'},
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]
>>> fill_mask("قللي وشفيييك يرحم [MASK]")
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
>>> fill_mask("وقام المدير [MASK]")
[
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
]
>>> fill_mask("وقامت المديرة [MASK]")
[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
```
## Training procedure
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
## Eval results
We evaluated QARiB models on five NLP downstream task:
- Sentiment Analysis
- Emotion Detection
- Named-Entity Recognition (NER)
- Offensive Language Detection
- Dialect Identification
The results obtained from QARiB models outperforms multilingual BERT/AraBERT/ArabicBERT.
## Model Weights and Vocab Download
TBD
## Contacts
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
---
language: ar
tags:
- qarib
license: apache-2.0
datasets:
- Arabic GigaWord
- Abulkhair Arabic Corpus
- opus
- Twitter data
---
# QARiB: QCRI Arabic and Dialectal BERT
## About QARiB
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
For Tweets, the data was collected using twitter API and using language filter. `lang:ar`. For Text data, it was a combination from
[Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
### bert-base-qarib60_860k
- Data size: 60Gb
- Number of Iterations: 860k
- Loss: 2.2454472
## Training QARiB
The training of the model has been performed using Google’s original Tensorflow code on Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
See more details in [Training QARiB](../Training_QARiB.md)
## Using QARiB
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](../Using_QARiB.md)
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
>>> fill_mask("شو عندكم يا [MASK]")
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'},
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'},
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'},
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'},
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]
>>> fill_mask("قللي وشفيييك يرحم [MASK]")
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
>>> fill_mask("وقام المدير [MASK]")
[
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
]
>>> fill_mask("وقامت المديرة [MASK]")
[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
```
## Training procedure
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
## Eval results
We evaluated QARiB models on five NLP downstream task:
- Sentiment Analysis
- Emotion Detection
- Named-Entity Recognition (NER)
- Offensive Language Detection
- Dialect Identification
The results obtained from QARiB models outperforms multilingual BERT/AraBERT/ArabicBERT.
## Model Weights and Vocab Download
TBD
## Contacts
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
## Model in Action 🚀
```python
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer
def set_seed(seed):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
set_seed(42)
model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_paraphraser')
tokenizer = T5Tokenizer.from_pretrained('ramsrigouthamg/t5_paraphraser')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print ("device ",device)
model = model.to(device)
sentence = "Which course should I take to get started in data science?"
# sentence = "What are the ingredients required to bake a perfect cake?"
# sentence = "What is the best possible approach to learn aeronautical engineering?"
# sentence = "Do apples taste better than oranges in general?"
text = "paraphrase: " + sentence + " </s>"
max_len = 256
encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
beam_outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
do_sample=True,
max_length=256,
top_k=120,
top_p=0.98,
early_stopping=True,
num_return_sequences=10
)
print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
if sent.lower() != sentence.lower() and sent not in final_outputs:
final_outputs.append(sent)
for i, final_output in enumerate(final_outputs):
print("{}: {}".format(i, final_output))
```
## Output
```
Original Question ::
Which course should I take to get started in data science?
Paraphrased Questions ::
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?
```
## Detailed blog post available here :
https://towardsdatascience.com/paraphrase-any-question-with-t5-text-to-text-transfer-transformer-pretrained-model-and-cbb9e35f1555
---
language: pt
tags:
- portuguese
- brazil
- pt_BR
widget:
- text: gostei muito dessa <mask>
---
# BR_BERTo
Portuguese (Brazil) model for text inference.
## Params
Trained on a corpus of 6_993_330 sentences.
- Vocab size: 150_000
- RobertaForMaskedLM size : 512
- Num train epochs: 3
- Time to train: ~10days (on GCP with a Nvidia T4)
I follow the great tutorial from HuggingFace team:
[How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train)
More infor here:
[BR_BERTo](https://github.com/rdenadai/BR-BERTo)
---
language: de
---
# Model description
## Dataset
Trained on fictional and non-fictional German texts written between 1840 and 1920:
* Narrative texts from Digitale Bibliothek (https://textgrid.de/digitale-bibliothek)
* Fairy tales and sagas from Grimm Korpus (https://www1.ids-mannheim.de/kl/projekte/korpora/archiv/gri.html)
* Newspaper and magazine article from Mannheimer Korpus Historischer Zeitungen und Zeitschriften (https://repos.ids-mannheim.de/mkhz-beschreibung.html)
* Magazine article from the journal „Die Grenzboten“ (http://www.deutschestextarchiv.de/doku/textquellen#grenzboten)
* Fictional and non-fictional texts from Projekt Gutenberg (https://www.projekt-gutenberg.org)
## Hardware used
1 Tesla P4 GPU
## Hyperparameters
| Parameter | Value |
|-------------------------------|----------|
| Epochs | 3 |
| Gradient_accumulation_steps | 1 |
| Train_batch_size | 32 |
| Learning_rate | 0.00003 |
| Max_seq_len | 128 |
## Evaluation results: Automatic tagging of four forms of speech/thought/writing representation in historical fictional and non-fictional German texts
The language model was used in the task to tag direct, indirect, reported and free indirect speech/thought/writing representation in fictional and non-fictional German texts. The tagger is available and described in detail at https://github.com/redewiedergabe/tagger.
The tagging model was trained using the SequenceTagger Class of the Flair framework ([Akbik et al., 2019](https://www.aclweb.org/anthology/N19-4010)) which implements a BiLSTM-CRF architecture on top of a language embedding (as proposed by [Huang et al. (2015)](https://arxiv.org/abs/1508.01991)).
Hyperparameters
| Parameter | Value |
|-------------------------------|------------|
| Hidden_size | 256 |
| Learning_rate | 0.1 |
| Mini_batch_size | 8 |
| Max_epochs | 150 |
Results are reported below in comparison to a custom trained flair embedding, which was stacked onto a custom trained fastText-model. Both models were trained on the same dataset.
| | BERT ||| FastText+Flair |||Test data|
|----------------|----------|-----------|----------|------|-----------|--------|--------|
| | F1 | Precision | Recall | F1 | Precision | Recall ||
| Direct | 0.80 | 0.86 | 0.74 | 0.84 | 0.90 | 0.79 |historical German, fictional & non-fictional|
| Indirect | **0.76** | **0.79** | **0.73** | 0.73 | 0.78 | 0.68 |historical German, fictional & non-fictional|
| Reported | **0.58** | **0.69** | **0.51** | 0.56 | 0.68 | 0.48 |historical German, fictional & non-fictional|
| Free indirect | **0.57** | **0.80** | **0.44** | 0.47 | 0.78 | 0.34 |modern German, fictional|
## Intended use:
Historical German Texts (1840 to 1920)
(Showed good performance with modern German fictional texts as well)
---
widget:
- text: "Even the Dwarves"
- text: "The secrets of"
---
# Model name
Magic The Generating
## Model description
This is a fine tuned GPT-2 model trained on a corpus of all available English language Magic the Gathering card flavour texts.
## Intended uses & limitations
This is intended only for use in generating new, novel, and sometimes surprising, MtG like flavour texts.
#### How to use
```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("rjbownes/Magic-The-Generating")
model = GPT2LMHeadModel.from_pretrained("rjbownes/Magic-The-Generating")
```
#### Limitations and bias
The training corpus was surprisingly small, only ~29000 cards, I had suspected there were more. This might mean there is a real limit to the number of entirely original strings this will generate.
This is also only based on the 117M parameter GPT2, it's a pretty obvious upgrade to retrain with medium, large or XL models. However, despite this, the outputs I tested were very convincing!
## Training data
The data was 29222 MtG card flavour texts. The model was based on the "gpt2" pretrained transformer: https://huggingface.co/gpt2.
## Training procedure
Only English language MtG flavour texts were scraped from the [Scryfall](https://scryfall.com/) API. Empty strings and any non-UTF-8 encoded tokens were removed leaving 29222 entries.
This was trained using google Colab with a T4 instance. 4 epochs, adamW optimizer with default parameters and a batch size of 32. Token embedding lengths were capped at 98 tokens as this was the longest string and an attention mask was added to the training model to ignore all padding tokens.
## Eval results
Average Training Loss: 0.44866578806635815.
Validation loss: 0.5606984243444775.
Sample model outputs:
1. "Every branch a crossroads, every vine a swift steed."
—Gwendlyn Di Corci
2. "The secrets of this world will tell their masters where to strike if need be."
—Noyan Dar, Tazeem roilmage
3. "The secrets of nature are expensive. You'd be better off just to have more freedom."
4. "Even the Dwarves knew to leave some stones unturned."
5. "The wise always keep an ear open to the whispers of power."
### BibTeX entry and citation info
```bibtex
@article{BownesLM,
title={Fine Tuning GPT-2 for Magic the Gathering flavour text generation.},
author={Richard J. Bownes},
journal={Medium},
year={2020}
}
```
---
language: en
tags:
- exbert
license: mit
datasets:
- bookcorpus
- wikipedia
---
# RoBERTa base model
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://arxiv.org/abs/1907.11692) and first released in
[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
makes a difference between english and English.
Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by
the Hugging Face team.
## Model description
RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means
it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
learn a bidirectional representation of the sentence.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that
interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-base')
>>> unmasker("Hello I'm a <mask> model.")
[{'sequence': "<s>Hello I'm a male model.</s>",
'score': 0.3306540250778198,
'token': 2943,
'token_str': 'Ġmale'},
{'sequence': "<s>Hello I'm a female model.</s>",
'score': 0.04655390977859497,
'token': 2182,
'token_str': 'Ġfemale'},
{'sequence': "<s>Hello I'm a professional model.</s>",
'score': 0.04232972860336304,
'token': 2038,
'token_str': 'Ġprofessional'},
{'sequence': "<s>Hello I'm a fashion model.</s>",
'score': 0.037216778844594955,
'token': 2734,
'token_str': 'Ġfashion'},
{'sequence': "<s>Hello I'm a Russian model.</s>",
'score': 0.03253649175167084,
'token': 1083,
'token_str': 'ĠRussian'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import RobertaTokenizer, TFRobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
### Limitations and bias
The training data used for this model contains a lot of unfiltered content from the internet, which is far from
neutral. Therefore, the model can have biased predictions:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-base')
>>> unmasker("The man worked as a <mask>.")
[{'sequence': '<s>The man worked as a mechanic.</s>',
'score': 0.08702439814805984,
'token': 25682,
'token_str': 'Ġmechanic'},
{'sequence': '<s>The man worked as a waiter.</s>',
'score': 0.0819653645157814,
'token': 38233,
'token_str': 'Ġwaiter'},
{'sequence': '<s>The man worked as a butcher.</s>',
'score': 0.073323555290699,
'token': 32364,
'token_str': 'Ġbutcher'},
{'sequence': '<s>The man worked as a miner.</s>',
'score': 0.046322137117385864,
'token': 18678,
'token_str': 'Ġminer'},
{'sequence': '<s>The man worked as a guard.</s>',
'score': 0.040150221437215805,
'token': 2510,
'token_str': 'Ġguard'}]
>>> unmasker("The Black woman worked as a <mask>.")
[{'sequence': '<s>The Black woman worked as a waitress.</s>',
'score': 0.22177888453006744,
'token': 35698,
'token_str': 'Ġwaitress'},
{'sequence': '<s>The Black woman worked as a prostitute.</s>',
'score': 0.19288744032382965,
'token': 36289,
'token_str': 'Ġprostitute'},
{'sequence': '<s>The Black woman worked as a maid.</s>',
'score': 0.06498628109693527,
'token': 29754,
'token_str': 'Ġmaid'},
{'sequence': '<s>The Black woman worked as a secretary.</s>',
'score': 0.05375480651855469,
'token': 2971,
'token_str': 'Ġsecretary'},
{'sequence': '<s>The Black woman worked as a nurse.</s>',
'score': 0.05245552211999893,
'token': 9008,
'token_str': 'Ġnurse'}]
```
This bias will also affect all fine-tuned versions of this model.
## Training data
The RoBERTa model was pretrained on the reunion of five datasets:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
- [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
articles crawled between September 2016 and February 2019.
- [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
train GPT-2,
- [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
story-like style of Winograd schemas.
Together theses datasets weight 160GB of text.
## Training procedure
### Preprocessing
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
with `<s>` and the end of one by `</s>`
The details of the masking procedure for each sentence are the following:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
### Pretraining
The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The
optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Glue test results:
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
| | 87.6 | 91.9 | 92.8 | 94.8 | 63.6 | 91.2 | 90.2 | 78.7 |
### BibTeX entry and citation info
```bibtex
@article{DBLP:journals/corr/abs-1907-11692,
author = {Yinhan Liu and
Myle Ott and
Naman Goyal and
Jingfei Du and
Mandar Joshi and
Danqi Chen and
Omer Levy and
Mike Lewis and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
journal = {CoRR},
volume = {abs/1907.11692},
year = {2019},
url = {http://arxiv.org/abs/1907.11692},
archivePrefix = {arXiv},
eprint = {1907.11692},
timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
<a href="https://huggingface.co/exbert/?model=roberta-base">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: en
tags:
- exbert
license: mit
datasets:
- bookcorpus
- wikipedia
---
# RoBERTa large model
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://arxiv.org/abs/1907.11692) and first released in
[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
makes a difference between english and English.
Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by
the Hugging Face team.
## Model description
RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means
it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
learn a bidirectional representation of the sentence.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that
interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-large')
>>> unmasker("Hello I'm a <mask> model.")
[{'sequence': "<s>Hello I'm a male model.</s>",
'score': 0.3317350447177887,
'token': 2943,
'token_str': 'Ġmale'},
{'sequence': "<s>Hello I'm a fashion model.</s>",
'score': 0.14171843230724335,
'token': 2734,
'token_str': 'Ġfashion'},
{'sequence': "<s>Hello I'm a professional model.</s>",
'score': 0.04291723668575287,
'token': 2038,
'token_str': 'Ġprofessional'},
{'sequence': "<s>Hello I'm a freelance model.</s>",
'score': 0.02134818211197853,
'token': 18150,
'token_str': 'Ġfreelance'},
{'sequence': "<s>Hello I'm a young model.</s>",
'score': 0.021098261699080467,
'token': 664,
'token_str': 'Ġyoung'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
model = RobertaModel.from_pretrained('roberta-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import RobertaTokenizer, TFRobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
model = TFRobertaModel.from_pretrained('roberta-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
### Limitations and bias
The training data used for this model contains a lot of unfiltered content from the internet, which is far from
neutral. Therefore, the model can have biased predictions:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-large')
>>> unmasker("The man worked as a <mask>.")
[{'sequence': '<s>The man worked as a mechanic.</s>',
'score': 0.08260300755500793,
'token': 25682,
'token_str': 'Ġmechanic'},
{'sequence': '<s>The man worked as a driver.</s>',
'score': 0.05736079439520836,
'token': 1393,
'token_str': 'Ġdriver'},
{'sequence': '<s>The man worked as a teacher.</s>',
'score': 0.04709019884467125,
'token': 3254,
'token_str': 'Ġteacher'},
{'sequence': '<s>The man worked as a bartender.</s>',
'score': 0.04641604796051979,
'token': 33080,
'token_str': 'Ġbartender'},
{'sequence': '<s>The man worked as a waiter.</s>',
'score': 0.04239227622747421,
'token': 38233,
'token_str': 'Ġwaiter'}]
>>> unmasker("The woman worked as a <mask>.")
[{'sequence': '<s>The woman worked as a nurse.</s>',
'score': 0.2667474150657654,
'token': 9008,
'token_str': 'Ġnurse'},
{'sequence': '<s>The woman worked as a waitress.</s>',
'score': 0.12280137836933136,
'token': 35698,
'token_str': 'Ġwaitress'},
{'sequence': '<s>The woman worked as a teacher.</s>',
'score': 0.09747499972581863,
'token': 3254,
'token_str': 'Ġteacher'},
{'sequence': '<s>The woman worked as a secretary.</s>',
'score': 0.05783602222800255,
'token': 2971,
'token_str': 'Ġsecretary'},
{'sequence': '<s>The woman worked as a cleaner.</s>',
'score': 0.05576248839497566,
'token': 16126,
'token_str': 'Ġcleaner'}]
```
This bias will also affect all fine-tuned versions of this model.
## Training data
The RoBERTa model was pretrained on the reunion of five datasets:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
- [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
articles crawled between September 2016 and February 2019.
- [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
train GPT-2,
- [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
story-like style of Winograd schemas.
Together theses datasets weight 160GB of text.
## Training procedure
### Preprocessing
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
with `<s>` and the end of one by `</s>`
The details of the masking procedure for each sentence are the following:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
### Pretraining
The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The
optimizer used is Adam with a learning rate of 4e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning
rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Glue test results:
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
| | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 96.4 | 90.9 | 86.6 |
### BibTeX entry and citation info
```bibtex
@article{DBLP:journals/corr/abs-1907-11692,
author = {Yinhan Liu and
Myle Ott and
Naman Goyal and
Jingfei Du and
Mandar Joshi and
Danqi Chen and
Omer Levy and
Mike Lewis and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
journal = {CoRR},
volume = {abs/1907.11692},
year = {2019},
url = {http://arxiv.org/abs/1907.11692},
archivePrefix = {arXiv},
eprint = {1907.11692},
timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
<a href="https://huggingface.co/exbert/?model=roberta-base">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
license: mit
widget:
- text: "I like you. </s></s> I love you."
---
## roberta-large-mnli
Trained by Facebook, [original source](https://github.com/pytorch/fairseq/tree/master/examples/roberta)
```bibtex
@article{liu2019roberta,
title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
Luke Zettlemoyer and Veselin Stoyanov},
journal={arXiv preprint arXiv:1907.11692},
year = {2019},
}
```
---
language:
- hi
- en
tags:
- hi
- en
- codemix
datasets:
- SAIL 2017
---
# Model name
## Model description
I took a bert-base-multilingual-cased model from huggingface and finetuned it on SAIL 2017 dataset.
## Intended uses & limitations
#### How to use
```python
# You can include sample code which will be formatted
#Coming soon!
```
#### Limitations and bias
Provide examples of latent issues and potential remediations.
## Training data
I trained on the SAIL 2017 dataset [link](http://amitavadas.com/SAIL/Data/SAIL_2017.zip) on this [pretrained model](https://huggingface.co/bert-base-multilingual-cased).
## Training procedure
No preprocessing.
## Eval results
### BibTeX entry and citation info
```bibtex
@inproceedings{khanuja-etal-2020-gluecos,
title = "{GLUEC}o{S}: An Evaluation Benchmark for Code-Switched {NLP}",
author = "Khanuja, Simran and
Dandapat, Sandipan and
Srinivasan, Anirudh and
Sitaram, Sunayana and
Choudhury, Monojit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.329",
pages = "3575--3585"
}
```
---
language:
- es
- en
tags:
- es
- en
- codemix
license: "apache-2.0"
datasets:
- SAIL 2017
metrics:
- fscore
- accuracy
- precision
- recall
---
# BERT codemixed base model for spanglish (cased)
This model was built using [lingualytics](https://github.com/lingualytics/py-lingualytics), an open-source library that supports code-mixed analytics.
## Model description
Input for the model: Any codemixed spanglish text
Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [CS-EN-ES-CORPUS](http://www.grupolys.org/software/CS-CORPORA/cs-en-es-corpus-wassa2015.txt) dataset.
Performance of this model on the dataset
| metric | score |
|------------|----------|
| acc | 0.718615 |
| f1 | 0.71759 |
| acc_and_f1 | 0.718103 |
| precision | 0.719302 |
| recall | 0.718615 |
## Intended uses & limitations
Make sure to preprocess your data using [these methods](https://github.com/microsoft/GLUECoS/blob/master/Data/Preprocess_Scripts/preprocess_sent_en_es.py) before using this model.
#### How to use
Here is how to use this model to get the features of a given text in *PyTorch*:
```python
# You can include sample code which will be formatted
from transformers import BertTokenizer, BertModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in *TensorFlow*:
```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
#### Limitations and bias
Since I dont know spanish, I cant verify the quality of annotations or the dataset itself. This is a very simple transfer learning approach and I'm open to discussions to improve upon this.
## Training data
I trained on the dataset on the [bert-base-multilingual-cased model](https://huggingface.co/bert-base-multilingual-cased).
## Training procedure
Followed the preprocessing techniques followed [here](https://github.com/microsoft/GLUECoS/blob/master/Data/Preprocess_Scripts/preprocess_sent_en_es.py)
## Eval results
### BibTeX entry and citation info
```bibtex
@inproceedings{khanuja-etal-2020-gluecos,
title = "{GLUEC}o{S}: An Evaluation Benchmark for Code-Switched {NLP}",
author = "Khanuja, Simran and
Dandapat, Sandipan and
Srinivasan, Anirudh and
Sitaram, Sunayana and
Choudhury, Monojit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.329",
pages = "3575--3585"
}
```
---
language:
- hi
- en
tags:
- es
- en
- codemix
license: "apache-2.0"
datasets:
- SAIL 2017
metrics:
- fscore
- accuracy
- precision
- recall
---
# BERT codemixed base model for Hinglish (cased)
This model was built using [lingualytics](https://github.com/lingualytics/py-lingualytics), an open-source library that supports code-mixed analytics.
## Model description
Input for the model: Any codemixed Hinglish text
Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset.
## Eval results
Performance of this model on the dataset
| metric | score |
|------------|----------|
| acc | 0.55873 |
| f1 | 0.558369 |
| acc_and_f1 | 0.558549 |
| precision | 0.558075 |
| recall | 0.55873 |
#### How to use
Here is how to use this model to get the features of a given text in *PyTorch*:
```python
# You can include sample code which will be formatted
from transformers import BertTokenizer, BertModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in *TensorFlow*:
```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
#### Preprocessing
Followed standard preprocessing techniques:
- removed digits
- removed punctuation
- removed stopwords
- removed excess whitespace
Here's the snippet
```python
from pathlib import Path
import pandas as pd
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits, remove_whitespace
root = Path('<path-to-data>')
for file in 'test','train','validation':
tochange = root / f'{file}.txt'
df = pd.read_csv(tochange,header=None,sep='\t',names=['text','label'])
df['text'] = df['text'].pipe(remove_digits) \
.pipe(remove_punctuation) \
.pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords)) \
.pipe(remove_whitespace)
df.to_csv(tochange,index=None,header=None,sep='\t')
```
## Training data
The dataset and annotations are not good, but this is the best dataset I could find. I am working on procuring my own dataset and will try to come up with a better model!
## Training procedure
I trained on the dataset on the [bert-base-multilingual-cased model](https://huggingface.co/bert-base-multilingual-cased).
---
language:
- hi
- en
tags:
- hi
- en
- codemix
license: "apache-2.0"
datasets:
- SAIL 2017
metrics:
- fscore
- accuracy
---
# BERT codemixed base model for hinglish (cased)
## Model description
Input for the model: Any codemixed hinglish text
Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset.
Performance of this model on the SAIL 2017 dataset
| metric | score |
|------------|----------|
| acc | 0.588889 |
| f1 | 0.582678 |
| acc_and_f1 | 0.585783 |
| precision | 0.586516 |
| recall | 0.588889 |
## Intended uses & limitations
#### How to use
Here is how to use this model to get the features of a given text in *PyTorch*:
```python
# You can include sample code which will be formatted
from transformers import BertTokenizer, BertModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("rohanrajpal/bert-base-codemixed-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("rohanrajpal/bert-base-codemixed-uncased-sentiment")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in *TensorFlow*:
```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-codemixed-uncased-sentiment')
model = TFBertModel.from_pretrained("rohanrajpal/bert-base-codemixed-uncased-sentiment")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
#### Limitations and bias
Coming soon!
## Training data
I trained on the SAIL 2017 dataset [link](http://amitavadas.com/SAIL/Data/SAIL_2017.zip) on this [pretrained model](https://huggingface.co/bert-base-multilingual-cased).
## Training procedure
No preprocessing.
## Eval results
### BibTeX entry and citation info
```bibtex
@inproceedings{khanuja-etal-2020-gluecos,
title = "{GLUEC}o{S}: An Evaluation Benchmark for Code-Switched {NLP}",
author = "Khanuja, Simran and
Dandapat, Sandipan and
Srinivasan, Anirudh and
Sitaram, Sunayana and
Choudhury, Monojit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.329",
pages = "3575--3585"
}
```
---
language: it
datasets:
- xtreme
---
# Italian-Bert (Italian Bert) + POS 🎃🏷
This model is a fine-tuned on [xtreme udpos Italian](https://huggingface.co/nlp/viewer/?dataset=xtreme&config=udpos.Italian) version of [Bert Base Italian](https://huggingface.co/dbmdz/bert-base-italian-cased) for **POS** downstream task.
## Details of the downstream task (POS) - Dataset
- [Dataset: xtreme udpos Italian](https://huggingface.co/nlp/viewer/?dataset=xtreme&config=udpos.Italian) 📚
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 716 K |
| Dev | 85 K |
- [Fine-tune on NER script provided by @stefan-it](https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py)
- Labels covered:
```
ADJ
ADP
ADV
AUX
CCONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
X
```
## Metrics on evaluation set 🧾
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **97.25**
| Precision | **97.15** |
| Recall | **97.36** |
## Model in action 🔨
Example of usage
```python
from transformers import pipeline
nlp_pos = pipeline(
"ner",
model="sachaarbonel/bert-italian-cased-finetuned-pos",
tokenizer=(
'sachaarbonel/bert-spanish-cased-finetuned-pos',
{"use_fast": False}
))
text = 'Roma è la Capitale d'Italia.'
nlp_pos(text)
'''
Output:
--------
[{'entity': 'PROPN', 'index': 1, 'score': 0.9995346665382385, 'word': 'roma'},
{'entity': 'AUX', 'index': 2, 'score': 0.9966597557067871, 'word': 'e'},
{'entity': 'DET', 'index': 3, 'score': 0.9994786977767944, 'word': 'la'},
{'entity': 'NOUN',
'index': 4,
'score': 0.9995198249816895,
'word': 'capitale'},
{'entity': 'ADP', 'index': 5, 'score': 0.9990894198417664, 'word': 'd'},
{'entity': 'PART', 'index': 6, 'score': 0.57159024477005, 'word': "'"},
{'entity': 'PROPN',
'index': 7,
'score': 0.9994804263114929,
'word': 'italia'},
{'entity': 'PUNCT', 'index': 8, 'score': 0.9772886633872986, 'word': '.'}]
'''
```
Yeah! Not too bad 🎉
> Created by [Sacha Arbonel/@sachaarbonel](https://twitter.com/sachaarbonel) | [LinkedIn](https://www.linkedin.com/in/sacha-arbonel)
> Made with <span style="color: #e25555;">&hearts;</span> in Paris
---
language: bn
tags:
- bert
- bengali
- bengali-lm
- bangla
license: MIT
datasets:
- common_crawl
- wikipedia
- oscar
---
# Bangla BERT Base
A long way passed. Here is our **Bangla-Bert**! It is now available in huggingface model hub.
[Bangla-Bert-Base](https://github.com/sagorbrur/bangla-bert) is a pretrained language model of Bengali language using mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and it's github [repository](https://github.com/google-research/bert)
## Pretrain Corpus Details
Corpus was downloaded from two main sources:
* Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/)
* [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
```
sentence 1
sentence 2
sentence 1
sentence 2
```
## Building Vocab
We used [BNLP](https://github.com/sagorbrur/bnlp) package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format.
Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](https://github.com/sagorbrur/bangla-bert) and also at [huggingface](https://huggingface.co/sagorsarker/bangla-bert-base) model hub.
## Training Details
* Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
* Total Training Steps: 1 Million
* The model was trained on a single Google Cloud TPU
## Evaluation Results
### LM Evaluation Results
After training 1 millions steps here is the evaluation resutls.
```
global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227
```
### Downstream Task Evaluation Results
Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evalution results of classification task.
He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for classification task.
Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves state of the art result.
Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
| Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average |
| ----- | -------------------| ---------------- | --------------- | ------- |
| mBERT | 68.15 | 52.32 | 72.27 | 64.25 |
| Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
| Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
**NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.**
## How to Use
You can use this model directly with a pipeline for masked language modeling:
```py
from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
print(pred)
# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}
```
## Author
[Sagor Sarker](https://github.com/sagorbrur)
## Acknowledgements
* Thanks to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
* Thank to all the people around, who always helping us to build something for Bengali.
## Reference
* https://github.com/google-research/bert
---
language:
- bn
datasets:
- socian
- bangla-sentiment-benchmark
license: mit
tags:
- bengali
- bengali-sentiment
- sentiment-analysis
---
# bangla-bert-sentiment
`bangla-bert-sentiment` is a pretrained model for bengali **Sentiment Analysis** using [bangla-bert-base](https://huggingface.co/sagorsarker/bangla-bert-base) model.
## Datasets Details
This model was trained with two combined datasets
* [socian sentiment data](https://github.com/socian-ai/socian-bangla-sentiment-dataset-labeled)
* [bangla classification dataset](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP)
|||
|--|--|
|Data Size| 10889 |
|Positive| 4999 |
|Negative| 5890 |
|Train | 8711 |
| Test | 2178 |
## Training Details
Model trained with [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) binary classification script with total of **3 epochs** in `google colab gpu`.
## Evaluation Details
Model evaluate with 2178 sentences
Here is the evaluation result details in table
|Eval Loss | TP | TN | FP | FN | F1 Score |
| -------- | -- | -- | -- | -- | -------- |
| 0.3289 | 880 | 1158 | 59 | 81 | 92.63 |
## Usage
Calculate sentiment from given sentence
```py
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("sagorsarker/bangla-bert-sentiment")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
sentence = "বাংলার ঘরে ঘরে আজ নবান্নের উৎসব"
nlp(sentence)
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment