"tests/gpt2/test_modeling_gpt2.py" did not exist on "afc4ece462ad83a090af620ff4da099a0272e171"
Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: de
license: mit
tags:
- sentence_embedding
- search
- pytorch
- xlm-roberta
- roberta
- xlm-r-distilroberta-base-paraphrase-v1
- paraphrase
datasets:
- STSbenchmark
metrics:
- Spearman’s rank correlation
- cosine similarity
---
# German RoBERTa for Sentence Embeddings V2
**The new [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model is slightly better for German language. It is also the current best model for English language and works cross-lingually. Please consider using that model.**
This model is intended to [compute sentence (text embeddings)](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html) for German text. These embeddings can then be compared with [cosine-similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to find sentences with a similar semantic meaning. For example this can be useful for [semantic textual similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html), [semantic search](https://www.sbert.net/docs/usage/semantic_search.html), or [paraphrase mining](https://www.sbert.net/docs/usage/paraphrase_mining.html). To do this you have to use the [Sentence Transformers Python framework](https://github.com/UKPLab/sentence-transformers).
> Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
Source: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
This model is fine-tuned from [Philip May](https://eniak.de/) and open-sourced by [T-Systems-onsite](https://www.t-systems-onsite.de/). Special thanks to [Nils Reimers](https://www.nils-reimers.de/) for your awesome open-source work, the Sentence Transformers, the models and your help on GitHub.
## How to use
**The usage description above - provided by Hugging Face - is wrong for sentence embeddings! Please use this:**
To use this model install the `sentence-transformers` package (see here: <https://github.com/UKPLab/sentence-transformers>).
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T-Systems-onsite/german-roberta-sentence-transformer-v2')
```
For details of usage and examples see here:
- [Computing Sentence Embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Paraphrase Mining](https://www.sbert.net/docs/usage/paraphrase_mining.html)
- [Semantic Search](https://www.sbert.net/docs/usage/semantic_search.html)
- [Cross-Encoders](https://www.sbert.net/docs/usage/cross-encoder.html)
- [Examples on GitHub](https://github.com/UKPLab/sentence-transformers/tree/master/examples)
## Training
The base model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). This model has been further trained by [Nils Reimers](https://www.nils-reimers.de/) on a large scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils-reimers.de/) about this [on GitHub](https://github.com/UKPLab/sentence-transformers/issues/509#issuecomment-712243280):
>A paper is upcoming for the paraphrase models.
>
>These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI-entailment pairs with in-batch-negative loss etc.
>
>In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.
>
>More details with the setup, all the datasets, and a wider evaluation will follow soon.
The resulting model called `xlm-r-distilroberta-base-paraphrase-v1` has been released here: <https://github.com/UKPLab/sentence-transformers/releases/tag/v0.3.8>
Building on this cross language model we fine-tuned it for German language on the [deepl.com](https://www.deepl.com/translator) dataset of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark).
We did an automatic hyperparameter search for 102 trials with [Optuna](https://github.com/optuna/optuna). Using 10-fold crossvalidation on the deepl.com test and dev dataset we found the following best hyperparameters:
- batch_size = 15
- num_epochs = 4
- lr = 2.2995320905210864e-05
- eps = 1.8979875906303792e-06
- weight_decay = 0.003314045812507563
- warmup_steps_proportion = 0.46141685205829014
The final model was trained with these hyperparameters on the combination of `sts_de_train.csv` and `sts_de_dev.csv`. The `sts_de_test.csv` was left for testing.
# Evaluation
The evaluation has been done on the test set of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark). The code is available on [Colab](https://colab.research.google.com/drive/1aCWOqDQx953kEnQ5k4Qn7uiixokocOHv?usp=sharing). As the metric for evaluation we use the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and STSbenchmark labels.
| Model Name | Spearman rank correlation<br/>(German) |
|--------------------------------------|-------------------------------------|
| xlm-r-distilroberta-base-paraphrase-v1 | 0.8079 |
| xlm-r-100langs-bert-base-nli-stsb-mean-tokens | 0.8194 |
| xlm-r-bert-base-nli-stsb-mean-tokens | 0.8194 |
| **T-Systems-onsite/<br/>german-roberta-sentence-transformer-v2** | **0.8529** |
| **[T-Systems-onsite/<br/>cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer)** | **0.8550** |
---
language: uk
---
Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
## GPT2 124M Trained on Ukranian Fiction
### Training details
Model was trained on corpus of 4040 fiction books, 2.77 GiB in total.
Evaluation on [brown-uk](https://github.com/brown-uk/corpus) gives perplexity of 50.16.
### Example usage:
```python
from transformers import AlbertTokenizer, GPT2LMHeadModel
tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
input_ids = tokenizer.encode("Но зла Юнона, суча дочка,", add_special_tokens=False, return_tensors='pt')
outputs = model.generate(
input_ids,
do_sample=True,
num_return_sequences=3,
max_length=50
)
for i, out in enumerate(outputs):
print("{}: {}".format(i, tokenizer.decode(out)))
```
Prints something like this:
```bash
0: Но зла Юнона, суча дочка, яка затьмарила всі її таємниці: І хто з'їсть її душу, той помре». І, не дочекавшись гніву богів, посунула в пітьму, щоб не бачити перед собою. Але, за
1: Но зла Юнона, суча дочка, і довела мене до божевілля. Але він не знав нічого. Після того як я його побачив, мені стало зле. Я втратив рівновагу. Але в мене не було часу на роздуми. Я вже втратив надію
2: Но зла Юнона, суча дочка, не нарікала нам! — раптом вигукнула Юнона. — Це ти, старий йолопе! — мовила вона, не перестаючи сміятись. — Хіба ти не знаєш, що мені подобається ходити з тобою?
```
\ No newline at end of file
---
language: fi
---
## Quickstart
**Release 1.0** (November 25, 2019)
Download the models here:
* Cased Finnish BERT Base: [bert-base-finnish-cased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased-v1.zip)
* Uncased Finnish BERT Base: [bert-base-finnish-uncased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased-v1.zip)
We generally recommend the use of the cased model.
Paper presenting Finnish BERT: [arXiv:1912.07076](https://arxiv.org/abs/1912.07076)
## What's this?
A version of Google's [BERT](https://github.com/google-research/bert) deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released [multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) models from Google:
| Vocabulary | Example |
|------------|---------|
| FinBERT | Suomessa vaihtuu kesän aikana sekä pääministeri että valtiovarain ##ministeri . |
| Multilingual BERT | Suomessa vai ##htuu kes ##än aikana sekä p ##ää ##minister ##i että valt ##io ##vara ##in ##minister ##i . |
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.
These features allow FinBERT to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
## Results
### Document classification
![learning curves for Yle and Ylilauta document classification](https://raw.githubusercontent.com/TurkuNLP/FinBERT/master/img/yle-ylilauta-curves.png)
FinBERT outperforms multilingual BERT (M-BERT) on document classification over a range of training set sizes on the Yle news (left) and Ylilauta online discussion (right) corpora. (Baseline classification performance with [FastText](https://fasttext.cc/) included for reference.)
[[code](https://github.com/spyysalo/finbert-text-classification)][[Yle data](https://github.com/spyysalo/yle-corpus)] [[Ylilauta data](https://github.com/spyysalo/ylilauta-corpus)]
### Named Entity Recognition
Evaluation on FiNER corpus ([Ruokolainen et al 2019](https://arxiv.org/abs/1908.04212))
| Model | Accuracy |
|--------------------|----------|
| **FinBERT** | **92.40%** |
| Multilingual BERT | 90.29% |
| [FiNER-tagger](https://github.com/Traubert/FiNer-rules) (rule-based) | 86.82% |
(FiNER tagger results from [Ruokolainen et al. 2019](https://arxiv.org/pdf/1908.04212.pdf))
[[code](https://github.com/jouniluoma/keras-bert-ner)][[data](https://github.com/mpsilfve/finer-data)]
### Part of speech tagging
Evaluation on three Finnish corpora annotated with [Universal Dependencies](https://universaldependencies.org/) part-of-speech tags: the Turku Dependency Treebank (TDT), FinnTreeBank (FTB), and Parallel UD treebank (PUD)
| Model | TDT | FTB | PUD |
|-------------------|-------------|-------------|-------------|
| **FinBERT** | **98.23%** | **98.39%** | **98.08%** |
| Multilingual BERT | 96.97% | 95.87% | 97.58% |
[[code](https://github.com/spyysalo/bert-pos)][[data](http://hdl.handle.net/11234/1-2837)]
## Use with PyTorch
If you want to use the model with the huggingface/transformers library, follow the steps in [huggingface_transformers.md](https://github.com/TurkuNLP/FinBERT/blob/master/huggingface_transformers.md)
## Previous releases
### Release 0.2
**October 24, 2019** Beta version of the BERT base uncased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-uncased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased.zip)
### Release 0.1
**September 30, 2019** We release a beta version of the BERT base cased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-cased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased.zip)
---
language: fi
---
## Quickstart
**Release 1.0** (November 25, 2019)
Download the models here:
* Cased Finnish BERT Base: [bert-base-finnish-cased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased-v1.zip)
* Uncased Finnish BERT Base: [bert-base-finnish-uncased-v1.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased-v1.zip)
We generally recommend the use of the cased model.
Paper presenting Finnish BERT: [arXiv:1912.07076](https://arxiv.org/abs/1912.07076)
## What's this?
A version of Google's [BERT](https://github.com/google-research/bert) deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released [multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) models from Google:
| Vocabulary | Example |
|------------|---------|
| FinBERT | Suomessa vaihtuu kesän aikana sekä pääministeri että valtiovarain ##ministeri . |
| Multilingual BERT | Suomessa vai ##htuu kes ##än aikana sekä p ##ää ##minister ##i että valt ##io ##vara ##in ##minister ##i . |
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.
These features allow FinBERT to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
## Results
### Document classification
![learning curves for Yle and Ylilauta document classification](https://raw.githubusercontent.com/TurkuNLP/FinBERT/master/img/yle-ylilauta-curves.png)
FinBERT outperforms multilingual BERT (M-BERT) on document classification over a range of training set sizes on the Yle news (left) and Ylilauta online discussion (right) corpora. (Baseline classification performance with [FastText](https://fasttext.cc/) included for reference.)
[[code](https://github.com/spyysalo/finbert-text-classification)][[Yle data](https://github.com/spyysalo/yle-corpus)] [[Ylilauta data](https://github.com/spyysalo/ylilauta-corpus)]
### Named Entity Recognition
Evaluation on FiNER corpus ([Ruokolainen et al 2019](https://arxiv.org/abs/1908.04212))
| Model | Accuracy |
|--------------------|----------|
| **FinBERT** | **92.40%** |
| Multilingual BERT | 90.29% |
| [FiNER-tagger](https://github.com/Traubert/FiNer-rules) (rule-based) | 86.82% |
(FiNER tagger results from [Ruokolainen et al. 2019](https://arxiv.org/pdf/1908.04212.pdf))
[[code](https://github.com/jouniluoma/keras-bert-ner)][[data](https://github.com/mpsilfve/finer-data)]
### Part of speech tagging
Evaluation on three Finnish corpora annotated with [Universal Dependencies](https://universaldependencies.org/) part-of-speech tags: the Turku Dependency Treebank (TDT), FinnTreeBank (FTB), and Parallel UD treebank (PUD)
| Model | TDT | FTB | PUD |
|-------------------|-------------|-------------|-------------|
| **FinBERT** | **98.23%** | **98.39%** | **98.08%** |
| Multilingual BERT | 96.97% | 95.87% | 97.58% |
[[code](https://github.com/spyysalo/bert-pos)][[data](http://hdl.handle.net/11234/1-2837)]
## Use with PyTorch
If you want to use the model with the huggingface/transformers library, follow the steps in [huggingface_transformers.md](https://github.com/TurkuNLP/FinBERT/blob/master/huggingface_transformers.md)
## Previous releases
### Release 0.2
**October 24, 2019** Beta version of the BERT base uncased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-uncased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-uncased.zip)
### Release 0.1
**September 30, 2019** We release a beta version of the BERT base cased model trained from scratch on a corpus of Finnish news, online discussions, and crawled data.
Download the model here: [bert-base-finnish-cased.zip](http://dl.turkunlp.org/finbert/bert-base-finnish-cased.zip)
---
language: fr
widget:
- text: "Je m'appelle Hicham et je vis a Fès"
---
# MagBERT-NER: a state-of-the-art NER model for Moroccan French language (Maghreb)
## Introduction
[MagBERT-NER] is a state-of-the-art NER model for Moroccan French language (Maghreb). The MagBERT-NER model was fine-tuned for NER Task based the language model for French Camembert (based on the RoBERTa architecture).
For further information or requests, please visite our website at [typica.ai Website](https://typica.ai/) or send us an email at contactus@typica.ai
## How to use MagBERT-NER with HuggingFace
##### Load MagBERT-NER and its sub-word tokenizer :
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("TypicaAI/magbert-ner")
model = AutoModelForTokenClassification.from_pretrained("TypicaAI/magbert-ner")
##### Process text sample (from wikipedia about the current Prime Minister of Morocco) Using NER pipeline
from transformers import pipeline
nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
nlp("Saad Dine El Otmani, né le 16 janvier 1956 à Inezgane, est un homme d'État marocain, chef du gouvernement du Maroc depuis le 5 avril 2017")
#[{'entity_group': 'I-PERSON',
# 'score': 0.8941445276141167,
# 'word': 'Saad Dine El Otmani'},
# {'entity_group': 'B-DATE',
# 'score': 0.5967703461647034,
# 'word': '16 janvier 1956'},
# {'entity_group': 'B-GPE', 'score': 0.7160899192094803, 'word': 'Inezgane'},
# {'entity_group': 'B-NORP', 'score': 0.7971733212471008, 'word': 'marocain'},
# {'entity_group': 'B-GPE', 'score': 0.8921478390693665, 'word': 'Maroc'},
# {'entity_group': 'B-DATE',
# 'score': 0.5760444005330404,
# 'word': '5 avril 2017'}]
```
## Authors
MagBert-NER Model was trained by Hicham Assoudi, Ph.D.
For any questions, comments you can contact me at assoudi@typica.ai
## Citation
If you use our work, please cite:
Hicham Assoudi, Ph.D., MagBERT-NER: a state-of-the-art NER model for Moroccan French language (Maghreb), (2020)
---
language: "en"
tags:
- paraphrase-generation
- text-generation
- Conditional Generation
inference: false
---
# Paraphrase-Generation
## Model description
T5 Model for generating paraphrases of english sentences. Trained on the [Google PAWS](https://github.com/google-research-datasets/paws) dataset.
## How to use
PyTorch and TF models available
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws")
model = AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")
sentence = "This is something which i cannot understand at all"
text = "paraphrase: " + sentence + " </s>"
encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=256,
do_sample=True,
top_k=120,
top_p=0.95,
early_stopping=True,
num_return_sequences=5
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
print(line)
```
For more reference on training your own T5 model or using this model, do check out [Paraphrase Generation](https://github.com/Vamsi995/Paraphrase-Generator).
---
language: en
datasets:
- yelp_polarity
---
# RoBERTa-base-finetuned-yelp-polarity
This is a [RoBERTa-base](https://huggingface.co/roberta-base) checkpoint fine-tuned on binary sentiment classifcation from [Yelp polarity](https://huggingface.co/nlp/viewer/?dataset=yelp_polarity).
It gets **98.08%** accuracy on the test set.
## Hyper-parameters
We used the following hyper-parameters to train the model on one GPU:
```python
num_train_epochs = 2.0
learning_rate = 1e-05
weight_decay = 0.0
adam_epsilon = 1e-08
max_grad_norm = 1.0
per_device_train_batch_size = 32
gradient_accumulation_steps = 1
warmup_steps = 3500
seed = 42
```
---
language: no
thumbnail: https://i.imgur.com/QqSEC5I.png
---
# Norwegian Electra
![Image of norwegian electra](https://i.imgur.com/QqSEC5I.png)
Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
# Model
## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
- https://openreview.net/pdf?id=r1xMH1BtvB
- https://github.com/google-research/electra
# Acknowledgments
### TensorFlow Research Cloud
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
- https://www.tensorflow.org/tfrc
#### OSCAR corpus
- https://oscar-corpus.com/
#### OPUS
- http://opus.nlpl.eu/
- http://www.opensubtitles.org/
---
datasets:
- squad_v2
---
# BART-LARGE finetuned on SQuADv2
This is bart-large model finetuned on SQuADv2 dataset for question answering task
## Model details
BART was propsed in the [paper](https://arxiv.org/abs/1910.13461) **BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension**.
BART is a seq2seq model intended for both NLG and NLU tasks.
To use BART for question answering tasks, we feed the complete document into the encoder and decoder, and use the top
hidden state of the decoder as a representation for each
word. This representation is used to classify the token. As given in the paper bart-large achives comparable to ROBERTa on SQuAD.
Another notable thing about BART is that it can handle sequences with upto 1024 tokens.
| Param | #Value |
|---------------------|--------|
| encoder layers | 12 |
| decoder layers | 12 |
| hidden size | 4096 |
| num attetion heads | 16 |
| on disk size | 1.63GB |
## Model training
This model was trained with following parameters using simpletransformers wrapper:
```
train_args = {
'learning_rate': 1e-5,
'max_seq_length': 512,
'doc_stride': 512,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 8,
'num_train_epochs': 2,
'gradient_accumulation_steps': 2,
'no_cache': True,
'use_cached_eval_features': False,
'save_model_every_epoch': False,
'output_dir': "bart-squadv2",
'eval_batch_size': 32,
'fp16_opt_level': 'O2',
}
```
[You can even train your own model using this colab notebook](https://colab.research.google.com/drive/1I5cK1M_0dLaf5xoewh6swcm5nAInfwHy?usp=sharing)
## Results
```{"correct": 6832, "similar": 4409, "incorrect": 632, "eval_loss": -14.950117511952177}```
## Model in Action 🚀
```python3
from transformers import BartTokenizer, BartForQuestionAnswering
import torch
tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2')
model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
#answer => 'a nice puppet'
```
> Created with ❤️ by A-ware UG [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/aware-ai)
---
datasets:
- squad_v2
---
# Roberta-LARGE finetuned on SQuADv2
This is roberta-large model finetuned on SQuADv2 dataset for question answering answerability classification
## Model details
This model is simply an Sequenceclassification model with two inputs (context and question) in a list.
The result is either [1] for answerable or [0] if it is not answerable.
It was trained over 4 epochs on squadv2 dataset and can be used to filter out which context is good to give into the QA model to avoid bad answers.
## Model training
This model was trained with following parameters using simpletransformers wrapper:
```
train_args = {
'learning_rate': 1e-5,
'max_seq_length': 512,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 4,
'num_train_epochs': 4,
'gradient_accumulation_steps': 2,
'no_cache': True,
'use_cached_eval_features': False,
'save_model_every_epoch': False,
'output_dir': "bart-squadv2",
'eval_batch_size': 8,
'fp16_opt_level': 'O2',
}
```
## Results
```{"accuracy": 90.48%}```
## Model in Action 🚀
```python3
from simpletransformers.classification import ClassificationModel
model = ClassificationModel('roberta', 'a-ware/roberta-large-squadv2', num_labels=2, args=train_args)
predictions, raw_outputs = model.predict([["my dog is an year old. he loves to go into the rain", "how old is my dog ?"]])
print(predictions)
==> [1]
```
> Created with ❤️ by A-ware UG [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/aware-ai)
---
datasets:
- squad_v2
---
# XLM-ROBERTA-LARGE finetuned on SQuADv2
This is xlm-roberta-large model finetuned on SQuADv2 dataset for question answering task
## Model details
XLM-Roberta was propsed in the [paper](https://arxiv.org/pdf/1911.02116.pdf) **XLM-R: State-of-the-art cross-lingual understanding through self-supervision
## Model training
This model was trained with following parameters using simpletransformers wrapper:
```
train_args = {
'learning_rate': 1e-5,
'max_seq_length': 512,
'doc_stride': 512,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 8,
'num_train_epochs': 2,
'gradient_accumulation_steps': 2,
'no_cache': True,
'use_cached_eval_features': False,
'save_model_every_epoch': False,
'output_dir': "bart-squadv2",
'eval_batch_size': 32,
'fp16_opt_level': 'O2',
}
```
## Results
```{"correct": 6961, "similar": 4359, "incorrect": 553, "eval_loss": -12.177856394381962}```
## Model in Action 🚀
```python3
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
import torch
tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
#answer => 'a nice puppet'
```
> Created with ❤️ by A-ware UG [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/aware-ai)
---
tags:
- finance
---
# Roberta Masked Language Model Trained On Financial Phrasebank Corpus
This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a Financial Phrasebank Corpus.
The model is built using Huggingface transformers.
The model can be found at :[Financial_Roberta](https://huggingface.co/abhilash1910/financial_roberta)
## Specifications
The corpus for training is taken from the Financial Phrasebank (Malo et al)[https://www.researchgate.net/publication/251231107_Good_Debt_or_Bad_Debt_Detecting_Semantic_Orientations_in_Economic_Texts].
## Model Specification
The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
1. vocab_size=56000
2. max_position_embeddings=514
3. num_attention_heads=12
4. num_hidden_layers=6
5. type_vocab_size=1
This is trained by using RobertaConfig from transformers package.
The model is trained for 10 epochs with a gpu batch size of 64 units.
## Usage Specifications
For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/financial_roberta' for the tokenizers and the model.
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("abhilash1910/financial_roberta")
model = AutoModelWithLMHead.from_pretrained("abhilash1910/financial_roberta")
```
After this the model will be downloaded, it will take some time to download all the model files.
For testing the model, we have to import pipeline module from transformers and create a masked output model for inference as follows:
```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='abhilash1910/inancial_roberta')
model_mask("The company had a <mask> of 20% in 2020.")
```
Some of the examples are also provided with generic financial statements:
Example 1:
```python
model_mask("The company had a <mask> of 20% in 2020.")
```
Output:
```bash
[{'sequence': '<s>The company had a profit of 20% in 2020.</s>',
'score': 0.023112965747714043,
'token': 421,
'token_str': 'Ġprofit'},
{'sequence': '<s>The company had a loss of 20% in 2020.</s>',
'score': 0.021379893645644188,
'token': 616,
'token_str': 'Ġloss'},
{'sequence': '<s>The company had a year of 20% in 2020.</s>',
'score': 0.0185744296759367,
'token': 443,
'token_str': 'Ġyear'},
{'sequence': '<s>The company had a sales of 20% in 2020.</s>',
'score': 0.018143286928534508,
'token': 428,
'token_str': 'Ġsales'},
{'sequence': '<s>The company had a value of 20% in 2020.</s>',
'score': 0.015319528989493847,
'token': 776,
'token_str': 'Ġvalue'}]
```
Example 2:
```python
model_mask("The <mask> is listed under NYSE")
```
Output:
```bash
[{'sequence': '<s>The company is listed under NYSE</s>',
'score': 0.1566661298274994,
'token': 359,
'token_str': 'Ġcompany'},
{'sequence': '<s>The total is listed under NYSE</s>',
'score': 0.05542507395148277,
'token': 522,
'token_str': 'Ġtotal'},
{'sequence': '<s>The value is listed under NYSE</s>',
'score': 0.04729423299431801,
'token': 776,
'token_str': 'Ġvalue'},
{'sequence': '<s>The order is listed under NYSE</s>',
'score': 0.02533523552119732,
'token': 798,
'token_str': 'Ġorder'},
{'sequence': '<s>The contract is listed under NYSE</s>',
'score': 0.02087237872183323,
'token': 635,
'token_str': 'Ġcontract'}]
```
## Resources
For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).
# Roberta Trained Model For Masked Language Model On French Corpus :robot:
This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a small French News Corpus(Leipzig corpora).
The model is built using Huggingface transformers.
The model can be found at :[French-Roberta](https://huggingface.co/abhilash1910/french-roberta)
## Specifications
The corpus for training is taken from Leipzig Corpora (French News) , and is trained on a small set of the corpus (300K).
## Model Specification
The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
1. vocab_size=32000
2. max_position_embeddings=514
3. num_attention_heads=12
4. num_hidden_layers=6
5. type_vocab_size=1
This is trained by using RobertaConfig from transformers package.The total training parameters :68124416
The model is trained for 100 epochs with a gpu batch size of 64 units.
More details for building custom models can be found at the [HuggingFace Blog](https://huggingface.co/blog/how-to-train)
## Usage Specifications
For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/french-roberta' for the tokenizers and the model.
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("abhilash1910/french-roberta")
model = AutoModelWithLMHead.from_pretrained("abhilash1910/french-roberta")
```
After this the model will be downloaded, it will take some time to download all the model files.
For testing the model, we have to import pipeline module from transformers and create a masked output model for inference as follows:
```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='abhilash1910/french-roberta')
model_mask("Le tweet <mask>.")
```
Some of the examples are also provided with generic French sentences:
Example 1:
```python
model_mask("À ce jour, <mask> projet a entraîné")
```
Output:
```bash
[{'sequence': '<s>À ce jour, belles projet a entraîné</s>',
'score': 0.18685665726661682,
'token': 6504,
'token_str': 'Ġbelles'},
{'sequence': '<s>À ce jour,- projet a entraîné</s>',
'score': 0.0005200508167035878,
'token': 17,
'token_str': '-'},
{'sequence': '<s>À ce jour, de projet a entraîné</s>',
'score': 0.00045729897101409733,
'token': 268,
'token_str': 'Ġde'},
{'sequence': '<s>À ce jour, du projet a entraîné</s>',
'score': 0.0004307595663703978,
'token': 326,
'token_str': 'Ġdu'},
{'sequence': '<s>À ce jour," projet a entraîné</s>',
'score': 0.0004219160182401538,
'token': 6,
'token_str': '"'}]
```
Example 2:
```python
model_mask("C'est un <mask>")
```
Output:
```bash
[{'sequence': "<s>C'est un belles</s>",
'score': 0.16440927982330322,
'token': 6504,
'token_str': 'Ġbelles'},
{'sequence': "<s>C'est un de</s>",
'score': 0.0005495127406902611,
'token': 268,
'token_str': 'Ġde'},
{'sequence': "<s>C'est un du</s>",
'score': 0.00044988933950662613,
'token': 326,
'token_str': 'Ġdu'},
{'sequence': "<s>C'est un-</s>",
'score': 0.00044542422983795404,
'token': 17,
'token_str': '-'},
{'sequence': "<s>C'est un\t</s>",
'score': 0.00037563967634923756,
'token': 202,
'token_str': 'ĉ'}]
```
## Resources
For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
`BERT-PT_*` addtionally uses SQuAD 1.1.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
`BERT-PT_*` addtionally uses SQuAD 1.1.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.
`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
## Model Description
The original model is from `BERT-base-uncased`.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
model = AutoModel.from_pretrained("activebus/BERT_Review")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```
---
language: pt
---
# PTT5-SMALL-SUM
## Model description
This model was trained to summarize texts in portuguese
based on ```unicamp-dl/ptt5-small-portuguese-vocab```
#### How to use
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('adalbertojunior/PTT5-SMALL-SUM')
t5 = T5ForConditionalGeneration.from_pretrained('adalbertojunior/PTT5-SMALL-SUM')
text="Esse é um exemplo de sumarização."
input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True)
generated_ids = t5.generate(
input_ids=input_ids,
num_beams=1,
max_length=40,
#repetition_penalty=2.5
).squeeze()
predicted_span = tokenizer.decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment