"vscode:/vscode.git/clone" did not exist on "cf2a3f37bfac2142f1d081760f787c9db263f895"
Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
### Model
**[`monologg/biobert_v1.1_pubmed`](https://huggingface.co/monologg/biobert_v1.1_pubmed)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
This model is cased.
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
```bash
BASE_MODEL=monologg/biobert_v1.1_pubmed
python run_squad.py \
--version_2_with_negative \
--model_type albert \
--model_name_or_path $BASE_MODEL \
--output_dir $OUTPUT_MODEL \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--per_gpu_train_batch_size 18 \
--per_gpu_eval_batch_size 64 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps 2000 \
--threads 24 \
--warmup_steps 550 \
--gradient_accumulation_steps 1 \
--fp16 \
--logging_steps 50 \
--do_train
```
### Evaluation
Evaluation on the dev set. I did not sweep for best threshold.
| | val |
|-------------------|-------------------|
| exact | 75.97068980038743 |
| f1 | 79.37043950121722 |
| total | 11873.0 |
| HasAns_exact | 74.13967611336032 |
| HasAns_f1 | 80.94892513460755 |
| HasAns_total | 5928.0 |
| NoAns_exact | 77.79646761984861 |
| NoAns_f1 | 77.79646761984861 |
| NoAns_total | 5945.0 |
| best_exact | 75.97068980038743 |
| best_exact_thresh | 0.0 |
| best_f1 | 79.37043950121729 |
| best_f1_thresh | 0.0 |
### Usage
See [huggingface documentation](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
```python
start_scores, end_scores = model(input_ids)
span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
ignore_score = span_scores[:,0,0] #no answer scores
```
---
language:
- en
thumbnail:
widget:
- text: "topic: climate article:"
---
# GPT2-medium-topic-news
## Model description
GPT2-medium fine tuned on a large news corpus conditioned on a topic
## Intended uses & limitations
#### How to use
To generate a news article text conditioned on a topic, prompt model with:
`topic: climate article:`
The following tags were used during training:
`arts law international science business politics disaster world conflict football sport sports artanddesign environment music film lifeandstyle business health commentisfree books technology media education politics travel stage uk society us money culture religion science news tv fashion uk australia cities global childrens sustainable global voluntary housing law local healthcare theguardian`
Zero shot generation works pretty well as long as `topic` is a single word and not too specific.
```python
device = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/gpt2-medium-topic-news")
model = AutoModelWithLMHead.from_pretrained("ktrapeznikov/gpt2-medium-topic-news")
model.to(device)
topic = "climate"
prompt = tokenizer(f"topic: {topic} article:", return_tensors="pt")
out = model.generate(prompt["input_ids"].to(device), do_sample=True,max_length=500, early_stopping=True, top_p=.9)
print(tokenizer.decode(list(out.cpu()[0])))
```
## Training data
## Training procedure
### Model
**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
```bash
BASE_MODEL=allenai/scibert_scivocab_uncased
python run_squad.py \
--version_2_with_negative \
--model_type albert \
--model_name_or_path $BASE_MODEL \
--output_dir $OUTPUT_MODEL \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--per_gpu_train_batch_size 18 \
--per_gpu_eval_batch_size 64 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps 2000 \
--threads 24 \
--warmup_steps 550 \
--gradient_accumulation_steps 1 \
--fp16 \
--logging_steps 50 \
--do_train
```
### Evaluation
Evaluation on the dev set. I did not sweep for best threshold.
| | val |
|-------------------|-------------------|
| exact | 75.07790785816559 |
| f1 | 78.47735207283013 |
| total | 11873.0 |
| HasAns_exact | 70.76585695006747 |
| HasAns_f1 | 77.57449412292718 |
| HasAns_total | 5928.0 |
| NoAns_exact | 79.37762825904122 |
| NoAns_f1 | 79.37762825904122 |
| NoAns_total | 5945.0 |
| best_exact | 75.08633032931863 |
| best_exact_thresh | 0.0 |
| best_f1 | 78.48577454398324 |
| best_f1_thresh | 0.0 |
### Usage
See [huggingface documentation](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
```python
start_scores, end_scores = model(input_ids)
span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
ignore_score = span_scores[:,0,0] #no answer scores
```
---
language: ar
datasets:
- oscar
- wikipedia
tags:
- ar
- masked-lm
---
# Arabic-ALBERT Base
Arabic edition of ALBERT Base pretrained language model
## Pretraining data
The models were pretrained on ~4.4 Billion words:
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
## Models
| | albert-base | albert-large | albert-xlarge |
|:---:|:---:|:---:|:---:|
| Hidden Layers | 12 | 24 | 24 |
| Attention heads | 12 | 16 | 32 |
| Hidden size | 768 | 1024 | 2048 |
## Results
For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
## How to use
You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
# loading the tokenizer
base_tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
# loading the model
base_model = AutoModel.from_pretrained("kuisailab/albert-base-arabic")
```
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
---
language: ar
datasets:
- oscar
- wikipedia
tags:
- ar
- masked-lm
---
# Arabic-ALBERT Large
Arabic edition of ALBERT Large pretrained language model
## Pretraining data
The models were pretrained on ~4.4 Billion words:
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
## Models
| | albert-base | albert-large | albert-xlarge |
|:---:|:---:|:---:|:---:|
| Hidden Layers | 12 | 24 | 24 |
| Attention heads | 12 | 16 | 32 |
| Hidden size | 768 | 1024 | 2048 |
## Results
For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
## How to use
You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
# loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-large-arabic")
# loading the model
model = AutoModel.from_pretrained("kuisailab/albert-large-arabic")
```
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
---
language: ar
datasets:
- oscar
- wikipedia
tags:
- ar
- masked-lm
---
# Arabic-ALBERT Xlarge
Arabic edition of ALBERT Xlarge pretrained language model
## Pretraining data
The models were pretrained on ~4.4 Billion words:
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
## Models
| | albert-base | albert-large | albert-xlarge |
|:---:|:---:|:---:|:---:|
| Hidden Layers | 12 | 24 | 24 |
| Attention heads | 12 | 16 | 32 |
| Hidden size | 768 | 1024 | 2048 |
## Results
For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
## How to use
You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
# loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-xlarge-arabic")
# loading the model
model = AutoModel.from_pretrained("kuisailab/albert-xlarge-arabic")
```
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
---
language: te
---
# telugu_bertu
## Model description
This model is a BERT MLM model trained on Telugu.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("kuppuluri/telugu_bertu",
clean_text=False,
handle_chinese_chars=False,
strip_accents=False,
wordpieces_prefix='##')
model = AutoModelWithLMHead.from_pretrained("kuppuluri/telugu_bertu")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి.")
```
# Named Entity Recognition Model for Telugu
#### How to use
```python
from simpletransformers.ner import NERModel
model = NERModel('bert',
'kuppuluri/telugu_bertu_ner',
labels=[
'B-PERSON', 'I-ORG', 'B-ORG', 'I-LOC', 'B-MISC',
'I-MISC', 'I-PERSON', 'B-LOC', 'O'
],
use_cuda=False,
args={"use_multiprocessing": False})
text = "విరాట్ కోహ్లీ కూడా అదే నిర్లక్ష్యాన్ని ప్రదర్శించి కేవలం ఒక పరుగుకే రనౌటై పెవిలియన్ చేరాడు ."
results = model.predict([text])
```
## Training data
Training data is from https://github.com/anikethjr/NER_Telugu
## Eval results
On the test set my results were
eval_loss = 0.0004407190410447974
f1_score = 0.999519076627124
precision = 0.9994389677005691
recall = 0.9995991983967936
# Part of Speech tagging Model for Telugu
#### How to use
```python
from simpletransformers.ner import NERModel
model = NERModel('bert',
'kuppuluri/telugu_bertu_pos',
args={"use_multiprocessing": False},
labels=[
'QC', 'JJ', 'NN', 'QF', 'RDP', 'O',
'NNO', 'PRP', 'RP', 'VM', 'WQ',
'PSP', 'UT', 'CC', 'INTF', 'SYMP',
'NNP', 'INJ', 'SYM', 'CL', 'QO',
'DEM', 'RB', 'NST', ],
use_cuda=False)
text = "విరాట్ కోహ్లీ కూడా అదే నిర్లక్ష్యాన్ని ప్రదర్శించి కేవలం ఒక పరుగుకే రనౌటై పెవిలియన్ చేరాడు ."
results = model.predict([text])
```
## Training data
Training data is from https://github.com/anikethjr/NER_Telugu
## Eval results
On the test set my results were
eval_loss = 0.0036797842364565416
f1_score = 0.9983795127912227
precision = 0.9984325602401637
recall = 0.9983264709788816
# Telugu Question-Answering model trained on Tydiqa dataset from Google
#### How to use
```python
from transformers.pipelines import pipeline, AutoModelForQuestionAnswering, AutoTokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained("kuppuluri/telugu_bertu_tydiqa",
clean_text=False,
handle_chinese_chars=False,
strip_accents=False,
wordpieces_prefix='##')
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)
result = nlp({'question': question, 'context': context})
```
## Training data
I used Tydiqa Telugu data from Google https://github.com/google-research-datasets/tydiqa
---
language:
- en
- ar
datasets:
- gigaword
- oscar
- wikipedia
---
## GigaBERT-v3
GigaBERT-v3 is a customized bilingual BERT for English and Arabic. It was pre-trained in a large-scale corpus (Gigaword+Oscar+Wikipedia) with ~10B tokens, showing state-of-the-art zero-shot transfer performance from English to Arabic on information extraction (IE) tasks. More details can be found in the following paper:
@inproceedings{lan2020gigabert,
author = {Lan, Wuwei and Chen, Yang and Xu, Wei and Ritter, Alan},
title = {GigaBERT: Zero-shot Transfer Learning from English to Arabic},
booktitle = {Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
year = {2020}
}
## Usage
```
from transformers import *
tokenizer = BertTokenizer.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English", do_lower_case=True)
model = BertForTokenClassification.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English")
```
More code examples can be found [here](https://github.com/lanwuwei/GigaBERT).
---
language: tr
---
# Turkish Language Models with Huggingface's Transformers
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
# Turkish ALBERT-Base (uncased)
This is ALBERT-Base model which has 12 repeated encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
## Usage
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True)
model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased")
normalizer = TextNormalization()
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
tokenizer.tokenize(normalized_text)
```
### Notes on Tokenizers
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
## Details and Contact
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
## Acknowledgments
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
---
language: tr
---
# Turkish Language Models with Huggingface's Transformers
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
# Turkish BERT-Base (uncased)
This is BERT-Base model which has 12 encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
## Usage
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased")
normalizer = TextNormalization()
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
tokenizer.tokenize(normalized_text)
```
### Notes on Tokenizers
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
## Details and Contact
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
## Acknowledgments
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
---
language: tr
---
# Turkish Language Models with Huggingface's Transformers
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
# Turkish ELECTRA-Base-discriminator (uncased/64k)
This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k, different from default 32k.
## Usage
Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
```python
from transformers import AutoModel, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False)
model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")
normalizer = TextNormalization()
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
tokenizer.tokenize(normalized_text)
```
### Notes on Tokenizers
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
## Details and Contact
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
## Acknowledgments
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
---
language: tr
---
# Turkish Language Models with Huggingface's Transformers
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
# Turkish ELECTRA-Base-discriminator (uncased)
This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset.
## Usage
Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
```python
from transformers import AutoModel, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased-discriminator", do_lower_case=False)
model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-uncased-discriminator")
normalizer = TextNormalization()
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
tokenizer.tokenize(normalized_text)
```
### Notes on Tokenizers
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
## Details and Contact
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
## Acknowledgments
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
---
language: tr
---
# Turkish Language Models with Huggingface's Transformers
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
# Turkish ELECTRA-Small-discriminator (cased)
This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layers size trained on cased Turkish dataset.
## Usage
Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
```python
from transformers import AutoModel, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
```
## Details and Contact
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
## Acknowledgments
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
---
language: tr
---
# Turkish Language Models with Huggingface's Transformers
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
# Turkish ELECTRA-Small-discriminator (uncased)
This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset.
## Usage
Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
```python
from transformers import AutoModel, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-uncased-discriminator", do_lower_case=False)
model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-uncased-discriminator")
normalizer = TextNormalization()
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
tokenizer.tokenize(normalized_text)
```
### Notes on Tokenizers
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
## Details and Contact
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
## Acknowledgments
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
---
language: en
inference: false
---
## COVID-SciBERT: A small language modelling expansion of SciBERT, a BERT model trained on scientific text.
### Details of SciBERT
The **SciBERT** model was presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://arxiv.org/abs/1903.10676) by *Iz Beltagy, Kyle Lo, Arman Cohan* and here is the abstract:
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks.
### Details of the downstream task (Language Modeling) - Dataset 📚
There are actually two datasets that have been used here:
- The original SciBERT model is trained on papers from the corpus of [semanticscholar.org](semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. They used the full text of the papers in training, not just abstracts. SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus.
- The expansion is done using the papers present in the [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Only the abstracts have been used and vocabulary was pruned and added to the existing scivocab. In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
### Model training
The training script is present [here](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb).
### Pipelining the Model
```python
import transformers
model = transformers.AutoModelWithLMHead.from_pretrained('lordtt13/COVID-SciBERT')
tokenizer = transformers.AutoTokenizer.from_pretrained('lordtt13/COVID-SciBERT')
nlp_fill = transformers.pipeline('fill-mask', model = model, tokenizer = tokenizer)
nlp_fill('Coronavirus or COVID-19 can be prevented by a' + nlp_fill.tokenizer.mask_token)
# Output:
# [{'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a combination [SEP]',
# 'score': 0.1719885915517807,
# 'token': 2702},
# {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a simple [SEP]',
# 'score': 0.054218728095293045,
# 'token': 2177},
# {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a novel [SEP]',
# 'score': 0.043364267796278,
# 'token': 3045},
# {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a high [SEP]',
# 'score': 0.03732519596815109,
# 'token': 597},
# {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a vaccine [SEP]',
# 'score': 0.021863549947738647,
# 'token': 7039}]
```
> Created by [Tanmay Thakur](https://github.com/lordtt13) | [LinkedIn](https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/)
> PS: Still looking for more resources to expand my expansion!
---
language: en
datasets:
- emo
---
## Emo-MobileBERT: a thin version of BERT LARGE, trained on the EmoContext Dataset from scratch
### Details of MobileBERT
The **MobileBERT** model was presented in [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by *Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou* and here is the abstract:
Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).
### Details of the downstream task (Emotion Recognition) - Dataset 📚
SemEval-2019 Task 3: EmoContext Contextual Emotion Detection in Text
In this dataset, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choosing from four emotion classes:
- sad 😢
- happy 😃
- angry 😡
- others
### Model training
The training script is present [here](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/emo-mobilebert.ipynb).
### Pipelining the Model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("lordtt13/emo-mobilebert")
model = AutoModelForSequenceClassification.from_pretrained("lordtt13/emo-mobilebert")
nlp_sentence_classif = transformers.pipeline('sentiment-analysis', model = model, tokenizer = tokenizer)
nlp_sentence_classif("I've never had such a bad day in my life")
# Output: [{'label': 'sad', 'score': 0.93153977394104}]
```
> Created by [Tanmay Thakur](https://github.com/lordtt13) | [LinkedIn](https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/)
---
language: tr
---
# bert-turkish-question-answering
## Usage
```python
from transformers import pipeline
nlp = pipeline('question-answering', model='lserinol/bert-turkish-question-answering', tokenizer='lserinol/bert-turkish-question-answering')
nlp({
'question': "Ankara'da kaç ilçe vardır?",
'context': r"""Türkiye'nin başkenti Ankara'dır. Ülkenin en büyük idari birimleri illerdir ve 81 il vardır. Bu iller ilçelere ayrılmıştır, toplamda 973 ilçe mevcuttur."""
})
```
```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("lserinol/bert-turkish-question-answering")
model = AutoModelForQuestionAnswering.from_pretrained("lserinol/bert-turkish-question-answering")
text = r"""
Ankara'nın başkent ilan edilmesinin ardından (13 Ekim 1923) şehir hızla gelişmiş ve Türkiye'nin ikinci en kalabalık ili olmuştur.
Türkiye Cumhuriyeti'nin ilk yıllarında ekonomisi tarım ve hayvancılığa dayanan ilin topraklarının yarısı hâlâ tarım amaçlı
kullanılmaktadır. Ekonomik etkinlik büyük oranda ticaret ve sanayiye dayalıdır. Tarım ve hayvancılığın ağırlığı ise giderek
azalmaktadır. Ankara ve civarındaki gerek kamu sektörü gerek özel sektör yatırımları, başka illerden büyük bir nüfus göçünü
teşvik etmiştir. Cumhuriyetin kuruluşundan günümüze, nüfusu ülke nüfusunun iki katı hızda artmıştır. Nüfusun yaklaşık dörtte
üçü hizmet sektörü olarak tanımlanabilecek memuriyet, ulaşım, haberleşme ve ticaret benzeri işlerde, dörtte biri sanayide,
%2'si ise tarım alanında çalışır. Sanayi, özellikle tekstil, gıda ve inşaat sektörlerinde yoğunlaşmıştır. Günümüzde ise en çok
savunma, metal ve motor sektörlerinde yatırım yapılmaktadır. Türkiye'nin en çok sayıda üniversiteye sahip ili olan Ankara'da
ayrıca, üniversite diplomalı kişi oranı ülke ortalamasının iki katıdır. Bu eğitimli nüfus, teknoloji ağırlıklı yatırımların
gereksinim duyduğu iş gücünü oluşturur. Ankara'dan otoyollar, demir yolu ve hava yoluyla Türkiye'nin diğer şehirlerine ulaşılır.
Ankara aynı zamanda başkent olarak Türkiye Büyük Millet Meclisi (TBMM)'ye de ev sahipliği yapmaktadır.
"""
questions = [
"Ankara kaç yılında başkent oldu?",
"Ankara ne zaman başkent oldu?",
"Ankara'dan başka şehirlere nasıl ulaşılır?",
"TBMM neyin kısaltmasıdır?"
]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment