[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)

* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)
* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
3552d0e0 · Julien Chaumond · GitHub · 29e45979 · 29e45979 · 29e45979
Unverified Commit 3552d0e0 authored Dec 12, 2020 by Julien Chaumond Committed by GitHub Dec 11, 2020
20 changed files
--- a/model_cards/ktrapeznikov/biobert_v1.1_pubmed_squad_v2/README.md
+++ b/model_cards/ktrapeznikov/biobert_v1.1_pubmed_squad_v2/README.md
-### Model
-**[`monologg/biobert_v1.1_pubmed`](https://huggingface.co/monologg/biobert_v1.1_pubmed)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
-This model is cased.
-### Training Parameters
-Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
-```bash
-BASE_MODEL=monologg/biobert_v1.1_pubmed
-python run_squad.py \
-  --version_2_with_negative \
-  --model_type albert \
-  --model_name_or_path $BASE_MODEL \
-  --output_dir $OUTPUT_MODEL \
-  --do_eval \
-  --do_lower_case \
-  --train_file $SQUAD_DIR/train-v2.0.json \
-  --predict_file $SQUAD_DIR/dev-v2.0.json \
-  --per_gpu_train_batch_size 18 \
-  --per_gpu_eval_batch_size 64 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 3.0 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --save_steps 2000 \
-  --threads 24 \
-  --warmup_steps 550 \
-  --gradient_accumulation_steps 1 \
-  --fp16 \
-  --logging_steps 50 \
-  --do_train
-```
-### Evaluation
-Evaluation on the dev set. I did not sweep for best threshold.
-|                   | val               |
-|-------------------|-------------------|
-| exact             | 75.97068980038743 |
-| f1                | 79.37043950121722 |
-| total             | 11873.0           |
-| HasAns_exact      | 74.13967611336032 |
-| HasAns_f1         | 80.94892513460755 |
-| HasAns_total      | 5928.0            |
-| NoAns_exact       | 77.79646761984861 |
-| NoAns_f1          | 77.79646761984861 |
-| NoAns_total       | 5945.0            |
-| best_exact        | 75.97068980038743 |
-| best_exact_thresh | 0.0               |
-| best_f1           | 79.37043950121729 |
-| best_f1_thresh    | 0.0               |
-### Usage
-See [huggingface documentation](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
-```python
-start_scores, end_scores = model(input_ids) 
-span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
-ignore_score = span_scores[:,0,0] #no answer scores
-```
--- a/model_cards/ktrapeznikov/gpt2-medium-topic-news/README.md
+++ b/model_cards/ktrapeznikov/gpt2-medium-topic-news/README.md
---
-language: 
- en
-thumbnail:
-widget:
- - text: "topic: climate article:"
---
-# GPT2-medium-topic-news
-## Model description
-GPT2-medium fine tuned on a large news corpus conditioned on a topic
-## Intended uses & limitations
-#### How to use
-To generate a news article text conditioned on a topic, prompt model with: 
-`topic: climate article:`
-The following tags were used during training:
-`arts law international science business politics disaster world conflict football sport sports artanddesign environment music film lifeandstyle business health commentisfree books technology media education politics travel stage uk society us money culture religion science news tv fashion uk australia cities global childrens sustainable global voluntary housing law local healthcare theguardian`
-Zero shot generation works pretty well as long as `topic` is a single word and not too specific.
-```python
-device = "cuda:0"
-tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/gpt2-medium-topic-news")
-model = AutoModelWithLMHead.from_pretrained("ktrapeznikov/gpt2-medium-topic-news")
-model.to(device)
-topic = "climate"
-prompt = tokenizer(f"topic: {topic} article:", return_tensors="pt")
-out = model.generate(prompt["input_ids"].to(device), do_sample=True,max_length=500, early_stopping=True, top_p=.9)
-print(tokenizer.decode(list(out.cpu()[0])))
-```
-## Training data
-## Training procedure
--- a/model_cards/ktrapeznikov/scibert_scivocab_uncased_squad_v2/README.md
+++ b/model_cards/ktrapeznikov/scibert_scivocab_uncased_squad_v2/README.md
-### Model
-**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
-### Training Parameters
-Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
-```bash
-BASE_MODEL=allenai/scibert_scivocab_uncased
-python run_squad.py \
-  --version_2_with_negative \
-  --model_type albert \
-  --model_name_or_path $BASE_MODEL \
-  --output_dir $OUTPUT_MODEL \
-  --do_eval \
-  --do_lower_case \
-  --train_file $SQUAD_DIR/train-v2.0.json \
-  --predict_file $SQUAD_DIR/dev-v2.0.json \
-  --per_gpu_train_batch_size 18 \
-  --per_gpu_eval_batch_size 64 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 3.0 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --save_steps 2000 \
-  --threads 24 \
-  --warmup_steps 550 \
-  --gradient_accumulation_steps 1 \
-  --fp16 \
-  --logging_steps 50 \
-  --do_train
-```
-### Evaluation
-Evaluation on the dev set. I did not sweep for best threshold.
-|                   | val               |
-|-------------------|-------------------|
-| exact             | 75.07790785816559 |
-| f1                | 78.47735207283013 |
-| total             | 11873.0           |
-| HasAns_exact      | 70.76585695006747 |
-| HasAns_f1         | 77.57449412292718 |
-| HasAns_total      | 5928.0            |
-| NoAns_exact       | 79.37762825904122 |
-| NoAns_f1          | 79.37762825904122 |
-| NoAns_total       | 5945.0            |
-| best_exact        | 75.08633032931863 |
-| best_exact_thresh | 0.0               |
-| best_f1           | 78.48577454398324 |
-| best_f1_thresh    | 0.0               |
-### Usage
-See [huggingface documentation](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
-```python
-start_scores, end_scores = model(input_ids) 
-span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
-ignore_score = span_scores[:,0,0] #no answer scores
-```
--- a/model_cards/kuisailab/albert-base-arabic/README.md
+++ b/model_cards/kuisailab/albert-base-arabic/README.md
---
-language: ar
-datasets:
- oscar
- wikipedia
-tags:
- ar
- masked-lm
---
-# Arabic-ALBERT Base
-Arabic edition of ALBERT Base pretrained language model
-## Pretraining data
-The models were pretrained on ~4.4 Billion words:
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
-__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
-## Pretraining details
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
-## Models
-|  | albert-base | albert-large | albert-xlarge |
-|:---:|:---:|:---:|:---:|
-| Hidden Layers | 12 | 24 | 24 |
-| Attention heads | 12 | 16 | 32 |
-| Hidden size | 768 | 1024 | 2048 |
-## Results
-For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
-## How to use
-You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-```python
-from transformers import AutoTokenizer, AutoModel
-# loading the tokenizer
-base_tokenizer    = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
-# loading the model
-base_model   = AutoModel.from_pretrained("kuisailab/albert-base-arabic")
-```
-## Acknowledgement
-Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
--- a/model_cards/kuisailab/albert-large-arabic/README.md
+++ b/model_cards/kuisailab/albert-large-arabic/README.md
---
-language: ar
-datasets:
- oscar
- wikipedia
-tags:
- ar
- masked-lm
---
-# Arabic-ALBERT Large
-Arabic edition of ALBERT Large pretrained language model
-## Pretraining data
-The models were pretrained on ~4.4 Billion words:
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
-__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
-## Pretraining details
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
-## Models
-|  | albert-base | albert-large | albert-xlarge |
-|:---:|:---:|:---:|:---:|
-| Hidden Layers | 12 | 24 | 24 |
-| Attention heads | 12 | 16 | 32 |
-| Hidden size | 768 | 1024 | 2048 |
-## Results
-For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
-## How to use
-You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-```python
-from transformers import AutoTokenizer, AutoModel
-# loading the tokenizer
-tokenizer    = AutoTokenizer.from_pretrained("kuisailab/albert-large-arabic")
-# loading the model
-model   = AutoModel.from_pretrained("kuisailab/albert-large-arabic")
-```
-## Acknowledgement
-Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
--- a/model_cards/kuisailab/albert-xlarge-arabic/README.md
+++ b/model_cards/kuisailab/albert-xlarge-arabic/README.md
---
-language: ar
-datasets:
- oscar
- wikipedia
-tags:
- ar
- masked-lm
---
-# Arabic-ALBERT Xlarge
-Arabic edition of ALBERT Xlarge pretrained language model
-## Pretraining data
-The models were pretrained on ~4.4 Billion words:
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
-__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
-## Pretraining details
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
-## Models
-|  | albert-base | albert-large | albert-xlarge |
-|:---:|:---:|:---:|:---:|
-| Hidden Layers | 12 | 24 | 24 |
-| Attention heads | 12 | 16 | 32 |
-| Hidden size | 768 | 1024 | 2048 |
-## Results
-For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
-## How to use
-You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-```python
-from transformers import AutoTokenizer, AutoModel
-# loading the tokenizer
-tokenizer    = AutoTokenizer.from_pretrained("kuisailab/albert-xlarge-arabic")
-# loading the model
-model   = AutoModel.from_pretrained("kuisailab/albert-xlarge-arabic")
-```
-## Acknowledgement
-Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
--- a/model_cards/kuppuluri/telugu_bertu/README.md
+++ b/model_cards/kuppuluri/telugu_bertu/README.md
---
-language: te
---
-# telugu_bertu
-## Model description
-This model is a BERT MLM model trained on Telugu.
-## Intended uses & limitations
-#### How to use
-```python
-from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline
-tokenizer = AutoTokenizer.from_pretrained("kuppuluri/telugu_bertu",
-                                          clean_text=False,
-                                          handle_chinese_chars=False,
-                                          strip_accents=False,
-                                          wordpieces_prefix='##')
-model = AutoModelWithLMHead.from_pretrained("kuppuluri/telugu_bertu")
-fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
-results = fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి.")
-```
--- a/model_cards/kuppuluri/telugu_bertu_ner/README.md
+++ b/model_cards/kuppuluri/telugu_bertu_ner/README.md
-# Named Entity Recognition Model for Telugu
-#### How to use
-```python
-from simpletransformers.ner import NERModel
-model = NERModel('bert',
-                 'kuppuluri/telugu_bertu_ner',
-                 labels=[
-                     'B-PERSON', 'I-ORG', 'B-ORG', 'I-LOC', 'B-MISC',
-                     'I-MISC', 'I-PERSON', 'B-LOC', 'O'
-                 ],
-                 use_cuda=False,
-                 args={"use_multiprocessing": False})
-text = "విరాట్ కోహ్లీ కూడా అదే నిర్లక్ష్యాన్ని ప్రదర్శించి కేవలం ఒక పరుగుకే రనౌటై పెవిలియన్ చేరాడు ."
-results = model.predict([text])
-```
-## Training data
-Training data is from https://github.com/anikethjr/NER_Telugu
-## Eval results
-On the test set my results were
-eval_loss = 0.0004407190410447974
-f1_score = 0.999519076627124
-precision = 0.9994389677005691
-recall = 0.9995991983967936
--- a/model_cards/kuppuluri/telugu_bertu_pos/README.md
+++ b/model_cards/kuppuluri/telugu_bertu_pos/README.md
-# Part of Speech tagging Model for Telugu
-#### How to use
-```python
-from simpletransformers.ner import NERModel
-model = NERModel('bert',
-                 'kuppuluri/telugu_bertu_pos',
-                 args={"use_multiprocessing": False},
-                 labels=[
-                     'QC', 'JJ', 'NN', 'QF', 'RDP', 'O',
-                     'NNO', 'PRP', 'RP', 'VM', 'WQ',
-                     'PSP', 'UT', 'CC', 'INTF', 'SYMP',
-                     'NNP', 'INJ', 'SYM', 'CL', 'QO',
-                     'DEM', 'RB', 'NST', ],
-                 use_cuda=False)
-text = "విరాట్ కోహ్లీ కూడా అదే నిర్లక్ష్యాన్ని ప్రదర్శించి కేవలం ఒక పరుగుకే రనౌటై పెవిలియన్ చేరాడు ."
-results = model.predict([text])
-```
-## Training data
-Training data is from https://github.com/anikethjr/NER_Telugu
-## Eval results
-On the test set my results were
-eval_loss = 0.0036797842364565416
-f1_score = 0.9983795127912227
-precision = 0.9984325602401637
-recall = 0.9983264709788816
--- a/model_cards/kuppuluri/telugu_bertu_tydiqa/README.md
+++ b/model_cards/kuppuluri/telugu_bertu_tydiqa/README.md
-# Telugu Question-Answering model trained on Tydiqa dataset from Google
-#### How to use
-```python
-from transformers.pipelines import pipeline, AutoModelForQuestionAnswering, AutoTokenizer
-model = AutoModelForQuestionAnswering.from_pretrained(model_name)
-tokenizer = AutoTokenizer.from_pretrained("kuppuluri/telugu_bertu_tydiqa",
-                                          clean_text=False,
-                                          handle_chinese_chars=False,
-                                          strip_accents=False,
-                                          wordpieces_prefix='##')
-nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)
-result = nlp({'question': question, 'context': context})
-```
-## Training data
-I used Tydiqa Telugu data from Google https://github.com/google-research-datasets/tydiqa
--- a/model_cards/lanwuwei/GigaBERT-v3-Arabic-and-English/README.md
+++ b/model_cards/lanwuwei/GigaBERT-v3-Arabic-and-English/README.md
---
-language:
- en
- ar
-datasets:
- gigaword
- oscar
- wikipedia
---
-## GigaBERT-v3
-GigaBERT-v3 is a customized bilingual BERT for English and Arabic. It was pre-trained in a large-scale corpus (Gigaword+Oscar+Wikipedia) with ~10B tokens, showing state-of-the-art zero-shot transfer performance from English to Arabic on information extraction (IE) tasks. More details can be found in the following paper:
-	@inproceedings{lan2020gigabert,
-	  author     = {Lan, Wuwei and Chen, Yang and Xu, Wei and Ritter, Alan},
-  	  title      = {GigaBERT: Zero-shot Transfer Learning from English to Arabic},
-  	  booktitle  = {Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
-  	  year       = {2020}
-  	} 
-## Usage
-```
-from transformers import *
-tokenizer = BertTokenizer.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English", do_lower_case=True)
-model = BertForTokenClassification.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English")
-```
-More code examples can be found [here](https://github.com/lanwuwei/GigaBERT).
--- a/model_cards/loodos/albert-base-turkish-uncased/README.md
+++ b/model_cards/loodos/albert-base-turkish-uncased/README.md
---
-language: tr
---
-# Turkish Language Models with Huggingface's Transformers
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
-# Turkish ALBERT-Base (uncased)
-This is ALBERT-Base model which has 12 repeated encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
-## Usage
-Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
-```python
-from transformers import AutoModel, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True)
-model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased")
-normalizer = TextNormalization()
-normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
-tokenizer.tokenize(normalized_text)
-```
-### Notes on Tokenizers
-Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
-1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
-2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
-respectively. However, in Turkish, 'I' and 'İ' are two different letters. 
-We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
-## Details and Contact
-You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
-## Acknowledgments
-Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
--- a/model_cards/loodos/bert-base-turkish-uncased/README.md
+++ b/model_cards/loodos/bert-base-turkish-uncased/README.md
---
-language: tr
---
-# Turkish Language Models with Huggingface's Transformers
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
-# Turkish BERT-Base (uncased)
-This is BERT-Base model which has 12 encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
-## Usage
-Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
-```python
-from transformers import AutoModel, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
-model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased")
-normalizer = TextNormalization()
-normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
-tokenizer.tokenize(normalized_text)
-```
-### Notes on Tokenizers
-Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
-1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
-2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
-respectively. However, in Turkish, 'I' and 'İ' are two different letters. 
-We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
-## Details and Contact
-You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
-## Acknowledgments
-Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
--- a/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md
+++ b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md
---
-language: tr
---
-# Turkish Language Models with Huggingface's Transformers
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
-# Turkish ELECTRA-Base-discriminator (uncased/64k)
-This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k, different from default 32k.
-## Usage
-Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
-```python
-from transformers import AutoModel, AutoModelWithLMHead
-tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False)
-model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")
-normalizer = TextNormalization()
-normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
-tokenizer.tokenize(normalized_text)
-```
-### Notes on Tokenizers
-Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
-1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
-2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
-respectively. However, in Turkish, 'I' and 'İ' are two different letters. 
-We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
-## Details and Contact
-You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
-## Acknowledgments
-Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
--- a/model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md
+++ b/model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md
---
-language: tr
---
-# Turkish Language Models with Huggingface's Transformers
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
-# Turkish ELECTRA-Base-discriminator (uncased)
-This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset.
-## Usage
-Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
-```python
-from transformers import AutoModel, AutoModelWithLMHead
-tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased-discriminator", do_lower_case=False)
-model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-uncased-discriminator")
-normalizer = TextNormalization()
-normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
-tokenizer.tokenize(normalized_text)
-```
-### Notes on Tokenizers
-Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
-1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
-2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
-respectively. However, in Turkish, 'I' and 'İ' are two different letters. 
-We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
-## Details and Contact
-You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
-## Acknowledgments
-Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
--- a/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md
+++ b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md
---
-language: tr
---
-# Turkish Language Models with Huggingface's Transformers
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
-# Turkish ELECTRA-Small-discriminator (cased)
-This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layers size trained on cased Turkish dataset.
-## Usage
-Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
-```python
-from transformers import AutoModel, AutoModelWithLMHead
-tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
-model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
-```
-## Details and Contact
-You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
-## Acknowledgments
-Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
--- a/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md
+++ b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md
---
-language: tr
---
-# Turkish Language Models with Huggingface's Transformers
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
-# Turkish ELECTRA-Small-discriminator (uncased)
-This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset.
-## Usage
-Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
-```python
-from transformers import AutoModel, AutoModelWithLMHead
-tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-uncased-discriminator", do_lower_case=False)
-model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-uncased-discriminator")
-normalizer = TextNormalization()
-normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
-tokenizer.tokenize(normalized_text)
-```
-### Notes on Tokenizers
-Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
-1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
-2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
- "I" and "İ" to 'i'
- 'i' and 'ı' to 'I'
-respectively. However, in Turkish, 'I' and 'İ' are two different letters. 
-We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
-## Details and Contact
-You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
-## Acknowledgments
-Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
--- a/model_cards/lordtt13/COVID-SciBERT/README.md
+++ b/model_cards/lordtt13/COVID-SciBERT/README.md
---
-language: en
-inference: false
---
-## COVID-SciBERT: A small language modelling expansion of SciBERT, a BERT model trained on scientific text.
-### Details of SciBERT
-The **SciBERT** model was presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://arxiv.org/abs/1903.10676) by *Iz Beltagy, Kyle Lo, Arman Cohan* and here is the abstract:
-Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks.
-### Details of the downstream task (Language Modeling) - Dataset 📚
-There are actually two datasets that have been used here:
- The original SciBERT model is trained on papers from the corpus of [semanticscholar.org](semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. They used the full text of the papers in training, not just abstracts. SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus.
- The expansion is done using the papers present in the [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Only the abstracts have been used and vocabulary was pruned and added to the existing scivocab. In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
-### Model training
-The training script is present [here](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb).
-### Pipelining the Model
-```python
-import transformers
-model = transformers.AutoModelWithLMHead.from_pretrained('lordtt13/COVID-SciBERT')
-tokenizer = transformers.AutoTokenizer.from_pretrained('lordtt13/COVID-SciBERT')
-nlp_fill = transformers.pipeline('fill-mask', model = model, tokenizer = tokenizer)
-nlp_fill('Coronavirus or COVID-19 can be prevented by a' + nlp_fill.tokenizer.mask_token)
-# Output:
-# [{'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a combination [SEP]',
-#   'score': 0.1719885915517807,
-#   'token': 2702},
-#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a simple [SEP]',
-#   'score': 0.054218728095293045,
-#   'token': 2177},
-#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a novel [SEP]',
-#   'score': 0.043364267796278,
-#   'token': 3045},
-#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a high [SEP]',
-#   'score': 0.03732519596815109,
-#   'token': 597},
-#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a vaccine [SEP]',
-#   'score': 0.021863549947738647,
-#   'token': 7039}]
-```
-> Created by [Tanmay Thakur](https://github.com/lordtt13) | [LinkedIn](https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/)
-> PS: Still looking for more resources to expand my expansion!
--- a/model_cards/lordtt13/emo-mobilebert/README.md
+++ b/model_cards/lordtt13/emo-mobilebert/README.md
---
-language: en
-datasets:
- emo
---
-## Emo-MobileBERT: a thin version of BERT LARGE, trained on the EmoContext Dataset from scratch
-### Details of MobileBERT
-The **MobileBERT** model was presented in [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by *Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou* and here is the abstract:
-Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).
-### Details of the downstream task (Emotion Recognition) - Dataset 📚
-SemEval-2019 Task 3: EmoContext Contextual Emotion Detection in Text
-In this dataset, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choosing from four emotion classes:
- - sad 😢
- - happy 😃
- - angry 😡
- - others
-### Model training
-The training script is present [here](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/emo-mobilebert.ipynb).
-### Pipelining the Model
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
-tokenizer = AutoTokenizer.from_pretrained("lordtt13/emo-mobilebert")
-model = AutoModelForSequenceClassification.from_pretrained("lordtt13/emo-mobilebert")
-nlp_sentence_classif = transformers.pipeline('sentiment-analysis', model = model, tokenizer = tokenizer)
-nlp_sentence_classif("I've never had such a bad day in my life")
-# Output: [{'label': 'sad', 'score': 0.93153977394104}]
-```
-> Created by [Tanmay Thakur](https://github.com/lordtt13) | [LinkedIn](https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/)
--- a/model_cards/lserinol/bert-turkish-question-answering/README.md
+++ b/model_cards/lserinol/bert-turkish-question-answering/README.md
---
-language: tr
---
-# bert-turkish-question-answering
-## Usage
-```python
-from transformers import pipeline
-nlp = pipeline('question-answering', model='lserinol/bert-turkish-question-answering', tokenizer='lserinol/bert-turkish-question-answering')
-nlp({
-    'question': "Ankara'da kaç ilçe vardır?",
-    'context': r"""Türkiye'nin başkenti Ankara'dır. Ülkenin en büyük idari birimleri illerdir ve 81 il vardır. Bu iller ilçelere ayrılmıştır, toplamda 973 ilçe mevcuttur."""
-})
-```
-```python
-from transformers import AutoTokenizer, AutoModelForQuestionAnswering
-import torch
-tokenizer = AutoTokenizer.from_pretrained("lserinol/bert-turkish-question-answering")
-model = AutoModelForQuestionAnswering.from_pretrained("lserinol/bert-turkish-question-answering")
-text = r"""
-Ankara'nın başkent ilan edilmesinin ardından (13 Ekim 1923) şehir hızla gelişmiş ve Türkiye'nin ikinci en kalabalık ili olmuştur.
-Türkiye Cumhuriyeti'nin ilk yıllarında ekonomisi tarım ve hayvancılığa dayanan ilin topraklarının yarısı hâlâ tarım amaçlı 
-kullanılmaktadır. Ekonomik etkinlik büyük oranda ticaret ve sanayiye dayalıdır. Tarım ve hayvancılığın ağırlığı ise giderek 
-azalmaktadır. Ankara ve civarındaki gerek kamu sektörü gerek özel sektör yatırımları, başka illerden büyük bir nüfus göçünü 
-teşvik etmiştir. Cumhuriyetin kuruluşundan günümüze, nüfusu ülke nüfusunun iki katı hızda artmıştır. Nüfusun yaklaşık dörtte 
-üçü hizmet sektörü olarak tanımlanabilecek memuriyet, ulaşım, haberleşme ve ticaret benzeri işlerde, dörtte biri sanayide, 
-%2'si ise tarım alanında çalışır. Sanayi, özellikle tekstil, gıda ve inşaat sektörlerinde yoğunlaşmıştır. Günümüzde ise en çok 
-savunma, metal ve motor sektörlerinde yatırım yapılmaktadır. Türkiye'nin en çok sayıda üniversiteye sahip ili olan Ankara'da 
-ayrıca, üniversite diplomalı kişi oranı ülke ortalamasının iki katıdır. Bu eğitimli nüfus, teknoloji ağırlıklı yatırımların 
-gereksinim duyduğu iş gücünü oluşturur. Ankara'dan otoyollar, demir yolu ve hava yoluyla Türkiye'nin diğer şehirlerine ulaşılır.
-Ankara aynı zamanda başkent olarak Türkiye Büyük Millet Meclisi (TBMM)'ye de ev sahipliği yapmaktadır.
-"""
-questions = [
-    "Ankara kaç yılında başkent oldu?",
-    "Ankara ne zaman başkent oldu?",
-    "Ankara'dan başka şehirlere nasıl ulaşılır?",
-    "TBMM neyin kısaltmasıdır?"
-]
-for question in questions:
-    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
-    input_ids = inputs["input_ids"].tolist()[0]
-    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
-    answer_start_scores, answer_end_scores = model(**inputs)
-    answer_start = torch.argmax(
-        answer_start_scores
-    )  # Get the most likely beginning of answer with the argmax of the score
-    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
-    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
-    print(f"Question: {question}")
-    print(f"Answer: {answer}\n")
-  ```