Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language:
- en
- fr
- es
- de
- zh
tags:
- pytorch
- bert
- multilingual
- en
- fr
- es
- de
- zh
datasets: wikipedia
license: apache-2.0
inference: false
---
# bert-base-5lang-cased
This is a smaller version of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handles only 5 languages (en, fr, es, de and zh) instead of 104.
The model is therefore 30% smaller than the original one (124M parameters instead of 178M) but gives exactly the same representations for the above cited languages.
Starting from `bert-base-5lang-cased` will facilitate the deployment of your model on public cloud platforms while keeping similar results.
For instance, Google Cloud Platform requires that the model size on disk should be lower than 500 MB for serveless deployments (Cloud Functions / Cloud ML) which is not the case of the original `bert-base-multilingual-cased`.
For more information about the models size, memory footprint and loading time please refer to the table below:
| Model | Num parameters | Size | Memory | Loading time |
| ---------------------------- | -------------- | -------- | -------- | ------------ |
| bert-base-multilingual-cased | 178 million | 714 MB | 1400 MB | 4.2 sec |
| bert-base-5lang-cased | 124 million | 495 MB | 950 MB | 3.6 sec |
These measurements have been computed on a [Google Cloud n1-standard-1 machine (1 vCPU, 3.75 GB)](https://cloud.google.com/compute/docs/machine-types\#n1_machine_type).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("amine/bert-base-5lang-cased")
model = AutoModel.from_pretrained("amine/bert-base-5lang-cased")
```
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.
\ No newline at end of file
---
language: "fr"
---
# BelGPT-2
**BelGPT-2** (*Belgian GPT-2* 🇧🇪) is a "small" GPT-2 model pre-trained on a very large and heterogeneous French corpus (around 60Gb). Please check [antoiloui/gpt2-french](https://github.com/antoiloui/gpt2-french) for more information about the pre-trained model, the data, the code to use the model and the code to pre-train it.
## Using BelGPT-2 for Text Generation in French
You can use BelGPT-2 with [🤗 transformers](https://github.com/huggingface/transformers) library as follows:
```python
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pretrained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("antoiloui/belgpt2")
tokenizer = GPT2Tokenizer.from_pretrained("antoiloui/belgpt2")
# Generate a sample of text
model.eval()
output = model.generate(
bos_token_id=random.randint(1,50000),
do_sample=True,
top_k=50,
max_length=100,
top_p=0.95,
num_return_sequences=1
)
# Decode it
decoded_output = []
for sample in output:
decoded_output.append(tokenizer.decode(sample, skip_special_tokens=True))
print(decoded_output)
```
## Data
Below is the list of all French copora used to pre-trained the model:
| Dataset | `$corpus_name` | Raw size | Cleaned size |
| :------| :--- | :---: | :---: |
| CommonCrawl | `common_crawl` | 200.2 GB | 40.4 GB |
| NewsCrawl | `news_crawl` | 10.4 GB | 9.8 GB |
| Wikipedia | `wiki` | 19.4 GB | 4.1 GB |
| Wikisource | `wikisource` | 4.6 GB | 2.3 GB |
| Project Gutenberg | `gutenberg` | 1.3 GB | 1.1 GB |
| EuroParl | `europarl` | 289.9 MB | 278.7 MB |
| NewsCommentary | `news_commentary` | 61.4 MB | 58.1 MB |
| **Total** | | **236.3 GB** | **57.9 GB** |
# BERT L-10 H-512 fine-tuned on MLM (CORD-19 2020/06/16)
BERT model with [10 Transformer layers and hidden embedding of size 512](https://huggingface.co/google/bert_uncased_L-10_H-512_A-8), referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962), fine-tuned for MLM on CORD-19 dataset (as released on 2020/06/16).
## Training the model
```bash
python run_language_modeling.py
--model_type bert
--model_name_or_path google/bert_uncased_L-10_H-512_A-8
--do_train
--train_data_file {cord19-200616-dataset}
--mlm
--mlm_probability 0.2
--line_by_line
--block_size 512
--per_device_train_batch_size 10
--learning_rate 3e-5
--num_train_epochs 2
--output_dir bert_uncased_L-10_H-512_A-8_cord19-200616
---
datasets:
- squad_v2
---
# BERT L-10 H-512 CORD-19 (2020/06/16) fine-tuned on SQuAD v2.0
BERT model with [10 Transformer layers and hidden embedding of size 512](https://huggingface.co/google/bert_uncased_L-10_H-512_A-8), referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962), [fine-tuned for MLM](https://huggingface.co/aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616) on CORD-19 dataset (as released on 2020/06/16) and fine-tuned for QA on SQuAD v2.0.
## Training the model
```bash
python run_squad.py
--model_type bert
--model_name_or_path aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616
--train_file 'train-v2.0.json'
--predict_file 'dev-v2.0.json'
--do_train
--do_eval
--do_lower_case
--version_2_with_negative
--max_seq_length 384
--per_gpu_train_batch_size 10
--learning_rate 3e-5
--num_train_epochs 2
--output_dir bert_uncased_L-10_H-512_A-8_cord19-200616_squad2
# BERT L-2 H-512 fine-tuned on MLM (CORD-19 2020/06/16)
BERT model with [2 Transformer layers and hidden embedding of size 512](https://huggingface.co/google/bert_uncased_L-2_H-512_A-8), referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962), fine-tuned for MLM on CORD-19 dataset (as released on 2020/06/16).
## Training the model
```bash
python run_language_modeling.py
--model_type bert
--model_name_or_path google/bert_uncased_L-2_H-512_A-8
--do_train
--train_data_file {cord19-200616-dataset}
--mlm
--mlm_probability 0.2
--line_by_line
--block_size 512
--per_device_train_batch_size 20
--learning_rate 3e-5
--num_train_epochs 2
--output_dir bert_uncased_L-2_H-512_A-8_cord19-200616
# BERT L-4 H-256 fine-tuned on MLM (CORD-19 2020/06/16)
BERT model with [4 Transformer layers and hidden embedding of size 256](https://huggingface.co/google/bert_uncased_L-4_H-256_A-4), referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962), fine-tuned for MLM on CORD-19 dataset (as released on 2020/06/16).
## Training the model
```bash
python run_language_modeling.py
--model_type bert
--model_name_or_path google/bert_uncased_L-4_H-256_A-4
--do_train
--train_data_file {cord19-200616-dataset}
--mlm
--mlm_probability 0.2
--line_by_line
--block_size 256
--per_device_train_batch_size 20
--learning_rate 3e-5
--num_train_epochs 2
--output_dir bert_uncased_L-4_H-256_A-4_cord19-200616
---
language: ar
datasets:
- oscar
- wikipedia
---
# Arabic BERT Model
Pretrained BERT base language model for Arabic
_If you use this model in your work, please cite this paper:_
<!--```
@inproceedings{
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
year={2020}
}
```-->
```
@misc{safaya2020kuisail,
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Ali Safaya and Moutasem Abdullatif and Deniz Yuret},
year={2020},
eprint={2007.13184},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Pretraining Corpus
`arabic-bert-base` model was pretrained on ~8.2 Billion words:
- Arabic version of [OSCAR](https://traces1.inria.fr/oscar/) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
and other Arabic resources which sum up to ~95GB of text.
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters does not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 3M training steps with batchsize of 128, instead of 1M with batchsize of 256.
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
model = AutoModel.from_pretrained("asafaya/bert-base-arabic")
```
## Results
For further details on the models performance or any other queries, please refer to [Arabic-BERT](https://github.com/alisafaya/Arabic-BERT)
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting this model on their servers 😊
---
language: ar
datasets:
- oscar
- wikipedia
---
# Arabic BERT Large Model
Pretrained BERT Large language model for Arabic
_If you use this model in your work, please cite this paper:_
<!--```
@inproceedings{
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
year={2020}
}
```-->
```
@misc{safaya2020kuisail,
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Ali Safaya and Moutasem Abdullatif and Deniz Yuret},
year={2020},
eprint={2007.13184},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Pretraining Corpus
`arabic-bert-large` model was pretrained on ~8.2 Billion words:
- Arabic version of [OSCAR](https://traces1.inria.fr/oscar/) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
and other Arabic resources which sum up to ~95GB of text.
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters does not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 3M training steps with batchsize of 128, instead of 1M with batchsize of 256.
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-large-arabic")
model = AutoModel.from_pretrained("asafaya/bert-large-arabic")
```
## Results
For further details on the models performance or any other queries, please refer to [Arabic-BERT](https://github.com/alisafaya/Arabic-BERT)
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting this model on their servers 😊
---
language: ar
datasets:
- oscar
- wikipedia
---
# Arabic BERT Medium Model
Pretrained BERT Medium language model for Arabic
_If you use this model in your work, please cite this paper:_
<!--```
@inproceedings{
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
year={2020}
}
```-->
```
@misc{safaya2020kuisail,
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Ali Safaya and Moutasem Abdullatif and Deniz Yuret},
year={2020},
eprint={2007.13184},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Pretraining Corpus
`arabic-bert-medium` model was pretrained on ~8.2 Billion words:
- Arabic version of [OSCAR](https://traces1.inria.fr/oscar/) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
and other Arabic resources which sum up to ~95GB of text.
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters does not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 3M training steps with batchsize of 128, instead of 1M with batchsize of 256.
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-medium-arabic")
model = AutoModel.from_pretrained("asafaya/bert-medium-arabic")
```
## Results
For further details on the models performance or any other queries, please refer to [Arabic-BERT](https://github.com/alisafaya/Arabic-BERT)
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting this model on their servers 😊
---
language: ar
datasets:
- oscar
- wikipedia
---
# Arabic BERT Mini Model
Pretrained BERT Mini language model for Arabic
_If you use this model in your work, please cite this paper:_
<!--```
@inproceedings{
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
year={2020}
}
```-->
```
@misc{safaya2020kuisail,
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Ali Safaya and Moutasem Abdullatif and Deniz Yuret},
year={2020},
eprint={2007.13184},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Pretraining Corpus
`arabic-bert-mini` model was pretrained on ~8.2 Billion words:
- Arabic version of [OSCAR](https://traces1.inria.fr/oscar/) - filtered from [Common Crawl](http://commoncrawl.org/)
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
and other Arabic resources which sum up to ~95GB of text.
__Notes on training data:__
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters does not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
## Pretraining details
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
- Our pretraining procedure follows training settings of bert with some changes: trained for 3M training steps with batchsize of 128, instead of 1M with batchsize of 256.
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic")
model = AutoModel.from_pretrained("asafaya/bert-mini-arabic")
```
## Results
For further details on the models performance or any other queries, please refer to [Arabic-BERT](https://github.com/alisafaya/Arabic-BERT)
## Acknowledgement
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting this model on their servers 😊
---
language: gu
---
# Gujarati-XLM-R-Base
This model is finetuned over [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) (XLM-R) using its base variant with the Gujarati language using the [OSCAR](https://oscar-corpus.com/) monolingual dataset. We used the same masked language modelling (MLM) objective which was used for pretraining the XLM-R. As it is built over the pretrained XLM-R, we leveraged *Transfer Learning* by exploiting the knowledge from its parent model.
## Dataset
OSCAR corpus contains several diverse datasets for different languages. We followed the work of [CamemBERT](https://www.aclweb.org/anthology/2020.acl-main.645/) who reported better performance with this diverse dataset as compared to the other large homogenous datasets.
## Preprocessing and Training Procedure
Please visit [this link](https://github.com/ashwanitanwar/nmt-transfer-learning-xlm-r#6-finetuning-xlm-r) for the detailed procedure.
## Usage
- This model can be used for further finetuning for different NLP tasks using the Gujarati language.
- It can be used to generate contextualised word representations for the Gujarati words.
- It can be used for domain adaptation.
- It can be used to predict the missing words from the Gujarati sentences.
## Demo
### Using the model to predict missing words
```
from transformers import pipeline
unmasker = pipeline('fill-mask', model='ashwani-tanwar/Gujarati-XLM-R-Base')
pred_word = unmasker("અમદાવાદ એ ગુજરાતનું એક <mask> છે.")
print(pred_word)
```
```
[{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક શહેર છે.</s>', 'score': 0.9463568329811096, 'token': 85227, 'token_str': '▁શહેર'},
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક ગામ છે.</s>', 'score': 0.013311690650880337, 'token': 66346, 'token_str': '▁ગામ'},
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એકનગર છે.</s>', 'score': 0.012945962138473988, 'token': 69702, 'token_str': 'નગર'},
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક સ્થળ છે.</s>', 'score': 0.0045941537246108055, 'token': 135436, 'token_str': '▁સ્થળ'},
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક મહત્વ છે.</s>', 'score': 0.00402021361514926, 'token': 126763, 'token_str': '▁મહત્વ'}]
```
### Using the model to generate contextualised word representations
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ashwani-tanwar/Gujarati-XLM-R-Base")
model = AutoModel.from_pretrained("ashwani-tanwar/Gujarati-XLM-R-Base")
sentence = "અમદાવાદ એ ગુજરાતનું એક શહેર છે."
encoded_sentence = tokenizer(sentence, return_tensors='pt')
context_word_rep = model(**encoded_sentence)
```
---
language: ar
---
# AraBERT : Pre-training BERT for Arabic Language Understanding
<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words. The training corpora are a collection of publically available large scale raw arabic text ([Arabic Wikidumps](https://archive.org/details/arwiki-20190201), [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4), [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619), Assafir news articles, and 4 other manually crawled news websites (Al-Akhbar, Annahar, AL-Ahram, AL-Wafd) from [the Wayback Machine](http://web.archive.org/))
We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
**Update 2 (21/5/2020) :**
Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
**Update 1 (21/4/2020) :**
Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
## Results (Acc.)
Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
---|:---:|:---:|:---:|:---:
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
## How to use
You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
**Not all libraries built on top of transformers support the `never_split` argument**
```python
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess_arabert import never_split_tokens, preprocess
from farasa.segmenter import FarasaSegmenter
arabert_tokenizer = AutoTokenizer.from_pretrained(
"aubmindlab/bert-base-arabert",
do_lower_case=False,
do_basic_tokenize=True,
never_split=never_split_tokens)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
#Preprocess the text to make it compatible with AraBERT using farasapy
farasa_segmenter = FarasaSegmenter(interactive=True)
#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
#(see update 2)
#from py4j.java_gateway import JavaGateway
#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed = preprocess( text,
do_farasa_tokenization = True,
farasa = farasa_segmenter,
use_farasapy = True)
>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
arabert_tokenizer.tokenize(text_preprocessed)
>>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
```
**AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
```python
from transformers import AutoTokenizer, AutoModel
arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_tokenizer.tokenize(text)
>>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
```
The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
**AraBERT on ARCD**
During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
```bash
python arcd_preprocessing.py \
--input_file="/PATH_TO/arcd-test.json" \
--output_file="arcd-test-pre.json" \
--do_farasa_tokenization=True \
--use_farasapy=True \
```
```bash
python SOQAL/bert/run_squad.py \
--vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
--bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
--init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
--do_train=True \
--train_file=turk_combined_all_pre.json \
--do_predict=True \
--predict_file=arcd-test-pre.json \
--train_batch_size=32 \
--predict_batch_size=24 \
--learning_rate=3e-5 \
--num_train_epochs=4 \
--max_seq_length=384 \
--doc_stride=128 \
--do_lower_case=False\
--output_dir="/PATH_TO/OUTPUT_PATH"/ \
--use_tpu=True \
--tpu_name=$TPU_ADDRESS \
```
## Model Weights and Vocab Download
Models | AraBERTv0.1 | AraBERTv1
---|:---:|:---:
TensorFlow|[Drive Link](https://drive.google.com/open?id=1-kVmTUZZ4DP2rzeHNjTPkY8OjnQCpomO) | [Drive Link](https://drive.google.com/open?id=1-d7-9ljKgDJP5mx73uBtio-TuUZCqZnt)
PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7yaJH-IOv)| [Drive_Link](https://drive.google.com/open?id=1-69s6Pxqbi63HOQ1M9wTcr-Ovc6PWLLo)
**You can find the PyTorch models in HuggingFace's Transformer Library under the ```aubmindlab``` username**
## If you used this model please cite us as:
```
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
```
## Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
## Contacts
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
---
language: ar
---
# AraBERT : Pre-training BERT for Arabic Language Understanding
<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words. The training corpora are a collection of publically available large scale raw arabic text ([Arabic Wikidumps](https://archive.org/details/arwiki-20190201), [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4), [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619), Assafir news articles, and 4 other manually crawled news websites (Al-Akhbar, Annahar, AL-Ahram, AL-Wafd) from [the Wayback Machine](http://web.archive.org/))
We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
**Update 2 (21/5/2020) :**
Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
**Update 1 (21/4/2020) :**
Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
## Results (Acc.)
Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
---|:---:|:---:|:---:|:---:
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
## How to use
You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
**Not all libraries built on top of transformers support the `never_split` argument**
```python
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess_arabert import never_split_tokens, preprocess
from farasa.segmenter import FarasaSegmenter
arabert_tokenizer = AutoTokenizer.from_pretrained(
"aubmindlab/bert-base-arabert",
do_lower_case=False,
do_basic_tokenize=True,
never_split=never_split_tokens)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
#Preprocess the text to make it compatible with AraBERT using farasapy
farasa_segmenter = FarasaSegmenter(interactive=True)
#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
#(see update 2)
#from py4j.java_gateway import JavaGateway
#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed = preprocess( text,
do_farasa_tokenization = True,
farasa = farasa_segmenter,
use_farasapy = True)
>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
arabert_tokenizer.tokenize(text_preprocessed)
>>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
```
**AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
```python
from transformers import AutoTokenizer, AutoModel
arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_tokenizer.tokenize(text)
>>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
```
The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
**AraBERT on ARCD**
During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
```bash
python arcd_preprocessing.py \
--input_file="/PATH_TO/arcd-test.json" \
--output_file="arcd-test-pre.json" \
--do_farasa_tokenization=True \
--use_farasapy=True \
```
```bash
python SOQAL/bert/run_squad.py \
--vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
--bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
--init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
--do_train=True \
--train_file=turk_combined_all_pre.json \
--do_predict=True \
--predict_file=arcd-test-pre.json \
--train_batch_size=32 \
--predict_batch_size=24 \
--learning_rate=3e-5 \
--num_train_epochs=4 \
--max_seq_length=384 \
--doc_stride=128 \
--do_lower_case=False\
--output_dir="/PATH_TO/OUTPUT_PATH"/ \
--use_tpu=True \
--tpu_name=$TPU_ADDRESS \
```
## Model Weights and Vocab Download
Models | AraBERTv0.1 | AraBERTv1
---|:---:|:---:
TensorFlow|[Drive Link](https://drive.google.com/open?id=1-kVmTUZZ4DP2rzeHNjTPkY8OjnQCpomO) | [Drive Link](https://drive.google.com/open?id=1-d7-9ljKgDJP5mx73uBtio-TuUZCqZnt)
PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7yaJH-IOv)| [Drive_Link](https://drive.google.com/open?id=1-69s6Pxqbi63HOQ1M9wTcr-Ovc6PWLLo)
**You can find the PyTorch models in HuggingFace's Transformer Library under the ```aubmindlab``` username**
## If you used this model please cite us as:
```
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
```
## Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
## Contacts
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
---
language: ar
thumbnail: https://raw.githubusercontent.com/mawdoo3/Multi-dialect-Arabic-BERT/master/multidialct_arabic_bert.png
datasets:
- nadi
---
# Multi-dialect-Arabic-BERT
This is a repository of Multi-dialect Arabic BERT model.
By [Mawdoo3-AI](https://ai.mawdoo3.com/).
<p align="center">
<br>
<img src="https://raw.githubusercontent.com/mawdoo3/Multi-dialect-Arabic-BERT/master/multidialct_arabic_bert.png" alt="Background reference: http://www.qfi.org/wp-content/uploads/2018/02/Qfi_Infographic_Mother-Language_Final.pdf" width="500"/>
<br>
<p>
### About our Multi-dialect-Arabic-BERT model
Instead of training the Multi-dialect Arabic BERT model from scratch, we initialized the weights of the model using [Arabic-BERT](https://github.com/alisafaya/Arabic-BERT) and trained it on 10M arabic tweets from the unlabled data of [The Nuanced Arabic Dialect Identification (NADI) shared task](https://sites.google.com/view/nadi-shared-task).
### To cite this work
```
@misc{talafha2020multidialect,
title={Multi-Dialect Arabic BERT for Country-Level Dialect Identification},
author={Bashar Talafha and Mohammad Ali and Muhy Eddin Za'ter and Haitham Seelawi and Ibraheem Tuffaha and Mostafa Samir and Wael Farhan and Hussein T. Al-Natsheh},
year={2020},
eprint={2007.05612},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Usage
The model weights can be loaded using `transformers` library by HuggingFace.
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bashar-talafha/multi-dialect-bert-base-arabic")
model = AutoModel.from_pretrained("bashar-talafha/multi-dialect-bert-base-arabic")
```
Example using `pipeline`:
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="bashar-talafha/multi-dialect-bert-base-arabic ",
tokenizer="bashar-talafha/multi-dialect-bert-base-arabic "
)
fill_mask(" سافر الرحالة من مطار [MASK] ")
```
```
[{'sequence': '[CLS] سافر الرحالة من مطار الكويت [SEP]', 'score': 0.08296813815832138, 'token': 3226},
{'sequence': '[CLS] سافر الرحالة من مطار دبي [SEP]', 'score': 0.05123933032155037, 'token': 4747},
{'sequence': '[CLS] سافر الرحالة من مطار مسقط [SEP]', 'score': 0.046838656067848206, 'token': 13205},
{'sequence': '[CLS] سافر الرحالة من مطار القاهرة [SEP]', 'score': 0.03234650194644928, 'token': 4003},
{'sequence': '[CLS] سافر الرحالة من مطار الرياض [SEP]', 'score': 0.02606341242790222, 'token': 2200}]
```
### Repository
Please check the [original repository](https://github.com/mawdoo3/Multi-dialect-Arabic-BERT) for more information.
---
language: mn
---
# ALBERT-Mongolian
[pretraining repo link](https://github.com/bayartsogt-ya/albert-mongolian)
## Model description
Here we provide pretrained ALBERT model and trained SentencePiece model for Mongolia text. Training data is the Mongolian wikipedia corpus from Wikipedia Downloads and Mongolian News corpus.
## Evaluation Result:
```
loss = 1.7478163
masked_lm_accuracy = 0.6838185
masked_lm_loss = 1.6687671
sentence_order_accuracy = 0.998125
sentence_order_loss = 0.007942731
```
## Fine-tuning Result on Eduge Dataset:
```
precision recall f1-score support
байгал орчин 0.83 0.76 0.80 483
боловсрол 0.79 0.75 0.77 420
спорт 0.98 0.96 0.97 1391
технологи 0.85 0.83 0.84 543
улс төр 0.88 0.87 0.87 1336
урлаг соёл 0.89 0.94 0.91 726
хууль 0.87 0.83 0.85 840
эдийн засаг 0.80 0.84 0.82 1265
эрүүл мэнд 0.84 0.90 0.87 562
accuracy 0.87 7566
macro avg 0.86 0.85 0.86 7566
weighted avg 0.87 0.87 0.87 7566
```
## Reference
1. [ALBERT - official repo](https://github.com/google-research/albert)
2. [WikiExtrator](https://github.com/attardi/wikiextractor)
3. [Mongolian BERT](https://github.com/tugstugi/mongolian-bert)
4. [ALBERT - Japanese](https://github.com/alinear-corp/albert-japanese)
5. [Mongolian Text Classification](https://github.com/sharavsambuu/mongolian-text-classification)
6. [You's paper](https://arxiv.org/abs/1904.00962)
## Citation
```
@misc{albert-mongolian,
author = {Bayartsogt Yadamsuren},
title = {ALBERT Pretrained Model on Mongolian Datasets},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}
```
## For More Information
Please contact by bayartsogtyadamsuren@icloud.com
---
language: "mn"
tags:
- mongolian
- cased
---
# BERT-BASE-MONGOLIAN-CASED
[Link to Official Mongolian-BERT repo](https://github.com/tugstugi/mongolian-bert)
## Model description
This repository contains pre-trained Mongolian [BERT](https://arxiv.org/abs/1810.04805) models trained by [tugstugi](https://github.com/tugstugi), [enod](https://github.com/enod) and [sharavsambuu](https://github.com/sharavsambuu).
Special thanks to [nabar](https://github.com/nabar) who provided 5x TPUs.
This repository is based on the following open source projects: [google-research/bert](https://github.com/google-research/bert/),
[huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) and [yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese).
#### How to use
```python
from transformers import pipeline, AlbertTokenizer, BertForMaskedLM
tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/bert-base-mongolian-cased')
model = BertForMaskedLM.from_pretrained('bayartsogt/bert-base-mongolian-cased')
## declare task ##
pipe = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
## example ##
input_ = 'Миний [MASK] хоол идэх нь тун чухал.'
output_ = pipe(input_)
for i in range(len(output_)):
print(output_[i])
## Output ##
# {'sequence': '[CLS] Миний хувьд хоол идэх нь тун чухал.[SEP]', 'score': 0.8734784722328186, 'token': 95, 'token_str': '▁хувьд'}
# {'sequence': '[CLS] Миний бодлоор хоол идэх нь тун чухал.[SEP]', 'score': 0.09788835793733597, 'token': 6320, 'token_str': '▁бодлоор'}
# {'sequence': '[CLS] Миний хүү хоол идэх нь тун чухал.[SEP]', 'score': 0.0027510314248502254, 'token': 590, 'token_str': '▁хүү'}
# {'sequence': '[CLS] Миний бие хоол идэх нь тун чухал.[SEP]', 'score': 0.0014857524074614048, 'token': 267, 'token_str': '▁бие'}
# {'sequence': '[CLS] Миний охин хоол идэх нь тун чухал.[SEP]', 'score': 0.0013575413031503558, 'token': 1116, 'token_str': '▁охин'}
```
## Training data
Mongolian Wikipedia and the 700 million word Mongolian news data set [[Pretraining Procedure](https://github.com/tugstugi/mongolian-bert#pre-training)]
### BibTeX entry and citation info
```bibtex
@misc{mongolian-bert,
author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
title = {BERT Pretrained Models on Mongolian Datasets},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
}
```
---
language: "mn"
tags:
- bert
- mongolian
- uncased
---
# BERT-BASE-MONGOLIAN-UNCASED
[Link to Official Mongolian-BERT repo](https://github.com/tugstugi/mongolian-bert)
## Model description
This repository contains pre-trained Mongolian [BERT](https://arxiv.org/abs/1810.04805) models trained by [tugstugi](https://github.com/tugstugi), [enod](https://github.com/enod) and [sharavsambuu](https://github.com/sharavsambuu).
Special thanks to [nabar](https://github.com/nabar) who provided 5x TPUs.
This repository is based on the following open source projects: [google-research/bert](https://github.com/google-research/bert/),
[huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) and [yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese).
#### How to use
```python
from transformers import pipeline, AlbertTokenizer, BertForMaskedLM
tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/bert-base-mongolian-uncased')
model = BertForMaskedLM.from_pretrained('bayartsogt/bert-base-mongolian-uncased')
## declare task ##
pipe = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
## example ##
input_ = 'Миний [MASK] хоол идэх нь тун чухал.'
output_ = pipe(input_)
for i in range(len(output_)):
print(output_[i])
```
## Training data
Mongolian Wikipedia and the 700 million word Mongolian news data set [[Pretraining Procedure](https://github.com/tugstugi/mongolian-bert#pre-training)]
### BibTeX entry and citation info
```bibtex
@misc{mongolian-bert,
author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
title = {BERT Pretrained Models on Mongolian Datasets},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
}
```
---
language: en
tags:
- exbert
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
---
# BERT base model (cased)
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://arxiv.org/abs/1810.04805) and first released in
[this repository](https://github.com/google-research/bert). This model is case-sensitive: it makes a difference between
english and English.
Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
the Hugging Face team.
## Model description
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
was pretrained with two objectives:
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
sentence.
- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
predict if the two sentences were following each other or not.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-cased')
>>> unmasker("Hello I'm a [MASK] model.")
[{'sequence': "[CLS] Hello I'm a fashion model. [SEP]",
'score': 0.09019174426794052,
'token': 4633,
'token_str': 'fashion'},
{'sequence': "[CLS] Hello I'm a new model. [SEP]",
'score': 0.06349995732307434,
'token': 1207,
'token_str': 'new'},
{'sequence': "[CLS] Hello I'm a male model. [SEP]",
'score': 0.06228214129805565,
'token': 2581,
'token_str': 'male'},
{'sequence': "[CLS] Hello I'm a professional model. [SEP]",
'score': 0.0441727414727211,
'token': 1848,
'token_str': 'professional'},
{'sequence': "[CLS] Hello I'm a super model. [SEP]",
'score': 0.03326151892542839,
'token': 7688,
'token_str': 'super'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertModel.from_pretrained("bert-base-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained("bert-base-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
### Limitations and bias
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
predictions:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-cased')
>>> unmasker("The man worked as a [MASK].")
[{'sequence': '[CLS] The man worked as a lawyer. [SEP]',
'score': 0.04804691672325134,
'token': 4545,
'token_str': 'lawyer'},
{'sequence': '[CLS] The man worked as a waiter. [SEP]',
'score': 0.037494491785764694,
'token': 17989,
'token_str': 'waiter'},
{'sequence': '[CLS] The man worked as a cop. [SEP]',
'score': 0.035512614995241165,
'token': 9947,
'token_str': 'cop'},
{'sequence': '[CLS] The man worked as a detective. [SEP]',
'score': 0.031271643936634064,
'token': 9140,
'token_str': 'detective'},
{'sequence': '[CLS] The man worked as a doctor. [SEP]',
'score': 0.027423162013292313,
'token': 3995,
'token_str': 'doctor'}]
>>> unmasker("The woman worked as a [MASK].")
[{'sequence': '[CLS] The woman worked as a nurse. [SEP]',
'score': 0.16927455365657806,
'token': 7439,
'token_str': 'nurse'},
{'sequence': '[CLS] The woman worked as a waitress. [SEP]',
'score': 0.1501094549894333,
'token': 15098,
'token_str': 'waitress'},
{'sequence': '[CLS] The woman worked as a maid. [SEP]',
'score': 0.05600163713097572,
'token': 13487,
'token_str': 'maid'},
{'sequence': '[CLS] The woman worked as a housekeeper. [SEP]',
'score': 0.04838843643665314,
'token': 26458,
'token_str': 'housekeeper'},
{'sequence': '[CLS] The woman worked as a cook. [SEP]',
'score': 0.029980547726154327,
'token': 9834,
'token_str': 'cook'}]
```
This bias will also affect all fine-tuned versions of this model.
## Training data
The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
headers).
## Training procedure
### Preprocessing
The texts are tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:
```
[CLS] Sentence A [SEP] Sentence B [SEP]
```
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
"sentences" has a combined length of less than 512 tokens.
The details of the masking procedure for each sentence are the following:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
### Pretraining
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Glue test results:
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
### BibTeX entry and citation info
```bibtex
@article{DBLP:journals/corr/abs-1810-04805,
author = {Jacob Devlin and
Ming{-}Wei Chang and
Kenton Lee and
Kristina Toutanova},
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
Understanding},
journal = {CoRR},
volume = {abs/1810.04805},
year = {2018},
url = {http://arxiv.org/abs/1810.04805},
archivePrefix = {arXiv},
eprint = {1810.04805},
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
<a href="https://huggingface.co/exbert/?model=bert-base-cased">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment