Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: en
tags:
- exbert
license: mit
---
# GPT-2
Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large
Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
and first released at [this page](https://openai.com/blog/better-language-models/).
Disclaimer: The team releasing GPT-2 also wrote a
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card
has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.
## Model description
GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This
means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots
of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely,
it was trained to guess the next word in sentences.
More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence,
shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the
predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a
prompt.
## Intended uses & limitations
You can use the raw model for text generation or fine-tune it to a downstream task. See the
[model hub](https://huggingface.co/models?filter=gpt2) to look for fine-tuned versions on a task that interests you.
### How to use
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
set a seed for reproducibility:
```python
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
{'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
{'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},
{'generated_text': "Hello, I'm a language model, a system model. I want to know my language so that it might be more interesting, more user-friendly"},
{'generated_text': 'Hello, I\'m a language model, not a language model"\n\nThe concept of "no-tricks" comes in handy later with new'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
### Limitations and bias
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
> that require the generated text to be true.
>
> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
> not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
> study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
> and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
> levels of caution around use cases that are sensitive to biases around human attributes.
Here's an example of how the model can have biased predictions:
```python
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2')
>>> set_seed(42)
>>> generator("The White man worked as a", max_length=10, num_return_sequences=5)
[{'generated_text': 'The White man worked as a mannequin for'},
{'generated_text': 'The White man worked as a maniser of the'},
{'generated_text': 'The White man worked as a bus conductor by day'},
{'generated_text': 'The White man worked as a plumber at the'},
{'generated_text': 'The White man worked as a journalist. He had'}]
>>> set_seed(42)
>>> generator("The Black man worked as a", max_length=10, num_return_sequences=5)
[{'generated_text': 'The Black man worked as a man at a restaurant'},
{'generated_text': 'The Black man worked as a car salesman in a'},
{'generated_text': 'The Black man worked as a police sergeant at the'},
{'generated_text': 'The Black man worked as a man-eating monster'},
{'generated_text': 'The Black man worked as a slave, and was'}]
```
This bias will also affect all fine-tuned versions of this model.
## Training data
The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web
pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from
this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights
40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText
[here](https://github.com/openai/gpt-2/blob/master/domains.txt).
## Training procedure
### Preprocessing
The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a
vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens.
The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact
details of training.
## Evaluation results
The model achieves the following results without any fine-tuning (zero-shot):
| Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiText2 | PTB | enwiki8 | text8 | WikiText103 | 1BW |
|:--------:|:-------:|:-------:|:------:|:------:|:---------:|:------:|:-------:|:------:|:-----------:|:-----:|
| (metric) | (PPL) | (ACC) | (ACC) | (ACC) | (PPL) | (PPL) | (BPB) | (BPC) | (PPL) | (PPL) |
| | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1,17 | 37.50 | 75.20 |
### BibTeX entry and citation info
```bibtex
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}
```
<a href="https://huggingface.co/exbert/?model=gpt2">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large
Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large
Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large
# BioBERT-NLI
This is the model [BioBERT](https://github.com/dmis-lab/biobert) [1] fine-tuned on the [SNLI](https://nlp.stanford.edu/projects/snli/) and the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) datasets using the [`sentence-transformers` library](https://github.com/UKPLab/sentence-transformers/) to produce universal sentence embeddings [2].
The model uses the original BERT wordpiece vocabulary and was trained using the **average pooling strategy** and a **softmax loss**.
**Base model**: `monologg/biobert_v1.1_pubmed` from HuggingFace's `AutoModel`.
**Training time**: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
**Parameters**:
| Parameter | Value |
|------------------|-------|
| Batch size | 64 |
| Training steps | 30000 |
| Warmup steps | 1450 |
| Lowercasing | False |
| Max. Seq. Length | 128 |
**Performances**: The performance was evaluated on the test portion of the [STS dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation and compared to the performances of a general BERT base model obtained with the same procedure to verify their similarity.
| Model | Score |
|-------------------------------|-------------|
| `biobert-nli` (this) | 73.40 |
| `gsarti/scibert-nli` | 74.50 |
| `bert-base-nli-mean-tokens`[3]| 77.12 |
An example usage for similarity-based scientific paper retrieval is provided in the [Covid Papers Browser](https://github.com/gsarti/covid-papers-browser) repository.
**References:**
[1] J. Lee et al, [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://academic.oup.com/bioinformatics/article/36/4/1234/5566506)
[2] A. Conneau et al., [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://www.aclweb.org/anthology/D17-1070/)
[3] N. Reimers et I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410/)
# CovidBERT-NLI
This is the model **CovidBERT** trained by DeepSet on AllenAI's [CORD19 Dataset](https://pages.semanticscholar.org/coronavirus-research) of scientific articles about coronaviruses.
The model uses the original BERT wordpiece vocabulary and was subsequently fine-tuned on the [SNLI](https://nlp.stanford.edu/projects/snli/) and the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) datasets using the [`sentence-transformers` library](https://github.com/UKPLab/sentence-transformers/) to produce universal sentence embeddings [1] using the **average pooling strategy** and a **softmax loss**.
Parameter details for the original training on CORD-19 are available on [DeepSet's MLFlow](https://public-mlflow.deepset.ai/#/experiments/2/runs/ba27d00c30044ef6a33b1d307b4a6cba)
**Base model**: `deepset/covid_bert_base` from HuggingFace's `AutoModel`.
**Training time**: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
**Parameters**:
| Parameter | Value |
|------------------|-------|
| Batch size | 64 |
| Training steps | 23000 |
| Warmup steps | 1450 |
| Lowercasing | True |
| Max. Seq. Length | 128 |
**Performances**: The performance was evaluated on the test portion of the [STS dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation and compared to the performances of similar models obtained with the same procedure to verify its performances.
| Model | Score |
|-------------------------------|-------------|
| `covidbert-nli` (this) | 67.52 |
| `gsarti/biobert-nli` | 73.40 |
| `gsarti/scibert-nli` | 74.50 |
| `bert-base-nli-mean-tokens`[2]| 77.12 |
An example usage for similarity-based scientific paper retrieval is provided in the [Covid-19 Semantic Browser](https://github.com/gsarti/covid-papers-browser) repository.
**References:**
[1] A. Conneau et al., [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://www.aclweb.org/anthology/D17-1070/)
[2] N. Reimers et I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410/)
# SciBERT-NLI
This is the model [SciBERT](https://github.com/allenai/scibert) [1] fine-tuned on the [SNLI](https://nlp.stanford.edu/projects/snli/) and the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) datasets using the [`sentence-transformers` library](https://github.com/UKPLab/sentence-transformers/) to produce universal sentence embeddings [2].
The model uses the original `scivocab` wordpiece vocabulary and was trained using the **average pooling strategy** and a **softmax loss**.
**Base model**: `allenai/scibert-scivocab-cased` from HuggingFace's `AutoModel`.
**Training time**: ~4 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
**Parameters**:
| Parameter | Value |
|------------------|-------|
| Batch size | 64 |
| Training steps | 20000 |
| Warmup steps | 1450 |
| Lowercasing | True |
| Max. Seq. Length | 128 |
**Performances**: The performance was evaluated on the test portion of the [STS dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation and compared to the performances of a general BERT base model obtained with the same procedure to verify their similarity.
| Model | Score |
|-------------------------------|-------------|
| `scibert-nli` (this) | 74.50 |
| `bert-base-nli-mean-tokens`[3]| 77.12 |
An example usage for similarity-based scientific paper retrieval is provided in the [Covid Papers Browser](https://github.com/gsarti/covid-papers-browser) repository.
**References:**
[1] I. Beltagy et al, [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/)
[2] A. Conneau et al., [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://www.aclweb.org/anthology/D17-1070/)
[3] N. Reimers et I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410/)
---
language: tr
---
# Turkish News Text Classification
Turkish text classification model obtained by fine-tuning the Turkish bert model (dbmdz/bert-base-turkish-cased)
# Dataset
Dataset consists of 11 classes were obtained from https://www.trthaber.com/. The model was created using the most distinctive 6 classes.
Dataset can be accessed at https://github.com/gurkan08/datasets/tree/master/trt_11_category.
label_dict = {
'LABEL_0': 'ekonomi',
'LABEL_1': 'spor',
'LABEL_2': 'saglik',
'LABEL_3': 'kultur_sanat',
'LABEL_4': 'bilim_teknoloji',
'LABEL_5': 'egitim'
}
70% of the data were used for training and 30% for testing.
train f1-weighted score = %97
test f1-weighted score = %94
# Usage
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("gurkan08/bert-turkish-text-classification")
model = AutoModelForSequenceClassification.from_pretrained("gurkan08/bert-turkish-text-classification")
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
text = ["Süper Lig'in 6. haftasında Sivasspor ile Çaykur Rizespor karşı karşıya geldi...",
"Son 24 saatte 69 kişi Kovid-19 nedeniyle yaşamını yitirdi, 1573 kişi iyileşti"]
out = nlp(text)
label_dict = {
'LABEL_0': 'ekonomi',
'LABEL_1': 'spor',
'LABEL_2': 'saglik',
'LABEL_3': 'kultur_sanat',
'LABEL_4': 'bilim_teknoloji',
'LABEL_5': 'egitim'
}
results = []
for result in out:
result['label'] = label_dict[result['label']]
results.append(result)
print(results)
# > [{'label': 'spor', 'score': 0.9992026090621948}, {'label': 'saglik', 'score': 0.9972177147865295}]
---
language: ar
---
# Arabic Named Entity Recognition Model
Pretrained BERT-based ([arabic-bert-base](https://huggingface.co/asafaya/bert-base-arabic)) Named Entity Recognition model for Arabic.
The pre-trained model can recognize the following entities:
1. **PERSON**
- و هذا ما نفاه المعاون السياسي للرئيس ***نبيه بري*** ، النائب ***علي حسن خليل***
- لكن أوساط ***الحريري*** تعتبر أنه ضحى كثيرا في سبيل البلد
- و ستفقد الملكة ***إليزابيث الثانية*** بذلك سيادتها على واحدة من آخر ممالك الكومنولث
2. **ORGANIZATION**
- حسب أرقام ***البنك الدولي***
- أعلن ***الجيش العراقي***
- و نقلت وكالة ***رويترز*** عن ثلاثة دبلوماسيين في ***الاتحاد الأوروبي*** ، أن ***بلجيكا*** و ***إيرلندا*** و ***لوكسمبورغ*** تريد أيضاً مناقشة
- ***الحكومة الاتحادية*** و ***حكومة إقليم كردستان***
- و هو ما يثير الشكوك حول مشاركة النجم البرتغالي في المباراة المرتقبة أمام ***برشلونة*** الإسباني في
3. ***LOCATION***
- الجديد هو تمكين اللاجئين من “ مغادرة الجزيرة تدريجياً و بهدوء إلى ***أثينا***
- ***جزيرة ساكيز*** تبعد 1 كم عن ***إزمير***
4. **DATE**
- ***غدا الجمعة***
- ***06 أكتوبر 2020***
- ***العام السابق***
5. **PRODUCT**
- عبر حسابه ب ***تطبيق “ إنستغرام ”***
- الجيل الثاني من ***نظارة الواقع الافتراضي أوكولوس كويست*** تحت اسم " ***أوكولوس كويست 2*** "
6. **COMPETITION**
- عدم المشاركة في ***بطولة فرنسا المفتوحة للتنس***
- في مباراة ***كأس السوبر الأوروبي***
7. **PRIZE**
- ***جائزة نوبل ل لآداب***
- الذي فاز ب ***جائزة “ إيمي ” لأفضل دور مساند***
8. **EVENT**
- تسجّل أغنية جديدة خاصة ب ***العيد الوطني السعودي***
- ***مهرجان المرأة يافوية*** في دورته الرابعة
9. **DISEASE**
- في مكافحة فيروس ***كورونا*** و عدد من الأمراض
- الأزمات المشابهة مثل “ ***انفلونزا الطيور*** ” و ” ***انفلونزا الخنازير***
## Example
[Find here a complete example to use this model](https://github.com/hatmimoha/arabic-ner)
Here is the map from index to label:
```
id2label = {
"0": "B-PERSON",
"1": "I-PERSON",
"2": "B-ORGANIZATION",
"3": "I-ORGANIZATION",
"4": "B-LOCATION",
"5": "I-LOCATION",
"6": "B-DATE",
"7": "I-DATE"",
"8": "B-COMPETITION",
"9": "I-COMPETITION",
"10": "B-PRIZE",
"11": "I-PRIZE",
"12": "O",
"13": "B-PRODUCT",
"14": "I-PRODUCT",
"15": "B-EVENT",
"16": "I-EVENT",
"17": "B-DISEASE",
"18": "I-DISEASE",
}
```
## Training Corpus
The training corpus is made of 378.000 tokens (14.000 sentences) collected from the Web and annotated manually.
## Results
The results on a valid corpus made of 30.000 tokens shows an F-measure of ~87%.
GPT-2 (774M model) finetuned on 0.5m PubMed abstracts. Used in the [writemeanabstract.com](writemeanabstract.com) and the following preprint:
[Papanikolaou, Yannis, and Andrea Pierleoni. "DARE: Data Augmented Relation Extraction with GPT-2." arXiv preprint arXiv:2004.13845 (2020).](https://arxiv.org/abs/2004.13845)
GPT-2 (355M model) finetuned on 0.5m PubMed abstracts. Used in the [writemeanabstract.com](writemeanabstract.com) and the following preprint:
[Papanikolaou, Yannis, and Andrea Pierleoni. "DARE: Data Augmented Relation Extraction with GPT-2." arXiv preprint arXiv:2004.13845 (2020).](https://arxiv.org/abs/2004.13845)
---
language: nl
---
# Multilingual + Dutch SQuAD2.0
This model is the multilingual model provided by the Google research team with a fine-tuned dutch Q&A downstream task.
## Details of the language model
Language model ([**bert-base-multilingual-cased**](https://github.com/google-research/bert/blob/master/multilingual.md)):
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias.
## Details of the downstream task
Using the `mtranslate` Python module, [**SQuAD2.0**](https://rajpurkar.github.io/SQuAD-explorer/) was machine-translated. In order to find the start tokens, the direct translations of the answers were searched in the corresponding paragraphs. Due to the different translations depending on the context (missing context in the pure answer), the answer could not always be found in the text, and thus a loss of question-answer examples occurred. This is a potential problem where errors can occur in the data set.
| Dataset | # Q&A |
| ---------------------- | ----- |
| SQuAD2.0 Train | 130 K |
| Dutch SQuAD2.0 Train | 99 K |
| SQuAD2.0 Dev | 12 K |
| Dutch SQuAD2.0 Dev | 10 K |
## Model benchmark
| Model | EM/F1 |HasAns (EM/F1) | NoAns |
| ---------------------- | ----- | ----- | ----- |
| [robBERT](https://huggingface.co/pdelobelle/robBERT-base) | 58.04/60.95 | 33.08/40.64 | 73.67 |
| [dutchBERT](https://huggingface.co/wietsedv/bert-base-dutch-cased) | 64.25/68.45 | 45.59/56.49 | 75.94 |
| [multiBERT](https://huggingface.co/bert-base-multilingual-cased) | **67.38**/**71.36** | 47.42/57.76 | 79.88 |
## Model training
The model was trained on a **Tesla V100** GPU with the following command:
```python
export SQUAD_DIR=path/to/nl_squad
python run_squad.py
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/nl_squadv2_train_clean.json \
--predict_file $SQUAD_DIR/nl_squadv2_dev_clean.json \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=8000 \
--output_dir ../../output \
--overwrite_cache \
--overwrite_output_dir
```
**Results**:
{'exact': 67.38028751680629, 'f1': 71.362297054268, 'total': 9669, 'HasAns_exact': 47.422126745435015, 'HasAns_f1': 57.761023151910734, 'HasAns_total': 3724, 'NoAns_exact': 79.88225399495374, 'NoAns_f1': 79.88225399495374, 'NoAns_total': 5945, 'best_exact': 67.53542248422795, 'best_exact_thresh': 0.0, 'best_f1': 71.36229705426837, 'best_f1_thresh': 0.0}
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': "Amsterdam is de hoofdstad en de dichtstbevolkte stad van Nederland.",
'question': "Wat is de hoofdstad van Nederland?"})
```
# Output:
```json
{
"score": 0.83,
"start": 0,
"end": 9,
"answer": "Amsterdam"
}
```
## Contact
Please do not hesitate to contact me via [LinkedIn](https://www.linkedin.com/in/henryk-borzymowski-0755a2167/) if you want to discuss or get access to the Dutch version of SQuAD.
\ No newline at end of file
---
language: pl
---
# Multilingual + Polish SQuAD1.1
This model is the multilingual model provided by the Google research team with a fine-tuned polish Q&A downstream task.
## Details of the language model
Language model ([**bert-base-multilingual-cased**](https://github.com/google-research/bert/blob/master/multilingual.md)):
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias.
## Details of the downstream task
Using the `mtranslate` Python module, [**SQuAD1.1**](https://rajpurkar.github.io/SQuAD-explorer/) was machine-translated. In order to find the start tokens, the direct translations of the answers were searched in the corresponding paragraphs. Due to the different translations depending on the context (missing context in the pure answer), the answer could not always be found in the text, and thus a loss of question-answer examples occurred. This is a potential problem where errors can occur in the data set.
| Dataset | # Q&A |
| ---------------------- | ----- |
| SQuAD1.1 Train | 87.7 K |
| Polish SQuAD1.1 Train | 39.5 K |
| SQuAD1.1 Dev | 10.6 K |
| Polish SQuAD1.1 Dev | 2.6 K |
## Model benchmark
| Model | EM | F1 |
| ---------------------- | ----- | ----- |
| [SlavicBERT](https://huggingface.co/DeepPavlov/bert-base-bg-cs-pl-ru-cased) | **60.89** | 71.68 |
| [polBERT](https://huggingface.co/dkleczek/bert-base-polish-uncased-v1) | 57.46 | 68.87 |
| [multiBERT](https://huggingface.co/bert-base-multilingual-cased) | 60.67 | **71.89** |
| [xlm](https://huggingface.co/xlm-mlm-100-1280) | 47.98 | 59.42 |
## Model training
The model was trained on a **Tesla V100** GPU with the following command:
```python
export SQUAD_DIR=path/to/pl_squad
python run_squad.py
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/pl_squadv1_train_clean.json \
--predict_file $SQUAD_DIR/pl_squadv1_dev_clean.json \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=8000 \
--output_dir ../../output \
--overwrite_cache \
--overwrite_output_dir
```
**Results**:
{'exact': 60.670731707317074, 'f1': 71.8952193697293, 'total': 2624, 'HasAns_exact': 60.670731707317074, 'HasAns_f1': 71.8952193697293,
'HasAns_total': 2624, 'best_exact': 60.670731707317074, 'best_exact_thresh': 0.0, 'best_f1': 71.8952193697293, 'best_f1_thresh': 0.0}
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-polish-squad1",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad1"
)
qa_pipeline({
'context': "Warszawa jest największym miastem w Polsce pod względem liczby ludności i powierzchni",
'question': "Jakie jest największe miasto w Polsce?"})
```
# Output:
```json
{
"score": 0.9988,
"start": 0,
"end": 8,
"answer": "Warszawa"
}
```
## Contact
Please do not hesitate to contact me via [LinkedIn](https://www.linkedin.com/in/henryk-borzymowski-0755a2167/) if you want to discuss or get access to the Polish version of SQuAD.
\ No newline at end of file
---
language: pl
---
# Multilingual + Polish SQuAD2.0
This model is the multilingual model provided by the Google research team with a fine-tuned polish Q&A downstream task.
## Details of the language model
Language model ([**bert-base-multilingual-cased**](https://github.com/google-research/bert/blob/master/multilingual.md)):
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias.
## Details of the downstream task
Using the `mtranslate` Python module, [**SQuAD2.0**](https://rajpurkar.github.io/SQuAD-explorer/) was machine-translated. In order to find the start tokens, the direct translations of the answers were searched in the corresponding paragraphs. Due to the different translations depending on the context (missing context in the pure answer), the answer could not always be found in the text, and thus a loss of question-answer examples occurred. This is a potential problem where errors can occur in the data set.
| Dataset | # Q&A |
| ---------------------- | ----- |
| SQuAD2.0 Train | 130 K |
| Polish SQuAD2.0 Train | 83.1 K |
| SQuAD2.0 Dev | 12 K |
| Polish SQuAD2.0 Dev | 8.5 K |
## Model benchmark
| Model | EM/F1 |HasAns (EM/F1) | NoAns |
| ---------------------- | ----- | ----- | ----- |
| [SlavicBERT](https://huggingface.co/DeepPavlov/bert-base-bg-cs-pl-ru-cased) | 69.35/71.51 | 47.02/54.09 | 79.20 |
| [polBERT](https://huggingface.co/dkleczek/bert-base-polish-uncased-v1) | 67.33/69.80| 45.73/53.80 | 76.87 |
| [multiBERT](https://huggingface.co/bert-base-multilingual-cased) | **70.76**/**72.92** |45.00/52.04 | 82.13 |
## Model training
The model was trained on a **Tesla V100** GPU with the following command:
```python
export SQUAD_DIR=path/to/pl_squad
python run_squad.py
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/pl_squadv2_train.json \
--predict_file $SQUAD_DIR/pl_squadv2_dev.json \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=8000 \
--output_dir ../../output \
--overwrite_cache \
--overwrite_output_dir
```
**Results**:
{'exact': 70.76671723655035, 'f1': 72.92156947155917, 'total': 8569, 'HasAns_exact': 45.00762195121951, 'HasAns_f1': 52.04456128116991, 'HasAns_total': 2624, 'NoAns_exact': 82.13624894869638, '
NoAns_f1': 82.13624894869638, 'NoAns_total': 5945, 'best_exact': 71.72365503559342, 'best_exact_thresh': 0.0, 'best_f1': 73.62662512059369, 'best_f1_thresh': 0.0}
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2"
)
qa_pipeline({
'context': "Warszawa jest największym miastem w Polsce pod względem liczby ludności i powierzchni",
'question': "Jakie jest największe miasto w Polsce?"})
```
# Output:
```json
{
"score": 0.9986,
"start": 0,
"end": 8,
"answer": "Warszawa"
}
```
## Contact
Please do not hesitate to contact me via [LinkedIn](https://www.linkedin.com/in/henryk-borzymowski-0755a2167/) if you want to discuss or get access to the Polish version of SQuAD.
\ No newline at end of file
## DynaBERT: Dynamic BERT with Adaptive Width and Depth
* DynaBERT can flexibly adjust the size and latency by selecting adaptive width and depth, and
the subnetworks of it have competitive performances as other similar-sized compressed models.
The training process of DynaBERT includes first training a width-adaptive BERT and then
allowing both adaptive width and depth using knowledge distillation.
* This code is modified based on the repository developed by Hugging Face: [Transformers v2.1.1](https://github.com/huggingface/transformers/tree/v2.1.1), and is released in [GitHub](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT).
### Reference
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu.
[DynaBERT: Dynamic BERT with Adaptive Width and Depth](https://arxiv.org/abs/2004.04037).
```
@inproceedings{hou2020dynabert,
title = {DynaBERT: Dynamic BERT with Adaptive Width and Depth},
author = {Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu},
booktitle = {Advances in Neural Information Processing Systems},
year = {2020}
}
```
## DynaBERT: Dynamic BERT with Adaptive Width and Depth
* DynaBERT can flexibly adjust the size and latency by selecting adaptive width and depth, and
the subnetworks of it have competitive performances as other similar-sized compressed models.
The training process of DynaBERT includes first training a width-adaptive BERT and then
allowing both adaptive width and depth using knowledge distillation.
* This code is modified based on the repository developed by Hugging Face: [Transformers v2.1.1](https://github.com/huggingface/transformers/tree/v2.1.1), and is released in [GitHub](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT).
### Reference
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu.
[DynaBERT: Dynamic BERT with Adaptive Width and Depth](https://arxiv.org/abs/2004.04037).
```
@inproceedings{hou2020dynabert,
title = {DynaBERT: Dynamic BERT with Adaptive Width and Depth},
author = {Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu},
booktitle = {Advances in Neural Information Processing Systems},
year = {2020}
}
```
TinyBERT: Distilling BERT for Natural Language Understanding
========
TinyBERT is 7.5x smaller and 9.4x faster on inference than BERT-base and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages. In general distillation, we use the original BERT-base without fine-tuning as the teacher and a large-scale text corpus as the learning data. By performing the Transformer distillation on the text from general domain, we obtain a general TinyBERT which provides a good initialization for the task-specific distillation. We here provide the general TinyBERT for your tasks at hand.
For more details about the techniques of TinyBERT, refer to our paper:
[TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351)
Citation
========
If you find TinyBERT useful in your research, please cite the following paper:
```
@article{jiao2019tinybert,
title={Tinybert: Distilling bert for natural language understanding},
author={Jiao, Xiaoqi and Yin, Yichun and Shang, Lifeng and Jiang, Xin and Chen, Xiao and Li, Linlin and Wang, Fang and Liu, Qun},
journal={arXiv preprint arXiv:1909.10351},
year={2019}
}
```
---
language: code
thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
datasets:
- code_search_net
---
# CodeBERTa-language-id: The World’s fanciest programming language identification algo 🤯
To demonstrate the usefulness of our CodeBERTa pretrained model on downstream tasks beyond language modeling, we fine-tune the [`CodeBERTa-small-v1`](https://huggingface.co/huggingface/CodeBERTa-small-v1) checkpoint on the task of classifying a sample of code into the programming language it's written in (*programming language identification*).
We add a sequence classification head on top of the model.
On the evaluation dataset, we attain an eval accuracy and F1 > 0.999 which is not surprising given that the task of language identification is relatively easy (see an intuition why, below).
## Quick start: using the raw model
```python
CODEBERTA_LANGUAGE_ID = "huggingface/CodeBERTa-language-id"
tokenizer = RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID)
input_ids = tokenizer.encode(CODE_TO_IDENTIFY)
logits = model(input_ids)[0]
language_idx = logits.argmax() # index for the resulting label
```
## Quick start: using Pipelines 💪
```python
from transformers import TextClassificationPipeline
pipeline = TextClassificationPipeline(
model=RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID),
tokenizer=RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
)
pipeline(CODE_TO_IDENTIFY)
```
Let's start with something very easy:
```python
pipeline("""
def f(x):
return x**2
""")
# [{'label': 'python', 'score': 0.9999965}]
```
Now let's probe shorter code samples:
```python
pipeline("const foo = 'bar'")
# [{'label': 'javascript', 'score': 0.9977546}]
```
What if I remove the `const` token from the assignment?
```python
pipeline("foo = 'bar'")
# [{'label': 'javascript', 'score': 0.7176245}]
```
For some reason, this is still statistically detected as JS code, even though it's also valid Python code. However, if we slightly tweak it:
```python
pipeline("foo = u'bar'")
# [{'label': 'python', 'score': 0.7638422}]
```
This is now detected as Python (Notice the `u` string modifier).
Okay, enough with the JS and Python domination already! Let's try fancier languages:
```python
pipeline("echo $FOO")
# [{'label': 'php', 'score': 0.9995257}]
```
(Yes, I used the word "fancy" to describe PHP 😅)
```python
pipeline("outcome := rand.Intn(6) + 1")
# [{'label': 'go', 'score': 0.9936151}]
```
Why is the problem of language identification so easy (with the correct toolkit)? Because code's syntax is rigid, and simple tokens such as `:=` (the assignment operator in Go) are perfect predictors of the underlying language:
```python
pipeline(":=")
# [{'label': 'go', 'score': 0.9998052}]
```
By the way, because we trained our own custom tokenizer on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset, and it handles streams of bytes in a very generic way, syntactic constructs such `:=` are represented by a single token:
```python
self.tokenizer.encode(" :=", add_special_tokens=False)
# [521]
```
<br>
## Fine-tuning code
<details>
```python
import gzip
import json
import logging
import os
from pathlib import Path
from typing import Dict, List, Tuple
import numpy as np
import torch
from sklearn.metrics import f1_score
from tokenizers.implementations.byte_level_bpe import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.dataset import Dataset
from torch.utils.tensorboard.writer import SummaryWriter
from tqdm import tqdm, trange
from transformers import RobertaForSequenceClassification
from transformers.data.metrics import acc_and_f1, simple_accuracy
logging.basicConfig(level=logging.INFO)
CODEBERTA_PRETRAINED = "huggingface/CodeBERTa-small-v1"
LANGUAGES = [
"go",
"java",
"javascript",
"php",
"python",
"ruby",
]
FILES_PER_LANGUAGE = 1
EVALUATE = True
# Set up tokenizer
tokenizer = ByteLevelBPETokenizer("./pretrained/vocab.json", "./pretrained/merges.txt",)
tokenizer._tokenizer.post_processor = BertProcessing(
("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
# Set up Tensorboard
tb_writer = SummaryWriter()
class CodeSearchNetDataset(Dataset):
examples: List[Tuple[List[int], int]]
def __init__(self, split: str = "train"):
"""
train | valid | test
"""
self.examples = []
src_files = []
for language in LANGUAGES:
src_files += list(
Path("../CodeSearchNet/resources/data/").glob(f"{language}/final/jsonl/{split}/*.jsonl.gz")
)[:FILES_PER_LANGUAGE]
for src_file in src_files:
label = src_file.parents[3].name
label_idx = LANGUAGES.index(label)
print("🔥", src_file, label)
lines = []
fh = gzip.open(src_file, mode="rt", encoding="utf-8")
for line in fh:
o = json.loads(line)
lines.append(o["code"])
examples = [(x.ids, label_idx) for x in tokenizer.encode_batch(lines)]
self.examples += examples
print("🔥🔥")
def __len__(self):
return len(self.examples)
def __getitem__(self, i):
# We’ll pad at the batch level.
return self.examples[i]
model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_PRETRAINED, num_labels=len(LANGUAGES))
train_dataset = CodeSearchNetDataset(split="train")
eval_dataset = CodeSearchNetDataset(split="test")
def collate(examples):
input_ids = pad_sequence([torch.tensor(x[0]) for x in examples], batch_first=True, padding_value=1)
labels = torch.tensor([x[1] for x in examples])
# ^^ uncessary .unsqueeze(-1)
return input_ids, labels
train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, collate_fn=collate)
batch = next(iter(train_dataloader))
model.to("cuda")
model.train()
for param in model.roberta.parameters():
param.requires_grad = False
## ^^ Only train final layer.
print(f"num params:", model.num_parameters())
print(f"num trainable params:", model.num_parameters(only_trainable=True))
def evaluate():
eval_loss = 0.0
nb_eval_steps = 0
preds = np.empty((0), dtype=np.int64)
out_label_ids = np.empty((0), dtype=np.int64)
model.eval()
eval_dataloader = DataLoader(eval_dataset, batch_size=512, collate_fn=collate)
for step, (input_ids, labels) in enumerate(tqdm(eval_dataloader, desc="Eval")):
with torch.no_grad():
outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
loss = outputs[0]
logits = outputs[1]
eval_loss += loss.mean().item()
nb_eval_steps += 1
preds = np.append(preds, logits.argmax(dim=1).detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
acc = simple_accuracy(preds, out_label_ids)
f1 = f1_score(y_true=out_label_ids, y_pred=preds, average="macro")
print("=== Eval: loss ===", eval_loss)
print("=== Eval: acc. ===", acc)
print("=== Eval: f1 ===", f1)
# print(acc_and_f1(preds, out_label_ids))
tb_writer.add_scalars("eval", {"loss": eval_loss, "acc": acc, "f1": f1}, global_step)
### Training loop
global_step = 0
train_iterator = trange(0, 4, desc="Epoch")
optimizer = torch.optim.AdamW(model.parameters())
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration")
for step, (input_ids, labels) in enumerate(epoch_iterator):
optimizer.zero_grad()
outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
loss = outputs[0]
loss.backward()
tb_writer.add_scalar("training_loss", loss.item(), global_step)
optimizer.step()
global_step += 1
if EVALUATE and global_step % 50 == 0:
evaluate()
model.train()
evaluate()
os.makedirs("./models/CodeBERT-language-id", exist_ok=True)
model.save_pretrained("./models/CodeBERT-language-id")
```
</details>
<br>
## CodeSearchNet citation
<details>
```bibtex
@article{husain_codesearchnet_2019,
title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
shorttitle = {{CodeSearchNet} {Challenge}},
url = {http://arxiv.org/abs/1909.09436},
urldate = {2020-03-12},
journal = {arXiv:1909.09436 [cs, stat]},
author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
month = sep,
year = {2019},
note = {arXiv: 1909.09436},
}
```
</details>
---
language: code
thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
datasets:
- code_search_net
---
# CodeBERTa
CodeBERTa is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub.
Supported languages:
```shell
"go"
"java"
"javascript"
"php"
"python"
"ruby"
```
The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.
Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).
The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.
### Tensorboard for this training ⤵️
[![tb](https://cdn-media.huggingface.co/CodeBERTa/tensorboard.png)](https://tensorboard.dev/experiment/irRI7jXGQlqmlxXS0I07ew/#scalars)
## Quick start: masked language modeling prediction
```python
PHP_CODE = """
public static <mask> set(string $key, $value) {
if (!in_array($key, self::$allowedKeys)) {
throw new \InvalidArgumentException('Invalid key given');
}
self::$storedValues[$key] = $value;
}
""".lstrip()
```
### Does the model know how to complete simple PHP code?
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="huggingface/CodeBERTa-small-v1",
tokenizer="huggingface/CodeBERTa-small-v1"
)
fill_mask(PHP_CODE)
## Top 5 predictions:
#
' function' # prob 0.9999827146530151
'function' #
' void' #
' def' #
' final' #
```
### Yes! That was easy 🎉 What about some Python (warning: this is going to be meta)
```python
PYTHON_CODE = """
def pipeline(
task: str,
model: Optional = None,
framework: Optional[<mask>] = None,
**kwargs
) -> Pipeline:
pass
""".lstrip()
```
Results:
```python
'framework', 'Framework', ' framework', 'None', 'str'
```
> This program can auto-complete itself! 😱
### Just for fun, let's try to mask natural language (not code):
```python
fill_mask("My name is <mask>.")
# {'sequence': '<s> My name is undefined.</s>', 'score': 0.2548016905784607, 'token': 3353}
# {'sequence': '<s> My name is required.</s>', 'score': 0.07290805131196976, 'token': 2371}
# {'sequence': '<s> My name is null.</s>', 'score': 0.06323737651109695, 'token': 469}
# {'sequence': '<s> My name is name.</s>', 'score': 0.021919190883636475, 'token': 652}
# {'sequence': '<s> My name is disabled.</s>', 'score': 0.019681859761476517, 'token': 7434}
```
This (kind of) works because code contains comments (which contain natural language).
Of course, the most frequent name for a Computer scientist must be undefined 🤓.
## Downstream task: [programming language identification](https://huggingface.co/huggingface/CodeBERTa-language-id)
See the model card for **[`huggingface/CodeBERTa-language-id`](https://huggingface.co/huggingface/CodeBERTa-language-id)** 🤯.
<br>
## CodeSearchNet citation
<details>
```bibtex
@article{husain_codesearchnet_2019,
title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
shorttitle = {{CodeSearchNet} {Challenge}},
url = {http://arxiv.org/abs/1909.09436},
urldate = {2020-03-12},
journal = {arXiv:1909.09436 [cs, stat]},
author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
month = sep,
year = {2019},
note = {arXiv: 1909.09436},
}
```
</details>
---
language: ms
---
# Bahasa Albert Model
Pretrained Albert base language model for Malay and Indonesian.
## Pretraining Corpus
`albert-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google Albert's github [repository](https://github.com/google-research/ALBERT) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/albert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AlbertTokenizer, AlbertModel
model = BertModel.from_pretrained('huseinzol05/albert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/albert-base-bahasa-cased',
do_lower_case = False,
)
```
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/albert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/albert-base-bahasa-cased',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan ayam[SEP]',
'score': 0.044952988624572754,
'token': 629},
{'sequence': '[CLS] makan ayam dengan sayur[SEP]',
'score': 0.03621877357363701,
'token': 1639},
{'sequence': '[CLS] makan ayam dengan ikan[SEP]',
'score': 0.034429922699928284,
'token': 758},
{'sequence': '[CLS] makan ayam dengan nasi[SEP]',
'score': 0.032447945326566696,
'token': 453},
{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
'score': 0.028885239735245705,
'token': 2451}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train Albert for Bahasa.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment