Commit 0603564e authored by Sylvain Gugger's avatar Sylvain Gugger
Browse files

Merge remote-tracking branch 'origin/master'

parents 1e08af38 d86b5ffc
---
language: vi
datasets: wikipedia
license: apache-2.0
---
# bert-base-vi-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-vi-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-vi-cased")
```
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.
---
language: zh
datasets: wikipedia
license: apache-2.0
---
# bert-base-zh-cased
We are sharing smaller versions of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handle a custom number of languages.
Unlike [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased), our versions give exactly the same representations produced by the original model which preserves the original accuracy.
For more information please visit our paper: [Load What You Need: Smaller Versions of Multilingual BERT](https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf).
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-zh-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-zh-cased")
```
### How to cite
```bibtex
@inproceedings{smallermbert,
title={Load What You Need: Smaller Versions of Mutlilingual BERT},
author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
booktitle={SustaiNLP / EMNLP},
year={2020}
}
```
## Contact
Please contact amine@geotrend.fr for any question, feedback or request.
---
language:
- ach
- en
tags:
- translation
license: cc-by-4.0
datasets:
- JW300
metrics:
- bleu
---
# HEL-ACH-EN
## Model description
MT model translating Acholi to English initialized with weights from [opus-mt-luo-en](https://huggingface.co/Helsinki-NLP/opus-mt-luo-en) on HuggingFace.
## Intended uses & limitations
Machine Translation experiments. Do not use for sensitive tasks.
#### How to use
```python
# You can include sample code which will be formatted
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Ogayo/Hel-ach-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Ogayo/Hel-ach-en")
```
#### Limitations and bias
Trained on Jehovah Witnesses data so contains theirs and Christian views.
## Training data
Trained on OPUS JW300 data.
Initialized with weights from [opus-mt-luo-en](https://huggingface.co/Helsinki-NLP/opus-mt-luo-en?text=Bed+gi+nyasi+mar+chieng%27+nyuol+mopong%27+gi+mor%21#model_card)
## Training procedure
Remove duplicates and rows with no alphabetic characters. Used GPU
## Eval results
testset | BLEU
--- | ---
JW300.luo.en| 46.1
---
tags:
- finance
---
# Roberta Masked Language Model Trained On Financial Phrasebank Corpus
This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a Financial Phrasebank Corpus.
The model is built using Huggingface transformers.
The model can be found at :[Financial_Roberta](https://huggingface.co/abhilash1910/financial_roberta)
## Specifications
The corpus for training is taken from the Financial Phrasebank (Malo et al)[https://www.researchgate.net/publication/251231107_Good_Debt_or_Bad_Debt_Detecting_Semantic_Orientations_in_Economic_Texts].
## Model Specification
The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
1. vocab_size=56000
2. max_position_embeddings=514
3. num_attention_heads=12
4. num_hidden_layers=6
5. type_vocab_size=1
This is trained by using RobertaConfig from transformers package.
The model is trained for 10 epochs with a gpu batch size of 64 units.
## Usage Specifications
For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/financial_roberta' for the tokenizers and the model.
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("abhilash1910/financial_roberta")
model = AutoModelWithLMHead.from_pretrained("abhilash1910/financial_roberta")
```
After this the model will be downloaded, it will take some time to download all the model files.
For testing the model, we have to import pipeline module from transformers and create a masked output model for inference as follows:
```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='abhilash1910/inancial_roberta')
model_mask("The company had a <mask> of 20% in 2020.")
```
Some of the examples are also provided with generic financial statements:
Example 1:
```python
model_mask("The company had a <mask> of 20% in 2020.")
```
Output:
```bash
[{'sequence': '<s>The company had a profit of 20% in 2020.</s>',
'score': 0.023112965747714043,
'token': 421,
'token_str': 'Ġprofit'},
{'sequence': '<s>The company had a loss of 20% in 2020.</s>',
'score': 0.021379893645644188,
'token': 616,
'token_str': 'Ġloss'},
{'sequence': '<s>The company had a year of 20% in 2020.</s>',
'score': 0.0185744296759367,
'token': 443,
'token_str': 'Ġyear'},
{'sequence': '<s>The company had a sales of 20% in 2020.</s>',
'score': 0.018143286928534508,
'token': 428,
'token_str': 'Ġsales'},
{'sequence': '<s>The company had a value of 20% in 2020.</s>',
'score': 0.015319528989493847,
'token': 776,
'token_str': 'Ġvalue'}]
```
Example 2:
```python
model_mask("The <mask> is listed under NYSE")
```
Output:
```bash
[{'sequence': '<s>The company is listed under NYSE</s>',
'score': 0.1566661298274994,
'token': 359,
'token_str': 'Ġcompany'},
{'sequence': '<s>The total is listed under NYSE</s>',
'score': 0.05542507395148277,
'token': 522,
'token_str': 'Ġtotal'},
{'sequence': '<s>The value is listed under NYSE</s>',
'score': 0.04729423299431801,
'token': 776,
'token_str': 'Ġvalue'},
{'sequence': '<s>The order is listed under NYSE</s>',
'score': 0.02533523552119732,
'token': 798,
'token_str': 'Ġorder'},
{'sequence': '<s>The contract is listed under NYSE</s>',
'score': 0.02087237872183323,
'token': 635,
'token_str': 'Ġcontract'}]
```
## Resources
For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).
---
language: en
license: mit
datasets:
- AI4Bharat IndicNLP Corpora
---
# IndicBERT
IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.
The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
The code can be found [here](https://github.com/divkakwani/indic-bert). For more information, checkout our [project page](https://indicnlp.ai4bharat.org/) or our [paper](https://indicnlp.ai4bharat.org/papers/arxiv2020_indicnlp_corpus.pdf).
## Pretraining Corpus
We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages:
| Language | as | bn | en | gu | hi | kn | |
| ----------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- |
| **No. of Tokens** | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | |
| **Language** | **ml** | **mr** | **or** | **pa** | **ta** | **te** | **all** |
| **No. of Tokens** | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B |
## Evaluation Results
IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our [official repo](https://github.com/divkakwani/indic-bert)
#### IndicGLUE
Task | mBERT | XLM-R | IndicBERT
-----| ----- | ----- | ------
News Article Headline Prediction | 89.58 | 95.52 | **95.87**
Wikipedia Section Title Prediction| **73.66** | 66.33 | 73.31
Cloze-style multiple-choice QA | 39.16 | 27.98 | **41.87**
Article Genre Classification | 90.63 | 97.03 | **97.34**
Named Entity Recognition (F1-score) | **73.24** | 65.93 | 64.47
Cross-Lingual Sentence Retrieval Task | 21.46 | 13.74 | **27.12**
Average | 64.62 | 61.09 | **66.66**
#### Additional Tasks
Task | Task Type | mBERT | XLM-R | IndicBERT
-----| ----- | ----- | ------ | -----
BBC News Classification | Genre Classification | 60.55 | **75.52** | 74.60
IIT Product Reviews | Sentiment Analysis | 74.57 | **78.97** | 71.32
IITP Movie Reviews | Sentiment Analaysis | 56.77 | **61.61** | 59.03
Soham News Article | Genre Classification | 80.23 | **87.6** | 78.45
Midas Discourse | Discourse Analysis | 71.20 | **79.94** | 78.44
iNLTK Headlines Classification | Genre Classification | 87.95 | 93.38 | **94.52**
ACTSA Sentiment Analysis | Sentiment Analysis | 48.53 | 59.33 | **61.18**
Winograd NLI | Natural Language Inference | 56.34 | 55.87 | **56.34**
Choice of Plausible Alternative (COPA) | Natural Language Inference | 54.92 | 51.13 | **58.33**
Amrita Exact Paraphrase | Paraphrase Detection | **93.81** | 93.02 | 93.75
Amrita Rough Paraphrase | Paraphrase Detection | 83.38 | 82.20 | **84.33**
Average | | 69.84 | **74.42** | 73.66
\* Note: all models have been restricted to a max_seq_length of 128.
## Downloads
The model can be downloaded [here](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/models/indic-bert-v1.tar.gz). Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from [Huggingface](https://huggingface.co/ai4bharat/indic-bert).
## Citing
If you are using any of the resources, please cite the following article:
```
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
booktitle={Findings of EMNLP},
}
```
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.
## License
The IndicBERT code (and models) are released under the MIT License.
## Contributors
- Divyanshu Kakwani
- Anoop Kunchukuttan
- Gokul NC
- Satish Golla
- Avik Bhattacharyya
- Mitesh Khapra
- Pratyush Kumar
This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
## Contact
- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
- Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))
---
language:
- en
tags:
- bert
- bluebert
license:
- PUBLIC DOMAIN NOTICE
datasets:
- PubMed
- MIMIC-III
---
# BlueBert-Base, Uncased, PubMed and MIMIC-III
## Model description
A BERT model pre-trained on PubMed abstracts and clinical notes ([MIMIC-III](https://mimic.physionet.org/)).
## Intended uses & limitations
#### How to use
Please see https://github.com/ncbi-nlp/bluebert
## Training data
We provide [preprocessed PubMed texts](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/pubmed_uncased_sentence_nltk.txt.tar.gz) that were used to pre-train the BlueBERT models.
The corpus contains ~4000M words extracted from the [PubMed ASCII code version](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/).
Pre-trained model: https://huggingface.co/bert-large-uncased
## Training procedure
* lowercasing the text
* removing speical chars `\x00`-`\x7F`
* tokenizing the text using the [NLTK Treebank tokenizer](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
Below is a code snippet for more details.
```python
value = value.lower()
value = re.sub(r'[\r\n]+', ' ', value)
value = re.sub(r'[^\x00-\x7F]+', ' ', value)
tokenized = TreebankWordTokenizer().tokenize(value)
sentence = ' '.join(tokenized)
sentence = re.sub(r"\s's\b", "'s", sentence)
```
### BibTeX entry and citation info
```bibtex
@InProceedings{peng2019transfer,
author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
year = {2019},
pages = {58--65},
}
```
### Acknowledgments
This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of
Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number 4R00LM013001-01.
We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available.
We would like to thank Dr Sun Kim for processing the PubMed texts.
### Disclaimer
This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced
on this website is not intended for direct diagnostic use or medical decision-making without review and oversight
by a clinical professional. Individuals should not change their health behavior solely on the basis of information
produced on this website. NIH does not independently verify the validity or utility of the information produced
by this tool. If you have questions about the information produced on this website, please see a health care
professional. More information about NCBI's disclaimer policy is available.
---
language:
- en
tags:
- bert
- bluebert
license:
- PUBLIC DOMAIN NOTICE
datasets:
- PubMed
---
# BlueBert-Base, Uncased, PubMed
## Model description
A BERT model pre-trained on PubMed abstracts.
## Intended uses & limitations
#### How to use
Please see https://github.com/ncbi-nlp/bluebert
## Training data
We provide [preprocessed PubMed texts](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/pubmed_uncased_sentence_nltk.txt.tar.gz) that were used to pre-train the BlueBERT models.
The corpus contains ~4000M words extracted from the [PubMed ASCII code version](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/).
Pre-trained model: https://huggingface.co/bert-large-uncased
## Training procedure
* lowercasing the text
* removing speical chars `\x00`-`\x7F`
* tokenizing the text using the [NLTK Treebank tokenizer](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
Below is a code snippet for more details.
```python
value = value.lower()
value = re.sub(r'[\r\n]+', ' ', value)
value = re.sub(r'[^\x00-\x7F]+', ' ', value)
tokenized = TreebankWordTokenizer().tokenize(value)
sentence = ' '.join(tokenized)
sentence = re.sub(r"\s's\b", "'s", sentence)
```
### BibTeX entry and citation info
```bibtex
@InProceedings{peng2019transfer,
author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
year = {2019},
pages = {58--65},
}
```
### Acknowledgments
This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of
Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number 4R00LM013001-01.
We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available.
We would like to thank Dr Sun Kim for processing the PubMed texts.
### Disclaimer
This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced
on this website is not intended for direct diagnostic use or medical decision-making without review and oversight
by a clinical professional. Individuals should not change their health behavior solely on the basis of information
produced on this website. NIH does not independently verify the validity or utility of the information produced
by this tool. If you have questions about the information produced on this website, please see a health care
professional. More information about NCBI's disclaimer policy is available.
## RuDR-BERT ---
language:
- ru
- en
---
EnDR-BERT - Multilingual, Cased, which pretrained on the collecting of consumer comments on drug administration from [2]. Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google. In particular, Multi-BERT was for used for initialization; vocabulary of Russian subtokens and parameters are the same as in Multi-BERT. Training details are described in our paper. \ ## EnDR-BERT
EnDR-BERT - Multilingual, Cased, which pretrained on the english collection of consumer comments on drug administration from [2]. Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google. In particular, Multi-BERT was for used for initialization and all the parameters are the same as in Multi-BERT. Training details are described in our paper. \
link: https://yadi.sk/d/-PTn0xhk1PqvgQ link: https://yadi.sk/d/-PTn0xhk1PqvgQ
......
...@@ -9,8 +9,7 @@ tags: ...@@ -9,8 +9,7 @@ tags:
```python ```python
import json import json
import os import os
from transformers.configuration_roberta import RobertaConfig from transformers import RobertaConfig, RobertaForMaskedLM, TFRobertaForMaskedLM
from transformers import RobertaForMaskedLM, TFRobertaForMaskedLM
DIRNAME = "./dummy-unknown" DIRNAME = "./dummy-unknown"
......
---
language: en
datasets:
- quartz
pipeline_tag: question-answering
---
# T5-base fine-tuned on QuaRTz
[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) fine-tuned on [QuaRTz](https://allenai.org/data/quartz) for **QA** downstream task.
## Details of T5
The **T5** model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* in Here the abstract:
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://i.imgur.com/jVFMMWR.png)
## Details of the dataset 📚
**QuaRTz** is a crowdsourced dataset of 3864 multiple-choice questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs). The QuaRTz dataset V1 contains 3864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
The dataset is split into:
|Set | Samples|
|-----|--------|
|Train | 2696 |
|Valid | 384 |
|Test | 784 |
## Model fine-tuning 🏋️‍
The training script is a slightly modified version of [this awesome one](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) by [Suraj Patil](https://twitter.com/psuraj28). The *question*, *context* (`para` field) and *options* (`choices` field) are concatenated and passed to the **encoder**. The **decoder** receives the right *answer* (by querying `answerKey` field). More details about the dataset fields/format [here](https://huggingface.co/nlp/viewer/?dataset=quartz)
## Results 📋
|Set | Metric | Score |
|-----|--------|-------|
|Validation | Accuracy (EM) | **83.59**|
|Test | Accuracy (EM) | **81.50**|
## Model in Action 🚀
```python
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-quartz")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-quartz")
def get_response(question, fact, opts, max_length=16):
input_text = 'question: %s context: %s options: %s' % (question, fact, opts)
features = tokenizer([input_text], return_tensors='pt')
output = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask'],
max_length=max_length)
return tokenizer.decode(output[0])
fact = 'The sooner cancer is detected the easier it is to treat.'
question = 'John was a doctor in a cancer ward and knew that early detection was key. The cancer being detected quickly makes the cancer treatment'
opts = 'Easier, Harder'
get_response(question, fact, opts)
# output: 'Easier'
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
...@@ -51,13 +51,13 @@ The training script is a slightly modified version of [this one](https://colab.r ...@@ -51,13 +51,13 @@ The training script is a slightly modified version of [this one](https://colab.r
## Model in Action 🚀 ## Model in Action 🚀
```python ```python
from transformers import AutoModelWithLMHead, AutoTokenizer from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2") tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2") model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
def get_answer(question, context): def get_answer(question, context):
input_text = "question: %s context: %s </s>" % (question, context) input_text = "question: %s context: %s" % (question, context)
features = tokenizer([input_text], return_tensors='pt') features = tokenizer([input_text], return_tensors='pt')
output = model.generate(input_ids=features['input_ids'], output = model.generate(input_ids=features['input_ids'],
......
...@@ -2,6 +2,7 @@ ...@@ -2,6 +2,7 @@
language: de language: de
tags: tags:
- exbert - exbert
- German
--- ---
<a href="https://huggingface.co/exbert/?model=smanjil/German-MedBERT"> <a href="https://huggingface.co/exbert/?model=smanjil/German-MedBERT">
...@@ -10,7 +11,7 @@ tags: ...@@ -10,7 +11,7 @@ tags:
# German Medical BERT # German Medical BERT
This is a fine-tuned model on Medical domain for German language and based on German BERT. This is a fine-tuned model on Medical domain for German language and based on German BERT. This model has only been trained to improve on target task (Masked Language Model). It can later be used to perform a downstream task of your needs, while I performed it for NTS-ICD-10 text classification task.
## Overview ## Overview
**Language model:** bert-base-german-cased **Language model:** bert-base-german-cased
...@@ -30,7 +31,12 @@ This is a fine-tuned model on Medical domain for German language and based on Ge ...@@ -30,7 +31,12 @@ This is a fine-tuned model on Medical domain for German language and based on Ge
- Although had to train for upto 25 epochs for classification. - Although had to train for upto 25 epochs for classification.
## Performance (Micro precision, recall and f1 score for multilabel code classification) ## Performance (Micro precision, recall and f1 score for multilabel code classification)
![performance](https://raw.githubusercontent.com/smanjil/finetune-lm/master/performance.png)
|Models |P |R |F1 |
|:-------------- |:------|:------|:------|
|German BERT |86.04 |75.82 |80.60 |
|German MedBERT-256 |87.41 |77.97 |82.42 |
|German MedBERT-512 |87.75 |78.26 |82.73 |
## Author ## Author
Manjil Shrestha: `shresthamanjil21 [at] gmail.com` Manjil Shrestha: `shresthamanjil21 [at] gmail.com`
......
---
language: zh
widget:
- text: "[CLS]国 -"
---
# Chinese Couplet GPT2 Model
## Model description
The model is used to generate Chinese couplets. You can download the model either from the [GPT2-Chinese Github page](https://github.com/Morizeyao/GPT2-Chinese), or via HuggingFace from the link [gpt2-chinese-couplet][couplet].
Since the parameter skip_special_tokens is used in the pipelines.py, special tokens such as [SEP], [UNK] will be deleted, and the output results may not be neat.
## How to use
You can use the model directly with a pipeline for text generation:
When the parameter skip_special_tokens is True:
```python
>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> from transformers import TextGenerationPipeline,
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-couplet")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-couplet")
>>> text_generator = TextGenerationPipeline(model, tokenizer)
>>> text_generator("[CLS]丹 枫 江 冷 人 初 去 -", max_length=25, do_sample=True)
[{'generated_text': '[CLS]丹 枫 江 冷 人 初 去 - 黄 叶 声 从 天 外 来 阅 旗'}]
```
When the parameter skip_special_tokens is False:
```python
>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> from transformers import TextGenerationPipeline,
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-poem")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-poem")
>>> text_generator = TextGenerationPipeline(model, tokenizer)
>>> text_generator("[CLS]丹 枫 江 冷 人 初 去 -", max_length=25, do_sample=True)
[{'generated_text': '[CLS]丹 枫 江 冷 人 初 去 - 黄 叶 声 我 酒 不 辞 [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]'}]
```
## Training data
Contains 700,000 Chinese couplets collected by [couplet-clean-dataset](https://github.com/v-zich/couplet-clean-dataset).
## Training procedure
Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 25,000 steps with a sequence length of 64.
```
python3 preprocess.py --corpus_path corpora/couplet.txt \
--vocab_path models/google_zh_vocab.txt \
--dataset_path couplet.pt --processes_num 16 \
--seq_length 64 --target lm
```
```
python3 pretrain.py --dataset_path couplet.pt \
--vocab_path models/google_zh_vocab.txt \
--output_model_path models/couplet_gpt_base_model.bin \
--config_path models/bert_base_config.json --learning_rate 5e-4 \
--tie_weight --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--batch_size 64 --report_steps 1000 \
--save_checkpoint_steps 5000 --total_steps 25000 \
--embedding gpt --encoder gpt2 --target lm
```
### BibTeX entry and citation info
```
@article{zhao2019uer,
title={UER: An Open-Source Toolkit for Pre-training Models},
author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
journal={EMNLP-IJCNLP 2019},
pages={241},
year={2019}
}
```
[couplet]: https://huggingface.co/uer/gpt2-chinese-couplet
---
language: zh
widget:
- text: "[CLS] ,"
- text: "[CLS] ,"
---
# Chinese Poem GPT2 Model
## Model description
The model is used to generate Chinese ancient poems. You can download the model either from the [GPT2-Chinese Github page](https://github.com/Morizeyao/GPT2-Chinese), or via HuggingFace from the link [gpt2-chinese-poem][poem].
Since the parameter skip_special_tokens is used in the pipelines.py, special tokens such as [SEP], [UNK] will be deleted, and the output results may not be neat.
## How to use
You can use the model directly with a pipeline for text generation:
When the parameter skip_special_tokens is True:
```python
>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> from transformers import TextGenerationPipeline,
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-poem")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-poem")
>>> text_generator = TextGenerationPipeline(model, tokenizer)
>>> text_generator("[CLS]梅 山 如 积 翠 ,", max_length=50, do_sample=True)
[{'generated_text': '[CLS]梅 山 如 积 翠 , 的 手 堪 捧 。 遥 遥 仙 人 尉 , 盘 盘 故 时 陇 。 丹 泉 清 可 鉴 , 石 乳 甘 于 。 行 将 解 尘 缨 , 于 焉 蹈 高 踵 。 我'}]
```
When the parameter skip_special_tokens is False:
```python
>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> from transformers import TextGenerationPipeline,
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-poem")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-poem")
>>> text_generator = TextGenerationPipeline(model, tokenizer)
>>> text_generator("[CLS]梅 山 如 积 翠 ,", max_length=50, do_sample=True)
[{'generated_text': '[CLS]梅 山 如 积 翠 , 的 [UNK] 手 堪 捧 。 遥 遥 仙 人 尉 , 盘 盘 故 时 陇 。 丹 泉 清 可 鉴 , 石 乳 甘 可 捧 。 银 汉 迟 不 来 , 槎 头 欲 谁 揽 。 何'}]
```
## Training data
Contains 800,000 Chinese ancient poems collected by [chinese-poetry](https://github.com/chinese-poetry/chinese-poetry) and [Poetry](https://github.com/Werneror/Poetry) projects.
## Training procedure
The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 200,000 steps with a sequence length of 128.
```
python3 preprocess.py --corpus_path corpora/poem.txt \
--vocab_path models/google_zh_vocab.txt \
--dataset_path poem.pt --processes_num 16 \
--seq_length 128 --target lm
```
```
python3 pretrain.py --dataset_path poem.pt \
--vocab_path models/google_zh_vocab.txt \
--output_model_path models/poem_gpt_base_model.bin \
--config_path models/bert_base_config.json --learning_rate 5e-4 \
--tie_weight --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--batch_size 64 --report_steps 1000 \
--save_checkpoint_steps 50000 --total_steps 200000 \
--embedding gpt --encoder gpt2 --target lm
```
### BibTeX entry and citation info
```
@article{zhao2019uer,
title={UER: An Open-Source Toolkit for Pre-training Models},
author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
journal={EMNLP-IJCNLP 2019},
pages={241},
year={2019}
}
```
[poem]: https://huggingface.co/uer/gpt2-chinese-poem
...@@ -119,7 +119,7 @@ extras["dev"] = extras["all"] + extras["testing"] + extras["quality"] + extras[" ...@@ -119,7 +119,7 @@ extras["dev"] = extras["all"] + extras["testing"] + extras["quality"] + extras["
setup( setup(
name="transformers", name="transformers",
version="4.0.0-dev", version="4.0.0-rc-1",
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Sylvain Gugger, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors", author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Sylvain Gugger, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co", author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch", description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this # There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all. # module, but to preserve other warnings. So, don't check this module at all.
__version__ = "4.0.0-dev" __version__ = "4.0.0-rc-1"
# Work around to update TensorFlow's absl.logging threshold which alters the # Work around to update TensorFlow's absl.logging threshold which alters the
# default Python logging output behavior when present. # default Python logging output behavior when present.
...@@ -766,7 +766,10 @@ if is_tf_available(): ...@@ -766,7 +766,10 @@ if is_tf_available():
from .models.longformer import ( from .models.longformer import (
TF_LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, TF_LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
TFLongformerForMaskedLM, TFLongformerForMaskedLM,
TFLongformerForMultipleChoice,
TFLongformerForQuestionAnswering, TFLongformerForQuestionAnswering,
TFLongformerForSequenceClassification,
TFLongformerForTokenClassification,
TFLongformerModel, TFLongformerModel,
TFLongformerSelfAttention, TFLongformerSelfAttention,
) )
......
...@@ -43,6 +43,8 @@ class PretrainedConfig(object): ...@@ -43,6 +43,8 @@ class PretrainedConfig(object):
- **is_composition** (:obj:`bool`): Whether the config class is composed of multiple sub-configs. In this case - **is_composition** (:obj:`bool`): Whether the config class is composed of multiple sub-configs. In this case
the config has to be initialized from two or more configs of type :class:`~transformers.PretrainedConfig` the config has to be initialized from two or more configs of type :class:`~transformers.PretrainedConfig`
like: :class:`~transformers.EncoderDecoderConfig` or :class:`~RagConfig`. like: :class:`~transformers.EncoderDecoderConfig` or :class:`~RagConfig`.
- **keys_to_ignore_at_inference** (:obj:`List[str]`): A list of keys to ignore by default when looking at
dictionary outputs of the model during inference.
Args: Args:
name_or_path (:obj:`str`, `optional`, defaults to :obj:`""`): name_or_path (:obj:`str`, `optional`, defaults to :obj:`""`):
......
...@@ -92,7 +92,10 @@ from ..funnel.modeling_tf_funnel import ( ...@@ -92,7 +92,10 @@ from ..funnel.modeling_tf_funnel import (
from ..gpt2.modeling_tf_gpt2 import TFGPT2LMHeadModel, TFGPT2Model from ..gpt2.modeling_tf_gpt2 import TFGPT2LMHeadModel, TFGPT2Model
from ..longformer.modeling_tf_longformer import ( from ..longformer.modeling_tf_longformer import (
TFLongformerForMaskedLM, TFLongformerForMaskedLM,
TFLongformerForMultipleChoice,
TFLongformerForQuestionAnswering, TFLongformerForQuestionAnswering,
TFLongformerForSequenceClassification,
TFLongformerForTokenClassification,
TFLongformerModel, TFLongformerModel,
) )
from ..lxmert.modeling_tf_lxmert import TFLxmertForPreTraining, TFLxmertModel from ..lxmert.modeling_tf_lxmert import TFLxmertForPreTraining, TFLxmertModel
...@@ -314,6 +317,7 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict( ...@@ -314,6 +317,7 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
(AlbertConfig, TFAlbertForSequenceClassification), (AlbertConfig, TFAlbertForSequenceClassification),
(CamembertConfig, TFCamembertForSequenceClassification), (CamembertConfig, TFCamembertForSequenceClassification),
(XLMRobertaConfig, TFXLMRobertaForSequenceClassification), (XLMRobertaConfig, TFXLMRobertaForSequenceClassification),
(LongformerConfig, TFLongformerForSequenceClassification),
(RobertaConfig, TFRobertaForSequenceClassification), (RobertaConfig, TFRobertaForSequenceClassification),
(BertConfig, TFBertForSequenceClassification), (BertConfig, TFBertForSequenceClassification),
(XLNetConfig, TFXLNetForSequenceClassification), (XLNetConfig, TFXLNetForSequenceClassification),
...@@ -353,6 +357,7 @@ TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict( ...@@ -353,6 +357,7 @@ TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
(FlaubertConfig, TFFlaubertForTokenClassification), (FlaubertConfig, TFFlaubertForTokenClassification),
(XLMConfig, TFXLMForTokenClassification), (XLMConfig, TFXLMForTokenClassification),
(XLMRobertaConfig, TFXLMRobertaForTokenClassification), (XLMRobertaConfig, TFXLMRobertaForTokenClassification),
(LongformerConfig, TFLongformerForTokenClassification),
(RobertaConfig, TFRobertaForTokenClassification), (RobertaConfig, TFRobertaForTokenClassification),
(BertConfig, TFBertForTokenClassification), (BertConfig, TFBertForTokenClassification),
(MobileBertConfig, TFMobileBertForTokenClassification), (MobileBertConfig, TFMobileBertForTokenClassification),
...@@ -368,6 +373,7 @@ TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict( ...@@ -368,6 +373,7 @@ TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
(CamembertConfig, TFCamembertForMultipleChoice), (CamembertConfig, TFCamembertForMultipleChoice),
(XLMConfig, TFXLMForMultipleChoice), (XLMConfig, TFXLMForMultipleChoice),
(XLMRobertaConfig, TFXLMRobertaForMultipleChoice), (XLMRobertaConfig, TFXLMRobertaForMultipleChoice),
(LongformerConfig, TFLongformerForMultipleChoice),
(RobertaConfig, TFRobertaForMultipleChoice), (RobertaConfig, TFRobertaForMultipleChoice),
(BertConfig, TFBertForMultipleChoice), (BertConfig, TFBertForMultipleChoice),
(DistilBertConfig, TFDistilBertForMultipleChoice), (DistilBertConfig, TFDistilBertForMultipleChoice),
......
...@@ -110,6 +110,7 @@ class BartConfig(PretrainedConfig): ...@@ -110,6 +110,7 @@ class BartConfig(PretrainedConfig):
:obj:`True` for `bart-large-cnn`. :obj:`True` for `bart-large-cnn`.
""" """
model_type = "bart" model_type = "bart"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__( def __init__(
self, self,
......
...@@ -77,6 +77,7 @@ class CTRLConfig(PretrainedConfig): ...@@ -77,6 +77,7 @@ class CTRLConfig(PretrainedConfig):
""" """
model_type = "ctrl" model_type = "ctrl"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__( def __init__(
self, self,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment