Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
datasets:
- mnli
tags:
- distilbart
- distilbart-mnli
pipeline_tag: zero-shot-classification
---
# DistilBart-MNLI
distilbart-mnli is the distilled version of bart-large-mnli created using the **No Teacher Distillation** technique proposed for BART summarisation by Huggingface, [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#distilbart).
We just copy alternating layers from `bart-large-mnli` and finetune more on the same data.
| | matched acc | mismatched acc |
| ------------------------------------------------------------------------------------ | ----------- | -------------- |
| [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) (baseline, 12-12) | 89.9 | 90.01 |
| [distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) | 87.08 | 87.5 |
| [distilbart-mnli-12-3](https://huggingface.co/valhalla/distilbart-mnli-12-3) | 88.1 | 88.19 |
| [distilbart-mnli-12-6](https://huggingface.co/valhalla/distilbart-mnli-12-6) | 89.19 | 89.01 |
| [distilbart-mnli-12-9](https://huggingface.co/valhalla/distilbart-mnli-12-9) | 89.56 | 89.52 |
This is a very simple and effective technique, as we can see the performance drop is very little.
Detailed performace trade-offs will be posted in this [sheet](https://docs.google.com/spreadsheets/d/1dQeUvAKpScLuhDV1afaPJRRAE55s2LpIzDVA5xfqxvk/edit?usp=sharing).
## Fine-tuning
If you want to train these models yourself, clone the [distillbart-mnli repo](https://github.com/patil-suraj/distillbart-mnli) and follow the steps below
Clone and install transformers from source
```bash
git clone https://github.com/huggingface/transformers.git
pip install -qqq -U ./transformers
```
Download MNLI data
```bash
python transformers/utils/download_glue_data.py --data_dir glue_data --tasks MNLI
```
Create student model
```bash
python create_student.py \
--teacher_model_name_or_path facebook/bart-large-mnli \
--student_encoder_layers 12 \
--student_decoder_layers 6 \
--save_path student-bart-mnli-12-6 \
```
Start fine-tuning
```bash
python run_glue.py args.json
```
You can find the logs of these trained models in this [wandb project](https://wandb.ai/psuraj/distilbart-mnli).
\ No newline at end of file
---
datasets:
- mnli
tags:
- distilbart
- distilbart-mnli
pipeline_tag: zero-shot-classification
---
# DistilBart-MNLI
distilbart-mnli is the distilled version of bart-large-mnli created using the **No Teacher Distillation** technique proposed for BART summarisation by Huggingface, [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#distilbart).
We just copy alternating layers from `bart-large-mnli` and finetune more on the same data.
| | matched acc | mismatched acc |
| ------------------------------------------------------------------------------------ | ----------- | -------------- |
| [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) (baseline, 12-12) | 89.9 | 90.01 |
| [distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) | 87.08 | 87.5 |
| [distilbart-mnli-12-3](https://huggingface.co/valhalla/distilbart-mnli-12-3) | 88.1 | 88.19 |
| [distilbart-mnli-12-6](https://huggingface.co/valhalla/distilbart-mnli-12-6) | 89.19 | 89.01 |
| [distilbart-mnli-12-9](https://huggingface.co/valhalla/distilbart-mnli-12-9) | 89.56 | 89.52 |
This is a very simple and effective technique, as we can see the performance drop is very little.
Detailed performace trade-offs will be posted in this [sheet](https://docs.google.com/spreadsheets/d/1dQeUvAKpScLuhDV1afaPJRRAE55s2LpIzDVA5xfqxvk/edit?usp=sharing).
## Fine-tuning
If you want to train these models yourself, clone the [distillbart-mnli repo](https://github.com/patil-suraj/distillbart-mnli) and follow the steps below
Clone and install transformers from source
```bash
git clone https://github.com/huggingface/transformers.git
pip install -qqq -U ./transformers
```
Download MNLI data
```bash
python transformers/utils/download_glue_data.py --data_dir glue_data --tasks MNLI
```
Create student model
```bash
python create_student.py \
--teacher_model_name_or_path facebook/bart-large-mnli \
--student_encoder_layers 12 \
--student_decoder_layers 6 \
--save_path student-bart-mnli-12-6 \
```
Start fine-tuning
```bash
python run_glue.py args.json
```
You can find the logs of these trained models in this [wandb project](https://wandb.ai/psuraj/distilbart-mnli).
\ No newline at end of file
---
datasets:
- mnli
tags:
- distilbart
- distilbart-mnli
pipeline_tag: zero-shot-classification
---
# DistilBart-MNLI
distilbart-mnli is the distilled version of bart-large-mnli created using the **No Teacher Distillation** technique proposed for BART summarisation by Huggingface, [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#distilbart).
We just copy alternating layers from `bart-large-mnli` and finetune more on the same data.
| | matched acc | mismatched acc |
| ------------------------------------------------------------------------------------ | ----------- | -------------- |
| [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) (baseline, 12-12) | 89.9 | 90.01 |
| [distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) | 87.08 | 87.5 |
| [distilbart-mnli-12-3](https://huggingface.co/valhalla/distilbart-mnli-12-3) | 88.1 | 88.19 |
| [distilbart-mnli-12-6](https://huggingface.co/valhalla/distilbart-mnli-12-6) | 89.19 | 89.01 |
| [distilbart-mnli-12-9](https://huggingface.co/valhalla/distilbart-mnli-12-9) | 89.56 | 89.52 |
This is a very simple and effective technique, as we can see the performance drop is very little.
Detailed performace trade-offs will be posted in this [sheet](https://docs.google.com/spreadsheets/d/1dQeUvAKpScLuhDV1afaPJRRAE55s2LpIzDVA5xfqxvk/edit?usp=sharing).
## Fine-tuning
If you want to train these models yourself, clone the [distillbart-mnli repo](https://github.com/patil-suraj/distillbart-mnli) and follow the steps below
Clone and install transformers from source
```bash
git clone https://github.com/huggingface/transformers.git
pip install -qqq -U ./transformers
```
Download MNLI data
```bash
python transformers/utils/download_glue_data.py --data_dir glue_data --tasks MNLI
```
Create student model
```bash
python create_student.py \
--teacher_model_name_or_path facebook/bart-large-mnli \
--student_encoder_layers 12 \
--student_decoder_layers 6 \
--save_path student-bart-mnli-12-6 \
```
Start fine-tuning
```bash
python run_glue.py args.json
```
You can find the logs of these trained models in this [wandb project](https://wandb.ai/psuraj/distilbart-mnli).
\ No newline at end of file
---
datasets:
- mnli
tags:
- distilbart
- distilbart-mnli
pipeline_tag: zero-shot-classification
---
# DistilBart-MNLI
distilbart-mnli is the distilled version of bart-large-mnli created using the **No Teacher Distillation** technique proposed for BART summarisation by Huggingface, [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#distilbart).
We just copy alternating layers from `bart-large-mnli` and finetune more on the same data.
| | matched acc | mismatched acc |
| ------------------------------------------------------------------------------------ | ----------- | -------------- |
| [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) (baseline, 12-12) | 89.9 | 90.01 |
| [distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) | 87.08 | 87.5 |
| [distilbart-mnli-12-3](https://huggingface.co/valhalla/distilbart-mnli-12-3) | 88.1 | 88.19 |
| [distilbart-mnli-12-6](https://huggingface.co/valhalla/distilbart-mnli-12-6) | 89.19 | 89.01 |
| [distilbart-mnli-12-9](https://huggingface.co/valhalla/distilbart-mnli-12-9) | 89.56 | 89.52 |
This is a very simple and effective technique, as we can see the performance drop is very little.
Detailed performace trade-offs will be posted in this [sheet](https://docs.google.com/spreadsheets/d/1dQeUvAKpScLuhDV1afaPJRRAE55s2LpIzDVA5xfqxvk/edit?usp=sharing).
## Fine-tuning
If you want to train these models yourself, clone the [distillbart-mnli repo](https://github.com/patil-suraj/distillbart-mnli) and follow the steps below
Clone and install transformers from source
```bash
git clone https://github.com/huggingface/transformers.git
pip install -qqq -U ./transformers
```
Download MNLI data
```bash
python transformers/utils/download_glue_data.py --data_dir glue_data --tasks MNLI
```
Create student model
```bash
python create_student.py \
--teacher_model_name_or_path facebook/bart-large-mnli \
--student_encoder_layers 12 \
--student_decoder_layers 6 \
--save_path student-bart-mnli-12-6 \
```
Start fine-tuning
```bash
python run_glue.py args.json
```
You can find the logs of these trained models in this [wandb project](https://wandb.ai/psuraj/distilbart-mnli).
\ No newline at end of file
# ELECTRA-BASE-DISCRIMINATOR finetuned on SQuADv1
This is electra-base-discriminator model finetuned on SQuADv1 dataset for for question answering task.
## Model details
As mentioned in the original paper: ELECTRA is a new method for self-supervised language representation learning.
It can be used to pre-train transformer networks using relatively little compute.
ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network,
similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU.
At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.
| Param | #Value |
|---------------------|--------|
| layers | 12 |
| hidden size | 768 |
| num attetion heads | 12 |
| on disk size | 436MB |
## Model training
This model was trained on google colab v100 GPU.
You can find the fine-tuning colab here
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11yo-LaFsgggwmDSy2P8zD3tzf5cCb-DU?usp=sharing).
## Results
The results are actually slightly better than given in the paper.
In the paper the authors mentioned that electra-base achieves 84.5 EM and 90.8 F1
| Metric | #Value |
|--------|--------|
| EM | 85.0520|
| F1 | 91.6050|
## Model in Action 🚀
```python3
from transformers import pipeline
nlp = pipeline('question-answering', model='valhalla/electra-base-discriminator-finetuned_squadv1')
nlp({
'question': 'What is the answer to everything ?',
'context': '42 is the answer to life the universe and everything'
})
=> {'answer': '42', 'end': 2, 'score': 0.981274963050339, 'start': 0}
```
> Created with ❤️ by Suraj Patil [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/patil-suraj/)
[![Twitter icon](https://cdn0.iconfinder.com/data/icons/shift-logotypes/32/Twitter-32.png)](https://twitter.com/psuraj28)
# LONGFORMER-BASE-4096 fine-tuned on SQuAD v1
This is longformer-base-4096 model fine-tuned on SQuAD v1 dataset for question answering task.
[Longformer](https://arxiv.org/abs/2004.05150) model created by Iz Beltagy, Matthew E. Peters, Arman Coha from AllenAI. As the paper explains it
> `Longformer` is a BERT-like model for long documents.
The pre-trained model can handle sequences with upto 4096 tokens.
## Model Training
This model was trained on google colab v100 GPU. You can find the fine-tuning colab here [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zEl5D-DdkBKva-DdreVOmN0hrAfzKG1o?usp=sharing).
Few things to keep in mind while training longformer for QA task,
by default longformer uses sliding-window local attention on all tokens. But For QA, all question tokens should have global attention. For more details on this please refer the paper. The `LongformerForQuestionAnswering` model automatically does that for you. To allow it to do that
1. The input sequence must have three sep tokens, i.e the sequence should be encoded like this
` <s> question</s></s> context</s>`. If you encode the question and answer as a input pair, then the tokenizer already takes care of that, you shouldn't worry about it.
2. `input_ids` should always be a batch of examples.
## Results
|Metric | # Value |
|-------------|---------|
| Exact Match | 85.1466 |
| F1 | 91.5415 |
## Model in Action 🚀
```python
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering,
tokenizer = AutoTokenizer.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1")
model = AutoModelForQuestionAnswering.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1")
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
start_scores, end_scores = model(input_ids, attention_mask=attention_mask)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))
# output => democratized NLP
```
The `LongformerForQuestionAnswering` isn't yet supported in `pipeline` . I'll update this card once the support has been added.
> Created with ❤️ by Suraj Patil [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/patil-suraj/)
[![Twitter icon](https://cdn0.iconfinder.com/data/icons/shift-logotypes/32/Twitter-32.png)](https://twitter.com/psuraj28)
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "Python is a programming language. It is developed by Guido Van Rossum and released in 1991. </s>"
license: mit
---
## T5 for question-generation
This is [t5-base](https://arxiv.org/abs/1910.10683) model trained for end-to-end question generation task. Simply input the text and the model will generate multile questions.
You can play with the model using the inference API, just put the text and see the results!
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
text = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."
nlp = pipeline("e2e-qg", model="valhalla/t5-base-e2e-qg")
nlp(text)
=> [
'Who created Python?',
'When was Python first released?',
"What is Python's design philosophy?"
]
```
\ No newline at end of file
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "generate question: <hl> 42 <hl> is the answer to life, the universe and everything. </s>"
- text: "question: What is 42 context: 42 is the answer to life, the universe and everything. </s>"
license: mit
---
## T5 for multi-task QA and QG
This is multi-task [t5-base](https://arxiv.org/abs/1910.10683) model trained for question answering and answer aware question generation tasks.
For question generation the answer spans are highlighted within the text with special highlight tokens (`<hl>`) and prefixed with 'generate question: '. For QA the input is processed like this `question: question_text context: context_text </s>`
You can play with the model using the inference API. Here's how you can use it
For QG
`generate question: <hl> 42 <hl> is the answer to life, the universe and everything. </s>`
For QA
`question: What is 42 context: 42 is the answer to life, the universe and everything. </s>`
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
nlp = pipeline("multitask-qa-qg", model="valhalla/t5-base-qa-qg-hl")
# to generate questions simply pass the text
nlp("42 is the answer to life, the universe and everything.")
=> [{'answer': '42', 'question': 'What is the answer to life, the universe and everything?'}]
# for qa pass a dict with "question" and "context"
nlp({
"question": "What is 42 ?",
"context": "42 is the answer to life, the universe and everything."
})
=> 'the answer to life, the universe and everything'
```
\ No newline at end of file
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "<hl> 42 <hl> is the answer to life, the universe and everything. </s>"
- text: "Python is a programming language. It is developed by <hl> Guido Van Rossum <hl>. </s>"
- text: "Although <hl> practicality <hl> beats purity </s>"
license: mit
---
## T5 for question-generation
This is [t5-base](https://arxiv.org/abs/1910.10683) model trained for answer aware question generation task. The answer spans are highlighted within the text with special highlight tokens.
You can play with the model using the inference API, just highlight the answer spans with `<hl>` tokens and end the text with `</s>`. For example
`<hl> 42 <hl> is the answer to life, the universe and everything. </s>`
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
nlp = pipeline("question-generation", model="valhalla/t5-base-qg-hl")
nlp("42 is the answer to life, universe and everything.")
=> [{'answer': '42', 'question': 'What is the answer to life, universe and everything?'}]
```
\ No newline at end of file
# T5 for question-answering
This is T5-base model fine-tuned on SQuAD1.1 for QA using text-to-text approach
## Model training
This model was trained on colab TPU with 35GB RAM for 4 epochs
## Results:
| Metric | #Value |
|-------------|---------|
| Exact Match | 81.5610 |
| F1 | 89.9601 |
## Model in Action 🚀
```
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("valhalla/t5-base-squad")
model = AutoModelWithLMHead.from_pretrained("valhalla/t5-base-squad")
def get_answer(question, context):
input_text = "question: %s context: %s </s>" % (question, context)
features = tokenizer([input_text], return_tensors='pt')
out = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask'])
return tokenizer.decode(out[0])
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
question = "What is Valhalla ?"
get_answer(question, context)
# output: 'a majestic, enormous hall located in Asgard, ruled over by the god Odin'
```
Play with this model [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1a5xpJiUjZybfU9Mi-aDkOp116PZ9-wni?usp=sharing)
> Created by Suraj Patil [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/patil-suraj/)
[![Twitter icon](https://cdn0.iconfinder.com/data/icons/shift-logotypes/32/Twitter-32.png)](https://twitter.com/psuraj28)
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "answer: 42 context: 42 is the answer to life, the universe and everything. </s>"
- text: "answer: Guido Van Rossum context: Python is a programming language. It is developed by Guido Van Rossum. </s>"
- text: "answer: Explicit context: Explicit is better than implicit </s>"
license: mit
---
## T5 for question-generation
This is [t5-small](https://arxiv.org/abs/1910.10683) model trained for answer aware question generation task. The answer text is prepended before the context text.
You can play with the model using the inference API, just get the input text in this format and see the results!
`answer: answer_text context: context_text </s>`
For example
`answer: 42 context: 42 is the answer to life, the universe and everything. </s>`
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
nlp = pipeline("question-generation", qg_format="prepend")
nlp("42 is the answer to life, universe and everything.")
=> [{'answer': '42', 'question': 'What is the answer to life, universe and everything?'}]
```
\ No newline at end of file
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "Python is developed by Guido Van Rossum and released in 1991. </s>"
license: mit
---
## T5 for question-generation
This is [t5-small](https://arxiv.org/abs/1910.10683) model trained for end-to-end question generation task. Simply input the text and the model will generate multile questions.
You can play with the model using the inference API, just put the text and see the results!
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
text = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."
nlp = pipeline("e2e-qg")
nlp(text)
=> [
'Who created Python?',
'When was Python first released?',
"What is Python's design philosophy?"
]
```
\ No newline at end of file
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "generate question: <hl> 42 <hl> is the answer to life, the universe and everything. </s>"
- text: "question: What is 42 context: 42 is the answer to life, the universe and everything. </s>"
license: mit
---
## T5 for multi-task QA and QG
This is multi-task [t5-small](https://arxiv.org/abs/1910.10683) model trained for question answering and answer aware question generation tasks.
For question generation the answer spans are highlighted within the text with special highlight tokens (`<hl>`) and prefixed with 'generate question: '. For QA the input is processed like this `question: question_text context: context_text </s>`
You can play with the model using the inference API. Here's how you can use it
For QG
`generate question: <hl> 42 <hl> is the answer to life, the universe and everything. </s>`
For QA
`question: What is 42 context: 42 is the answer to life, the universe and everything. </s>`
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
nlp = pipeline("multitask-qa-qg")
# to generate questions simply pass the text
nlp("42 is the answer to life, the universe and everything.")
=> [{'answer': '42', 'question': 'What is the answer to life, the universe and everything?'}]
# for qa pass a dict with "question" and "context"
nlp({
"question": "What is 42 ?",
"context": "42 is the answer to life, the universe and everything."
})
=> 'the answer to life, the universe and everything'
```
\ No newline at end of file
---
datasets:
- squad
tags:
- question-generation
widget:
- text: "<hl> 42 <hl> is the answer to life, the universe and everything. </s>"
- text: "Python is a programming language. It is developed by <hl> Guido Van Rossum <hl>. </s>"
- text: "Simple is better than <hl> complex <hl>. </s>"
license: mit
---
## T5 for question-generation
This is [t5-small](https://arxiv.org/abs/1910.10683) model trained for answer aware question generation task. The answer spans are highlighted within the text with special highlight tokens.
You can play with the model using the inference API, just highlight the answer spans with `<hl>` tokens and end the text with `</s>`. For example
`<hl> 42 <hl> is the answer to life, the universe and everything. </s>`
For more deatils see [this](https://github.com/patil-suraj/question_generation) repo.
### Model in action 🚀
You'll need to clone the [repo](https://github.com/patil-suraj/question_generation).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/question_generation/blob/master/question_generation.ipynb)
```python3
from pipelines import pipeline
nlp = pipeline("question-generation")
nlp("42 is the answer to life, universe and everything.")
=> [{'answer': '42', 'question': 'What is the answer to life, universe and everything?'}]
```
\ No newline at end of file
# <a name="introduction"></a> BERTweet: A pre-trained language model for English Tweets
- BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) pre-training procedure, using the same model configuration as [BERT-base](https://github.com/google-research/bert).
- The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the **COVID-19** pandemic.
- BERTweet does better than its competitors RoBERTa-base and [XLM-R-base](https://arxiv.org/abs/1911.02116) and outperforms previous state-of-the-art models on three downstream Tweet NLP tasks of Part-of-speech tagging, Named entity recognition and text classification.
The general architecture and experimental results of BERTweet can be found in our [paper](https://arxiv.org/abs/2005.10200):
@inproceedings{bertweet,
title = {{BERTweet: A pre-trained language model for English Tweets}},
author = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020}
}
**Please CITE** our paper when BERTweet is used to help produce published results or is incorporated into other software.
For further information or requests, please go to [BERTweet's homepage](https://github.com/VinAIResearch/BERTweet)!
### <a name="install2"></a> Installation
- Python 3.6+, and PyTorch 1.1.0+ (or TensorFlow 2.0+)
- Install `transformers`:
- `git clone https://github.com/huggingface/transformers.git`
- `cd transformers`
- `pip3 install --upgrade .`
- Install `emoji`: `pip3 install emoji`
### <a name="models2"></a> Pre-trained models
Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/bertweet-base` | 135M | base | 845M English Tweets (cased)
`vinai/bertweet-covid19-base-cased` | 135M | base | 23M COVID-19 English Tweets (cased)
`vinai/bertweet-covid19-base-uncased` | 135M | base | 23M COVID-19 English Tweets (uncased)
Two pre-trained models `vinai/bertweet-covid19-base-cased` and `vinai/bertweet-covid19-base-uncased` are resulted by further pre-training the pre-trained model `vinai/bertweet-base` on a corpus of 23M COVID-19 English Tweets for 40 epochs.
### <a name="usage2"></a> Example usage
```python
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
```
### <a name="preprocess"></a> Normalize raw input Tweets
Before applying `fastBPE` to the pre-training corpus of 850M English Tweets, we tokenized these Tweets using `TweetTokenizer` from the NLTK toolkit and used the `emoji` package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens `@USER` and `HTTPURL`, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets. BERTweet provides this pre-processing step by enabling the `normalization` argument.
```python
import torch
from transformers import AutoTokenizer
# Load the AutoTokenizer with a normalization mode if the input Tweet is raw
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True)
# from transformers import BertweetTokenizer
# tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base", normalization=True)
line = "SC has first two presumptive cases of coronavirus, DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier"
input_ids = torch.tensor([tokenizer.encode(line)])
```
# <a name="introduction"></a> BERTweet: A pre-trained language model for English Tweets
- BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) pre-training procedure, using the same model configuration as [BERT-base](https://github.com/google-research/bert).
- The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the **COVID-19** pandemic.
- BERTweet does better than its competitors RoBERTa-base and [XLM-R-base](https://arxiv.org/abs/1911.02116) and outperforms previous state-of-the-art models on three downstream Tweet NLP tasks of Part-of-speech tagging, Named entity recognition and text classification.
The general architecture and experimental results of BERTweet can be found in our [paper](https://arxiv.org/abs/2005.10200):
@inproceedings{bertweet,
title = {{BERTweet: A pre-trained language model for English Tweets}},
author = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020}
}
**Please CITE** our paper when BERTweet is used to help produce published results or is incorporated into other software.
For further information or requests, please go to [BERTweet's homepage](https://github.com/VinAIResearch/BERTweet)!
### <a name="install2"></a> Installation
- Python 3.6+, and PyTorch 1.1.0+ (or TensorFlow 2.0+)
- Install `transformers`:
- `git clone https://github.com/huggingface/transformers.git`
- `cd transformers`
- `pip3 install --upgrade .`
- Install `emoji`: `pip3 install emoji`
### <a name="models2"></a> Pre-trained models
Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/bertweet-base` | 135M | base | 845M English Tweets (cased)
`vinai/bertweet-covid19-base-cased` | 135M | base | 23M COVID-19 English Tweets (cased)
`vinai/bertweet-covid19-base-uncased` | 135M | base | 23M COVID-19 English Tweets (uncased)
Two pre-trained models `vinai/bertweet-covid19-base-cased` and `vinai/bertweet-covid19-base-uncased` are resulted by further pre-training the pre-trained model `vinai/bertweet-base` on a corpus of 23M COVID-19 English Tweets for 40 epochs.
### <a name="usage2"></a> Example usage
```python
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-covid19-base-cased")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-cased")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-covid19-base-cased")
```
### <a name="preprocess"></a> Normalize raw input Tweets
Before applying `fastBPE` to the pre-training corpus of 850M English Tweets, we tokenized these Tweets using `TweetTokenizer` from the NLTK toolkit and used the `emoji` package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens `@USER` and `HTTPURL`, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets. BERTweet provides this pre-processing step by enabling the `normalization` argument.
```python
import torch
from transformers import AutoTokenizer
# Load the AutoTokenizer with a normalization mode if the input Tweet is raw
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-cased", normalization=True)
# from transformers import BertweetTokenizer
# tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-covid19-base-cased", normalization=True)
line = "SC has first two presumptive cases of coronavirus, DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier"
input_ids = torch.tensor([tokenizer.encode(line)])
```
# <a name="introduction"></a> BERTweet: A pre-trained language model for English Tweets
- BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) pre-training procedure, using the same model configuration as [BERT-base](https://github.com/google-research/bert).
- The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the **COVID-19** pandemic.
- BERTweet does better than its competitors RoBERTa-base and [XLM-R-base](https://arxiv.org/abs/1911.02116) and outperforms previous state-of-the-art models on three downstream Tweet NLP tasks of Part-of-speech tagging, Named entity recognition and text classification.
The general architecture and experimental results of BERTweet can be found in our [paper](https://arxiv.org/abs/2005.10200):
@inproceedings{bertweet,
title = {{BERTweet: A pre-trained language model for English Tweets}},
author = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020}
}
**Please CITE** our paper when BERTweet is used to help produce published results or is incorporated into other software.
For further information or requests, please go to [BERTweet's homepage](https://github.com/VinAIResearch/BERTweet)!
### <a name="install2"></a> Installation
- Python 3.6+, and PyTorch 1.1.0+ (or TensorFlow 2.0+)
- Install `transformers`:
- `git clone https://github.com/huggingface/transformers.git`
- `cd transformers`
- `pip3 install --upgrade .`
- Install `emoji`: `pip3 install emoji`
### <a name="models2"></a> Pre-trained models
Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/bertweet-base` | 135M | base | 845M English Tweets (cased)
`vinai/bertweet-covid19-base-cased` | 135M | base | 23M COVID-19 English Tweets (cased)
`vinai/bertweet-covid19-base-uncased` | 135M | base | 23M COVID-19 English Tweets (uncased)
Two pre-trained models `vinai/bertweet-covid19-base-cased` and `vinai/bertweet-covid19-base-uncased` are resulted by further pre-training the pre-trained model `vinai/bertweet-base` on a corpus of 23M COVID-19 English Tweets for 40 epochs.
### <a name="usage2"></a> Example usage
```python
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-covid19-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-covid19-base-uncased")
```
### <a name="preprocess"></a> Normalize raw input Tweets
Before applying `fastBPE` to the pre-training corpus of 850M English Tweets, we tokenized these Tweets using `TweetTokenizer` from the NLTK toolkit and used the `emoji` package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens `@USER` and `HTTPURL`, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets. BERTweet provides this pre-processing step by enabling the `normalization` argument.
```python
import torch
from transformers import AutoTokenizer
# Load the AutoTokenizer with a normalization mode if the input Tweet is raw
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased", normalization=True)
# from transformers import BertweetTokenizer
# tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased", normalization=True)
line = "SC has first two presumptive cases of coronavirus, DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier"
input_ids = torch.tensor([tokenizer.encode(line)])
```
# <a name="introduction"></a> PhoBERT: Pre-trained language models for Vietnamese
Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. "Phở", is a popular food in Vietnam):
- Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) which optimizes the [BERT](https://github.com/google-research/bert) pre-training procedure for more robust performance.
- PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.
The general architecture and experimental results of PhoBERT can be found in our EMNLP-2020 Findings [paper](https://arxiv.org/abs/2003.00744):
@article{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal = {Findings of EMNLP},
year = {2020}
}
**Please CITE** our paper when PhoBERT is used to help produce published results or is incorporated into other software.
For further information or requests, please go to [PhoBERT's homepage](https://github.com/VinAIResearch/PhoBERT)!
### Installation <a name="install2"></a>
- Python 3.6+, and PyTorch 1.1.0+ (or TensorFlow 2.0+)
- Install `transformers`:
- `git clone https://github.com/huggingface/transformers.git`
- `cd transformers`
- `pip3 install --upgrade .`
### Pre-trained models <a name="models2"></a>
Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/phobert-base` | 135M | base | 20GB of texts
`vinai/phobert-large` | 370M | large | 20GB of texts
### Example usage <a name="usage2"></a>
```python
import torch
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")
```
# <a name="introduction"></a> PhoBERT: Pre-trained language models for Vietnamese
Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. "Phở", is a popular food in Vietnam):
- Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) which optimizes the [BERT](https://github.com/google-research/bert) pre-training procedure for more robust performance.
- PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.
The general architecture and experimental results of PhoBERT can be found in our EMNLP-2020 Findings [paper](https://arxiv.org/abs/2003.00744):
@article{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal = {Findings of EMNLP},
year = {2020}
}
**Please CITE** our paper when PhoBERT is used to help produce published results or is incorporated into other software.
For further information or requests, please go to [PhoBERT's homepage](https://github.com/VinAIResearch/PhoBERT)!
### Installation <a name="install2"></a>
- Python 3.6+, and PyTorch 1.1.0+ (or TensorFlow 2.0+)
- Install `transformers`:
- `git clone https://github.com/huggingface/transformers.git`
- `cd transformers`
- `pip3 install --upgrade .`
### Pre-trained models <a name="models2"></a>
Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/phobert-base` | 135M | base | 20GB of texts
`vinai/phobert-large` | 370M | large | 20GB of texts
### Example usage <a name="usage2"></a>
```python
import torch
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinai/phobert-large")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-large")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-large")
```
---
language: zh
---
# albert_chinese_base
This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT)
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Attention (注意)
Since sentencepiece is not used in albert_chinese_base model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_base 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
```python
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_base'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
```
Result: `感 0.36333346366882324`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment