Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
7f60e93a
Unverified
Commit
7f60e93a
authored
Jun 29, 2020
by
Sylvain Gugger
Committed by
GitHub
Jun 29, 2020
Browse files
Mention openAI model card and merge content (#5378)
* Mention openAI model card and merge content * Fix sentence
parent
482a5993
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
18 additions
and
4 deletions
+18
-4
model_cards/gpt2-README.md
model_cards/gpt2-README.md
+18
-4
No files found.
model_cards/gpt2-README.md
View file @
7f60e93a
...
@@ -13,8 +13,9 @@ Pretrained model on English language using a causal language modeling (CLM) obje
...
@@ -13,8 +13,9 @@ Pretrained model on English language using a causal language modeling (CLM) obje
[
this paper
](
https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
)
[
this paper
](
https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
)
and first released at
[
this page
](
https://openai.com/blog/better-language-models/
)
.
and first released at
[
this page
](
https://openai.com/blog/better-language-models/
)
.
Disclaimer: The team releasing GPT-2 did not write a model card for this model so this model card has been written by
Disclaimer: The team releasing GPT-2 also wrote a
the Hugging Face team.
[
model card
](
https://github.com/openai/gpt-2/blob/master/model_card.md
)
for their model. Content from this model card
has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.
## Model description
## Model description
...
@@ -79,7 +80,19 @@ output = model(encoded_input)
...
@@ -79,7 +80,19 @@ output = model(encoded_input)
### Limitations and bias
### Limitations and bias
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
unfiltered from the internet, which is far from neutral. Therefore, the model can have biased predictions:
unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their
[
model card
](
https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases
)
:
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
> that require the generated text to be true.
>
> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
> not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
> study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
> and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
> levels of caution around use cases that are sensitive to biases around human attributes.
Here's an example of how the model can have biased predictions:
```
python
```
python
>>>
from
transformers
import
pipeline
,
set_seed
>>>
from
transformers
import
pipeline
,
set_seed
...
@@ -110,7 +123,8 @@ This bias will also affect all fine-tuned versions of this model.
...
@@ -110,7 +123,8 @@ This bias will also affect all fine-tuned versions of this model.
The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web
The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web
pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from
pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from
this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights
this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights
40GB of texts but has not been publicly released.
40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText
[
here
](
https://github.com/openai/gpt-2/blob/master/domains.txt
)
.
## Training procedure
## Training procedure
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment