Unverified Commit 3c289fb3 authored by Kevin Canwen Xu's avatar Kevin Canwen Xu Committed by GitHub
Browse files

Remove outdated BERT tips (#6217)

* Remove out-dated BERT tips

* Update modeling_outputs.py

* Update bert.rst

* Update bert.rst
parent e4920c92
...@@ -27,13 +27,8 @@ Tips: ...@@ -27,13 +27,8 @@ Tips:
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on - BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left. the right rather than the left.
- BERT was trained with a masked language modeling (MLM) objective. It is therefore efficient at predicting masked - BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked
tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language tokens and at NLU in general, but is not optimal for text generation.
modeling (CLM) objective are better in that regard.
- Alongside MLM, BERT was trained using a next sentence prediction (NSP) objective using the [CLS] token as a sequence
approximate. The user may use this token (the first token in a sequence built with special tokens) to get a sequence
prediction rather than a token prediction. However, averaging over the sequence may yield better results than using
the [CLS] token.
The original code can be found `here <https://github.com/google-research/bert>`_. The original code can be found `here <https://github.com/google-research/bert>`_.
......
...@@ -45,10 +45,6 @@ class BaseModelOutputWithPooling(ModelOutput): ...@@ -45,10 +45,6 @@ class BaseModelOutputWithPooling(ModelOutput):
further processed by a Linear layer and a Tanh activation function. The Linear further processed by a Linear layer and a Tanh activation function. The Linear
layer weights are trained from the next sentence prediction (classification) layer weights are trained from the next sentence prediction (classification)
objective during pretraining. objective during pretraining.
This output is usually *not* a good summary
of the semantic content of the input, you're often better with averaging or pooling
the sequence of hidden-states for the whole input sequence.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`. of shape :obj:`(batch_size, sequence_length, hidden_size)`.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment