Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
e5c78c66
Commit
e5c78c66
authored
Jan 10, 2019
by
thomwolf
Browse files
update readme and few typos
parent
fa5222c2
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
7 additions
and
7 deletions
+7
-7
README.md
README.md
+4
-4
examples/extract_features.py
examples/extract_features.py
+2
-2
pytorch_pretrained_bert/modeling.py
pytorch_pretrained_bert/modeling.py
+1
-1
No files found.
README.md
View file @
e5c78c66
# PyTorch Pretrained Bert
-
PyTorch Pretrained OpenAI GPT
# PyTorch Pretrained Bert
(also with
PyTorch Pretrained OpenAI GPT
)
[

](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)
[

](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)
...
@@ -125,18 +125,18 @@ from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
...
@@ -125,18 +125,18 @@ from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenized input
# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
text = "
[CLS]
Who was Jim Henson ?
[SEP]
Jim Henson was a puppeteer
[SEP]
"
tokenized_text = tokenizer.tokenize(text)
tokenized_text = tokenizer.tokenize(text)
# Mask a token that we will try to predict back with `BertForMaskedLM`
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6
masked_index = 6
tokenized_text[masked_index] = '[MASK]'
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['who', 'was', 'jim', 'henson', '?', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer']
assert tokenized_text == [
'[CLS]',
'who', 'was', 'jim', 'henson', '?',
'[SEP]',
'jim', '[MASK]', 'was', 'a', 'puppet', '##eer'
, '[SEP]'
]
# Convert token to vocabulary indices
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
segments_ids = [0, 0, 0, 0, 0,
0, 0, 1,
1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor = torch.tensor([indexed_tokens])
...
...
examples/extract_features.py
View file @
e5c78c66
...
@@ -80,10 +80,10 @@ def convert_examples_to_features(examples, seq_length, tokenizer):
...
@@ -80,10 +80,10 @@ def convert_examples_to_features(examples, seq_length, tokenizer):
# The convention in BERT is:
# The convention in BERT is:
# (a) For sequence pairs:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0
0 0 1 1 1
1
1 1
# type_ids:
0 0 0 0 0 0 0
0 1 1 1
1
1
1
# (b) For single sequences:
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
# type_ids:
0 0 0 0 0 0
0
#
#
# Where "type_ids" are used to indicate whether this is the first
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# sequence or the second sequence. The embedding vectors for `type=0` and
...
...
pytorch_pretrained_bert/modeling.py
View file @
e5c78c66
...
@@ -584,7 +584,7 @@ class BertModel(BertPreTrainedModel):
...
@@ -584,7 +584,7 @@ class BertModel(BertPreTrainedModel):
to the last attention block of shape [batch_size, sequence_length, hidden_size],
to the last attention block of shape [batch_size, sequence_length, hidden_size],
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
classifier pretrained on top of the hidden state associated to the first character of the
classifier pretrained on top of the hidden state associated to the first character of the
input (`CL
F
`) to train on the Next-Sentence task (see BERT's paper).
input (`CL
S
`) to train on the Next-Sentence task (see BERT's paper).
Example usage:
Example usage:
```python
```python
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment