Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
e5c78c66
Commit
e5c78c66
authored
Jan 10, 2019
by
thomwolf
Browse files
update readme and few typos
parent
fa5222c2
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
7 additions
and
7 deletions
+7
-7
README.md
README.md
+4
-4
examples/extract_features.py
examples/extract_features.py
+2
-2
pytorch_pretrained_bert/modeling.py
pytorch_pretrained_bert/modeling.py
+1
-1
No files found.
README.md
View file @
e5c78c66
# PyTorch Pretrained Bert
-
PyTorch Pretrained OpenAI GPT
# PyTorch Pretrained Bert
(also with
PyTorch Pretrained OpenAI GPT
)
[

](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)
...
...
@@ -125,18 +125,18 @@ from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
text = "
[CLS]
Who was Jim Henson ?
[SEP]
Jim Henson was a puppeteer
[SEP]
"
tokenized_text = tokenizer.tokenize(text)
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['who', 'was', 'jim', 'henson', '?', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer']
assert tokenized_text == [
'[CLS]',
'who', 'was', 'jim', 'henson', '?',
'[SEP]',
'jim', '[MASK]', 'was', 'a', 'puppet', '##eer'
, '[SEP]'
]
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
segments_ids = [0, 0, 0, 0, 0,
0, 0, 1,
1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
...
...
examples/extract_features.py
View file @
e5c78c66
...
...
@@ -80,10 +80,10 @@ def convert_examples_to_features(examples, seq_length, tokenizer):
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0
0 0 1 1 1
1
1 1
# type_ids:
0 0 0 0 0 0 0
0 1 1 1
1
1
1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
# type_ids:
0 0 0 0 0 0
0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
...
...
pytorch_pretrained_bert/modeling.py
View file @
e5c78c66
...
...
@@ -584,7 +584,7 @@ class BertModel(BertPreTrainedModel):
to the last attention block of shape [batch_size, sequence_length, hidden_size],
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
classifier pretrained on top of the hidden state associated to the first character of the
input (`CL
F
`) to train on the Next-Sentence task (see BERT's paper).
input (`CL
S
`) to train on the Next-Sentence task (see BERT's paper).
Example usage:
```python
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment