Unverified Commit cdcb206e authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

Merge pull request #273 from huggingface/update_to_fifth_release

Update to fifth release
parents 3c33499f 321d70a7
...@@ -45,12 +45,14 @@ PyTorch pretrained bert can be installed by pip as follows: ...@@ -45,12 +45,14 @@ PyTorch pretrained bert can be installed by pip as follows:
pip install pytorch-pretrained-bert pip install pytorch-pretrained-bert
``` ```
If you want to use the tokenizer associated to the `OpenAI GPT` tokenizer, you will need to install `ftfy` (if you are using Python 2, version 4.4.3 is the last version working for you) and `SpaCy` : If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
```bash ```bash
pip install spacy ftfy==4.4.3 pip install spacy ftfy==4.4.3
python -m spacy download en python -m spacy download en
``` ```
If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
### From source ### From source
Clone the repository and run: Clone the repository and run:
...@@ -58,12 +60,13 @@ Clone the repository and run: ...@@ -58,12 +60,13 @@ Clone the repository and run:
pip install [--editable] . pip install [--editable] .
``` ```
Here also, if you want to use `OpenAIGPT` tokenizer, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` : Here also, if you want to reproduce the original tokenization process of the `OpenAI GPT` model, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
```bash ```bash
pip install spacy ftfy==4.4.3 pip install spacy ftfy==4.4.3
python -m spacy download en python -m spacy download en
``` ```
Again, if you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage).
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`). A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
...@@ -157,6 +160,10 @@ First let's prepare a tokenized input with `BertTokenizer` ...@@ -157,6 +160,10 @@ First let's prepare a tokenized input with `BertTokenizer`
import torch import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
# Load pre-trained model tokenizer (vocabulary) # Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
...@@ -230,6 +237,10 @@ First let's prepare a tokenized input with `OpenAIGPTTokenizer` ...@@ -230,6 +237,10 @@ First let's prepare a tokenized input with `OpenAIGPTTokenizer`
import torch import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
# Load pre-trained model tokenizer (vocabulary) # Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt') tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
...@@ -291,6 +302,10 @@ First let's prepare a tokenized input with `TransfoXLTokenizer` ...@@ -291,6 +302,10 @@ First let's prepare a tokenized input with `TransfoXLTokenizer`
import torch import torch
from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
# Load pre-trained model tokenizer (vocabulary from wikitext 103) # Load pre-trained model tokenizer (vocabulary from wikitext 103)
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103') tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
...@@ -629,10 +644,12 @@ This model *outputs* a tuple of (last_hidden_state, new_mems) ...@@ -629,10 +644,12 @@ This model *outputs* a tuple of (last_hidden_state, new_mems)
`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization. `BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
This class has two arguments: This class has four arguments:
- `vocab_file`: path to a vocabulary file. - `vocab_file`: path to a vocabulary file.
- `do_lower_case`: convert text to lower-case while tokenizing. **Default = True**. - `do_lower_case`: convert text to lower-case while tokenizing. **Default = True**.
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
- `never_split`: a list of tokens that should not be splitted during tokenization. **Default = `["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]`**
and three methods: and three methods:
...@@ -646,16 +663,20 @@ Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretra ...@@ -646,16 +663,20 @@ Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretra
`OpenAIGPTTokenizer` perform Byte-Pair-Encoding (BPE) tokenization. `OpenAIGPTTokenizer` perform Byte-Pair-Encoding (BPE) tokenization.
This class has two arguments: This class has four arguments:
- `vocab_file`: path to a vocabulary file. - `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges. - `merges_file`: path to a file containing the BPE merges.
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
- `special_tokens`: a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's `BasicTokenizer` is used as the pre-BPE tokenizer, these tokens are not split. **Default= None**
and three methods: and five methods:
- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization. - `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary. - `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary. - `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
- `set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
- `decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`. Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.
......
__version__ = "0.5.0" __version__ = "0.5.1"
from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
from .tokenization_openai import OpenAIGPTTokenizer from .tokenization_openai import OpenAIGPTTokenizer
from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus) from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
......
...@@ -959,7 +959,12 @@ class TransfoXLPreTrainedModel(nn.Module): ...@@ -959,7 +959,12 @@ class TransfoXLPreTrainedModel(nn.Module):
for name, child in module._modules.items(): for name, child in module._modules.items():
if child is not None: if child is not None:
load(child, prefix + name + '.') load(child, prefix + name + '.')
load(model, prefix='')
start_prefix = ''
if not hasattr(model, 'transformer') and any(s.startswith('transformer.') for s in state_dict.keys()):
start_prefix = 'transformer.'
load(model, prefix=start_prefix)
if len(missing_keys) > 0: if len(missing_keys) > 0:
logger.info("Weights of {} not initialized from pretrained model: {}".format( logger.info("Weights of {} not initialized from pretrained model: {}".format(
model.__class__.__name__, missing_keys)) model.__class__.__name__, missing_keys))
......
...@@ -26,6 +26,7 @@ from io import open ...@@ -26,6 +26,7 @@ from io import open
from tqdm import tqdm from tqdm import tqdm
from .file_utils import cached_path from .file_utils import cached_path
from .tokenization import BasicTokenizer
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
...@@ -72,8 +73,9 @@ class OpenAIGPTTokenizer(object): ...@@ -72,8 +73,9 @@ class OpenAIGPTTokenizer(object):
""" """
BPE tokenizer. Peculiarities: BPE tokenizer. Peculiarities:
- lower case all inputs - lower case all inputs
- uses SpaCy tokenizer - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.
- special tokens: additional symbols (ex: "__classify__") to add to a vocabulary. - argument special_tokens and function set_special_tokens:
can be used to add additional symbols (ex: "__classify__") to a vocabulary.
""" """
@classmethod @classmethod
def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs): def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
...@@ -122,12 +124,15 @@ class OpenAIGPTTokenizer(object): ...@@ -122,12 +124,15 @@ class OpenAIGPTTokenizer(object):
try: try:
import ftfy import ftfy
import spacy import spacy
self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])
self.fix_text = ftfy.fix_text
except ImportError: except ImportError:
raise ImportError("Please install ftfy and spacy to use OpenAI GPT tokenizer.") logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")
self.nlp = BasicTokenizer(do_lower_case=True,
never_split=special_tokens if special_tokens is not None else [])
self.fix_text = None
self.max_len = max_len if max_len is not None else int(1e12) self.max_len = max_len if max_len is not None else int(1e12)
self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])
self.fix_text = ftfy.fix_text
self.encoder = json.load(open(vocab_file, encoding="utf-8")) self.encoder = json.load(open(vocab_file, encoding="utf-8"))
self.decoder = {v:k for k,v in self.encoder.items()} self.decoder = {v:k for k,v in self.encoder.items()}
merges = open(merges_file, encoding='utf-8').read().split('\n')[1:-1] merges = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
...@@ -150,6 +155,9 @@ class OpenAIGPTTokenizer(object): ...@@ -150,6 +155,9 @@ class OpenAIGPTTokenizer(object):
return return
self.special_tokens = dict((tok, len(self.encoder) + i) for i, tok in enumerate(special_tokens)) self.special_tokens = dict((tok, len(self.encoder) + i) for i, tok in enumerate(special_tokens))
self.special_tokens_decoder = {v:k for k, v in self.special_tokens.items()} self.special_tokens_decoder = {v:k for k, v in self.special_tokens.items()}
if self.fix_text is None:
# Using BERT's BasicTokenizer: we can update the tokenizer
self.nlp.never_split = special_tokens
logger.info("Special tokens {}".format(self.special_tokens)) logger.info("Special tokens {}".format(self.special_tokens))
def bpe(self, token): def bpe(self, token):
...@@ -198,6 +206,13 @@ class OpenAIGPTTokenizer(object): ...@@ -198,6 +206,13 @@ class OpenAIGPTTokenizer(object):
def tokenize(self, text): def tokenize(self, text):
""" Tokenize a string. """ """ Tokenize a string. """
split_tokens = [] split_tokens = []
if self.fix_text is None:
# Using BERT's BasicTokenizer
text = self.nlp.tokenize(text)
for token in text:
split_tokens.extend([t for t in self.bpe(token).split(' ')])
else:
# Using SpaCy & ftfy (original tokenization process of OpenAI GPT)
text = self.nlp(text_standardize(self.fix_text(text))) text = self.nlp(text_standardize(self.fix_text(text)))
for token in text: for token in text:
split_tokens.extend([t for t in self.bpe(token.text.lower()).split(' ')]) split_tokens.extend([t for t in self.bpe(token.text.lower()).split(' ')])
...@@ -219,8 +234,8 @@ class OpenAIGPTTokenizer(object): ...@@ -219,8 +234,8 @@ class OpenAIGPTTokenizer(object):
if len(ids) > self.max_len: if len(ids) > self.max_len:
raise ValueError( raise ValueError(
"Token indices sequence length is longer than the specified maximum " "Token indices sequence length is longer than the specified maximum "
" sequence length for this BERT model ({} > {}). Running this" " sequence length for this OpenAI GPT model ({} > {}). Running this"
" sequence through BERT will result in indexing errors".format(len(ids), self.max_len) " sequence through the model will result in indexing errors".format(len(ids), self.max_len)
) )
return ids return ids
......
...@@ -38,7 +38,7 @@ from setuptools import find_packages, setup ...@@ -38,7 +38,7 @@ from setuptools import find_packages, setup
setup( setup(
name="pytorch_pretrained_bert", name="pytorch_pretrained_bert",
version="0.5.0", version="0.5.1",
author="Thomas Wolf, Victor Sanh, Tim Rault, Google AI Language Team Authors, Open AI team Authors", author="Thomas Wolf, Victor Sanh, Tim Rault, Google AI Language Team Authors, Open AI team Authors",
author_email="thomas@huggingface.co", author_email="thomas@huggingface.co",
description="PyTorch version of Google AI BERT model with script to load Google pre-trained models", description="PyTorch version of Google AI BERT model with script to load Google pre-trained models",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment