Unverified Commit 601d4d69 authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308)

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples
parent fd405e9a
...@@ -287,8 +287,8 @@ pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf ...@@ -287,8 +287,8 @@ pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf
sentence_0 = "This research was consistent with his findings." sentence_0 = "This research was consistent with his findings."
sentence_1 = "His findings were compatible with this research." sentence_1 = "His findings were compatible with this research."
sentence_2 = "His findings were not compatible with this research." sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt') inputs_1 = tokenizer(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt') inputs_2 = tokenizer(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item() pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item() pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()
......
...@@ -167,7 +167,7 @@ Here's an example showcasing everything so far: ...@@ -167,7 +167,7 @@ Here's an example showcasing everything so far:
Indices can be obtained using :class:`transformers.AlbertTokenizer`. Indices can be obtained using :class:`transformers.AlbertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details. :func:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
``` ```
......
...@@ -11,7 +11,7 @@ The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` impleme ...@@ -11,7 +11,7 @@ The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` impleme
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...), - adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization) - managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token). ``BatchEncoding`` holds the output of the tokenizer's encoding methods (``__call__``, ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
``PreTrainedTokenizer`` ``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -74,7 +74,7 @@ of each other. The process is the following: ...@@ -74,7 +74,7 @@ of each other. The process is the following:
with the weights stored in the checkpoint. with the weights stored in the checkpoint.
- Build a sequence from the two sentences, with the correct model-specific separators token type ids - Build a sequence from the two sentences, with the correct model-specific separators token type ids
and attention masks (:func:`~transformers.PreTrainedTokenizer.encode` and and attention masks (:func:`~transformers.PreTrainedTokenizer.encode` and
:func:`~transformers.PreTrainedTokenizer.encode_plus` take care of this) :func:`~transformers.PreTrainedTokenizer.__call__` take care of this)
- Pass this sequence through the model so that it is classified in one of the two available classes: 0 - Pass this sequence through the model so that it is classified in one of the two available classes: 0
(not a paraphrase) and 1 (is a paraphrase) (not a paraphrase) and 1 (is a paraphrase)
- Compute the softmax of the result to get probabilities over the classes - Compute the softmax of the result to get probabilities over the classes
...@@ -95,8 +95,8 @@ of each other. The process is the following: ...@@ -95,8 +95,8 @@ of each other. The process is the following:
>>> sequence_1 = "Apples are especially bad for your health" >>> sequence_1 = "Apples are especially bad for your health"
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan" >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
>>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt") >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
>>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt") >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
>>> paraphrase_classification_logits = model(**paraphrase)[0] >>> paraphrase_classification_logits = model(**paraphrase)[0]
>>> not_paraphrase_classification_logits = model(**not_paraphrase)[0] >>> not_paraphrase_classification_logits = model(**not_paraphrase)[0]
...@@ -128,8 +128,8 @@ of each other. The process is the following: ...@@ -128,8 +128,8 @@ of each other. The process is the following:
>>> sequence_1 = "Apples are especially bad for your health" >>> sequence_1 = "Apples are especially bad for your health"
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan" >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
>>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="tf") >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
>>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="tf") >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
>>> paraphrase_classification_logits = model(paraphrase)[0] >>> paraphrase_classification_logits = model(paraphrase)[0]
>>> not_paraphrase_classification_logits = model(not_paraphrase)[0] >>> not_paraphrase_classification_logits = model(not_paraphrase)[0]
...@@ -221,7 +221,7 @@ Here is an example of question answering using a model and a tokenizer. The proc ...@@ -221,7 +221,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
... ] ... ]
>>> for question in questions: >>> for question in questions:
... inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt") ... inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
... input_ids = inputs["input_ids"].tolist()[0] ... input_ids = inputs["input_ids"].tolist()[0]
... ...
... text_tokens = tokenizer.convert_ids_to_tokens(input_ids) ... text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
...@@ -263,7 +263,7 @@ Here is an example of question answering using a model and a tokenizer. The proc ...@@ -263,7 +263,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
... ] ... ]
>>> for question in questions: >>> for question in questions:
... inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="tf") ... inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
... input_ids = inputs["input_ids"].numpy()[0] ... input_ids = inputs["input_ids"].numpy()[0]
... ...
... text_tokens = tokenizer.convert_ids_to_tokens(input_ids) ... text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
......
...@@ -77,7 +77,7 @@ other than bias and layer normalization terms: ...@@ -77,7 +77,7 @@ other than bias and layer normalization terms:
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5) optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Now we can set up a simple dummy training batch using Now we can set up a simple dummy training batch using
:func:`~transformers.PreTrainedTokenizer.batch_encode_plus`. This returns a :func:`~transformers.PreTrainedTokenizer.__call__`. This returns a
:func:`~transformers.BatchEncoding` instance which :func:`~transformers.BatchEncoding` instance which
prepares everything we might need to pass to the model. prepares everything we might need to pass to the model.
......
...@@ -298,12 +298,13 @@ def hans_convert_examples_to_features( ...@@ -298,12 +298,13 @@ def hans_convert_examples_to_features(
if ex_index % 10000 == 0: if ex_index % 10000 == 0:
logger.info("Writing example %d" % (ex_index)) logger.info("Writing example %d" % (ex_index))
inputs = tokenizer.encode_plus( inputs = tokenizer(
example.text_a, example.text_a,
example.text_b, example.text_b,
add_special_tokens=True, add_special_tokens=True,
max_length=max_length, max_length=max_length,
pad_to_max_length=True, padding="max_length",
truncation=True,
return_overflowing_tokens=True, return_overflowing_tokens=True,
) )
......
...@@ -193,12 +193,12 @@ def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from ...@@ -193,12 +193,12 @@ def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from
def make_qa_retriever_batch(qa_list, tokenizer, max_len=64, device="cuda:0"): def make_qa_retriever_batch(qa_list, tokenizer, max_len=64, device="cuda:0"):
q_ls = [q for q, a in qa_list] q_ls = [q for q, a in qa_list]
a_ls = [a for q, a in qa_list] a_ls = [a for q, a in qa_list]
q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True) q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
q_ids, q_mask = ( q_ids, q_mask = (
torch.LongTensor(q_toks["input_ids"]).to(device), torch.LongTensor(q_toks["input_ids"]).to(device),
torch.LongTensor(q_toks["attention_mask"]).to(device), torch.LongTensor(q_toks["attention_mask"]).to(device),
) )
a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True) a_toks = tokenizer(a_ls, max_length=max_len, padding="max_length", truncation=True)
a_ids, a_mask = ( a_ids, a_mask = (
torch.LongTensor(a_toks["input_ids"]).to(device), torch.LongTensor(a_toks["input_ids"]).to(device),
torch.LongTensor(a_toks["attention_mask"]).to(device), torch.LongTensor(a_toks["attention_mask"]).to(device),
...@@ -375,12 +375,12 @@ def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device=" ...@@ -375,12 +375,12 @@ def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="
def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=360, device="cuda:0"): def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=360, device="cuda:0"):
q_ls = [q for q, a in qa_list] q_ls = [q for q, a in qa_list]
a_ls = [a for q, a in qa_list] a_ls = [a for q, a in qa_list]
q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True) q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
q_ids, q_mask = ( q_ids, q_mask = (
torch.LongTensor(q_toks["input_ids"]).to(device), torch.LongTensor(q_toks["input_ids"]).to(device),
torch.LongTensor(q_toks["attention_mask"]).to(device), torch.LongTensor(q_toks["attention_mask"]).to(device),
) )
a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True) a_toks = tokenizer(a_ls, max_length=min(max_len, max_a_len), padding="max_length", truncation=True)
a_ids, a_mask = ( a_ids, a_mask = (
torch.LongTensor(a_toks["input_ids"]).to(device), torch.LongTensor(a_toks["input_ids"]).to(device),
torch.LongTensor(a_toks["attention_mask"]).to(device), torch.LongTensor(a_toks["attention_mask"]).to(device),
...@@ -531,7 +531,7 @@ def qa_s2s_generate( ...@@ -531,7 +531,7 @@ def qa_s2s_generate(
# ELI5-trained retrieval model usage # ELI5-trained retrieval model usage
############### ###############
def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda:0"): def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda:0"):
a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True) a_toks = tokenizer(passages, max_length=max_length, padding="max_length", truncation=True)
a_ids, a_mask = ( a_ids, a_mask = (
torch.LongTensor(a_toks["input_ids"]).to(device), torch.LongTensor(a_toks["input_ids"]).to(device),
torch.LongTensor(a_toks["attention_mask"]).to(device), torch.LongTensor(a_toks["attention_mask"]).to(device),
...@@ -542,7 +542,7 @@ def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=12 ...@@ -542,7 +542,7 @@ def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=12
def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda:0"): def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda:0"):
q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True) q_toks = tokenizer(q_ls, max_length=128, padding="max_length", truncation=True)
q_ids, q_mask = ( q_ids, q_mask = (
torch.LongTensor(q_toks["input_ids"]).to(device), torch.LongTensor(q_toks["input_ids"]).to(device),
torch.LongTensor(q_toks["attention_mask"]).to(device), torch.LongTensor(q_toks["attention_mask"]).to(device),
......
...@@ -424,7 +424,7 @@ MASKED_BERT_INPUTS_DOCSTRING = r""" ...@@ -424,7 +424,7 @@ MASKED_BERT_INPUTS_DOCSTRING = r"""
Indices can be obtained using :class:`transformers.BertTokenizer`. Indices can be obtained using :class:`transformers.BertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details. :func:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
......
...@@ -510,12 +510,13 @@ def convert_examples_to_features( ...@@ -510,12 +510,13 @@ def convert_examples_to_features(
else: else:
text_b = example.question + " " + ending text_b = example.question + " " + ending
inputs = tokenizer.encode_plus( inputs = tokenizer(
text_a, text_a,
text_b, text_b,
add_special_tokens=True, add_special_tokens=True,
max_length=max_length, max_length=max_length,
pad_to_max_length=True, padding="max_length",
truncation=True,
return_overflowing_tokens=True, return_overflowing_tokens=True,
) )
if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0: if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
......
...@@ -45,9 +45,9 @@ def generate_summaries_or_translations( ...@@ -45,9 +45,9 @@ def generate_summaries_or_translations(
for batch in tqdm(list(chunks(examples, batch_size))): for batch in tqdm(list(chunks(examples, batch_size))):
if "t5" in model_name: if "t5" in model_name:
batch = [model.config.prefix + text for text in batch] batch = [model.config.prefix + text for text in batch]
batch = tokenizer.batch_encode_plus( batch = tokenizer(batch, max_length=1024, return_tensors="pt", truncation=True, padding="max_length").to(
batch, max_length=1024, return_tensors="pt", truncation=True, pad_to_max_length=True device
).to(device) )
summaries = model.generate(**batch, **gen_kwargs) summaries = model.generate(**batch, **gen_kwargs)
dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False) dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for hypothesis in dec: for hypothesis in dec:
......
...@@ -41,12 +41,12 @@ def encode_file( ...@@ -41,12 +41,12 @@ def encode_file(
assert lns, f"found empty file at {data_path}" assert lns, f"found empty file at {data_path}"
examples = [] examples = []
for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"): for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"):
tokenized = tokenizer.batch_encode_plus( tokenized = tokenizer(
[text], [text],
max_length=max_length, max_length=max_length,
pad_to_max_length=pad_to_max_length, padding="max_length" if pad_to_max_length else None,
add_prefix_space=True,
truncation=True, truncation=True,
add_prefix_space=True,
return_tensors=return_tensors, return_tensors=return_tensors,
) )
assert tokenized.input_ids.shape[1] == max_length assert tokenized.input_ids.shape[1] == max_length
......
...@@ -40,7 +40,7 @@ def roberta_similarity_batches(to_predict): ...@@ -40,7 +40,7 @@ def roberta_similarity_batches(to_predict):
return similarity_scores return similarity_scores
def similarity_roberta(model, tokenizer, sent_pairs): def similarity_roberta(model, tokenizer, sent_pairs):
batch_token = tokenizer.batch_encode_plus(sent_pairs, pad_to_max_length=True, max_length=500) batch_token = tokenizer(sent_pairs, padding='max_length', truncation=True, max_length=500)
res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda()) res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda())
return res return res
......
...@@ -60,7 +60,7 @@ tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2') ...@@ -60,7 +60,7 @@ tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2')
model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2') model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt') encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids'] input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] attention_mask = encoding['attention_mask']
......
...@@ -43,7 +43,7 @@ tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2') ...@@ -43,7 +43,7 @@ tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2') model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt') encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids'] input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] attention_mask = encoding['attention_mask']
......
...@@ -14,7 +14,7 @@ Therefore, this model does not need a tokenizer. The following function can inst ...@@ -14,7 +14,7 @@ Therefore, this model does not need a tokenizer. The following function can inst
import torch import torch
# Encoding # Encoding
def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0): def encode(list_of_strings, pad_token_id=0):
max_length = max([len(string) for string in list_of_strings]) max_length = max([len(string) for string in list_of_strings])
# create emtpy tensors # create emtpy tensors
......
...@@ -43,7 +43,7 @@ questions = [ ...@@ -43,7 +43,7 @@ questions = [
] ]
for question in questions: for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt") inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0] input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids) text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
......
...@@ -50,7 +50,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4 ...@@ -50,7 +50,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?" question = "What has Huggingface done ?"
encoding = tokenizer.encode_plus(question, text, return_tensors="pt") encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"] input_ids = encoding["input_ids"]
# default is local attention everywhere # default is local attention everywhere
......
...@@ -55,7 +55,7 @@ model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2") ...@@ -55,7 +55,7 @@ model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
def get_answer(question, context): def get_answer(question, context):
input_text = "question: %s context: %s </s>" % (question, context) input_text = "question: %s context: %s </s>" % (question, context)
features = tokenizer.batch_encode_plus([input_text], return_tensors='pt') features = tokenizer([input_text], return_tensors='pt')
output = model.generate(input_ids=features['input_ids'], output = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask']) attention_mask=features['attention_mask'])
......
...@@ -55,7 +55,7 @@ class SentimentModel(): ...@@ -55,7 +55,7 @@ class SentimentModel():
def predict_sentiment(self, texts: List[str])-> List[str]: def predict_sentiment(self, texts: List[str])-> List[str]:
texts = [self.clean_text(text) for text in texts] texts = [self.clean_text(text) for text in texts]
# Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model. # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True) input_ids = self.tokenizer(texts, padding=True, truncation=True, add_special_tokens=True)
input_ids = torch.tensor(input_ids["input_ids"]) input_ids = torch.tensor(input_ids["input_ids"])
with torch.no_grad(): with torch.no_grad():
......
...@@ -50,7 +50,7 @@ tokenizer = BartTokenizer.from_pretrained('valhalla/bart-large-finetuned-squadv1 ...@@ -50,7 +50,7 @@ tokenizer = BartTokenizer.from_pretrained('valhalla/bart-large-finetuned-squadv1
model = BartForQuestionAnswering.from_pretrained('valhalla/bart-large-finetuned-squadv1') model = BartForQuestionAnswering.from_pretrained('valhalla/bart-large-finetuned-squadv1')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt') encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids'] input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] attention_mask = encoding['attention_mask']
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment