Unverified Commit 3a40cdf5 authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers...


[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)

* WIP refactoring pipeline tests - switching to fast tokenizers

* fix dialog pipeline and fill-mask

* refactoring pipeline tests backbone

* make large tests slow

* fix tests (tf Bart inactive for now)

* fix doc...

* clean up for merge

* fixing tests - remove bart from summarization until there is TF

* fix quality and RAG

* Add new translation pipeline tests - fix JAX tests

* only slow for dialog

* Fixing the missing TF-BART imports in modeling_tf_auto

* spin out pipeline tests in separate CI job

* adding pipeline test to CI YAML

* add slow pipeline tests

* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>

* Update src/transformers/pipelines.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines.py
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Update src/transformers/testing_utils.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add require_torch and require_tf in is_pt_tf_cross_test
Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
parent 88b3a91e
...@@ -84,7 +84,7 @@ jobs: ...@@ -84,7 +84,7 @@ jobs:
key: v0.3-{{ checksum "setup.py" }} key: v0.3-{{ checksum "setup.py" }}
paths: paths:
- '~/.cache/pip' - '~/.cache/pip'
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ --cov --durations=0 | tee output.txt - run: RUN_PT_TF_CROSS_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ -m is_pt_tf_cross_test --cov --durations=0 | tee output.txt
- run: codecov - run: codecov
- store_artifacts: - store_artifacts:
path: ~/transformers/output.txt path: ~/transformers/output.txt
...@@ -164,6 +164,56 @@ jobs: ...@@ -164,6 +164,56 @@ jobs:
- store_artifacts: - store_artifacts:
path: ~/transformers/output.txt path: ~/transformers/output.txt
destination: test_output.txt destination: test_output.txt
run_tests_pipelines_torch:
working_directory: ~/transformers
docker:
- image: circleci/python:3.7
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- restore_cache:
keys:
- v0.3-torch-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install git+https://github.com/huggingface/datasets
- run: pip install .[sklearn,torch,testing]
- save_cache:
key: v0.3-torch-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: RUN_PIPELINE_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ -m is_pipeline_test | tee output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
run_tests_pipelines_tf:
working_directory: ~/transformers
docker:
- image: circleci/python:3.7
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- restore_cache:
keys:
- v0.3-tf-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install git+https://github.com/huggingface/datasets
- run: pip install .[sklearn,tf-cpu,testing]
- save_cache:
key: v0.3-tf-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: RUN_PIPELINE_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ -m is_pipeline_test | tee output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
run_tests_custom_tokenizers: run_tests_custom_tokenizers:
working_directory: ~/transformers working_directory: ~/transformers
docker: docker:
...@@ -331,6 +381,8 @@ workflows: ...@@ -331,6 +381,8 @@ workflows:
- run_tests_torch - run_tests_torch
- run_tests_tf - run_tests_tf
- run_tests_flax - run_tests_flax
- run_tests_pipelines_torch
- run_tests_pipelines_tf
- build_doc - build_doc
- deploy_doc: *workflow_filters - deploy_doc: *workflow_filters
tpu_testing_jobs: tpu_testing_jobs:
......
...@@ -16,52 +16,52 @@ jobs: ...@@ -16,52 +16,52 @@ jobs:
run_tests_torch_and_tf_gpu: run_tests_torch_and_tf_gpu:
runs-on: [self-hosted, single-gpu] runs-on: [self-hosted, single-gpu]
steps: steps:
- uses: actions/checkout@v2 - uses: actions/checkout@v2
- name: Python version - name: Python version
run: | run: |
which python which python
python --version python --version
pip --version pip --version
- name: Current dir - name: Current dir
run: pwd run: pwd
- run: nvidia-smi - run: nvidia-smi
- name: Loading cache. - name: Loading cache.
uses: actions/cache@v2 uses: actions/cache@v2
id: cache id: cache
with: with:
path: .env path: .env
key: v0-tests_tf_torch_gpu-${{ hashFiles('setup.py') }} key: v0-tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves) - name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: | run: |
python -m venv .env python -m venv .env
source .env/bin/activate source .env/bin/activate
which python which python
python --version python --version
pip --version pip --version
- name: Install dependencies - name: Install dependencies
run: | run: |
source .env/bin/activate source .env/bin/activate
pip install --upgrade pip pip install --upgrade pip
pip install torch!=1.6.0 pip install torch!=1.6.0
pip install .[sklearn,testing,onnxruntime] pip install .[sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks - name: Are GPUs recognized by our DL frameworks
run: | run: |
source .env/bin/activate source .env/bin/activate
python -c "import torch; print('Cuda available:', torch.cuda.is_available())" python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())" python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Run all non-slow tests on GPU - name: Run all non-slow tests on GPU
env: env:
TF_FORCE_GPU_ALLOW_GROWTH: "true" TF_FORCE_GPU_ALLOW_GROWTH: "true"
# TF_GPU_MEMORY_LIMIT: 4096 # TF_GPU_MEMORY_LIMIT: 4096
OMP_NUM_THREADS: 1 OMP_NUM_THREADS: 1
run: | run: |
source .env/bin/activate source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s ./tests/ python -m pytest -n 2 --dist=loadfile -s ./tests/
run_tests_torch_and_tf_multiple_gpu: run_tests_torch_and_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu] runs-on: [self-hosted, multi-gpu]
......
...@@ -12,64 +12,75 @@ jobs: ...@@ -12,64 +12,75 @@ jobs:
run_all_tests_torch_and_tf_gpu: run_all_tests_torch_and_tf_gpu:
runs-on: [self-hosted, single-gpu] runs-on: [self-hosted, single-gpu]
steps: steps:
- uses: actions/checkout@v2 - uses: actions/checkout@v2
- name: Loading cache. - name: Loading cache.
uses: actions/cache@v2 uses: actions/cache@v2
id: cache id: cache
with: with:
path: .env path: .env
key: v0-slow_tests_tf_torch_gpu-${{ hashFiles('setup.py') }} key: v0-slow_tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
- name: Python version - name: Python version
run: | run: |
which python which python
python --version python --version
pip --version pip --version
- name: Current dir - name: Current dir
run: pwd run: pwd
- run: nvidia-smi - run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves) - name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
if: steps.cache.outputs.cache-hit != 'true' if: steps.cache.outputs.cache-hit != 'true'
run: | run: |
python -m venv .env python -m venv .env
source .env/bin/activate source .env/bin/activate
which python which python
python --version python --version
pip --version pip --version
- name: Install dependencies - name: Install dependencies
run: | run: |
source .env/bin/activate source .env/bin/activate
pip install --upgrade pip pip install --upgrade pip
pip install torch!=1.6.0 pip install torch!=1.6.0
pip install .[sklearn,testing,onnxruntime] pip install .[sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks - name: Are GPUs recognized by our DL frameworks
run: | run: |
source .env/bin/activate source .env/bin/activate
python -c "import torch; print('Cuda available:', torch.cuda.is_available())" python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())" python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Run all tests on GPU - name: Run all tests on GPU
env: env:
TF_FORCE_GPU_ALLOW_GROWTH: "true" TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1 OMP_NUM_THREADS: 1
RUN_SLOW: yes RUN_SLOW: yes
run: | run: |
source .env/bin/activate source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=50 python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=50
- name: Run examples tests on GPU - name: Run examples tests on GPU
env: env:
TF_FORCE_GPU_ALLOW_GROWTH: "true" TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1 OMP_NUM_THREADS: 1
RUN_SLOW: yes RUN_SLOW: yes
run: | run: |
source .env/bin/activate source .env/bin/activate
pip install -r examples/requirements.txt pip install -r examples/requirements.txt
python -m pytest -n 1 --dist=loadfile -s examples --durations=50 python -m pytest -n 1 --dist=loadfile -s examples --durations=50
- name: Run all pipeline tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s ./tests/ -m is_pipeline_test --durations=50
run_all_tests_torch_and_tf_multiple_gpu: run_all_tests_torch_and_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu] runs-on: [self-hosted, multi-gpu]
...@@ -131,3 +142,13 @@ jobs: ...@@ -131,3 +142,13 @@ jobs:
source .env/bin/activate source .env/bin/activate
pip install -r examples/requirements.txt pip install -r examples/requirements.txt
python -m pytest -n 1 --dist=loadfile -s examples --durations=50 python -m pytest -n 1 --dist=loadfile -s examples --durations=50
- name: Run all pipeline tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s ./tests/ -m is_pipeline_test --durations=50
...@@ -78,15 +78,16 @@ def convert_slow_checkpoint_to_fast(tokenizer_name, checkpoint_name, dump_path, ...@@ -78,15 +78,16 @@ def convert_slow_checkpoint_to_fast(tokenizer_name, checkpoint_name, dump_path,
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix) "=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
) )
file_path = list(tokenizer.pretrained_vocab_files_map.values())[0][checkpoint] if checkpoint in list(tokenizer.pretrained_vocab_files_map.values())[0]:
next_char = file_path.split(checkpoint)[-1][0] file_path = list(tokenizer.pretrained_vocab_files_map.values())[0][checkpoint]
if next_char == "/": next_char = file_path.split(checkpoint)[-1][0]
dump_path_full = os.path.join(dump_path_full, checkpoint_prefix_name) if next_char == "/":
checkpoint_prefix_name = None dump_path_full = os.path.join(dump_path_full, checkpoint_prefix_name)
checkpoint_prefix_name = None
logger.info(
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix) logger.info(
) "=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
)
file_names = tokenizer.save_pretrained( file_names = tokenizer.save_pretrained(
dump_path_full, legacy_format=False, filename_prefix=checkpoint_prefix_name dump_path_full, legacy_format=False, filename_prefix=checkpoint_prefix_name
......
...@@ -7,11 +7,8 @@ import numpy as np ...@@ -7,11 +7,8 @@ import numpy as np
from tqdm import tqdm from tqdm import tqdm
from ...file_utils import is_tf_available, is_torch_available from ...file_utils import is_tf_available, is_torch_available
from ...tokenization_bart import BartTokenizer
from ...tokenization_bert import whitespace_tokenize from ...tokenization_bert import whitespace_tokenize
from ...tokenization_longformer import LongformerTokenizer from ...tokenization_utils_base import PreTrainedTokenizerBase, TruncationStrategy
from ...tokenization_roberta import RobertaTokenizer
from ...tokenization_utils_base import TruncationStrategy
from ...utils import logging from ...utils import logging
from .utils import DataProcessor from .utils import DataProcessor
...@@ -112,7 +109,14 @@ def squad_convert_example_to_features( ...@@ -112,7 +109,14 @@ def squad_convert_example_to_features(
all_doc_tokens = [] all_doc_tokens = []
for (i, token) in enumerate(example.doc_tokens): for (i, token) in enumerate(example.doc_tokens):
orig_to_tok_index.append(len(all_doc_tokens)) orig_to_tok_index.append(len(all_doc_tokens))
if isinstance(tokenizer, (RobertaTokenizer, LongformerTokenizer, BartTokenizer)): if tokenizer.__class__.__name__ in [
"RobertaTokenizer",
"LongformerTokenizer",
"BartTokenizer",
"RobertaTokenizerFast",
"LongformerTokenizerFast",
"BartTokenizerFast",
]:
sub_tokens = tokenizer.tokenize(token, add_prefix_space=True) sub_tokens = tokenizer.tokenize(token, add_prefix_space=True)
else: else:
sub_tokens = tokenizer.tokenize(token) sub_tokens = tokenizer.tokenize(token)
...@@ -292,7 +296,7 @@ def squad_convert_example_to_features( ...@@ -292,7 +296,7 @@ def squad_convert_example_to_features(
return features return features
def squad_convert_example_to_features_init(tokenizer_for_convert): def squad_convert_example_to_features_init(tokenizer_for_convert: PreTrainedTokenizerBase):
global tokenizer global tokenizer
tokenizer = tokenizer_for_convert tokenizer = tokenizer_for_convert
...@@ -344,9 +348,9 @@ def squad_convert_examples_to_features( ...@@ -344,9 +348,9 @@ def squad_convert_examples_to_features(
is_training=not evaluate, is_training=not evaluate,
) )
""" """
# Defining helper methods # Defining helper methods
features = [] features = []
threads = min(threads, cpu_count()) threads = min(threads, cpu_count())
with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p: with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:
annotate_ = partial( annotate_ = partial(
...@@ -365,6 +369,7 @@ def squad_convert_examples_to_features( ...@@ -365,6 +369,7 @@ def squad_convert_examples_to_features(
disable=not tqdm_enabled, disable=not tqdm_enabled,
) )
) )
new_features = [] new_features = []
unique_id = 1000000000 unique_id = 1000000000
example_index = 0 example_index = 0
......
...@@ -52,7 +52,7 @@ from .modeling_tf_albert import ( ...@@ -52,7 +52,7 @@ from .modeling_tf_albert import (
TFAlbertForTokenClassification, TFAlbertForTokenClassification,
TFAlbertModel, TFAlbertModel,
) )
from .modeling_tf_bart import TFBartForConditionalGeneration from .modeling_tf_bart import TFBartForConditionalGeneration, TFBartModel
from .modeling_tf_bert import ( from .modeling_tf_bert import (
TFBertForMaskedLM, TFBertForMaskedLM,
TFBertForMultipleChoice, TFBertForMultipleChoice,
...@@ -163,6 +163,7 @@ TF_MODEL_MAPPING = OrderedDict( ...@@ -163,6 +163,7 @@ TF_MODEL_MAPPING = OrderedDict(
(T5Config, TFT5Model), (T5Config, TFT5Model),
(DistilBertConfig, TFDistilBertModel), (DistilBertConfig, TFDistilBertModel),
(AlbertConfig, TFAlbertModel), (AlbertConfig, TFAlbertModel),
(BartConfig, TFBartModel),
(CamembertConfig, TFCamembertModel), (CamembertConfig, TFCamembertModel),
(XLMRobertaConfig, TFXLMRobertaModel), (XLMRobertaConfig, TFXLMRobertaModel),
(LongformerConfig, TFLongformerModel), (LongformerConfig, TFLongformerModel),
...@@ -186,6 +187,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict( ...@@ -186,6 +187,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
(T5Config, TFT5ForConditionalGeneration), (T5Config, TFT5ForConditionalGeneration),
(DistilBertConfig, TFDistilBertForMaskedLM), (DistilBertConfig, TFDistilBertForMaskedLM),
(AlbertConfig, TFAlbertForPreTraining), (AlbertConfig, TFAlbertForPreTraining),
(BartConfig, TFBartForConditionalGeneration),
(CamembertConfig, TFCamembertForMaskedLM), (CamembertConfig, TFCamembertForMaskedLM),
(XLMRobertaConfig, TFXLMRobertaForMaskedLM), (XLMRobertaConfig, TFXLMRobertaForMaskedLM),
(RobertaConfig, TFRobertaForMaskedLM), (RobertaConfig, TFRobertaForMaskedLM),
......
...@@ -640,12 +640,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin): ...@@ -640,12 +640,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
# Load model # Load model
if pretrained_model_name_or_path is not None: if pretrained_model_name_or_path is not None:
if os.path.isdir(pretrained_model_name_or_path): if os.path.isdir(pretrained_model_name_or_path):
if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)): if from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):
# Load from a PyTorch checkpoint in priority if from_pt
archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)
elif os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):
# Load from a TF 2.0 checkpoint # Load from a TF 2.0 checkpoint
archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME) archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)
elif from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):
# Load from a PyTorch checkpoint
archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)
else: else:
raise EnvironmentError( raise EnvironmentError(
"Error no file named {} found in directory {} or `from_pt` set to False".format( "Error no file named {} found in directory {} or `from_pt` set to False".format(
......
...@@ -882,10 +882,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin): ...@@ -882,10 +882,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
if pretrained_model_name_or_path is not None: if pretrained_model_name_or_path is not None:
if os.path.isdir(pretrained_model_name_or_path): if os.path.isdir(pretrained_model_name_or_path):
if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index")): if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index")):
# Load from a TF 1.0 checkpoint # Load from a TF 1.0 checkpoint in priority if from_tf
archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index") archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index")
elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)): elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):
# Load from a TF 2.0 checkpoint # Load from a TF 2.0 checkpoint in priority if from_tf
archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME) archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)
elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)): elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):
# Load from a PyTorch checkpoint # Load from a PyTorch checkpoint
...@@ -951,7 +951,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin): ...@@ -951,7 +951,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
state_dict = torch.load(resolved_archive_file, map_location="cpu") state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception: except Exception:
raise OSError( raise OSError(
"Unable to load weights from pytorch checkpoint file. " f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
f"at '{resolved_archive_file}'"
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. " "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
) )
......
...@@ -38,7 +38,7 @@ from .modelcard import ModelCard ...@@ -38,7 +38,7 @@ from .modelcard import ModelCard
from .tokenization_auto import AutoTokenizer from .tokenization_auto import AutoTokenizer
from .tokenization_bert import BasicTokenizer from .tokenization_bert import BasicTokenizer
from .tokenization_utils import PreTrainedTokenizer from .tokenization_utils import PreTrainedTokenizer
from .tokenization_utils_base import BatchEncoding, PaddingStrategy from .tokenization_utils_base import PaddingStrategy
from .utils import logging from .utils import logging
...@@ -2396,11 +2396,12 @@ class ConversationalPipeline(Pipeline): ...@@ -2396,11 +2396,12 @@ class ConversationalPipeline(Pipeline):
def __init__(self, min_length_for_response=32, *args, **kwargs): def __init__(self, min_length_for_response=32, *args, **kwargs):
super().__init__(*args, **kwargs) super().__init__(*args, **kwargs)
# We need at least an eos_token
assert self.tokenizer.eos_token_id is not None, "DialoguePipeline tokenizer should have an EOS token set" assert self.tokenizer.eos_token_id is not None, "DialoguePipeline tokenizer should have an EOS token set"
if self.tokenizer.pad_token_id is not None: if self.tokenizer.pad_token_id is None:
self.pad_token_id = self.tokenizer.pad_token_id self.tokenizer.pad_token = self.tokenizer.eos_token
else:
self.pad_token_id = self.tokenizer.eos_token_id
self.min_length_for_response = min_length_for_response self.min_length_for_response = min_length_for_response
def __call__( def __call__(
...@@ -2496,7 +2497,7 @@ class ConversationalPipeline(Pipeline): ...@@ -2496,7 +2497,7 @@ class ConversationalPipeline(Pipeline):
""" """
# Parse arguments # Parse arguments
inputs = self._args_parser(*args, **kwargs) inputs = self._args_parser(*args, **kwargs)
inputs = self.tokenizer.batch_encode_plus(inputs, add_special_tokens=False, padding=False).get("input_ids", []) inputs = self.tokenizer(inputs, add_special_tokens=False, padding=False).get("input_ids", [])
for input in inputs: for input in inputs:
input.append(self.tokenizer.eos_token_id) input.append(self.tokenizer.eos_token_id)
return inputs return inputs
...@@ -2516,7 +2517,7 @@ class ConversationalPipeline(Pipeline): ...@@ -2516,7 +2517,7 @@ class ConversationalPipeline(Pipeline):
sequence_tokens = [] sequence_tokens = []
is_previous_pad = False is_previous_pad = False
for token in sequence: for token in sequence:
if token == self.pad_token_id: if token == self.tokenizer.pad_token_id:
if is_previous_pad: if is_previous_pad:
continue continue
else: else:
...@@ -2550,13 +2551,10 @@ class ConversationalPipeline(Pipeline): ...@@ -2550,13 +2551,10 @@ class ConversationalPipeline(Pipeline):
else: else:
new_input = new_input[cutoff_eos_index + 1 :] new_input = new_input[cutoff_eos_index + 1 :]
outputs.append(new_input) outputs.append(new_input)
max_len = max([len(item) for item in outputs]) padded_outputs = self.tokenizer.pad(
outputs = [output + [self.pad_token_id] * (max_len - len(output)) for output in outputs] {"input_ids": outputs}, padding="longest", return_attention_mask=True, return_tensors=self.framework
outputs = BatchEncoding(
{"input_ids": outputs, "attention_mask": [[1] * len(outputs)]},
tensor_type=self.framework,
) )
return outputs return padded_outputs
# Register all the supported tasks here # Register all the supported tasks here
...@@ -2700,6 +2698,7 @@ def pipeline( ...@@ -2700,6 +2698,7 @@ def pipeline(
config: Optional[Union[str, PretrainedConfig]] = None, config: Optional[Union[str, PretrainedConfig]] = None,
tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None, tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None,
framework: Optional[str] = None, framework: Optional[str] = None,
use_fast: bool = False,
**kwargs **kwargs
) -> Pipeline: ) -> Pipeline:
""" """
...@@ -2749,6 +2748,8 @@ def pipeline( ...@@ -2749,6 +2748,8 @@ def pipeline(
If no framework is specified, will default to the one currently installed. If no framework is specified If no framework is specified, will default to the one currently installed. If no framework is specified
and both frameworks are installed, will default to the framework of the :obj:`model`, or to PyTorch if no and both frameworks are installed, will default to the framework of the :obj:`model`, or to PyTorch if no
model is provided. model is provided.
use_fast (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to use a Fast tokenizer if possible (a :class:`~transformers.PreTrainedTokenizerFast`).
kwargs: kwargs:
Additional keyword arguments passed along to the specific pipeline init (see the documentation for the Additional keyword arguments passed along to the specific pipeline init (see the documentation for the
corresponding pipeline class for possible values). corresponding pipeline class for possible values).
...@@ -2807,9 +2808,10 @@ def pipeline( ...@@ -2807,9 +2808,10 @@ def pipeline(
if isinstance(tokenizer, (str, tuple)): if isinstance(tokenizer, (str, tuple)):
if isinstance(tokenizer, tuple): if isinstance(tokenizer, tuple):
# For tuple we have (tokenizer name, {kwargs}) # For tuple we have (tokenizer name, {kwargs})
tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], **tokenizer[1]) use_fast = tokenizer[1].pop("use_fast", use_fast)
tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], use_fast=use_fast, **tokenizer[1])
else: else:
tokenizer = AutoTokenizer.from_pretrained(tokenizer) tokenizer = AutoTokenizer.from_pretrained(tokenizer, use_fast=use_fast)
# Instantiate config if needed # Instantiate config if needed
if isinstance(config, str): if isinstance(config, str):
......
...@@ -59,10 +59,50 @@ def parse_int_from_env(key, default=None): ...@@ -59,10 +59,50 @@ def parse_int_from_env(key, default=None):
_run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False) _run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False)
_run_pt_tf_cross_tests = parse_flag_from_env("RUN_PT_TF_CROSS_TESTS", default=False)
_run_custom_tokenizers = parse_flag_from_env("RUN_CUSTOM_TOKENIZERS", default=False) _run_custom_tokenizers = parse_flag_from_env("RUN_CUSTOM_TOKENIZERS", default=False)
_run_pipeline_tests = parse_flag_from_env("RUN_PIPELINE_TESTS", default=False)
_tf_gpu_memory_limit = parse_int_from_env("TF_GPU_MEMORY_LIMIT", default=None) _tf_gpu_memory_limit = parse_int_from_env("TF_GPU_MEMORY_LIMIT", default=None)
def is_pt_tf_cross_test(test_case):
"""
Decorator marking a test as a test that control interactions between PyTorch and TensorFlow.
PT+TF tests are skipped by default and we can run only them by setting RUN_PT_TF_CROSS_TESTS environment variable
to a truthy value and selecting the is_pt_tf_cross_test pytest mark.
"""
if not _run_pt_tf_cross_tests or not _torch_available or not _tf_available:
return unittest.skip("test is PT+TF test")(test_case)
else:
try:
import pytest # We don't need a hard dependency on pytest in the main library
except ImportError:
return test_case
else:
return pytest.mark.is_pt_tf_cross_test()(test_case)
def is_pipeline_test(test_case):
"""
Decorator marking a test as a pipeline test.
Pipeline tests are skipped by default and we can run only them by setting RUN_PIPELINE_TEST environment variable
to a truthy value and selecting the is_pipeline_test pytest mark.
"""
if not _run_pipeline_tests:
return unittest.skip("test is pipeline test")(test_case)
else:
try:
import pytest # We don't need a hard dependency on pytest in the main library
except ImportError:
return test_case
else:
return pytest.mark.is_pipeline_test()(test_case)
def slow(test_case): def slow(test_case):
""" """
Decorator marking a test as slow. Decorator marking a test as slow.
......
...@@ -136,18 +136,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -136,18 +136,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
""" """
raise NotImplementedError raise NotImplementedError
def get_vocab(self) -> Dict[str, int]:
"""
Returns the vocabulary as a dictionary of token to index.
:obj:`tokenizer.get_vocab()[token]` is equivalent to :obj:`tokenizer.convert_tokens_to_ids(token)` when
:obj:`token` is in the vocab.
Returns:
:obj:`Dict[str, int]`: The vocabulary.
"""
raise NotImplementedError()
def get_added_vocab(self) -> Dict[str, int]: def get_added_vocab(self) -> Dict[str, int]:
""" """
Returns the added tokens in the vocabulary as a dictionary of token to index. Returns the added tokens in the vocabulary as a dictionary of token to index.
...@@ -733,47 +721,15 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase): ...@@ -733,47 +721,15 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
raise NotImplementedError raise NotImplementedError
def convert_tokens_to_string(self, tokens: List[str]) -> str: def convert_tokens_to_string(self, tokens: List[str]) -> str:
"""
Converts a sequence of token ids in a single string.
The most simple way to do it is ``" ".join(tokens)`` but we often want to remove
sub-word tokenization artifacts at the same time.
Args:
tokens (:obj:`List[str]`): The token to join in a string.
Return: The joined tokens.
"""
return " ".join(tokens) return " ".join(tokens)
def decode( def _decode(
self, self,
token_ids: List[int], token_ids: List[int],
skip_special_tokens: bool = False, skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True, clean_up_tokenization_spaces: bool = True,
spaces_between_special_tokens: bool = True, spaces_between_special_tokens: bool = True,
) -> str: ) -> str:
"""
Converts a sequence of ids in a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces.
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
Args:
token_ids (:obj:`List[int]`):
List of tokenized input ids. Can be obtained using the ``__call__`` method.
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to clean up the tokenization spaces.
spaces_between_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to add spaces around special tokens.
The behavior of Fast tokenizers is to have this to :obj:`False`.
This is setup to :obj:`True` in slow tokenizers for backward compatibility.
Returns:
:obj:`str`: The decoded sentence.
"""
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
# To avoid mixing byte-level and unicode for byte-level BPT # To avoid mixing byte-level and unicode for byte-level BPT
......
...@@ -175,6 +175,23 @@ class TokenSpan(NamedTuple): ...@@ -175,6 +175,23 @@ class TokenSpan(NamedTuple):
end: int end: int
def to_py_obj(obj):
"""
Convert a TensorFlow tensor, PyTorch tensor, Numpy array or python list
to a python list.
"""
if isinstance(obj, (list, tuple)):
return [to_py_obj(o) for o in obj]
elif is_tf_available() and isinstance(obj, tf.Tensor):
return obj.numpy().tolist()
elif is_torch_available() and isinstance(obj, torch.Tensor):
return obj.detach().cpu().tolist()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return obj
class BatchEncoding(UserDict): class BatchEncoding(UserDict):
""" """
Holds the output of the :meth:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase.encode_plus` Holds the output of the :meth:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase.encode_plus`
...@@ -1025,6 +1042,38 @@ class SpecialTokensMixin: ...@@ -1025,6 +1042,38 @@ class SpecialTokensMixin:
""" """
return self.convert_tokens_to_ids(self.additional_special_tokens) return self.convert_tokens_to_ids(self.additional_special_tokens)
@bos_token_id.setter
def bos_token_id(self, value):
self._bos_token = self.convert_tokens_to_ids(value)
@eos_token_id.setter
def eos_token_id(self, value):
self._eos_token = self.convert_tokens_to_ids(value)
@unk_token_id.setter
def unk_token_id(self, value):
self._unk_token = self.convert_tokens_to_ids(value)
@sep_token_id.setter
def sep_token_id(self, value):
self._sep_token = self.convert_tokens_to_ids(value)
@pad_token_id.setter
def pad_token_id(self, value):
self._pad_token = self.convert_tokens_to_ids(value)
@cls_token_id.setter
def cls_token_id(self, value):
self._cls_token = self.convert_tokens_to_ids(value)
@mask_token_id.setter
def mask_token_id(self, value):
self._mask_token = self.convert_tokens_to_ids(value)
@additional_special_tokens_ids.setter
def additional_special_tokens_ids(self, values):
self._additional_special_tokens = [self.convert_tokens_to_ids(value) for value in values]
@property @property
def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]: def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]:
""" """
...@@ -1424,6 +1473,18 @@ class PreTrainedTokenizerBase(SpecialTokensMixin): ...@@ -1424,6 +1473,18 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
f"padding_side='{self.padding_side}', special_tokens={self.special_tokens_map_extended})" f"padding_side='{self.padding_side}', special_tokens={self.special_tokens_map_extended})"
) )
def get_vocab(self) -> Dict[str, int]:
"""
Returns the vocabulary as a dictionary of token to index.
:obj:`tokenizer.get_vocab()[token]` is equivalent to :obj:`tokenizer.convert_tokens_to_ids(token)` when
:obj:`token` is in the vocab.
Returns:
:obj:`Dict[str, int]`: The vocabulary.
"""
raise NotImplementedError()
@classmethod @classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs): def from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):
r""" r"""
...@@ -1852,6 +1913,32 @@ class PreTrainedTokenizerBase(SpecialTokensMixin): ...@@ -1852,6 +1913,32 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
""" """
raise NotImplementedError raise NotImplementedError
def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
"""
Converts a string in a sequence of tokens, using the backend Rust tokenizer.
Note that this method behave differently between fast and slow tokenizers:
- in fast tokenizers (instances of :class:`~transformers.PreTrainedTokenizerFast`), this method
will replace the unknown tokens with the :obj:`unk_token`,
- in slow tokenizers (instances of :class:`~transformers.PreTrainedTokenizer`), this method
keep unknown tokens unchanged.
Args:
text (:obj:`str`):
The sequence to be encoded.
pair (:obj:`str`, `optional`):
A second sequence to be encoded with the first.
add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to add the special tokens associated with the corresponding model.
kwargs (additional keyword arguments, `optional`):
Will be passed to the underlying model specific encode method.
See details in :meth:`~transformers.PreTrainedTokenizer.__call__`
Returns:
:obj:`List[str]`: The list of tokens.
"""
raise NotImplementedError
@add_end_docstrings( @add_end_docstrings(
ENCODE_KWARGS_DOCSTRING, ENCODE_KWARGS_DOCSTRING,
""" """
...@@ -2456,18 +2543,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin): ...@@ -2456,18 +2543,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
f"Should be one of a python, numpy, pytorch or tensorflow object." f"Should be one of a python, numpy, pytorch or tensorflow object."
) )
def to_py_obj(obj):
if isinstance(obj, (list, tuple)):
return [to_py_obj(o) for o in obj]
elif is_tf_available() and isinstance(obj, tf.Tensor):
return obj.numpy().tolist()
elif is_torch_available() and isinstance(obj, torch.Tensor):
return obj.cpu().tolist()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return obj
for key, value in encoded_inputs.items(): for key, value in encoded_inputs.items():
encoded_inputs[key] = to_py_obj(value) encoded_inputs[key] = to_py_obj(value)
...@@ -2862,33 +2937,53 @@ class PreTrainedTokenizerBase(SpecialTokensMixin): ...@@ -2862,33 +2937,53 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
return encoded_inputs return encoded_inputs
def convert_tokens_to_string(self, tokens: List[str]) -> str:
"""
Converts a sequence of token ids in a single string.
The most simple way to do it is ``" ".join(tokens)`` but we often want to remove
sub-word tokenization artifacts at the same time.
Args:
tokens (:obj:`List[str]`): The token to join in a string.
Return: The joined tokens.
"""
raise NotImplementedError
def batch_decode( def batch_decode(
self, sequences: List[List[int]], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True self,
sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs
) -> List[str]: ) -> List[str]:
""" """
Convert a list of lists of token ids into a list of strings by calling decode. Convert a list of lists of token ids into a list of strings by calling decode.
Args: Args:
sequences (:obj:`List[List[int]]`): sequences (:obj:`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the ``__call__`` method. List of tokenized input ids. Can be obtained using the ``__call__`` method.
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to remove special tokens in the decoding. Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`): clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to clean up the tokenization spaces. Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, `optional`):
Will be passed to the underlying model specific decode method.
Returns: Returns:
:obj:`List[str]`: The list of decoded sentences. :obj:`List[str]`: The list of decoded sentences.
""" """
return [ return [
self.decode( self.decode(
seq, skip_special_tokens=skip_special_tokens, clean_up_tokenization_spaces=clean_up_tokenization_spaces seq,
skip_special_tokens=skip_special_tokens,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
) )
for seq in sequences for seq in sequences
] ]
def decode( def decode(
self, self,
token_ids: List[int], token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
skip_special_tokens: bool = False, skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True, clean_up_tokenization_spaces: bool = True,
**kwargs **kwargs
...@@ -2900,16 +2995,35 @@ class PreTrainedTokenizerBase(SpecialTokensMixin): ...@@ -2900,16 +2995,35 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``. Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
Args: Args:
token_ids (:obj:`List[int]`): token_ids (:obj:`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the ``__call__`` method. List of tokenized input ids. Can be obtained using the ``__call__`` method.
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to remove special tokens in the decoding. Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`): clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to clean up the tokenization spaces. Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, `optional`):
Will be passed to the underlying model specific decode method.
Returns: Returns:
:obj:`str`: The decoded sentence. :obj:`str`: The decoded sentence.
""" """
# Convert inputs to python lists
token_ids = to_py_obj(token_ids)
return self._decode(
token_ids=token_ids,
skip_special_tokens=skip_special_tokens,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
)
def _decode(
self,
token_ids: Union[int, List[int]],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs
) -> str:
raise NotImplementedError raise NotImplementedError
def get_special_tokens_mask( def get_special_tokens_mask(
......
...@@ -122,17 +122,12 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -122,17 +122,12 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
return self._tokenizer.get_vocab_size(with_added_tokens=False) return self._tokenizer.get_vocab_size(with_added_tokens=False)
def get_vocab(self) -> Dict[str, int]: def get_vocab(self) -> Dict[str, int]:
"""
Returns the vocabulary as a dictionary of token to index.
:obj:`tokenizer.get_vocab()[token]` is equivalent to :obj:`tokenizer.convert_tokens_to_ids(token)` when
:obj:`token` is in the vocab.
Returns:
:obj:`Dict[str, int]`: The vocabulary.
"""
return self._tokenizer.get_vocab(with_added_tokens=True) return self._tokenizer.get_vocab(with_added_tokens=True)
@property
def vocab(self) -> Dict[str, int]:
return self.get_vocab()
def get_added_vocab(self) -> Dict[str, int]: def get_added_vocab(self) -> Dict[str, int]:
""" """
Returns the added tokens in the vocabulary as a dictionary of token to index. Returns the added tokens in the vocabulary as a dictionary of token to index.
...@@ -291,25 +286,8 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -291,25 +286,8 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
tokens.append(self._tokenizer.id_to_token(index)) tokens.append(self._tokenizer.id_to_token(index))
return tokens return tokens
def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False) -> List[str]: def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
""" return self.encode_plus(text=text, text_pair=pair, add_special_tokens=add_special_tokens, **kwargs).tokens()
Converts a string in a sequence of tokens, using the backend Rust tokenizer.
Note that, unlike slow tokenizers (instances of :class:`~transformers.PreTrainedTokenizer`), this method
will replace the unknown tokens with the :obj:`unk_token`.
Args:
text (:obj:`str`):
The sequence to be encoded.
pair (:obj:`str`, `optional`):
A second sequence to be encoded with the first.
add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to add the special tokens associated with the corresponding model.
Returns:
:obj:`List[str]`: The list of tokens.
"""
return self._tokenizer.encode(text, pair, add_special_tokens=add_special_tokens).tokens
def set_truncation_and_padding( def set_truncation_and_padding(
self, self,
...@@ -405,29 +383,11 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -405,29 +383,11 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
pad_to_multiple_of=pad_to_multiple_of, pad_to_multiple_of=pad_to_multiple_of,
) )
# Avoid thread overhead if only one example. encodings = self._tokenizer.encode_batch(
if len(batch_text_or_text_pairs) == 1: batch_text_or_text_pairs,
if isinstance(batch_text_or_text_pairs[0], tuple): add_special_tokens=add_special_tokens,
# We got a Tuple with a pair of sequences is_pretokenized=is_split_into_words,
encodings = self._tokenizer.encode( )
*batch_text_or_text_pairs[0],
add_special_tokens=add_special_tokens,
is_pretokenized=is_split_into_words,
)
else:
# We got a single sequence
encodings = self._tokenizer.encode(
batch_text_or_text_pairs[0],
add_special_tokens=add_special_tokens,
is_pretokenized=is_split_into_words,
)
encodings = [encodings]
else:
encodings = self._tokenizer.encode_batch(
batch_text_or_text_pairs,
add_special_tokens=add_special_tokens,
is_pretokenized=is_split_into_words,
)
# Convert encoding to dict # Convert encoding to dict
# `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]] # `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]]
...@@ -525,30 +485,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase): ...@@ -525,30 +485,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
return batched_output return batched_output
def decode( def convert_tokens_to_string(self, tokens: List[str]) -> str:
return self.backend_tokenizer.decoder.decode(tokens)
def _decode(
self, self,
token_ids: Union[int, List[int]], token_ids: Union[int, List[int]],
skip_special_tokens: bool = False, skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True, clean_up_tokenization_spaces: bool = True,
**kwargs **kwargs
) -> str: ) -> str:
"""
Converts a sequence of ids in a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces.
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
Args:
token_ids (:obj:`Union[int, List[int]]`):
List of tokenized input ids. Can be obtained using the ``__call__`` method.
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to clean up the tokenization spaces.
Returns:
:obj:`str`: The decoded sentence.
"""
if isinstance(token_ids, int): if isinstance(token_ids, int):
token_ids = [token_ids] token_ids = [token_ids]
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
......
...@@ -15,3 +15,10 @@ sys.path.insert(1, git_repo_path) ...@@ -15,3 +15,10 @@ sys.path.insert(1, git_repo_path)
# silence FutureWarning warnings in tests since often we can't act on them until # silence FutureWarning warnings in tests since often we can't act on them until
# they become normal warnings - i.e. the tests still need to test the current functionality # they become normal warnings - i.e. the tests still need to test the current functionality
warnings.simplefilter(action="ignore", category=FutureWarning) warnings.simplefilter(action="ignore", category=FutureWarning)
def pytest_configure(config):
config.addinivalue_line("markers", "is_pipeline_test: mark test to run only when pipeline are tested")
config.addinivalue_line(
"markers", "is_pt_tf_cross_test: mark test to run only when PT and TF interactions are tested"
)
...@@ -19,7 +19,7 @@ import unittest ...@@ -19,7 +19,7 @@ import unittest
from transformers import is_tf_available from transformers import is_tf_available
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.testing_utils import require_tf, require_torch, slow from transformers.testing_utils import is_pt_tf_cross_test, require_tf, slow
from .test_configuration_common import ConfigTester from .test_configuration_common import ConfigTester
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
...@@ -231,8 +231,7 @@ def _long_tensor(tok_lst): ...@@ -231,8 +231,7 @@ def _long_tensor(tok_lst):
TOLERANCE = 1e-4 TOLERANCE = 1e-4
@require_tf @is_pt_tf_cross_test
@require_torch
@slow @slow
class TFBartModelIntegrationTest(unittest.TestCase): class TFBartModelIntegrationTest(unittest.TestCase):
def test_inference_no_head(self): def test_inference_no_head(self):
......
...@@ -23,8 +23,8 @@ import unittest ...@@ -23,8 +23,8 @@ import unittest
from importlib import import_module from importlib import import_module
from typing import List, Tuple from typing import List, Tuple
from transformers import is_tf_available, is_torch_available from transformers import is_tf_available
from transformers.testing_utils import _tf_gpu_memory_limit, require_tf, slow from transformers.testing_utils import _tf_gpu_memory_limit, is_pt_tf_cross_test, require_tf, slow
if is_tf_available(): if is_tf_available():
...@@ -291,9 +291,8 @@ class TFModelTesterMixin: ...@@ -291,9 +291,8 @@ class TFModelTesterMixin:
max_diff = np.amax(np.abs(out_1 - out_2)) max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5) self.assertLessEqual(max_diff, 1e-5)
@is_pt_tf_cross_test
def test_pt_tf_model_equivalence(self): def test_pt_tf_model_equivalence(self):
if not is_torch_available():
return
import torch import torch
......
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import is_tf_available, is_torch_available
from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, is_pt_tf_cross_test, slow
if is_tf_available():
from transformers import (
AutoConfig,
BertConfig,
GPT2Config,
T5Config,
TFAutoModel,
TFAutoModelForCausalLM,
TFAutoModelForMaskedLM,
TFAutoModelForPreTraining,
TFAutoModelForQuestionAnswering,
TFAutoModelForSeq2SeqLM,
TFAutoModelForSequenceClassification,
TFAutoModelWithLMHead,
TFBertForMaskedLM,
TFBertForPreTraining,
TFBertForQuestionAnswering,
TFBertForSequenceClassification,
TFBertModel,
TFGPT2LMHeadModel,
TFRobertaForMaskedLM,
TFT5ForConditionalGeneration,
)
from transformers.modeling_tf_bert import TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST
from transformers.modeling_tf_gpt2 import TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST
from transformers.modeling_tf_t5 import TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST
if is_torch_available():
from transformers import (
AutoModel,
AutoModelForCausalLM,
AutoModelForMaskedLM,
AutoModelForPreTraining,
AutoModelForQuestionAnswering,
AutoModelForSeq2SeqLM,
AutoModelForSequenceClassification,
AutoModelWithLMHead,
BertForMaskedLM,
BertForPreTraining,
BertForQuestionAnswering,
BertForSequenceClassification,
BertModel,
GPT2LMHeadModel,
RobertaForMaskedLM,
T5ForConditionalGeneration,
)
@is_pt_tf_cross_test
class TFPTAutoModelTest(unittest.TestCase):
@slow
def test_model_from_pretrained(self):
import h5py
self.assertTrue(h5py.version.hdf5_version.startswith("1.10"))
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
for model_name in ["bert-base-uncased"]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = TFAutoModel.from_pretrained(model_name, from_pt=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFBertModel)
model = AutoModel.from_pretrained(model_name, from_tf=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertModel)
@slow
def test_model_for_pretraining_from_pretrained(self):
import h5py
self.assertTrue(h5py.version.hdf5_version.startswith("1.10"))
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
for model_name in ["bert-base-uncased"]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = TFAutoModelForPreTraining.from_pretrained(model_name, from_pt=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFBertForPreTraining)
model = AutoModelForPreTraining.from_pretrained(model_name, from_tf=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForPreTraining)
@slow
def test_model_for_causal_lm(self):
for model_name in TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, GPT2Config)
model = TFAutoModelForCausalLM.from_pretrained(model_name, from_pt=True)
model, loading_info = TFAutoModelForCausalLM.from_pretrained(
model_name, output_loading_info=True, from_pt=True
)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFGPT2LMHeadModel)
model = AutoModelForCausalLM.from_pretrained(model_name, from_tf=True)
model, loading_info = AutoModelForCausalLM.from_pretrained(
model_name, output_loading_info=True, from_tf=True
)
self.assertIsNotNone(model)
self.assertIsInstance(model, GPT2LMHeadModel)
@slow
def test_lmhead_model_from_pretrained(self):
for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = TFAutoModelWithLMHead.from_pretrained(model_name, from_pt=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFBertForMaskedLM)
model = AutoModelWithLMHead.from_pretrained(model_name, from_tf=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForMaskedLM)
@slow
def test_model_for_masked_lm(self):
for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = TFAutoModelForMaskedLM.from_pretrained(model_name, from_pt=True)
model, loading_info = TFAutoModelForMaskedLM.from_pretrained(
model_name, output_loading_info=True, from_pt=True
)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFBertForMaskedLM)
model = AutoModelForMaskedLM.from_pretrained(model_name, from_tf=True)
model, loading_info = AutoModelForMaskedLM.from_pretrained(
model_name, output_loading_info=True, from_tf=True
)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForMaskedLM)
@slow
def test_model_for_encoder_decoder_lm(self):
for model_name in TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, T5Config)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name, from_pt=True)
model, loading_info = TFAutoModelForSeq2SeqLM.from_pretrained(
model_name, output_loading_info=True, from_pt=True
)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFT5ForConditionalGeneration)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, from_tf=True)
model, loading_info = AutoModelForSeq2SeqLM.from_pretrained(
model_name, output_loading_info=True, from_tf=True
)
self.assertIsNotNone(model)
self.assertIsInstance(model, T5ForConditionalGeneration)
@slow
def test_sequence_classification_model_from_pretrained(self):
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
for model_name in ["bert-base-uncased"]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFBertForSequenceClassification)
model = AutoModelForSequenceClassification.from_pretrained(model_name, from_tf=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForSequenceClassification)
@slow
def test_question_answering_model_from_pretrained(self):
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
for model_name in ["bert-base-uncased"]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = TFAutoModelForQuestionAnswering.from_pretrained(model_name, from_pt=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, TFBertForQuestionAnswering)
model = AutoModelForQuestionAnswering.from_pretrained(model_name, from_tf=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForQuestionAnswering)
def test_from_pretrained_identifier(self):
model = TFAutoModelWithLMHead.from_pretrained(SMALL_MODEL_IDENTIFIER, from_pt=True)
self.assertIsInstance(model, TFBertForMaskedLM)
self.assertEqual(model.num_parameters(), 14830)
self.assertEqual(model.num_parameters(only_trainable=True), 14830)
model = AutoModelWithLMHead.from_pretrained(SMALL_MODEL_IDENTIFIER, from_tf=True)
self.assertIsInstance(model, BertForMaskedLM)
self.assertEqual(model.num_parameters(), 14410)
self.assertEqual(model.num_parameters(only_trainable=True), 14410)
def test_from_identifier_from_model_type(self):
model = TFAutoModelWithLMHead.from_pretrained(DUMMY_UNKWOWN_IDENTIFIER, from_pt=True)
self.assertIsInstance(model, TFRobertaForMaskedLM)
self.assertEqual(model.num_parameters(), 14830)
self.assertEqual(model.num_parameters(only_trainable=True), 14830)
model = AutoModelWithLMHead.from_pretrained(DUMMY_UNKWOWN_IDENTIFIER, from_tf=True)
self.assertIsInstance(model, RobertaForMaskedLM)
self.assertEqual(model.num_parameters(), 14410)
self.assertEqual(model.num_parameters(only_trainable=True), 14410)
This diff is collapsed.
import unittest
from typing import List, Optional
from transformers import is_tf_available, is_torch_available, pipeline
from transformers.pipelines import DefaultArgumentHandler, Pipeline
from transformers.testing_utils import _run_slow_tests, is_pipeline_test, require_tf, require_torch, slow
VALID_INPUTS = ["A simple string", ["list of strings"]]
@is_pipeline_test
class CustomInputPipelineCommonMixin:
pipeline_task = None
pipeline_loading_kwargs = {}
small_models = None # Models tested without the @slow decorator
large_models = None # Models tested with the @slow decorator
def setUp(self) -> None:
if not is_tf_available() and not is_torch_available():
return # Currently no JAX pipelines
# Download needed checkpoints
models = self.small_models
if _run_slow_tests:
models = models + self.large_models
for model_name in models:
if is_torch_available():
pipeline(
self.pipeline_task,
model=model_name,
tokenizer=model_name,
framework="pt",
**self.pipeline_loading_kwargs,
)
if is_tf_available():
pipeline(
self.pipeline_task,
model=model_name,
tokenizer=model_name,
framework="tf",
**self.pipeline_loading_kwargs,
)
@require_torch
@slow
def test_pt_defaults(self):
pipeline(self.pipeline_task, framework="pt")
@require_tf
@slow
def test_tf_defaults(self):
pipeline(self.pipeline_task, framework="tf")
@require_torch
def test_torch_small(self):
for model_name in self.small_models:
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="pt")
self._test_pipeline(nlp)
@require_tf
def test_tf_small(self):
for model_name in self.small_models:
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="tf")
self._test_pipeline(nlp)
@require_torch
@slow
def test_torch_large(self):
for model_name in self.large_models:
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="pt")
self._test_pipeline(nlp)
@require_tf
@slow
def test_tf_large(self):
for model_name in self.large_models:
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="tf")
self._test_pipeline(nlp)
def _test_pipeline(self, nlp: Pipeline):
raise NotImplementedError
@is_pipeline_test
class MonoInputPipelineCommonMixin:
pipeline_task = None
pipeline_loading_kwargs = {} # Additional kwargs to load the pipeline with
pipeline_running_kwargs = {} # Additional kwargs to run the pipeline with
small_models = [] # Models tested without the @slow decorator
large_models = [] # Models tested with the @slow decorator
mandatory_keys = {} # Keys which should be in the output
valid_inputs = VALID_INPUTS # inputs which are valid
invalid_inputs = [None] # inputs which are not allowed
expected_multi_result: Optional[List] = None
expected_check_keys: Optional[List[str]] = None
def setUp(self) -> None:
if not is_tf_available() and not is_torch_available():
return # Currently no JAX pipelines
for model_name in self.small_models:
pipeline(self.pipeline_task, model=model_name, tokenizer=model_name, **self.pipeline_loading_kwargs)
for model_name in self.large_models:
pipeline(self.pipeline_task, model=model_name, tokenizer=model_name, **self.pipeline_loading_kwargs)
@require_torch
@slow
def test_pt_defaults_loads(self):
pipeline(self.pipeline_task, framework="pt", **self.pipeline_loading_kwargs)
@require_tf
@slow
def test_tf_defaults_loads(self):
pipeline(self.pipeline_task, framework="tf", **self.pipeline_loading_kwargs)
@require_torch
def test_torch_small(self):
for model_name in self.small_models:
nlp = pipeline(
task=self.pipeline_task,
model=model_name,
tokenizer=model_name,
framework="pt",
**self.pipeline_loading_kwargs,
)
self._test_pipeline(nlp)
@require_tf
def test_tf_small(self):
for model_name in self.small_models:
nlp = pipeline(
task=self.pipeline_task,
model=model_name,
tokenizer=model_name,
framework="tf",
**self.pipeline_loading_kwargs,
)
self._test_pipeline(nlp)
@require_torch
@slow
def test_torch_large(self):
for model_name in self.large_models:
nlp = pipeline(
task=self.pipeline_task,
model=model_name,
tokenizer=model_name,
framework="pt",
**self.pipeline_loading_kwargs,
)
self._test_pipeline(nlp)
@require_tf
@slow
def test_tf_large(self):
for model_name in self.large_models:
nlp = pipeline(
task=self.pipeline_task,
model=model_name,
tokenizer=model_name,
framework="tf",
**self.pipeline_loading_kwargs,
)
self._test_pipeline(nlp)
def _test_pipeline(self, nlp: Pipeline):
self.assertIsNotNone(nlp)
mono_result = nlp(self.valid_inputs[0], **self.pipeline_running_kwargs)
self.assertIsInstance(mono_result, list)
self.assertIsInstance(mono_result[0], (dict, list))
if isinstance(mono_result[0], list):
mono_result = mono_result[0]
for key in self.mandatory_keys:
self.assertIn(key, mono_result[0])
multi_result = [nlp(input, **self.pipeline_running_kwargs) for input in self.valid_inputs]
self.assertIsInstance(multi_result, list)
self.assertIsInstance(multi_result[0], (dict, list))
if self.expected_multi_result is not None:
for result, expect in zip(multi_result, self.expected_multi_result):
for key in self.expected_check_keys or []:
self.assertEqual(
set([o[key] for o in result]),
set([o[key] for o in expect]),
)
if isinstance(multi_result[0], list):
multi_result = multi_result[0]
for result in multi_result:
for key in self.mandatory_keys:
self.assertIn(key, result)
self.assertRaises(Exception, nlp, self.invalid_inputs)
@is_pipeline_test
class DefaultArgumentHandlerTestCase(unittest.TestCase):
def setUp(self) -> None:
self.handler = DefaultArgumentHandler()
def test_kwargs_x(self):
mono_data = {"X": "This is a sample input"}
mono_args = self.handler(**mono_data)
self.assertTrue(isinstance(mono_args, list))
self.assertEqual(len(mono_args), 1)
multi_data = {"x": ["This is a sample input", "This is a second sample input"]}
multi_args = self.handler(**multi_data)
self.assertTrue(isinstance(multi_args, list))
self.assertEqual(len(multi_args), 2)
def test_kwargs_data(self):
mono_data = {"data": "This is a sample input"}
mono_args = self.handler(**mono_data)
self.assertTrue(isinstance(mono_args, list))
self.assertEqual(len(mono_args), 1)
multi_data = {"data": ["This is a sample input", "This is a second sample input"]}
multi_args = self.handler(**multi_data)
self.assertTrue(isinstance(multi_args, list))
self.assertEqual(len(multi_args), 2)
def test_multi_kwargs(self):
mono_data = {"data": "This is a sample input", "X": "This is a sample input 2"}
mono_args = self.handler(**mono_data)
self.assertTrue(isinstance(mono_args, list))
self.assertEqual(len(mono_args), 2)
multi_data = {
"data": ["This is a sample input", "This is a second sample input"],
"test": ["This is a sample input 2", "This is a second sample input 2"],
}
multi_args = self.handler(**multi_data)
self.assertTrue(isinstance(multi_args, list))
self.assertEqual(len(multi_args), 4)
def test_args(self):
mono_data = "This is a sample input"
mono_args = self.handler(mono_data)
self.assertTrue(isinstance(mono_args, list))
self.assertEqual(len(mono_args), 1)
mono_data = ["This is a sample input"]
mono_args = self.handler(mono_data)
self.assertTrue(isinstance(mono_args, list))
self.assertEqual(len(mono_args), 1)
multi_data = ["This is a sample input", "This is a second sample input"]
multi_args = self.handler(multi_data)
self.assertTrue(isinstance(multi_args, list))
self.assertEqual(len(multi_args), 2)
multi_data = ["This is a sample input", "This is a second sample input"]
multi_args = self.handler(*multi_data)
self.assertTrue(isinstance(multi_args, list))
self.assertEqual(len(multi_args), 2)
import unittest
from transformers import Conversation, pipeline
from transformers.testing_utils import require_torch, slow, torch_device
from .test_pipelines_common import MonoInputPipelineCommonMixin
DEFAULT_DEVICE_NUM = -1 if torch_device == "cpu" else 0
class TextGenerationPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
pipeline_task = "conversational"
small_models = [] # Models tested without the @slow decorator
large_models = ["microsoft/DialoGPT-medium"] # Models tested with the @slow decorator
valid_inputs = [Conversation("Hi there!"), [Conversation("Hi there!"), Conversation("How are you?")]]
invalid_inputs = ["Hi there!", Conversation()]
def _test_pipeline(
self, nlp
): # e overide the default test method to check that the output is a `Conversation` object
self.assertIsNotNone(nlp)
mono_result = nlp(self.valid_inputs[0])
self.assertIsInstance(mono_result, Conversation)
multi_result = nlp(self.valid_inputs[1])
self.assertIsInstance(multi_result, list)
self.assertIsInstance(multi_result[0], Conversation)
# Inactive conversations passed to the pipeline raise a ValueError
self.assertRaises(ValueError, nlp, self.valid_inputs[1])
for bad_input in self.invalid_inputs:
self.assertRaises(Exception, nlp, bad_input)
self.assertRaises(Exception, nlp, self.invalid_inputs)
@require_torch
@slow
def test_integration_torch_conversation(self):
# When
nlp = pipeline(task="conversational", device=DEFAULT_DEVICE_NUM)
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
conversation_2 = Conversation("What's the last book you have read?")
# Then
self.assertEqual(len(conversation_1.past_user_inputs), 0)
self.assertEqual(len(conversation_2.past_user_inputs), 0)
# When
result = nlp([conversation_1, conversation_2], do_sample=False, max_length=1000)
# Then
self.assertEqual(result, [conversation_1, conversation_2])
self.assertEqual(len(result[0].past_user_inputs), 1)
self.assertEqual(len(result[1].past_user_inputs), 1)
self.assertEqual(len(result[0].generated_responses), 1)
self.assertEqual(len(result[1].generated_responses), 1)
self.assertEqual(result[0].past_user_inputs[0], "Going to the movies tonight - any suggestions?")
self.assertEqual(result[0].generated_responses[0], "The Big Lebowski")
self.assertEqual(result[1].past_user_inputs[0], "What's the last book you have read?")
self.assertEqual(result[1].generated_responses[0], "The Last Question")
# When
conversation_2.add_user_input("Why do you recommend it?")
result = nlp(conversation_2, do_sample=False, max_length=1000)
# Then
self.assertEqual(result, conversation_2)
self.assertEqual(len(result.past_user_inputs), 2)
self.assertEqual(len(result.generated_responses), 2)
self.assertEqual(result.past_user_inputs[1], "Why do you recommend it?")
self.assertEqual(result.generated_responses[1], "It's a good book.")
@require_torch
@slow
def test_integration_torch_conversation_truncated_history(self):
# When
nlp = pipeline(task="conversational", min_length_for_response=24, device=DEFAULT_DEVICE_NUM)
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
# Then
self.assertEqual(len(conversation_1.past_user_inputs), 0)
# When
result = nlp(conversation_1, do_sample=False, max_length=36)
# Then
self.assertEqual(result, conversation_1)
self.assertEqual(len(result.past_user_inputs), 1)
self.assertEqual(len(result.generated_responses), 1)
self.assertEqual(result.past_user_inputs[0], "Going to the movies tonight - any suggestions?")
self.assertEqual(result.generated_responses[0], "The Big Lebowski")
# When
conversation_1.add_user_input("Is it an action movie?")
result = nlp(conversation_1, do_sample=False, max_length=36)
# Then
self.assertEqual(result, conversation_1)
self.assertEqual(len(result.past_user_inputs), 2)
self.assertEqual(len(result.generated_responses), 2)
self.assertEqual(result.past_user_inputs[1], "Is it an action movie?")
self.assertEqual(result.generated_responses[1], "It's a comedy.")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment